ABSTRACT
Spectral clustering refers to a flexible class of clustering procedures that can produce high-quality clusterings on small data sets but which has limited applicability to large-scale problems due to its computational complexity of O(n3) in general, with n the number of data points. We extend the range of spectral clustering by developing a general framework for fast approximate spectral clustering in which a distortion-minimizing local transformation is first applied to the data. This framework is based on a theoretical analysis that provides a statistical characterization of the effect of local distortion on the mis-clustering rate. We develop two concrete instances of our general framework, one based on local k-means clustering (KASP) and one based on random projection trees (RASP). Extensive experiments show that these algorithms can achieve significant speedups with little degradation in clustering accuracy. Specifically, our algorithms outperform k-means by a large margin in terms of accuracy, and run several times faster than approximate spectral clustering based on the Nystrom method, with comparable accuracy and significantly smaller memory footprint. Remarkably, our algorithms make it possible for a single machine to spectral cluster data sets with a million observations within several minutes.
Supplemental Material
- D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In Proceedings of the 18th Annual ACM SIAM Symposium on Discrete algorithms (SODA), 2007. Google ScholarDigital Library
- S. Arya, D. Mount, N. Netanyahu, R. Silverman, and A. Wu. An optimal algorithm for approximate nearest neighbor searching. Journal of the ACM, 45:891--923, 1998. Google ScholarDigital Library
- A. Asuncion and D. Newman. UCI Machine Learning Repository, Department of Information and Computer Science. http://www.ics.uci.edu/ mlearn/MLRepository.html, 2007.Google Scholar
- F. R. Bach and M. I. Jordan. Learning spectral clustering, with application to speech separation. Journal of Machine Learning Research, 7:1963--2001, 2006. Google ScholarDigital Library
- M. Badoiu, S. Har-Peled, and P. Indyk. Approximate clustering via core-sets. In Proceedings of the 34th Annual ACM Symposium on Theory of Computing, 2002. Google ScholarDigital Library
- P. S. Bradley and U. M. Fayyad. Refining initial points for k-means clustering. In Proceedings of the 15th International Conference on Machine Learning (ICML), 1998. Google ScholarDigital Library
- L. Breiman. Random forests. Machine Learning, 45(1):5--32, 2001. Google ScholarDigital Library
- S. Dasgupta and Y. Freund. Random projection trees and low dimensional manifolds. In Fortieth ACM Symposium on Theory of Computing (STOC), 2008. Google ScholarDigital Library
- I. Dhillon, Y. Guan, and B. Kulis. Weighted graph cuts without eigenvectors: A multilevel approach. IEEE Transactions on PAMI, 29(11):1944--1957, 2007. Google ScholarDigital Library
- P. Drineas and M. W. Mahoney. On the Nystrom method for approximating a Gram matrix for improved kernel-based learning. In Proceedings of COLT, pages 323--337, 2005. Google ScholarDigital Library
- S. Fine and K. Scheinberg. Efficient SVM training using low-rank kernel representations. Journal of Machine Learning Research, 2:243--264, 2001. Google ScholarDigital Library
- C. Fowlkes, S. Belongie, F. Chung, and J. Malik. Spectral grouping using the Nystrom method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(2), 2004. Google ScholarDigital Library
- R. M. Gray and D. L. Neuhoff. Quantization. IEEE Transactions of Information Theory, 44(6):2325--2383, 1998. Google ScholarDigital Library
- S. Gunter, N. N. Schraudolph, and A. V. N. Vishwanathan. Fast iterative kernel principal component analysis. Journal of Machine Learning Research, 8:1893--1918, 2007. Google ScholarDigital Library
- J. A. Hartigan. Clustering Algorithms. Wiley, New York, 1975. Google ScholarDigital Library
- J. A. Hartigan and M. A. Wong. A k-means clustering algorithm. Applied Statistics, 28(1):100--108, 1979.Google ScholarCross Ref
- B. Hendrickson and R. Leland. A multilevel algorithm for partitioning graphs. In Proceedings of Supercomputing, 1995. Google ScholarDigital Library
- L. Huang, D. Yan, M. I. Jordan, and N. Taft. Spectral clustering with perturbed data. In Advances in Neural Information Processing Systems (NIPS), December 2008.Google Scholar
- A. Jain, M. Murty, and P. Flynn. Data clustering: a review. ACM Computing Surveys, 31(3):264--323, 1999. Google ScholarDigital Library
- R. Kannan, S. Vempala, and A. Vetta. On clusterings: Good, bad and spectral. Journal of the ACM, 51(3):497--515, 2004. Google ScholarDigital Library
- T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. An efficient k-means clustering algorithm: analysis and implementation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(7):881--892, 2002. Google ScholarDigital Library
- G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing, 20:359--392, 1999. Google ScholarDigital Library
- A. Kumar, Y. Sabbarwal, and S. Sen. A simple linear time (1 + e)-approximation algorithm for k-means clustering in any dimensions. In Proceedings of the IEEE Symposium on Foundations of Computer Science, 2004. Google ScholarDigital Library
- J. F. Lu, J. B. Tang, Z. M. Tang, and J. Y. Yang. Hierarchical initialization approach for k-means clustering. Pattern Recognition Letters, 29:787--795, 2008. Google ScholarDigital Library
- D. Madigan, I. Raghavan, W. Dumouchel, M. Nason, C. Posse, and G. Ridgeway. Likelihood-based data squashing: a modeling approach to instance construction. Data Mining and Knowledge Discovery, 6(2):173--190, 2002. Google ScholarDigital Library
- M. Meila and J. Shi. Learning segmentation with random walk. In Advances in Neural Information Processing Systems (NIPS), 2001.Google Scholar
- P. Mitra, C. A. Murthy, and S. K. Pal. Density-based multiscale data condensation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(6):1--14, 2002. Google ScholarDigital Library
- A. Y. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems (NIPS), 2002.Google ScholarDigital Library
- J. M. Pena, J. A. Lozano, and P. Larranaga. An empirical comparison of four initialization methods for the k-means algorithm. Pattern Recognition Letters, 14(1):1027--1040, 1999. Google ScholarDigital Library
- S. J. Redmond and C. Heneghen. A method for initialising the k-means clustering algorithm using kd-trees. Pattern Recognition Letters, 28:965--973, 2007. Google ScholarDigital Library
- J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888--905, 2000. Google ScholarDigital Library
- A. Smola and B. Scholkopf. Sparse greedy matrix approximation for machine learning. In Proceedings of the 17th International Conference on Machine Learning, 2000. Google ScholarDigital Library
- G. Stewart. Introduction to Matrix Computation. Academic Press, 1973.Google Scholar
- U. von Luxburg, M. Belkin, and O. Bousquet. Consistency of spectral clustering. Annals of Statistics, 36(2):555--586, 2008.Google ScholarCross Ref
- C. Williams and M. Seeger. Using the Nyström method to speed up kernel machines. In Advances in Neural Information Processing Systems, 2001.Google ScholarDigital Library
- D. Yan, L. Huang, and M. I. Jordan. Fast approximate spectral clustering. Technical report, Department of Statistics, UC Berkeley, 2009.Google Scholar
- S. Yu and J. B. Shi. Multiclass spectral clustering. In Proceedings of ICCV, 2003. Google ScholarDigital Library
- P. L. Zador. Asymptotic quantization error of continuous signals and the quantization dimension. IEEE Transactions of Information Theory, 28:139--148, 1982.Google ScholarDigital Library
Index Terms
- Fast approximate spectral clustering
Recommendations
Local information-based fast approximate spectral clustering
Spectral clustering has become one of the most popular clustering approaches in recent years. However, its high computational complexity prevents its application to large-scale datasets. To address this complexity, approximate spectral clustering ...
Spectral clustering with eigenvector selection
The task of discovering natural groupings of input patterns, or clustering, is an important aspect of machine learning and pattern analysis. In this paper, we study the widely used spectral clustering algorithm which clusters data using eigenvectors of ...
Study on multi-center fuzzy C-means algorithm based on transitive closure and spectral clustering
Fuzzy C-means (FCM) clustering has been widely used successfully in many real-world applications. However, the FCM algorithm is sensitive to the initial prototypes, and it cannot handle non-traditional curved clusters. In this paper, a multi-center ...
Comments