skip to main content
10.1145/1557019.1557118acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Fast approximate spectral clustering

Published:28 June 2009Publication History

ABSTRACT

Spectral clustering refers to a flexible class of clustering procedures that can produce high-quality clusterings on small data sets but which has limited applicability to large-scale problems due to its computational complexity of O(n3) in general, with n the number of data points. We extend the range of spectral clustering by developing a general framework for fast approximate spectral clustering in which a distortion-minimizing local transformation is first applied to the data. This framework is based on a theoretical analysis that provides a statistical characterization of the effect of local distortion on the mis-clustering rate. We develop two concrete instances of our general framework, one based on local k-means clustering (KASP) and one based on random projection trees (RASP). Extensive experiments show that these algorithms can achieve significant speedups with little degradation in clustering accuracy. Specifically, our algorithms outperform k-means by a large margin in terms of accuracy, and run several times faster than approximate spectral clustering based on the Nystrom method, with comparable accuracy and significantly smaller memory footprint. Remarkably, our algorithms make it possible for a single machine to spectral cluster data sets with a million observations within several minutes.

Skip Supplemental Material Section

Supplemental Material

p907-yan.mp4

mp4

92.1 MB

References

  1. D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In Proceedings of the 18th Annual ACM SIAM Symposium on Discrete algorithms (SODA), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Arya, D. Mount, N. Netanyahu, R. Silverman, and A. Wu. An optimal algorithm for approximate nearest neighbor searching. Journal of the ACM, 45:891--923, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Asuncion and D. Newman. UCI Machine Learning Repository, Department of Information and Computer Science. http://www.ics.uci.edu/ mlearn/MLRepository.html, 2007.Google ScholarGoogle Scholar
  4. F. R. Bach and M. I. Jordan. Learning spectral clustering, with application to speech separation. Journal of Machine Learning Research, 7:1963--2001, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Badoiu, S. Har-Peled, and P. Indyk. Approximate clustering via core-sets. In Proceedings of the 34th Annual ACM Symposium on Theory of Computing, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. P. S. Bradley and U. M. Fayyad. Refining initial points for k-means clustering. In Proceedings of the 15th International Conference on Machine Learning (ICML), 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. L. Breiman. Random forests. Machine Learning, 45(1):5--32, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Dasgupta and Y. Freund. Random projection trees and low dimensional manifolds. In Fortieth ACM Symposium on Theory of Computing (STOC), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. I. Dhillon, Y. Guan, and B. Kulis. Weighted graph cuts without eigenvectors: A multilevel approach. IEEE Transactions on PAMI, 29(11):1944--1957, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. P. Drineas and M. W. Mahoney. On the Nystrom method for approximating a Gram matrix for improved kernel-based learning. In Proceedings of COLT, pages 323--337, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Fine and K. Scheinberg. Efficient SVM training using low-rank kernel representations. Journal of Machine Learning Research, 2:243--264, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. C. Fowlkes, S. Belongie, F. Chung, and J. Malik. Spectral grouping using the Nystrom method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(2), 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. R. M. Gray and D. L. Neuhoff. Quantization. IEEE Transactions of Information Theory, 44(6):2325--2383, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Gunter, N. N. Schraudolph, and A. V. N. Vishwanathan. Fast iterative kernel principal component analysis. Journal of Machine Learning Research, 8:1893--1918, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. A. Hartigan. Clustering Algorithms. Wiley, New York, 1975. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. A. Hartigan and M. A. Wong. A k-means clustering algorithm. Applied Statistics, 28(1):100--108, 1979.Google ScholarGoogle ScholarCross RefCross Ref
  17. B. Hendrickson and R. Leland. A multilevel algorithm for partitioning graphs. In Proceedings of Supercomputing, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. L. Huang, D. Yan, M. I. Jordan, and N. Taft. Spectral clustering with perturbed data. In Advances in Neural Information Processing Systems (NIPS), December 2008.Google ScholarGoogle Scholar
  19. A. Jain, M. Murty, and P. Flynn. Data clustering: a review. ACM Computing Surveys, 31(3):264--323, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. R. Kannan, S. Vempala, and A. Vetta. On clusterings: Good, bad and spectral. Journal of the ACM, 51(3):497--515, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. An efficient k-means clustering algorithm: analysis and implementation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(7):881--892, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing, 20:359--392, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. A. Kumar, Y. Sabbarwal, and S. Sen. A simple linear time (1 + e)-approximation algorithm for k-means clustering in any dimensions. In Proceedings of the IEEE Symposium on Foundations of Computer Science, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. F. Lu, J. B. Tang, Z. M. Tang, and J. Y. Yang. Hierarchical initialization approach for k-means clustering. Pattern Recognition Letters, 29:787--795, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. D. Madigan, I. Raghavan, W. Dumouchel, M. Nason, C. Posse, and G. Ridgeway. Likelihood-based data squashing: a modeling approach to instance construction. Data Mining and Knowledge Discovery, 6(2):173--190, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. Meila and J. Shi. Learning segmentation with random walk. In Advances in Neural Information Processing Systems (NIPS), 2001.Google ScholarGoogle Scholar
  27. P. Mitra, C. A. Murthy, and S. K. Pal. Density-based multiscale data condensation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(6):1--14, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. A. Y. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems (NIPS), 2002.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J. M. Pena, J. A. Lozano, and P. Larranaga. An empirical comparison of four initialization methods for the k-means algorithm. Pattern Recognition Letters, 14(1):1027--1040, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. J. Redmond and C. Heneghen. A method for initialising the k-means clustering algorithm using kd-trees. Pattern Recognition Letters, 28:965--973, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888--905, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. A. Smola and B. Scholkopf. Sparse greedy matrix approximation for machine learning. In Proceedings of the 17th International Conference on Machine Learning, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. G. Stewart. Introduction to Matrix Computation. Academic Press, 1973.Google ScholarGoogle Scholar
  34. U. von Luxburg, M. Belkin, and O. Bousquet. Consistency of spectral clustering. Annals of Statistics, 36(2):555--586, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  35. C. Williams and M. Seeger. Using the Nyström method to speed up kernel machines. In Advances in Neural Information Processing Systems, 2001.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. D. Yan, L. Huang, and M. I. Jordan. Fast approximate spectral clustering. Technical report, Department of Statistics, UC Berkeley, 2009.Google ScholarGoogle Scholar
  37. S. Yu and J. B. Shi. Multiclass spectral clustering. In Proceedings of ICCV, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. P. L. Zador. Asymptotic quantization error of continuous signals and the quantization dimension. IEEE Transactions of Information Theory, 28:139--148, 1982.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Fast approximate spectral clustering

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
          June 2009
          1426 pages
          ISBN:9781605584959
          DOI:10.1145/1557019

          Copyright © 2009 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 28 June 2009

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate1,133of8,635submissions,13%

          Upcoming Conference

          KDD '24

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader