research-article

Fast approximate spectral clustering

Authors:
Donghui Yan

University of California, Berkeley, Berkeley, CA, USA

University of California, Berkeley, Berkeley, CA, USA
View Profile

,
Ling Huang

Intel, Berkeley, CA, USA

Intel, Berkeley, CA, USA
View Profile

,
Michael I. Jordan

University of California, Berkeley, Berkeley, CA, USA

University of California, Berkeley, Berkeley, CA, USA
View Profile

KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data miningJune 2009Pages 907–916https://doi.org/10.1145/1557019.1557118

Published:28 June 2009Publication History

KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 907–916

ABSTRACT

Spectral clustering refers to a flexible class of clustering procedures that can produce high-quality clusterings on small data sets but which has limited applicability to large-scale problems due to its computational complexity of O(n³) in general, with n the number of data points. We extend the range of spectral clustering by developing a general framework for fast approximate spectral clustering in which a distortion-minimizing local transformation is first applied to the data. This framework is based on a theoretical analysis that provides a statistical characterization of the effect of local distortion on the mis-clustering rate. We develop two concrete instances of our general framework, one based on local k-means clustering (KASP) and one based on random projection trees (RASP). Extensive experiments show that these algorithms can achieve significant speedups with little degradation in clustering accuracy. Specifically, our algorithms outperform k-means by a large margin in terms of accuracy, and run several times faster than approximate spectral clustering based on the Nystrom method, with comparable accuracy and significantly smaller memory footprint. Remarkably, our algorithms make it possible for a single machine to spectral cluster data sets with a million observations within several minutes.

Supplemental Material

p907-yan.mp4

mp4

92.1 MB

Download

References

D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In Proceedings of the 18th Annual ACM SIAM Symposium on Discrete algorithms (SODA), 2007. Google ScholarDigital Library
S. Arya, D. Mount, N. Netanyahu, R. Silverman, and A. Wu. An optimal algorithm for approximate nearest neighbor searching. Journal of the ACM, 45:891--923, 1998. Google ScholarDigital Library
A. Asuncion and D. Newman. UCI Machine Learning Repository, Department of Information and Computer Science. http://www.ics.uci.edu/ mlearn/MLRepository.html, 2007.Google Scholar
F. R. Bach and M. I. Jordan. Learning spectral clustering, with application to speech separation. Journal of Machine Learning Research, 7:1963--2001, 2006. Google ScholarDigital Library
M. Badoiu, S. Har-Peled, and P. Indyk. Approximate clustering via core-sets. In Proceedings of the 34th Annual ACM Symposium on Theory of Computing, 2002. Google ScholarDigital Library
P. S. Bradley and U. M. Fayyad. Refining initial points for k-means clustering. In Proceedings of the 15th International Conference on Machine Learning (ICML), 1998. Google ScholarDigital Library
L. Breiman. Random forests. Machine Learning, 45(1):5--32, 2001. Google ScholarDigital Library
S. Dasgupta and Y. Freund. Random projection trees and low dimensional manifolds. In Fortieth ACM Symposium on Theory of Computing (STOC), 2008. Google ScholarDigital Library
I. Dhillon, Y. Guan, and B. Kulis. Weighted graph cuts without eigenvectors: A multilevel approach. IEEE Transactions on PAMI, 29(11):1944--1957, 2007. Google ScholarDigital Library
P. Drineas and M. W. Mahoney. On the Nystrom method for approximating a Gram matrix for improved kernel-based learning. In Proceedings of COLT, pages 323--337, 2005. Google ScholarDigital Library
S. Fine and K. Scheinberg. Efficient SVM training using low-rank kernel representations. Journal of Machine Learning Research, 2:243--264, 2001. Google ScholarDigital Library
C. Fowlkes, S. Belongie, F. Chung, and J. Malik. Spectral grouping using the Nystrom method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(2), 2004. Google ScholarDigital Library
R. M. Gray and D. L. Neuhoff. Quantization. IEEE Transactions of Information Theory, 44(6):2325--2383, 1998. Google ScholarDigital Library
S. Gunter, N. N. Schraudolph, and A. V. N. Vishwanathan. Fast iterative kernel principal component analysis. Journal of Machine Learning Research, 8:1893--1918, 2007. Google ScholarDigital Library
J. A. Hartigan. Clustering Algorithms. Wiley, New York, 1975. Google ScholarDigital Library
J. A. Hartigan and M. A. Wong. A k-means clustering algorithm. Applied Statistics, 28(1):100--108, 1979.Google ScholarCross Ref
B. Hendrickson and R. Leland. A multilevel algorithm for partitioning graphs. In Proceedings of Supercomputing, 1995. Google ScholarDigital Library
L. Huang, D. Yan, M. I. Jordan, and N. Taft. Spectral clustering with perturbed data. In Advances in Neural Information Processing Systems (NIPS), December 2008.Google Scholar
A. Jain, M. Murty, and P. Flynn. Data clustering: a review. ACM Computing Surveys, 31(3):264--323, 1999. Google ScholarDigital Library
R. Kannan, S. Vempala, and A. Vetta. On clusterings: Good, bad and spectral. Journal of the ACM, 51(3):497--515, 2004. Google ScholarDigital Library
T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. An efficient k-means clustering algorithm: analysis and implementation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(7):881--892, 2002. Google ScholarDigital Library
G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing, 20:359--392, 1999. Google ScholarDigital Library
A. Kumar, Y. Sabbarwal, and S. Sen. A simple linear time (1 + e)-approximation algorithm for k-means clustering in any dimensions. In Proceedings of the IEEE Symposium on Foundations of Computer Science, 2004. Google ScholarDigital Library
J. F. Lu, J. B. Tang, Z. M. Tang, and J. Y. Yang. Hierarchical initialization approach for k-means clustering. Pattern Recognition Letters, 29:787--795, 2008. Google ScholarDigital Library
D. Madigan, I. Raghavan, W. Dumouchel, M. Nason, C. Posse, and G. Ridgeway. Likelihood-based data squashing: a modeling approach to instance construction. Data Mining and Knowledge Discovery, 6(2):173--190, 2002. Google ScholarDigital Library
M. Meila and J. Shi. Learning segmentation with random walk. In Advances in Neural Information Processing Systems (NIPS), 2001.Google Scholar
P. Mitra, C. A. Murthy, and S. K. Pal. Density-based multiscale data condensation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(6):1--14, 2002. Google ScholarDigital Library
A. Y. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems (NIPS), 2002.Google ScholarDigital Library
J. M. Pena, J. A. Lozano, and P. Larranaga. An empirical comparison of four initialization methods for the k-means algorithm. Pattern Recognition Letters, 14(1):1027--1040, 1999. Google ScholarDigital Library
S. J. Redmond and C. Heneghen. A method for initialising the k-means clustering algorithm using kd-trees. Pattern Recognition Letters, 28:965--973, 2007. Google ScholarDigital Library
J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888--905, 2000. Google ScholarDigital Library
A. Smola and B. Scholkopf. Sparse greedy matrix approximation for machine learning. In Proceedings of the 17th International Conference on Machine Learning, 2000. Google ScholarDigital Library
G. Stewart. Introduction to Matrix Computation. Academic Press, 1973.Google Scholar
U. von Luxburg, M. Belkin, and O. Bousquet. Consistency of spectral clustering. Annals of Statistics, 36(2):555--586, 2008.Google ScholarCross Ref
C. Williams and M. Seeger. Using the Nyström method to speed up kernel machines. In Advances in Neural Information Processing Systems, 2001.Google ScholarDigital Library
D. Yan, L. Huang, and M. I. Jordan. Fast approximate spectral clustering. Technical report, Department of Statistics, UC Berkeley, 2009.Google Scholar
S. Yu and J. B. Shi. Multiclass spectral clustering. In Proceedings of ICCV, 2003. Google ScholarDigital Library
P. L. Zador. Asymptotic quantization error of continuous signals and the quantization dimension. IEEE Transactions of Information Theory, 28:139--148, 1982.Google ScholarDigital Library

Index Terms

Fast approximate spectral clustering
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Local information-based fast approximate spectral clustering

Spectral clustering has become one of the most popular clustering approaches in recent years. However, its high computational complexity prevents its application to large-scale datasets. To address this complexity, approximate spectral clustering ...
Read More
Spectral clustering with eigenvector selection

The task of discovering natural groupings of input patterns, or clustering, is an important aspect of machine learning and pattern analysis. In this paper, we study the widely used spectral clustering algorithm which clusters data using eigenvectors of ...
Read More
Study on multi-center fuzzy C-means algorithm based on transitive closure and spectral clustering

Fuzzy C-means (FCM) clustering has been widely used successfully in many real-world applications. However, the FCM algorithm is sensitive to the initial prototypes, and it cannot handle non-traditional curved clusters. In this paper, a multi-center ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
June 2009
1426 pages
ISBN:9781605584959
DOI:10.1145/1557019
General Chairs:
John Elder
Elder Research, Inc., USA
,
Françoise Soulié Fogelman
KXEN, France
,
Program Chairs:
Peter Flach
University of Bristol, UK
,
Mohammed Zaki
RPI, USA
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 28 June 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data quantization
spectral clustering
unsupervised learning
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,133of8,635submissions,13%
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 305
  Total Citations
  View Citations
- 2,405
  Total Downloads
- Downloads (Last 12 months)120
- Downloads (Last 6 weeks)22
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Fast approximate spectral clustering

KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Local information-based fast approximate spectral clustering

Spectral clustering with eigenvector selection

Study on multi-center fuzzy C-means algorithm based on transitive closure and spectral clustering

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Fast approximate spectral clustering

KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Local information-based fast approximate spectral clustering

Spectral clustering with eigenvector selection

Study on multi-center fuzzy C-means algorithm based on transitive closure and spectral clustering

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media