research-article

Scalable k-means++

Authors:
Bahman Bahmani

Stanford University, Stanford, CA

Stanford University, Stanford, CA
View Profile

,
Benjamin Moseley

University of Illinois, Urbana, IL

University of Illinois, Urbana, IL
View Profile

,
Andrea Vattani

University of California, San Diego, CA

University of California, San Diego, CA
View Profile

,
Ravi Kumar

Yahoo! Research, Sunnyvale, CA

Yahoo! Research, Sunnyvale, CA
View Profile

,
Sergei Vassilvitskii

Yahoo! Research, New York, NY

Yahoo! Research, New York, NY
View Profile

Proceedings of the VLDB Endowment Volume 5 Issue 7pp 622–633https://doi.org/10.14778/2180912.2180915

Published:01 March 2012Publication History

Proceedings of the VLDB Endowment

Abstract

Over half a century old and showing no signs of aging, k-means remains one of the most popular data processing algorithms. As is well-known, a proper initialization of k-means is crucial for obtaining a good final solution. The recently proposed k-means++ initialization algorithm achieves this, obtaining an initial set of centers that is provably close to the optimum solution. A major downside of the k-means++ is its inherent sequential nature, which limits its applicability to massive data: one must make k passes over the data to find a good initial set of centers. In this work we show how to drastically reduce the number of passes needed to obtain, in parallel, a good initialization. This is unlike prevailing efforts on parallelizing k-means that have mostly focused on the post-initialization phases of k-means. We prove that our proposed initialization algorithm k-means|| obtains a nearly optimal solution after a logarithmic number of passes, and then show that in practice a constant number of passes suffices. Experimental evaluation on real-world large-scale data demonstrates that k-means|| outperforms k-means++ in both sequential and parallel settings.

References

M. R. Ackermann, C. Lammersen, M. Märtens, C. Raupach, C. Sohler, and K. Swierkot. StreamKM++: A clustering algorithm for data streams. In ALENEX, pages 173--187, 2010.Google ScholarCross Ref
N. Ailon, R. Jaiswal, and C. Monteleoni. Streaming k-means approximation. In NIPS, pages 10--18, 2009.Google ScholarDigital Library
D. Aloise, A. Deshpande, P. Hansen, and P. Popat. NP-hardness of Euclidean sum-of-squares clustering. Machine Learning, 75(2):245--248, 2009. Google ScholarDigital Library
D. Arthur and S. Vassilvitskii. How slow is the k-means method? In SOCG, pages 144--153, 2006. Google ScholarDigital Library
D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In SODA, pages 1027--1035, 2007. Google ScholarDigital Library
B. Bahmani, K. Chakrabarti, and D. Xin. Fast personalized PageRank on MapReduce. In SIGMOD, pages 973--984, 2011. Google ScholarDigital Library
B. Bahmani, R. Kumar, and S. Vassilvitskii. Densest subgraph in streaming and mapreduce. Proc. VLDB Endow., 5(5):454--465, 2012. Google ScholarDigital Library
P. Berkhin. Survey of clustering data mining techniques. In J. Kogan, C. K. Nicholas, and M. Teboulle, editors, Grouping Multidimensional Data: Recent Advances in Clustering. Springer, 2006.Google ScholarCross Ref
P. S. Bradley, U. M. Fayyad, and C. Reina. Scaling clustering algorithms to large databases. In KDD, pages 9--15, 1998.Google Scholar
E. Chandra and V. P. Anuradha. A survey on clustering algorithms for data in spatial database management systems. International Journal of Computer Applications, 24(9):19--26, 2011.Google ScholarCross Ref
F. Chierichetti, R. Kumar, and A. Tomkins. Max-cover in map-reduce. In WWW, pages 231--240, 2010. Google ScholarDigital Library
A. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: Scalable online collaborative filtering. In WWW, pages 271--280, 2007. Google ScholarDigital Library
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004. Google ScholarDigital Library
I. S. Dhillon and D. S. Modha. A data-clustering algorithm on distributed memory multiprocessors. In Workshop on Large-Scale Parallel KDD Systems, SIGKDD, pages 245--260, 2000. Google ScholarDigital Library
A. Ene, S. Im, and B. Moseley. Fast clustering using MapReduce. In KDD, pages 681--689, 2011. Google ScholarDigital Library
F. Farnstrom, J. Lewis, and C. Elkan. Scalability for clustering algorithms revisited. SIGKDD Explor. Newsl., 2:51--57, 2000. Google ScholarDigital Library
S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering data streams: Theory and practice. TKDE, 15(3):515--528, 2003. Google ScholarDigital Library
S. Guha, R. Rastogi, and K. Shim. CURE: An efficient clustering algorithm for large databases. In SIGMOD, pages 73--84, 1998. Google ScholarDigital Library
S. Guinepain and L. Gruenwald. Research issues in automatic database clustering. SIGMOD Record, 34(1):33--38, 2005. Google ScholarDigital Library
G. H. Hardy, J. E. Littlewood, and G. Polya. Inequalities. Cambridge University Press, 1988.Google Scholar
A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A review. ACM Computing Surveys, 31:264--323, 1999. Google ScholarDigital Library
M. Jiang, S. Tseng, and C. Su. Two-phase clustering process for outliers detection. Pattern Recognition Letters, 22(6--7):691--700, 2001. Google ScholarDigital Library
T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. A local search approximation algorithm for k-means clustering. Computational Geometry, 28(2--3):89--112, 2004. Google ScholarDigital Library
H. J. Karloff, S. Suri, and S. Vassilvitskii. A model of computation for MapReduce. In SODA, pages 938--948, 2010. Google ScholarDigital Library
E. Kolatch. Clustering algorithms for spatial databases: A survey, 2000. Available at www.cs.umd.edu/~kolatch/papers/SpatialClustering.pdf.Google Scholar
A. Kumar, Y. Sabharwal, and S. Sen. A simple linear time (1 + ∈)-approximation algorithm for k-means clustering in any dimensions. In FOCS, pages 454--462, 2004. Google ScholarDigital Library
S. Lattanzi, B. Moseley, S. Suri, and S. Vassilvitskii. Filtering: A method for solving graph problems in MapReduce. In SPAA, pages 85--94, 2011. Google ScholarDigital Library
C. Ordonez. Integrating k-means clustering with a relational DBMS using SQL. TKDE, 18:188--201, 2006. Google ScholarDigital Library
C. Ordonez and E. Omiecinski. Efficient disk-based k-means clustering for relational databases. TKDE, 16:909--921, 2004. Google ScholarDigital Library
R. Ostrovsky, Y. Rabani, L. J. Schulman, and C. Swamy. The effectiveness of Lloyd-type methods for the k-means problem. In FOCS, pages 165--176, 2006. Google ScholarDigital Library
D. Sculley. Web-scale k-means clustering. In WWW, pages 1177--1178, 2010. Google ScholarDigital Library
A. Vattani. k-means requires exponentially many iterations even in the plane. DCG, 45(4):596--616, 2011. Google ScholarCross Ref
T. White. Hadoop: The Definitive Guide. O'Reilly Media, 2009. Google ScholarDigital Library
X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. J. McLachlan, A. Ng, B. Liu, P. S. Yu, Z.-H. Zhou, M. Steinbach, D. J. Hand, and D. Steinberg. Top 10 algorithms in data mining. Knowl. Inf. Syst., 14:1--37, 2007. Google ScholarDigital Library
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An efficient data clustering method for very large databases. SIGMOD Record, 25:103--114, 1996. Google ScholarDigital Library
W. Zhao, H. Ma, and Q. He. Parallel k-means clustering based on MapReduce. In CloudCom, pages 674--679, 2009. Google ScholarDigital Library

Index Terms

Scalable k-means++
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis

Index terms have been assigned to the content through auto-classification.

Recommendations

Hybrid Bisect K-Means Clustering Algorithm
BCGIN '11: Proceedings of the 2011 International Conference on Business Computing and Global Informatization

In this paper, we present a hybrid clustering algorithm that combines divisive and agglomerative hierarchical clustering algorithm. Our method uses bisect K-means for divisive clustering algorithm and Unweighted Pair Group Method with Arithmetic Mean (...
Read More
DIC-DOC-K-means: Dissimilarity-based Initial Centroid selection for DOCument clustering using K-means for improving the effectiveness of text document clustering

In this article, a new initial centroid selection for a K-means document clustering algorithm, namely, Dissimilarity-based Initial Centroid selection for DOCument clustering using K-means (DIC-DOC-K-means), to improve the performance of text document ...
Read More
Proficient Normalised Fuzzy K-Means With Initial Centroids Methodology

This article describes how data is relevant and if it can be organized, linked with other data and grouped into a cluster. Clustering is the process of organizing a given set of objects into a set of disjoint groups called clusters. There are a number ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the VLDB Endowment Volume 5, Issue 7
March 2012
94 pages
ISSN:2150-8097
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 March 2012
Published in pvldb Volume 5, Issue 7
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 137
  Total Citations
  View Citations
- 1,870
  Total Downloads
- Downloads (Last 12 months)101
- Downloads (Last 6 weeks)14
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Scalable k-means++

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Hybrid Bisect K-Means Clustering Algorithm

DIC-DOC-K-means: Dissimilarity-based Initial Centroid selection for DOCument clustering using K-means for improving the effectiveness of text document clustering

Proficient Normalised Fuzzy K-Means With Initial Centroids Methodology

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Scalable k-means++

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Hybrid Bisect K-Means Clustering Algorithm

DIC-DOC-K-means: Dissimilarity-based Initial Centroid selection for DOCument clustering using K-means for improving the effectiveness of text document clustering

Proficient Normalised Fuzzy K-Means With Initial Centroids Methodology

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media