Abstract
Over half a century old and showing no signs of aging, k-means remains one of the most popular data processing algorithms. As is well-known, a proper initialization of k-means is crucial for obtaining a good final solution. The recently proposed k-means++ initialization algorithm achieves this, obtaining an initial set of centers that is provably close to the optimum solution. A major downside of the k-means++ is its inherent sequential nature, which limits its applicability to massive data: one must make k passes over the data to find a good initial set of centers. In this work we show how to drastically reduce the number of passes needed to obtain, in parallel, a good initialization. This is unlike prevailing efforts on parallelizing k-means that have mostly focused on the post-initialization phases of k-means. We prove that our proposed initialization algorithm k-means|| obtains a nearly optimal solution after a logarithmic number of passes, and then show that in practice a constant number of passes suffices. Experimental evaluation on real-world large-scale data demonstrates that k-means|| outperforms k-means++ in both sequential and parallel settings.
- M. R. Ackermann, C. Lammersen, M. Märtens, C. Raupach, C. Sohler, and K. Swierkot. StreamKM++: A clustering algorithm for data streams. In ALENEX, pages 173--187, 2010.Google ScholarCross Ref
- N. Ailon, R. Jaiswal, and C. Monteleoni. Streaming k-means approximation. In NIPS, pages 10--18, 2009.Google ScholarDigital Library
- D. Aloise, A. Deshpande, P. Hansen, and P. Popat. NP-hardness of Euclidean sum-of-squares clustering. Machine Learning, 75(2):245--248, 2009. Google ScholarDigital Library
- D. Arthur and S. Vassilvitskii. How slow is the k-means method? In SOCG, pages 144--153, 2006. Google ScholarDigital Library
- D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In SODA, pages 1027--1035, 2007. Google ScholarDigital Library
- B. Bahmani, K. Chakrabarti, and D. Xin. Fast personalized PageRank on MapReduce. In SIGMOD, pages 973--984, 2011. Google ScholarDigital Library
- B. Bahmani, R. Kumar, and S. Vassilvitskii. Densest subgraph in streaming and mapreduce. Proc. VLDB Endow., 5(5):454--465, 2012. Google ScholarDigital Library
- P. Berkhin. Survey of clustering data mining techniques. In J. Kogan, C. K. Nicholas, and M. Teboulle, editors, Grouping Multidimensional Data: Recent Advances in Clustering. Springer, 2006.Google ScholarCross Ref
- P. S. Bradley, U. M. Fayyad, and C. Reina. Scaling clustering algorithms to large databases. In KDD, pages 9--15, 1998.Google Scholar
- E. Chandra and V. P. Anuradha. A survey on clustering algorithms for data in spatial database management systems. International Journal of Computer Applications, 24(9):19--26, 2011.Google ScholarCross Ref
- F. Chierichetti, R. Kumar, and A. Tomkins. Max-cover in map-reduce. In WWW, pages 231--240, 2010. Google ScholarDigital Library
- A. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: Scalable online collaborative filtering. In WWW, pages 271--280, 2007. Google ScholarDigital Library
- J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004. Google ScholarDigital Library
- I. S. Dhillon and D. S. Modha. A data-clustering algorithm on distributed memory multiprocessors. In Workshop on Large-Scale Parallel KDD Systems, SIGKDD, pages 245--260, 2000. Google ScholarDigital Library
- A. Ene, S. Im, and B. Moseley. Fast clustering using MapReduce. In KDD, pages 681--689, 2011. Google ScholarDigital Library
- F. Farnstrom, J. Lewis, and C. Elkan. Scalability for clustering algorithms revisited. SIGKDD Explor. Newsl., 2:51--57, 2000. Google ScholarDigital Library
- S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering data streams: Theory and practice. TKDE, 15(3):515--528, 2003. Google ScholarDigital Library
- S. Guha, R. Rastogi, and K. Shim. CURE: An efficient clustering algorithm for large databases. In SIGMOD, pages 73--84, 1998. Google ScholarDigital Library
- S. Guinepain and L. Gruenwald. Research issues in automatic database clustering. SIGMOD Record, 34(1):33--38, 2005. Google ScholarDigital Library
- G. H. Hardy, J. E. Littlewood, and G. Polya. Inequalities. Cambridge University Press, 1988.Google Scholar
- A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A review. ACM Computing Surveys, 31:264--323, 1999. Google ScholarDigital Library
- M. Jiang, S. Tseng, and C. Su. Two-phase clustering process for outliers detection. Pattern Recognition Letters, 22(6--7):691--700, 2001. Google ScholarDigital Library
- T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. A local search approximation algorithm for k-means clustering. Computational Geometry, 28(2--3):89--112, 2004. Google ScholarDigital Library
- H. J. Karloff, S. Suri, and S. Vassilvitskii. A model of computation for MapReduce. In SODA, pages 938--948, 2010. Google ScholarDigital Library
- E. Kolatch. Clustering algorithms for spatial databases: A survey, 2000. Available at www.cs.umd.edu/~kolatch/papers/SpatialClustering.pdf.Google Scholar
- A. Kumar, Y. Sabharwal, and S. Sen. A simple linear time (1 + ∈)-approximation algorithm for k-means clustering in any dimensions. In FOCS, pages 454--462, 2004. Google ScholarDigital Library
- S. Lattanzi, B. Moseley, S. Suri, and S. Vassilvitskii. Filtering: A method for solving graph problems in MapReduce. In SPAA, pages 85--94, 2011. Google ScholarDigital Library
- C. Ordonez. Integrating k-means clustering with a relational DBMS using SQL. TKDE, 18:188--201, 2006. Google ScholarDigital Library
- C. Ordonez and E. Omiecinski. Efficient disk-based k-means clustering for relational databases. TKDE, 16:909--921, 2004. Google ScholarDigital Library
- R. Ostrovsky, Y. Rabani, L. J. Schulman, and C. Swamy. The effectiveness of Lloyd-type methods for the k-means problem. In FOCS, pages 165--176, 2006. Google ScholarDigital Library
- D. Sculley. Web-scale k-means clustering. In WWW, pages 1177--1178, 2010. Google ScholarDigital Library
- A. Vattani. k-means requires exponentially many iterations even in the plane. DCG, 45(4):596--616, 2011. Google ScholarCross Ref
- T. White. Hadoop: The Definitive Guide. O'Reilly Media, 2009. Google ScholarDigital Library
- X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. J. McLachlan, A. Ng, B. Liu, P. S. Yu, Z.-H. Zhou, M. Steinbach, D. J. Hand, and D. Steinberg. Top 10 algorithms in data mining. Knowl. Inf. Syst., 14:1--37, 2007. Google ScholarDigital Library
- T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An efficient data clustering method for very large databases. SIGMOD Record, 25:103--114, 1996. Google ScholarDigital Library
- W. Zhao, H. Ma, and Q. He. Parallel k-means clustering based on MapReduce. In CloudCom, pages 674--679, 2009. Google ScholarDigital Library
Index Terms
- Scalable k-means++
Recommendations
Hybrid Bisect K-Means Clustering Algorithm
BCGIN '11: Proceedings of the 2011 International Conference on Business Computing and Global InformatizationIn this paper, we present a hybrid clustering algorithm that combines divisive and agglomerative hierarchical clustering algorithm. Our method uses bisect K-means for divisive clustering algorithm and Unweighted Pair Group Method with Arithmetic Mean (...
DIC-DOC-K-means: Dissimilarity-based Initial Centroid selection for DOCument clustering using K-means for improving the effectiveness of text document clustering
In this article, a new initial centroid selection for a K-means document clustering algorithm, namely, Dissimilarity-based Initial Centroid selection for DOCument clustering using K-means (DIC-DOC-K-means), to improve the performance of text document ...
Proficient Normalised Fuzzy K-Means With Initial Centroids Methodology
This article describes how data is relevant and if it can be organized, linked with other data and grouped into a cluster. Clustering is the process of organizing a given set of objects into a set of disjoint groups called clusters. There are a number ...
Comments