skip to main content
research-article

Scalable k-means++

Published:01 March 2012Publication History
Skip Abstract Section

Abstract

Over half a century old and showing no signs of aging, k-means remains one of the most popular data processing algorithms. As is well-known, a proper initialization of k-means is crucial for obtaining a good final solution. The recently proposed k-means++ initialization algorithm achieves this, obtaining an initial set of centers that is provably close to the optimum solution. A major downside of the k-means++ is its inherent sequential nature, which limits its applicability to massive data: one must make k passes over the data to find a good initial set of centers. In this work we show how to drastically reduce the number of passes needed to obtain, in parallel, a good initialization. This is unlike prevailing efforts on parallelizing k-means that have mostly focused on the post-initialization phases of k-means. We prove that our proposed initialization algorithm k-means|| obtains a nearly optimal solution after a logarithmic number of passes, and then show that in practice a constant number of passes suffices. Experimental evaluation on real-world large-scale data demonstrates that k-means|| outperforms k-means++ in both sequential and parallel settings.

References

  1. M. R. Ackermann, C. Lammersen, M. Märtens, C. Raupach, C. Sohler, and K. Swierkot. StreamKM++: A clustering algorithm for data streams. In ALENEX, pages 173--187, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  2. N. Ailon, R. Jaiswal, and C. Monteleoni. Streaming k-means approximation. In NIPS, pages 10--18, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. D. Aloise, A. Deshpande, P. Hansen, and P. Popat. NP-hardness of Euclidean sum-of-squares clustering. Machine Learning, 75(2):245--248, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. Arthur and S. Vassilvitskii. How slow is the k-means method? In SOCG, pages 144--153, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. In SODA, pages 1027--1035, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. B. Bahmani, K. Chakrabarti, and D. Xin. Fast personalized PageRank on MapReduce. In SIGMOD, pages 973--984, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. B. Bahmani, R. Kumar, and S. Vassilvitskii. Densest subgraph in streaming and mapreduce. Proc. VLDB Endow., 5(5):454--465, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. P. Berkhin. Survey of clustering data mining techniques. In J. Kogan, C. K. Nicholas, and M. Teboulle, editors, Grouping Multidimensional Data: Recent Advances in Clustering. Springer, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  9. P. S. Bradley, U. M. Fayyad, and C. Reina. Scaling clustering algorithms to large databases. In KDD, pages 9--15, 1998.Google ScholarGoogle Scholar
  10. E. Chandra and V. P. Anuradha. A survey on clustering algorithms for data in spatial database management systems. International Journal of Computer Applications, 24(9):19--26, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  11. F. Chierichetti, R. Kumar, and A. Tomkins. Max-cover in map-reduce. In WWW, pages 231--240, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: Scalable online collaborative filtering. In WWW, pages 271--280, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. I. S. Dhillon and D. S. Modha. A data-clustering algorithm on distributed memory multiprocessors. In Workshop on Large-Scale Parallel KDD Systems, SIGKDD, pages 245--260, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Ene, S. Im, and B. Moseley. Fast clustering using MapReduce. In KDD, pages 681--689, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. F. Farnstrom, J. Lewis, and C. Elkan. Scalability for clustering algorithms revisited. SIGKDD Explor. Newsl., 2:51--57, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Guha, A. Meyerson, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering data streams: Theory and practice. TKDE, 15(3):515--528, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Guha, R. Rastogi, and K. Shim. CURE: An efficient clustering algorithm for large databases. In SIGMOD, pages 73--84, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. S. Guinepain and L. Gruenwald. Research issues in automatic database clustering. SIGMOD Record, 34(1):33--38, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. G. H. Hardy, J. E. Littlewood, and G. Polya. Inequalities. Cambridge University Press, 1988.Google ScholarGoogle Scholar
  21. A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A review. ACM Computing Surveys, 31:264--323, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Jiang, S. Tseng, and C. Su. Two-phase clustering process for outliers detection. Pattern Recognition Letters, 22(6--7):691--700, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. A local search approximation algorithm for k-means clustering. Computational Geometry, 28(2--3):89--112, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. H. J. Karloff, S. Suri, and S. Vassilvitskii. A model of computation for MapReduce. In SODA, pages 938--948, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. E. Kolatch. Clustering algorithms for spatial databases: A survey, 2000. Available at www.cs.umd.edu/~kolatch/papers/SpatialClustering.pdf.Google ScholarGoogle Scholar
  26. A. Kumar, Y. Sabharwal, and S. Sen. A simple linear time (1 + )-approximation algorithm for k-means clustering in any dimensions. In FOCS, pages 454--462, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. S. Lattanzi, B. Moseley, S. Suri, and S. Vassilvitskii. Filtering: A method for solving graph problems in MapReduce. In SPAA, pages 85--94, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. C. Ordonez. Integrating k-means clustering with a relational DBMS using SQL. TKDE, 18:188--201, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. C. Ordonez and E. Omiecinski. Efficient disk-based k-means clustering for relational databases. TKDE, 16:909--921, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. R. Ostrovsky, Y. Rabani, L. J. Schulman, and C. Swamy. The effectiveness of Lloyd-type methods for the k-means problem. In FOCS, pages 165--176, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. D. Sculley. Web-scale k-means clustering. In WWW, pages 1177--1178, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. A. Vattani. k-means requires exponentially many iterations even in the plane. DCG, 45(4):596--616, 2011. Google ScholarGoogle ScholarCross RefCross Ref
  33. T. White. Hadoop: The Definitive Guide. O'Reilly Media, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. J. McLachlan, A. Ng, B. Liu, P. S. Yu, Z.-H. Zhou, M. Steinbach, D. J. Hand, and D. Steinberg. Top 10 algorithms in data mining. Knowl. Inf. Syst., 14:1--37, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An efficient data clustering method for very large databases. SIGMOD Record, 25:103--114, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. W. Zhao, H. Ma, and Q. He. Parallel k-means clustering based on MapReduce. In CloudCom, pages 674--679, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Scalable k-means++
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image Proceedings of the VLDB Endowment
      Proceedings of the VLDB Endowment  Volume 5, Issue 7
      March 2012
      94 pages

      Publisher

      VLDB Endowment

      Publication History

      • Published: 1 March 2012
      Published in pvldb Volume 5, Issue 7

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader