Skip to main content

Online K-Means Clustering with Lightweight Coresets

  • Conference paper
  • First Online:
  • 2197 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11919))

Abstract

Coresets are representative samples of data that can be used to train machine learning models with provable guarantees of approximating the accuracy of training on the full data set. They have been used for scalable clustering of large datasets and result in better cluster partitions compared to clustering a random sample. In this paper, we present a novel approach of constructing lightweight coresets on subsets of data that can fit in memory while performing a streaming variant of k-means clustering known as online k-means. Experimental results show that this approach generates cluster partitions of comparable accuracy to the regular online k-means algorithm in less time, or superior partitions in comparable time.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Ailon, N., Jaiswal, R., Monteleoni, C.: Streaming k-means approximation. In: Advances in Neural Information Processing Systems, pp. 10–18 (2009)

    Google Scholar 

  2. Aloise, D., Deshpande, A., Hansen, P., Popat, P.: NP-hardness of Euclidean sum-of-squares clustering. Mach. Learn. 75(2), 245–248 (2009)

    Article  Google Scholar 

  3. Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035. Society for Industrial and Applied Mathematics (2007)

    Google Scholar 

  4. Bachem, O., Lucic, M., Hassani, S.H., Krause, A.: Approximate k-means++ in sublinear time. In: Thirtieth AAAI Conference on Artificial Intelligence (2016)

    Google Scholar 

  5. Bachem, O., Lucic, M., Krause, A.: Scalable k-means clustering via lightweight coresets. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1119–1127. ACM (2018)

    Google Scholar 

  6. Bay, S.D., Kibler, D.F., Pazzani, M.J., Smyth, P.: The UCI KDD archive of large data sets for data mining research and experimentation. SIGKDD Explor. 2(2), 81–85 (2000)

    Article  Google Scholar 

  7. Bezdek, J.C., Ehrlich, R., Full, W.: FCM: the fuzzy c-means clustering algorithm. Comput. Geosci. 10(2–3), 191–203 (1984)

    Article  Google Scholar 

  8. Bottou, L., Bengio, Y.: Convergence properties of the k-means algorithms. In: Advances in Neural Information Processing Systems, pp. 585–592 (1995)

    Google Scholar 

  9. Celebi, M.E., Kingravi, H.A., Vela, P.A.: A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst. Appl. 40(1), 200–210 (2013)

    Article  Google Scholar 

  10. Feldman, D., Schmidt, M., Sohler, C.: Turning big data into tiny data: constant-size coresets for k-means, PCA and projective clustering. In: Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1434–1453. Society for Industrial and Applied Mathematics (2013)

    Google Scholar 

  11. Feldman, D., Xiang, C., Zhu, R., Rus, D.: Coresets for differentially private k-means clustering and applications to privacy in mobile sensor networks. In: 2017 16th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN), pp. 3–16. IEEE (2017)

    Google Scholar 

  12. Gama, J.: A survey on learning from data streams: current and future trends. Prog. Artif. Intell. 1(1), 45–55 (2012)

    Article  Google Scholar 

  13. Har-Peled, S., Kushal, A.: Smaller coresets for k-median and k-means clustering. Discrete Comput. Geom. 37(1), 3–19 (2007)

    Article  MathSciNet  Google Scholar 

  14. Har-Peled, S., Mazumdar, S.: On coresets for k-means and k-median clustering. In: Proceedings of the Thirty-Sixth Annual ACM Symposium on Theory of Computing, pp. 291–300. ACM (2004)

    Google Scholar 

  15. Havens, T.C., Bezdek, J.C., Leckie, C., Hall, L.O., Palaniswami, M.: Fuzzy c-means algorithms for very large data. IEEE Trans. Fuzzy Syst. 20(6), 1130–1146 (2012)

    Article  Google Scholar 

  16. Hore, P., Hall, L., Goldgof, D., Cheng, W.: Online fuzzy c means. In: NAFIPS 2008–2008 Annual Meeting of the North American Fuzzy Information Processing Society, pp. 1–5. IEEE (2008)

    Google Scholar 

  17. Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)

    Article  Google Scholar 

  18. Kune, R., Konugurthi, P.K., Agarwal, A., Chillarige, R.R., Buyya, R.: The anatomy of big data computing. Software Pract. Exper. 46(1), 79–105 (2016)

    Article  Google Scholar 

  19. Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982)

    Article  MathSciNet  Google Scholar 

  20. Low, J.S., Ghafoori, Z., Bezdek, J., Leckie, C.: Seeding on samples for accelerating k-means clustering. In: Proceedings of the 2019 3rd International Conference on Big Data and Internet of Things. ACM (2019, to appear)

    Google Scholar 

  21. Meidan, Y., et al.: N-BaIoT: network-based detection of IoT botnet attacks using deep autoencoders. IEEE Pervasive Comput. 17(3), 12–22 (2018)

    Article  Google Scholar 

  22. Nock, R., Canyasse, R., Boreli, R., Nielsen, F.: k-variates++: more pluses in the k-means++. In: International Conference on Machine Learning, pp. 145–154 (2016)

    Google Scholar 

  23. O’callaghan, L., Mishra, N., Meyerson, A., Guha, S., Motwani, R.: Streaming-data algorithms for high-quality clustering. In: Proceedings 18th International Conference on Data Engineering, pp. 685–694. IEEE (2002)

    Google Scholar 

  24. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  25. Rathore, P., Kumar, D., Bezdek, J.C., Rajasegarar, S., Palaniswami, M.: A rapid hybrid clustering algorithm for large volumes of high dimensional data. IEEE Trans. Knowl. Data Eng. 31(4), 641–654 (2018)

    Article  Google Scholar 

  26. Sculley, D.: Web-scale k-means clustering. In: Proceedings of the 19th International Conference on World Wide Web, pp. 1177–1178. ACM (2010)

    Google Scholar 

  27. Wu, X., Kumar, V.: The Top Ten Algorithms in Data Mining. CRC Press, Boca Raton (2009)

    Book  Google Scholar 

  28. Zhang, Y., Tangwongsan, K., Tirthapura, S.: Streaming k-means clustering with fast queries. In: IEEE 33rd International Conference on Data Engineering (ICDE), pp. 449–460. IEEE (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jia Shun Low .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Low, J.S., Ghafoori, Z., Leckie, C. (2019). Online K-Means Clustering with Lightweight Coresets. In: Liu, J., Bailey, J. (eds) AI 2019: Advances in Artificial Intelligence. AI 2019. Lecture Notes in Computer Science(), vol 11919. Springer, Cham. https://doi.org/10.1007/978-3-030-35288-2_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-35288-2_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-35287-5

  • Online ISBN: 978-3-030-35288-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics