Online K-Means Clustering with Lightweight Coresets

Low, Jia Shun; Ghafoori, Zahra; Leckie, Christopher

doi:10.1007/978-3-030-35288-2_16

Online K-Means Clustering with Lightweight Coresets

Jia Shun Low¹⁰,
Zahra Ghafoori¹⁰ &
Christopher Leckie¹⁰

Conference paper
First Online: 25 November 2019

2197 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11919))

Abstract

Coresets are representative samples of data that can be used to train machine learning models with provable guarantees of approximating the accuracy of training on the full data set. They have been used for scalable clustering of large datasets and result in better cluster partitions compared to clustering a random sample. In this paper, we present a novel approach of constructing lightweight coresets on subsets of data that can fit in memory while performing a streaming variant of k-means clustering known as online k-means. Experimental results show that this approach generates cluster partitions of comparable accuracy to the regular online k-means algorithm in less time, or superior partitions in comparable time.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Ailon, N., Jaiswal, R., Monteleoni, C.: Streaming k-means approximation. In: Advances in Neural Information Processing Systems, pp. 10–18 (2009)
Google Scholar
Aloise, D., Deshpande, A., Hansen, P., Popat, P.: NP-hardness of Euclidean sum-of-squares clustering. Mach. Learn. 75(2), 245–248 (2009)
Article Google Scholar
Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035. Society for Industrial and Applied Mathematics (2007)
Google Scholar
Bachem, O., Lucic, M., Hassani, S.H., Krause, A.: Approximate k-means++ in sublinear time. In: Thirtieth AAAI Conference on Artificial Intelligence (2016)
Google Scholar
Bachem, O., Lucic, M., Krause, A.: Scalable k-means clustering via lightweight coresets. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1119–1127. ACM (2018)
Google Scholar
Bay, S.D., Kibler, D.F., Pazzani, M.J., Smyth, P.: The UCI KDD archive of large data sets for data mining research and experimentation. SIGKDD Explor. 2(2), 81–85 (2000)
Article Google Scholar
Bezdek, J.C., Ehrlich, R., Full, W.: FCM: the fuzzy c-means clustering algorithm. Comput. Geosci. 10(2–3), 191–203 (1984)
Article Google Scholar
Bottou, L., Bengio, Y.: Convergence properties of the k-means algorithms. In: Advances in Neural Information Processing Systems, pp. 585–592 (1995)
Google Scholar
Celebi, M.E., Kingravi, H.A., Vela, P.A.: A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst. Appl. 40(1), 200–210 (2013)
Article Google Scholar
Feldman, D., Schmidt, M., Sohler, C.: Turning big data into tiny data: constant-size coresets for k-means, PCA and projective clustering. In: Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1434–1453. Society for Industrial and Applied Mathematics (2013)
Google Scholar
Feldman, D., Xiang, C., Zhu, R., Rus, D.: Coresets for differentially private k-means clustering and applications to privacy in mobile sensor networks. In: 2017 16th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN), pp. 3–16. IEEE (2017)
Google Scholar
Gama, J.: A survey on learning from data streams: current and future trends. Prog. Artif. Intell. 1(1), 45–55 (2012)
Article Google Scholar
Har-Peled, S., Kushal, A.: Smaller coresets for k-median and k-means clustering. Discrete Comput. Geom. 37(1), 3–19 (2007)
Article MathSciNet Google Scholar
Har-Peled, S., Mazumdar, S.: On coresets for k-means and k-median clustering. In: Proceedings of the Thirty-Sixth Annual ACM Symposium on Theory of Computing, pp. 291–300. ACM (2004)
Google Scholar
Havens, T.C., Bezdek, J.C., Leckie, C., Hall, L.O., Palaniswami, M.: Fuzzy c-means algorithms for very large data. IEEE Trans. Fuzzy Syst. 20(6), 1130–1146 (2012)
Article Google Scholar
Hore, P., Hall, L., Goldgof, D., Cheng, W.: Online fuzzy c means. In: NAFIPS 2008–2008 Annual Meeting of the North American Fuzzy Information Processing Society, pp. 1–5. IEEE (2008)
Google Scholar
Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)
Article Google Scholar
Kune, R., Konugurthi, P.K., Agarwal, A., Chillarige, R.R., Buyya, R.: The anatomy of big data computing. Software Pract. Exper. 46(1), 79–105 (2016)
Article Google Scholar
Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982)
Article MathSciNet Google Scholar
Low, J.S., Ghafoori, Z., Bezdek, J., Leckie, C.: Seeding on samples for accelerating k-means clustering. In: Proceedings of the 2019 3rd International Conference on Big Data and Internet of Things. ACM (2019, to appear)
Google Scholar
Meidan, Y., et al.: N-BaIoT: network-based detection of IoT botnet attacks using deep autoencoders. IEEE Pervasive Comput. 17(3), 12–22 (2018)
Article Google Scholar
Nock, R., Canyasse, R., Boreli, R., Nielsen, F.: k-variates++: more pluses in the k-means++. In: International Conference on Machine Learning, pp. 145–154 (2016)
Google Scholar
O’callaghan, L., Mishra, N., Meyerson, A., Guha, S., Motwani, R.: Streaming-data algorithms for high-quality clustering. In: Proceedings 18th International Conference on Data Engineering, pp. 685–694. IEEE (2002)
Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Rathore, P., Kumar, D., Bezdek, J.C., Rajasegarar, S., Palaniswami, M.: A rapid hybrid clustering algorithm for large volumes of high dimensional data. IEEE Trans. Knowl. Data Eng. 31(4), 641–654 (2018)
Article Google Scholar
Sculley, D.: Web-scale k-means clustering. In: Proceedings of the 19th International Conference on World Wide Web, pp. 1177–1178. ACM (2010)
Google Scholar
Wu, X., Kumar, V.: The Top Ten Algorithms in Data Mining. CRC Press, Boca Raton (2009)
Book Google Scholar
Zhang, Y., Tangwongsan, K., Tirthapura, S.: Streaming k-means clustering with fast queries. In: IEEE 33rd International Conference on Data Engineering (ICDE), pp. 449–460. IEEE (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computing and Information Systems, The University of Melbourne, Melbourne, Australia
Jia Shun Low, Zahra Ghafoori & Christopher Leckie

Authors

Jia Shun Low
View author publications
You can also search for this author in PubMed Google Scholar
Zahra Ghafoori
View author publications
You can also search for this author in PubMed Google Scholar
Christopher Leckie
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jia Shun Low .

Editor information

Editors and Affiliations

University of South Australia, Adelaide, SA, Australia
Jixue Liu
The University of Melbourne, Melbourne, VIC, Australia
James Bailey

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Low, J.S., Ghafoori, Z., Leckie, C. (2019). Online K-Means Clustering with Lightweight Coresets. In: Liu, J., Bailey, J. (eds) AI 2019: Advances in Artificial Intelligence. AI 2019. Lecture Notes in Computer Science(), vol 11919. Springer, Cham. https://doi.org/10.1007/978-3-030-35288-2_16

Download citation

DOI: https://doi.org/10.1007/978-3-030-35288-2_16
Published: 25 November 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-35287-5
Online ISBN: 978-3-030-35288-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics