An Analysis of Distributed Document Clustering Using MapReduce Based K-Means Algorithm

Sardar, Tanvir Habib; Ansari, Zahid

doi:10.1007/s40031-020-00485-2

An Analysis of Distributed Document Clustering Using MapReduce Based K-Means Algorithm

Original Contribution
Published: 19 October 2020

Volume 101, pages 641–650, (2020)
Cite this article

Journal of The Institution of Engineers (India): Series B Aims and scope Submit manuscript

231 Accesses
9 Citations
Explore all metrics

Abstract

Clustering is considered as one of the important data mining techniques. Document clustering is among many applications of clustering. The traditional clustering algorithms are proven inefficient for clustering rapidly generating large real world datasets. As a solution, traditional clustering algorithms are modified using distributed programming paradigm. MapReduce is a popular distributed programming paradigm designed for Hadoop distributed framework. This paper demonstrates a MapReduce based modification of K-Means clustering algorithm for document datasets. The result shows that the proposed algorithm is efficient than traditional K-Means for all size of document datasets clustering. The experiments also show that the MapReduce clustering works more efficiently when the dataset size and Hadoop cluster sizes are large.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MapReduce-based Fuzzy C-means Algorithm for Distributed Document Clustering

Article 19 July 2021

Distributed Big Data Clustering using MapReduce-based Fuzzy C-Medoids

Article 27 July 2021

Clustering large datasets using K-means modified inter and intra clustering (KM-I2C) in Hadoop

Article Open access 05 September 2017

References

R. Shettar, Mr. Bhimasen, V. Purohit, A review on clustering algorithms applicable for map reduce, in Proceedings of the International Conference Computational Systems for Health & Sustainability. 17–18, April, Bangalore, Karnataka, India (2015)
T.H. Sardar, A.R. Faizabadi, Z. Ansari, An evaluation of mapreduce framework in cluster analysis, in 2017 IEEE International Conference on Intelligent Computing. Instrumentation and Control Technologies (ICICICT), Kannur, India (2017)
N. Shah, S. Mahajan, Document clustering: a detailed review. Int. J. Appl. Inf. Syst. 30, 4–5 (2012)
Google Scholar
Y. Jain, A.K. Nandanwar, A theoretical study of text document clustering. Int. J. Comput. Sci. Inf. Technol. 5(2), 2246–2251 (2014)
Google Scholar
S. Bisht, A. Paul, Document clustering: a review. Int. J. Comput. Appl. 73–11 (2013)
M. Steinbach, G. Karypis, V. Kumar. A comparison of document clustering techniques, KDD Workshop on Text Mining. 400-1 (2000)
J. Zhang et al., A parallel clustering algorithm with mpi-mkmeans. J. Comput. 8(1), 1017 (2013)
Google Scholar
A.N. Nandakumar, N. Yambem, A survey on data mining algorithms on apache hadoop platform. Int. J. Emerg. Technol. Adv. Eng. 4(1), 563–565 (2014)
Google Scholar
T.H. Sardar, Z. Ansari, Partition based clustering of large datasets using mapreduce framework: an analysis of recent themes and directions. Future Comput. Inform. J. 3, 247–261 (2018)
Article Google Scholar
T.H. Sardar, Z. Ansari, A. Khatun. An evaluation of Hadoop cluster efficiency in document clustering using parallel K-means, in IEEE 2014 International Conference IMpact of E-Technology on US (IMPETUS)
H. Singh, Clustering of text documents by implementation of K-means algorithms. Stream. Info-Ocean 1(1), 53–63 (2016)
Google Scholar
R.C. Balabantaray, C. Sarma, M. Jha. Document clustering using K-means and K-medoids. Preprint arXiv:1502.07938 (2015)
T.H. Sardar, Z. Ansari. Detection and confirmation of web robot requests for cleaning the voluminous web log data, 2014 International Conference on the IMpact of E-Technology on US (IMPETUS)
A. Fahad et al., A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans. Emerg. Top. Comput. 2(3), 267–279 (2014)
Article Google Scholar
T.H. Sardar, Z. Ansari, An analysis of MapReduce efficiency in document clustering using parallel K-means algorithm. Future Comput. Inform. J. 3(2), 200–209 (2018)
Article Google Scholar
P.S. Bradley, U.M. Fayyad. Refining initial points for K-means clustering, in Proceedings of the Fifteenth International Conference on Machine Learning. pp. 91–99 (1998)
G. Ball, D. Hall, A clustering technique for summarizing multivariate data. Behav. Sci. 153, 12 (1967)
Google Scholar
D. Arthur, S. Vassilvitskii. k-means++: the advantages of careful seeding. in Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms. pp. 1027–1035 (2007)
T. Kanungo, D. Mount, N. Netanyahu, C. Piatko, R. Silverman, A. Wu, An efficient K-means clustering algorithm: analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell. 881, 24–27 (2002)
MATH Google Scholar
G. Frahling, C. Sohler. A fast K-means implementation using coresets, in Proceedings of the Twenty-second Annual Symposium on Computational Geometry. pp. 135–143 (2006)
R. Amorim, B. Mirkin, Minkowski metric, feature weighting and anomalous cluster initializing in k-means clustering. Pattern Recogn. 1061, 45 (2012)
Google Scholar
A. Likas, N. Vlassis, J.J. Verbeek, The global k-means clustering algorithm. Pattern Recogn. 451, 36 (2003)
Google Scholar
G. Forman, Bin Zhang, Distributed data clustering can be efficient and exact. ACM SIGKDD Explor. Newsl. 34, 2 (2000)
Google Scholar
I.S. Dhillon, D.S. Modha, A Data-Clustering Algorithm on Distributed Memory Multiprocessors: Large-Scale Parallel Data Mining (Springer, Heidelberg, 2002), pp. 245–260
Book Google Scholar
J. Bhimani, M. Leeser, N. Mi. Accelerating K-means clustering with parallel implementations and GPU computing, in High Performance Extreme Computing Conference (HPEC) (IEEE, 2015)
L. Yang et al., High performance data clustering: a comparative analysis of performance for GPU, RASC, MPI, and OpenMP implementations. J. Supercomput. 70, 284–300 (2014)
Article Google Scholar
S.J. Kang, S.Y. Lee, K.M. Lee. Performance comparison of OpenMP, MPI, and MapReduce in practical problems. Adv. Multimed. 7 (2015)
J. Tian et al., Improvement and parallelism of k-means clustering algorithm. Tsinghua Sci. Technol. 10, 277–281 (2005)
Article MathSciNet Google Scholar
Z. Ping, L. Jingsheng, Y. Wenjun, Large-scale data sets clustering based on mapreduce and hadoop. J. Comput. Inf. Syst. 5956, 7–16 (2011)
Google Scholar
M. Marisiddanagouda, M.T. Raghu, Survey on performance of hadoop mapreduce optimization methods. Int. J. Recent Res. Math. Comput. Sci. Inf. Technol. 2, 114–121 (2015)
Google Scholar
D.N. Nagarjuna, N. Yogesh, A survey on hadoop architecture & its ecosystem to process big data -real world hadoop use cases. Int. J. Sci. Res. Eng. Technol. (IJSRET) 90, 4 (2015)
Google Scholar
C. Verma, J. Jain, Amazon hadoop framework used in business for big data analysis. Glob. J. Eng. Sci. Res. Manag. 131, 4–5 (2017)
Google Scholar
V.S. Patil, P.D. Soni. Hadoop skeleton & fault tolerance in Hadoop clusters, in IJAIEM. 2 (2013)
Li et al. K-means clustering with bagging and mapreduce, in 44th Hawaii IEEE International Conference on System Sciences (HICSS), pp. 1–8
M.W. Berry, M. Castellanos, Survey of Text Mining II, vol. 6 (Springer, New York, 2008)
Book Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Engineering, Jain University, Bengaluru, India
Tanvir Habib Sardar
P.A. College of Engineering, Mangaluru, India
Zahid Ansari

Authors

Tanvir Habib Sardar
View author publications
You can also search for this author in PubMed Google Scholar
Zahid Ansari
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zahid Ansari.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sardar, T.H., Ansari, Z. An Analysis of Distributed Document Clustering Using MapReduce Based K-Means Algorithm. J. Inst. Eng. India Ser. B 101, 641–650 (2020). https://doi.org/10.1007/s40031-020-00485-2

Download citation

Received: 08 November 2018
Accepted: 28 August 2020
Published: 19 October 2020
Issue Date: December 2020
DOI: https://doi.org/10.1007/s40031-020-00485-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Analysis of Distributed Document Clustering Using MapReduce Based K-Means Algorithm

Abstract

Access this article

Similar content being viewed by others

MapReduce-based Fuzzy C-means Algorithm for Distributed Document Clustering

Distributed Big Data Clustering using MapReduce-based Fuzzy C-Medoids

Clustering large datasets using K-means modified inter and intra clustering (KM-I2C) in Hadoop

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An Analysis of Distributed Document Clustering Using MapReduce Based K-Means Algorithm

Abstract

Access this article

Similar content being viewed by others

MapReduce-based Fuzzy C-means Algorithm for Distributed Document Clustering

Distributed Big Data Clustering using MapReduce-based Fuzzy C-Medoids

Clustering large datasets using K-means modified inter and intra clustering (KM-I2C) in Hadoop

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation