Abstract
Clustering is considered as one of the important data mining techniques. Document clustering is among many applications of clustering. The traditional clustering algorithms are proven inefficient for clustering rapidly generating large real world datasets. As a solution, traditional clustering algorithms are modified using distributed programming paradigm. MapReduce is a popular distributed programming paradigm designed for Hadoop distributed framework. This paper demonstrates a MapReduce based modification of K-Means clustering algorithm for document datasets. The result shows that the proposed algorithm is efficient than traditional K-Means for all size of document datasets clustering. The experiments also show that the MapReduce clustering works more efficiently when the dataset size and Hadoop cluster sizes are large.
Similar content being viewed by others
References
R. Shettar, Mr. Bhimasen, V. Purohit, A review on clustering algorithms applicable for map reduce, in Proceedings of the International Conference Computational Systems for Health & Sustainability. 17–18, April, Bangalore, Karnataka, India (2015)
T.H. Sardar, A.R. Faizabadi, Z. Ansari, An evaluation of mapreduce framework in cluster analysis, in 2017 IEEE International Conference on Intelligent Computing. Instrumentation and Control Technologies (ICICICT), Kannur, India (2017)
N. Shah, S. Mahajan, Document clustering: a detailed review. Int. J. Appl. Inf. Syst. 30, 4–5 (2012)
Y. Jain, A.K. Nandanwar, A theoretical study of text document clustering. Int. J. Comput. Sci. Inf. Technol. 5(2), 2246–2251 (2014)
S. Bisht, A. Paul, Document clustering: a review. Int. J. Comput. Appl. 73–11 (2013)
M. Steinbach, G. Karypis, V. Kumar. A comparison of document clustering techniques, KDD Workshop on Text Mining. 400-1 (2000)
J. Zhang et al., A parallel clustering algorithm with mpi-mkmeans. J. Comput. 8(1), 1017 (2013)
A.N. Nandakumar, N. Yambem, A survey on data mining algorithms on apache hadoop platform. Int. J. Emerg. Technol. Adv. Eng. 4(1), 563–565 (2014)
T.H. Sardar, Z. Ansari, Partition based clustering of large datasets using mapreduce framework: an analysis of recent themes and directions. Future Comput. Inform. J. 3, 247–261 (2018)
T.H. Sardar, Z. Ansari, A. Khatun. An evaluation of Hadoop cluster efficiency in document clustering using parallel K-means, in IEEE 2014 International Conference IMpact of E-Technology on US (IMPETUS)
H. Singh, Clustering of text documents by implementation of K-means algorithms. Stream. Info-Ocean 1(1), 53–63 (2016)
R.C. Balabantaray, C. Sarma, M. Jha. Document clustering using K-means and K-medoids. Preprint arXiv:1502.07938 (2015)
T.H. Sardar, Z. Ansari. Detection and confirmation of web robot requests for cleaning the voluminous web log data, 2014 International Conference on the IMpact of E-Technology on US (IMPETUS)
A. Fahad et al., A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans. Emerg. Top. Comput. 2(3), 267–279 (2014)
T.H. Sardar, Z. Ansari, An analysis of MapReduce efficiency in document clustering using parallel K-means algorithm. Future Comput. Inform. J. 3(2), 200–209 (2018)
P.S. Bradley, U.M. Fayyad. Refining initial points for K-means clustering, in Proceedings of the Fifteenth International Conference on Machine Learning. pp. 91–99 (1998)
G. Ball, D. Hall, A clustering technique for summarizing multivariate data. Behav. Sci. 153, 12 (1967)
D. Arthur, S. Vassilvitskii. k-means++: the advantages of careful seeding. in Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms. pp. 1027–1035 (2007)
T. Kanungo, D. Mount, N. Netanyahu, C. Piatko, R. Silverman, A. Wu, An efficient K-means clustering algorithm: analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell. 881, 24–27 (2002)
G. Frahling, C. Sohler. A fast K-means implementation using coresets, in Proceedings of the Twenty-second Annual Symposium on Computational Geometry. pp. 135–143 (2006)
R. Amorim, B. Mirkin, Minkowski metric, feature weighting and anomalous cluster initializing in k-means clustering. Pattern Recogn. 1061, 45 (2012)
A. Likas, N. Vlassis, J.J. Verbeek, The global k-means clustering algorithm. Pattern Recogn. 451, 36 (2003)
G. Forman, Bin Zhang, Distributed data clustering can be efficient and exact. ACM SIGKDD Explor. Newsl. 34, 2 (2000)
I.S. Dhillon, D.S. Modha, A Data-Clustering Algorithm on Distributed Memory Multiprocessors: Large-Scale Parallel Data Mining (Springer, Heidelberg, 2002), pp. 245–260
J. Bhimani, M. Leeser, N. Mi. Accelerating K-means clustering with parallel implementations and GPU computing, in High Performance Extreme Computing Conference (HPEC) (IEEE, 2015)
L. Yang et al., High performance data clustering: a comparative analysis of performance for GPU, RASC, MPI, and OpenMP implementations. J. Supercomput. 70, 284–300 (2014)
S.J. Kang, S.Y. Lee, K.M. Lee. Performance comparison of OpenMP, MPI, and MapReduce in practical problems. Adv. Multimed. 7 (2015)
J. Tian et al., Improvement and parallelism of k-means clustering algorithm. Tsinghua Sci. Technol. 10, 277–281 (2005)
Z. Ping, L. Jingsheng, Y. Wenjun, Large-scale data sets clustering based on mapreduce and hadoop. J. Comput. Inf. Syst. 5956, 7–16 (2011)
M. Marisiddanagouda, M.T. Raghu, Survey on performance of hadoop mapreduce optimization methods. Int. J. Recent Res. Math. Comput. Sci. Inf. Technol. 2, 114–121 (2015)
D.N. Nagarjuna, N. Yogesh, A survey on hadoop architecture & its ecosystem to process big data -real world hadoop use cases. Int. J. Sci. Res. Eng. Technol. (IJSRET) 90, 4 (2015)
C. Verma, J. Jain, Amazon hadoop framework used in business for big data analysis. Glob. J. Eng. Sci. Res. Manag. 131, 4–5 (2017)
V.S. Patil, P.D. Soni. Hadoop skeleton & fault tolerance in Hadoop clusters, in IJAIEM. 2 (2013)
Li et al. K-means clustering with bagging and mapreduce, in 44th Hawaii IEEE International Conference on System Sciences (HICSS), pp. 1–8
M.W. Berry, M. Castellanos, Survey of Text Mining II, vol. 6 (Springer, New York, 2008)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Sardar, T.H., Ansari, Z. An Analysis of Distributed Document Clustering Using MapReduce Based K-Means Algorithm. J. Inst. Eng. India Ser. B 101, 641–650 (2020). https://doi.org/10.1007/s40031-020-00485-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40031-020-00485-2