Skip to main content
Log in

An Analysis of Distributed Document Clustering Using MapReduce Based K-Means Algorithm

  • Original Contribution
  • Published:
Journal of The Institution of Engineers (India): Series B Aims and scope Submit manuscript

Abstract

Clustering is considered as one of the important data mining techniques. Document clustering is among many applications of clustering. The traditional clustering algorithms are proven inefficient for clustering rapidly generating large real world datasets. As a solution, traditional clustering algorithms are modified using distributed programming paradigm. MapReduce is a popular distributed programming paradigm designed for Hadoop distributed framework. This paper demonstrates a MapReduce based modification of K-Means clustering algorithm for document datasets. The result shows that the proposed algorithm is efficient than traditional K-Means for all size of document datasets clustering. The experiments also show that the MapReduce clustering works more efficiently when the dataset size and Hadoop cluster sizes are large.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. R. Shettar, Mr. Bhimasen, V. Purohit, A review on clustering algorithms applicable for map reduce, in Proceedings of the International Conference Computational Systems for Health & Sustainability. 17–18, April, Bangalore, Karnataka, India (2015)

  2. T.H. Sardar, A.R. Faizabadi, Z. Ansari, An evaluation of mapreduce framework in cluster analysis, in 2017 IEEE International Conference on Intelligent Computing. Instrumentation and Control Technologies (ICICICT), Kannur, India (2017)

  3. N. Shah, S. Mahajan, Document clustering: a detailed review. Int. J. Appl. Inf. Syst. 30, 4–5 (2012)

    Google Scholar 

  4. Y. Jain, A.K. Nandanwar, A theoretical study of text document clustering. Int. J. Comput. Sci. Inf. Technol. 5(2), 2246–2251 (2014)

    Google Scholar 

  5. S. Bisht, A. Paul, Document clustering: a review. Int. J. Comput. Appl. 73–11 (2013)

  6. M. Steinbach, G. Karypis, V. Kumar. A comparison of document clustering techniques, KDD Workshop on Text Mining. 400-1 (2000)

  7. J. Zhang et al., A parallel clustering algorithm with mpi-mkmeans. J. Comput. 8(1), 1017 (2013)

    Google Scholar 

  8. A.N. Nandakumar, N. Yambem, A survey on data mining algorithms on apache hadoop platform. Int. J. Emerg. Technol. Adv. Eng. 4(1), 563–565 (2014)

    Google Scholar 

  9. T.H. Sardar, Z. Ansari, Partition based clustering of large datasets using mapreduce framework: an analysis of recent themes and directions. Future Comput. Inform. J. 3, 247–261 (2018)

    Article  Google Scholar 

  10. T.H. Sardar, Z. Ansari, A. Khatun. An evaluation of Hadoop cluster efficiency in document clustering using parallel K-means, in IEEE 2014 International Conference IMpact of E-Technology on US (IMPETUS)

  11. H. Singh, Clustering of text documents by implementation of K-means algorithms. Stream. Info-Ocean 1(1), 53–63 (2016)

    Google Scholar 

  12. R.C. Balabantaray, C. Sarma, M. Jha. Document clustering using K-means and K-medoids. Preprint arXiv:1502.07938 (2015)

  13. T.H. Sardar, Z. Ansari. Detection and confirmation of web robot requests for cleaning the voluminous web log data, 2014 International Conference on the IMpact of E-Technology on US (IMPETUS)

  14. A. Fahad et al., A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans. Emerg. Top. Comput. 2(3), 267–279 (2014)

    Article  Google Scholar 

  15. T.H. Sardar, Z. Ansari, An analysis of MapReduce efficiency in document clustering using parallel K-means algorithm. Future Comput. Inform. J. 3(2), 200–209 (2018)

    Article  Google Scholar 

  16. P.S. Bradley, U.M. Fayyad. Refining initial points for K-means clustering, in Proceedings of the Fifteenth International Conference on Machine Learning. pp. 91–99 (1998)

  17. G. Ball, D. Hall, A clustering technique for summarizing multivariate data. Behav. Sci. 153, 12 (1967)

    Google Scholar 

  18. D. Arthur, S. Vassilvitskii. k-means++: the advantages of careful seeding. in Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms. pp. 1027–1035 (2007)

  19. T. Kanungo, D. Mount, N. Netanyahu, C. Piatko, R. Silverman, A. Wu, An efficient K-means clustering algorithm: analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell. 881, 24–27 (2002)

    MATH  Google Scholar 

  20. G. Frahling, C. Sohler. A fast K-means implementation using coresets, in Proceedings of the Twenty-second Annual Symposium on Computational Geometry. pp. 135–143 (2006)

  21. R. Amorim, B. Mirkin, Minkowski metric, feature weighting and anomalous cluster initializing in k-means clustering. Pattern Recogn. 1061, 45 (2012)

    Google Scholar 

  22. A. Likas, N. Vlassis, J.J. Verbeek, The global k-means clustering algorithm. Pattern Recogn. 451, 36 (2003)

    Google Scholar 

  23. G. Forman, Bin Zhang, Distributed data clustering can be efficient and exact. ACM SIGKDD Explor. Newsl. 34, 2 (2000)

    Google Scholar 

  24. I.S. Dhillon, D.S. Modha, A Data-Clustering Algorithm on Distributed Memory Multiprocessors: Large-Scale Parallel Data Mining (Springer, Heidelberg, 2002), pp. 245–260

    Book  Google Scholar 

  25. J. Bhimani, M. Leeser, N. Mi. Accelerating K-means clustering with parallel implementations and GPU computing, in High Performance Extreme Computing Conference (HPEC) (IEEE, 2015)

  26. L. Yang et al., High performance data clustering: a comparative analysis of performance for GPU, RASC, MPI, and OpenMP implementations. J. Supercomput. 70, 284–300 (2014)

    Article  Google Scholar 

  27. S.J. Kang, S.Y. Lee, K.M. Lee. Performance comparison of OpenMP, MPI, and MapReduce in practical problems. Adv. Multimed. 7 (2015)

  28. J. Tian et al., Improvement and parallelism of k-means clustering algorithm. Tsinghua Sci. Technol. 10, 277–281 (2005)

    Article  MathSciNet  Google Scholar 

  29. Z. Ping, L. Jingsheng, Y. Wenjun, Large-scale data sets clustering based on mapreduce and hadoop. J. Comput. Inf. Syst. 5956, 7–16 (2011)

    Google Scholar 

  30. M. Marisiddanagouda, M.T. Raghu, Survey on performance of hadoop mapreduce optimization methods. Int. J. Recent Res. Math. Comput. Sci. Inf. Technol. 2, 114–121 (2015)

    Google Scholar 

  31. D.N. Nagarjuna, N. Yogesh, A survey on hadoop architecture & its ecosystem to process big data -real world hadoop use cases. Int. J. Sci. Res. Eng. Technol. (IJSRET) 90, 4 (2015)

    Google Scholar 

  32. C. Verma, J. Jain, Amazon hadoop framework used in business for big data analysis. Glob. J. Eng. Sci. Res. Manag. 131, 4–5 (2017)

    Google Scholar 

  33. V.S. Patil, P.D. Soni. Hadoop skeleton & fault tolerance in Hadoop clusters, in IJAIEM. 2 (2013)

  34. Li et al. K-means clustering with bagging and mapreduce, in 44th Hawaii IEEE International Conference on System Sciences (HICSS), pp. 1–8

  35. M.W. Berry, M. Castellanos, Survey of Text Mining II, vol. 6 (Springer, New York, 2008)

    Book  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zahid Ansari.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sardar, T.H., Ansari, Z. An Analysis of Distributed Document Clustering Using MapReduce Based K-Means Algorithm. J. Inst. Eng. India Ser. B 101, 641–650 (2020). https://doi.org/10.1007/s40031-020-00485-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s40031-020-00485-2

Keywords

Navigation