Pragmatic Evaluation of the Impact of Dimensionality Reduction in the Performance of Clustering Algorithms

Renjith, Shini; Sreekumar, A.; Jathavedan, M.

doi:10.1007/978-981-15-5558-9_45

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 672))

1280 Accesses
3 Citations

Abstract

With the huge volume of data available as input, modern-day statistical analysis leverages clustering techniques to limit the volume of data to be processed. These input data mainly sourced from social media channels and typically have high dimensions due to the diverse features it represents. This is normally referred to as the curse of dimensionality as it makes the clustering process highly computational intensive and less efficient. Dimensionality reduction techniques are proposed as a solution to address this issue. This paper covers an empirical analysis done on the impact of applying dimensionality reduction during the data transformation phase of the clustering process. We measured the impacts in terms of clustering quality and clustering performance for three most common clustering algorithms k-means clustering, clustering large applications (CLARA), and agglomerative hierarchical clustering (AGNES). The clustering quality is compared by using four internal evaluation criteria, namely Silhouette index, Dunn index, Calinski-Harabasz index, and Davies-Bouldin index, and average execution time is verified as a measure of clustering performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 219.00; Price excludes VAT (USA)

Softcover Book: USD 279.99; Price excludes VAT (USA)

Hardcover Book: USD 379.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Kantardzic M (2011) Data mining: concepts, models, methods, and algorithms. Wiley
Google Scholar
Piatetsky-Shapiro G (1994) An overview of knowledge discovery in databases: recent progress and challenges. In: Rough sets, fuzzy sets and knowledge discovery, pp 1–10. https://doi.org/10.1007/978-1-4471-3238-7_1
Fayyad U (2001) Knowledge discovery in databases: an overview. In: Relational data mining, pp 28–47. https://doi.org/10.1007/978-3-662-04599-2_2
Cattell R (1943) The description of personality: basic traits resolved into clusters. J Abnorm Soc Psychol 38:476–506. https://doi.org/10.1037/H0054116
Article Google Scholar
Hartigan J, Wong M (1979) Algorithm AS 136: a k-means clustering algorithm. J Roy Stat Soc: Ser C (Appl Stat) 28(1):100–108. https://doi.org/10.2307/2346830
Article MATH Google Scholar
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Oakland, vol 1, no 14, pp 281–297
Google Scholar
Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–137. https://doi.org/10.1109/TIT.1982.1056489
Article MathSciNet MATH Google Scholar
Forgey E (1965) Cluster analysis of multivariate data: efficiency vs. interpretability of classification. Biometrics 21(3):768–769
Google Scholar
Kaufman L, Rousseeuw P (1987) Clustering by means of medoids. Faculty of Mathematics and Informatics, Delft
Google Scholar
Park H, Jun C (2009) A simple and fast algorithm for K-medoids clustering. Expert Syst Appl 36(2):3336–3341. https://doi.org/10.1016/J.ESWA.2008.01.039
Article Google Scholar
Kaufman L, Rousseeuw P (2009) Finding groups in data: an introduction to cluster analysis. Wiley. https://doi.org/10.1002/9780470316801
Lukasová A (1979) Hierarchical agglomerative clustering procedure. Pattern Recogn 11(5–6):365–381. https://doi.org/10.1016/0031-3203(79)90049-9
Article MathSciNet MATH Google Scholar
Zepeda-Mendoza M, Resendis-Antonio O (2013) Hierarchical agglomerative clustering. In: Encyclopedia of systems biology, pp 886–887. https://doi.org/10.1007/978-1-4419-9863-7_1371
Roux M (2018) A comparative study of divisive and agglomerative hierarchical clustering algorithms. J Classif 35(2):345–366. https://doi.org/10.1007/S00357-018-9259-9
Article MathSciNet MATH Google Scholar
Pudil P, Novovičová J (1998) Novel methods for feature subset selection with respect to problem knowledge. In: Feature extraction, construction and selection, pp 101–116. https://doi.org/10.1007/978-1-4615-5725-8_7
Hotelling H (1933) Analysis of a complex of statistical variables into principal components. J Educ Psychol 24(6):417–441. https://doi.org/10.1037/H0071325
Article MATH Google Scholar
Liou C, Huang J, Yang W (2008) Modeling word perception using the Elman network. Neurocomputing 71(16–18):3150–3157. https://doi.org/10.1016/J.NEUCOM.2008.04.030
Article Google Scholar
Xu R, Wunsch II D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678. https://doi.org/10.1109/TNN.2005.845141
Shirkhorshidi A, Aghabozorgi S, Wah T, Herawan T (2014) Big data clustering: a review. In: The 14th international conference on computational science and its applications—ICCSA 2014. Springer International Publishing, Guimaraes, pp 707–720. https://doi.org/10.1007/978-3-319-09156-3_49
Sajana T, Sheela Rani C, Narayana K (2016) A survey on clustering techniques for big data mining. Indian J Sci Technol 9(3):1–12. https://doi.org/10.17485/IJST/2016/V9I3/75971
Article Google Scholar
Ajin V, Kumar L (2016) Big data and clustering algorithms. In: 2016 international conference on research advances in integrated navigation systems (RAINS). IEEE Press, Bangalore, pp 101–106. https://doi.org/10.1109/rains.2016.7764405
Dave M, Gianey H (2016) Different clustering algorithms for big data analytics: a review. In: 2016 international conference system modeling & advancement in research trends (SMART). IEEE Press, Moradabad, pp 328–333. https://doi.org/10.1109/sysmart.2016.7894544
Lau T, King I (1998) Performance analysis of clustering algorithms for information retrieval in image databases. In: 1998 IEEE international joint conference on neural networks proceedings, IEEE world congress on computational intelligence (Cat. No. 98CH36227). IEEE Press, Anchorage, pp 932–937. https://doi.org/10.1109/ijcnn.1998.685895
Maulik U, Bandyopadhyay S (2002) Performance evaluation of some clustering algorithms and validity indices. IEEE Trans Pattern Anal Mach Intell 24(12):1650–1654. https://doi.org/10.1109/TPAMI.2002.1114856
Article Google Scholar
Wei C, Lee Y, Hsu C (2003) Empirical comparison of fast partitioning-based clustering algorithms for large data sets. Expert Syst Appl 24(4):351–363. https://doi.org/10.1016/S0957-4174(02)00185-9
Article Google Scholar
Zhang B (2003) Comparison of the performance of center-based clustering algorithms. In: Advances in knowledge discovery and data mining, PAKDD 2003. Lecture notes in computer science, vol 2637. Springer, Seoul, pp 63–74. https://doi.org/10.1007/3-540-36175-8_7
Wang X, Hamilton H (2005) A comparative study of two density-based spatial clustering algorithms for very large datasets. In: Advances in artificial intelligence, AI 2005. Lecture notes in computer science, vol 3501. Springer, Victoria, pp 120–132. https://doi.org/10.1007/11424918_14
Singh P, Dutta M (2012) Performance analysis of clustering methods for outlier detection. In: 2012 second international conference on advanced computing & communication technologies (ACCT 2012). IEEE Press, Rohtak, pp 89–95. https://doi.org/10.1109/acct.2012.84
Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya A, Foufou S, Bouras A (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2(3):267–279. https://doi.org/10.1109/TETC.2014.2330519
Article Google Scholar
Jung Y, Kang M, Heo J (2014) Clustering performance comparison using k-means and expectation maximization algorithms. Biotechnol Biotechnol Equip 28(2):S44–S48. https://doi.org/10.1080/13102818.2014.949045
Article Google Scholar
Bhatnagar V, Majhi R, Jena P (2017) Comparative performance evaluation of clustering algorithms for grouping manufacturing firms. Arab J Sci Eng 43(8):4071–4083. https://doi.org/10.1007/S13369-017-2788-4
Article Google Scholar
Renjith S, Sreekumar A, Jathavedan M (2018) Evaluation of partitioning clustering algorithms for processing social media data in tourism domain. In: 2018 IEEE recent advances in intelligent computational systems (RAICS). IEEE Press, Thiruvananthapuram, pp 127–131. https://doi.org/10.1109/raics.2018.8635080
Kohonen T (1997) Exploration of very large databases by self-organizing maps. In: International conference on neural networks (ICNN’97), vol 1. IEEE Press, Houston, pp PL1–PL6. https://doi.org/10.1109/icnn.1997.611622
Roweis S (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500):2323–2326. https://doi.org/10.1126/SCIENCE.290.5500.2323
Article Google Scholar
Ding C, He X, Zha H, Simon H (2002) Adaptive dimension reduction for clustering high dimensional data. In: 2002 IEEE international conference on data mining. IEEE Computer Society, Maebashi City, pp 147–154. https://doi.org/10.1109/icdm.2002.1183897
Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9:2579–2605
Google Scholar
Wang Q, Li J (2009) Combining local and global information for nonlinear dimensionality reduction. Neurocomputing 72(10–12):2235–2241. https://doi.org/10.1016/J.NEUCOM.2009.01.006
Article Google Scholar
Araujo D, Doria Neto A, Martins A, Melo J (2011) Comparative study on dimension reduction techniques for cluster analysis of microarray data. In: The 2011 international joint conference on neural networks. IEEE Press, San Jose, pp 1835–1842. https://doi.org/10.1109/ijcnn.2011.6033447
Chui CK, Wang J (2013) Nonlinear methods for dimensionality reduction. In: Handbook of geomathematics, pp 1–46. https://doi.org/10.1007/978-3-642-27793-1_34-2
Song M, Yang H, Siadat S, Pechenizkiy M (2013) A comparative study of dimensionality reduction techniques to enhance trace clustering performances. Expert Syst Appl 40(9):3722–3737. https://doi.org/10.1016/J.ESWA.2012.12.078
Article Google Scholar
Charrad M, Ghazzali N, Boiteau V, Niknafs A (2014) NbClust: an R package for determining the relevant number of clusters in a data set. J Stat Softw 61(6):1–36. https://doi.org/10.18637/JSS.V061.I06
Article Google Scholar
Rousseeuw P (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65. https://doi.org/10.1016/0377-0427(87)90125-7
Article MATH Google Scholar
Dunn J (1973) A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. J Cybern 3(3):32–57. https://doi.org/10.1080/01969727308546046
Article MathSciNet MATH Google Scholar
Calinski T, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat Theory Methods 3(1):1–27. https://doi.org/10.1080/03610927408827101
Article MathSciNet MATH Google Scholar
Davies D, Bouldin D (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell PAMI 1(2):224–227. https://doi.org/10.1109/tpami.1979.4766909
R Core Team (2009) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna
Google Scholar
Tierney L (2012) The R statistical computing environment. In: Lecture notes in statistics, pp 435–447. https://doi.org/10.1007/978-1-4614-3520-4_41
Racine J (2011) RStudio: a platform-independent IDE for R and Sweave. J Appl Econ 27(1):167–172. https://doi.org/10.1002/JAE.1278
Article Google Scholar
Goldberg K, Roeder T, Gupta D, Perkins C (2001) Eigentaste: a constant time collaborative filtering algorithm. Inf Retr 4(2):133–151. https://doi.org/10.1023/A:1011419012209
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Applications, Cochin University of Science and Technology, Kochi, Kerala, 682022, India
Shini Renjith, A. Sreekumar & M. Jathavedan
Department of Computer Science and Engineering, Mar Baselios College of Engineering and Technology, Thiruvananthapuram, Kerala, 695015, India
Shini Renjith

Authors

Shini Renjith
View author publications
You can also search for this author in PubMed Google Scholar
A. Sreekumar
View author publications
You can also search for this author in PubMed Google Scholar
M. Jathavedan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shini Renjith .

Editor information

Editors and Affiliations

SVS College of Engineering, Coimbatore, Tamil Nadu, India
Thangaprakash Sengodan
Kuwait College of Science and Technology, Doha, Kuwait
M. Murugappan
Covenant University, Ota, Nigeria
Sanjay Misra

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Renjith, S., Sreekumar, A., Jathavedan, M. (2020). Pragmatic Evaluation of the Impact of Dimensionality Reduction in the Performance of Clustering Algorithms. In: Sengodan, T., Murugappan, M., Misra, S. (eds) Advances in Electrical and Computer Technologies. Lecture Notes in Electrical Engineering, vol 672. Springer, Singapore. https://doi.org/10.1007/978-981-15-5558-9_45

Download citation

DOI: https://doi.org/10.1007/978-981-15-5558-9_45
Published: 08 September 2020
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-5557-2
Online ISBN: 978-981-15-5558-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics