Skip to main content

A Comparative Analysis of Clustering Quality Based on Internal Validation Indices for Dimensionally Reduced Social Media Data

  • Conference paper
  • First Online:
Advances in Artificial Intelligence and Data Engineering (AIDE 2019)

Abstract

Almost all modern industries leverage data analytics to deal with various dimensions of their business like demand forecasting, targeted marketing, and supply chain planning. In addition to historic data, social media data has also become a prominent source of input for data analytics. The key challenges observed with social media data are its huge volume and high dimensions that need to be dealt with. Clustering is the proven strategy in data analytics to segregate the relevant data for processing and thereby reducing the impact of huge volume. Dimensionality corresponds to the diverse features of the data subject being represented. The application of dimensionality reduction techniques can help in reducing the computational intensiveness caused by the curse of dimensionality. This paper covers an experimental analysis using four popular dimensionality reduction techniques – two linear and two nonlinear approaches – to verify the impact of dimensionality reduction on cluster quality using internal clustering validation indices.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 219.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 279.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 279.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Kohavi R, Rothleder N, Simoudis E (2002) Emerging trends in business analytics. Commun ACM 45(8):45–48. https://doi.org/10.1145/545151.545177

    Article  Google Scholar 

  2. Kantardzic M (2011) Data mining: concepts, models, methods, and algorithms. Wiley

    Google Scholar 

  3. Cattell R (1943) The description of personality: basic traits resolved into clusters. J Abnorm Soc Psychology 38:476–506. https://doi.org/10.1037/H0054116

    Article  Google Scholar 

  4. Pudil P, Novovičová J (1998) Novel methods for feature subset selection with respect to problem knowledge. In: Feature extraction, construction and selection, pp 101–116. https://doi.org/10.1007/978-1-4615-5725-8_7

  5. Hartigan J, Wong M (1979) Algorithm AS 136: a K-means clustering algorithm. J Roy Stat Soc: Ser C (Appl Stat) 28(1):100–108. https://doi.org/10.2307/2346830

    Article  MATH  Google Scholar 

  6. MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol 1, no 14. Oakland, CA, USA, pp 281–297

    Google Scholar 

  7. Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–137. https://doi.org/10.1109/TIT.1982.1056489

    Article  MathSciNet  MATH  Google Scholar 

  8. Forgey E (1965) Cluster analysis of multivariate data: efficiency versus interpretability of classification. Biometrics 21(3):768–769

    Google Scholar 

  9. Kaufman L, Rousseeuw P (2009) Finding groups in data: an introduction to cluster analysis. Wiley. https://doi.org/10.1002/9780470316801. -->

  10. Lukasová A (1979) Hierarchical agglomerative clustering procedure. Pattern Recogn 11(5–6):365–381. https://doi.org/10.1016/0031-3203(79)90049-9

    Article  MathSciNet  MATH  Google Scholar 

  11. Zepeda-Mendoza M, Resendis-Antonio O (2013) Hierarchical agglomerative clustering. Encycl Syst Biol 886–887. https://doi.org/10.1007/978-1-4419-9863-7_1371

  12. Roux M (2018) A comparative study of divisive and agglomerative hierarchical clustering algorithms. J Classif 35(2):345–366. https://doi.org/10.1007/S00357-018-9259-9

    Article  MathSciNet  MATH  Google Scholar 

  13. Hotelling H (1933) Analysis of a complex of statistical variables into principal components. J Educ Psychol 24(6):417–441. https://doi.org/10.1037/H0071325

    Article  MATH  Google Scholar 

  14. Abdi H, Williams L (2010) Principal component analysis. Wiley Interdiscip Rev: Comput Statistics 2(4):433–459. https://doi.org/10.1002/wics.101

    Article  Google Scholar 

  15. Isomura T, Toyoizumi T (2016) A local learning rule for independent component analysis. Sci Rep 6. https://doi.org/10.1038/srep28073

  16. Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9:2579–2605

    Google Scholar 

  17. Maaten L (2014) Accelerating t-SNE using tree-based algorithms. J Mach Learn Res 15:3221–3245

    MathSciNet  MATH  Google Scholar 

  18. Roweis S, Saul L (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500):2323–2326. https://doi.org/10.1126/science.290.5500.2323

    Article  Google Scholar 

  19. Ridder D, Kouropteva O, Okun O, Pietikäinen M, Duin R (2003) Supervised locally linear embedding. Artif Neural Netw Neural Inf Process—ICANN/ICONIP 2003:333–341. https://doi.org/10.1007/3-540-44989-2_40

    Article  MATH  Google Scholar 

  20. Renjith S, Sreekumar A, Jathavedan M (2018) Evaluation of partitioning clustering algorithms for processing social media data in tourism domain. In: 2018 IEEE recent advances in intelligent computational systems (RAICS). IEEE Press, Thiruvananthapuram, India, pp 127–131. https://doi.org/10.1109/raics.2018.8635080

  21. Renjith S, Sreekumar A, Jathavedan M (2020) Performance evaluation of clustering algorithms for varying cardinality and dimensionality of data sets. Materials Today: Proceedings. https://doi.org/10.1016/j.matpr.2020.01.110

  22. Renjith S, Sreekumar A, Jathavedan M (2019) Pragmatic evaluation of the impact of dimensionality reduction in the performance of clustering algorithms. In: Advances in electrical and computer technologies, ICAECT 2019, Lecture notes in electrical engineering. Springer, Coimbatore, India

    Google Scholar 

  23. Xu R, WunschII D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678. https://doi.org/10.1109/TNN.2005.845141

    Article  Google Scholar 

  24. Shirkhorshidi A, Aghabozorgi S, Wah T, Herawan T (2014) Big data clustering: a review. In: The 14th international conference on computational science and its applications—ICCSA 2014. Springer International Publishing, Guimaraes, Portugal, pp 707–720. https://doi.org/10.1007/978-3-319-09156-3_49

  25. Sajana T, Sheela Rani C, Narayana K (2016) A survey on clustering techniques for big data mining. Indian J Sci Technol 9(3):1–12. https://doi.org/10.17485/IJST/2016/V9I3/75971

    Article  Google Scholar 

  26. Ajin V, Kumar L (2016) Big data and clustering algorithms. In: 2016 international conference on research advances in integrated navigation systems (RAINS). IEEE Press, Bangalore, India, pp 101–106. https://doi.org/10.1109/rains.2016.7764405

  27. Dave M, Gianey H (2016) Different clustering algorithms for big data analytics: a review. In: 2016 international conference system modeling and advancement in research trends (SMART). IEEE Press, Moradabad, India, pp 328–333. https://doi.org/10.1109/sysmart.2016.7894544

  28. Lau T, King I (1998) Performance analysis of clustering algorithms for information retrieval in image databases. In: 1998 IEEE international joint conference on neural networks proceedings, IEEE world congress on computational intelligence (Cat. No.98CH36227). IEEE Press, Anchorage, AK, USA, pp 932–937. https://doi.org/10.1109/ijcnn.1998.685895

  29. Maulik U, Bandyopadhyay S (2002) Performance evaluation of some clustering algorithms and validity indices. IEEE Trans Pattern Anal Mach Intell 24(12):1650–1654. https://doi.org/10.1109/TPAMI.2002.1114856

    Article  Google Scholar 

  30. Wei C, Lee Y, Hsu C (2003) Empirical comparison of fast partitioning-based clustering algorithms for large data sets. Expert Syst Appl 24(4):351–363. https://doi.org/10.1016/S0957-4174(02)00185-9

    Article  Google Scholar 

  31. Zhang B (2003) Comparison of the performance of center-based clustering algorithms. In: Advances in knowledge discovery and data mining, PAKDD 2003, Lecture notes in computer science, vol 2637. Springer, Seoul, Republic of Korea, pp 63–74. https://doi.org/10.1007/3-540-36175-8_7

  32. Wang X, Hamilton H (2005) A comparative study of two density-based spatial clustering algorithms for very large datasets. In: Advances in artificial intelligence, AI 2005, lecture notes in computer science, vol 3501. Springer, Victoria, BC, Canada, pp 120–132. https://doi.org/10.1007/11424918_14

  33. Poonam Dutta M (2012) Performance analysis of clustering methods for outlier detection. In: 2012 second international conference on advanced computing and communication technologies (ACCT 2012). IEEE Press, Rohtak, India, pp 89–95. https://doi.org/10.1109/acct.2012.84

  34. Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya A, Foufou S, Bouras A (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2(3):267–279. https://doi.org/10.1109/TETC.2014.2330519

    Article  Google Scholar 

  35. Jung Y, Kang M, Heo J (2014) Clustering performance comparison using k-means and expectation maximization algorithms. Biotechnol Biotechnol Equip 28(2):S44–S48. https://doi.org/10.1080/13102818.2014.949045

    Article  Google Scholar 

  36. Bhatnagar V, Majhi R, Jena P (2017) Comparative performance evaluation of clustering algorithms for grouping manufacturing firms. Arab J Sci Eng 43(8):4071–4083. https://doi.org/10.1007/S13369-017-2788-4

    Article  Google Scholar 

  37. Kohonen T (1997) Exploration of very large databases by self-organizing maps. In: International conference on neural networks (ICNN’97), vol 1. IEEE Press, Houston, TX, USA, pp PL1-PL6. https://doi.org/10.1109/icnn.1997.611622

  38. Ding C, He X, Zha H, Simon H (2002) Adaptive dimension reduction for clustering high dimensional data. In: 2002 IEEE international conference on data mining. IEEE Computer Society, Maebashi City, Japan, pp 147–154. https://doi.org/10.1109/icdm.2002.1183897

  39. Wang Q, Li J (2009) Combining local and global information for nonlinear dimensionality reduction. Neurocomputing 72(10–12):2235–2241. https://doi.org/10.1016/J.NEUCOM.2009.01.006

    Article  Google Scholar 

  40. Araujo D, Doria Neto A, Martins A, Melo J (2011) Comparative study on dimension reduction techniques for cluster analysis of microarray data. In: The 2011 international joint conference on neural networks. IEEE Press, San Jose, CA, USA, pp 1835–1842. https://doi.org/10.1109/ijcnn.2011.6033447

  41. Chui CK, Wang J (2013) Nonlinear methods for dimensionality reduction. Handb Geomath 1–46. https://doi.org/10.1007/978-3-642-27793-1_34-2

  42. Song M, Yang H, Siadat S, Pechenizkiy M (2013) A comparative study of dimensionality reduction techniques to enhance trace clustering performances. Expert Syst Appl 40(9):3722–3737. https://doi.org/10.1016/J.ESWA.2012.12.078

    Article  Google Scholar 

  43. Charrad M, Ghazzali N, Boiteau V, Niknafs A (2014) NbClust: an R package for determining the relevant number of clusters in a data set. J Stat Softw 61(6):1–36. https://doi.org/10.18637/JSS.V061.I06

    Article  Google Scholar 

  44. Rousseeuw P (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20(November):53–65. https://doi.org/10.1016/0377-0427(87)90125-7

    Article  MATH  Google Scholar 

  45. Dunn J (1973) A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. J Cybern 3(3):32–57. https://doi.org/10.1080/01969727308546046

    Article  MathSciNet  MATH  Google Scholar 

  46. Calinski T, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat Theory Methods 3(1):1–27. https://doi.org/10.1080/03610927408827101

    Article  MathSciNet  MATH  Google Scholar 

  47. Davies D, Bouldin D (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell PAMI 1(2):224–227. https://doi.org/10.1109/tpami.1979.4766909

  48. Team RC (2009) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria

    Google Scholar 

  49. Tierney L (2012) The R statistical computing environment. Lect Notes Stat. 435–447. https://doi.org/10.1007/978-1-4614-3520-4_41

  50. Racine J (2011) RStudio: a platform-independent IDE for R and Sweave. J Appl Econ 27(1):167–172. https://doi.org/10.1002/JAE.1278

    Article  Google Scholar 

  51. Goldberg K, Roeder T, Gupta D, Perkins C (2001) Eigentaste: a constant time collaborative filtering algorithm. Inf Retr 4(2):133–151. https://doi.org/10.1023/A:1011419012209

    Article  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shini Renjith .

Editor information

Editors and Affiliations

Appendices

Appendix 1

Complete observations from internal evaluation conducted on the k-means clustering results using the R package clusterCrit

Index

ClusterCrit variable

Goodness indicator

Dimensionality reduction using

PCA

ICA

LLE

t-SNE

Ball–Hall index

$ball_hall

max diff

17.20457

1.391324

1.232667

232.7926

Banfield–Raftery index

$banfeld_raftery

min

11,029.27

572.3903

626.2364

27,286.89

C index

$c_index

min

0.09553103

0.2327327

0.2312412

0.08800492

Calinski–Harabasz index

$calinski_harabasz

max

3417.288

1782.438

1847.939

6416.44

Davies–Bouldin index

$davies_bouldin

min

0.9058952

1.034618

1.039139

0.8715404

Det ratio index

$det_ratio

min diff

3.871588

3.047042

3.026433

7.844465

Dunn index

$dunn

max

0.00173042

0.00110608

0.00136502

0.00453987

Baker–Hubert Gamma index

$gamma

max

0.7785525

0.5188169

0.4677813

0.7813951

G plus index

$g_plus

min

0.05533569

0.1182925

0.1279114

0.04907176

GDI index

$gdi11

max

0.00173042

0.00110608

0.00136502

0.00453987

GDI index

$gdi12

max

0.01493041

0.01049106

0.02023069

0.02933953

GDI index

$gdi13

max

0.00511206

0.00372532

0.00717291

0.01019923

GDI index

$gdi21

max

1.21189

1.324565

1.406772

1.227989

GDI index

$gdi22

max

10.45642

12.56334

20.84945

7.936054

GDI index

$gdi23

max

3.580196

4.461175

7.392295

2.75879

GDI index

$gdi31

max

0.2752019

0.2248528

0.1981029

0.5592138

GDI index

$gdi32

max

2.374494

2.132702

2.936039

3.614

GDI index

$gdi33

max

0.8130083

0.7573111

1.04099

1.256326

GDI index

$gdi41

max

0.2410528

0.1885064

0.1703023

0.4925548

GDI index

$gdi42

max

2.079849

1.78796

2.524013

3.183207

GDI index

$gdi43

max

0.7121241

0.6348952

0.8949034

1.10657

GDI index

$gdi51

max

0.09224953

0.09543682

0.08576157

0.2129281

GDI index

$gdi52

max

0.7959463

0.9052065

1.271053

1.376078

GDI index

$gdi53

max

0.2725258

0.3214341

0.4506593

0.4783627

Ksq DetW index

$ksq_detw

max diff

7,186,235,240

73,842,109

74,344,937

3.02E + 12

Log Det ratio index

$log_det_ratio

min diff

6768.324

5570.856

5536.924

10,299.04

Log SS ratio index

$log_ss_ratio

min diff

0.3131567

−0.3377083

−0.3016196

0.9431728

McClain–Rao index

$mcclain_rao

min

0.3721302

0.5643422

0.595994

0.4384923

PBM index

$pbm

max

42.68164

0.9640561

0.6983515

1497.468

Point Biserial index

$point_biserial

max

−2.693745

−0.4353606

−0.3859055

−11.7986

Ray–Turi index

$ray_turi

min

0.2860079

0.4883412

0.4338252

0.2302405

Ratkowsky–Lance index

$ratkowsky_lance

max

0.3568949

0.3725436

0.3764579

0.4178183

Scott–Symons index

$scott_symons

min

14,296.68

−6276.843

−5815.897

47,340.83

SD index

$sd_scat

min

0.5612758

0.7125735

0.6275758

0.2424001

SD index

$sd_dis

min

0.3689349

1.142325

1.029613

0.07119309

S Dbw index

$s_dbw

min

1.177152

2.50059

4.480659

1.999409

Silhouette index

$silhouette

max

0.3342391

0.3077699

0.2943706

0.4099526

Tau index

$tau

max

0.5503895

0.3637915

0.3243153

0.5235662

Trace W index

$trace_w

max diff

56,644.87

5836.337

5748.384

1,175,143

Trace WiB index

$trace_wib

max diff

2.784339

1.556713

1.47942

5.971501

Wemmert–Gancarski index

$wemmert_gancarski

max

0.5552955

0.5115594

0.4629613

0.5244954

Xie–Beni index

$xie_beni

min

5550.067

14,184.04

6752.686

2710.222

Appendix 2

Complete observations from internal evaluation conducted on the AGNES clustering results using the R package clusterCrit

Index

ClusterCrit variable

Goodness indicator

Dimensionality reduction using

PCA

ICA

LLE

t-SNE

Ball–Hall index

$ball_hall

max diff

19.53912

1.331913

1.623174

363.1799

Banfield–Raftery index

$banfeld_raftery

min

15,293.4

1734.322

1796.481

29,752.68

C index

$c_index

min

0.3628751

0.3158354

0.3416631

0.2416017

Calinski–Harabasz index

$calinski_harabasz

max

615.9316

1015.008

944.1628

2396.612

Davies–Bouldin index

$davies_bouldin

min

1.79305

1.329694

1.286655

1.515427

Det ratio index

$det_ratio

min diff

2.089563

2.089563

2.005657

4.858279

Dunn index

$dunn

max

0.00114581

0.00209901

0.00118646

0.00444984

Baker–Hubert Gamma index

$gamma

max

0.3062376

0.3735996

0.2849121

0.4667844

G plus index

$g_plus

min

0.1733258

0.1564965

0.1784953

0.1258491

GDI index

$gdi11

max

0.00114581

0.00209901

0.00118646

0.00444984

GDI index

$gdi12

max

0.01376865

0.02173099

0.0153745

0.02884337

GDI index

$gdi13

max

0.00473243

0.00763023

0.00537622

0.00997892

GDI index

$gdi21

max

1.15714

1.296565

1.365893

0.9185128

GDI index

$gdi22

max

13.90482

13.42333

17.6997

5.953694

GDI index

$gdi23

max

4.779239

4.713223

6.189309

2.059795

GDI index

$gdi31

max

0.1850741

0.2396983

0.1761109

0.3479464

GDI index

$gdi32

max

2.223952

2.481595

2.282103

2.255348

GDI index

$gdi33

max

0.7643965

0.871342

0.7980158

0.7802811

GDI index

$gdi41

max

0.09938497

0.16374

0.1200127

0.2061969

GDI index

$gdi42

max

1.194264

1.695199

1.555164

1.336545

GDI index

$gdi43

max

0.4104816

0.5952212

0.5438165

0.4624033

GDI index

$gdi51

max

0.1013306

0.1180803

0.08362988

0.1522455

GDI index

$gdi52

max

1.217644

1.222485

1.083704

0.9868379

GDI index

$gdi53

max

0.4185175

0.4292411

0.3789543

0.3414155

Ksq DetW index

$ksq_detw

max diff

1.3315E+10

107,678,003

112,182,708

4.88E+12

Log Det ratio index

$log_det_ratio

min diff

3684.775

3684.775

3479.858

7903.421

Log SS ratio index

$log_ss_ratio

min diff

−1.40031

−0.9007937

−0.9731472

−0.0416343

McClain–Rao index

$mcclain_rao

min

0.7376162

0.6732493

0.7332948

0.6335125

PBM index

$pbm

max

7.771969

1.091036

1.552822

518.4593

Point Biserial index

$point_biserial

max

−0.8765914

−0.3172137

−0.2494482

−7.426781

Ray–Turi index

$ray_turi

min

1.920044

0.928044

1.039746

1.106356

Ratkowsky–Lance index

$ratkowsky_lance

max

0.3103157

0.3103157

0.3023539

0.4276503

Scott–Symons index

$scott_symons

min

20,174.66

−3912.774

−3847.927

48,827.03

SD index

$sd_scat

min

0.7649894

0.6738317

0.8583248

0.4549016

SD index

$sd_dis

min

0.5789202

1.703494

1.819146

0.1006037

S Dbw index

$s_dbw

min

3.363968

2.699607

2.499008

2.769472

Silhouette index

$silhouette

max

0.2485357

0.3003016

0.2853209

0.2976705

Tau index

$tau

max

0.2164711

0.2640874

0.2013074

0.3207045

Trace W index

$trace_w

max diff

107,595.6

7111.126

7257.464

2,140,160

Trace WiB index

$trace_wib

max diff

0.9718295

0.9718295

0.9111961

2.86745

Wemmert–Gancarski index

$wemmert_gancarski

max

0.1370079

0.3178442

0.3221082

0.3178927

Xie–Beni index

$xie_beni

min

14,445.39

5647.428

10,638.43

2375.582

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Renjith, S., Sreekumar, A., Jathavedan, M. (2021). A Comparative Analysis of Clustering Quality Based on Internal Validation Indices for Dimensionally Reduced Social Media Data. In: Chiplunkar, N.N., Fukao, T. (eds) Advances in Artificial Intelligence and Data Engineering. AIDE 2019. Advances in Intelligent Systems and Computing, vol 1133. Springer, Singapore. https://doi.org/10.1007/978-981-15-3514-7_78

Download citation

Publish with us

Policies and ethics