Skip to main content

Introduction to Clustering

  • Chapter
  • First Online:
Partitional Clustering via Nonsmooth Optimization

Part of the book series: Unsupervised and Semi-Supervised Learning ((UNSESUL))

Abstract

In this chapter, we first define the commonly used tasks and terminologies in data analysis and show the importance of data clustering. Then we give a mathematical formulation of the clustering problem and explain the frequently used similarity measures. Finally, we provide a short survey on different types of existing clustering algorithms and describe some of the most popular applications of data clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Aggarwal, C.C., Reddy, C.K.: Data Clustering: Algorithms and Applications. CRC Press, Boca Raton (2014)

    Book  MATH  Google Scholar 

  2. Aliguliyev, R.M.: Performance evaluation of density-based clustering methods. Inf. Sci. 179(20), 3583–3602 (2009)

    Article  Google Scholar 

  3. Aliguliyev, R.M.: Clustering of document collection: a weighting approach. Expert Syst. Appl. 36(4), 7904–7916 (2009)

    Article  Google Scholar 

  4. Andritsos, P., Tsaparas, P., Miller, R.J., Servcik, K.C.: LIMBO: a linear algorithm to cluster categorical data. Technical Report CSRG-467, Department of Computer Science, UofT (2003)

    Google Scholar 

  5. Baeza-Yates, R.A.: Introduction to data structures and algorithms related to information retrieval. In: Frakes, W.B., Baeza Yates, R. (eds.) Information Retrieval: Data Structures and Algorithms. Prentice Hall, Upper Saddle River, NJ, pp. 13–27 (1992)

    Google Scholar 

  6. Bagirov, A.M., Mardaneh, K.: Modified global k-means algorithm for clustering in gene expression data sets. In: Boden, M., Bailey, T. (eds.) Proceedings of the AI 2006 Workshop on Intelligent Systems of Bioinformatics, pp. 23–28 (2006)

    Google Scholar 

  7. Bagirov, A.M., Ugon, J., Mirzayeva, H.: Nonsmooth nonconvex optimization approach to clusterwise linear regression problems. Eur. J. Oper. Res. 229(1), 132–142 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  8. Bagirov, A.M., Ugon, J., Mirzayeva, H.: Nonsmooth optimization algorithm for solving clusterwise linear regression problems. J. Optim. Theory Appl. 164(3), 755–780 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  9. Bagirov, A.M., Ugon, J., Mirzayeva, H.: An algorithm for clusterwise linear regression based on smoothing techniques. Optim. Lett. 9(2), 375–390 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  10. Bagirov, A.M., Mahmood, A., Barton, A.: Prediction of monthly rainfall in Victoria, Australia: clusterwise linear regression approach. Atmos. Res. 188, 20–29 (2017)

    Google Scholar 

  11. Brauksa, I.: Use of cluster analysis in exploring economic indicator differences among regions: the case of latvia. J. Econ. Bus. Manag. 1(1), 42–45 (2013)

    Article  Google Scholar 

  12. Brown, M., Grundy, W., Lin, D., Christianini, N., Sugnet, C., Furey, T., Ares, M., Haussler, D.: Knowledg-based analysis of microarray gene expression data using support vector machines. Proc. Natl. Acad. Sci. 97, 262–267 (2000)

    Article  Google Scholar 

  13. Cariou, C., Chehdi, K.: Unsupervised nearest neighbors clustering with application to hyperspectral images. IEEE J. Sel. Top. Sign. Process. 9(6), 1105–1116 (2015)

    Article  Google Scholar 

  14. Celebi, M.E.: Improving the performance of k-means for color quantization. Image Vis. Comput. 29(4), 260–271 (2011)

    Article  Google Scholar 

  15. Chaudhuri, B.B., Garai, G.: Grid clustering with genetic algorithm and tabu search process. J. Pattern Recogn. Res. 4(1), 152–168 (2009)

    Google Scholar 

  16. Cheng, Y., Church, G.M.: Biclustering of expression data. In: Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, vol. 8, pp. 93–103 (2000)

    Google Scholar 

  17. Chipman, H., Tibshirani, R.: Hybrid hierarchical clustering with applications to microarray data. Biostatistics 7(2), 286–301 (2006)

    Article  MATH  Google Scholar 

  18. Courvisanos, J., Jain, A., Mardaneh, K.: Economic resilience of regions under crises: a study of the Australian Economy. Reg. Stud. 50(4), 629–643 (2016)

    Article  Google Scholar 

  19. DeSarbo, W.S., William, L.C.: A maximum likelihood methodology for clusterwise linear regression. J. Classif. 5(2), 249–282 (1988)

    Article  MathSciNet  MATH  Google Scholar 

  20. Dhillon, I.S., Fan, J., Guan, Y.: Efficient clustering of very large document collections. In: Kamath, C., Kumar, V., Grossman, R., Namburu, R. (eds.) Data Mining for Scientific and Engineering Applications, Massive Computing, vol. 2, pp. 357–381. Springer, Boston, MA (2001)

    Chapter  Google Scholar 

  21. Dolnicar, S.: Using cluster analysis for market segmentation - typical misconceptions, established methodological weaknesses and some recommendations for improvement. Australasian J. Mark. Res. 11(2), 5–12 (2003)

    Article  Google Scholar 

  22. Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D.: Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. 95, 14863–14868 (1998)

    Article  Google Scholar 

  23. Eren, K., Deveci, M., Kücüktunc, O., Catalyürek, U.V.: A comparative analysis of biclustering algorithms for gene expression data. Brief. Bioinform. 14(3), 279–292 (2013)

    Article  Google Scholar 

  24. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Simoudis, E., Han, J., Fayyad, U.M. (eds.) Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR, pp. 226–231 (1996)

    Google Scholar 

  25. Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P.: Advances in knowledge discovery and data mining. In: American Association for Artificial Intelligence, pp. 1–34 (1996)

    Google Scholar 

  26. Finnie, G., Sun, Z.: r 5 model for case-based reasoning. Knowl. Based Syst. 16, 59–65 (2003)

    Article  Google Scholar 

  27. Frismantas, V., et al.: Ex vivo drug response profiling detects recurrent sensitivity patterns in drug-resistant acute lymphoblastic leukemia. Blood 129(11), e26–e37 (2017)

    Article  Google Scholar 

  28. Ganti, V., Gehrke, J., Ramakrishnan, R.: CACTUS: clustering categorical data using summaries. In: Knowledge Discovery and Data Mining, pp. 73–83 (1999)

    Google Scholar 

  29. Gibson, D., Kleinberg, J., Raghavan, P.: Clustering categorical data: an approach based on dynamical systems. In: Proceedings of the 24th International Conference on Very Large Databases (VLDB), pp. 103–114 (1998)

    Google Scholar 

  30. Guha, S., Rastogi, R., Shim, K.: CURE: an efficient clustering algorithm for large databases. In: Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 73–84. ACM Press, New York (1998)

    Article  Google Scholar 

  31. Guha, S., Rastogi, R., Shim, K.: ROCK: a robust clustering algorithm for categorical attributes. Inf. Syst. 25(5), 345–366 (2000)

    Article  Google Scholar 

  32. Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques. The Morgan Kaufmann Series in Data Management Systems, 3rd edn. Morgan Kaufmann, San Francisco, CA (2011)

    Google Scholar 

  33. Hruschka, H., Natter, M.: Comparing performance of feedforward neural nets and k-means for cluster-based market segmentation. Eur. J. Oper. Res. 114(2), 346–353 (1999)

    Article  MATH  Google Scholar 

  34. Huang, J.J., Tzeng, G.H., Ong, C.Sh.: Marketing segmentation using support vector clustering. Expert Syst. Appl. 32(2), 313–317 (2007)

    Article  Google Scholar 

  35. Jain, A.K., Dubes, R.: Algorithms for Clustering Data. Prentice Hall, Upper Saddle River, NJ (1988)

    MATH  Google Scholar 

  36. Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. 31(3), 264–323 (1999)

    Article  Google Scholar 

  37. Jardine, N., Sibson, R.: Mathematical Taxonomy. Wiley, London/New York (1971)

    MATH  Google Scholar 

  38. Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Series in Probability and Statistics. Wiley, New York (1990)

    Book  MATH  Google Scholar 

  39. Ketchen, D.J., Shook, C.L.: The application of cluster analysis in strategic management research: an analysis and critique. Strateg. Manag. J. 17(6), 441–458 (1996)

    Article  Google Scholar 

  40. King, B.: Step-wise clustering procedures. J. Am. Stat. Assoc. 69, 86–101 (1967)

    Article  Google Scholar 

  41. Kuo, R.J., Ho, L.M., Hu, C.M.: Integration of self-organizing feature map and k-means algorithm for market segmentation. Comput. Oper. Res. 29(11), 1475–1493 (2002)

    Article  MATH  Google Scholar 

  42. Le-Khac, N., Cai, F., Kechadi, M.: Clustering approaches for financial data analysis: a survey. In: Abou-Nasr, M. Arabnia, H. (eds.) Proceedings of the International Conference on Data Mining, Las Vegas, Nevada (2012)

    Google Scholar 

  43. Lu, S.Y., Fu, K.S.: A sentence to sentence clustering procedure for pattern analysis. IEEE Trans. Syst. Man Cybern. 8(5), 381–389 (1978)

    Article  MathSciNet  MATH  Google Scholar 

  44. Murtagh, F.: A survey of recent advances in hierarchical clustering algorithms which use cluster centres. Comput. J. 26(4), 354–359 (1984)

    Article  MATH  Google Scholar 

  45. Mustjoki, S., et al.: Discovery of novel drug sensitivities in T-PLL by high-throughput ex vivo drug testing and mutation profiling. Leukemia 32, 774–787 (2017)

    Google Scholar 

  46. Nagy, G.: State of the art in pattern recognition. Proc. IEEE 56(5), 836–862 (1968)

    Article  Google Scholar 

  47. Nappa, S.D., Wang, X., Nair, S.: A comparison of machine learning techniques for phishing detection. In: Proceedings of the Anti-Phishing Working Groups 2nd Annual eCrime Researchers Summit (eCrime 07), New York, pp. 60–69 (2007)

    Google Scholar 

  48. Oyelade, J., Isewon, I., Oladipupo, F., Aromolaran, O., Uwoghiren, E. Ameh, F., Achas, M., Adebiyi, E.: Clustering algorithms: their application to gene expression data. Bioinf. Biol. Insights 10, 237–253 (2016)

    Google Scholar 

  49. Parsons, L., Haque, E., Liu, H.: Subspace clustering for high dimensional data: a review. ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets 6(1), 90–105 (2004)

    Article  Google Scholar 

  50. Pemovska, T., et al.: Individualized systems medicine strategy to tailor treatments for patients with chemorefractory acute myeloid leukemia. Cancer Discov. 3(12), 1416–1429 (2013)

    Article  Google Scholar 

  51. Poggi, J.M., Portier, B.: PM10 forecasting using clusterwise regression. Atmos. Environ. 45(38), 7005–7014 (2011)

    Article  Google Scholar 

  52. Punj, G., Stewart, D.W.: Cluster analysis in marketing research: review and suggestions for application. J. Mark. Res. 20(2), 134–148 (1983)

    Article  Google Scholar 

  53. Rezanková, H.: Cluster analysis of economic data. Statistica 94(1), 73–86 (2014)

    Google Scholar 

  54. Rosch, E.: Principles of Categorization. MIT Press, Cambridge (1999)

    Google Scholar 

  55. Seifollahi, S., Bagirov, A.M. Layton, R., Gondal, I.: Optimization based clustering algorithms for authorship analysis of phishing emails. Neural Process. Lett. 46(2), 411–425 (2017)

    Article  Google Scholar 

  56. Slonm, N., Tishby, N.: Document clustering using word clusters via the information bottleneck method. In: Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 208–215 (2000)

    Google Scholar 

  57. Sneath, P.H.A., Sokal, R.R.: Numerical Taxonomy. Freeman, London (1973)

    MATH  Google Scholar 

  58. Späth, H.: Algorithm 39: clusterwise linear regression. Computing 22(4), 367–373 (1979)

    Article  MathSciNet  MATH  Google Scholar 

  59. Späth, H.: Cluster Analysis Algorithms for Data Reduction and Classification of Objects. Computers and Their applications. Ellis Horwood Limited, Chichester (1980)

    MATH  Google Scholar 

  60. Späth, H.: The Cluster Dissection and Analysis Theory FORTRAN Programs Examples. Prentice-Hall, Upper Saddle River, NJ (1985)

    MATH  Google Scholar 

  61. Thalamuthu, A., Mukhopadhyay, I., Zheng, X., Tseng, G.C.: Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics 22(19), 2405–2412 (2006)

    Article  Google Scholar 

  62. Tran, T.N., Wehrens, R., Buydens, L.M.C.: KNN-kernel density-based clustering for high-dimensional multivariate data. Comput. Stat. Data Anal. 51(2), 513–525 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  63. Tsai, C.Y., Chiu, C.C.: A purchase-based market segmentation methodology. Expert Syst. Appl. 27(2), 265–276 (2004)

    Article  Google Scholar 

  64. Ward, J.H.: Hierarchical grouping to optimize and objective function. J. Am. Stat. Assoc. 58(301), 236–244 (1963)

    Article  MathSciNet  Google Scholar 

  65. Wedel, M., Kistemaker, C.: Consumer benefit segmentation using clusterwise linear regression. Int. J. Res. Mark. 6(1), 45–59 (1989)

    Article  Google Scholar 

  66. Wierzchon, S.T., Klopotek, M.A.: Modern Algorithms of Cluster Analysis. Springer, Cham (2018)

    Book  MATH  Google Scholar 

  67. Yeung, K.Y., Haynor, D.R., Ruzzo, W.L.: Validating clustering for gene expression data. Bioinformatics 17(4), 309–318 (2001)

    Article  Google Scholar 

  68. Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the ACM SIGMOD Conference on Management of Data, pp. 103–114 (1996)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

M. Bagirov, A., Karmitsa, N., Taheri, S. (2020). Introduction to Clustering. In: Partitional Clustering via Nonsmooth Optimization. Unsupervised and Semi-Supervised Learning. Springer, Cham. https://doi.org/10.1007/978-3-030-37826-4_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-37826-4_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-37825-7

  • Online ISBN: 978-3-030-37826-4

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics