Introduction to Clustering

M. Bagirov, Adil; Karmitsa, Napsu; Taheri, Sona

doi:10.1007/978-3-030-37826-4_1

Adil M. Bagirov⁵,
Napsu Karmitsa⁶ &
Sona Taheri⁵

Part of the book series: Unsupervised and Semi-Supervised Learning ((UNSESUL))

827 Accesses
1 Citations

Abstract

In this chapter, we first define the commonly used tasks and terminologies in data analysis and show the importance of data clustering. Then we give a mathematical formulation of the clustering problem and explain the frequently used similarity measures. Finally, we provide a short survey on different types of existing clustering algorithms and describe some of the most popular applications of data clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Aggarwal, C.C., Reddy, C.K.: Data Clustering: Algorithms and Applications. CRC Press, Boca Raton (2014)
Book MATH Google Scholar
Aliguliyev, R.M.: Performance evaluation of density-based clustering methods. Inf. Sci. 179(20), 3583–3602 (2009)
Article Google Scholar
Aliguliyev, R.M.: Clustering of document collection: a weighting approach. Expert Syst. Appl. 36(4), 7904–7916 (2009)
Article Google Scholar
Andritsos, P., Tsaparas, P., Miller, R.J., Servcik, K.C.: LIMBO: a linear algorithm to cluster categorical data. Technical Report CSRG-467, Department of Computer Science, UofT (2003)
Google Scholar
Baeza-Yates, R.A.: Introduction to data structures and algorithms related to information retrieval. In: Frakes, W.B., Baeza Yates, R. (eds.) Information Retrieval: Data Structures and Algorithms. Prentice Hall, Upper Saddle River, NJ, pp. 13–27 (1992)
Google Scholar
Bagirov, A.M., Mardaneh, K.: Modified global k-means algorithm for clustering in gene expression data sets. In: Boden, M., Bailey, T. (eds.) Proceedings of the AI 2006 Workshop on Intelligent Systems of Bioinformatics, pp. 23–28 (2006)
Google Scholar
Bagirov, A.M., Ugon, J., Mirzayeva, H.: Nonsmooth nonconvex optimization approach to clusterwise linear regression problems. Eur. J. Oper. Res. 229(1), 132–142 (2013)
Article MathSciNet MATH Google Scholar
Bagirov, A.M., Ugon, J., Mirzayeva, H.: Nonsmooth optimization algorithm for solving clusterwise linear regression problems. J. Optim. Theory Appl. 164(3), 755–780 (2015)
Article MathSciNet MATH Google Scholar
Bagirov, A.M., Ugon, J., Mirzayeva, H.: An algorithm for clusterwise linear regression based on smoothing techniques. Optim. Lett. 9(2), 375–390 (2015)
Article MathSciNet MATH Google Scholar
Bagirov, A.M., Mahmood, A., Barton, A.: Prediction of monthly rainfall in Victoria, Australia: clusterwise linear regression approach. Atmos. Res. 188, 20–29 (2017)
Google Scholar
Brauksa, I.: Use of cluster analysis in exploring economic indicator differences among regions: the case of latvia. J. Econ. Bus. Manag. 1(1), 42–45 (2013)
Article Google Scholar
Brown, M., Grundy, W., Lin, D., Christianini, N., Sugnet, C., Furey, T., Ares, M., Haussler, D.: Knowledg-based analysis of microarray gene expression data using support vector machines. Proc. Natl. Acad. Sci. 97, 262–267 (2000)
Article Google Scholar
Cariou, C., Chehdi, K.: Unsupervised nearest neighbors clustering with application to hyperspectral images. IEEE J. Sel. Top. Sign. Process. 9(6), 1105–1116 (2015)
Article Google Scholar
Celebi, M.E.: Improving the performance of k-means for color quantization. Image Vis. Comput. 29(4), 260–271 (2011)
Article Google Scholar
Chaudhuri, B.B., Garai, G.: Grid clustering with genetic algorithm and tabu search process. J. Pattern Recogn. Res. 4(1), 152–168 (2009)
Google Scholar
Cheng, Y., Church, G.M.: Biclustering of expression data. In: Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, vol. 8, pp. 93–103 (2000)
Google Scholar
Chipman, H., Tibshirani, R.: Hybrid hierarchical clustering with applications to microarray data. Biostatistics 7(2), 286–301 (2006)
Article MATH Google Scholar
Courvisanos, J., Jain, A., Mardaneh, K.: Economic resilience of regions under crises: a study of the Australian Economy. Reg. Stud. 50(4), 629–643 (2016)
Article Google Scholar
DeSarbo, W.S., William, L.C.: A maximum likelihood methodology for clusterwise linear regression. J. Classif. 5(2), 249–282 (1988)
Article MathSciNet MATH Google Scholar
Dhillon, I.S., Fan, J., Guan, Y.: Efficient clustering of very large document collections. In: Kamath, C., Kumar, V., Grossman, R., Namburu, R. (eds.) Data Mining for Scientific and Engineering Applications, Massive Computing, vol. 2, pp. 357–381. Springer, Boston, MA (2001)
Chapter Google Scholar
Dolnicar, S.: Using cluster analysis for market segmentation - typical misconceptions, established methodological weaknesses and some recommendations for improvement. Australasian J. Mark. Res. 11(2), 5–12 (2003)
Article Google Scholar
Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D.: Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. 95, 14863–14868 (1998)
Article Google Scholar
Eren, K., Deveci, M., Kücüktunc, O., Catalyürek, U.V.: A comparative analysis of biclustering algorithms for gene expression data. Brief. Bioinform. 14(3), 279–292 (2013)
Article Google Scholar
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Simoudis, E., Han, J., Fayyad, U.M. (eds.) Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR, pp. 226–231 (1996)
Google Scholar
Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P.: Advances in knowledge discovery and data mining. In: American Association for Artificial Intelligence, pp. 1–34 (1996)
Google Scholar
Finnie, G., Sun, Z.: r ⁵ model for case-based reasoning. Knowl. Based Syst. 16, 59–65 (2003)
Article Google Scholar
Frismantas, V., et al.: Ex vivo drug response profiling detects recurrent sensitivity patterns in drug-resistant acute lymphoblastic leukemia. Blood 129(11), e26–e37 (2017)
Article Google Scholar
Ganti, V., Gehrke, J., Ramakrishnan, R.: CACTUS: clustering categorical data using summaries. In: Knowledge Discovery and Data Mining, pp. 73–83 (1999)
Google Scholar
Gibson, D., Kleinberg, J., Raghavan, P.: Clustering categorical data: an approach based on dynamical systems. In: Proceedings of the 24th International Conference on Very Large Databases (VLDB), pp. 103–114 (1998)
Google Scholar
Guha, S., Rastogi, R., Shim, K.: CURE: an efficient clustering algorithm for large databases. In: Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 73–84. ACM Press, New York (1998)
Article Google Scholar
Guha, S., Rastogi, R., Shim, K.: ROCK: a robust clustering algorithm for categorical attributes. Inf. Syst. 25(5), 345–366 (2000)
Article Google Scholar
Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques. The Morgan Kaufmann Series in Data Management Systems, 3rd edn. Morgan Kaufmann, San Francisco, CA (2011)
Google Scholar
Hruschka, H., Natter, M.: Comparing performance of feedforward neural nets and k-means for cluster-based market segmentation. Eur. J. Oper. Res. 114(2), 346–353 (1999)
Article MATH Google Scholar
Huang, J.J., Tzeng, G.H., Ong, C.Sh.: Marketing segmentation using support vector clustering. Expert Syst. Appl. 32(2), 313–317 (2007)
Article Google Scholar
Jain, A.K., Dubes, R.: Algorithms for Clustering Data. Prentice Hall, Upper Saddle River, NJ (1988)
MATH Google Scholar
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. 31(3), 264–323 (1999)
Article Google Scholar
Jardine, N., Sibson, R.: Mathematical Taxonomy. Wiley, London/New York (1971)
MATH Google Scholar
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Series in Probability and Statistics. Wiley, New York (1990)
Book MATH Google Scholar
Ketchen, D.J., Shook, C.L.: The application of cluster analysis in strategic management research: an analysis and critique. Strateg. Manag. J. 17(6), 441–458 (1996)
Article Google Scholar
King, B.: Step-wise clustering procedures. J. Am. Stat. Assoc. 69, 86–101 (1967)
Article Google Scholar
Kuo, R.J., Ho, L.M., Hu, C.M.: Integration of self-organizing feature map and k-means algorithm for market segmentation. Comput. Oper. Res. 29(11), 1475–1493 (2002)
Article MATH Google Scholar
Le-Khac, N., Cai, F., Kechadi, M.: Clustering approaches for financial data analysis: a survey. In: Abou-Nasr, M. Arabnia, H. (eds.) Proceedings of the International Conference on Data Mining, Las Vegas, Nevada (2012)
Google Scholar
Lu, S.Y., Fu, K.S.: A sentence to sentence clustering procedure for pattern analysis. IEEE Trans. Syst. Man Cybern. 8(5), 381–389 (1978)
Article MathSciNet MATH Google Scholar
Murtagh, F.: A survey of recent advances in hierarchical clustering algorithms which use cluster centres. Comput. J. 26(4), 354–359 (1984)
Article MATH Google Scholar
Mustjoki, S., et al.: Discovery of novel drug sensitivities in T-PLL by high-throughput ex vivo drug testing and mutation profiling. Leukemia 32, 774–787 (2017)
Google Scholar
Nagy, G.: State of the art in pattern recognition. Proc. IEEE 56(5), 836–862 (1968)
Article Google Scholar
Nappa, S.D., Wang, X., Nair, S.: A comparison of machine learning techniques for phishing detection. In: Proceedings of the Anti-Phishing Working Groups 2nd Annual eCrime Researchers Summit (eCrime 07), New York, pp. 60–69 (2007)
Google Scholar
Oyelade, J., Isewon, I., Oladipupo, F., Aromolaran, O., Uwoghiren, E. Ameh, F., Achas, M., Adebiyi, E.: Clustering algorithms: their application to gene expression data. Bioinf. Biol. Insights 10, 237–253 (2016)
Google Scholar
Parsons, L., Haque, E., Liu, H.: Subspace clustering for high dimensional data: a review. ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets 6(1), 90–105 (2004)
Article Google Scholar
Pemovska, T., et al.: Individualized systems medicine strategy to tailor treatments for patients with chemorefractory acute myeloid leukemia. Cancer Discov. 3(12), 1416–1429 (2013)
Article Google Scholar
Poggi, J.M., Portier, B.: PM10 forecasting using clusterwise regression. Atmos. Environ. 45(38), 7005–7014 (2011)
Article Google Scholar
Punj, G., Stewart, D.W.: Cluster analysis in marketing research: review and suggestions for application. J. Mark. Res. 20(2), 134–148 (1983)
Article Google Scholar
Rezanková, H.: Cluster analysis of economic data. Statistica 94(1), 73–86 (2014)
Google Scholar
Rosch, E.: Principles of Categorization. MIT Press, Cambridge (1999)
Google Scholar
Seifollahi, S., Bagirov, A.M. Layton, R., Gondal, I.: Optimization based clustering algorithms for authorship analysis of phishing emails. Neural Process. Lett. 46(2), 411–425 (2017)
Article Google Scholar
Slonm, N., Tishby, N.: Document clustering using word clusters via the information bottleneck method. In: Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 208–215 (2000)
Google Scholar
Sneath, P.H.A., Sokal, R.R.: Numerical Taxonomy. Freeman, London (1973)
MATH Google Scholar
Späth, H.: Algorithm 39: clusterwise linear regression. Computing 22(4), 367–373 (1979)
Article MathSciNet MATH Google Scholar
Späth, H.: Cluster Analysis Algorithms for Data Reduction and Classification of Objects. Computers and Their applications. Ellis Horwood Limited, Chichester (1980)
MATH Google Scholar
Späth, H.: The Cluster Dissection and Analysis Theory FORTRAN Programs Examples. Prentice-Hall, Upper Saddle River, NJ (1985)
MATH Google Scholar
Thalamuthu, A., Mukhopadhyay, I., Zheng, X., Tseng, G.C.: Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics 22(19), 2405–2412 (2006)
Article Google Scholar
Tran, T.N., Wehrens, R., Buydens, L.M.C.: KNN-kernel density-based clustering for high-dimensional multivariate data. Comput. Stat. Data Anal. 51(2), 513–525 (2006)
Article MathSciNet MATH Google Scholar
Tsai, C.Y., Chiu, C.C.: A purchase-based market segmentation methodology. Expert Syst. Appl. 27(2), 265–276 (2004)
Article Google Scholar
Ward, J.H.: Hierarchical grouping to optimize and objective function. J. Am. Stat. Assoc. 58(301), 236–244 (1963)
Article MathSciNet Google Scholar
Wedel, M., Kistemaker, C.: Consumer benefit segmentation using clusterwise linear regression. Int. J. Res. Mark. 6(1), 45–59 (1989)
Article Google Scholar
Wierzchon, S.T., Klopotek, M.A.: Modern Algorithms of Cluster Analysis. Springer, Cham (2018)
Book MATH Google Scholar
Yeung, K.Y., Haynor, D.R., Ruzzo, W.L.: Validating clustering for gene expression data. Bioinformatics 17(4), 309–318 (2001)
Article Google Scholar
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the ACM SIGMOD Conference on Management of Data, pp. 103–114 (1996)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Science, Engineering & Information Technology, Federation University Australia, Ballarat, VIC, Australia
Adil M. Bagirov & Sona Taheri
Department of Mathematics and Statistics, University of Turku, Turku, Finland
Napsu Karmitsa

Authors

Adil M. Bagirov
View author publications
You can also search for this author in PubMed Google Scholar
Napsu Karmitsa
View author publications
You can also search for this author in PubMed Google Scholar
Sona Taheri
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

M. Bagirov, A., Karmitsa, N., Taheri, S. (2020). Introduction to Clustering. In: Partitional Clustering via Nonsmooth Optimization. Unsupervised and Semi-Supervised Learning. Springer, Cham. https://doi.org/10.1007/978-3-030-37826-4_1

Download citation

DOI: https://doi.org/10.1007/978-3-030-37826-4_1
Published: 25 February 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-37825-7
Online ISBN: 978-3-030-37826-4
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics