Abstract
In this chapter, we first define the commonly used tasks and terminologies in data analysis and show the importance of data clustering. Then we give a mathematical formulation of the clustering problem and explain the frequently used similarity measures. Finally, we provide a short survey on different types of existing clustering algorithms and describe some of the most popular applications of data clustering.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Aggarwal, C.C., Reddy, C.K.: Data Clustering: Algorithms and Applications. CRC Press, Boca Raton (2014)
Aliguliyev, R.M.: Performance evaluation of density-based clustering methods. Inf. Sci. 179(20), 3583–3602 (2009)
Aliguliyev, R.M.: Clustering of document collection: a weighting approach. Expert Syst. Appl. 36(4), 7904–7916 (2009)
Andritsos, P., Tsaparas, P., Miller, R.J., Servcik, K.C.: LIMBO: a linear algorithm to cluster categorical data. Technical Report CSRG-467, Department of Computer Science, UofT (2003)
Baeza-Yates, R.A.: Introduction to data structures and algorithms related to information retrieval. In: Frakes, W.B., Baeza Yates, R. (eds.) Information Retrieval: Data Structures and Algorithms. Prentice Hall, Upper Saddle River, NJ, pp. 13–27 (1992)
Bagirov, A.M., Mardaneh, K.: Modified global k-means algorithm for clustering in gene expression data sets. In: Boden, M., Bailey, T. (eds.) Proceedings of the AI 2006 Workshop on Intelligent Systems of Bioinformatics, pp. 23–28 (2006)
Bagirov, A.M., Ugon, J., Mirzayeva, H.: Nonsmooth nonconvex optimization approach to clusterwise linear regression problems. Eur. J. Oper. Res. 229(1), 132–142 (2013)
Bagirov, A.M., Ugon, J., Mirzayeva, H.: Nonsmooth optimization algorithm for solving clusterwise linear regression problems. J. Optim. Theory Appl. 164(3), 755–780 (2015)
Bagirov, A.M., Ugon, J., Mirzayeva, H.: An algorithm for clusterwise linear regression based on smoothing techniques. Optim. Lett. 9(2), 375–390 (2015)
Bagirov, A.M., Mahmood, A., Barton, A.: Prediction of monthly rainfall in Victoria, Australia: clusterwise linear regression approach. Atmos. Res. 188, 20–29 (2017)
Brauksa, I.: Use of cluster analysis in exploring economic indicator differences among regions: the case of latvia. J. Econ. Bus. Manag. 1(1), 42–45 (2013)
Brown, M., Grundy, W., Lin, D., Christianini, N., Sugnet, C., Furey, T., Ares, M., Haussler, D.: Knowledg-based analysis of microarray gene expression data using support vector machines. Proc. Natl. Acad. Sci. 97, 262–267 (2000)
Cariou, C., Chehdi, K.: Unsupervised nearest neighbors clustering with application to hyperspectral images. IEEE J. Sel. Top. Sign. Process. 9(6), 1105–1116 (2015)
Celebi, M.E.: Improving the performance of k-means for color quantization. Image Vis. Comput. 29(4), 260–271 (2011)
Chaudhuri, B.B., Garai, G.: Grid clustering with genetic algorithm and tabu search process. J. Pattern Recogn. Res. 4(1), 152–168 (2009)
Cheng, Y., Church, G.M.: Biclustering of expression data. In: Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, vol. 8, pp. 93–103 (2000)
Chipman, H., Tibshirani, R.: Hybrid hierarchical clustering with applications to microarray data. Biostatistics 7(2), 286–301 (2006)
Courvisanos, J., Jain, A., Mardaneh, K.: Economic resilience of regions under crises: a study of the Australian Economy. Reg. Stud. 50(4), 629–643 (2016)
DeSarbo, W.S., William, L.C.: A maximum likelihood methodology for clusterwise linear regression. J. Classif. 5(2), 249–282 (1988)
Dhillon, I.S., Fan, J., Guan, Y.: Efficient clustering of very large document collections. In: Kamath, C., Kumar, V., Grossman, R., Namburu, R. (eds.) Data Mining for Scientific and Engineering Applications, Massive Computing, vol. 2, pp. 357–381. Springer, Boston, MA (2001)
Dolnicar, S.: Using cluster analysis for market segmentation - typical misconceptions, established methodological weaknesses and some recommendations for improvement. Australasian J. Mark. Res. 11(2), 5–12 (2003)
Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D.: Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. 95, 14863–14868 (1998)
Eren, K., Deveci, M., Kücüktunc, O., Catalyürek, U.V.: A comparative analysis of biclustering algorithms for gene expression data. Brief. Bioinform. 14(3), 279–292 (2013)
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Simoudis, E., Han, J., Fayyad, U.M. (eds.) Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR, pp. 226–231 (1996)
Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P.: Advances in knowledge discovery and data mining. In: American Association for Artificial Intelligence, pp. 1–34 (1996)
Finnie, G., Sun, Z.: r 5 model for case-based reasoning. Knowl. Based Syst. 16, 59–65 (2003)
Frismantas, V., et al.: Ex vivo drug response profiling detects recurrent sensitivity patterns in drug-resistant acute lymphoblastic leukemia. Blood 129(11), e26–e37 (2017)
Ganti, V., Gehrke, J., Ramakrishnan, R.: CACTUS: clustering categorical data using summaries. In: Knowledge Discovery and Data Mining, pp. 73–83 (1999)
Gibson, D., Kleinberg, J., Raghavan, P.: Clustering categorical data: an approach based on dynamical systems. In: Proceedings of the 24th International Conference on Very Large Databases (VLDB), pp. 103–114 (1998)
Guha, S., Rastogi, R., Shim, K.: CURE: an efficient clustering algorithm for large databases. In: Proceedings of ACM SIGMOD International Conference on Management of Data, pp. 73–84. ACM Press, New York (1998)
Guha, S., Rastogi, R., Shim, K.: ROCK: a robust clustering algorithm for categorical attributes. Inf. Syst. 25(5), 345–366 (2000)
Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques. The Morgan Kaufmann Series in Data Management Systems, 3rd edn. Morgan Kaufmann, San Francisco, CA (2011)
Hruschka, H., Natter, M.: Comparing performance of feedforward neural nets and k-means for cluster-based market segmentation. Eur. J. Oper. Res. 114(2), 346–353 (1999)
Huang, J.J., Tzeng, G.H., Ong, C.Sh.: Marketing segmentation using support vector clustering. Expert Syst. Appl. 32(2), 313–317 (2007)
Jain, A.K., Dubes, R.: Algorithms for Clustering Data. Prentice Hall, Upper Saddle River, NJ (1988)
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. 31(3), 264–323 (1999)
Jardine, N., Sibson, R.: Mathematical Taxonomy. Wiley, London/New York (1971)
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Series in Probability and Statistics. Wiley, New York (1990)
Ketchen, D.J., Shook, C.L.: The application of cluster analysis in strategic management research: an analysis and critique. Strateg. Manag. J. 17(6), 441–458 (1996)
King, B.: Step-wise clustering procedures. J. Am. Stat. Assoc. 69, 86–101 (1967)
Kuo, R.J., Ho, L.M., Hu, C.M.: Integration of self-organizing feature map and k-means algorithm for market segmentation. Comput. Oper. Res. 29(11), 1475–1493 (2002)
Le-Khac, N., Cai, F., Kechadi, M.: Clustering approaches for financial data analysis: a survey. In: Abou-Nasr, M. Arabnia, H. (eds.) Proceedings of the International Conference on Data Mining, Las Vegas, Nevada (2012)
Lu, S.Y., Fu, K.S.: A sentence to sentence clustering procedure for pattern analysis. IEEE Trans. Syst. Man Cybern. 8(5), 381–389 (1978)
Murtagh, F.: A survey of recent advances in hierarchical clustering algorithms which use cluster centres. Comput. J. 26(4), 354–359 (1984)
Mustjoki, S., et al.: Discovery of novel drug sensitivities in T-PLL by high-throughput ex vivo drug testing and mutation profiling. Leukemia 32, 774–787 (2017)
Nagy, G.: State of the art in pattern recognition. Proc. IEEE 56(5), 836–862 (1968)
Nappa, S.D., Wang, X., Nair, S.: A comparison of machine learning techniques for phishing detection. In: Proceedings of the Anti-Phishing Working Groups 2nd Annual eCrime Researchers Summit (eCrime 07), New York, pp. 60–69 (2007)
Oyelade, J., Isewon, I., Oladipupo, F., Aromolaran, O., Uwoghiren, E. Ameh, F., Achas, M., Adebiyi, E.: Clustering algorithms: their application to gene expression data. Bioinf. Biol. Insights 10, 237–253 (2016)
Parsons, L., Haque, E., Liu, H.: Subspace clustering for high dimensional data: a review. ACM SIGKDD Explorations Newsletter - Special issue on learning from imbalanced datasets 6(1), 90–105 (2004)
Pemovska, T., et al.: Individualized systems medicine strategy to tailor treatments for patients with chemorefractory acute myeloid leukemia. Cancer Discov. 3(12), 1416–1429 (2013)
Poggi, J.M., Portier, B.: PM10 forecasting using clusterwise regression. Atmos. Environ. 45(38), 7005–7014 (2011)
Punj, G., Stewart, D.W.: Cluster analysis in marketing research: review and suggestions for application. J. Mark. Res. 20(2), 134–148 (1983)
Rezanková, H.: Cluster analysis of economic data. Statistica 94(1), 73–86 (2014)
Rosch, E.: Principles of Categorization. MIT Press, Cambridge (1999)
Seifollahi, S., Bagirov, A.M. Layton, R., Gondal, I.: Optimization based clustering algorithms for authorship analysis of phishing emails. Neural Process. Lett. 46(2), 411–425 (2017)
Slonm, N., Tishby, N.: Document clustering using word clusters via the information bottleneck method. In: Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 208–215 (2000)
Sneath, P.H.A., Sokal, R.R.: Numerical Taxonomy. Freeman, London (1973)
Späth, H.: Algorithm 39: clusterwise linear regression. Computing 22(4), 367–373 (1979)
Späth, H.: Cluster Analysis Algorithms for Data Reduction and Classification of Objects. Computers and Their applications. Ellis Horwood Limited, Chichester (1980)
Späth, H.: The Cluster Dissection and Analysis Theory FORTRAN Programs Examples. Prentice-Hall, Upper Saddle River, NJ (1985)
Thalamuthu, A., Mukhopadhyay, I., Zheng, X., Tseng, G.C.: Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics 22(19), 2405–2412 (2006)
Tran, T.N., Wehrens, R., Buydens, L.M.C.: KNN-kernel density-based clustering for high-dimensional multivariate data. Comput. Stat. Data Anal. 51(2), 513–525 (2006)
Tsai, C.Y., Chiu, C.C.: A purchase-based market segmentation methodology. Expert Syst. Appl. 27(2), 265–276 (2004)
Ward, J.H.: Hierarchical grouping to optimize and objective function. J. Am. Stat. Assoc. 58(301), 236–244 (1963)
Wedel, M., Kistemaker, C.: Consumer benefit segmentation using clusterwise linear regression. Int. J. Res. Mark. 6(1), 45–59 (1989)
Wierzchon, S.T., Klopotek, M.A.: Modern Algorithms of Cluster Analysis. Springer, Cham (2018)
Yeung, K.Y., Haynor, D.R., Ruzzo, W.L.: Validating clustering for gene expression data. Bioinformatics 17(4), 309–318 (2001)
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the ACM SIGMOD Conference on Management of Data, pp. 103–114 (1996)
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this chapter
Cite this chapter
M. Bagirov, A., Karmitsa, N., Taheri, S. (2020). Introduction to Clustering. In: Partitional Clustering via Nonsmooth Optimization. Unsupervised and Semi-Supervised Learning. Springer, Cham. https://doi.org/10.1007/978-3-030-37826-4_1
Download citation
DOI: https://doi.org/10.1007/978-3-030-37826-4_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-37825-7
Online ISBN: 978-3-030-37826-4
eBook Packages: EngineeringEngineering (R0)