Clustering-Based Techniques for Big Data Analysis of Gene Expression

Das, Tanuja; Pratim Kalita, Partha; Saha, Goutam

doi:10.1007/978-981-33-4084-8_16

Tanuja Das¹⁴,
Partha Pratim Kalita¹⁴ &
Goutam Saha¹⁵

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 170))

546 Accesses

Abstract

Proper investigation of cancer has always been of foremost importance for its accurate forecasting, thereby aiding the correct cure. Microarray-based gene expression profiling is being practised for this purpose making it one of the leading research interests for discovering gene clusters accountable for a particular behavior. Big data analytics provides an efficient way to seek facts about the biological processes inherent from this microarray data. Previously, many attempts have been made to achieve this using numerous clustering approaches, but the results were quite deviating from the reality. In this work, we have attempted to discover potential and accurate gene indicators from the gene expression data by using a well-known quantitative measure called quantum clustering. The characteristic feature of this concept is that the total estimate of clusters formed is not predetermined but is determined depending on the nature of the data. As the concept is established on the grounds that a cluster is formed by density wise spaces, where the center is formed based on the density maxima point, this motivated us to detect those clusters which may be engaged in a certain biological process. The clustering approach becomes privileged in that extremely dense spaces are inherently detected and combined to produce arbitrarily shaped clusters without regarding the dimension of the space. For the purpose of comparing the results obtained, we have also applied a non-parametric measure, namely, the mean shift clustering on the gene expression data. For validation purpose, we used DAVID to check the significance of the clusters created. Results show that the genes so discovered are highly indicative in the pursuit of rare diseases.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Jiang D, Tang C, Zhang A (Nov 2004) Cluster analysis for gene expression data: a survey. IEEE Trans Knowl Data Eng 16(11):1370–1386
Google Scholar
Board FS (2017) Artificial intelligence and machine learning in financial services. http://www.fsb.org/2017/11/artificialintelligence-and-machine-learning-in-financialservice/. Accessed 30 Jan 2018
Maji P (2012) Mutual-information-based supervised attribute clustering for microarray sample classification. IEEE Trans Knowl Data Eng 24(1):127–140
Google Scholar
Pita-Juarez et al (2018) The pathway coexpression network: revealing pathway relationships. PLoS Comput Bifol 14(3):e1006042
Google Scholar
Kim J, Shin M (2017) Inferring genes and biological functions that are sensitive to the severity of toxicity symptoms. Int J Mol Sci 18(4):755
Google Scholar
Kriegel HP, Kroger P, Zimek A (2009) Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Tran Knowl Discov Data (TKDD) 3(1):1
Article Google Scholar
Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining Knowl Discov 2(3):283–304
Google Scholar
Breyne P, Zabeau M (2001) Genome-wide expression analysis of plant cell cycle modulated genes. Current Opin Plant Biol 4(2):136–142
Article Google Scholar
Fukunaga K (2013) Introduction to statistical pattern recognition. Academic Press
Google Scholar
Cheng Y, Church GM (2000) Biclustering of expression data. ISMB, vol 8, no 2000, pp 93–103
Google Scholar
Lazzeroni L, Owen A (2002) Plaid models for gene expression data. Statistica Sinica 61–86
Google Scholar
Abdullah A, Hussain A (2006) A new biclustering technique based on crossing minimization. Neurocomputing 69(16):1882–1896
Google Scholar
Preli A et al (2006) A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics 22(9):1122–1129
Google Scholar
Ben-Dor A et al (2003) Discovering local structure in gene expression data: the order-preserving submatrix problem. J Comput Biol 10(3–4):373–384
Google Scholar
Cho H et al (2004) Minimum sum-squared residue co-clustering of gene expression data. In: Proceedings of the 2004 SIAM international conference on data mining, society for industrial and applied mathematics, pp 114–125
Google Scholar
Banerjee A et al (2007) A generalized maximum entropy approach to Bregman co-clustering and matrix approximation. J Mach Learn Res 8:1919–1986
Google Scholar
Deodhar M et al (2008) Hunting for coherent co-clusters in high dimensional and noisy datasets. In: IEEE international conference on data mining workshops ICDMW08. IEEE, pp 654–663
Google Scholar
Huang DW (2007) DAVID bioinformatics resources: expanded annotation database and novel algorithms to better extract biology from large gene lists. Nucl Acids Res 35(suppl 2) W169–W175
Google Scholar
Horn D, Gottlieb A (2002) The method of quantum clustering. In: Advances in neural information processing systems, pp 769–776
Google Scholar
Sebastian R (2016) An overview of gradient descent optimization algorithms. vol 1609, no 04747
Google Scholar
Fukunaga K, Hostetler L (1975) The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Trans Inf Theory 21(1):32–40
Article MathSciNet Google Scholar
Comaniciu D, Meer P (2002) Mean shift: a robust approach toward feature space analysis. IEEE Trans Pattern Anal Mach Intell 5:603–619
Article Google Scholar
West M et al (2001) Predicting the clinical status of human breast cancer by using gene expression profiles. Proc Natl Acad Sci 98(20):11462–11467
Google Scholar
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):531–537
Google Scholar
Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedi Natl Acad Sci 96(12):6745–6750
Google Scholar
Van der Pouw Kraan TCTM et al (2007) Rheumatoid arthritis subtypes identified by genomic profiling of peripheral blood cells: assignment of a type I interferon signature in a subpopulation of patients. Ann Rheum Dis 66(8):1008–1014
Google Scholar
Liu X, Cheng HM, Zhang ZY (2019) Evaluation of community detection methods. IEEE Trans Knowl Data Eng
Google Scholar
Hamosh A et al (2005) Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucl Acids Res 33(suppl 1):D514–D517
Google Scholar
Becker KG et al (2004) The genetic association database. Nature Gen 36(5):431–432
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Technology, GUIST, Guwahati, Assam, India
Tanuja Das & Partha Pratim Kalita
Department of Information Technology, NEHU, Shillong, Meghalaya, India
Goutam Saha

Authors

Tanuja Das
View author publications
You can also search for this author in PubMed Google Scholar
Partha Pratim Kalita
View author publications
You can also search for this author in PubMed Google Scholar
Goutam Saha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tanuja Das .

Editor information

Editors and Affiliations

Department of Information Technology, North-Eastern Hill University, Shillong, Meghalaya, India
Arnab Kumar Maji
Department of Information Technology, North-Eastern Hill University, Shillong, Meghalaya, India
Goutam Saha
Department of Information Technology, North-Eastern Hill University, Shillong, Meghalaya, India
Sufal Das
Department of Computer Science and Engineering, Jadavpur University, Kolkata, West Bengal, India
Subhadip Basu
Departamento de Engenharia Mecânica, Faculdade de Engenharia, Universidade do Porto, Porto, Portugal
João Manuel R. S. Tavares

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Das, T., Pratim Kalita, P., Saha, G. (2021). Clustering-Based Techniques for Big Data Analysis of Gene Expression. In: Maji, A.K., Saha, G., Das, S., Basu, S., Tavares, J.M.R.S. (eds) Proceedings of the International Conference on Computing and Communication Systems. Lecture Notes in Networks and Systems, vol 170. Springer, Singapore. https://doi.org/10.1007/978-981-33-4084-8_16

Download citation

DOI: https://doi.org/10.1007/978-981-33-4084-8_16
Published: 11 April 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-33-4083-1
Online ISBN: 978-981-33-4084-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics