Integration of heterogeneous ‘omics’ data using semi-supervised network labelling to identify essential genes in colorectal cancer☆
Introduction
Network theory, the study of how complex systems interact is widely applied in fields such as computer networks, social networks, and interactome networks in systems biology [1]. Network metrics such as node degree are often used to prioritise nodes within a network. Similarly, one of the main goals in cancer research is the identification of biomarkers or essential genes that can be used to understand the development or progression of a specific cancer type such as Colorectal cancer (CRC).
To prioritise these genes, researchers often study the complex interactions between the numerous molecules within cells such as proteins, deoxyribonucleic acid (DNA), ribonucleic acid (RNA) and other small molecules. The molecules are obtained from the global profiling of patient samples as well as representative cell lines at multiple layers, these layers constitute what is today referred to as ‘omics’ data. ‘Omics’ is an informal term that is used to refer to areas of study in biology that end with the term ‘omics’ such as genomics, proteomics and metabolomics [2]. The interactions, on the other hand, are collectively known as interactome networks and provide a global picture of how molecular interactions influence cellular behaviour, an example being protein-protein interactions (PPI) [3].
Omics data is highly dimensional in nature, coupled with this, is the heterogeneity of cancer whereby two individuals with the same type of cancer may have a different set of biomarkers. This makes identifying and prioritising cancer-related genes a challenging and daunting task that cannot be achieved using traditional statistical methods. As such, network theory provides a means by which complexity in such instances can be used to model the cellular system behaviour. Barabási, et al. [4] provides a summation of the application of network-based metrics in associating omics-related molecules to disease. Other works in [5], [6], [7] applied network-based methods in areas such as identifying and associating genes to disease as well as identifying drug targets in various cancer types. In [8], [9], integrated network-based methods with machine learning techniques are applied in reducing the dimensionality of omics data and building models to predict genes associated with the disease as well as classify multiple cancer types. While the integration of omics data with networks has been gaining momentum over the years, a typical recurring theme in most of the research has been the use of a single type of omics data as opposed to integrating the various types of omics data which are heterogeneous in nature.
In this paper, we used an integrated approach to identify essential genes in colorectal cancer, a type of cancer that originates in the bowel, is the third most common form of cancer and has the fourth highest cancer mortality rate in the world [10]. The integrated approach employed a semi-supervised learning algorithm to propagate heterogeneous omics data into a protein-protein interaction network, which was followed by a downstream enrichment analysis to validate and understand the role of the predicted potential essential genes in CRC.
The rest of the paper is organised as follows: Section 2 provides a description of the materials and methods used as well as an overview of related works, Section 3 provides a discussion of the experimental results and the implications of the findings. The paper concludes with a summary of the findings and the future directions of the research.
Section snippets
Proteomics data
We used proteomics and genomics data as the input to our method. Proteomics data consisted of protein-protein interactions. Weighted protein-protein interactions were downloaded from HIPPIE Version 2.0 [11], an online web-based database resource for weighted protein-protein interactions. The weights in the interactions show the confidence in the interaction between two proteins and are calculated by the authors based on the amount and reliability of evidence supporting an interaction. The
Propagation of omics data
Network propagation of mutation status and that of differential expression status data is performed, Fig. 2 shows the distribution of scores in TCGA samples respectively. The figure also shows the relationships between the propagation scores against their corresponding status data. From this, it is shown that genes with a high-frequency rate of mutation or differential expression across samples are labelled with a propagation score close to their initial label in the prior knowledge dataset.
Acknowledgements
SM is supported by the Australian NHMRC fellowship (1016599) and Ramaciotti Establishment grant. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
This research was supported by the use of the NeCTAR Research Cloud, a collaborative Australian research platform supported by the National Collaborative Research Infrastructure Strategy.
David Chisanga is currently a Ph.D. candidate in the Department of Computer Science and Information Technology, at La Trobe University, Melbourne, VIC, Australia. He received his BSc degree in Computer Science from the University of Zambia in 2010 and MSc. degree in Information Technology Management from Binary University in 2013. His research interests include Bioinformatics, Data Science, and Machine Learning.
References (25)
- et al.
Hallmarks of Cancer: The Next Generation
Cell
(2011) - et al.
Biophysical changes of ATP binding pocket may explain loss of kinase activity in mutant DAPK3 in cancer: a molecular dynamic simulation analysis
Gene
(2016) - et al.
Small molecule inhibition of the ubiquitin-specific protease USP2 accelerates cyclin D1 degradation and leads to cell cycle arrest in colorectal cancer and mantle cell lymphoma models
J Biol Chem
(2016) - et al.
Network tools for the analysis of proteomic data
- et al.
‘Omic’ technologies: genomics, transcriptomics, proteomics and metabolomics
Obstetric Gynaecol
(2011) - et al.
The BioGRID interaction database: 2017 update
Nucleic Acids Res
(2017) - et al.
Network medicine: a network-based approach to human disease
Nat Rev Genet
(2011) - et al.
Network-based gene prediction for plasmodium falciparum malaria towards genetics-based drug discovery
BMC Genom
(2015) - et al.
Predicting disease-related genes using integrated biomedical networks
BMC Genom
(2017) - et al.
Discovering disease-associated genes in weighted protein-protein interaction networks
Physica A
(2017)
Identification of hepatocellular carcinoma-related genes with a machine learning and network analysis
J Comput Biol
Using machine learning algorithms to identify genes essential for cell survival
BMC Bioinformatics
Cited by (3)
Enhanced classification loss functions and regularization loss function (ECLFaRLF) algorithm for bowel cancer feature classification
2021, Multimedia Tools and ApplicationsDeveloping sustainable classification of diseases via deep learning and semi-supervised learning
2020, Healthcare (Switzerland)Semi-Supervised Learning with Ensemble Self-Training for Cancer Classification
2018, Proceedings - 2018 IEEE SmartWorld, Ubiquitous Intelligence and Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People and Smart City Innovations, SmartWorld/UIC/ATC/ScalCom/CBDCom/IoP/SCI 2018
David Chisanga is currently a Ph.D. candidate in the Department of Computer Science and Information Technology, at La Trobe University, Melbourne, VIC, Australia. He received his BSc degree in Computer Science from the University of Zambia in 2010 and MSc. degree in Information Technology Management from Binary University in 2013. His research interests include Bioinformatics, Data Science, and Machine Learning.
Shivakumar Keerthikumar is currently a cancer bioinformatician at Cancer Research Division, Peter MacCallum Cancer Centre, Melbourne, Australia. His current research combines whole-genome, transcriptomics and DNA-methylation to understanding tumour evolution in prostate cancer. He obtained his Ph.D. in Bioinformatics from the Institute of Bioinformatics, Bangalore, India. He did his post-doctoral research in Intellectual Disability (ID) at Centre for molecular and biomolecular informatics (CMBI), Radboud University Nijmegen Medical Centre, Nijmegen, The Netherlands. He also worked as Research officer, at Department of Biochemistry and Genetics, La Trobe University, Melbourne, Australia, in understanding the role of Exosomes in CRC using genomics and proteomics data. His research interests span from functional genomics to systems biology using computational methods.
Suresh Mathivanan is currently Associate Professor and laboratory head, Biochemistry and Genetics, La Trobe University, Melbourne, VIC, Australia. He obtained his Ph.D. degree from Kuvempu University, India and Johns Hopkins University, USA. His current research areas include cancer, chemoresistance, exosomes, extracellular vesicles, tumour microenvironment, proteomics, and bioinformatics.
Naveen Chilamkurti is currently Associate Professor and Cybersecurity Program Coordinator, Computer Science and Information Technology, La Trobe University, Melbourne, VIC, Australia. He obtained his Ph.D. degree from La Trobe University. His current research areas include intelligent transport systems (ITS), Smart grid computing, vehicular communications, Vehicular cloud, Cybersecurity, wireless multimedia, wireless sensor networks, and Mobile security.
- ☆
Reviews processed and recommended for publication to the Editor-in-Chief by Guest Editor Dr. A. Sangaiah.