Integration of heterogeneous ‘omics’ data using semi-supervised network labelling to identify essential genes in colorectal cancer

https://doi.org/10.1016/j.compeleceng.2018.03.039Get rights and content

Abstract

Colorectal cancer (CRC) is the third most common form of cancer and has the fourth highest mortality rate in the world. To understand the origin and progression of this disease, biomedical researchers undertake global analyses of omics data of CRC patient samples and representative cell lines. However, due to the heterogeneity and high dimensionality nature of `omics’ data, traditional tools for analysing this sort of data are inadequate and the heterogeneous nature of cancer makes the process of identifying essential genes very difficult. ‘Omics’ is a term that is used to refer to areas of study in biology that end with the ending ‘omics’ such as genomics, proteomics and metabolomics. This paper uses network theory-based methods to address the problem of high dimensionality in omics datasets and applies network propagation to address the problem of heterogeneity in both omics datasets and cancer in identifying the essential genes. The method successfully identifies known essential genes in CRC as well as a new set of genes that are likely to be essential in the study of CRC.

Introduction

Network theory, the study of how complex systems interact is widely applied in fields such as computer networks, social networks, and interactome networks in systems biology [1]. Network metrics such as node degree are often used to prioritise nodes within a network. Similarly, one of the main goals in cancer research is the identification of biomarkers or essential genes that can be used to understand the development or progression of a specific cancer type such as Colorectal cancer (CRC).

To prioritise these genes, researchers often study the complex interactions between the numerous molecules within cells such as proteins, deoxyribonucleic acid (DNA), ribonucleic acid (RNA) and other small molecules. The molecules are obtained from the global profiling of patient samples as well as representative cell lines at multiple layers, these layers constitute what is today referred to as ‘omics’ data. ‘Omics’ is an informal term that is used to refer to areas of study in biology that end with the term ‘omics’ such as genomics, proteomics and metabolomics [2]. The interactions, on the other hand, are collectively known as interactome networks and provide a global picture of how molecular interactions influence cellular behaviour, an example being protein-protein interactions (PPI) [3].

Omics data is highly dimensional in nature, coupled with this, is the heterogeneity of cancer whereby two individuals with the same type of cancer may have a different set of biomarkers. This makes identifying and prioritising cancer-related genes a challenging and daunting task that cannot be achieved using traditional statistical methods. As such, network theory provides a means by which complexity in such instances can be used to model the cellular system behaviour. Barabási, et al. [4] provides a summation of the application of network-based metrics in associating omics-related molecules to disease. Other works in [5], [6], [7] applied network-based methods in areas such as identifying and associating genes to disease as well as identifying drug targets in various cancer types. In [8], [9], integrated network-based methods with machine learning techniques are applied in reducing the dimensionality of omics data and building models to predict genes associated with the disease as well as classify multiple cancer types. While the integration of omics data with networks has been gaining momentum over the years, a typical recurring theme in most of the research has been the use of a single type of omics data as opposed to integrating the various types of omics data which are heterogeneous in nature.

In this paper, we used an integrated approach to identify essential genes in colorectal cancer, a type of cancer that originates in the bowel, is the third most common form of cancer and has the fourth highest cancer mortality rate in the world [10]. The integrated approach employed a semi-supervised learning algorithm to propagate heterogeneous omics data into a protein-protein interaction network, which was followed by a downstream enrichment analysis to validate and understand the role of the predicted potential essential genes in CRC.

The rest of the paper is organised as follows: Section 2 provides a description of the materials and methods used as well as an overview of related works, Section 3 provides a discussion of the experimental results and the implications of the findings. The paper concludes with a summary of the findings and the future directions of the research.

Section snippets

Proteomics data

We used proteomics and genomics data as the input to our method. Proteomics data consisted of protein-protein interactions. Weighted protein-protein interactions were downloaded from HIPPIE Version 2.0 [11], an online web-based database resource for weighted protein-protein interactions. The weights in the interactions show the confidence in the interaction between two proteins and are calculated by the authors based on the amount and reliability of evidence supporting an interaction. The

Propagation of omics data

Network propagation of mutation status and that of differential expression status data is performed, Fig. 2 shows the distribution of scores in TCGA samples respectively. The figure also shows the relationships between the propagation scores against their corresponding status data. From this, it is shown that genes with a high-frequency rate of mutation or differential expression across samples are labelled with a propagation score close to their initial label in the prior knowledge dataset.

Acknowledgements

SM is supported by the Australian NHMRC fellowship (1016599) and Ramaciotti Establishment grant. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

This research was supported by the use of the NeCTAR Research Cloud, a collaborative Australian research platform supported by the National Collaborative Research Infrastructure Strategy.

David Chisanga is currently a Ph.D. candidate in the Department of Computer Science and Information Technology, at La Trobe University, Melbourne, VIC, Australia. He received his BSc degree in Computer Science from the University of Zambia in 2010 and MSc. degree in Information Technology Management from Binary University in 2013. His research interests include Bioinformatics, Data Science, and Machine Learning.

References (25)

  • T. Gui et al.

    Identification of hepatocellular carcinoma-related genes with a machine learning and network analysis

    J Comput Biol

    (2015)
  • S. Philips et al.

    Using machine learning algorithms to identify genes essential for cell survival

    BMC Bioinformatics

    (2017)
  • Cited by (3)

    David Chisanga is currently a Ph.D. candidate in the Department of Computer Science and Information Technology, at La Trobe University, Melbourne, VIC, Australia. He received his BSc degree in Computer Science from the University of Zambia in 2010 and MSc. degree in Information Technology Management from Binary University in 2013. His research interests include Bioinformatics, Data Science, and Machine Learning.

    Shivakumar Keerthikumar is currently a cancer bioinformatician at Cancer Research Division, Peter MacCallum Cancer Centre, Melbourne, Australia. His current research combines whole-genome, transcriptomics and DNA-methylation to understanding tumour evolution in prostate cancer. He obtained his Ph.D. in Bioinformatics from the Institute of Bioinformatics, Bangalore, India. He did his post-doctoral research in Intellectual Disability (ID) at Centre for molecular and biomolecular informatics (CMBI), Radboud University Nijmegen Medical Centre, Nijmegen, The Netherlands. He also worked as Research officer, at Department of Biochemistry and Genetics, La Trobe University, Melbourne, Australia, in understanding the role of Exosomes in CRC using genomics and proteomics data. His research interests span from functional genomics to systems biology using computational methods.

    Suresh Mathivanan is currently Associate Professor and laboratory head, Biochemistry and Genetics, La Trobe University, Melbourne, VIC, Australia. He obtained his Ph.D. degree from Kuvempu University, India and Johns Hopkins University, USA. His current research areas include cancer, chemoresistance, exosomes, extracellular vesicles, tumour microenvironment, proteomics, and bioinformatics.

    Naveen Chilamkurti is currently Associate Professor and Cybersecurity Program Coordinator, Computer Science and Information Technology, La Trobe University, Melbourne, VIC, Australia. He obtained his Ph.D. degree from La Trobe University. His current research areas include intelligent transport systems (ITS), Smart grid computing, vehicular communications, Vehicular cloud, Cybersecurity, wireless multimedia, wireless sensor networks, and Mobile security.

    Reviews processed and recommended for publication to the Editor-in-Chief by Guest Editor Dr. A. Sangaiah.

    View full text