Analysis of high-dimensional biomedical data using an evolutionary multi-objective emperor penguin optimizer

https://doi.org/10.1016/j.swevo.2019.04.010Get rights and content

Abstract

Over the last two decades, there has been an expeditious expansion in the generation and exploration of high-dimensional biomedical data. Identification of biomarkers from the genomics data poses a significant challenge in microarray data analysis. Therefore, for the methodical analysis of the genomics dataset, it is paramount to develop some effective algorithms. In this work, a multi-objective version of the emperor penguin optimization (EPO) algorithm with chaos, namely, multi-objective chaotic EPO (MOCEPO) is proposed. The suggested approach extends the original continuous single objective EPO to a competent binary multi-objective model. The objectives are to minimize the number of selected genes (NSG) and to maximize the classification accuracy (CA). In this work, Fisher score and minimum redundancy maximum relevance (mRMR) are independently used as initial filters. Further, the proposed MOCEPO is employed for the simultaneous optimal feature selection and cancer classification. The proposed algorithm is successfully experimented on seven well-known high-dimensional binary-class as well as multi-class datasets. To evaluate the effectiveness, the proposed method is compared with non-dominated sorting genetic algorithm (NSGA-II), multi-objective particle swarm optimization (MOPSO), chaotic version of GA for multi-objective optimization (CGAMO), and chaotic MOPSO methods. The experimental results show that the proposed framework achieves better CA with minimum NSG compared to the existing schemes. The presented approach exhibits its efficacy with regard to NSG, accuracy, sensitivity, specificity, and F-measure.

Introduction

The rapidly growing DNA microarray technology has enabled the researchers to measure the expression level of thousands of genes simultaneously in a single experiment [1]. The biomarker genes extracted from the microarray data helps in the clinical diagnosis, prognosis, and treatment of cancer. However, the high dimensionality of the microarray data increases the computational overhead and hence poses a significant challenge in the biomedical data analysis. To overcome this issue, the irrelevant and redundant genes need to be discarded using some feature selection [2,3] techniques. In fact, it is expected that selecting the relevant genes reduces the size of the gene expression data and enhances the CA.

Several gene selection (GS) techniques have been suggested in the literature to identify the useful genes present in the microarray data. These GS techniques can be broadly classified into three categories, namely, filter methods, wrapper methods, and hybrid methods [4,5]. Filter-based methods select the relevant genes from the original gene set based on some statistical characteristics. Despite the simplicity and computationally efficiency, filter techniques are incapable of exploiting the relationship among the genes, thereby reducing the overall accuracy. On the other hand, the wrapper-based techniques employ the knowledge of the classifiers, namely, kernel ridge regression (KRR) [6], support vector machine (SVM) [7], K-nearest neighbor (KNN) [8,9], Naive Bayes (NB) [10], radial basis functions neural networks (RBFN) [11], and decision tree (DT) [12] to find the bio-markers. The wrapper models use bio-inspired algorithms to identify the optimal solutions by analyzing the search area from a set of solutions (population). The evolutionary algorithms such as GA [[13], [14], [15]], differential evolution (DE) [16,17], artificial bee colony algorithm (ABC) [18], genetic bee colony optimization (GBC) [19], ant colony optimization (ACO) [20], salp swarm algorithm (SSA) [21], firefly algorithm (FA) [22], bidirectional elitist optimization [23], and PSO [[24], [25], [26], [27], [28]] have been successfully utilized for solving numerous feature selection problems. These methods are competent of learning the association among the genes and therefore, lead to better CA. The hybrid methods use the merits of the both by first employing a filter method to reduce the NSG and then applying the wrapper method to explore the optimal gene subset.

In order to select the biomarker genes in a faster and efficient manner, multi-objective methods have been designed. In the last two decades, several multi-objective optimization methods, namely, MOPSO [29], CMOPSO [30], NSGA-II [31], CGAMO [32], multi-objective FA (MOFA) [33], multi-objective teaching-learning-based optimization (MOTLBO) [34], multi-objective gravitational search algorithm (MOGSA) [35], and multi-objective differential evolution (MODE) [36] algorithms have been proposed. These methods prove their effectiveness in solving multi-objective problems. Though all the above mentioned algorithms are competent enough in solving a specific task, they can not fix all optimization problems with dissimilar characteristics [37]. Hence, there always remain a room for a novel method which can solve a problem that can not be addressed by the present methods.

The two important phases of any metaheuristic algorithm are diversification and intensification [38,39]. Diversification makes sure that the algorithm searches the various promising areas in a certain search space, whereas intensification investigates the optimal solutions around the promising areas which is resulted by the diversification phase [40]. The proper balancing between the above two phases is important for any optimization problem, which motivates us to employ the EPO algorithm. The second motivation is the ‘no free lunch theorem’, which says that none of the existing metaheuristic is capable of solving all optimization problems [37].

In this paper, a novel multi-objective version of the EPO algorithm, namely, MOCEPO is proposed. EPO [41] is a newly developed meta-heuristic method, originally designed for single objective optimization problems. In this work, we have extended the single objective EPO to multi-objective binary EPO by utilizing the multi-objective operators, namely, non-dominated sorting, and crowding distance. The two objectives of our problem are to minimize the NSG, and to maximize the CA. The CA is computed by the KRR classifier. In order to reduce the redundant genes, Fisher score and mRMR filters are employed independently.

The five major contributions of the suggested work are highlighted as:

  • -

    For the first time, a multi-objective version of the EPO algorithm is proposed.

  • -

    Chaos theory is introduced in the MOEPO for faster convergence.

  • -

    Multi-objective operator like non-dominated sorting is incorporated to rank the pareto optimal solutions.

  • -

    Selection of the fittest solution is carried out by the crowding distance operator.

  • -

    The proposed method is applied for simultaneous GS and cancer classification.

The proposed approach is implemented on seven standard datasets. The performance of the proposed framework is evaluated in terms of CA, NSG, F-measure, specificity, Matthews correlation coefficient (MCC), and sensitivity. The results show that our method can not only achieve higher CA, but also reduces the NSG effectively.

The remainder of the paper is organized as follows. Section 2 explains the methods used along with the proposed work. Experimental setup and performance metrics are presented in Section 3. The results of the work are presented and discussed in Section 4. Finally, we conclude the work in Section 5.

Section snippets

Pre-selection of genes

To effectively filter out the highly redundant and irrelevant genes, usually, filter-based gene ranking algorithms are used. In this paper, we have employed Fisher score [42] and mRMR [43] filters separately for gene pre-selection, which have reliable performance in segregating the relevant genes [44,45]. As compared to the methods like T-test, Z-score, and information gain, Fisher score and mRMR produce superior results [45,46]. Nonetheless, every technique has its advantages that influence

Datasets

The proposed method is applied on seven standard microarray cancer datasets [50,51], listed in Table 1. Out of the seven datasets, three datasets belong to binary-class and four datasets belong to multi-class. Prior to feature selection by the Fisher score and mRMR filters, min-max normalization in the range of [-1,1] is applied on the whole dataset.

Experimental setup

MATLAB 2017b is used to carry out the experiments with 8 GB of main memory and Core i5 processor (2.70 GHz). The simulation results are evaluated

Experimental results of feature selection using Fisher score and mRMR

In this experiment, two feature selection methods, namely, Fisher score and mRMR are used independently as initial filters to select top N statistically relevant biomarkers. N ranges from 1 to 500 (see Fig. 4).

Further, these selected features are sent to the KRR model with default C and γ. In this work, the default values of C and γ are taken as 1 and 100, respectively. Fig. 4 shows the change in CA with the increment in NSG on various datasets by Fisher score and mRMR filters. It is observed

Conclusion

Evolutionary algorithms play an important role in finding the relevant genes from high-dimensional microarray data and hence help the system biologist in cancer diagnosis. Identification of biomarkers with smaller numbers and higher CA substantially improves the quality of the expert systems used in the hospitals.

In the present work, a multi-objective model based on the principle of chaotic EPO algorithm has been proposed for microarray cancer classification. There are two major merits of the

References (66)

  • S. Kar et al.

    Gene selection from microarray gene expression data for classification of cancer subgroups employing PSO and adaptive K-nearest neighborhood technique

    Expert Syst. Appl.

    (2015)
  • Y. Sun et al.

    Chaotic multi-objective particle swarm optimization algorithm incorporating clone immunity

    Mathematics

    (2019)
  • V. Ravi et al.

    Financial time series prediction using hybrids of chaos theory, multi-layer perceptron and multi-objective evolutionary algorithms

    Swarm Evol. Comput.

    (2017)
  • V.K. Patel et al.

    A multi-objective improved teachinglearning based optimization algorithm (MO-ITLBO)

    Inf. Sci.

    (2016)
  • E. Rashedi et al.

    A comprehensive survey on gravitational search algorithm

    Swarm Evol. Comput.

    (2018)
  • J. Cheng et al.

    A grid-based adaptive multi-objective differential evolution algorithm

    Inf. Sci.

    (2016)
  • M. Lozano et al.

    Hybrid metaheuristics with evolutionary algorithms specializing in intensification and diversification: overview and progress report

    Comput. Oper. Res.

    (2010)
  • G. Dhiman et al.

    Emperor penguin optimizer: a bio-inspired algorithm for engineering problems

    Knowl. Base Syst.

    (2018)
  • Q. Song et al.

    Feature selection based on FDA and F-score for multi-class classification

    Expert Syst. Appl.

    (2017)
  • M. Dashtban et al.

    Gene selection for microarray cancer classification using a new evolutionary method employing artificial intelligence concepts

    Genomics

    (2017)
  • J. Lv et al.

    A multi-objective heuristic algorithm for gene expression microarray data classification

    Expert Syst. Appl.

    (2016)
  • A.H. Gandomi et al.

    Chaotic bat algorithm

    J. Comput. Sci.

    (2014)
  • Z. Zhu et al.

    Markov blanket-embedded genetic algorithm for gene selection

    Pattern Recogn.

    (2007)
  • E.F. Petricoin et al.

    Use of proteomic patterns in serum to identify ovarian cancer

    Lancet

    (2002)
  • Y. Wang et al.

    Informative gene selection for microarray classification via adaptive elastic net with conditional mutual information

    Appl. Math. Model.

    (2019)
  • K.-H. Liu et al.

    A hierarchical ensemble of ecoc for cancer classification based on multi-class microarray data

    Inf. Sci.

    (2016)
  • W. Ding et al.

    A hierarchical-coevolutionary-mapreduce-based knowledge reduction algorithm with robust ensemble pareto equilibrium

    Inf. Sci.

    (2016)
  • W. Ding et al.

    Multiagent-consensus-mapreduce-based attribute reduction using co-evolutionary quantum PSO for big data applications

    Neurocomputing

    (2018)
  • K. Muhammad et al.

    Early fire detection using convolutional neural networks during surveillance for effective disaster management

    Neurocomputing

    (2018)
  • Y.-D. Zhang et al.

    Image based fruit category classification by 13-layer deep convolutional neural network and data augmentation

    Multimed. Tool. Appl.

    (2017)
  • S. Pang et al.

    Classification consistency analysis for bootstrapping gene selection

    Neural Comput. Appl.

    (2007)
  • E.K. Tang et al.

    Feature selection for microarray data using least squares SVM and particle swarm optimization

  • W. Ding et al.

    Multiple relevant feature ensemble selection based on multilayer co-evolutionary consensus mapreduce

    IEEE Trans. Cybern.

    (2018)
  • Cited by (46)

    View all citing articles on Scopus
    View full text