Elsevier

Knowledge-Based Systems

Volume 37, January 2013, Pages 502-514
Knowledge-Based Systems

Wavelet feature extraction and genetic algorithm for biomarker detection in colorectal cancer data

https://doi.org/10.1016/j.knosys.2012.09.011Get rights and content

Abstract

Biomarkers which predict patient’s survival play an important role in medical diagnosis and treatment. How to select the significant biomarkers from hundreds of protein markers is a key step in survival analysis. In this paper a novel method is proposed to detect the prognostic biomarkers of survival in colorectal cancer patients using wavelet analysis, genetic algorithm, and Bayes classifier. One dimensional discrete wavelet transform (DWT) is normally used to reduce the dimensionality of biomedical data. In this study one dimensional continuous wavelet transform (CWT) was proposed to extract the features of colorectal cancer data. One dimensional CWT has no ability to reduce dimensionality of data, but captures the missing features of DWT, and is complementary part of DWT. Genetic algorithm was performed on extracted wavelet coefficients to select the optimized features, using Bayes classifier to build its fitness function. The corresponding protein markers were located based on the position of optimized features. Kaplan–Meier curve and Cox regression model were used to evaluate the performance of selected biomarkers. Experiments were conducted on colorectal cancer dataset and several significant biomarkers were detected. A new protein biomarker CD46 was found to be significantly associated with survival time.

Introduction

Survival analysis involves the estimation of the distribution of time it takes for death to occur depending on the biology of the disease. It allows clinicians to plan a suitable treatment and counsel patients about their prognosis. In medical domains, survival analysis is mainly based on Kaplan–Meier (KM) estimator and Cox proportional hazards regression model [1], [2], which are used to evaluate the performance of prognostic markers. However how to rank these biomarkers, is a key step in survival analysis. Normally, the selection of biomarkers is based on medical knowledge and the diagnosis of the clinician [1], [2]. This may ignore potential biomarkers. Machine learning algorithms have been widely used in biomarker analysis of high dimensional medical data, such as microarray data [3], [4], [5] or mass spectrometry data [6], [7]. Despite the potential advantages over standard statistical methods, their applications to survival analysis are rare due to the difficulty in dealing with censored data [8]. Recent research has shown that machine learning methods, such as neural network [9], [10], Bayesian network [11], decision tree and Naı¨ve Bayes classifier [8], are used to improve the survival model. However, none of these methods deals with the biomarker selection in survival analysis.

In this study we propose a novel method of biomarker selection based on one dimensional continuous wavelet transform (CWT). Normally, one dimensional discrete wavelet transform (DWT) is used to reduce dimensionality in the analysis of high dimensional biomedical data [12], [13]. In biomarker detection, the feature space must have the corresponding relationship with original data space to locate the detected biomarker based on detected features. One dimensional CWT detects the feature of data at every scale and position, and keeps local property of the original data. Wavelet feature vector of CWT has the same length as the original data, and can be used to locate the biomarker in original data space.

First we perform one dimensional continuous wavelet transform at different scales on colorectal cancer data to extract the discriminant features. Then we use genetic algorithm (GA) and Bayes classifier to select the optimized features from extracted wavelet coefficients. Due to the wavelet well-known property, which reveals the local features of data (or time feature) and does not lose the position information of original data, the corresponding protein markers in the original data space are obtained based on the position of optimized wavelet features. Finally Kaplan–Meier (KM) estimator and Cox regression model were used to evaluate the performance of selected protein markers. A new protein biomarker CD46 was found to have independent prognostic significance. Recent research suggests that “the immune system might be involved in the development and progression of colorectal cancer” [1], [14]. The detection of CD46 supports their deduction or conclusions.

The rest of paper is organized as follows: In Section 2, we describe the colorectal cancer data. Our proposed method is introduced in Section 3. Wavelet feature extraction for colorectal cancer data is described in Section 4. In Section 5, genetic algorithm based on Bayes classifier is used to select the optimized features. Survival models are used to evaluate the selected biomarkers in Section 6. The experiments are conducted in Section 7, followed by discussion and concluding comments in Section 8.

Section snippets

Colorectal cancer data

We use the same dataset, which Professor Lindy Durrant used in her research. It is described in Lindy Durrant’s research [1], [2]. The study population cohort comprised a consecutive series of 462 archived specimens of primary invasive cases of colorectal cancer (CRC) tissue obtained from patients undergoing elective surgical resection of a histologically proven primary CRC at Nottingham University Hospitals, Nottingham, UK. The samples were collected between January 1994 and December 2000 from

The proposed method

Fig. 2 shows the selection process of significant biomarkers in survival analysis. First the data was transformed into wavelet space at different scales to find the most discriminant features between the two groups. Genetic algorithm was used to select the best features from extracted wavelet features and then the significant protein markers were detected based on the optimized features in wavelet space. Finally Kaplan–Meier curve and Cox regression model were performed to evaluate the

Wavelet feature extraction

A wavelet is a “small wave”, which has its energy concentrated in time. A wavelet system is generated from a single scaling function or wavelet by simple scaling and translation. Wavelets have a accurate local description and separation of signal characteristics, and give a tool for the analysis of transient or time-varying signal [26]. Wavelets are widely used for image processing and feature extraction of data [27], [28]. One dimensional continuous wavelet transform (CWT) of signal or data s

Biomarker selection based on genetic algorithm and Bayes classifier

After wavelet feature extraction, genetic algorithm is employed to select the best features. Floating point encoding or real encoding is used in this study. Student’s t-test is performed on wavelet coefficients of CWT at scale 3 to select the initialization chromosome. Uniform crossover and Gaussian mutation are performed to create next generations. In the fitness function, Bayes classifier is used to evaluate the performance of subset features, using a linear combination of the empirical error

Kaplan–Meier estimator

Kaplan–Meier (KM) analysis is a non-parametric technique for estimating time-related events, especially when not all subjects continue in the study [39]. It analyzes the distribution of patient survival times following the enrollment into a study, including the proportion of alive patients up to a given time following enrollment, i.e. “censored data”. “Censored data” means that the survival time for the subjects cannot be accurately determined as these patients are still alive at the time of

Experiments and results

In this section, several experiments are conducted. In section 7.1, we compare extracted features using CWT with ones using DWT. The experimental results show that our proposed CWT method has the ability to catch the information that DWT is missing. In Section 7.2, the performance of CWT features is compared with one of original data, because other feature extraction methods, such as PCA and LDA, are not applicable in biomarker detection. In Section 7.3, several subsets of biomarkers are

Discussion and conclusions

In this study we propose a novel method of biomarker detection in survival analysis. Two groups of patients were used to select the biomarkers of colorectal cancer data. One was the patients with survival time of less than 30 months, and another one was the patients with survival time of more than 70 months. First continuous wavelet analysis was used to extract the discriminant features between the two groups of patients. The best discriminant features were obtained based on CWT at scale 3.

References (60)

  • R.W. Swiniarski et al.

    Rough set methods in feature selection and recognition

    Pattern Recognition Letters

    (2003)
  • P. Bermejo et al.

    Fast wrapper feature subset selection in high-dimensional datasets by means of filter re-ranking

    Knowledge-Based Systems

    (2012)
  • B. Huang et al.

    Dominance-based rough set model in intuitionistic fuzzy information systems

    Knowledge-Based Systems

    (2012)
  • Q. He et al.

    Fuzzy rough set based attribute reduction for information systems with fuzzy decisions

    Knowledge-Based Systems

    (2011)
  • S. Deng et al.

    G-ANMI: a mutual information based genetic clustering algorithm for categorical data

    Knowledge-Based Systems

    (2010)
  • I.A. Gheyas et al.

    Feature subset selection in large dimensionality domains

    Pattern Recognition

    (2010)
  • X. Wang et al.

    Palmprint verification based on 2D-Gabor wavelet and pulse-coupled neural network

    Knowledge-Based Systems

    (2012)
  • A.H. Wright

    Genetic algorithms for real parameter optimization

  • K. Valarmathi et al.

    Real-coded genetic algorithm for system identification and controller tuning

    Applied Mathematical Modelling

    (2009)
  • L.J. Eshelman et al.

    Real-coded genetic algorithms and interval schemata

  • E.D. Hawkins et al.

    CD46 signaling in T cells: linking pathogens with polarity

    FEBS Letters

    (2010)
  • Z. Fishelson et al.

    Obstacles to cancer immunotherapy: expression of membrane complement regulatory proteins (mCRPs) in tumors

    Molecular Immunology

    (2003)
  • S. Ni Choileain et al.

    CD46 processing: a means of expression

    Immunobiology

    (2012)
  • J.A.D. Simpson, A. Al-Attar, N.F.S. Watson, J.H. Scholefield, M. Ilyas, L.G. Durrant, Intratumoral T cell infiltration,...
  • G.J. Ullenhag et al.

    Overexpression of FLIP L is an independent marker of poor prognosis in colorectal cancer patients

    Clinical Cancer Research

    (2007)
  • Y. Liu

    Detect key genes information in classification of microarray data

    EURASIP Journal on Advances in Signal Processing

    (2008)
  • Y. Liu et al.

    Find significant gene information based on changing points of microarray data

    IEEE Transactions on Biomedical Engineering

    (2009)
  • J. Li et al.

    Discovery of significant rules for classifying cancer diagnosis data

    Bioinformatics

    (2003)
  • C. Greenhill

    Cancer: New biomarkers of good prognosis in colorectal cancer identified

    Nature Reviews Gastroenterology and Hepatology

    (2010)
  • H. Peng et al.

    Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2005)
  • Cited by (34)

    • Human papillomavirus-mediated expression of complement regulatory proteins in human cervical cancer cells

      2023, European Journal of Obstetrics and Gynecology and Reproductive Biology
    • A new distance with derivative information for functional k-means clustering algorithm

      2018, Information Sciences
      Citation Excerpt :

      In essence, it is a smooth approximation to the functional curve in a low-dimensional functional subspace [34,36]. Widely used basis functions include the polynomial basis [40], Fourier basis [17,31,44,45], B-spline basis [1,2,21,32,47], and wavelet basis [9,18,30,35]. After obtaining smooth functional curves by means of a basis function expansion technique, we further discuss the clustering analysis of functional data.

    • MBCGP-FE: A modified balanced cartesian genetic programming feature extractor

      2017, Knowledge-Based Systems
      Citation Excerpt :

      Global searches such as evolutionary algorithms (EAs) have features for escaping from local optima. So, they are preferred for the complex search space [14,15]. Recently, genetic algorithm (GA) and genetic programming (GP) is used for searching interacting features and constructs new features.

    • Development of a two-stage gene selection method that incorporates a novel hybrid approach using the cuckoo optimization algorithm and harmony search for cancer classification

      2017, Journal of Biomedical Informatics
      Citation Excerpt :

      High classification accuracy is, of course, of the utmost importance for personalized medicine. However, biomarker identification is also an area of ongoing research, where it is important to identify a small number of genes to spot patterns (e.g., choosing few genes that are all differentially expressed across different samples) [28,29]. Therefore, in this study, the main objectives were to select the optimum number of the most informative genes that can best distinguish between two cancer types.

    • Comparison study of orthonormal representations of functional data in classification

      2016, Knowledge-Based Systems
      Citation Excerpt :

      Representing data series in the transformed domain is a common dimensionality reduction approach. Some of the popular transformation techniques are Fourier transform [15,33,53] and wavelet transform [11,16,32,37]. Functional principal component analysis(FPCA) [10,21,29,39,43,46,54–56] is a popular technique that uses statistical methods.

    View all citing articles on Scopus
    View full text