Wavelet feature extraction and genetic algorithm for biomarker detection in colorectal cancer data
Introduction
Survival analysis involves the estimation of the distribution of time it takes for death to occur depending on the biology of the disease. It allows clinicians to plan a suitable treatment and counsel patients about their prognosis. In medical domains, survival analysis is mainly based on Kaplan–Meier (KM) estimator and Cox proportional hazards regression model [1], [2], which are used to evaluate the performance of prognostic markers. However how to rank these biomarkers, is a key step in survival analysis. Normally, the selection of biomarkers is based on medical knowledge and the diagnosis of the clinician [1], [2]. This may ignore potential biomarkers. Machine learning algorithms have been widely used in biomarker analysis of high dimensional medical data, such as microarray data [3], [4], [5] or mass spectrometry data [6], [7]. Despite the potential advantages over standard statistical methods, their applications to survival analysis are rare due to the difficulty in dealing with censored data [8]. Recent research has shown that machine learning methods, such as neural network [9], [10], Bayesian network [11], decision tree and Naı¨ve Bayes classifier [8], are used to improve the survival model. However, none of these methods deals with the biomarker selection in survival analysis.
In this study we propose a novel method of biomarker selection based on one dimensional continuous wavelet transform (CWT). Normally, one dimensional discrete wavelet transform (DWT) is used to reduce dimensionality in the analysis of high dimensional biomedical data [12], [13]. In biomarker detection, the feature space must have the corresponding relationship with original data space to locate the detected biomarker based on detected features. One dimensional CWT detects the feature of data at every scale and position, and keeps local property of the original data. Wavelet feature vector of CWT has the same length as the original data, and can be used to locate the biomarker in original data space.
First we perform one dimensional continuous wavelet transform at different scales on colorectal cancer data to extract the discriminant features. Then we use genetic algorithm (GA) and Bayes classifier to select the optimized features from extracted wavelet coefficients. Due to the wavelet well-known property, which reveals the local features of data (or time feature) and does not lose the position information of original data, the corresponding protein markers in the original data space are obtained based on the position of optimized wavelet features. Finally Kaplan–Meier (KM) estimator and Cox regression model were used to evaluate the performance of selected protein markers. A new protein biomarker CD46 was found to have independent prognostic significance. Recent research suggests that “the immune system might be involved in the development and progression of colorectal cancer” [1], [14]. The detection of CD46 supports their deduction or conclusions.
The rest of paper is organized as follows: In Section 2, we describe the colorectal cancer data. Our proposed method is introduced in Section 3. Wavelet feature extraction for colorectal cancer data is described in Section 4. In Section 5, genetic algorithm based on Bayes classifier is used to select the optimized features. Survival models are used to evaluate the selected biomarkers in Section 6. The experiments are conducted in Section 7, followed by discussion and concluding comments in Section 8.
Section snippets
Colorectal cancer data
We use the same dataset, which Professor Lindy Durrant used in her research. It is described in Lindy Durrant’s research [1], [2]. The study population cohort comprised a consecutive series of 462 archived specimens of primary invasive cases of colorectal cancer (CRC) tissue obtained from patients undergoing elective surgical resection of a histologically proven primary CRC at Nottingham University Hospitals, Nottingham, UK. The samples were collected between January 1994 and December 2000 from
The proposed method
Fig. 2 shows the selection process of significant biomarkers in survival analysis. First the data was transformed into wavelet space at different scales to find the most discriminant features between the two groups. Genetic algorithm was used to select the best features from extracted wavelet features and then the significant protein markers were detected based on the optimized features in wavelet space. Finally Kaplan–Meier curve and Cox regression model were performed to evaluate the
Wavelet feature extraction
A wavelet is a “small wave”, which has its energy concentrated in time. A wavelet system is generated from a single scaling function or wavelet by simple scaling and translation. Wavelets have a accurate local description and separation of signal characteristics, and give a tool for the analysis of transient or time-varying signal [26]. Wavelets are widely used for image processing and feature extraction of data [27], [28]. One dimensional continuous wavelet transform (CWT) of signal or data s
Biomarker selection based on genetic algorithm and Bayes classifier
After wavelet feature extraction, genetic algorithm is employed to select the best features. Floating point encoding or real encoding is used in this study. Student’s t-test is performed on wavelet coefficients of CWT at scale 3 to select the initialization chromosome. Uniform crossover and Gaussian mutation are performed to create next generations. In the fitness function, Bayes classifier is used to evaluate the performance of subset features, using a linear combination of the empirical error
Kaplan–Meier estimator
Kaplan–Meier (KM) analysis is a non-parametric technique for estimating time-related events, especially when not all subjects continue in the study [39]. It analyzes the distribution of patient survival times following the enrollment into a study, including the proportion of alive patients up to a given time following enrollment, i.e. “censored data”. “Censored data” means that the survival time for the subjects cannot be accurately determined as these patients are still alive at the time of
Experiments and results
In this section, several experiments are conducted. In section 7.1, we compare extracted features using CWT with ones using DWT. The experimental results show that our proposed CWT method has the ability to catch the information that DWT is missing. In Section 7.2, the performance of CWT features is compared with one of original data, because other feature extraction methods, such as PCA and LDA, are not applicable in biomarker detection. In Section 7.3, several subsets of biomarkers are
Discussion and conclusions
In this study we propose a novel method of biomarker detection in survival analysis. Two groups of patients were used to select the biomarkers of colorectal cancer data. One was the patients with survival time of less than 30 months, and another one was the patients with survival time of more than 70 months. First continuous wavelet analysis was used to extract the discriminant features between the two groups of patients. The best discriminant features were obtained based on CWT at scale 3.
References (60)
- et al.
Use of proteomic patterns in serum to identify ovarian cancer
The Lancet
(2002) - et al.
Genomics and proteomics: application of novel technology to early detection and prevention of cancer
Cancer Detection and Prevention
(2002) - et al.
Machine learning for survival analysis: a case study on recurrence of prostate cancer
Artificial Intelligence in Medicine
(2000) - et al.
Selection of artificial neural network models for survival analysis with genetic algorithms
Computational Statistics & Data Analysis
(2007) - et al.
A novel neural network-based survival analysis model
Neural Networks
(2003) - et al.
Impact of censoring on learning Bayesian networks in survival modeling
Artificial Intelligence in Medicine
(2009) Feature extraction and dimensionality reduction for mass spectrometry data
Computers in Biology and Medicine
(2009)Wavelet feature extraction for high-dimensional microarray data
Neurocomputing
(2009)Prominent feature selection of microarray data
Progress in Nature Science
(2009)Dimensionality reduction and main component extraction of mass spectrometry cancer data
Knowledge-Based Systems
(2012)
Rough set methods in feature selection and recognition
Pattern Recognition Letters
Fast wrapper feature subset selection in high-dimensional datasets by means of filter re-ranking
Knowledge-Based Systems
Dominance-based rough set model in intuitionistic fuzzy information systems
Knowledge-Based Systems
Fuzzy rough set based attribute reduction for information systems with fuzzy decisions
Knowledge-Based Systems
G-ANMI: a mutual information based genetic clustering algorithm for categorical data
Knowledge-Based Systems
Feature subset selection in large dimensionality domains
Pattern Recognition
Palmprint verification based on 2D-Gabor wavelet and pulse-coupled neural network
Knowledge-Based Systems
Genetic algorithms for real parameter optimization
Real-coded genetic algorithm for system identification and controller tuning
Applied Mathematical Modelling
Real-coded genetic algorithms and interval schemata
CD46 signaling in T cells: linking pathogens with polarity
FEBS Letters
Obstacles to cancer immunotherapy: expression of membrane complement regulatory proteins (mCRPs) in tumors
Molecular Immunology
CD46 processing: a means of expression
Immunobiology
Overexpression of FLIP L is an independent marker of poor prognosis in colorectal cancer patients
Clinical Cancer Research
Detect key genes information in classification of microarray data
EURASIP Journal on Advances in Signal Processing
Find significant gene information based on changing points of microarray data
IEEE Transactions on Biomedical Engineering
Discovery of significant rules for classifying cancer diagnosis data
Bioinformatics
Cancer: New biomarkers of good prognosis in colorectal cancer identified
Nature Reviews Gastroenterology and Hepatology
Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy
IEEE Transactions on Pattern Analysis and Machine Intelligence
Cited by (34)
Human papillomavirus-mediated expression of complement regulatory proteins in human cervical cancer cells
2023, European Journal of Obstetrics and Gynecology and Reproductive BiologyA new distance with derivative information for functional k-means clustering algorithm
2018, Information SciencesCitation Excerpt :In essence, it is a smooth approximation to the functional curve in a low-dimensional functional subspace [34,36]. Widely used basis functions include the polynomial basis [40], Fourier basis [17,31,44,45], B-spline basis [1,2,21,32,47], and wavelet basis [9,18,30,35]. After obtaining smooth functional curves by means of a basis function expansion technique, we further discuss the clustering analysis of functional data.
Identification of flow units using the joint of WT and LSSVM based on FZI in a heterogeneous carbonate reservoir
2018, Journal of Petroleum Science and EngineeringMBCGP-FE: A modified balanced cartesian genetic programming feature extractor
2017, Knowledge-Based SystemsCitation Excerpt :Global searches such as evolutionary algorithms (EAs) have features for escaping from local optima. So, they are preferred for the complex search space [14,15]. Recently, genetic algorithm (GA) and genetic programming (GP) is used for searching interacting features and constructs new features.
Development of a two-stage gene selection method that incorporates a novel hybrid approach using the cuckoo optimization algorithm and harmony search for cancer classification
2017, Journal of Biomedical InformaticsCitation Excerpt :High classification accuracy is, of course, of the utmost importance for personalized medicine. However, biomarker identification is also an area of ongoing research, where it is important to identify a small number of genes to spot patterns (e.g., choosing few genes that are all differentially expressed across different samples) [28,29]. Therefore, in this study, the main objectives were to select the optimum number of the most informative genes that can best distinguish between two cancer types.
Comparison study of orthonormal representations of functional data in classification
2016, Knowledge-Based SystemsCitation Excerpt :Representing data series in the transformed domain is a common dimensionality reduction approach. Some of the popular transformation techniques are Fourier transform [15,33,53] and wavelet transform [11,16,32,37]. Functional principal component analysis(FPCA) [10,21,29,39,43,46,54–56] is a popular technique that uses statistical methods.