Wavelet feature extraction and genetic algorithm for biomarker detection in colorectal cancer data

doi:10.1016/j.knosys.2012.09.011

Knowledge-Based Systems

Volume 37, January 2013, Pages 502-514

https://doi.org/10.1016/j.knosys.2012.09.011 Get rights and content

Abstract

Biomarkers which predict patient’s survival play an important role in medical diagnosis and treatment. How to select the significant biomarkers from hundreds of protein markers is a key step in survival analysis. In this paper a novel method is proposed to detect the prognostic biomarkers of survival in colorectal cancer patients using wavelet analysis, genetic algorithm, and Bayes classifier. One dimensional discrete wavelet transform (DWT) is normally used to reduce the dimensionality of biomedical data. In this study one dimensional continuous wavelet transform (CWT) was proposed to extract the features of colorectal cancer data. One dimensional CWT has no ability to reduce dimensionality of data, but captures the missing features of DWT, and is complementary part of DWT. Genetic algorithm was performed on extracted wavelet coefficients to select the optimized features, using Bayes classifier to build its fitness function. The corresponding protein markers were located based on the position of optimized features. Kaplan–Meier curve and Cox regression model were used to evaluate the performance of selected biomarkers. Experiments were conducted on colorectal cancer dataset and several significant biomarkers were detected. A new protein biomarker CD46 was found to be significantly associated with survival time.

Introduction

Survival analysis involves the estimation of the distribution of time it takes for death to occur depending on the biology of the disease. It allows clinicians to plan a suitable treatment and counsel patients about their prognosis. In medical domains, survival analysis is mainly based on Kaplan–Meier (KM) estimator and Cox proportional hazards regression model [1], [2], which are used to evaluate the performance of prognostic markers. However how to rank these biomarkers, is a key step in survival analysis. Normally, the selection of biomarkers is based on medical knowledge and the diagnosis of the clinician [1], [2]. This may ignore potential biomarkers. Machine learning algorithms have been widely used in biomarker analysis of high dimensional medical data, such as microarray data [3], [4], [5] or mass spectrometry data [6], [7]. Despite the potential advantages over standard statistical methods, their applications to survival analysis are rare due to the difficulty in dealing with censored data [8]. Recent research has shown that machine learning methods, such as neural network [9], [10], Bayesian network [11], decision tree and Naı¨ve Bayes classifier [8], are used to improve the survival model. However, none of these methods deals with the biomarker selection in survival analysis.

In this study we propose a novel method of biomarker selection based on one dimensional continuous wavelet transform (CWT). Normally, one dimensional discrete wavelet transform (DWT) is used to reduce dimensionality in the analysis of high dimensional biomedical data [12], [13]. In biomarker detection, the feature space must have the corresponding relationship with original data space to locate the detected biomarker based on detected features. One dimensional CWT detects the feature of data at every scale and position, and keeps local property of the original data. Wavelet feature vector of CWT has the same length as the original data, and can be used to locate the biomarker in original data space.

First we perform one dimensional continuous wavelet transform at different scales on colorectal cancer data to extract the discriminant features. Then we use genetic algorithm (GA) and Bayes classifier to select the optimized features from extracted wavelet coefficients. Due to the wavelet well-known property, which reveals the local features of data (or time feature) and does not lose the position information of original data, the corresponding protein markers in the original data space are obtained based on the position of optimized wavelet features. Finally Kaplan–Meier (KM) estimator and Cox regression model were used to evaluate the performance of selected protein markers. A new protein biomarker CD46 was found to have independent prognostic significance. Recent research suggests that “the immune system might be involved in the development and progression of colorectal cancer” [1], [14]. The detection of CD46 supports their deduction or conclusions.

The rest of paper is organized as follows: In Section 2, we describe the colorectal cancer data. Our proposed method is introduced in Section 3. Wavelet feature extraction for colorectal cancer data is described in Section 4. In Section 5, genetic algorithm based on Bayes classifier is used to select the optimized features. Survival models are used to evaluate the selected biomarkers in Section 6. The experiments are conducted in Section 7, followed by discussion and concluding comments in Section 8.

Section snippets

Colorectal cancer data

We use the same dataset, which Professor Lindy Durrant used in her research. It is described in Lindy Durrant’s research [1], [2]. The study population cohort comprised a consecutive series of 462 archived specimens of primary invasive cases of colorectal cancer (CRC) tissue obtained from patients undergoing elective surgical resection of a histologically proven primary CRC at Nottingham University Hospitals, Nottingham, UK. The samples were collected between January 1994 and December 2000 from

The proposed method

Fig. 2 shows the selection process of significant biomarkers in survival analysis. First the data was transformed into wavelet space at different scales to find the most discriminant features between the two groups. Genetic algorithm was used to select the best features from extracted wavelet features and then the significant protein markers were detected based on the optimized features in wavelet space. Finally Kaplan–Meier curve and Cox regression model were performed to evaluate the

Wavelet feature extraction

A wavelet is a “small wave”, which has its energy concentrated in time. A wavelet system is generated from a single scaling function or wavelet by simple scaling and translation. Wavelets have a accurate local description and separation of signal characteristics, and give a tool for the analysis of transient or time-varying signal [26]. Wavelets are widely used for image processing and feature extraction of data [27], [28]. One dimensional continuous wavelet transform (CWT) of signal or data s

Biomarker selection based on genetic algorithm and Bayes classifier

After wavelet feature extraction, genetic algorithm is employed to select the best features. Floating point encoding or real encoding is used in this study. Student’s t-test is performed on wavelet coefficients of CWT at scale 3 to select the initialization chromosome. Uniform crossover and Gaussian mutation are performed to create next generations. In the fitness function, Bayes classifier is used to evaluate the performance of subset features, using a linear combination of the empirical error

Kaplan–Meier estimator

Kaplan–Meier (KM) analysis is a non-parametric technique for estimating time-related events, especially when not all subjects continue in the study [39]. It analyzes the distribution of patient survival times following the enrollment into a study, including the proportion of alive patients up to a given time following enrollment, i.e. “censored data”. “Censored data” means that the survival time for the subjects cannot be accurately determined as these patients are still alive at the time of

Experiments and results

In this section, several experiments are conducted. In section 7.1, we compare extracted features using CWT with ones using DWT. The experimental results show that our proposed CWT method has the ability to catch the information that DWT is missing. In Section 7.2, the performance of CWT features is compared with one of original data, because other feature extraction methods, such as PCA and LDA, are not applicable in biomarker detection. In Section 7.3, several subsets of biomarkers are

Discussion and conclusions

In this study we propose a novel method of biomarker detection in survival analysis. Two groups of patients were used to select the biomarkers of colorectal cancer data. One was the patients with survival time of less than 30 months, and another one was the patients with survival time of more than 70 months. First continuous wavelet analysis was used to extract the discriminant features between the two groups of patients. The best discriminant features were obtained based on CWT at scale 3.

References (60)

E.F. Petricoin et al.
Use of proteomic patterns in serum to identify ovarian cancer
The Lancet
(2002)
C.M. Michener et al.
Genomics and proteomics: application of novel technology to early detection and prevention of cancer
Cancer Detection and Prevention
(2002)
B. Zupan et al.
Machine learning for survival analysis: a case study on recurrence of prostate cancer
Artificial Intelligence in Medicine
(2000)
F. Ambrogi et al.
Selection of artificial neural network models for survival analysis with genetic algorithms
Computational Statistics & Data Analysis
(2007)
A. Eleuteri et al.
A novel neural network-based survival analysis model
Neural Networks
(2003)
I. Stajduhar et al.
Impact of censoring on learning Bayesian networks in survival modeling
Artificial Intelligence in Medicine
(2009)
Y. Liu
Feature extraction and dimensionality reduction for mass spectrometry data
Computers in Biology and Medicine
(2009)
Y. Liu
Wavelet feature extraction for high-dimensional microarray data
Neurocomputing
(2009)
Y. Liu
Prominent feature selection of microarray data
Progress in Nature Science
(2009)
Y. Liu
Dimensionality reduction and main component extraction of mass spectrometry cancer data
Knowledge-Based Systems
(2012)

R.W. Swiniarski et al.

Rough set methods in feature selection and recognition

Pattern Recognition Letters

(2003)

P. Bermejo et al.

Fast wrapper feature subset selection in high-dimensional datasets by means of filter re-ranking

Knowledge-Based Systems

(2012)

B. Huang et al.

Dominance-based rough set model in intuitionistic fuzzy information systems

Knowledge-Based Systems

(2012)

Q. He et al.

Fuzzy rough set based attribute reduction for information systems with fuzzy decisions

Knowledge-Based Systems

(2011)

S. Deng et al.

G-ANMI: a mutual information based genetic clustering algorithm for categorical data

Knowledge-Based Systems

(2010)

I.A. Gheyas et al.

Feature subset selection in large dimensionality domains

Pattern Recognition

(2010)

X. Wang et al.

Palmprint verification based on 2D-Gabor wavelet and pulse-coupled neural network

Knowledge-Based Systems

(2012)

A.H. Wright

Genetic algorithms for real parameter optimization

K. Valarmathi et al.

Real-coded genetic algorithm for system identification and controller tuning

Applied Mathematical Modelling

(2009)

L.J. Eshelman et al.

Real-coded genetic algorithms and interval schemata

E.D. Hawkins et al.

CD46 signaling in T cells: linking pathogens with polarity

FEBS Letters

(2010)

Z. Fishelson et al.

Obstacles to cancer immunotherapy: expression of membrane complement regulatory proteins (mCRPs) in tumors

Molecular Immunology

(2003)

S. Ni Choileain et al.

CD46 processing: a means of expression

Immunobiology

(2012)

J.A.D. Simpson, A. Al-Attar, N.F.S. Watson, J.H. Scholefield, M. Ilyas, L.G. Durrant, Intratumoral T cell infiltration,...

G.J. Ullenhag et al.

Overexpression of FLIP L is an independent marker of poor prognosis in colorectal cancer patients

Clinical Cancer Research

(2007)

Y. Liu

Detect key genes information in classification of microarray data

EURASIP Journal on Advances in Signal Processing

(2008)

Y. Liu et al.

Find significant gene information based on changing points of microarray data

IEEE Transactions on Biomedical Engineering

(2009)

J. Li et al.

Discovery of significant rules for classifying cancer diagnosis data

Bioinformatics

(2003)

C. Greenhill

Cancer: New biomarkers of good prognosis in colorectal cancer identified

Nature Reviews Gastroenterology and Hepatology

(2010)

H. Peng et al.

Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy

IEEE Transactions on Pattern Analysis and Machine Intelligence

(2005)

Cited by (34)

Human papillomavirus-mediated expression of complement regulatory proteins in human cervical cancer cells
2023, European Journal of Obstetrics and Gynecology and Reproductive Biology
This study aimed to evaluate the expression pattern of complement regulatory proteins (CRPs) CD46, CD59, and CD55 in HPV-positive (HPV⁺) & negative (HPV^-) cervical cancer cell lines in search of a reliable differential biomarker.
We analysed the expression of CRPs in HPV 16-positive SiHa cell line, HPV 18-positive HeLa cell line, and HPV-negative cell line C33a using RT-qPCR, Western blotting, flow cytometry, and confocal microscopy.
We observed a differential expression profile of CRPs in HPV⁺ and HPV^- cervical cancer cell lines. The mRNA level of CD59 & CD55 showed a higher expression pattern in HPV⁺ cells when compared to HPV^- cancer cells. However, flow cytometry-based experiments revealed that CD46 was preferentially expressed more in HPV 16-positive SiHa cells followed by HPV 18-positive HeLa cells when compared to HPV^- C33a cells. Interestingly, confocal microscopy revealed a high level of CD59 expression in Hela cells and SiHa cells but low expression in HPV^- C33a cells. In addition, HPV 18-positive HeLa cells expressed more CD55, which was lower in SiHa cells and very weak in C33a cells.
The study demonstrates the differential expression of CRPs in both HPV⁺ and HPV^- cervical cancer cells for the first time, and their potential to serve as an early diagnostic marker for cervical carcinogenesis.
A new distance with derivative information for functional k-means clustering algorithm
2018, Information Sciences
Citation Excerpt :
In essence, it is a smooth approximation to the functional curve in a low-dimensional functional subspace [34,36]. Widely used basis functions include the polynomial basis [40], Fourier basis [17,31,44,45], B-spline basis [1,2,21,32,47], and wavelet basis [9,18,30,35]. After obtaining smooth functional curves by means of a basis function expansion technique, we further discuss the clustering analysis of functional data.
The functional k-means clustering algorithm is a widely used method for clustering functional data. However, with this algorithm, the derivative information is not further considered in calculating the similarity between two functional samples. In fact, the derivative information is very important for catching the trend characteristic differences among functional data. In this paper, we define a novel distance used to measure the similarity among functional samples by adding their derivative information. Furthermore, in theory, we construct cluster centroids that can minimize the objective function of the functional k-means clustering algorithm based on the proposed distance. After preprocessing functional data using three types of common basis representation techniques, we compare the clustering performance of the functional k-means clustering algorithms based on four different similarity metrics. The experiments on six data sets with class labels show the effectiveness and robustness of the functional k-means clustering algorithm with the defined distance statistically. In addition, the experimental results on three real-life data sets verify the convergence and practicability of the functional k-means clustering algorithm with the defined distance.
Identification of flow units using the joint of WT and LSSVM based on FZI in a heterogeneous carbonate reservoir
2018, Journal of Petroleum Science and Engineering
The objective of this study was to develop an accurate method for predicting hydraulic unit types in a heterogeneous carbonate reservoir. There is a significant practical potential in the use of the flow unit characterization. Identification of flow units in inhomogeneous carbonate reservoir presents a great challenge to geologists and engineers. A new method for dividing the flow units was proposed in this study based on the joint of wavelet transform (WT) and least squares support vector machine (LSSVM) within the most productive carbonate reservoir of the Minghuazhen Formation in Region A, Block X in the Petrochina Dagang oilfield. Petrophysical properties derived from core data and logging from 21 representative wells were analyzed. The flow units were classified as five types based on the flow zone index (FZI) approach. The WT and LSSVM were jointly used for learning and training each unit. The well logs were broken down into high and low frequency data using WT. Sensitivity analysis of parameters of training samples to select the largest impact was performed with C5.0 decision tree to obtain a WT-trained set. A predictive model was then established by training LSSVM model. The final trained model with the identification rule and criterion for the classification of flow units was used for identifying flow units in the cored and non-cored intervals of reservoir. The result from this study is consistent with core data and is more accurate than that from the previous investigations. It is concluded that using the combination of the WT and LSSVM improved the accuracy of classification of flow units in the Minghuazhen Formation.
MBCGP-FE: A modified balanced cartesian genetic programming feature extractor
2017, Knowledge-Based Systems
Citation Excerpt :
Global searches such as evolutionary algorithms (EAs) have features for escaping from local optima. So, they are preferred for the complex search space [14,15]. Recently, genetic algorithm (GA) and genetic programming (GP) is used for searching interacting features and constructs new features.
Many data sets are represented by low-level or primitive features. This makes it difficult to discover relevant information via learning algorithm. Changing the way primitive data is represented can be advantageous. This can be performed using data preprocessing algorithms. A successful preprocessing algorithm should be capable of revealing the relationships among features to improve learners. These hidden relations among features can make the relevancy of the aspects of the data opaque to the learner. Automatic feature extraction is a solution to overcome this problem. This article introduces a Modified Balanced Cartesian Genetic Programming Feature Extractor (MBCGP-FE) for transforming the feature space to a smaller one composed of highly informative features through modifying the representation and operators of Balanced Cartesian Genetic Programming (BCGP). The new feature space is composed from original relevant and new constructed features which are created by discovering and compacting hidden relations among features. The size of the new feature space is determined during the optimization process. Experimental results on real data sets show that the MBCGP-FE improves the performance of learners and it is effective in reducing the dimension of data sets through the construction of new informative features. In addition, obtained results indicate the effectiveness of our proposed method in comparison with other feature extraction methods.
Development of a two-stage gene selection method that incorporates a novel hybrid approach using the cuckoo optimization algorithm and harmony search for cancer classification
2017, Journal of Biomedical Informatics
Citation Excerpt :
High classification accuracy is, of course, of the utmost importance for personalized medicine. However, biomarker identification is also an area of ongoing research, where it is important to identify a small number of genes to spot patterns (e.g., choosing few genes that are all differentially expressed across different samples) [28,29]. Therefore, in this study, the main objectives were to select the optimum number of the most informative genes that can best distinguish between two cancer types.
For each cancer type, only a few genes are informative. Due to the so-called ‘curse of dimensionality’ problem, the gene selection task remains a challenge. To overcome this problem, we propose a two-stage gene selection method called MRMR-COA-HS. In the first stage, the minimum redundancy and maximum relevance (MRMR) feature selection is used to select a subset of relevant genes. The selected genes are then fed into a wrapper setup that combines a new algorithm, COA-HS, using the support vector machine as a classifier. The method was applied to four microarray datasets, and the performance was assessed by the leave one out cross-validation method. Comparative performance assessment of the proposed method with other evolutionary algorithms suggested that the proposed algorithm significantly outperforms other methods in selecting a fewer number of genes while maintaining the highest classification accuracy. The functions of the selected genes were further investigated, and it was confirmed that the selected genes are biologically relevant to each cancer type.
Comparison study of orthonormal representations of functional data in classification
2016, Knowledge-Based Systems
Citation Excerpt :
Representing data series in the transformed domain is a common dimensionality reduction approach. Some of the popular transformation techniques are Fourier transform [15,33,53] and wavelet transform [11,16,32,37]. Functional principal component analysis(FPCA) [10,21,29,39,43,46,54–56] is a popular technique that uses statistical methods.
Functional data type, which is an important data type, is widely prevalent in many fields such as economics, biology, finance, and meteorology. Its underlying process is often seen as a continuous curve. The classification process for functional data is a basic data mining task. The common method is a two-stage learning process: first, by means of basis functions, the functional data series is converted into multivariate data; second, a machine learning algorithm is employed for performing the classification task based on the new representation. The problem is that a majority of learning algorithms are based on Euclidean distance, whereas the distance between functional samples is L₂ distance. In this context, there are three very interesting problems. (1) Is seeing a functional sample as a point in the corresponding Euclidean space feasible? (2) How to select an orthonormal basis for a given functional data type? (3) Which one is better, orthogonal representation or non-orthogonal representation, under finite basis functions for the same number of basis? These issues are the main motivation of this study. For the first problem, theoretical studies show that seeing a functional sample as a point in the corresponding Euclidean space is feasible under the orthonormal representation. For the second problem, through experimental analysis, we find that Fourier basis is suitable for representing stable functions(especially, periodic functions), wavelet basis is good at differentiating functions with local differences, and data driven functional principal component basis could be the first preference especially when one does not have any prior knowledge on functional data types. For the third problem, experimental results show that orthogonal representation is better than non-orthogonal representation from the viewpoint of classification performance. These results have important significance for studying functional data classification.

View all citing articles on Scopus

View full text

Wavelet feature extraction and genetic algorithm for biomarker detection in colorectal cancer data

Abstract

Introduction

Section snippets

Colorectal cancer data

The proposed method

Wavelet feature extraction

Biomarker selection based on genetic algorithm and Bayes classifier

Kaplan–Meier estimator

Experiments and results

Discussion and conclusions

The Lancet

Cancer Detection and Prevention

Artificial Intelligence in Medicine

Computational Statistics & Data Analysis

Neural Networks

Artificial Intelligence in Medicine

Computers in Biology and Medicine

Neurocomputing

Progress in Nature Science

Knowledge-Based Systems

Pattern Recognition Letters

Knowledge-Based Systems

Knowledge-Based Systems

Knowledge-Based Systems

Knowledge-Based Systems

Pattern Recognition

Knowledge-Based Systems

Applied Mathematical Modelling

FEBS Letters

Molecular Immunology

Immunobiology

Overexpression of FLIP L is an independent marker of poor prognosis in colorectal cancer patients

Clinical Cancer Research

Detect key genes information in classification of microarray data

EURASIP Journal on Advances in Signal Processing

Find significant gene information based on changing points of microarray data

IEEE Transactions on Biomedical Engineering

Discovery of significant rules for classifying cancer diagnosis data

Bioinformatics

Cancer: New biomarkers of good prognosis in colorectal cancer identified

Nature Reviews Gastroenterology and Hepatology

Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy

IEEE Transactions on Pattern Analysis and Machine Intelligence