Multiclass classification and gene selection with a stochastic algorithm

https://doi.org/10.1016/j.csda.2009.02.028Get rights and content

Abstract

Microarray technology allows for the monitoring of thousands of gene expressions in various biological conditions, but most of these genes are irrelevant for classifying these conditions. Feature selection is consequently needed to help reduce the dimension of the variable space. Starting from the application of the stochastic meta-algorithm “Optimal Feature Weighting” (OFW) for selecting features in various classification problems, focus is made on the multiclass problem that wrapper methods rarely handle. From a computational point of view, one of the main difficulties comes from the unbalanced classes situation that is commonly encountered in microarray data. From a theoretical point of view, very few methods have been developed so far to minimize the classification error made on the minority classes. The OFW approach is developed to handle multiclass problems using CART and one-vs-one SVM classifiers. Comparisons are made with other multiclass selection algorithms such as Random Forests and the filter method F-test on five public microarray data sets with various complexities. Statistical relevancy of the gene selections is assessed by computing the performances and the stability of these different approaches and the results obtained show that the two proposed approaches are competitive and relevant to selecting genes classifying the minority classes.

Application to a pig folliculogenesis study follows and a detailed interpretation of the genes that were selected shows that the OFW approach answers the biological question.

Introduction

When dealing with microarray data, one of the most important issues to improve the classification task is to perform feature selection. Thousands of genes can be measured on a single array, most of which are irrelevant or uninformative for classification methods. Dimensionality must therefore be reduced without losing information.

In this context, our objective was to look for predictors (the genes) that would classify the observed cases (the microarrays) into their known classes. The selection of these discriminative variables can be performed in two ways: either explicitly, with the filter methods or implicitly, with the wrapper methods. The filter methods measure the usefulness of a feature by ordering it with statistical tests such as t- or F-tests. These gene-by-gene approaches are robust against overfitting and computationally fast. However, they disregard the interactions between the features and they may fail to find the “useful” set of variables, as they usually select variables with redundant information. On the contrary, the aim of the wrapper methods is to measure the usefulness of a subset of features in the whole set of variables. However, when dealing with a large number of variables as is the case here, it is computationally impossible to do an exhaustive search among all subsets of features. Furthermore, these methods are prone to overfit. One solution to benefit from the wrapper approach is to perform a search using stochastic approximations that still cover a large portion of the feature space and avoid local minima. The “Optimal Feature Weighting” algorithm (OFW) proposed by Gadat and Younes (2007) allows for the selection of an optimal discriminative subset of variables. This meta-algorithm can be applied with any classifier. For example Support Vector Machines (SVM, Vapnik, 1999) and Classification And Regression Trees (CART, Breiman et al., 1984) were applied to this stochastic approach in Lê Cao et al. (2007) for 2-class microarray problems. The aim was to make a comparative study of OFW+SV M/CART with other wrapper methods and a filter method on public microarray data sets. The relevancy of the results was assessed in a statistical manner by measuring the performance of each gene selection and with a thorough biological interpretation. The selections obtained with OFW were statistically competitive and biologically relevant, even for complex data sets.

From this point, we investigate OFW with multiclass microarray data sets. Multiclass problems are often considered as an extension of 2-class problems. However, this extension is not always straightforward as the data sets are often characterized by unbalanced classes with a very small number of cases in at least one of the classes. Furthermore, this “rare” minority class is often the one of interest for the biologists who would like to diagnose a disease for example. Nevertheless, most algorithms do not perform well for such problems as they aim at minimizing the overall error rate instead of focusing on the minority class. Moreover, the classification accuracy appears to degrade very quickly as the number of classes increases (Li et al., 2004). Several methods have been proposed in recent years. Chen et al. (2004) proposed balanced or weighted random forests, McCarthy et al. (2005) compared sampling methods and cost sensitive learning, but with no clear winner in the results. More recently Eitrich et al. (2007) and Qiao and Liu (in press) also addressed the unbalanced multiclass issue with cost sensitive machine learning techniques and SVM.

In the specific context of multiclass microarray data, Li et al. (2004) applied various classifiers with various feature selection methods and concluded that the accuracy was highly dependent on the choice of the classifier, rather than the choice of the selection method — although this would be more natural. Chen et al. (2003) applied four filter methods with low correlation between the selected genes, Tibshirani et al. (2002) proposed the Shrunken Centroid approach and Yeung and Burmgarner (2003) applied uncorrelated or error-weighted Shrunken Centroid. More recently Chakraborty (2008) proposed a Bayesian Nearest Neighbor model.

In this study we compare two ways of handling multiclass data, either with an internal weighting procedure in OFW to take into account the minority classes or without. We do not intend to optimize the size of the gene subset. We rather focus on the assessment criteria to measure the performance of the different methods on the first selected genes.

Biological interpretation that is one of the main factors to evaluate the relevancy of the results will be given for one case study. The reader can also refer to Lê Cao et al. (2007) that highlights the importance of biological interpretation in the analysis.

We apply the multicategory classifier CART and the one-vs-one SVM approach with OFW on five public microarray data sets. Numerical comparisons are done with Random Forests, that are known to perform efficiently on such data sets, and one filter method (F-test). We compute the e.632+ bootstrap error from Efron and Tibshirani (1997) for each feature selection method, assess the stability of the results with the Jaccard index and compare the different gene lists. The weighted and non-weighted approaches are then compared in OFW+CART and OFW+SV M with the same tools. Finally, application and biological analysis are performed on a pig folliculogenesis data set.

The first section introduces the theoretical adaptation of the OFW model to the multiclass framework. In the next section we consider the computational aspects of the application of CART and SVM in OFW and describe the different tools to assess the performance of the results. Application on public data sets and on a practical data set follow. The paper ends with further elements of discussion.

Section snippets

The OFW model

We introduce our feature selection model in the framework of multiclass analysis. As we focus here on microarray data, we will mostly refer to “genes” instead of “variables”.

Application of OFW and performance evaluation

We discuss the applications of OFW+CART/SVM in the context of multiclass problems. The binary case can be found in Lê Cao et al. (2007).

Statistical assessment on public data sets

A short description of the five public data sets is first given before we apply OFW+CART, OFW+SV M, RF and F-test with no weighting procedure. We compare the results obtained in terms of performance, stability and differences in the gene selections. We then focus on OFW and compare the weighted vs. non-weighted procedure with the same criteria cited above.

Application and biological interpretation

When developing feature selection algorithms for microarray data, it is useful to check if the actual gene selection is biologically relevant for the study. The biological interpretation of the results is therefore valuable to show the applicability of such algorithms.

Computation time

The experiments were performed in R with a 1.6 GHz 960 Mo RAM AMD Turion 64 X2 PC for OFW+SV M (implementation in R) and OFW+CART (implementation in C in a R package). The learning time of OFW mostly depends on the initial number of variables in the feature space and the step of the stochastic scheme, as well as the size of ω and the number of trees aggregated for OFW+CART. For Brain (Lymphoma) that contains 1963 (4026) genes, the learning step took about 1 (1.5) h for OFW+SV M for 200 000

Conclusion

Starting from Lê Cao et al. (2007) that provided interesting results for binary problems, we extended the application of OFW+CART and OFW+SV Mone-vs-one for multiclass microarray problems. These data sets are known to be complex problems because of their high dimensionality with a small sample size and at least one of the classes that is under represented. For most classifiers, this often results in a good overall classification accuracy even though the minority classes are misclassified.

We

References (33)

  • M. Keira et al.

    Identification of a molecular species in porcine ovarian luteal glutathione S-transferase and its hormonal regulation by pituitary gonadotropins

    Archives of biochemistry and biophysics (Print)

    (1994)
  • A. Alizadeh et al.

    Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling

    Nature

    (2000)
  • C. Ambroise et al.

    Selection bias in gene extraction in tumour classification on basis of microarray gene expression data

    Proceedings of the National Academy of Sciences of the United States of America

    (2002)
  • A. Bonnet et al.

    Identification of gene networks involved in antral follicular development

    Reproduction

    (2008)
  • L. Breiman

    Bagging predictors

    Machine Learning

    (1996)
  • L. Breiman

    Random forests

    Machine Learning

    (2001)
  • L. Breiman et al.

    Classification and Regression Trees

    (1984)
  • C.J.C. Burges

    A tutorial on support vector machines for pattern recognition

    Data Mining and Knowledge Discovery

    (1998)
  • S. Chakraborty

    Simultaneous cancer classification and gene selection with Bayesian nearest neighbor method: An integrated approach

    Computational Statistics and Data Analysis

    (2008)
  • Chen, C., Liaw, A., Breiman, L., 2004. Using random forest to learn imbalanced data. Technical Report 666, Dpt. of...
  • D. Chen et al.

    Gene selection for multi-class prediction of microarray data

  • M. Duflo

    Random Iterative Models

    (1997)
  • B. Efron et al.

    Improvements on cross-validation: the e.632+ bootstrap method

    Journal of American Statistical Association

    (1997)
  • T. Eitrich et al.

    Classification of highly unbalanced CYP450 data of drugs using cost sensitive machine learning techniques

    Journal of Chemical Information and Modeling

    (2007)
  • S. Gadat et al.

    A stochastic algorithm for feature selection in pattern recognition

    Journal of Machine Learning Research

    (2007)
  • T. Golub et al.

    Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring

    Science

    (1999)
  • Cited by (0)

    View full text