Elsevier

Neurocomputing

Volume 149, Part B, 3 February 2015, Pages 767-776
Neurocomputing

A Projection Pursuit framework for supervised dimension reduction of high dimensional small sample datasets

https://doi.org/10.1016/j.neucom.2014.07.057Get rights and content

Abstract

The analysis and interpretation of datasets with large number of features and few examples has remained as a challenging problem in the scientific community, owing to the difficulties associated with the curse-of-the-dimensionality phenomenon. Projection Pursuit (PP) has shown promise in circumventing this phenomenon by searching low-dimensional projections of the data where meaningful structures are exposed. However, PP faces computational difficulties in dealing with datasets containing thousands of features (typical in genomics and proteomics) due to the vast quantity of parameters to optimize. In this paper we describe and evaluate a PP framework aimed at relieving such difficulties and thus ease the construction of classifier systems. The framework is a two-stage approach, where the first stage performs a rapid compaction of the data and the second stage implements the PP search using an improved version of the SPP method (Guo et al., 2000, [32]). In an experimental evaluation with eight public microarray datasets we showed that some configurations of the proposed framework can clearly overtake the performance of eight well-established dimension reduction methods in their ability to pack more discriminatory information into fewer dimensions.

Introduction

In the last few decades we have witnessed a rapid development and refinement of data acquisition technologies in several science and industrial areas [1]. This has led to the emergence of high-throughput technologies that are capable of generating datasets with the number of features (p) far greater than the number of examples (n), the so-called large p small n datasets. A representative example of these technological developments is the microarray technology [2], which has made possible the measurement of expression levels of thousands of genes in a relatively rapid and economic way, leading to significant advances in the understanding of severe diseases, like cancer, and raising hopes on possible cures [3], [4].

Though the collection of large p small n datasets is nowadays a common practice in many fields, their analysis and interpretation is still a challenging task [5], [6], [1]. This difficulty is mainly originated by the so-called “curse of dimensionality” phenomenon, inherent in such a kind of data [7]. This phenomenon states that as the dimensionality increases, the corresponding space becomes emptier and the data points tend to be equidistant. This generates detrimental impacts in most machine-learning and pattern-recognition methods (including model-estimation instability, model over fitting and local convergence), compromising the generalization performance and reliability of such methods [5], [6].

A common approach to circumvent the curse of dimensionality is by reducing it [6]. Two kinds of methods exist for this task: feature selection (FS) [8], [9] and feature extraction (FE) [10], [11]. The former methods try to find small subsets of original features that are relevant to the intended analysis. The latter methods reduce the dimensionality by building new features from combinations (linear or nonlinear) of the original features. FS has the benefit of keeping the original feature meaning, facilitating the interpretability by the domain expert [9]. However, it has been said [12] that FE is preferable over FS when the final goal is an accurate system for classifying new examples and interpretability is not as important. This is because FE is not tied to the original feature space, providing greater chances of finding more useful representations for the desired task [12].

Projection pursuit (PP) [13], [14] is a FE method that has been successfully applied in several domains for both supervised and unsupervised analyses (e.g. [15], [16], [17], [18]). PP seeks low-dimensional linear projections of the data that expose interesting aspects of them. To this end, a measure of “interestingness” is employed, which is known as projection pursuit index (PP index). A key advantage of PP is its flexibility to fit different pattern recognition tasks, depending on the PP index used. For example, PP can be used to perform clustering analysis [19], [20], classification [21], [22], [23], [24], regression analysis [25] and density estimation [26] (some reviews of PP indexes can be found in [21], [27], [28]). Another advantage of PP is its out-of-sample mapping capability, that is, the possibility to map new examples in the projection space after the construction of it.

Despite the aforementioned advantages, the literature shows a limited use of PP in large p small n datasets, like those generated by microarray technology. This may be due to the high computational difficulty in finding optimal projection spaces for such cases. For instance, the projection of a dataset with p=10k features (a realistic number in microarray datasets) onto a target space of dimension m=3 will require the optimization of a projection matrix of p×m = 30k elements. Evidently, the problem worsens as p or m increase. Traditional PP optimizers based on the gradients or Newton methods [29], [30], [31], [19] are usually inadequate for such a kind of data due to the vastness of possible projections and, thus, the high susceptibility to find poor local optima [14]. More global PP optimizers were described recently, including genetic algorithms (GA) [32], [33], simulating annealing (SA) [21], random scan sampling (RSSA) [34] and particle swarm optimization (PSO) [35]. However, none of these works have been directly applied in dimensionalities as high as those found in microarray data, which shows the difficulty of applying PP in such scenarios.

In this paper we present a framework to facilitate the applicability of PP on large p small n datasets with the aim of classification tasks. The framework is formed by two main stages (Fig. 1): a compaction stage and a PP optimization stage. The first stage is devised to rapidly transform the original data into a less sparse representation. The second stage is the PP part, which is responsible to find optimal projections taking the compacted representation as input.

For the compaction stage we use three well-known techniques: PCA, Whitening and Partial Least Squares. For the PP stage, we adopt the Sequential Projection Pursuit (SPP) approach [32] coupled with the GA optimizer (PPGA) we described recently [33], in which a specialized crossover operator showed excelling search capabilities. An experimental study is presented over eight public microarray datasets. The evaluation systematically tested several configurations of the framework, including variations of the compaction method, the PP index function and the target dimensionality. We used the predictive accuracy of two popular classification methods (LDA and 3NN) in order to assess the quality of the tested configurations. We also compare the framework against eight well-established dimension reduction methods, including FE and FS methods.

The paper is organized as follows. Section 2 introduces some important concepts of PP, SPP, PP optimization and PP indexes used in the paper. Section 3 describes the proposed framework. Section 4 presents the experimental evaluation, including the experimental setup, results and corresponding discussion. Finally, our conclusions are presented in Section 5.

Section snippets

Projection pursuit

The projection pursuit (PP) concept was formally introduced in the paper of Friedman and Tukey [13], although the seminal ideas were originally posed by Kruskal [36]. To describe the PP concept we assume that we have a data matrix X of n×p dimensions, where n is the number of data examples or observations and p is the number of attributes or variables. PP can be defined as the constrained optimization problem in (1), where the aim is to seek a m-dimensional projection space (m<p) (defined by

A PP framework for supervised dimension reduction of large p small n data

We detail here the proposed framework to ease the applicability of PP in large p small n data. Fig. 2 shows the structure of this framework. Two stages compose this: the first stage implements a fast procedure to compact the data into an intermediate-dimensional representation (in the order of n). The second stage is a PP procedure over the compacted data, which implements an improved version of the SPP scheme. Next, we describe each framework component:

Experimental evaluation

This section presents the experimental evaluation conducted over the proposed framework in order to determine its suitability in classification tasks of large p small n data.

Conclusion

Reducing the dimensionality of datasets with large number of features and few examples is a challenging problem. In this paper we described and evaluated a Projection Pursuit framework, which is intended to circumvent the difficulties associated with that kind of data and to facilitate the construction of classifiers. The framework is formed by two stages: the first stage performs a rapid compaction of the data, which is used by the second stage to perform a projection search, seeking to

Acknowledgments

We would like to thank CNPq (Conselho Nacional de Desenvolvimento Científico e Tecnológico) grant#151547/2013-0 and FAPESP (São Paulo Research Foundation) grant#2012/22295-0 for funding this study.

Soledad Espezua is currently a Postdoctoral fellow in the Department of Computer Science at the University of São Paulo, Brazil. Her major research focuses are Evolutionary Computation, Bioinformatics, Machine Learning, Data Mining and Meta-learning. She received his Ph.D. and M.Sc. degrees in electrical engineering from the University of São Paulo, Brazil, in 2008 and 2013 respectively.

References (83)

  • D. Singh et al.

    Gene expression correlates of clinical prostate cancer behavior

    Cancer Cell

    (2002)
  • N. Zhou et al.

    A modified T-test feature selection method and its application on the hapmap genotype data

    Genomics Proteomics Bioinform.

    (2007)
  • I.M. Johnstone et al.

    Statistical challenges of high-dimensional data INTRODUCTION

    Philos. Trans. R. Soc. A—Math. Phys. Eng. Sci.

    (2009)
  • S. Russell et al.

    Microarray Technology in Practice

    (2008)
  • T.R. Golub et al.

    Molecular classification of cancerclass discovery and class prediction by gene expression monitoring

    Science

    (1999)
  • E. Segal et al.

    From signatures to modelsunderstanding cancer using microarrays

    Nat. Genet.

    (2005)
  • L. Jimenez et al.

    Hyperspectral data analysis and supervised feature reduction via projection pursuit

    IEEE Trans. Geosci. Remote Sens.

    (1999)
  • R. Clarke et al.

    The properties of high-dimensional data spacesimplications for exploring gene and protein expression data

    Nat. Rev. Cancer

    (2008)
  • F. Korn et al.

    On the “dimensionality curse” and the “self-similarity blessing”

    IEEE Trans. Knowl. Data Eng.

    (2001)
  • H. Liu et al.

    Toward integrating feature selection algorithms for classification and clustering

    IEEE Trans. Knowl. Data Eng.

    (2005)
  • Y. Saeys et al.

    A review of feature selection techniques in bioinformatics

    Bioinformatics

    (2007)
  • C.J.C. Burges, Dimension reduction: a guided tour, Found. Trends Mach. Learn., Now Publishers Inc., Hanover, MA, USA, 2...
  • H. Liu et al.

    Feature Extraction, Construction and SelectionA Data Mining Perspective

    (1998)
  • J.H. Friedman et al.

    A projection pursuit algorithm for exploratory data analysis

    IEEE Trans. Comput.

    (1974)
  • J.H. Friedman

    Exploratory projection pursuit

    Am. Stat. Assoc.

    (1987)
  • W. Huang et al.

    Projection pursuit flood disaster classification assessment method based on multi-swarm cooperative particle swarm optimization

    J. Water Resour. Prot.

    (2011)
  • S. Wang et al.

    Projection pursuit dynamic cluster model and its application to water resources carrying capacity evaluation

    J. Water Resour. Prot.

    (2010)
  • X. Cui et al.

    Detecting single-feature polymorphisms using oligonucleotide arrays and robustified projection pursuit

    Bioinformatics

    (2005)
  • D. Pena et al.

    Cluster identification using projections

    J. Am. Stat. Assoc.

    (2001)
  • R. Bolton et al.

    Projection pursuit clustering for exploratory data analysis

    J. Comput. Graph. Stat.

    (2003)
  • E. Lee et al.

    Projection pursuit for exploratory supervised classification

    J. Comput. Graph. Stat.

    (2005)
  • M. Grochowski et al.

    Fast projection pursuit based on quality of projected clusters

  • M. Aladjem

    Projection pursuit mixture density estimation

    IEEE Trans. Signal Process.

    (2005)
  • J.R. Jee

    Projection pursuit

    Wiley Interdiscip. Rev.: Comput. Stat.

    (2009)
  • E. Rodriguez-Martinez et al.

    Automatic induction of projection pursuit indices

    IEEE Trans. Neural Netw.

    (2010)
  • P.J. Huber

    Projection pursuit

    Ann. Stat.

    (1985)
  • M.C. Jones et al.

    What is projection pursuit?

    J. R. Stat. Soc. Ser. A: General

    (1987)
  • G. Nason

    Three-dimensional projection pursuit

    J. R. Stat. Soc. Ser. C

    (1995)
  • Q. Guo et al.

    Sequential projection pursuit using genetic algorithms for data mining of analytical data

    Anal. Chem.

    (2000)
  • A. Berro et al.

    Genetic algorithms and particle swarm optimization for exploratory projection pursuit

    Ann. Math. Artif. Intell.

    (2010)
  • Cited by (33)

    • A Gaussian mixture model based virtual sample generation approach for small datasets in industrial processes

      2021, Information Sciences
      Citation Excerpt :

      In the past decades, learning based on small sample sets has drawn special attention from both academia and industry. It can be divided into three categories: grey modeling [21,22], feature extraction [23,24], and virtual sample generation (VSG) [25,26]. Among them, VSG is a relatively new method that can improve the performance of data-based modeling by generating effective virtual samples (VS).

    • An integrated approach based on Gaussian noises-based data augmentation method and AdaBoost model to predict faecal coliforms in rivers with small dataset

      2021, Journal of Hydrology
      Citation Excerpt :

      However, their applicability and strengths differ from one field to another and from one task to another. Indeed, the feature extraction-based method is suitable for medicine studies to reducing the dimensionality of data with missing observations (Espezua et al., 2015; Zhang et al., 2019). Nevertheless, the AGO and VSG approaches are outstanding and popular technologies for generating virtual data for regression tasks (Chen et al., 2017; He et al., 2018; Wang et al., 2014).

    View all citing articles on Scopus

    Soledad Espezua is currently a Postdoctoral fellow in the Department of Computer Science at the University of São Paulo, Brazil. Her major research focuses are Evolutionary Computation, Bioinformatics, Machine Learning, Data Mining and Meta-learning. She received his Ph.D. and M.Sc. degrees in electrical engineering from the University of São Paulo, Brazil, in 2008 and 2013 respectively.

    Edwin Villanueva received his M.Sc. and Ph.D. degrees in electrical engineering from the University of São Paulo, Brazil, in 2007 and 2012 respectively. He is currently a Postdoctoral fellow in the Department of Computer Science at the University of São Paulo, Brazil. His main interests are Machine Learning, Data Mining, Meta-learning, Bioinformatics, Evolutionary Computation, Bioinspired Computing, Optimization and Probabilistic Graphical Models.

    Carlos D. Maciel is an Associate Professor in Statistical Signal Processing and Pattern Recognition at the Department of Electrical Engineering, University of São Paulo (USP) at Sao Carlos, Brazil. He received his B.Sc. from the Military Institute of Engineering (IME), Brazil, in 1989 and Ph.D. degree from the Federal University of Rio de Janeiro (UFRJ), Brazil, in 2000.

    André Carvalho is a Full Professor in the Department of Computer Science at the University of São Paulo, Brazil. He received his Ph.D. in Electronics from the University of Kent, UK, in 1994. He co-authored one textbook on neural networks and one on artificial intelligence, both in Portuguese. He has several publications in books, refereed journals and conferences. He is in the editorial board and was a guest editor for international journals and general program chair of several national and international conferences. He gave invited talks and won the best paper awards in national and international conferences. He collaborates with researchers from Brazil and abroad in several research projects.

    View full text