A Projection Pursuit framework for supervised dimension reduction of high dimensional small sample datasets
Introduction
In the last few decades we have witnessed a rapid development and refinement of data acquisition technologies in several science and industrial areas [1]. This has led to the emergence of high-throughput technologies that are capable of generating datasets with the number of features (p) far greater than the number of examples (n), the so-called large p small n datasets. A representative example of these technological developments is the microarray technology [2], which has made possible the measurement of expression levels of thousands of genes in a relatively rapid and economic way, leading to significant advances in the understanding of severe diseases, like cancer, and raising hopes on possible cures [3], [4].
Though the collection of large p small n datasets is nowadays a common practice in many fields, their analysis and interpretation is still a challenging task [5], [6], [1]. This difficulty is mainly originated by the so-called “curse of dimensionality” phenomenon, inherent in such a kind of data [7]. This phenomenon states that as the dimensionality increases, the corresponding space becomes emptier and the data points tend to be equidistant. This generates detrimental impacts in most machine-learning and pattern-recognition methods (including model-estimation instability, model over fitting and local convergence), compromising the generalization performance and reliability of such methods [5], [6].
A common approach to circumvent the curse of dimensionality is by reducing it [6]. Two kinds of methods exist for this task: feature selection (FS) [8], [9] and feature extraction (FE) [10], [11]. The former methods try to find small subsets of original features that are relevant to the intended analysis. The latter methods reduce the dimensionality by building new features from combinations (linear or nonlinear) of the original features. FS has the benefit of keeping the original feature meaning, facilitating the interpretability by the domain expert [9]. However, it has been said [12] that FE is preferable over FS when the final goal is an accurate system for classifying new examples and interpretability is not as important. This is because FE is not tied to the original feature space, providing greater chances of finding more useful representations for the desired task [12].
Projection pursuit (PP) [13], [14] is a FE method that has been successfully applied in several domains for both supervised and unsupervised analyses (e.g. [15], [16], [17], [18]). PP seeks low-dimensional linear projections of the data that expose interesting aspects of them. To this end, a measure of “interestingness” is employed, which is known as projection pursuit index (PP index). A key advantage of PP is its flexibility to fit different pattern recognition tasks, depending on the PP index used. For example, PP can be used to perform clustering analysis [19], [20], classification [21], [22], [23], [24], regression analysis [25] and density estimation [26] (some reviews of PP indexes can be found in [21], [27], [28]). Another advantage of PP is its out-of-sample mapping capability, that is, the possibility to map new examples in the projection space after the construction of it.
Despite the aforementioned advantages, the literature shows a limited use of PP in large p small n datasets, like those generated by microarray technology. This may be due to the high computational difficulty in finding optimal projection spaces for such cases. For instance, the projection of a dataset with p=10k features (a realistic number in microarray datasets) onto a target space of dimension m=3 will require the optimization of a projection matrix of p×m = 30k elements. Evidently, the problem worsens as p or m increase. Traditional PP optimizers based on the gradients or Newton methods [29], [30], [31], [19] are usually inadequate for such a kind of data due to the vastness of possible projections and, thus, the high susceptibility to find poor local optima [14]. More global PP optimizers were described recently, including genetic algorithms (GA) [32], [33], simulating annealing (SA) [21], random scan sampling (RSSA) [34] and particle swarm optimization (PSO) [35]. However, none of these works have been directly applied in dimensionalities as high as those found in microarray data, which shows the difficulty of applying PP in such scenarios.
In this paper we present a framework to facilitate the applicability of PP on large p small n datasets with the aim of classification tasks. The framework is formed by two main stages (Fig. 1): a compaction stage and a PP optimization stage. The first stage is devised to rapidly transform the original data into a less sparse representation. The second stage is the PP part, which is responsible to find optimal projections taking the compacted representation as input.
For the compaction stage we use three well-known techniques: PCA, Whitening and Partial Least Squares. For the PP stage, we adopt the Sequential Projection Pursuit (SPP) approach [32] coupled with the GA optimizer (PPGA) we described recently [33], in which a specialized crossover operator showed excelling search capabilities. An experimental study is presented over eight public microarray datasets. The evaluation systematically tested several configurations of the framework, including variations of the compaction method, the PP index function and the target dimensionality. We used the predictive accuracy of two popular classification methods (LDA and 3NN) in order to assess the quality of the tested configurations. We also compare the framework against eight well-established dimension reduction methods, including FE and FS methods.
The paper is organized as follows. Section 2 introduces some important concepts of PP, SPP, PP optimization and PP indexes used in the paper. Section 3 describes the proposed framework. Section 4 presents the experimental evaluation, including the experimental setup, results and corresponding discussion. Finally, our conclusions are presented in Section 5.
Section snippets
Projection pursuit
The projection pursuit (PP) concept was formally introduced in the paper of Friedman and Tukey [13], although the seminal ideas were originally posed by Kruskal [36]. To describe the PP concept we assume that we have a data matrix of dimensions, where n is the number of data examples or observations and p is the number of attributes or variables. PP can be defined as the constrained optimization problem in (1), where the aim is to seek a m-dimensional projection space (defined by
A PP framework for supervised dimension reduction of large p small n data
We detail here the proposed framework to ease the applicability of PP in large p small n data. Fig. 2 shows the structure of this framework. Two stages compose this: the first stage implements a fast procedure to compact the data into an intermediate-dimensional representation (in the order of n). The second stage is a PP procedure over the compacted data, which implements an improved version of the SPP scheme. Next, we describe each framework component:
Experimental evaluation
This section presents the experimental evaluation conducted over the proposed framework in order to determine its suitability in classification tasks of large p small n data.
Conclusion
Reducing the dimensionality of datasets with large number of features and few examples is a challenging problem. In this paper we described and evaluated a Projection Pursuit framework, which is intended to circumvent the difficulties associated with that kind of data and to facilitate the construction of classifiers. The framework is formed by two stages: the first stage performs a rapid compaction of the data, which is used by the second stage to perform a projection search, seeking to
Acknowledgments
We would like to thank CNPq (Conselho Nacional de Desenvolvimento Científico e Tecnológico) grant#151547/2013-0 and FAPESP (São Paulo Research Foundation) grant#2012/22295-0 for funding this study.
Soledad Espezua is currently a Postdoctoral fellow in the Department of Computer Science at the University of São Paulo, Brazil. Her major research focuses are Evolutionary Computation, Bioinformatics, Machine Learning, Data Mining and Meta-learning. She received his Ph.D. and M.Sc. degrees in electrical engineering from the University of São Paulo, Brazil, in 2008 and 2013 respectively.
References (83)
- et al.
A projection pursuit algorithm for anomaly detection in hyperspectral imagery
Pattern Recognit.
(2008) - et al.
A projection pursuit algorithm to classify individuals using fMRI data: application to schizophrenia
Neuroimage
(2008) - et al.
Projection-pursuit approach to robust linear discriminant analysis
J. Multivar. Anal.
(2010) - et al.
Prediction of ozone tropospheric degradation rate constants by projection pursuit regression
Anal. Chim. Acta
(2007) - et al.
Towards an efficient genetic algorithm optimizer for sequential projection pursuit
Neurocomputing
(2014) - et al.
An improved optimization algorithm and Bayes factor termination criterion for sequential projection pursuit
Chemom. Intell. Lab. Syst.
(2005) Toward a practical method which helps uncover the structure of a set of multivariate observations by finding the linear transformation which optimizes a new “Index of Condensation”
Stat. Comput.
(1969)Projection pursuit exploratory data analysis
Comput. Stat. Data Anal.
(1995)- et al.
Fast neighborhood component analysis
Neurocomputing
(2012) - et al.
The small sample size problem of ICAa comparative study and analysis
Pattern Recognit.
(2012)
Gene expression correlates of clinical prostate cancer behavior
Cancer Cell
A modified T-test feature selection method and its application on the hapmap genotype data
Genomics Proteomics Bioinform.
Statistical challenges of high-dimensional data INTRODUCTION
Philos. Trans. R. Soc. A—Math. Phys. Eng. Sci.
Microarray Technology in Practice
Molecular classification of cancerclass discovery and class prediction by gene expression monitoring
Science
From signatures to modelsunderstanding cancer using microarrays
Nat. Genet.
Hyperspectral data analysis and supervised feature reduction via projection pursuit
IEEE Trans. Geosci. Remote Sens.
The properties of high-dimensional data spacesimplications for exploring gene and protein expression data
Nat. Rev. Cancer
On the “dimensionality curse” and the “self-similarity blessing”
IEEE Trans. Knowl. Data Eng.
Toward integrating feature selection algorithms for classification and clustering
IEEE Trans. Knowl. Data Eng.
A review of feature selection techniques in bioinformatics
Bioinformatics
Feature Extraction, Construction and SelectionA Data Mining Perspective
A projection pursuit algorithm for exploratory data analysis
IEEE Trans. Comput.
Exploratory projection pursuit
Am. Stat. Assoc.
Projection pursuit flood disaster classification assessment method based on multi-swarm cooperative particle swarm optimization
J. Water Resour. Prot.
Projection pursuit dynamic cluster model and its application to water resources carrying capacity evaluation
J. Water Resour. Prot.
Detecting single-feature polymorphisms using oligonucleotide arrays and robustified projection pursuit
Bioinformatics
Cluster identification using projections
J. Am. Stat. Assoc.
Projection pursuit clustering for exploratory data analysis
J. Comput. Graph. Stat.
Projection pursuit for exploratory supervised classification
J. Comput. Graph. Stat.
Fast projection pursuit based on quality of projected clusters
Projection pursuit mixture density estimation
IEEE Trans. Signal Process.
Projection pursuit
Wiley Interdiscip. Rev.: Comput. Stat.
Automatic induction of projection pursuit indices
IEEE Trans. Neural Netw.
Projection pursuit
Ann. Stat.
What is projection pursuit?
J. R. Stat. Soc. Ser. A: General
Three-dimensional projection pursuit
J. R. Stat. Soc. Ser. C
Sequential projection pursuit using genetic algorithms for data mining of analytical data
Anal. Chem.
Genetic algorithms and particle swarm optimization for exploratory projection pursuit
Ann. Math. Artif. Intell.
Cited by (33)
A framework based on multivariate distribution-based virtual sample generation and DNN for predicting water quality with small data
2022, Journal of Cleaner ProductionA Gaussian mixture model based virtual sample generation approach for small datasets in industrial processes
2021, Information SciencesCitation Excerpt :In the past decades, learning based on small sample sets has drawn special attention from both academia and industry. It can be divided into three categories: grey modeling [21,22], feature extraction [23,24], and virtual sample generation (VSG) [25,26]. Among them, VSG is a relatively new method that can improve the performance of data-based modeling by generating effective virtual samples (VS).
Analysis and comprehensive evaluation of sustainable land use in China: Based on sustainable development goals framework
2021, Journal of Cleaner ProductionAn integrated approach based on Gaussian noises-based data augmentation method and AdaBoost model to predict faecal coliforms in rivers with small dataset
2021, Journal of HydrologyCitation Excerpt :However, their applicability and strengths differ from one field to another and from one task to another. Indeed, the feature extraction-based method is suitable for medicine studies to reducing the dimensionality of data with missing observations (Espezua et al., 2015; Zhang et al., 2019). Nevertheless, the AGO and VSG approaches are outstanding and popular technologies for generating virtual data for regression tasks (Chen et al., 2017; He et al., 2018; Wang et al., 2014).
Soledad Espezua is currently a Postdoctoral fellow in the Department of Computer Science at the University of São Paulo, Brazil. Her major research focuses are Evolutionary Computation, Bioinformatics, Machine Learning, Data Mining and Meta-learning. She received his Ph.D. and M.Sc. degrees in electrical engineering from the University of São Paulo, Brazil, in 2008 and 2013 respectively.
Edwin Villanueva received his M.Sc. and Ph.D. degrees in electrical engineering from the University of São Paulo, Brazil, in 2007 and 2012 respectively. He is currently a Postdoctoral fellow in the Department of Computer Science at the University of São Paulo, Brazil. His main interests are Machine Learning, Data Mining, Meta-learning, Bioinformatics, Evolutionary Computation, Bioinspired Computing, Optimization and Probabilistic Graphical Models.
Carlos D. Maciel is an Associate Professor in Statistical Signal Processing and Pattern Recognition at the Department of Electrical Engineering, University of São Paulo (USP) at Sao Carlos, Brazil. He received his B.Sc. from the Military Institute of Engineering (IME), Brazil, in 1989 and Ph.D. degree from the Federal University of Rio de Janeiro (UFRJ), Brazil, in 2000.
André Carvalho is a Full Professor in the Department of Computer Science at the University of São Paulo, Brazil. He received his Ph.D. in Electronics from the University of Kent, UK, in 1994. He co-authored one textbook on neural networks and one on artificial intelligence, both in Portuguese. He has several publications in books, refereed journals and conferences. He is in the editorial board and was a guest editor for international journals and general program chair of several national and international conferences. He gave invited talks and won the best paper awards in national and international conferences. He collaborates with researchers from Brazil and abroad in several research projects.