Abstract
Principal component analysis is widely used in the analysis of multivariate data in the agricultural, biological, and environmental sciences. The first few principal components (PCs) of a set of variables are derived variables with optimal properties in terms of approximating the original variables. This paper considers the problem of identifying subsets of variables that best approximate the full set of variables or their first few PCs, thus stressing dimensionality reduction in terms of the original variables rather than in terms of derived variables (PCs) whose definition requires all the original variables. Criteria for selecting variables are often ill defined and may produce inappropriate subsets. Indicators of the performance of different subsets of the variables are discussed and two criteria are defined. These criteria are used in stepwise selection-type algorithms to choose good subsets. Examples are given that show, among other things, that the selection of variable subsets should not be based only on the PC loadings of the variables.
Similar content being viewed by others
References
Aarts, E., and Korst, J. (1989), Simulated Annealing and Boltzmann Machines—A Stochastic Approach to Combinatorial Optimization and Neural Computing, Chichester: Wiley Interscience Series in Discrete Mathematics and Optimization.
Baeriswyl, P. A., and Rebetez, M. (1997), “Regionalization of Precipitation in Switzerland by Means of Principal Component Analysis,” Theoretical and Applied Climatology, 58, 31–41.
Bonifas, I., Escoufier, Y., Gonzales, P. L., and Sabatier, R. (1984), “Choix de Variables en Analyse en Composants Principales,” Revue de Statistiques Appliquées, 23, 5–15.
Cadima, J., and Jolliffe, I. T. (1995), “Loadings and Correlations in the Interpretation of Principal Components,” Journal of Applied Statistics, 22, 203–214.
Durrieu, G., Letellier, T., Antoch, J., Deshouillers, J. M., Malgat, M., and Mazat, J. P. (1997), “Identification of Mitochondrial Deficiency Using Principal Component Analysis,” Molecular and Cellular Biochemistry, 174, 149–156.
Falguerolles, A., and Jmel, S. (1993), “Un Critére de Choix de Variables en Analyse en Composants Principales Fondé surdes Modèles Graphiques Gaussiens Particuliers,” The Canadian Journal of Statistics, 21, 239–256.
Ferraz, A., Esposito, E., Bruns, R. E., and Duran, N. (1998), “The Use of Principal Component Analysis (PCA) for Pattern Recognition in Eucalyptus grandis Wood Biodegradation Experiments,” World Journal of Microbiology and Biotechnology, 14, 487–490.
Golub, G., and Van Loan, C. (1996), Matrix Computations, Baltimore: Johns Hopkins University Press.
Gonzalez, P. L., Evry, R., Cléroux, R., and Rioux, B. (1990), “Selecting the Best Subset of Variables in Principal Component Analysis,” in Compstat 1990, eds. K. Momirovic and V. Mildner, Heidelberg: Physica-Verlag, pp. 115–120.
Jeffers, J. N. R. (1967), “Two Case Studies in the Application of Principal Component Analysis,” Applied Statistics, 16, 225–236.
Jolicoeur, P. (1963), “The Multivariate Generalisation of the Allometry Equation,” Biometrics, 19, 497–499.
Jolliffe, I. T. (1972), “Discarding Variables in a Principal Component Analysis, I: Artificial Data,” Applied Statistics, 21, 160–173.
Jolliffe, I. T. (1973), “Discarding Variables in a Principal Component Analysis, II: Real Data,” Applied Statistics, 22, 21–31.
Jolliffe, I. T. (1986), Principal Component Analysis, New York: Springer-Verlag.
Jolliffe, I. T. (1987), “Letter to the Editors,” Applied Statistics, 36, 373–374.
Jolliffe, I. T. (1989), “Rotation of Ill-Defined Principal Components,” Applied Statistics, 38, 139–147.
Krzanowski, W. J. (1987), “Selection of Variables to Preserve Multivariate Data Structure Using Principal Components,” Applied Statistics, 36, 22–33.
Krzanowski, W. J. (1988), Principles of Multivariate Analysis: A User’s Perspective, Oxford: Clarendon Press.
McCabe, G. P. (1984), “Principal Variables,” Technometrics, 26, 137–144.
McCabe, G. P. (1986), “Prediction of Principal Components by Variables Subsets,” Technical Report 86-19, Purdue University, Dept. of Statistics.
Neter, J., Wasserman, W., and Kutner, M. H. (1990), Applied Linear Statistical Models (3rd ed.), Chicago: Irwin.
Ramsay, J. O., and Silverman, B. W. (1997), Functional Data Analysis, Springer Series in Statics, Springer.
Ramsay, J. O., ten Berge, J., and Styan, G. P. H. (1984), “Matrix Correlation,” Psychometrika, 49, 403–423.
Richman, M. B. (1992), “Determination of Dimensionality in Eigen analysis,” Proceedings of the Fifth International Meeting on Statistical Climatology, 229–235.
Somers, K. M. (1986), “Allometry, Isometry and Shape in Principal Component Analysis,” Systematic Zoology, 38, 169–173.
Teitelman, M., and Eeckman, F. H. (1996), “Principal Component Analysis and Large-Scale Correlations in Non-Coding Sequences of Human DNA,” Journal of Computational Biology, 3, 573–576.
Villar, A., Garcia, J. A., Iglesias, L., Garcia, M. L., and Otero, A. (1996), “Application of Principal Component Analysis to the Study of Microbial Populations in Refrigerated Raw Milk From Farms,” International Dairy Journal, 6, 937–945.
Yu, C. C., Quinn, J. T., Dufournaud, C. M., Harrington, J. J., Rogers, P. P., and Lohani, B. N. (1998), “Effective Dimensionality of Environmental Indicators: A Principal Component Analysis With Bootstrap Confidence Intervals,” Journal of Environmental Management, 53, 101–119.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Cadima, J.F.C.L., Jolliffe, I.T. Variable selection and the interpretation of principal subspaces. JABES 6, 62 (2001). https://doi.org/10.1198/108571101300325256
Received:
Accepted:
DOI: https://doi.org/10.1198/108571101300325256