Performance of some variable selection methods when multicollinearity is present

https://doi.org/10.1016/j.chemolab.2004.12.011Get rights and content

Abstract

Variable selection is one of the important practical issues for many scientific engineers. Although the PLS (partial least squares) regression combined with the VIP (variable importance in the projection) scores is often used when the multicollinearity is present among variables, there are few guidelines about its uses as well as its performance. The purpose of this paper is to explore the nature of the VIP method and to compare with other methods through computer simulation experiments. We design 108 experiments where observations are generated from true models considering four factors–the proportion of the number of relevant predictors, the magnitude of correlations between predictors, the structure of regression coefficients, and the magnitude of signal to noise. Confusion matrix is adopted to evaluate the performance of PLS, the Lasso, and stepwise method. We also discuss the proper cutoff value of the VIP method to increase its performance. Some practical hints for the use of the VIP method are given as simulation results.

Introduction

The quality of a final product in a process industry is believed to be determined by a lot of process variables. Process engineers are often interested in finding vital few process variables that would be most influential on the quality of the product. With only several variables in hand, their control problem for the quality improvement would become much easier. Although stepwise regression methods are often used for this purpose due to their simplicity, there are several reasons why process engineers are often not satisfied with the results. One of them is its poor performance when the multicollinearity exists among variables. Under this situation, the VIP (Variable Importance in the Projection) scores obtained by the partial least squares (PLS) regression, has been paid an increasing attention these days as an importance measure of each explanatory variable or predictor [1]. However, the performance and the use of the VIP scores are not well discovered.

The objective of this study is to investigate the performance of the VIP scores for selecting the relevant process variables which “really” have an effect on the response or have nonzero coefficients. For this purpose, we used computer simulation experiments where some true models are assumed and data sets are generated so as to mimic the typical manufacturing process which consists of consecutive unit processes. We compare the performance of VIP scores under PLS (called PLS-VIP method) with the PLS regression (called PLS-BETA method), the Lasso regression [2] and the stepwise regression [3]. We also aim to discuss the proper cutoff value of the PLS-VIP method.

The rest of the paper is organized as follows. A brief review of variable selection methods using PLS regression, the Lasso regression and the stepwise regression is given in Section 2. Section 3 describes the simulation design and performance measure using confusion matrix. The simulation results and the discussion are given in Section 4. Finally, Section 5 concludes the paper with a summary.

Section snippets

Partial least squares regression

In case of single response y and p predictors, PLS regression model with h (hp) latent variables can be expressed as follows [4], [5].X=TPt+Ey=Tb+f

In Eq. (1a), (1b), X (n×p), T (n×h), P (p×h), y (n×1), and b (h×1) are respectively used for predictors, X scores, X loadings, a response, and regression coefficients of T. The k-th element of column vector b explains the relation between y and tk, the k-th column vector of T. Meanwhile, E (n×p) and f (n×1) stand for random errors of X and y,

Design of simulations

We generate datasets by assuming that true response follows a linear model having p predictors defined as Eq. (6).yi=j=1pβjxij+εi,whereεiiidN(0,σ2),(i=1,2,,500)

Here, the data matrix X=(xij) is generated by assuming a special correlation structure described in Section 3.1.2. For convenience, we fix the number of relevant predictors as 10 and therefore the rest of predictors (p−10) are irrelevant to the response over all cases. We design 108 (=3×3×4×3) different cases with four factors–the

PLS-VIP method vs. the Lasso or Stepwise method

The number of latent variables for PLS regression, the tuning parameter for the Lasso and the significant levels for stepwise regression are determined by five-fold cross-validation which is widely used for estimating prediction error [12]. As mentioned before, 100 replications for each of 108 cases are made to evaluate the performance of the variable selection methods. At each replication, performance measure of G was calculated. In addition, the root mean squared error (RMSE) of predicted

Conclusions

In this paper, we conducted 10,800 experiments to explore the nature of the PLS-VIP method as compared with other variable selection methods. Experiments were designed by considering four factors including the proportion of the number of relevant predictors among total predictors, the magnitude of correlations between predictors, the structure of regression coefficients, and the magnitude of signal to noise.

First, the PLS-VIP method was compared with the Lasso and Stepwise method. The PLS-VIP

Acknowledgements

We would like to thank two anonymous referees for their valuable comments that have led to a substantial improvement in the paper. This work was supported by the Brain Korea 21 project and by the Systems Bio-Dynamics Research Center at POSTECH.

References (12)

  • S. Wold et al.

    Chemom. Intell. Lab. Syst.

    (2001)
  • P. Geladi et al.

    Anal. Chim. Acta

    (1986)
  • R. Tibshirani

    J. R. Stat. Soc.

    (1996)
  • D.C. Montgomery et al.
  • L. Eriksson et al.

    Multi-and megavariate data analysis; principles and applications

    (2001)
  • S. Wold et al.
There are more references available in the full text version of this article.

Cited by (0)

View full text