Performance of some variable selection methods when multicollinearity is present
Introduction
The quality of a final product in a process industry is believed to be determined by a lot of process variables. Process engineers are often interested in finding vital few process variables that would be most influential on the quality of the product. With only several variables in hand, their control problem for the quality improvement would become much easier. Although stepwise regression methods are often used for this purpose due to their simplicity, there are several reasons why process engineers are often not satisfied with the results. One of them is its poor performance when the multicollinearity exists among variables. Under this situation, the VIP (Variable Importance in the Projection) scores obtained by the partial least squares (PLS) regression, has been paid an increasing attention these days as an importance measure of each explanatory variable or predictor [1]. However, the performance and the use of the VIP scores are not well discovered.
The objective of this study is to investigate the performance of the VIP scores for selecting the relevant process variables which “really” have an effect on the response or have nonzero coefficients. For this purpose, we used computer simulation experiments where some true models are assumed and data sets are generated so as to mimic the typical manufacturing process which consists of consecutive unit processes. We compare the performance of VIP scores under PLS (called PLS-VIP method) with the PLS regression (called PLS-BETA method), the Lasso regression [2] and the stepwise regression [3]. We also aim to discuss the proper cutoff value of the PLS-VIP method.
The rest of the paper is organized as follows. A brief review of variable selection methods using PLS regression, the Lasso regression and the stepwise regression is given in Section 2. Section 3 describes the simulation design and performance measure using confusion matrix. The simulation results and the discussion are given in Section 4. Finally, Section 5 concludes the paper with a summary.
Section snippets
Partial least squares regression
In case of single response y and p predictors, PLS regression model with h (h≤p) latent variables can be expressed as follows [4], [5].
In Eq. (1a), (1b), X (n×p), T (n×h), P (p×h), y (n×1), and b (h×1) are respectively used for predictors, X scores, X loadings, a response, and regression coefficients of T. The k-th element of column vector b explains the relation between y and tk, the k-th column vector of T. Meanwhile, E (n×p) and f (n×1) stand for random errors of X and y,
Design of simulations
We generate datasets by assuming that true response follows a linear model having p predictors defined as Eq. (6).
Here, the data matrix X=(xij) is generated by assuming a special correlation structure described in Section 3.1.2. For convenience, we fix the number of relevant predictors as 10 and therefore the rest of predictors (p−10) are irrelevant to the response over all cases. We design 108 (=3×3×4×3) different cases with four factors–the
PLS-VIP method vs. the Lasso or Stepwise method
The number of latent variables for PLS regression, the tuning parameter for the Lasso and the significant levels for stepwise regression are determined by five-fold cross-validation which is widely used for estimating prediction error [12]. As mentioned before, 100 replications for each of 108 cases are made to evaluate the performance of the variable selection methods. At each replication, performance measure of G was calculated. In addition, the root mean squared error (RMSE) of predicted
Conclusions
In this paper, we conducted 10,800 experiments to explore the nature of the PLS-VIP method as compared with other variable selection methods. Experiments were designed by considering four factors including the proportion of the number of relevant predictors among total predictors, the magnitude of correlations between predictors, the structure of regression coefficients, and the magnitude of signal to noise.
First, the PLS-VIP method was compared with the Lasso and Stepwise method. The PLS-VIP
Acknowledgements
We would like to thank two anonymous referees for their valuable comments that have led to a substantial improvement in the paper. This work was supported by the Brain Korea 21 project and by the Systems Bio-Dynamics Research Center at POSTECH.
References (12)
- et al.
Chemom. Intell. Lab. Syst.
(2001) - et al.
Anal. Chim. Acta
(1986) J. R. Stat. Soc.
(1996)- et al.
- et al.
Multi-and megavariate data analysis; principles and applications
(2001) - et al.