Abstract
Advances in computer technologies have enabled corporations to accumulate data at an unprecedented speed. Large-scale business data might contain billions of observations and thousands of features, which easily brings their scale to the level of terabytes. Most traditional feature selection algorithms are designed for a centralized computing architecture. Their usability significantly deteriorates when data size exceeds hundreds of gigabytes. High-performance distributed computing frameworks and protocols, such as the Message Passing Interface (MPI) and MapReduce, have been proposed to facilitate software development on grid infrastructures, enabling analysts to process large-scale problems efficiently. This paper presents a novel large-scale feature selection algorithm that is based on variance analysis. The algorithm selects features by evaluating their abilities to explain data variance. It supports both supervised and unsupervised feature selection and can be readily implemented in most distributed computing environments. The algorithm was developed as a SAS High-Performance Analytics procedure, which can read data in distributed form and perform parallel feature selection in both symmetric multiprocessing mode and massively parallel processing mode. Experimental results demonstrated the superior performance of the proposed method for large scale feature selection.
Chapter PDF
Similar content being viewed by others
References
Liu, H., Motoda, H.: Feature Selection for Knowledge Discovery and Data Mining. Kluwer Academic Publishers, Boston (1998)
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. Journal of Machine Learning Research 3, 1157–1182 (2003)
Zaki, M.J., Ho, C.T. (eds.): Large-scale parallel data mining. Springer (2000)
Snir, M., et al.: MPI: The Complete Reference. MIT Press, Cambridge (1995)
Dean, J., Ghemawat, S.: System and method for efficient large-scale data processing, United States Patent 7650331 (2010)
Hall, M.: Correlation-Based Feature Selection for Machine Learning. PhD thesis, University of Waikato, Dept. of Computer Science (1999)
Ding, C., Peng, H.: Minimum redundancy feature selection from microarray gene expression data. In: Proceedings of the CSB, pp. 523–529 (2003)
Felix, G.L., et al.: Solving feature subset selection problem by a parallel scatter search. European Journal of Operational Research 169(2), 477–489 (2006)
Melab, N., et al.: Grid computing for parallel bioinspired algorithms. Journal of Parallel and Distributed Computing 66(8), 1052–1061 (2006)
Garcia, D.J., et al.: A parallel feature selection algorithm from random subsets. In: Proceedings of the International Workshop on Parallel Data Mining (2006)
Guillén, A., Sorjamaa, A., Miche, Y., Lendasse, A., Rojas, I.: Efficient Parallel Feature Selection for Steganography Problems. In: Cabestany, J., Sandoval, F., Prieto, A., Corchado, J.M. (eds.) IWANN 2009, Part I. LNCS, vol. 5517, pp. 1224–1231. Springer, Heidelberg (2009)
Kent, P., Schabenberger, O.: SAS high performance computing: The future is not what it used to be (2011), http://www.monash.com/uploads/SAS_HPA_2011-Longer.pdf
Singh, S., et al.: Parallel large scale feature selection for logistic regression. In: Proc. of SDM (2009)
Dy, J.G., Brodley, C.E.: Feature selection for unsupervised learn. Journal of Machine Learning Research 5, 845–889 (2004)
He, X., et al.: Laplacian score for feature selection. In: Proc. of NIPS (2005)
Zhao, Z., Liu, H.: Spectral feature selection for supervised and unsupervised learning. In: Proceedings of ICML (2007)
Dash, M., et al.: Feature selection for clustering, a filter solution. In: Proceedings of ICDM (2002)
Ye, J.: Least squares linear discriminant analysis. In: Proceedings of ICML (2007)
Jolliffe, I.T.: Principal Component Analysis, 2nd edn. Springer (2002)
Chu, C.T., et al.: Map-reduce for machine learning on multicore. In: Proceedings of NIPS (2007)
Nie, F., et al.: Trace ratio criterion for feature selection. In: Proc. of AAAI (2008)
Song, L., et al.: Supervised feature selection via dependence estimation. In: Proceedings of ICML (2007)
Zhao, Z., Wang, L., Liu, H., Ye, J.: On similarity preserving feature selection. IEEE Transactions on Knowledge and Data Engineering 99, 198–206 (2011)
Sikonja, M.R., Kononenko, I.: Theoretical and empirical analysis of Relief and ReliefF. Machine Learning 53, 23–69 (2003)
Duda, R., et al.: Pattern Classification, 2nd edn. John Wiley & Sons (2001)
Weston, J., et al.: Use of the zero norm with linear models and kernel methods. Journal of Machine Learning Research 3, 1439–1461 (2003)
Efron, B., et al.: Least angle regression. Annals of Statistics 32, 407–449 (2004)
Tibshirani, R.: Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B 58(1), 267–288 (1994)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhao, Z., Cox, J., Duling, D., Sarle, W. (2012). Massively Parallel Feature Selection: An Approach Based on Variance Preservation. In: Flach, P.A., De Bie, T., Cristianini, N. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2012. Lecture Notes in Computer Science(), vol 7523. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33460-3_21
Download citation
DOI: https://doi.org/10.1007/978-3-642-33460-3_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33459-7
Online ISBN: 978-3-642-33460-3
eBook Packages: Computer ScienceComputer Science (R0)