Massively Parallel Feature Selection: An Approach Based on Variance Preservation

Zhao, Zheng; Cox, James; Duling, David; Sarle, Warren

doi:10.1007/978-3-642-33460-3_21

Zheng Zhao²⁰,
James Cox²⁰,
David Duling²⁰ &
…
Warren Sarle²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7523))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

4696 Accesses
4 Citations

Abstract

Advances in computer technologies have enabled corporations to accumulate data at an unprecedented speed. Large-scale business data might contain billions of observations and thousands of features, which easily brings their scale to the level of terabytes. Most traditional feature selection algorithms are designed for a centralized computing architecture. Their usability significantly deteriorates when data size exceeds hundreds of gigabytes. High-performance distributed computing frameworks and protocols, such as the Message Passing Interface (MPI) and MapReduce, have been proposed to facilitate software development on grid infrastructures, enabling analysts to process large-scale problems efficiently. This paper presents a novel large-scale feature selection algorithm that is based on variance analysis. The algorithm selects features by evaluating their abilities to explain data variance. It supports both supervised and unsupervised feature selection and can be readily implemented in most distributed computing environments. The algorithm was developed as a SAS High-Performance Analytics procedure, which can read data in distributed form and perform parallel feature selection in both symmetric multiprocessing mode and massively parallel processing mode. Experimental results demonstrated the superior performance of the proposed method for large scale feature selection.

Download to read the full chapter text

Chapter PDF

Dealing with heterogeneity in the context of distributed feature selection for classification

Article 21 November 2020

Distributed Monte Carlo Feature Selection: Extracting Informative Features Out of Multidimensional Problems with Linear Speedup

Parallel Feature Selection Based on MapReduce

Keywords

References

Liu, H., Motoda, H.: Feature Selection for Knowledge Discovery and Data Mining. Kluwer Academic Publishers, Boston (1998)
Book MATH Google Scholar
Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. Journal of Machine Learning Research 3, 1157–1182 (2003)
MATH Google Scholar
Zaki, M.J., Ho, C.T. (eds.): Large-scale parallel data mining. Springer (2000)
Google Scholar
Snir, M., et al.: MPI: The Complete Reference. MIT Press, Cambridge (1995)
Google Scholar
Dean, J., Ghemawat, S.: System and method for efficient large-scale data processing, United States Patent 7650331 (2010)
Google Scholar
Hall, M.: Correlation-Based Feature Selection for Machine Learning. PhD thesis, University of Waikato, Dept. of Computer Science (1999)
Google Scholar
Ding, C., Peng, H.: Minimum redundancy feature selection from microarray gene expression data. In: Proceedings of the CSB, pp. 523–529 (2003)
Google Scholar
Felix, G.L., et al.: Solving feature subset selection problem by a parallel scatter search. European Journal of Operational Research 169(2), 477–489 (2006)
Article MathSciNet MATH Google Scholar
Melab, N., et al.: Grid computing for parallel bioinspired algorithms. Journal of Parallel and Distributed Computing 66(8), 1052–1061 (2006)
Article MATH Google Scholar
Garcia, D.J., et al.: A parallel feature selection algorithm from random subsets. In: Proceedings of the International Workshop on Parallel Data Mining (2006)
Google Scholar
Guillén, A., Sorjamaa, A., Miche, Y., Lendasse, A., Rojas, I.: Efficient Parallel Feature Selection for Steganography Problems. In: Cabestany, J., Sandoval, F., Prieto, A., Corchado, J.M. (eds.) IWANN 2009, Part I. LNCS, vol. 5517, pp. 1224–1231. Springer, Heidelberg (2009)
Chapter Google Scholar
Kent, P., Schabenberger, O.: SAS high performance computing: The future is not what it used to be (2011), http://www.monash.com/uploads/SAS_HPA_2011-Longer.pdf
Singh, S., et al.: Parallel large scale feature selection for logistic regression. In: Proc. of SDM (2009)
Google Scholar
Dy, J.G., Brodley, C.E.: Feature selection for unsupervised learn. Journal of Machine Learning Research 5, 845–889 (2004)
MathSciNet MATH Google Scholar
He, X., et al.: Laplacian score for feature selection. In: Proc. of NIPS (2005)
Google Scholar
Zhao, Z., Liu, H.: Spectral feature selection for supervised and unsupervised learning. In: Proceedings of ICML (2007)
Google Scholar
Dash, M., et al.: Feature selection for clustering, a filter solution. In: Proceedings of ICDM (2002)
Google Scholar
Ye, J.: Least squares linear discriminant analysis. In: Proceedings of ICML (2007)
Google Scholar
Jolliffe, I.T.: Principal Component Analysis, 2nd edn. Springer (2002)
Google Scholar
Chu, C.T., et al.: Map-reduce for machine learning on multicore. In: Proceedings of NIPS (2007)
Google Scholar
Nie, F., et al.: Trace ratio criterion for feature selection. In: Proc. of AAAI (2008)
Google Scholar
Song, L., et al.: Supervised feature selection via dependence estimation. In: Proceedings of ICML (2007)
Google Scholar
Zhao, Z., Wang, L., Liu, H., Ye, J.: On similarity preserving feature selection. IEEE Transactions on Knowledge and Data Engineering 99, 198–206 (2011)
Google Scholar
Sikonja, M.R., Kononenko, I.: Theoretical and empirical analysis of Relief and ReliefF. Machine Learning 53, 23–69 (2003)
Article MATH Google Scholar
Duda, R., et al.: Pattern Classification, 2nd edn. John Wiley & Sons (2001)
Google Scholar
Weston, J., et al.: Use of the zero norm with linear models and kernel methods. Journal of Machine Learning Research 3, 1439–1461 (2003)
MATH Google Scholar
Efron, B., et al.: Least angle regression. Annals of Statistics 32, 407–449 (2004)
Article MathSciNet MATH Google Scholar
Tibshirani, R.: Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B 58(1), 267–288 (1994)
MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

SAS Institute Inc., 600 Research Drive, Cary, NC, 27513, USA
Zheng Zhao, James Cox, David Duling & Warren Sarle

Authors

Zheng Zhao
View author publications
You can also search for this author in PubMed Google Scholar
James Cox
View author publications
You can also search for this author in PubMed Google Scholar
David Duling
View author publications
You can also search for this author in PubMed Google Scholar
Warren Sarle
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Intelligent Systems Laboratory, University of Bristol, Merchant Venturers Building, Woodland Road, BS8 1UB, Bristol, UK
Peter A. Flach , Tijl De Bie & Nello Cristianini , &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhao, Z., Cox, J., Duling, D., Sarle, W. (2012). Massively Parallel Feature Selection: An Approach Based on Variance Preservation. In: Flach, P.A., De Bie, T., Cristianini, N. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2012. Lecture Notes in Computer Science(), vol 7523. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-33460-3_21

Download citation

DOI: https://doi.org/10.1007/978-3-642-33460-3_21
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-33459-7
Online ISBN: 978-3-642-33460-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Massively Parallel Feature Selection: An Approach Based on Variance Preservation

Abstract

Chapter PDF

Similar content being viewed by others

Dealing with heterogeneity in the context of distributed feature selection for classification

Distributed Monte Carlo Feature Selection: Extracting Informative Features Out of Multidimensional Problems with Linear Speedup

Parallel Feature Selection Based on MapReduce

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Massively Parallel Feature Selection: An Approach Based on Variance Preservation

Abstract

Chapter PDF

Similar content being viewed by others

Dealing with heterogeneity in the context of distributed feature selection for classification

Distributed Monte Carlo Feature Selection: Extracting Informative Features Out of Multidimensional Problems with Linear Speedup

Parallel Feature Selection Based on MapReduce

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation