research-article

A Unified View of Causal and Non-causal Feature Selection

Authors:
Kui Yu

Hefei University of Technology, Hefei, China

Hefei University of Technology, Hefei, China
View Profile

,
Lin Liu

University of South Australia, Mawson Lakes, Adelaide, SA, Australia

University of South Australia, Mawson Lakes, Adelaide, SA, Australia
View Profile

,
Jiuyong Li

University of South Australia, Mawson Lakes, Adelaide, SA, Australia

University of South Australia, Mawson Lakes, Adelaide, SA, Australia
View Profile

ACM Transactions on Knowledge Discovery from Data Volume 15 Issue 4Article No.: 63pp 1–46https://doi.org/10.1145/3436891

Published:18 April 2021Publication History

ACM Transactions on Knowledge Discovery from Data

Abstract

In this article, we aim to develop a unified view of causal and non-causal feature selection methods. The unified view will fill in the gap in the research of the relation between the two types of methods. Based on the Bayesian network framework and information theory, we first show that causal and non-causal feature selection methods share the same objective. That is to find the Markov blanket of a class attribute, the theoretically optimal feature set for classification. We then examine the assumptions made by causal and non-causal feature selection methods when searching for the optimal feature set, and unify the assumptions by mapping them to the restrictions on the structure of the Bayesian network model of the studied problem. We further analyze in detail how the structural assumptions lead to the different levels of approximations employed by the methods in their search, which then result in the approximations in the feature sets found by the methods with respect to the optimal feature set. With the unified view, we can interpret the output of non-causal methods from a causal perspective and derive the error bounds of both types of methods. Finally, we present practical understanding of the relation between causal and non-causal methods using extensive experiments with synthetic data and various types of real-world data.

Supplemental Material

Available for Download

zip

yu.zip (116.5 KB)

Supplemental movie, appendix, image and software files for, A Unified View of Causal and Non-causal Feature Selection

References

Alan Agresti and Maria Kateri. 2011. Categorical data analysis. In International Encyclopedia of Statistical Science. Springer, 206--208.Google Scholar
Constantin F. Aliferis, Alexander Statnikov, Ioannis Tsamardinos, Subramani Mani, and Xenofon D. Koutsoukos. 2010. Local causal and Markov blanket induction for causal discovery and feature selection for classification part I: Algorithms and empirical evaluation. Journal of Machine Learning Research 11, 7 (2010), 171--234.Google ScholarDigital Library
Constantin F. Aliferis, Ioannis Tsamardinos, and Alexander Statnikov. 2003. HITON: A novel Markov blanket algorithm for optimal variable selection. In Proceedings of the AMIA Annual Symposium. Vol. 2003. American Medical Informatics Association, 21.Google Scholar
Kevin Bache and Moshe Lichman. 2013. UCI machine learning repository. Retrieved from http://archive.ics.uci.edu/ml.Google Scholar
Kiran S. Balagani and Vir V. Phoha. 2010. On the feature selection criterion based on an approximation of multidimensional mutual information. IEEE Transactions on Pattern Analysis and Machine Intelligence 32, 7 (2010), 1342--1343.Google ScholarDigital Library
Roberto Battiti. 1994. Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on Neural Networks 5, 4 (1994), 537--550.Google ScholarDigital Library
Ingo A. Beinlich, Henri J. Suermondt, R. Martin Chavez, and Gregory F. Cooper. 1989. The ALARM Monitoring System: A Case Study with Two Probabilistic Inference Techniques for Belief Networks. Springer.Google Scholar
David A. Bell and Hui Wang. 2000. A formalism for relevance and its application in feature subset selection. Machine Learning 41, 2 (2000), 175--195.Google ScholarDigital Library
Gianluca Bontempi and Patrick E. Meyer. 2010. Causal filter selection in microarray data. In Proceedings of the 27th International Conference on Machine Learning. 95--102.Google Scholar
Giorgos Borboudakis and Ioannis Tsamardinos. 2019. Forward-backward selection with early dropping. The Journal of Machine Learning Research 20, 1 (2019), 276--314.Google ScholarDigital Library
Gavin Brown, Adam Pocock, Ming-Jie Zhao, and Mikel Luján. 2012. Conditional likelihood maximisation: A unifying framework for information theoretic feature selection. Journal of Machine Learning Research 13, 1 (2012), 27--66.Google ScholarDigital Library
Peter Bühlmann, Markus Kalisch, and Marloes H. Maathuis. 2010. Variable selection in high-dimensional linear models: Partially faithful distributions and the PC-simple algorithm. Biometrika 97, 2 (2010), 261--278.Google ScholarCross Ref
Thomas M. Cover and Joy A. Thomas. 2012. Elements of Information Theory. John Wiley & Sons.Google Scholar
Manoranjan Dash and Huan Liu. 2003. Consistency-based search in feature selection. Artificial Intelligence 151, 1--2 (2003), 155--176.Google ScholarDigital Library
R. M. Fano. 1961. Transmission of Information: A Statistical Theory of Communications. MIT Press.Google Scholar
François Fleuret. 2004. Fast binary feature selection with conditional mutual information. Journal of Machine Learning Research 5, 9 (2004), 1531--1555.Google ScholarDigital Library
Nir Friedman, Dan Geiger, and Moises Goldszmidt. 1997. Bayesian network classifiers. Machine Learning 29, 2--3 (1997), 131--163.Google ScholarDigital Library
Shunkai Fu and Michel C. Desmarais. 2008. Fast Markov blanket discovery algorithm via local learning within single pass. In Proceedings of the Conference of the Canadian Society for Computational Studies of Intelligence. Springer, 96--107.Google Scholar
Keinosuke Fukunaga. 2013. Introduction to Statistical Pattern Recognition. Academic Press.Google ScholarDigital Library
Tian Gao and Qiang Ji. 2017. Efficient Markov blanket discovery and its application. IEEE Transactions on Cybernetics 47, 5 (2017), 1169--1179.Google ScholarCross Ref
Isabelle Guyon, Constantin Aliferis, and André Elisseeff. 2007. Causal feature selection. Computational Methods of Feature Selection, H. Liu and H. Motoda (Eds.). CRC Press.Google Scholar
Isabelle Guyon and André Elisseeff. 2003. An introduction to variable and feature selection. The Journal of Machine Learning Research 3, Mar (2003), 1157--1182.Google Scholar
Isabelle Guyon and André Elisseeff. 2006. An introduction to feature extraction. Feature Extraction. Springer, 1--25.Google Scholar
Martin E. Hellman and Josef Raviv. 1970. Probability of error, equivocation and the chernoff bound. IEEE Transactions on Information Theory 16, 4 (1970), 368--372.Google ScholarDigital Library
Ron Kohavi and George H. John. 1997. Wrappers for feature subset selection. Artificial Intelligence 97, 1 (1997), 273--324.Google ScholarDigital Library
Daphne Koller and Mehran Sahami. 1995. Toward optimal feature selection. In Proceedings of the 13th International Conference on International Conference on Machine Learning. 284--292.Google Scholar
Solomon Kullback and Richard A. Leibler. 1951. On information and sufficiency. The Annals of Mathematical Statistics 22, 1 (1951), 79--86.Google ScholarCross Ref
David D. Lewis. 1992. Feature selection and feature extraction for text categorization. In Proceedings of the Workshop on Speech and Natural Language. Association for Computational Linguistics, 212--217.Google ScholarDigital Library
Jundong Li, Kewei Cheng, Suhang Wang, Fred Morstatter, Robert P. Trevino, Jiliang Tang, and Huan Liu. 2017. Feature selection: A data perspective. ACM Computing Surveys 50, 6 (2017), 1--45.Google ScholarDigital Library
Dahua Lin and Xiaoou Tang. 2006. Conditional infomax learning: An integrated framework for feature extraction and fusion. In Proceedings of the European Conference on Computer Vision. Springer, 68--82.Google ScholarDigital Library
Zhaolong Ling, Kui Yu, Hao Wang, Lei Li, and Xindong Wu. 2020. Using feature selection for local causal structure learning. IEEE Transactions on Emerging Topics in Computational Intelligence. (2020).Google ScholarCross Ref
Zhaolong Ling, Kui Yu, Hao Wang, Lin Liu, Wei Ding, and Xindong Wu. 2019. Bamb: A balanced markov blanket discovery approach to feature selection. ACM Transactions on Intelligent Systems and Technology 10, 5 (2019), 1--25.Google ScholarDigital Library
Dimitris Margaritis and Sebastian Thrun. 2000. Bayesian network induction via local neighborhoods. In Proceedings of the Advances in Neural Information Processing Systems. 505--511.Google Scholar
Patrick Emmanuel Meyer, Colas Schretter, and Gianluca Bontempi. 2008. Information-theoretic feature selection in microarray data using variable complementarity. IEEE Journal of Selected Topics in Signal Processing 2, 3 (2008), 261--274.Google ScholarCross Ref
Judea Pearl. 2014. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann.Google Scholar
Jose M. Peña, Roland Nilsson, Johan Björkegren, and Jesper Tegnér. 2007. Towards scalable and data efficient learning of Markov boundaries. International Journal of Approximate Reasoning 45, 2 (2007), 211--232.Google ScholarDigital Library
Hanchuan Peng, Fuhui Long, and Chris Ding. 2005. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence 27, 8 (2005), 1226--1238.Google ScholarDigital Library
Marko Robnik-Šikonja and Igor Kononenko. 2003. Theoretical and empirical analysis of ReliefF and RReliefF. Machine Learning 53, 1--2 (2003), 23--69.Google ScholarDigital Library
Bernhard Schölkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. 2021. Toward causal representation learning. Proceedings of the IEEE. DOI:10.1109/JPROC.2021.3058954Google ScholarCross Ref
Claude Elwood Shannon. 2001. A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review 5, 1 (2001), 3--55.Google ScholarDigital Library
Alexander Shishkin, Anastasia Bezzubtseva, Alexey Drutsa, Ilia Shishkov, Ekaterina Gladkikh, Gleb Gusev, and Pavel Serdyukov. 2016. Efficient high-order interaction-aware feature selection based on conditional mutual information. In Proceedings of the Advances in Neural Information Processing Systems. 4637--4645.Google Scholar
Le Song, Alex Smola, Arthur Gretton, Justin Bedo, and Karsten Borgwardt. 2012. Feature selection via dependence maximization. Journal of Machine Learning Research 13, 47 (2012), 1393--1434.Google ScholarDigital Library
Xian-fang Song, Yong Zhang, Yi-nan Guo, Xiao-yan Sun, and Yong-li Wang. 2020. Variable-size cooperative coevolutionary particle swarm optimization for feature selection on high-dimensional data. IEEE Transactions on Evolutionary Computation 24, 5 (2020), 882--895.Google ScholarCross Ref
Peter Spirtes, Clark N. Glymour, and Richard Scheines. 2000. Causation, Prediction, and Search. Vol. 81. MIT Press.Google Scholar
D. Tebbe and S. Dwyer. 1968. Uncertainty and the probability of error (Corresp.). IEEE Transactions on Information Theory 14, 3 (1968), 516--518.Google ScholarDigital Library
Ioannis Tsamardinos and Constantin F. Aliferis. 2003. Towards principled feature selection: Relevancy, filters and wrappers. In Proceedings of the 9th International Workshop on Artificial Intelligence and Statistics. Morgan Kaufmann Publishers.Google Scholar
Ioannis Tsamardinos, Constantin F. Aliferis, and Alexander Statnikov. 2003. Time and sample efficient discovery of Markov blankets and direct causal relations. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 673--678.Google ScholarDigital Library
Ioannis Tsamardinos, Constantin F. Aliferis, Alexander R. Statnikov, and Er Statnikov. 2003. Algorithms for large scale Markov blanket discovery. In Proceedings of the FLAIRS Conference. Vol. 2. 376--380.Google Scholar
Ioannis Tsamardinos, Laura E. Brown, and Constantin F. Aliferis. 2006. The max-min hill-climbing Bayesian network structure learning algorithm. Machine Learning 65, 1 (2006), 31--78.Google ScholarDigital Library
Jorge R. Vergara and Pablo A. Estévez. 2014. A review of feature selection methods based on mutual information. Neural Computing and Applications 24, 1 (2014), 175--186.Google ScholarCross Ref
Michel Vidal-Naquet and Shimon Ullman. 2003. Object recognition with informative features and linear classification. In Proceedings of the 9th IEEE International Conference on Computer Vision. Vol. 1. 281--281.Google ScholarCross Ref
Nguyen Xuan Vinh, Shuo Zhou, Jeffrey Chan, and James Bailey. 2016. Can high-order dependencies improve mutual information based feature selection? Pattern Recognition 53, May (2016), 46--58.Google ScholarDigital Library
De Wang, Danesh Irani, and Calton Pu. 2012. Evolutionary study of web spam: Webb spam corpus 2011 versus webb spam corpus 2006. In Proceedings of the 8th International Conference on Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom’12). IEEE, 40--49.Google ScholarDigital Library
Hao Wang, Zhaolong Ling, Kui Yu, and Xindong Wu. 2020. Towards efficient and effective discovery of Markov blankets for feature selection. Information Sciences 509, January (2020), 227--242.Google Scholar
Jun Wang, Jin-Mao Wei, Zhenglu Yang, and Shu-Qin Wang. 2017. Feature selection by maximizing independent classification information. IEEE Transactions on Knowledge and Data Engineering 29, 4 (2017), 828--841.Google ScholarDigital Library
Bing Xue, Mengjie Zhang, Will N. Browne, and Xin Yao. 2015. A survey on evolutionary computation approaches to feature selection. IEEE Transactions on Evolutionary Computation 20, 4 (2015), 606--626.Google ScholarDigital Library
Howard Hua Yang and John Moody. 2000. Data visualization and feature selection: New algorithms for nongaussian data. In Proceedings of the Advances in Neural Information Processing Systems. 687--693.Google Scholar
Sandeep Yaramakala. 2004. Fast Markov Blanket Discovery. Ph.D. Dissertation. Iowa State University.Google Scholar
Sandeep Yaramakala and Dimitris Margaritis. 2005. Speculative Markov blanket discovery for optimal feature selection. In Proceedings of the 5th IEEE International Conference on Data Mining. IEEE, 4.Google ScholarDigital Library
Kui Yu, Xianjie Guo, Lin Liu, Jiuyong Li, Hao Wang, Zhaolong Ling, and Xindong Wu. 2020. Causality-based feature selection: Methods and evaluations. ACM Computing Surveys 53, 5 (2020), 1--36.Google ScholarDigital Library
Kui Yu, Lin Liu, Jiuyong Li, Wei Ding, and Thuc Duy Le. 2020. Multi-source causal feature selection. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 9 (2020), 2240--2256.Google ScholarDigital Library
Lei Yu and Huan Liu. 2004. Efficient feature selection via analysis of relevance and redundancy. The Journal of Machine Learning Research 5, December (2004), 1205--1224.Google Scholar
Yiteng Zhai, Yew-Soon Ong, and Ivor W. Tsang. 2014. The emerging “big dimensionality”. Computational Intelligence Magazine, IEEE 9, 3 (2014), 14--26.Google ScholarDigital Library
Yong Zhang, Dun-wei Gong, and Jian Cheng. 2015. Multi-objective particle swarm optimization approach for cost-based feature selection in classification. IEEE/ACM Transactions on Computational Biology and Bioinformatics 14, 1 (2015), 64--75.Google ScholarDigital Library
Xun Zheng, Bryon Aragam, Pradeep K. Ravikumar, and Eric P. Xing. 2018. DAGs with NO TEARS: Continuous optimization for structure learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems. 9472--9483.Google Scholar

Index Terms

A Unified View of Causal and Non-causal Feature Selection
1. Computing methodologies
  1. Machine learning
    1. Machine learning algorithms
      1. Feature selection

Recommendations

Causal Feature Selection with Missing Data
Causal feature selection aims at learning the Markov blanket (MB) of a class variable for feature selection. The MB of a class variable implies the local causal structure among the class variable and its MB and all other features are probabilistically ...
Read More
Error-aware Markov blanket learning for causal feature selection
Abstract
Causal feature selection has attracted much attention in recent years, since it has better robustness than the traditional feature selection. Existing causal feature selection algorithms aim to identify a Markov blanket (MB) of the ...
Read More
Causal Feature Selection in the Presence of Sample Selection Bias
Almost all existing causal feature selection methods are proposed without considering the problem of sample selection bias. However, in practice, as data-gathering process cannot be fully controlled, sample selection bias often occurs, leading to spurious ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Knowledge Discovery from Data Volume 15, Issue 4
August 2021
486 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/3458847
Editor:
Charu Aggarwal
IBM T. J. Watson Research, USA
Issue’s Table of Contents
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 April 2021
- Accepted: 1 November 2020
- Revised: 1 September 2020
- Received: 1 August 2019
Published in tkdd Volume 15, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Bayesian network
Causal feature selection
Markov blanket
mutual information
non-causal feature selection
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 20
  Total Citations
  View Citations
- 657
  Total Downloads
- Downloads (Last 12 months)161
- Downloads (Last 6 weeks)28
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

A Unified View of Causal and Non-causal Feature Selection

ACM Transactions on Knowledge Discovery from Data

Abstract

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Causal Feature Selection with Missing Data

Error-aware Markov blanket learning for causal feature selection

Causal Feature Selection in the Presence of Sample Selection Bias