skip to main content
research-article

A Unified View of Causal and Non-causal Feature Selection

Authors Info & Claims
Published:18 April 2021Publication History
Skip Abstract Section

Abstract

In this article, we aim to develop a unified view of causal and non-causal feature selection methods. The unified view will fill in the gap in the research of the relation between the two types of methods. Based on the Bayesian network framework and information theory, we first show that causal and non-causal feature selection methods share the same objective. That is to find the Markov blanket of a class attribute, the theoretically optimal feature set for classification. We then examine the assumptions made by causal and non-causal feature selection methods when searching for the optimal feature set, and unify the assumptions by mapping them to the restrictions on the structure of the Bayesian network model of the studied problem. We further analyze in detail how the structural assumptions lead to the different levels of approximations employed by the methods in their search, which then result in the approximations in the feature sets found by the methods with respect to the optimal feature set. With the unified view, we can interpret the output of non-causal methods from a causal perspective and derive the error bounds of both types of methods. Finally, we present practical understanding of the relation between causal and non-causal methods using extensive experiments with synthetic data and various types of real-world data.

Skip Supplemental Material Section

Supplemental Material

References

  1. Alan Agresti and Maria Kateri. 2011. Categorical data analysis. In International Encyclopedia of Statistical Science. Springer, 206--208.Google ScholarGoogle Scholar
  2. Constantin F. Aliferis, Alexander Statnikov, Ioannis Tsamardinos, Subramani Mani, and Xenofon D. Koutsoukos. 2010. Local causal and Markov blanket induction for causal discovery and feature selection for classification part I: Algorithms and empirical evaluation. Journal of Machine Learning Research 11, 7 (2010), 171--234.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Constantin F. Aliferis, Ioannis Tsamardinos, and Alexander Statnikov. 2003. HITON: A novel Markov blanket algorithm for optimal variable selection. In Proceedings of the AMIA Annual Symposium. Vol. 2003. American Medical Informatics Association, 21.Google ScholarGoogle Scholar
  4. Kevin Bache and Moshe Lichman. 2013. UCI machine learning repository. Retrieved from http://archive.ics.uci.edu/ml.Google ScholarGoogle Scholar
  5. Kiran S. Balagani and Vir V. Phoha. 2010. On the feature selection criterion based on an approximation of multidimensional mutual information. IEEE Transactions on Pattern Analysis and Machine Intelligence 32, 7 (2010), 1342--1343.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Roberto Battiti. 1994. Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on Neural Networks 5, 4 (1994), 537--550.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Ingo A. Beinlich, Henri J. Suermondt, R. Martin Chavez, and Gregory F. Cooper. 1989. The ALARM Monitoring System: A Case Study with Two Probabilistic Inference Techniques for Belief Networks. Springer.Google ScholarGoogle Scholar
  8. David A. Bell and Hui Wang. 2000. A formalism for relevance and its application in feature subset selection. Machine Learning 41, 2 (2000), 175--195.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Gianluca Bontempi and Patrick E. Meyer. 2010. Causal filter selection in microarray data. In Proceedings of the 27th International Conference on Machine Learning. 95--102.Google ScholarGoogle Scholar
  10. Giorgos Borboudakis and Ioannis Tsamardinos. 2019. Forward-backward selection with early dropping. The Journal of Machine Learning Research 20, 1 (2019), 276--314.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Gavin Brown, Adam Pocock, Ming-Jie Zhao, and Mikel Luján. 2012. Conditional likelihood maximisation: A unifying framework for information theoretic feature selection. Journal of Machine Learning Research 13, 1 (2012), 27--66.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Peter Bühlmann, Markus Kalisch, and Marloes H. Maathuis. 2010. Variable selection in high-dimensional linear models: Partially faithful distributions and the PC-simple algorithm. Biometrika 97, 2 (2010), 261--278.Google ScholarGoogle ScholarCross RefCross Ref
  13. Thomas M. Cover and Joy A. Thomas. 2012. Elements of Information Theory. John Wiley & Sons.Google ScholarGoogle Scholar
  14. Manoranjan Dash and Huan Liu. 2003. Consistency-based search in feature selection. Artificial Intelligence 151, 1--2 (2003), 155--176.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. M. Fano. 1961. Transmission of Information: A Statistical Theory of Communications. MIT Press.Google ScholarGoogle Scholar
  16. François Fleuret. 2004. Fast binary feature selection with conditional mutual information. Journal of Machine Learning Research 5, 9 (2004), 1531--1555.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Nir Friedman, Dan Geiger, and Moises Goldszmidt. 1997. Bayesian network classifiers. Machine Learning 29, 2--3 (1997), 131--163.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Shunkai Fu and Michel C. Desmarais. 2008. Fast Markov blanket discovery algorithm via local learning within single pass. In Proceedings of the Conference of the Canadian Society for Computational Studies of Intelligence. Springer, 96--107.Google ScholarGoogle Scholar
  19. Keinosuke Fukunaga. 2013. Introduction to Statistical Pattern Recognition. Academic Press.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Tian Gao and Qiang Ji. 2017. Efficient Markov blanket discovery and its application. IEEE Transactions on Cybernetics 47, 5 (2017), 1169--1179.Google ScholarGoogle ScholarCross RefCross Ref
  21. Isabelle Guyon, Constantin Aliferis, and André Elisseeff. 2007. Causal feature selection. Computational Methods of Feature Selection, H. Liu and H. Motoda (Eds.). CRC Press.Google ScholarGoogle Scholar
  22. Isabelle Guyon and André Elisseeff. 2003. An introduction to variable and feature selection. The Journal of Machine Learning Research 3, Mar (2003), 1157--1182.Google ScholarGoogle Scholar
  23. Isabelle Guyon and André Elisseeff. 2006. An introduction to feature extraction. Feature Extraction. Springer, 1--25.Google ScholarGoogle Scholar
  24. Martin E. Hellman and Josef Raviv. 1970. Probability of error, equivocation and the chernoff bound. IEEE Transactions on Information Theory 16, 4 (1970), 368--372.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Ron Kohavi and George H. John. 1997. Wrappers for feature subset selection. Artificial Intelligence 97, 1 (1997), 273--324.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Daphne Koller and Mehran Sahami. 1995. Toward optimal feature selection. In Proceedings of the 13th International Conference on International Conference on Machine Learning. 284--292.Google ScholarGoogle Scholar
  27. Solomon Kullback and Richard A. Leibler. 1951. On information and sufficiency. The Annals of Mathematical Statistics 22, 1 (1951), 79--86.Google ScholarGoogle ScholarCross RefCross Ref
  28. David D. Lewis. 1992. Feature selection and feature extraction for text categorization. In Proceedings of the Workshop on Speech and Natural Language. Association for Computational Linguistics, 212--217.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Jundong Li, Kewei Cheng, Suhang Wang, Fred Morstatter, Robert P. Trevino, Jiliang Tang, and Huan Liu. 2017. Feature selection: A data perspective. ACM Computing Surveys 50, 6 (2017), 1--45.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Dahua Lin and Xiaoou Tang. 2006. Conditional infomax learning: An integrated framework for feature extraction and fusion. In Proceedings of the European Conference on Computer Vision. Springer, 68--82.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Zhaolong Ling, Kui Yu, Hao Wang, Lei Li, and Xindong Wu. 2020. Using feature selection for local causal structure learning. IEEE Transactions on Emerging Topics in Computational Intelligence. (2020).Google ScholarGoogle ScholarCross RefCross Ref
  32. Zhaolong Ling, Kui Yu, Hao Wang, Lin Liu, Wei Ding, and Xindong Wu. 2019. Bamb: A balanced markov blanket discovery approach to feature selection. ACM Transactions on Intelligent Systems and Technology 10, 5 (2019), 1--25.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Dimitris Margaritis and Sebastian Thrun. 2000. Bayesian network induction via local neighborhoods. In Proceedings of the Advances in Neural Information Processing Systems. 505--511.Google ScholarGoogle Scholar
  34. Patrick Emmanuel Meyer, Colas Schretter, and Gianluca Bontempi. 2008. Information-theoretic feature selection in microarray data using variable complementarity. IEEE Journal of Selected Topics in Signal Processing 2, 3 (2008), 261--274.Google ScholarGoogle ScholarCross RefCross Ref
  35. Judea Pearl. 2014. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann.Google ScholarGoogle Scholar
  36. Jose M. Peña, Roland Nilsson, Johan Björkegren, and Jesper Tegnér. 2007. Towards scalable and data efficient learning of Markov boundaries. International Journal of Approximate Reasoning 45, 2 (2007), 211--232.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Hanchuan Peng, Fuhui Long, and Chris Ding. 2005. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence 27, 8 (2005), 1226--1238.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Marko Robnik-Šikonja and Igor Kononenko. 2003. Theoretical and empirical analysis of ReliefF and RReliefF. Machine Learning 53, 1--2 (2003), 23--69.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Bernhard Schölkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. 2021. Toward causal representation learning. Proceedings of the IEEE. DOI:10.1109/JPROC.2021.3058954Google ScholarGoogle ScholarCross RefCross Ref
  40. Claude Elwood Shannon. 2001. A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review 5, 1 (2001), 3--55.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Alexander Shishkin, Anastasia Bezzubtseva, Alexey Drutsa, Ilia Shishkov, Ekaterina Gladkikh, Gleb Gusev, and Pavel Serdyukov. 2016. Efficient high-order interaction-aware feature selection based on conditional mutual information. In Proceedings of the Advances in Neural Information Processing Systems. 4637--4645.Google ScholarGoogle Scholar
  42. Le Song, Alex Smola, Arthur Gretton, Justin Bedo, and Karsten Borgwardt. 2012. Feature selection via dependence maximization. Journal of Machine Learning Research 13, 47 (2012), 1393--1434.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Xian-fang Song, Yong Zhang, Yi-nan Guo, Xiao-yan Sun, and Yong-li Wang. 2020. Variable-size cooperative coevolutionary particle swarm optimization for feature selection on high-dimensional data. IEEE Transactions on Evolutionary Computation 24, 5 (2020), 882--895.Google ScholarGoogle ScholarCross RefCross Ref
  44. Peter Spirtes, Clark N. Glymour, and Richard Scheines. 2000. Causation, Prediction, and Search. Vol. 81. MIT Press.Google ScholarGoogle Scholar
  45. D. Tebbe and S. Dwyer. 1968. Uncertainty and the probability of error (Corresp.). IEEE Transactions on Information Theory 14, 3 (1968), 516--518.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Ioannis Tsamardinos and Constantin F. Aliferis. 2003. Towards principled feature selection: Relevancy, filters and wrappers. In Proceedings of the 9th International Workshop on Artificial Intelligence and Statistics. Morgan Kaufmann Publishers.Google ScholarGoogle Scholar
  47. Ioannis Tsamardinos, Constantin F. Aliferis, and Alexander Statnikov. 2003. Time and sample efficient discovery of Markov blankets and direct causal relations. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 673--678.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Ioannis Tsamardinos, Constantin F. Aliferis, Alexander R. Statnikov, and Er Statnikov. 2003. Algorithms for large scale Markov blanket discovery. In Proceedings of the FLAIRS Conference. Vol. 2. 376--380.Google ScholarGoogle Scholar
  49. Ioannis Tsamardinos, Laura E. Brown, and Constantin F. Aliferis. 2006. The max-min hill-climbing Bayesian network structure learning algorithm. Machine Learning 65, 1 (2006), 31--78.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Jorge R. Vergara and Pablo A. Estévez. 2014. A review of feature selection methods based on mutual information. Neural Computing and Applications 24, 1 (2014), 175--186.Google ScholarGoogle ScholarCross RefCross Ref
  51. Michel Vidal-Naquet and Shimon Ullman. 2003. Object recognition with informative features and linear classification. In Proceedings of the 9th IEEE International Conference on Computer Vision. Vol. 1. 281--281.Google ScholarGoogle ScholarCross RefCross Ref
  52. Nguyen Xuan Vinh, Shuo Zhou, Jeffrey Chan, and James Bailey. 2016. Can high-order dependencies improve mutual information based feature selection? Pattern Recognition 53, May (2016), 46--58.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. De Wang, Danesh Irani, and Calton Pu. 2012. Evolutionary study of web spam: Webb spam corpus 2011 versus webb spam corpus 2006. In Proceedings of the 8th International Conference on Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom’12). IEEE, 40--49.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Hao Wang, Zhaolong Ling, Kui Yu, and Xindong Wu. 2020. Towards efficient and effective discovery of Markov blankets for feature selection. Information Sciences 509, January (2020), 227--242.Google ScholarGoogle Scholar
  55. Jun Wang, Jin-Mao Wei, Zhenglu Yang, and Shu-Qin Wang. 2017. Feature selection by maximizing independent classification information. IEEE Transactions on Knowledge and Data Engineering 29, 4 (2017), 828--841.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Bing Xue, Mengjie Zhang, Will N. Browne, and Xin Yao. 2015. A survey on evolutionary computation approaches to feature selection. IEEE Transactions on Evolutionary Computation 20, 4 (2015), 606--626.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Howard Hua Yang and John Moody. 2000. Data visualization and feature selection: New algorithms for nongaussian data. In Proceedings of the Advances in Neural Information Processing Systems. 687--693.Google ScholarGoogle Scholar
  58. Sandeep Yaramakala. 2004. Fast Markov Blanket Discovery. Ph.D. Dissertation. Iowa State University.Google ScholarGoogle Scholar
  59. Sandeep Yaramakala and Dimitris Margaritis. 2005. Speculative Markov blanket discovery for optimal feature selection. In Proceedings of the 5th IEEE International Conference on Data Mining. IEEE, 4.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Kui Yu, Xianjie Guo, Lin Liu, Jiuyong Li, Hao Wang, Zhaolong Ling, and Xindong Wu. 2020. Causality-based feature selection: Methods and evaluations. ACM Computing Surveys 53, 5 (2020), 1--36.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Kui Yu, Lin Liu, Jiuyong Li, Wei Ding, and Thuc Duy Le. 2020. Multi-source causal feature selection. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 9 (2020), 2240--2256.Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Lei Yu and Huan Liu. 2004. Efficient feature selection via analysis of relevance and redundancy. The Journal of Machine Learning Research 5, December (2004), 1205--1224.Google ScholarGoogle Scholar
  63. Yiteng Zhai, Yew-Soon Ong, and Ivor W. Tsang. 2014. The emerging “big dimensionality”. Computational Intelligence Magazine, IEEE 9, 3 (2014), 14--26.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Yong Zhang, Dun-wei Gong, and Jian Cheng. 2015. Multi-objective particle swarm optimization approach for cost-based feature selection in classification. IEEE/ACM Transactions on Computational Biology and Bioinformatics 14, 1 (2015), 64--75.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Xun Zheng, Bryon Aragam, Pradeep K. Ravikumar, and Eric P. Xing. 2018. DAGs with NO TEARS: Continuous optimization for structure learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems. 9472--9483.Google ScholarGoogle Scholar

Index Terms

  1. A Unified View of Causal and Non-causal Feature Selection

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Knowledge Discovery from Data
      ACM Transactions on Knowledge Discovery from Data  Volume 15, Issue 4
      August 2021
      486 pages
      ISSN:1556-4681
      EISSN:1556-472X
      DOI:10.1145/3458847
      Issue’s Table of Contents

      Copyright © 2021 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 18 April 2021
      • Accepted: 1 November 2020
      • Revised: 1 September 2020
      • Received: 1 August 2019
      Published in tkdd Volume 15, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format