Skip to main content

Information-Theoretic Measures for Knowledge Discovery and Data Mining

  • Chapter
Entropy Measures, Maximum Entropy Principle and Emerging Applications

Part of the book series: Studies in Fuzziness and Soft Computing ((STUDFUZZ,volume 119))

Abstract

A database may be considered as a statistical population, and an attribute as a statistical variable taking values from its domain. One can carry out statistical and information-theoretic analysis on a database. Based on the attribute values, a database can be partitioned into smaller populations. An attribute is deemed important if it partitions the database such that previously unknown regularities and patterns are observable. Many information-theoretic measures have been proposed and applied to quantify the importance of attributes and relationships between attributes in various fields. In the context of knowledge discovery and data mining (KDD), we present a critical review and analysis of information-theoretic measures of attribute importance and attribute association, with emphasis on their interpretations and connections.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Agrawal, R., Imielinski, T. and Swami, A. Mining association rules between sets of items in large databases, Proceedings of the ACM SIGMOD International Conference on the Management of Data, 207–216, 1993.

    Google Scholar 

  2. Bell, A. Discovery and maintenance of functional dependencies by independencies, Proceedings of KDD-95, 27–32, 1995.

    Google Scholar 

  3. Büchter, O. and Wirth R. Discovery of association rules over ordinal data: a new and faster algorithm and its application to basket analysis, in: Research and Development in Knowledge Discovery and Data Mining, Wu, X., Kotagiri, R. and Bork, K.B. (Eds.), Springer, Berlin, 36–47, 1998.

    Google Scholar 

  4. Butz, C.J., Wong, S.K.M. and Yao, Y.Y. On data and probabilistic dependencies, Proceedings of the 1999 IEEE Canadian Conference on Electrical and Computer Engineering, 1692–1697, 1999.

    Google Scholar 

  5. Cendrowska, J. PRISM: an algorithm for inducing modular rules, International Journal of Man-Machine Studies, 27, 349–370, 1987.

    Article  MATH  Google Scholar 

  6. Chen, C. Statistical Pattern Recognition, Hayden Book Company, Inc., New Jersey, 1973.

    Google Scholar 

  7. Chen, M., Han, J. and Yu, P.S. Data mining, an overview from a database perspective, IEEE Transactions on Knowledge and Data Engineering, 8, 866883, 1996.

    Google Scholar 

  8. Chow, C. and Liu, C. Approximating discrete probability distributions with dependence trees, IEEE Transactions on Information Theory, IT-14, 462–467, 1968.

    Google Scholar 

  9. Cowell, R.G., Dawid, A.P., Lauritzen, S.L. and Spiegelhalter, D.J. Probabilistic Networks and Expert Systems, Springer, New York, 1999.

    MATH  Google Scholar 

  10. Cover, T. and Thomas, J. Elements of Information Theory, John Wiley Sc Sons, Toronto, 1991.

    Book  MATH  Google Scholar 

  11. Csiszar, I. and Körner, J. Information Theory: Coding Theorems for Discrete Memoryless Systems, Academic Press, New York, 1981.

    MATH  Google Scholar 

  12. Garner, W.R. and McGill, W.J. Relation between information and variance analyses, Psychometrika, 21, 219–228, 1956.

    Article  MathSciNet  MATH  Google Scholar 

  13. Gray, B. and Orlowska, M.E. CCAIIA: clustering categorical attributes into interesting association rules, in: Research and Development in Knowledge Discovery and Data Mining, Wu, X., Kotagiri, R. and Bork, K.B. (Eds.), Springer, Berlin, 132–143, 1998.

    Google Scholar 

  14. Guiasu, S. Information Theory with Applications, McGraw-Hill, New York, 1977.

    Google Scholar 

  15. Han, J., Cai, Y. and Cercone, N. Data-driven discovery of quantitative rules in databases, IEEE Transactions on Knowledge and Data Engineering, 5, 29–40, 1993.

    Article  Google Scholar 

  16. Horibe, Y. A note on entropy metrics, Information and Control, 22, 403–404, 1973.

    Article  MathSciNet  MATH  Google Scholar 

  17. Horibe, Y. Entropy and correlation, IEEE Transactions on Systems, Man, and Cybernetics, SMC-15, 641–642, 1985.

    Google Scholar 

  18. Hou, W. Extraction and applications of statistical relationships in relational databases, IEEE Transactions on Knowledge and Data Engineering, 8, 939945, 1996.

    Google Scholar 

  19. Hwang, C.L. and Yoon, K. Multiple Attribute Decision Making, Methods and Applications, Springer-Verlag, Berlin, 1981.

    MATH  Google Scholar 

  20. Kazakos, D. and Cotsidas, T. A decision approach to the approximation of discrete probability densities, IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-2, 61–67, 1980.

    Google Scholar 

  21. Kamber, M. and Shinghal, R. Evaluating the interestingness of characteristic rules, Proceedings of KDD-96, 263–266, 1996.

    Google Scholar 

  22. Klir, G.J. and Yuan, B. Fuzzy Sets and Fuzzy Logic, Theory and Applications, Prentice Hall, New Jersey, 1995.

    MATH  Google Scholar 

  23. Klösgen, W. Explora: a multipattern and multistrategy discovery assistant, in: Advances in Knowledge Discovery and Data Mining, Fayyad, U.M, PiatetskyShapiro, G., Smyth, P. and Uthurusamy, R. (Eds.), AAAI/MIT Press, California, 249–271, 1996.

    Google Scholar 

  24. Knobbe, A.J. and Adriaans P.W. Analysis of binary association, Proceedings of KDD-96, 311–314, 1996.

    Google Scholar 

  25. Kohavi, R. and Li, C. Oblivious decision trees, graphs and top-down pruning, Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, 1071–1077, 1995.

    Google Scholar 

  26. Kvâlseth, T.O. Entropy and correlation: some comments, IEEE Transactions on Systems, Man, and Cybernetics, SMC-17, 517–519, 1987.

    Google Scholar 

  27. Kullback, S. and Leibler, R.A. On information and sufficiency, Annals of Mathematical Statistics, 22, 79–86, 1951.

    Article  MathSciNet  MATH  Google Scholar 

  28. Lee, T.T. An information-theoretic analysis of relational databases — part I: data dependencies and information metric, IEEE Transactions on Software Engineering, SE-13, 1049–1061, 1987.

    Google Scholar 

  29. Liebetrau, A.M. Measures of Association, Sage University Paper Series on Quantitative Application in the Social Sciences, 07–032, Sage Publications, Beverly Hills, 1983.

    Google Scholar 

  30. Lin, J. and Wong, S.K.M. A new directed divergence measure and its characterization, International Journal of General Systems,17, 73–81, 1991

    Google Scholar 

  31. Lin, T.Y. and Cercone, N. (Eds.), Rough Sets and Data Mining: Analysis for Imprecise Data, Kluwer Academic Publishers, Boston, 1997.

    Google Scholar 

  32. Linfoot, E.H. An informational measure of correlation, Information and Control, 1, 85–87, 1957.

    Article  MathSciNet  MATH  Google Scholar 

  33. Liu, H. and Motoda, H. Feature Selection for Knowledge Discovery and Data Mining, Kluwer Academic Publishers, Boston, 1998.

    Book  MATH  Google Scholar 

  34. Lopez de Mhntaras, R. ID3 revisited: a distance-based criterion for attribute selection, in: Methodologies for Intelligent Systems, 4, Ras, Z.W. (Ed.), North-Holland, New York, 342–350, 1989.

    Google Scholar 

  35. Malvestuto, F.M. Statistical treatment of the information content of a database, Information Systems, 11, 211–223, 1986.

    Article  MATH  Google Scholar 

  36. Michalski, R.S., Carbonell, J.G. and Mitchell, T.M. (Eds.), Machine Learning, Tioga, 1983.

    Google Scholar 

  37. Pfahringer, B. and Kramer, S. Compression-based evaluation of partial determinations, Proceedings of KDD-95, 234–239, 1995.

    Google Scholar 

  38. Pawlak, Z. Rough Sets, Theoretical Aspects of Reasoning about Data, Kluwer Academic Publishers, Boston, 1991.

    MATH  Google Scholar 

  39. Pawlak, Z., Wong, S.K.M. and Ziarko, W. Rough sets: probabilistic versus deterministic approach, International Journal of Man-Machine Studies, 29, 8195, 1988.

    Article  Google Scholar 

  40. Pearl, J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference,Morgan Kaufmann Publishers, San Francisco, 1988.

    Google Scholar 

  41. Polkowski, L. and Skowron, A. (Eds.), Rough Sets in Knowledge Discovery 1,2,Physica-Verlag, Heidelberg, 1998.

    Google Scholar 

  42. Quinlan, J.R. Induction of decision trees, Machine Learning, 1, 81–106, 1986.

    Google Scholar 

  43. Rao, C.R. Diversity and dissimilarity coefficients: a unified approach, Theoretical Population Biology, 21, 24–43, 1982.

    Article  MathSciNet  MATH  Google Scholar 

  44. Rajski, C. A metric space of discrete probability distributions, Information and Control, 4, 373–377, 1961.

    Article  MathSciNet  Google Scholar 

  45. Salton, G. and McGill, M.H. Introduction to Modern Information Retrieval,McGraw-Hill, New York, 1983.

    Google Scholar 

  46. Shannon, C.E. A mathematical theory of communication, Bell System and Technical Journal, 27, 379–423, 623–656, 1948.

    Google Scholar 

  47. Shannon, C.E. Some topics in information theory, Proceedings of International Congress of Mathematics, 2, 262, 1950.

    Google Scholar 

  48. Sheridan, T.B. and Ferrell, W.R. Man-Machine Systems: Information Control and Decision Models of Human Performance The MIT Press, Cambridge, 1974.

    Google Scholar 

  49. Silverstein, C., Brin, S. and Motwani, R. Beyond market baskets: generalizing association rules to dependence rules, Data Mining and Knowledge Discovery, 2, 39–68, 1998.

    Article  Google Scholar 

  50. Smyth, P. and Goodman, R.M. Rule induction using information theory, in: Knowledge Discovery in Databases, Piatetsky-Shapiro, G. and Frawley, W.J. (Eds.), AAAI/MIT Press, 159–176, 1991.

    Google Scholar 

  51. Spyratos, N. The partition model: a deductive database model, ACM Transactions on Database Systems12, 1–37, 1987.

    Google Scholar 

  52. van Rijsbergen, C.J. Information RetrievalButterworth, London, 1979.

    Google Scholar 

  53. Wan, S.J. and Wong, S.K.M. A measure for attribute dissimilarity and its applications in machine learning, in: Computing and Information, Janicki, R. and Koczkodaj, W.W. (Eds.), North-Holland, Amsterdam, 267–273, 1989.

    Google Scholar 

  54. Wang, Q.R. and Suen, C.Y. Analysis and design of a decision tree based on entropy reduction and its application to large character set recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-6, 406–417, 1984.

    Google Scholar 

  55. Watanabe, S. Knowing and Guessing, Wiley, New York, 1969.

    MATH  Google Scholar 

  56. Watanabe, S. Pattern recognition as a quest for minimum entropy, Pattern Recognition, 13, 381–387, 1981.

    Article  MathSciNet  MATH  Google Scholar 

  57. Wong, A.K.C. and You, M. Entropy and distance of random graphs with application to structural pattern recognition, IEEE Transactions on Pattern Analysis And Machine Intelligence, PAMI-7, 599–609, 1985.

    Google Scholar 

  58. Wong, S.K.M. and Yao, Y.Y. A probability distribution model for information retrieval, Information Processing and Management, 25, 39–53, 1989.

    Article  Google Scholar 

  59. Wong, S.K.M. and Yao, Y.Y. An information-theoretic measure of term specificity, Journal of the American Society for Information Science, 43, 54–61, 1992.

    Article  Google Scholar 

  60. Yao, Y.Y., Wong, S.K.M. and Butz, C.J. On information-theoretic measures of attribute importance, Proceedings of PAKDD’99, 133–137, 1999.

    Google Scholar 

  61. Yao, Y.Y., Wong, S.K.M. and Lin, T.Y. A review of rough set models, in: Rough Sets and Data Mining: Analysis for Imprecise Data, Lin, T.Y. and Cercone, N. (Eds.), Academic Publishers, Boston, 47–75, 1997.

    Google Scholar 

  62. Yao, Y.Y. Information tables with neighborhood semantics, in: Data Mining and Knowledge Discovery: Theory, Tools, and Technology II, Dasarathy, B.V. (Ed.), The International Society for Optical Engineering, Bellingham, Washington, 108–116, 2000.

    Google Scholar 

  63. Yao, Y.Y. and Zhong, N. An analysis of quantitative measures associated with rules, Proceedings of PAKDD’99, 479–488, 1999.

    Google Scholar 

  64. Yao, Y.Y. and Zhong, N. On association, similarity and dependency of attributes, Proceedings of PAKDD’00, 2000.

    Google Scholar 

  65. Yao, Y.Y. and Zhong, N. Granular computing using information tables, manuscript, 2000.

    Google Scholar 

  66. Yao, Y.Y. and Zhong, N. Mining market value function for targeted marketing, manuscript, 2000.

    Google Scholar 

  67. Zeleny, M. Linear multiobjective programming, Springer-Verlag, New York, 1974.

    Book  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Yao, Y.Y. (2003). Information-Theoretic Measures for Knowledge Discovery and Data Mining. In: Karmeshu (eds) Entropy Measures, Maximum Entropy Principle and Emerging Applications. Studies in Fuzziness and Soft Computing, vol 119. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-36212-8_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-36212-8_6

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-05531-7

  • Online ISBN: 978-3-540-36212-8

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics