Skip to main content
Log in

Mining Text Using Keyword Distributions

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

Knowledge Discovery in Databases (KDD) focuses on the computerized exploration of large amounts of data and on the discovery of interesting patterns within them. While most work on KDD has been concerned with structured databases, there has been little work on handling the huge amount of information that is available only in unstructured textual form. This paper describes the KDT system for Knowledge Discovery in Text, in which documents are labeled by keywords, and knowledge discovery is performed by analyzing the co-occurrence frequencies of the various keywords labeling the documents. We show how this keyword-frequency approach supports a range of KDD operations, providing a suitable foundation for knowledge discovery and exploration for collections of unstructured text.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Agrawal, R., Imielinski, T., and Swami, A. Mining association rules between sets of items in large databases. In Proceedings of the ACM SIGMOD Conference on Management of Data (pp. 207-216).

  • Anand, T. and Kahn, G. (1993). Opportunity explorer: Navigating large databases using knowledge discovery templates. In Proceedings of the 1993 workshop on Knowledge Discovery in Databases.

  • Apte, C., Damerau, F., and Weiss, S. (1994). Towards language independent automated learning of text categorization models. In Proceedings of ACM-SIGIR Conference on Information Retrieval.

  • Brachman, R., Selfridge, P., Terveen, L., Altman, B., Borgida, A., Halper, F., Kirk, T., Lazar, A., McGuinness, D., and Resnick, L. (1993). Integrated Support for Data Archeology. International Journal of Intelligent and Cooperative Information Systems.

  • Cover, T.M. and Thomas, J.A. (1991). Elements of Information Theory, John Wiley and Sons.

  • Cutting, C., Karger, D., and Pedersen, J. (1993). Constant interaction-time scatter/gather browsing of very large document collections. In Proceedings of ACM-SIGIR Conference on Information Retrieval.

  • Dagan, I., Pereira, F., and Lee, L. (1994). Similarity-based estimation of word co-occurrence probabilities. In Proceedings of the Annual Meeting of the ACL (pp. 272-278).

  • Dagan, I., Feldman, R., and Hirsh, H. (1996). Keyword-based browsing and analysis of large document sets. To appear In Proceedings of the Fifth Annual Symposium on Document Analysis and Information Retrieval (SDAIR-96). Las Vegas.

  • Ezawa, K. and Norton, S. (1995). Knowledge discovery in telecommunication services data using Bayesian Network Models. In Proceedings of the First International Conference on Knowledge Discovery (KDD-95).

  • Feldman, R. (1996). The KDT system-using prolog for KDD. To appear In Proceedings of PAP'96 (Practical Applications of Prolog). London, UK.

  • Feldman, R. and Dagan, I. (1995). KDT-Knowledge discovery in texts. In Proceedings of the First International Conference on Knowledge Discovery (KDD-95).

  • Feldman, R., Dagan, I., and Klöesgen, W. KDD tools for mining associations in textual databases. To appear. In Proceedings of the 9th International Symposium on Methodologies for Intelligent Systems.

  • Feldman, R., Dagan, I., and Klöesgen, W. (1996). Efficient algorithms for mining and manipulating associations in texts. To appear, Research and Cybernetics.

  • Finch, S. (1994). Exploiting sophisticated representations for document retrieval. In Proceedings of the 4th Conference on Applied Natural Language Processing.

  • Frawley, W.J., Piatetsky-Shapiro, G., and Matheus, C.J. (1991). Knowledge Discovery in Databases: An Overview. In G. Piatetsky-Shapiro and W.J. Frawley (Eds.), Knowledge Discovery in Databases. MIT Press, pp. 1-27.

  • Han, J. and Fu, Y. (1995). Discovery of multiple-level association rules from large databases. In Proc. of 1995 Int. Conf. on Very Large Data Bases (VLDB'95) (pp. 420-431). Zürich, Switzerland.

  • Hearst, M. (1995). Tilebars: Visualization of term distribution information in full text information access. In Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems. Denver, CO: ACM.

    Google Scholar 

  • Iwayama, M. and Tokunaga, T. (1994). A probabilistic model for text categorization based on a single random variable with multiple values. In Proceedings of the 4th Conference on Applied Natural Language Processing.

  • Jacobs, P. (1992). Joining statistics with NLP for text categorization. In Proceedings of the 3rd Conference on Applied Natural Language Processing.

  • Klösgen, W. (1992). Problems for Knowledge Discovery in Databases and Their Treatment in the Statistics Interpreter EXPLORA, International Journal for Intelligent Systems, 7(7), 649-673

    Google Scholar 

  • Klösgen, W. (1995a). EXPLORA: A Multipattern and Multistrategy Discovery Assistant. In U. Fayyad, G. Piatetsky-Shapiro, and R. Smyth (Eds.), Advances in Knowledge Discovery and Data Mining. Cambridge, MA: MIT Press, pp. 2249-271.

    Google Scholar 

  • Klösgen, W. (1995b). Efficient Discovery of Interesting Statements in Databases, Journal of Intelligent Information Systems, 4, 53-69.

    Google Scholar 

  • Lewis, D. (1992). An evaluation of phrasal and clustered representations on a text categorization problem. In Proceedings of ACM-SIGIR Conference on Information Retrieval.

  • Lewis, D. and Catlett, J. (1994). Heterogeneous uncertainty sampling for supervised learning. In Proceedings of the 11th International Conference on Machine Learning.

  • Mannila, H., Toivonen, H., and Verkamo, A. Efficient algorithms for discovering association rules. In KDD-94: AAAI workshop on Knowledge Discovery in Databases (pp. 181-192).

  • Salton, G. (1989). Automatic Text Processing, Addison-Wesley Publishing Company.

  • Srikant, R. and Agrawal, R. 1995.Mining generalized association rules. In Proc. of the 21st Int'l Conference on Very Large Databases. Zurich, Switzerland, Sept. 1995. Expanded version available as IBM Research Report RJ 9963.

  • Toivonen, H., Klemettinen, M., Ronkainen, P., Hatonen, K., and Mannila, H., Pruning and grouping discovered association rules. In Worksop Notes Statistics, Machine Learning and Knowledge Discovery in Databases, ECML-95.

  • Williamson, C. and Shneiderman, B. (1992). The dynamic HomeFinder: Evaluating dynamic queries in a real-estate information exploration system. In Proceedings of ACM-SIGIR.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Feldman, R., Dagan, I. & Hirsh, H. Mining Text Using Keyword Distributions. Journal of Intelligent Information Systems 10, 281–300 (1998). https://doi.org/10.1023/A:1008623632443

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1008623632443

Navigation