Skip to main content

Analyzing Document Collections via Context-Aware Term Extraction

  • Conference paper
Natural Language Processing and Information Systems (NLDB 2009)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5723))

Abstract

In large collections of documents that are divided into predefined classes, the differences and similarities of those classes are of special interest. This paper presents an approach that is able to automatically extract terms from such document collections which describe what topics discriminate a single class from the others (discriminating terms) and which topics discriminate a subset of the classes against the remaining ones (overlap terms). The importance for real world applications and the effectiveness of our approach are demonstrated by two out of practice examples. In a first application our predefined classes correspond to different scientific conferences. By extracting terms from collections of papers published on these conferences, we determine automatically the topical differences and similarities of the conferences. In our second application task we extract terms out of a collection of product reviews which show what features reviewers commented on. We get these terms by discriminating the product review class against a suitable counter-balance class. Finally, our method is evaluated comparing it to alternative approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Spärck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28(1), 11–21 (1972)

    Article  Google Scholar 

  2. Salton, G., Wong, A., Yang, C.S.: A Vector Space Model for Automatic Indexing. Communications of the ACM 18(11), 613–620 (1975)

    Article  MATH  Google Scholar 

  3. Kageura, K., Umino, B.: Methods of automatic term recognition: A review. Terminology 3(2), 259 (1996)

    Google Scholar 

  4. Feldman, R., Fresko, M., Kinar, Y., Lindell, Y., Liphstat, O., Rajman, M., Schler, Y., Zamir, O.: Text mining at the term level. In: Proceedings of the Second European Symposium on Principles of Data Mining and Knowledge Discovery, pp. 65–73 (1998)

    Google Scholar 

  5. Matsuo, Y., Ishizuka, M.: Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information. In: Proceedings of the 16th International Florida AI Research Society, pp. 392–396 (2003)

    Google Scholar 

  6. Riloff, E., Jones, R.: Learning dictionaries for information extraction by multi-level bootstrapping. In: Proceedings of the Sixteenth National Conference on Artificial Intelligence, pp. 474–479 (1999)

    Google Scholar 

  7. Collins, M., Singer, Y.: Unsupervised models for named entity classification. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (1999)

    Google Scholar 

  8. Brunzel, M., Spiliopoulou, M.: Domain Relevance on Term Weighting. In: 12th International Conference on Applications of Natural Language to Information Systems, pp. 427–432 (2007)

    Google Scholar 

  9. Witschel, H.F.: Terminologie-Extraktion: Möglichkeiten der Kombination statistischer und musterbasierter Verfahren. In: Content and Communication: Terminology, Language Resources and Semantic Interoperability. Ergon Verlag, Würzburg (2004)

    Google Scholar 

  10. Velardi, P., Missikoff, M., Basili, R.: Identification of relevant terms to support the construction of domain ontologies. In: Proceedings of the workshop on Human Language Technology and Knowledge Management, pp. 1–8 (2001)

    Google Scholar 

  11. Drouin, P.: Detection of Domain Specifc Terminology Using Corpora Comparison. In: Proceedings of the International Language Resources Conference, pp. 79–82 (2004)

    Google Scholar 

  12. Wise, J.A.: The ecological approach to text visualization. Journal of the American Society for Information Science, 1224–1233 (1999)

    Google Scholar 

  13. Kaski, S., Honkela, T., Lagus, K., Kohonen, T.: WEBSOM - Selforganizing maps of document collections. Neurocomputing 21, 101–117 (1998)

    Article  MATH  Google Scholar 

  14. Lagus, K., Kaski, S.: Keyword selection method for characterizing text document maps. In: Proceedings of ICANN 1999, Ninth International Conference on Artificial Neural Networks, pp. 371–376 (1999)

    Google Scholar 

  15. Azcarraga, A.P., Yap, T.N., Tan, J., Chua, T.S.: Evaluating Keyword Selection Methods for WEBSOM Text Archives. IEEE Transactions on Knowledge and Data Engineering 16(3), 380–383 (2004)

    Article  Google Scholar 

  16. Seki, Y., Eguchi, K., Kando, N.: Multi-Document Viewpoint Summarization Focused on Facts, Opinion and Knowledge. In: Computing Attitude and Affect in Text: Theory and Applications. The Information Retrieval Series, pp. 317–336. Springer, Heidelberg (2005)

    Google Scholar 

  17. Lerman, K., McDonald, R.: Contrastive Summarization: An Experiment with Consumer Reviews. In: Proceedings of the North American Association for Computational Linguistics, NAACL (2009)

    Google Scholar 

  18. Zhai, C., Velivelli, A., Yu, B.: A Cross-Collection Mixture Model for Comparative Text Mining. In: Proceedings of the ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), pp. 743–748 (2004)

    Google Scholar 

  19. Mei, Q., Zhai, C.: A mixture model for contextual text mining. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), pp. 649–655 (2006)

    Google Scholar 

  20. Salton, G., Buckley, C.: Term weighting approaches in automatic text retrieval. Information Processing and Management: an International Journal 24(5), 513–523 (1988)

    Article  Google Scholar 

  21. Kuhlen, R.: Experimentelle Morphologie in der Informationswissenschaft. Verlag Dokumentation (1977)

    Google Scholar 

  22. Toutanova, K., Manning, C.: Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC), pp. 63–70 (2000)

    Google Scholar 

  23. Toutanova, K., Klein, D., Manning, C., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (NAACL), pp. 173–180 (2003)

    Google Scholar 

  24. Stanford Log-linear Part-Of-Speech Tagger, http://nlp.stanford.edu/software/tagger.shtml

  25. Ramshaw, L., Marcus, M.: Text Chunking Using Transformation-Based Learning. In: Proceedings of the Third ACL Workshop on Very Large Corpora (1995)

    Google Scholar 

  26. Greenwood, M.: Noun Phrase Chunker Version 1.1, http://www.dcs.shef.ac.uk/~mark/phd/software/chunker.html

  27. Thiel, K., Dill, F., Kötter, T., Berthold, M.R.: Towards Visual Exploration of Topic Shifts. In: IEEE International Conference on Systems, Man and Cybernetics, pp. 522–527 (2007)

    Google Scholar 

  28. Online tool for terminology extraction, http://wortschatz.uni-leipzig.de/~fwitschel/terminology.html

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Keim, D.A., Oelke, D., Rohrdantz, C. (2010). Analyzing Document Collections via Context-Aware Term Extraction. In: Horacek, H., Métais, E., Muñoz, R., Wolska, M. (eds) Natural Language Processing and Information Systems. NLDB 2009. Lecture Notes in Computer Science, vol 5723. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12550-8_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-12550-8_13

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-12549-2

  • Online ISBN: 978-3-642-12550-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics