Analyzing Document Collections via Context-Aware Term Extraction

Keim, Daniel A.; Oelke, Daniela; Rohrdantz, Christian

doi:10.1007/978-3-642-12550-8_13

Daniel A. Keim²⁰,
Daniela Oelke²⁰ &
Christian Rohrdantz²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5723))

Included in the following conference series:

International Conference on Application of Natural Language to Information Systems

879 Accesses
6 Citations

Abstract

In large collections of documents that are divided into predefined classes, the differences and similarities of those classes are of special interest. This paper presents an approach that is able to automatically extract terms from such document collections which describe what topics discriminate a single class from the others (discriminating terms) and which topics discriminate a subset of the classes against the remaining ones (overlap terms). The importance for real world applications and the effectiveness of our approach are demonstrated by two out of practice examples. In a first application our predefined classes correspond to different scientific conferences. By extracting terms from collections of papers published on these conferences, we determine automatically the topical differences and similarities of the conferences. In our second application task we extract terms out of a collection of product reviews which show what features reviewers commented on. We get these terms by discriminating the product review class against a suitable counter-balance class. Finally, our method is evaluated comparing it to alternative approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Methods for automatic term recognition in domain-specific text collections: A survey

Article 15 November 2015

Learning Heterogeneous Coupling Relationships Between Non-IID Terms

Knowledge-Based Metrics for Document Classification: Online Reviews Experiments

References

Spärck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28(1), 11–21 (1972)
Article Google Scholar
Salton, G., Wong, A., Yang, C.S.: A Vector Space Model for Automatic Indexing. Communications of the ACM 18(11), 613–620 (1975)
Article MATH Google Scholar
Kageura, K., Umino, B.: Methods of automatic term recognition: A review. Terminology 3(2), 259 (1996)
Google Scholar
Feldman, R., Fresko, M., Kinar, Y., Lindell, Y., Liphstat, O., Rajman, M., Schler, Y., Zamir, O.: Text mining at the term level. In: Proceedings of the Second European Symposium on Principles of Data Mining and Knowledge Discovery, pp. 65–73 (1998)
Google Scholar
Matsuo, Y., Ishizuka, M.: Keyword Extraction from a Single Document using Word Co-occurrence Statistical Information. In: Proceedings of the 16th International Florida AI Research Society, pp. 392–396 (2003)
Google Scholar
Riloff, E., Jones, R.: Learning dictionaries for information extraction by multi-level bootstrapping. In: Proceedings of the Sixteenth National Conference on Artificial Intelligence, pp. 474–479 (1999)
Google Scholar
Collins, M., Singer, Y.: Unsupervised models for named entity classification. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (1999)
Google Scholar
Brunzel, M., Spiliopoulou, M.: Domain Relevance on Term Weighting. In: 12th International Conference on Applications of Natural Language to Information Systems, pp. 427–432 (2007)
Google Scholar
Witschel, H.F.: Terminologie-Extraktion: Möglichkeiten der Kombination statistischer und musterbasierter Verfahren. In: Content and Communication: Terminology, Language Resources and Semantic Interoperability. Ergon Verlag, Würzburg (2004)
Google Scholar
Velardi, P., Missikoff, M., Basili, R.: Identification of relevant terms to support the construction of domain ontologies. In: Proceedings of the workshop on Human Language Technology and Knowledge Management, pp. 1–8 (2001)
Google Scholar
Drouin, P.: Detection of Domain Specifc Terminology Using Corpora Comparison. In: Proceedings of the International Language Resources Conference, pp. 79–82 (2004)
Google Scholar
Wise, J.A.: The ecological approach to text visualization. Journal of the American Society for Information Science, 1224–1233 (1999)
Google Scholar
Kaski, S., Honkela, T., Lagus, K., Kohonen, T.: WEBSOM - Selforganizing maps of document collections. Neurocomputing 21, 101–117 (1998)
Article MATH Google Scholar
Lagus, K., Kaski, S.: Keyword selection method for characterizing text document maps. In: Proceedings of ICANN 1999, Ninth International Conference on Artificial Neural Networks, pp. 371–376 (1999)
Google Scholar
Azcarraga, A.P., Yap, T.N., Tan, J., Chua, T.S.: Evaluating Keyword Selection Methods for WEBSOM Text Archives. IEEE Transactions on Knowledge and Data Engineering 16(3), 380–383 (2004)
Article Google Scholar
Seki, Y., Eguchi, K., Kando, N.: Multi-Document Viewpoint Summarization Focused on Facts, Opinion and Knowledge. In: Computing Attitude and Affect in Text: Theory and Applications. The Information Retrieval Series, pp. 317–336. Springer, Heidelberg (2005)
Google Scholar
Lerman, K., McDonald, R.: Contrastive Summarization: An Experiment with Consumer Reviews. In: Proceedings of the North American Association for Computational Linguistics, NAACL (2009)
Google Scholar
Zhai, C., Velivelli, A., Yu, B.: A Cross-Collection Mixture Model for Comparative Text Mining. In: Proceedings of the ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), pp. 743–748 (2004)
Google Scholar
Mei, Q., Zhai, C.: A mixture model for contextual text mining. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), pp. 649–655 (2006)
Google Scholar
Salton, G., Buckley, C.: Term weighting approaches in automatic text retrieval. Information Processing and Management: an International Journal 24(5), 513–523 (1988)
Article Google Scholar
Kuhlen, R.: Experimentelle Morphologie in der Informationswissenschaft. Verlag Dokumentation (1977)
Google Scholar
Toutanova, K., Manning, C.: Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC), pp. 63–70 (2000)
Google Scholar
Toutanova, K., Klein, D., Manning, C., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (NAACL), pp. 173–180 (2003)
Google Scholar
Stanford Log-linear Part-Of-Speech Tagger, http://nlp.stanford.edu/software/tagger.shtml
Ramshaw, L., Marcus, M.: Text Chunking Using Transformation-Based Learning. In: Proceedings of the Third ACL Workshop on Very Large Corpora (1995)
Google Scholar
Greenwood, M.: Noun Phrase Chunker Version 1.1, http://www.dcs.shef.ac.uk/~mark/phd/software/chunker.html
Thiel, K., Dill, F., Kötter, T., Berthold, M.R.: Towards Visual Exploration of Topic Shifts. In: IEEE International Conference on Systems, Man and Cybernetics, pp. 522–527 (2007)
Google Scholar
Online tool for terminology extraction, http://wortschatz.uni-leipzig.de/~fwitschel/terminology.html

Download references

Author information

Authors and Affiliations

University of Konstanz, Germany
Daniel A. Keim, Daniela Oelke & Christian Rohrdantz

Authors

Daniel A. Keim
View author publications
You can also search for this author in PubMed Google Scholar
Daniela Oelke
View author publications
You can also search for this author in PubMed Google Scholar
Christian Rohrdantz
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institut für Computertechnologie, Technische Universität Wien, A-1040, Wien, Austria
Helmut Horacek
CNAM- Laboratoire Cédric, 292 Rue St. Martin, 75141, Paris Cedex 03, France
Elisabeth Métais
Departamento de Lenguajes y Sistemas Informáticos, Universidad de Alicante, Campus de San Vincente del Raspeig, Apdo 99, 03080, Alicante, Spain
Rafael Muñoz
Dept. of Computational Linguistics, Saarland University, Germany
Magdalena Wolska

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Keim, D.A., Oelke, D., Rohrdantz, C. (2010). Analyzing Document Collections via Context-Aware Term Extraction. In: Horacek, H., Métais, E., Muñoz, R., Wolska, M. (eds) Natural Language Processing and Information Systems. NLDB 2009. Lecture Notes in Computer Science, vol 5723. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12550-8_13

Download citation

DOI: https://doi.org/10.1007/978-3-642-12550-8_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-12549-2
Online ISBN: 978-3-642-12550-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Analyzing Document Collections via Context-Aware Term Extraction

Abstract

Access this chapter

Preview

Similar content being viewed by others

Methods for automatic term recognition in domain-specific text collections: A survey

Learning Heterogeneous Coupling Relationships Between Non-IID Terms

Knowledge-Based Metrics for Document Classification: Online Reviews Experiments

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Analyzing Document Collections via Context-Aware Term Extraction

Abstract

Access this chapter

Preview

Similar content being viewed by others

Methods for automatic term recognition in domain-specific text collections: A survey

Learning Heterogeneous Coupling Relationships Between Non-IID Terms

Knowledge-Based Metrics for Document Classification: Online Reviews Experiments

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation