Skip to main content

Topic-Oriented Words as Features for Named Entity Recognition

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7816))

Abstract

Research has shown that topic-oriented words are often related to named entities and can be used for Named Entity Recognition. Many have proposed to measure topicality of words in terms of ‘informativeness’ based on global distributional characteristics of words in a corpus. However, this study shows that there can be large discrepancy between informativeness and topicality; empirically, informativeness based features can damage learning accuracy of NER. This paper proposes to measure words’ topicality based on local distributional features specific to individual documents, and proposes methods to transform topicality into gazetteer-like features for NER by binning. Evaluated using five datasets from three domains, the methods have shown consistent improvement over a baseline by between 0.9 and 4.0 in F-measure, and always outperformed methods that use informativeness measures.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ahmed, K., Gillam, L., Tostevin, L.: University of Surrey Participation in TREC8: Weirdness Indexing for Logical Document Extrapolation and Retrieval (WILDER). In: The 8th Text Retrieval Conference, TREC-8 (1999)

    Google Scholar 

  2. Chang, J., Schütze, H., Altman, R.: GAPSCORE: finding gene and protein names one word at a time. Bioinformatics 20(2), 216–225 (2004)

    Article  Google Scholar 

  3. Church, K., Gale, W.: Inverse Document Frequency (IDF): A Measure of Deviation from Poisson. In: Proceedings of the 3rd Workshop on Very Large Corpora, Cambridge, Massachusetts, USA, pp. 121–130 (1995a)

    Google Scholar 

  4. Church, K., Gale, W.: Poisson mixtures. Natural Language Engineering 1(2), 163–190 (1995b)

    Article  Google Scholar 

  5. Clifton, C., Cooley, R., Rennie, J.: TopCat: Data Mining for Topic Identification in a Text Corpus. In: Proceedings of the 3rd European Conference of Principles and Practice of Knowledge Discovery in Databases, pp. 949–964 (1999)

    Google Scholar 

  6. Collier, N., Nobata, C., Tsujii, J.: Extracting the Names of Genes and Gene Products with a Hidden Markov Model. In: Proceedings of COLING 2000, pp. 201–207 (2000)

    Google Scholar 

  7. Dagan, I., Church, K.: Termight: Identify-ing and Translating Technical Terminology. In: Proceedings of EACL, pp. 34–40 (1994)

    Google Scholar 

  8. Downey, D., Broadhead, M., Etzioni, O.: Locating Complex Named Entities in Web Text. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence (2007)

    Google Scholar 

  9. Grishman, R., Sundheim, B.: Message Understanding Conference - 6: A brief history. In: Proceedings of the 16th International Conference on Computational Linguistics (1996)

    Google Scholar 

  10. Gupta, S., Bhattacharyya, P.: Think Globally, Apply Locally: Using Distributional Characteristics for Hindi Named Entity Identification. In: Proceedings of the 2010 Named Entities Workshop, ACL 2010, pp. 116–125 (2010)

    Google Scholar 

  11. Harter, S.: A probabilistic approach to automatic keyword indexing: Part I. On the distribution of specialty words in a technical literature. Journal of the American Society for Information Science 26(4), 197–206 (1975)

    Article  Google Scholar 

  12. Hassel, M.: Exploitation of Named Entities in Automatic Text Summarization for Swedish. In: Proceedings of the 14th Nordic Conference on Computational Linguistics (2003)

    Google Scholar 

  13. Jones, K.: Index term weighting. Information Storage and Retrieval 9(11), 619–633 (1973)

    Article  Google Scholar 

  14. Kim, J., Ohta, T., Tsuruoka, Y., Tateisi, Y.: Introduction to the Bio-Entity Recognition Task at JNLPBA. In: Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (2004)

    Google Scholar 

  15. Mizzaro, S.: Relevance: The Whole History. Journal of the American Society for Information Science 48(9), 810–832 (1997)

    Article  Google Scholar 

  16. Morgan, A., Hirschman, L., Yeh, A., Colosimo, M.: Gene Name Extraction Using FlyBase Resources. In: ACL 2003 Workshop on Language Processing in Biomedicine, Sapporo, Japan, pp. 1–8 (2003)

    Google Scholar 

  17. Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investigationes 30(1), 3–26 (2007)

    Article  Google Scholar 

  18. Rennie, J., Jaakkola, T.: Using Term Informativeness for Named Entity Detection. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (2005)

    Google Scholar 

  19. Saha, S., Sarkar, S., Mitra, P.: Feature selection techniques for maximum entropy based biomedical named entity recognition. Journal of Biomedical Informatics 42(5), 905–911 (2009)

    Article  Google Scholar 

  20. Silva, J., Kozareva, Z., Noncheva, V., Lopes, G.: Extracting Named Entities: A Statistical Approach. In: Proceeding of TALN (2004)

    Google Scholar 

  21. Tjong, E., Sang, K., Meulder, F.: Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pp. 142–147 (2003)

    Google Scholar 

  22. Wan, X., Zhong, L., Huang, X., Ma, T., Jia, H., Wu, Y., Xiao, J.: Named Entity Recognition in Chinese News Comments on the Web. In: Proceedings of the 5th International Joint Conference on Natural Language Processing, pp. 856–864 (2011)

    Google Scholar 

  23. Zhang, L., Pan, Y., Zhang, T.: Focused Named Entity Recognition using Machine Learning. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (2004)

    Google Scholar 

  24. Zhang, Z., Iria, J.: A Novel Approach to Automatic Gazetteer Generation using Wikipedia. In: Proceedings of the ACL 2009 Workshop on Collaboratively Constructed Semantic Resources (2009)

    Google Scholar 

  25. Zhang, Z., Iria, J., Ciravegna, F.: Improving Domain-specific Entity Recognition with Automatic Term Recognition and Feature Extraction. In: Proceedings of LREC 2010, Malta (May 2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Zhang, Z., Cohn, T., Ciravegna, F. (2013). Topic-Oriented Words as Features for Named Entity Recognition. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2013. Lecture Notes in Computer Science, vol 7816. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37247-6_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-37247-6_25

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-37246-9

  • Online ISBN: 978-3-642-37247-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics