skip to main content
10.1145/2872518.2891065acmotherconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
tutorial

Automatic Entity Recognition and Typing in Massive Text Corpora

Authors Info & Claims
Published:11 April 2016Publication History

ABSTRACT

In today's computerized and information-based society, we are soaked with vast amounts of natural language text data, ranging from news articles, product reviews, advertisements, to a wide range of user-generated content from social media. To turn such massive unstructured text data into actionable knowledge, one of the grand challenges is to gain an understanding of entities and the relationships between them. In this tutorial, we introduce data-driven methods to recognize typed entities of interest in different kinds of text corpora (especially in massive, domain-specific text corpora). These methods can automatically identify token spans as entity mentions in text and label their types (e.g., people, product, food) in a scalable way. We demonstrate on real datasets including news articles and yelp reviews how these typed entities aid in knowledge discovery and management.

References

  1. R. K. Ando and T. Zhang. A high-performance semi-supervised learning method for text chunking. In ACL, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. Dbpedia: A nucleus for a web of open data. Springer, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  3. D. M. Bikel, R. Schwartz, and R. M. Weischedel. An algorithm that learns what's in a name. Machine learning, 34(1--3):211--231, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In COLT Workshop on Computational Learning Theory, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Carlson, J. Betteridge, R. C. Wang, E. R. Hruschka Jr, and T. M. Mitchell. Coupled semi-supervised learning for information extraction. In WSDM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. W. W. Cohen and S. Sarawagi. Exploiting dictionaries in named entity extraction: combining semi-markov extraction processes and data integration methods. In SIGKDD, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Collins. Ranking algorithms for named-entity extraction: Boosting and the voted perceptron. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 489--496. Association for Computational Linguistics, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. R. Curran and S. Clark. Language independent ner using a maximum entropy tagger. In HLT-NAACL, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. B. B. Dalvi, W. W. Cohen, and J. Callan. Websets: Extracting sets of entities from the web using unsupervised information extraction. In WSDM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. X. L. Dong, T. Strohmann, S. Sun, and W. Zhang. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In SIGKDD, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. El-Kishky, Y. Song, C. Wang, C. R. Voss, and J. Han. Scalable topical phrase mining from text corpora. VLDB, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. O. Etzioni, M. Cafarella, D. Downey, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Unsupervised named-entity extraction from the web: An experimental study. Artificial Intelligence, 165(1):91--134, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Fader, S. Soderland, and O. Etzioni. Identifying relations for open information extraction. In EMNLP, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. R. Finkel, T. Grenager, and C. Manning. Incorporating non-local information into information extraction systems by gibbs sampling. In ACL, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. V. Ganti, A. C. König, and R. Vernica. Entity categorization over large document collections. In SIGKDD, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. W. Gao, P. Li, and K. Darwish. Joint topic modeling for event summarization across news and social media streams. In CIKM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. Gattani, D. S. Lamba, N. Garera, M. Tiwari, X. Chai, S. Das, S. Subramaniam, A. Rajaraman, V. Harinarayan, and A. Doan. Entity extraction, linking, classification, and tagging for social media: a wikipedia-based approach. VLDB, 6(11):1126--1137, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. W. Guo, H. Li, Ji, and M. T. Diab. Linking tweets to news: A framework to enrich short text data in social media. In ACL, 2013.Google ScholarGoogle Scholar
  20. S. Gupta and C. D. Manning. Improved pattern learning for bootstrapped entity extraction. In CONLL, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  21. Y. He and D. Xin. Seisa: set expansion by iterative similarity aggregation. In WWW, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. R. Huang and E. Riloff. Inducing domain-specific semantic class taggers from (almost) nothing. In ACL, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. H. Ji and R. Grishman. Knowledge base population: Successful approaches and challenges. In ACL, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. D. S. Kim, K. Verma, and P. Z. Yeh. Joint extraction and labeling via graph propagation for dictionary construction. In AAAI, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Z. Kozareva, K. Voevodski, and S.-H. Teng. Class label enhancement via related instances. In EMNLP, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. C. Li, J. Weng, Q. He, Y. Yao, A. Datta, A. Sun, and B.-S. Lee. Twiner: named entity recognition in targeted twitter stream. In SIGIR, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Q. Li and H. Ji. Incremental joint extraction of entity mentions and relations. In ACL, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  28. G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and searching web tables using entities, types and relationships. VLDB, 3(1--2):1338--1347, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. D. Lin and X. Wu. Phrase clustering for discriminative learning. In ACL, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. H. Lin, Y. Jia, Y. Wang, X. Jin, X. Li, and X. Cheng. Populating knowledge base with collective entity mentions: A graph-based approach. In ASONAM, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  31. T. Lin, O. Etzioni, et al. No noun phrase left behind: detecting and typing unlinkable entities. In EMNLP, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. W. Lin, R. Yangarber, and R. Grishman. Bootstrapped learning of semantic classes from positive and negative examples. In ICML Workshop on The Continuum from Labeled to Unlabeled Data, 2003.Google ScholarGoogle Scholar
  33. X. Ling and D. S. Weld. Fine-grained entity recognition. In AAAI, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. J. Liu, J. Shang, C. Wang, X. Ren, and J. Han. Mining quality phrases from massive text corpora. In SIGMOD, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. A. McCallum, D. Freitag, and F. C. Pereira. Maximum entropy markov models for information extraction and segmentation. In ICML, volume 17, pages 591--598, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. P. McNamee and J. Mayfield. Entity extraction without language-specific resources. In COLING, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. D. Nadeau and S. Sekine. A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1):3--26, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  38. N. Nakashole, T. Tylenda, and G. Weikum. Fine-grained semantic typing of emerging entities. In ACL, 2013.Google ScholarGoogle Scholar
  39. K. Nigam and R. Ghani. Analyzing the effectiveness and applicability of co-training. In CIKM, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. L. Ratinov and D. Roth. Design challenges and misconceptions in named entity recognition. In ACL, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. X. Ren, A. El-Kishky, C. Wang, F. Tao, C. R. Voss, and J. Han. Clustype: Effective entity recognition and typing by relation phrase-based clustering. In SIGKDD, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. A. Ritter, S. Clark, O. Etzioni, et al. Named entity recognition in tweets: an experimental study. In EMNLP, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. W. Shen, J. Wang, and J. Han. Entity linking with a knowledge base: Issues, techniques, and solutions. TKDE, (99):1--20, 2014.Google ScholarGoogle Scholar
  44. W. Shen, J. Wang, P. Luo, and M. Wang. A graph-based approach for ontology population with named entities. In CIKM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Y. Sun and J. Han. Mining heterogeneous information networks: a structural analysis approach. SIGKDD Explorations, 14(2):20--28, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. P. P. Talukdar, T. Brants, M. Liberman, and F. Pereira. A context pattern induction method for named entity extraction. In CONLL, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. P. P. Talukdar and F. Pereira. Experiments in graph-based semi-supervised learning methods for class-instance acquisition. In ACL, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. J. Turian, L. Ratinov, and Y. Bengio. Word representations: a simple and general method for semi-supervised learning. In ACL, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. R. Yangarber, W. Lin, and R. Grishman. Unsupervised learning of generalized names. In COLING, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Automatic Entity Recognition and Typing in Massive Text Corpora

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      WWW '16 Companion: Proceedings of the 25th International Conference Companion on World Wide Web
      April 2016
      1094 pages
      ISBN:9781450341448

      Copyright © 2016 Copyright is held by the owner/author(s)

      Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

      Publisher

      International World Wide Web Conferences Steering Committee

      Republic and Canton of Geneva, Switzerland

      Publication History

      • Published: 11 April 2016

      Check for updates

      Qualifiers

      • tutorial

      Acceptance Rates

      WWW '16 Companion Paper Acceptance Rate115of727submissions,16%Overall Acceptance Rate1,899of8,196submissions,23%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader