ABSTRACT
In today's computerized and information-based society, we are soaked with vast amounts of natural language text data, ranging from news articles, product reviews, advertisements, to a wide range of user-generated content from social media. To turn such massive unstructured text data into actionable knowledge, one of the grand challenges is to gain an understanding of entities and the relationships between them. In this tutorial, we introduce data-driven methods to recognize typed entities of interest in different kinds of text corpora (especially in massive, domain-specific text corpora). These methods can automatically identify token spans as entity mentions in text and label their types (e.g., people, product, food) in a scalable way. We demonstrate on real datasets including news articles and yelp reviews how these typed entities aid in knowledge discovery and management.
- R. K. Ando and T. Zhang. A high-performance semi-supervised learning method for text chunking. In ACL, 2005. Google ScholarDigital Library
- S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. Dbpedia: A nucleus for a web of open data. Springer, 2007.Google ScholarCross Ref
- D. M. Bikel, R. Schwartz, and R. M. Weischedel. An algorithm that learns what's in a name. Machine learning, 34(1--3):211--231, 1999. Google ScholarDigital Library
- A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In COLT Workshop on Computational Learning Theory, 1998. Google ScholarDigital Library
- K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD, 2008. Google ScholarDigital Library
- A. Carlson, J. Betteridge, R. C. Wang, E. R. Hruschka Jr, and T. M. Mitchell. Coupled semi-supervised learning for information extraction. In WSDM, 2010. Google ScholarDigital Library
- W. W. Cohen and S. Sarawagi. Exploiting dictionaries in named entity extraction: combining semi-markov extraction processes and data integration methods. In SIGKDD, 2004. Google ScholarDigital Library
- M. Collins. Ranking algorithms for named-entity extraction: Boosting and the voted perceptron. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 489--496. Association for Computational Linguistics, 2002. Google ScholarDigital Library
- J. R. Curran and S. Clark. Language independent ner using a maximum entropy tagger. In HLT-NAACL, 2003. Google ScholarDigital Library
- B. B. Dalvi, W. W. Cohen, and J. Callan. Websets: Extracting sets of entities from the web using unsupervised information extraction. In WSDM, 2012. Google ScholarDigital Library
- X. L. Dong, T. Strohmann, S. Sun, and W. Zhang. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In SIGKDD, 2014. Google ScholarDigital Library
- A. El-Kishky, Y. Song, C. Wang, C. R. Voss, and J. Han. Scalable topical phrase mining from text corpora. VLDB, 2015. Google ScholarDigital Library
- O. Etzioni, M. Cafarella, D. Downey, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Unsupervised named-entity extraction from the web: An experimental study. Artificial Intelligence, 165(1):91--134, 2005. Google ScholarDigital Library
- A. Fader, S. Soderland, and O. Etzioni. Identifying relations for open information extraction. In EMNLP, 2011. Google ScholarDigital Library
- J. R. Finkel, T. Grenager, and C. Manning. Incorporating non-local information into information extraction systems by gibbs sampling. In ACL, 2005. Google ScholarDigital Library
- V. Ganti, A. C. König, and R. Vernica. Entity categorization over large document collections. In SIGKDD, 2008. Google ScholarDigital Library
- W. Gao, P. Li, and K. Darwish. Joint topic modeling for event summarization across news and social media streams. In CIKM, 2012. Google ScholarDigital Library
- A. Gattani, D. S. Lamba, N. Garera, M. Tiwari, X. Chai, S. Das, S. Subramaniam, A. Rajaraman, V. Harinarayan, and A. Doan. Entity extraction, linking, classification, and tagging for social media: a wikipedia-based approach. VLDB, 6(11):1126--1137, 2013. Google ScholarDigital Library
- W. Guo, H. Li, Ji, and M. T. Diab. Linking tweets to news: A framework to enrich short text data in social media. In ACL, 2013.Google Scholar
- S. Gupta and C. D. Manning. Improved pattern learning for bootstrapped entity extraction. In CONLL, 2014.Google ScholarCross Ref
- Y. He and D. Xin. Seisa: set expansion by iterative similarity aggregation. In WWW, 2011. Google ScholarDigital Library
- R. Huang and E. Riloff. Inducing domain-specific semantic class taggers from (almost) nothing. In ACL, 2010. Google ScholarDigital Library
- H. Ji and R. Grishman. Knowledge base population: Successful approaches and challenges. In ACL, 2011. Google ScholarDigital Library
- D. S. Kim, K. Verma, and P. Z. Yeh. Joint extraction and labeling via graph propagation for dictionary construction. In AAAI, 2013. Google ScholarDigital Library
- Z. Kozareva, K. Voevodski, and S.-H. Teng. Class label enhancement via related instances. In EMNLP, 2011. Google ScholarDigital Library
- C. Li, J. Weng, Q. He, Y. Yao, A. Datta, A. Sun, and B.-S. Lee. Twiner: named entity recognition in targeted twitter stream. In SIGIR, 2012. Google ScholarDigital Library
- Q. Li and H. Ji. Incremental joint extraction of entity mentions and relations. In ACL, 2014.Google ScholarCross Ref
- G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and searching web tables using entities, types and relationships. VLDB, 3(1--2):1338--1347, 2010. Google ScholarDigital Library
- D. Lin and X. Wu. Phrase clustering for discriminative learning. In ACL, 2009. Google ScholarDigital Library
- H. Lin, Y. Jia, Y. Wang, X. Jin, X. Li, and X. Cheng. Populating knowledge base with collective entity mentions: A graph-based approach. In ASONAM, 2014.Google ScholarCross Ref
- T. Lin, O. Etzioni, et al. No noun phrase left behind: detecting and typing unlinkable entities. In EMNLP, 2012. Google ScholarDigital Library
- W. Lin, R. Yangarber, and R. Grishman. Bootstrapped learning of semantic classes from positive and negative examples. In ICML Workshop on The Continuum from Labeled to Unlabeled Data, 2003.Google Scholar
- X. Ling and D. S. Weld. Fine-grained entity recognition. In AAAI, 2012. Google ScholarDigital Library
- J. Liu, J. Shang, C. Wang, X. Ren, and J. Han. Mining quality phrases from massive text corpora. In SIGMOD, 2015. Google ScholarDigital Library
- A. McCallum, D. Freitag, and F. C. Pereira. Maximum entropy markov models for information extraction and segmentation. In ICML, volume 17, pages 591--598, 2000. Google ScholarDigital Library
- P. McNamee and J. Mayfield. Entity extraction without language-specific resources. In COLING, 2002. Google ScholarDigital Library
- D. Nadeau and S. Sekine. A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1):3--26, 2007.Google ScholarCross Ref
- N. Nakashole, T. Tylenda, and G. Weikum. Fine-grained semantic typing of emerging entities. In ACL, 2013.Google Scholar
- K. Nigam and R. Ghani. Analyzing the effectiveness and applicability of co-training. In CIKM, 2000. Google ScholarDigital Library
- L. Ratinov and D. Roth. Design challenges and misconceptions in named entity recognition. In ACL, 2009. Google ScholarDigital Library
- X. Ren, A. El-Kishky, C. Wang, F. Tao, C. R. Voss, and J. Han. Clustype: Effective entity recognition and typing by relation phrase-based clustering. In SIGKDD, 2015. Google ScholarDigital Library
- A. Ritter, S. Clark, O. Etzioni, et al. Named entity recognition in tweets: an experimental study. In EMNLP, 2011. Google ScholarDigital Library
- W. Shen, J. Wang, and J. Han. Entity linking with a knowledge base: Issues, techniques, and solutions. TKDE, (99):1--20, 2014.Google Scholar
- W. Shen, J. Wang, P. Luo, and M. Wang. A graph-based approach for ontology population with named entities. In CIKM, 2012. Google ScholarDigital Library
- Y. Sun and J. Han. Mining heterogeneous information networks: a structural analysis approach. SIGKDD Explorations, 14(2):20--28, 2013. Google ScholarDigital Library
- P. P. Talukdar, T. Brants, M. Liberman, and F. Pereira. A context pattern induction method for named entity extraction. In CONLL, 2006. Google ScholarDigital Library
- P. P. Talukdar and F. Pereira. Experiments in graph-based semi-supervised learning methods for class-instance acquisition. In ACL, 2010. Google ScholarDigital Library
- J. Turian, L. Ratinov, and Y. Bengio. Word representations: a simple and general method for semi-supervised learning. In ACL, 2010. Google ScholarDigital Library
- R. Yangarber, W. Lin, and R. Grishman. Unsupervised learning of generalized names. In COLING, 2002. Google ScholarDigital Library
Index Terms
- Automatic Entity Recognition and Typing in Massive Text Corpora
Recommendations
Automatic Entity Recognition and Typing in Massive Text Data
SIGMOD '16: Proceedings of the 2016 International Conference on Management of DataIn today's computerized and information-based society, individuals are constantly presented with vast amounts of text data, ranging from news articles, scientific publications, product reviews, to a wide range of textual information from social media. ...
Building Structured Databases of Factual Knowledge from Massive Text Corpora
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of DataIn today's computerized and information-based society, people are inundated with vast amounts of text data, ranging from news articles, social media post, scientific publications, to a wide range of textual information from various domains (corporate ...
Constructing Structured Information Networks from Massive Text Corpora
WWW '17 Companion: Proceedings of the 26th International Conference on World Wide Web CompanionIn today's computerized and information-based society, text data is rich but messy. People are soaked with vast amounts of natural-language text data, ranging from news articles, social media post, advertisements, to a wide range of textual information ...
Comments