tutorial

Automatic Entity Recognition and Typing in Massive Text Corpora

Authors:
Xiang Ren

University of Illinois at Urbana-Champaign, Urbana, IL, USA

University of Illinois at Urbana-Champaign, Urbana, IL, USA
View Profile

,
Ahmed El-Kishky

University of Illinois at Urbana-Champaign, Urbana, IL, USA

University of Illinois at Urbana-Champaign, Urbana, IL, USA
View Profile

,
Chi Wang

Microsoft Research, Redmond, WA, USA

Microsoft Research, Redmond, WA, USA
View Profile

,
Jiawei Han

University of Illinois at Urbana-Champaign, URBANA, IL, USA

University of Illinois at Urbana-Champaign, URBANA, IL, USA
View Profile

WWW '16 Companion: Proceedings of the 25th International Conference Companion on World Wide WebApril 2016Pages 1025–1028https://doi.org/10.1145/2872518.2891065

Published:11 April 2016Publication History

WWW '16 Companion: Proceedings of the 25th International Conference Companion on World Wide Web

Pages 1025–1028

ABSTRACT

In today's computerized and information-based society, we are soaked with vast amounts of natural language text data, ranging from news articles, product reviews, advertisements, to a wide range of user-generated content from social media. To turn such massive unstructured text data into actionable knowledge, one of the grand challenges is to gain an understanding of entities and the relationships between them. In this tutorial, we introduce data-driven methods to recognize typed entities of interest in different kinds of text corpora (especially in massive, domain-specific text corpora). These methods can automatically identify token spans as entity mentions in text and label their types (e.g., people, product, food) in a scalable way. We demonstrate on real datasets including news articles and yelp reviews how these typed entities aid in knowledge discovery and management.

References

R. K. Ando and T. Zhang. A high-performance semi-supervised learning method for text chunking. In ACL, 2005. Google ScholarDigital Library
S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. Dbpedia: A nucleus for a web of open data. Springer, 2007.Google ScholarCross Ref
D. M. Bikel, R. Schwartz, and R. M. Weischedel. An algorithm that learns what's in a name. Machine learning, 34(1--3):211--231, 1999. Google ScholarDigital Library
A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In COLT Workshop on Computational Learning Theory, 1998. Google ScholarDigital Library
K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD, 2008. Google ScholarDigital Library
A. Carlson, J. Betteridge, R. C. Wang, E. R. Hruschka Jr, and T. M. Mitchell. Coupled semi-supervised learning for information extraction. In WSDM, 2010. Google ScholarDigital Library
W. W. Cohen and S. Sarawagi. Exploiting dictionaries in named entity extraction: combining semi-markov extraction processes and data integration methods. In SIGKDD, 2004. Google ScholarDigital Library
M. Collins. Ranking algorithms for named-entity extraction: Boosting and the voted perceptron. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 489--496. Association for Computational Linguistics, 2002. Google ScholarDigital Library
J. R. Curran and S. Clark. Language independent ner using a maximum entropy tagger. In HLT-NAACL, 2003. Google ScholarDigital Library
B. B. Dalvi, W. W. Cohen, and J. Callan. Websets: Extracting sets of entities from the web using unsupervised information extraction. In WSDM, 2012. Google ScholarDigital Library
X. L. Dong, T. Strohmann, S. Sun, and W. Zhang. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In SIGKDD, 2014. Google ScholarDigital Library
A. El-Kishky, Y. Song, C. Wang, C. R. Voss, and J. Han. Scalable topical phrase mining from text corpora. VLDB, 2015. Google ScholarDigital Library
O. Etzioni, M. Cafarella, D. Downey, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Unsupervised named-entity extraction from the web: An experimental study. Artificial Intelligence, 165(1):91--134, 2005. Google ScholarDigital Library
A. Fader, S. Soderland, and O. Etzioni. Identifying relations for open information extraction. In EMNLP, 2011. Google ScholarDigital Library
J. R. Finkel, T. Grenager, and C. Manning. Incorporating non-local information into information extraction systems by gibbs sampling. In ACL, 2005. Google ScholarDigital Library
V. Ganti, A. C. König, and R. Vernica. Entity categorization over large document collections. In SIGKDD, 2008. Google ScholarDigital Library
W. Gao, P. Li, and K. Darwish. Joint topic modeling for event summarization across news and social media streams. In CIKM, 2012. Google ScholarDigital Library
A. Gattani, D. S. Lamba, N. Garera, M. Tiwari, X. Chai, S. Das, S. Subramaniam, A. Rajaraman, V. Harinarayan, and A. Doan. Entity extraction, linking, classification, and tagging for social media: a wikipedia-based approach. VLDB, 6(11):1126--1137, 2013. Google ScholarDigital Library
W. Guo, H. Li, Ji, and M. T. Diab. Linking tweets to news: A framework to enrich short text data in social media. In ACL, 2013.Google Scholar
S. Gupta and C. D. Manning. Improved pattern learning for bootstrapped entity extraction. In CONLL, 2014.Google ScholarCross Ref
Y. He and D. Xin. Seisa: set expansion by iterative similarity aggregation. In WWW, 2011. Google ScholarDigital Library
R. Huang and E. Riloff. Inducing domain-specific semantic class taggers from (almost) nothing. In ACL, 2010. Google ScholarDigital Library
H. Ji and R. Grishman. Knowledge base population: Successful approaches and challenges. In ACL, 2011. Google ScholarDigital Library
D. S. Kim, K. Verma, and P. Z. Yeh. Joint extraction and labeling via graph propagation for dictionary construction. In AAAI, 2013. Google ScholarDigital Library
Z. Kozareva, K. Voevodski, and S.-H. Teng. Class label enhancement via related instances. In EMNLP, 2011. Google ScholarDigital Library
C. Li, J. Weng, Q. He, Y. Yao, A. Datta, A. Sun, and B.-S. Lee. Twiner: named entity recognition in targeted twitter stream. In SIGIR, 2012. Google ScholarDigital Library
Q. Li and H. Ji. Incremental joint extraction of entity mentions and relations. In ACL, 2014.Google ScholarCross Ref
G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and searching web tables using entities, types and relationships. VLDB, 3(1--2):1338--1347, 2010. Google ScholarDigital Library
D. Lin and X. Wu. Phrase clustering for discriminative learning. In ACL, 2009. Google ScholarDigital Library
H. Lin, Y. Jia, Y. Wang, X. Jin, X. Li, and X. Cheng. Populating knowledge base with collective entity mentions: A graph-based approach. In ASONAM, 2014.Google ScholarCross Ref
T. Lin, O. Etzioni, et al. No noun phrase left behind: detecting and typing unlinkable entities. In EMNLP, 2012. Google ScholarDigital Library
W. Lin, R. Yangarber, and R. Grishman. Bootstrapped learning of semantic classes from positive and negative examples. In ICML Workshop on The Continuum from Labeled to Unlabeled Data, 2003.Google Scholar
X. Ling and D. S. Weld. Fine-grained entity recognition. In AAAI, 2012. Google ScholarDigital Library
J. Liu, J. Shang, C. Wang, X. Ren, and J. Han. Mining quality phrases from massive text corpora. In SIGMOD, 2015. Google ScholarDigital Library
A. McCallum, D. Freitag, and F. C. Pereira. Maximum entropy markov models for information extraction and segmentation. In ICML, volume 17, pages 591--598, 2000. Google ScholarDigital Library
P. McNamee and J. Mayfield. Entity extraction without language-specific resources. In COLING, 2002. Google ScholarDigital Library
D. Nadeau and S. Sekine. A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1):3--26, 2007.Google ScholarCross Ref
N. Nakashole, T. Tylenda, and G. Weikum. Fine-grained semantic typing of emerging entities. In ACL, 2013.Google Scholar
K. Nigam and R. Ghani. Analyzing the effectiveness and applicability of co-training. In CIKM, 2000. Google ScholarDigital Library
L. Ratinov and D. Roth. Design challenges and misconceptions in named entity recognition. In ACL, 2009. Google ScholarDigital Library
X. Ren, A. El-Kishky, C. Wang, F. Tao, C. R. Voss, and J. Han. Clustype: Effective entity recognition and typing by relation phrase-based clustering. In SIGKDD, 2015. Google ScholarDigital Library
A. Ritter, S. Clark, O. Etzioni, et al. Named entity recognition in tweets: an experimental study. In EMNLP, 2011. Google ScholarDigital Library
W. Shen, J. Wang, and J. Han. Entity linking with a knowledge base: Issues, techniques, and solutions. TKDE, (99):1--20, 2014.Google Scholar
W. Shen, J. Wang, P. Luo, and M. Wang. A graph-based approach for ontology population with named entities. In CIKM, 2012. Google ScholarDigital Library
Y. Sun and J. Han. Mining heterogeneous information networks: a structural analysis approach. SIGKDD Explorations, 14(2):20--28, 2013. Google ScholarDigital Library
P. P. Talukdar, T. Brants, M. Liberman, and F. Pereira. A context pattern induction method for named entity extraction. In CONLL, 2006. Google ScholarDigital Library
P. P. Talukdar and F. Pereira. Experiments in graph-based semi-supervised learning methods for class-instance acquisition. In ACL, 2010. Google ScholarDigital Library
J. Turian, L. Ratinov, and Y. Bengio. Word representations: a simple and general method for semi-supervised learning. In ACL, 2010. Google ScholarDigital Library
R. Yangarber, W. Lin, and R. Grishman. Unsupervised learning of generalized names. In COLING, 2002. Google ScholarDigital Library

Index Terms

Automatic Entity Recognition and Typing in Massive Text Corpora
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Automatic Entity Recognition and Typing in Massive Text Data
SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

In today's computerized and information-based society, individuals are constantly presented with vast amounts of text data, ranging from news articles, scientific publications, product reviews, to a wide range of textual information from social media. ...
Read More
Building Structured Databases of Factual Knowledge from Massive Text Corpora
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

In today's computerized and information-based society, people are inundated with vast amounts of text data, ranging from news articles, social media post, scientific publications, to a wide range of textual information from various domains (corporate ...
Read More
Constructing Structured Information Networks from Massive Text Corpora
WWW '17 Companion: Proceedings of the 26th International Conference on World Wide Web Companion

In today's computerized and information-based society, text data is rich but messy. People are soaked with vast amounts of natural-language text data, ranging from news articles, social media post, advertisements, to a wide range of textual information ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '16 Companion: Proceedings of the 25th International Conference Companion on World Wide Web
April 2016
1094 pages
ISBN:9781450341448
General Chairs:
Jacqueline Bourdeau
Tele-university (TELUQ), Montreal, QC, Canada
,
Jim A. Hendler
Rensselaer Polytechnic Institute, Troy, NY, USA
,
Roger Nkambou Nkambou
Université du Québec à Montréal, Montreal, QC, Canada
,
Program Chairs:
Ian Horrocks
University of Oxford, UK
,
Ben Y. Zhao
University of California at Santa Barbara, CA, USA
Copyright © 2016 Copyright is held by the owner/author(s)
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
International World Wide Web Conferences Steering Committee
Republic and Canton of Geneva, Switzerland
Publication History
- Published: 11 April 2016
Check for updates
Author Tags
entity recognition and typing
massive text corpora
Qualifiers
- tutorial
Conference

Acceptance Rates
WWW '16 Companion Paper Acceptance Rate115of727submissions,16%Overall Acceptance Rate1,899of8,196submissions,23%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 7
  Total Citations
  View Citations
- 318
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Automatic Entity Recognition and Typing in Massive Text Corpora

WWW '16 Companion: Proceedings of the 25th International Conference Companion on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Automatic Entity Recognition and Typing in Massive Text Data

Building Structured Databases of Factual Knowledge from Massive Text Corpora

Constructing Structured Information Networks from Massive Text Corpora

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Automatic Entity Recognition and Typing in Massive Text Corpora

WWW '16 Companion: Proceedings of the 25th International Conference Companion on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Automatic Entity Recognition and Typing in Massive Text Data

Building Structured Databases of Factual Knowledge from Massive Text Corpora

Constructing Structured Information Networks from Massive Text Corpora

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media