research-article

Incorporating Social Context and Domain Knowledge for Entity Recognition

Authors:
Jie Tang

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
Zhanpeng Fang

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
Jimeng Sun

Georgia Institute of Technology, Atlanta, GA, USA

Georgia Institute of Technology, Atlanta, GA, USA
View Profile

WWW '15: Proceedings of the 24th International Conference on World Wide WebMay 2015Pages 517–526https://doi.org/10.1145/2736277.2741135

Published:18 May 2015Publication History

WWW '15: Proceedings of the 24th International Conference on World Wide Web

Pages 517–526

ABSTRACT

Recognizing entity instances in documents according to a knowledge base is a fundamental problem in many data mining applications. The problem is extremely challenging for short documents in complex domains such as social media and biomedical domains. Large concept spaces and instance ambiguity are key issues that need to be addressed. Most of the documents are created in a social context by common authors via social interactions, such as reply and citations. Such social contexts are largely ignored in the instance-recognition literature. How can users' interactions help entity instance recognition? How can the social context be modeled so as to resolve the ambiguity of different instances?

In this paper, we propose the SOCINST model to formalize the problem into a probabilistic model. Given a set of short documents (e.g., tweets or paper abstracts) posted by users who may connect with each other, SOCINST can automatically construct a context of subtopics for each instance, with each subtopic representing one possible meaning of the instance. The model is also able to incorporate social relationships between users to help build social context. We further incorporate domain knowledge into the model using a Dirichlet tree distribution.

We evaluate the proposed model on three different genres of datasets: ICDM'12 Contest, Weibo, and I2B2. In ICDM'12 Contest, the proposed model clearly outperforms (+21.4%; $p l 1e-5 with t-test) all the top contestants. In Weibo and I2B2, our results also show that the recognition accuracy of SOCINST is up to 5.3-26.6% better than those of several alternative methods.

References

D. Andrzejewski, X. Zhu, and M. Craven. Incorporating domain knowledge into topic modeling via dirichlet forest priors. In ICML'09, pages 25--32, 2009. Google ScholarDigital Library
X. Bai, F. P. Junqueira, and S. H. Sengamedu. Exploiting user clicks for automatic seed set generation for entity matching. In KDD'13, pages 980--988, 2013. Google ScholarDigital Library
K. Bellare, S. Iyengar, A. G. Parameswaran, and V. Rastogi. Active sampling for entity matching. In KDD'12, pages 1131--1139, 2012. Google ScholarDigital Library
I. Bhattacharya and L. Getoor. Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data, 1(1):1--36, March 2007. Google ScholarDigital Library
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. JMLR, 3:993--1022, 2003. Google ScholarDigital Library
C. Buckley and E. M. Voorhees. Retrieval evaluation with incomplete information. In SIGIR'2004, pages 25--32, 2004. Google ScholarDigital Library
W. Buntine and A. Jakulin. Applying discrete pca in data analysis. In UAI'04, pages 59--66, 2004. Google ScholarDigital Library
L. Chiticariu, R. Krishnamurthy, Y. Li, F. Reiss, and S. Vaithyanathan. Domain adaptation of rule-based annotators for named-entity recognition tasks. In EMNLP'10, pages 1002--1012, 2010. Google ScholarDigital Library
M. Collins. Ranking algorithms for named-entity extraction: boosting and the voted perceptron. In ACL'02, pages 489--496, 2002. Google ScholarDigital Library
M. Dean, G. Schreiber, S. Bechhofer, F. van Harmelen, J. Hendler, I. Horrocks, D. L. McGuinness, P. F. Patel-Schneider, and L. A. Stein. Owl web ontology language reference. w3c recommendation., Feb. 2004.Google Scholar
S. Y. Dennis. On the hyper-dirichlet type 1 and hyper-liouville distributions. Communications in Statistics - Theory and Methods, 20:4069--4081, 1991.Google Scholar
A. Doucet, N. de Freitas, K. Murphy, and S. Russell. Rao-blackwellised particle filtering for dynamic bayesian networks. In UAI'00, pages 176--183, 2000. Google ScholarDigital Library
J. R. Finkel, T. Grenager, and C. D. Manning. Incorporating non-local information into information extraction systems by gibbs sampling. In ACL'05, pages 363--370, 2005. Google ScholarDigital Library
G. Heinrich. Parameter estimation for text analysis. Technical report, University of Leipzig, Germany, 2004.Google Scholar
T. Hofmann. Probabilistic latent semantic indexing. In SIGIR'99, pages 50--57, 1999. Google ScholarDigital Library
Y. Hu, J. Boyd-Graber, and B. Satinoff. Interactive topic modeling. In HLT'11, pages 248--257, 2011. Google ScholarDigital Library
H. Huang, Z. Wen, D. Yu, H. Ji, Y. Sun, J. Han, and H. Li. Resolving entity morphs in censored data. In ACL'13, pages 1083--1093, 2013.Google Scholar
S. Kataria, K. S. Kumar, R. Rastogi, P. Sen, and S. H. Sengamedu. Entity disambiguation with hierarchical topic models. In KDD'11, pages 1037--1045, 2011. Google ScholarDigital Library
J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML'01, pages 282--289, 2001. Google ScholarDigital Library
C. Li, J. Weng, Q. He, Y. Yao, A. Datta, A. Sun, and B.-S. Lee. Twiner: Named entity recognition in targeted twitter stream. In SIGIR'12, pages 721--730, 2012. Google ScholarDigital Library
J. Li, J. Tang, Y. Li, and Q. Luo. Rimom: A dynamic multi-strategy ontology alignment framework. IEEE TKDE, 21(8):1218--1232, 2009. Google ScholarDigital Library
Y. Li, C. Wang, F. Han, J. Han, D. Roth, and X. Yan. Mining evidences for named entity disambiguation. In KDD'13, pages 1070--1078, 2013. Google ScholarDigital Library
X. Liu, S. Zhang, F. Wei, and M. Zhou. Recognizing named entities in tweets. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, ACL '11, pages 359--367, 2011. Google ScholarDigital Library
A. K. McCallum. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu, 2002.Google Scholar
D. Nadeau and S. Sekine. A survey of named entity recognition and classification. Linguisticae Investigationes, 30:3--26, 2007.Google ScholarCross Ref
D. Ramage, D. Hall, R. Nallapati, and C. D. Manning. Labeled lda: A supervised topic model for credit attribution in multi-labeled corpora. In EMNLP '09, pages 248--256, 2009. Google ScholarDigital Library
A. Ritter, S. Clark, Mausam, and O. Etzioni. Named entity recognition in tweets: An experimental study. In EMNLP'11, pages 1524--1534, 2011. Google ScholarDigital Library
M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth. The author-topic model for authors and documents. In UAI'04, pages 487--494, 2004. Google ScholarDigital Library
W. Shen, J. Wang, P. Luo, and M. Wang. Linking named entities in tweets with knowledge base via user interest modeling. In KDD'13, pages 68--76, 2013. Google ScholarDigital Library
M. Steyvers, P. Smyth, and T. Griffiths. Probabilistic author-topic models for information discovery. In KDD'04, pages 306--315, 2004. Google ScholarDigital Library
Y.-C. Tam and T. Schultz. Correlated latent semantic model for unsupervised lm adaptation. In ICASSP'07, volume 4, pages IV--41--IV--44, 2007.Google Scholar
J. Tang, A. Fong, B. Wang, and J. Zhang. A unified probabilistic framework for name disambiguation in digital library. IEEE TKDE, 24(6):975--987, 2012. Google ScholarDigital Library
J. Tang, J. Sun, C. Wang, and Z. Yang. Social influence analysis in large-scale networks. In KDD'09, pages 807--816, 2009. Google ScholarDigital Library
J. Tang, S. Wu, J. Sun, and H. Su. Cross-domain collaboration recommendation. In KDD'12, pages 1285--1294, 2012. Google ScholarDigital Library
J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su. Arnetminer: Extraction and mining of academic social networks. In KDD'08, pages 990--998, 2008. Google ScholarDigital Library
K. M. Ting and I. H. Witten. Issues in stacked generalization. Journal of Artificial Intelligence Research, 10:271--289, 1999. Google ScholarCross Ref
O. Uzuner, Y. Juo, and P. Szolovits. Evaluating the state-of-the-art in automatic de-identification. J Am Med Inform Assoc, 14(5):550--563, 2007.Google ScholarCross Ref
S. Wu, Z. Fang, and J. Tang. Accurate product name recognition from user generated content. In ICDM 2012 Contest, pages 874--877, 2012. Google ScholarDigital Library

Index Terms

Incorporating Social Context and Domain Knowledge for Entity Recognition
1. Applied computing
  1. Law, social and behavioral sciences
    1. Sociology

Recommendations

Named Entity Recognition with Context-Aware Dictionary Knowledge
Chinese Computational Linguistics
Abstract
Named entity recognition (NER) is an important task in the natural language processing field. Existing NER methods heavily rely on labeled data for model training, and their performance on rare entities is usually unsatisfactory. Entity ...
Read More
Protein/Gene Entity Recognition and Normalization with Domain Knowledge and Local Context
Chinese Lexical Semantics
Abstract
Biomedical named entity recognition and normalization aim at recognizing biomedical entity mentions from text and mapping them to their unique database entity identifiers (IDs), which are the primary task of biomedical text mining. However, name ...
Read More
Incorporating global information into named entity recognition systems using relational context
SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval

The state-of-the-art in Named Entity Recognition relies on a combination of local features of the text and global knowledge to determine the types of the recognized entities. This is problematic in some cases, resulting in entities being classified as ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '15: Proceedings of the 24th International Conference on World Wide Web
May 2015
1460 pages
ISBN:9781450334693
General Chairs:
Aldo Gangemi
National Research Council, Italy & Paris 13 University-CNRS, France
,
Stefano Leonardi
Sapienza University of Rome, Italy
,
Alessandro Panconesi
Sapienza University of Rome, Italy
Copyright © 2015 Copyright is held by the International World Wide Web Conference Committee (IW3C2)
Sponsors
In-Cooperation
Publisher
International World Wide Web Conferences Steering Committee
Republic and Canton of Geneva, Switzerland
Publication History
- Published: 18 May 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
instance recognition
probabilistic model
social network
Qualifiers
- research-article
Conference

Acceptance Rates
WWW '15 Paper Acceptance Rate131of929submissions,14%Overall Acceptance Rate1,899of8,196submissions,23%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 7
  Total Citations
  View Citations
- 337
  Total Downloads
- Downloads (Last 12 months)7
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Incorporating Social Context and Domain Knowledge for Entity Recognition

WWW '15: Proceedings of the 24th International Conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Named Entity Recognition with Context-Aware Dictionary Knowledge

Protein/Gene Entity Recognition and Normalization with Domain Knowledge and Local Context

Incorporating global information into named entity recognition systems using relational context