ABSTRACT
Recognizing entity instances in documents according to a knowledge base is a fundamental problem in many data mining applications. The problem is extremely challenging for short documents in complex domains such as social media and biomedical domains. Large concept spaces and instance ambiguity are key issues that need to be addressed. Most of the documents are created in a social context by common authors via social interactions, such as reply and citations. Such social contexts are largely ignored in the instance-recognition literature. How can users' interactions help entity instance recognition? How can the social context be modeled so as to resolve the ambiguity of different instances?
In this paper, we propose the SOCINST model to formalize the problem into a probabilistic model. Given a set of short documents (e.g., tweets or paper abstracts) posted by users who may connect with each other, SOCINST can automatically construct a context of subtopics for each instance, with each subtopic representing one possible meaning of the instance. The model is also able to incorporate social relationships between users to help build social context. We further incorporate domain knowledge into the model using a Dirichlet tree distribution.
We evaluate the proposed model on three different genres of datasets: ICDM'12 Contest, Weibo, and I2B2. In ICDM'12 Contest, the proposed model clearly outperforms (+21.4%; $p l 1e-5 with t-test) all the top contestants. In Weibo and I2B2, our results also show that the recognition accuracy of SOCINST is up to 5.3-26.6% better than those of several alternative methods.
- D. Andrzejewski, X. Zhu, and M. Craven. Incorporating domain knowledge into topic modeling via dirichlet forest priors. In ICML'09, pages 25--32, 2009. Google ScholarDigital Library
- X. Bai, F. P. Junqueira, and S. H. Sengamedu. Exploiting user clicks for automatic seed set generation for entity matching. In KDD'13, pages 980--988, 2013. Google ScholarDigital Library
- K. Bellare, S. Iyengar, A. G. Parameswaran, and V. Rastogi. Active sampling for entity matching. In KDD'12, pages 1131--1139, 2012. Google ScholarDigital Library
- I. Bhattacharya and L. Getoor. Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data, 1(1):1--36, March 2007. Google ScholarDigital Library
- D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. JMLR, 3:993--1022, 2003. Google ScholarDigital Library
- C. Buckley and E. M. Voorhees. Retrieval evaluation with incomplete information. In SIGIR'2004, pages 25--32, 2004. Google ScholarDigital Library
- W. Buntine and A. Jakulin. Applying discrete pca in data analysis. In UAI'04, pages 59--66, 2004. Google ScholarDigital Library
- L. Chiticariu, R. Krishnamurthy, Y. Li, F. Reiss, and S. Vaithyanathan. Domain adaptation of rule-based annotators for named-entity recognition tasks. In EMNLP'10, pages 1002--1012, 2010. Google ScholarDigital Library
- M. Collins. Ranking algorithms for named-entity extraction: boosting and the voted perceptron. In ACL'02, pages 489--496, 2002. Google ScholarDigital Library
- M. Dean, G. Schreiber, S. Bechhofer, F. van Harmelen, J. Hendler, I. Horrocks, D. L. McGuinness, P. F. Patel-Schneider, and L. A. Stein. Owl web ontology language reference. w3c recommendation., Feb. 2004.Google Scholar
- S. Y. Dennis. On the hyper-dirichlet type 1 and hyper-liouville distributions. Communications in Statistics - Theory and Methods, 20:4069--4081, 1991.Google Scholar
- A. Doucet, N. de Freitas, K. Murphy, and S. Russell. Rao-blackwellised particle filtering for dynamic bayesian networks. In UAI'00, pages 176--183, 2000. Google ScholarDigital Library
- J. R. Finkel, T. Grenager, and C. D. Manning. Incorporating non-local information into information extraction systems by gibbs sampling. In ACL'05, pages 363--370, 2005. Google ScholarDigital Library
- G. Heinrich. Parameter estimation for text analysis. Technical report, University of Leipzig, Germany, 2004.Google Scholar
- T. Hofmann. Probabilistic latent semantic indexing. In SIGIR'99, pages 50--57, 1999. Google ScholarDigital Library
- Y. Hu, J. Boyd-Graber, and B. Satinoff. Interactive topic modeling. In HLT'11, pages 248--257, 2011. Google ScholarDigital Library
- H. Huang, Z. Wen, D. Yu, H. Ji, Y. Sun, J. Han, and H. Li. Resolving entity morphs in censored data. In ACL'13, pages 1083--1093, 2013.Google Scholar
- S. Kataria, K. S. Kumar, R. Rastogi, P. Sen, and S. H. Sengamedu. Entity disambiguation with hierarchical topic models. In KDD'11, pages 1037--1045, 2011. Google ScholarDigital Library
- J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML'01, pages 282--289, 2001. Google ScholarDigital Library
- C. Li, J. Weng, Q. He, Y. Yao, A. Datta, A. Sun, and B.-S. Lee. Twiner: Named entity recognition in targeted twitter stream. In SIGIR'12, pages 721--730, 2012. Google ScholarDigital Library
- J. Li, J. Tang, Y. Li, and Q. Luo. Rimom: A dynamic multi-strategy ontology alignment framework. IEEE TKDE, 21(8):1218--1232, 2009. Google ScholarDigital Library
- Y. Li, C. Wang, F. Han, J. Han, D. Roth, and X. Yan. Mining evidences for named entity disambiguation. In KDD'13, pages 1070--1078, 2013. Google ScholarDigital Library
- X. Liu, S. Zhang, F. Wei, and M. Zhou. Recognizing named entities in tweets. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, ACL '11, pages 359--367, 2011. Google ScholarDigital Library
- A. K. McCallum. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu, 2002.Google Scholar
- D. Nadeau and S. Sekine. A survey of named entity recognition and classification. Linguisticae Investigationes, 30:3--26, 2007.Google ScholarCross Ref
- D. Ramage, D. Hall, R. Nallapati, and C. D. Manning. Labeled lda: A supervised topic model for credit attribution in multi-labeled corpora. In EMNLP '09, pages 248--256, 2009. Google ScholarDigital Library
- A. Ritter, S. Clark, Mausam, and O. Etzioni. Named entity recognition in tweets: An experimental study. In EMNLP'11, pages 1524--1534, 2011. Google ScholarDigital Library
- M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth. The author-topic model for authors and documents. In UAI'04, pages 487--494, 2004. Google ScholarDigital Library
- W. Shen, J. Wang, P. Luo, and M. Wang. Linking named entities in tweets with knowledge base via user interest modeling. In KDD'13, pages 68--76, 2013. Google ScholarDigital Library
- M. Steyvers, P. Smyth, and T. Griffiths. Probabilistic author-topic models for information discovery. In KDD'04, pages 306--315, 2004. Google ScholarDigital Library
- Y.-C. Tam and T. Schultz. Correlated latent semantic model for unsupervised lm adaptation. In ICASSP'07, volume 4, pages IV--41--IV--44, 2007.Google Scholar
- J. Tang, A. Fong, B. Wang, and J. Zhang. A unified probabilistic framework for name disambiguation in digital library. IEEE TKDE, 24(6):975--987, 2012. Google ScholarDigital Library
- J. Tang, J. Sun, C. Wang, and Z. Yang. Social influence analysis in large-scale networks. In KDD'09, pages 807--816, 2009. Google ScholarDigital Library
- J. Tang, S. Wu, J. Sun, and H. Su. Cross-domain collaboration recommendation. In KDD'12, pages 1285--1294, 2012. Google ScholarDigital Library
- J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su. Arnetminer: Extraction and mining of academic social networks. In KDD'08, pages 990--998, 2008. Google ScholarDigital Library
- K. M. Ting and I. H. Witten. Issues in stacked generalization. Journal of Artificial Intelligence Research, 10:271--289, 1999. Google ScholarCross Ref
- O. Uzuner, Y. Juo, and P. Szolovits. Evaluating the state-of-the-art in automatic de-identification. J Am Med Inform Assoc, 14(5):550--563, 2007.Google ScholarCross Ref
- S. Wu, Z. Fang, and J. Tang. Accurate product name recognition from user generated content. In ICDM 2012 Contest, pages 874--877, 2012. Google ScholarDigital Library
Index Terms
- Incorporating Social Context and Domain Knowledge for Entity Recognition
Recommendations
Named Entity Recognition with Context-Aware Dictionary Knowledge
Chinese Computational LinguisticsAbstractNamed entity recognition (NER) is an important task in the natural language processing field. Existing NER methods heavily rely on labeled data for model training, and their performance on rare entities is usually unsatisfactory. Entity ...
Protein/Gene Entity Recognition and Normalization with Domain Knowledge and Local Context
Chinese Lexical SemanticsAbstractBiomedical named entity recognition and normalization aim at recognizing biomedical entity mentions from text and mapping them to their unique database entity identifiers (IDs), which are the primary task of biomedical text mining. However, name ...
Incorporating global information into named entity recognition systems using relational context
SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrievalThe state-of-the-art in Named Entity Recognition relies on a combination of local features of the text and global knowledge to determine the types of the recognized entities. This is problematic in some cases, resulting in entities being classified as ...
Comments