Abstract
We study the problem of automatically extracting information networks formed by recognizable entities as well as relations among them from social media sites. Our approach consists of using state-of-the-art natural language processing tools to identify entities and extract sentences that relate such entities, followed by using text-clustering algorithms to identify the relations within the information network. We propose a new term-weighting scheme that significantly improves on the state-of-the-art in the task of relation extraction, both when used in conjunction with the standard tf ċ idf scheme and also when used as a pruning filter. We describe an effective method for identifying benchmarks for open information extraction that relies on a curated online database that is comparable to the hand-crafted evaluation datasets in the literature. From this benchmark, we derive a much larger dataset which mimics realistic conditions for the task of open information extraction. We report on extensive experiments on both datasets, which not only shed light on the accuracy levels achieved by state-of-the-art open information extraction tools, but also on how to tune such tools for better results.
- Agichtein, E. and Gravano, L. 2000. Snowball: Extracting relations from large plain-text collections. In Proceedings of the 5th ACM Conference on Digital Libraries. ACM, New York, 85--94. Google ScholarDigital Library
- Allan, J. 1998. Book review: Readings in information retrieval edited by K. Sparck Jones and P. Willett. Inf. Process. Manage. 34, 4, 489--490.Google ScholarCross Ref
- Amigo, E., Gonzalo, J., Artiles, J., and Verdejo, F. 2009. A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf. Retrieval 12, 461--486. Google ScholarDigital Library
- Banko, M., Cafarella, M. J., Soderland, S., Broadhead, M., and Etzioni, O. 2007. Open information extraction from the Web. In Proceedings of the IJCAI. M.M. Veloso Ed., 2670--2676. Google ScholarDigital Library
- Banko, M. and Etzioni, O. 2008. The tradeoffs between open and traditional relation extraction. In Proceedings of ACL-08: HLT. Association for Computational Linguistics, 28--36.Google Scholar
- Bontcheva, K., Dimitrov, M.,Maynard, D., Tablan, V., and Cunningham, H. 2002. Shallow methods for named entity co-reference resolution. In Proceedings of TALN.Google Scholar
- Brin, S. 1998. Extracting patterns and relations from the world wide web. In Proceedings of WebDB. 172--183. Google ScholarDigital Library
- Bunescu, R. C. and Mooney, R. J. 2005. A shortest path dependency kernel for relation extraction. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT '05), R. J. Mooney Ed., Association for Computational Linguistics, Morristown, NJ, 724--731. Google ScholarDigital Library
- Bunescu, R. C. and Mooney, R. J. 2007. Learning to extract relations from the web using minimal supervision. ACM Trans.Web. To appear.Google Scholar
- Burton, K., Java, A., and Soboroff, I. 2009. The ICWSM 2009 spinn3r dataset. In Proceedings of the Annual Conference on Weblogs and Social Media.Google Scholar
- Carmel, D., Roitman, H., and Zwerdling, N. 2009. Enhancing cluster labeling using wikipedia. In SIGIR. 139--146. Google ScholarDigital Library
- Chen, J., Ji, D., Tan, C.L., and Niu, Z. 2005. Unsupervised feature selection for relation extraction. In Proceedings of the 2nd International Joint Conference on Natural Language Processing (IJCNLP-05). Springer, Berlin.Google Scholar
- CNN. 2008. McCain ad compares Obama to Britney Spears, Paris Hilton. http://www.cnn.com/2008/POLITICS/07/30/mccain.ad.Google Scholar
- Craven, M., Dipasquo, D., Freitag, D., McCallum, A., Mitchell, T.M., Nigam, K., And Slattery, S. 2000. Learning to construct knowledge bases from the world wide web. Artif. Intell. 118, 1-2, 69--113. Google ScholarDigital Library
- Culotta, A. and Sorensen, J.S. 2004. Dependency tree kernels for relation extraction. In Proceedings of ACL. 423--429. Google ScholarDigital Library
- Doddington, G., Mitchell, A., Przybocki, M., Ramshaw, L., Strassel, S., and Weischedel, R. 2004. The automatic content extraction (ace) program--tasks, data, and evaluation. In Proceedings of LREC. 837--840.Google Scholar
- Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A.-M., Shaked, T., Soderl, S., , Weld, D. S., and Yates, A. 2004. Web-scale information extraction in Knowitall: (Preliminary results). In Proceedings of the 13th International Conference on the World Wide Web (WWW '04). ACM, New York, 100--110. Google ScholarDigital Library
- Fader, A., Soderland, S., and Etzioni, O. 2011. Identifying relations for open information extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP'11). Association for Computational Linguistics, 1535--1545. Google ScholarDigital Library
- Fisher, D., Soderland, S., Feng, F., and Lehnert, W. 1995. Description of the UMASS system as used for MUC-6. In Proceedings of the 6th Conference on Message Understanding (MUC6 '95). Association for Computational Linguistics, Morristown, NJ, 127--140. Google ScholarDigital Library
- Fleiss, J. L., Levin, B., and Paik, M. C. 2003. Statistical Methods for Rates and Proportions 3rd Ed., Wiley, New York.Google Scholar
- Gabrilovich, E. and Markovitch, S. 2007. Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI '07). Morgan Kaufmann, San Francisco, CA, 1606--1611. Google ScholarDigital Library
- Glover, E. J., Tsioutsiouliklis, K., Lawrence, S., Pennock, D. M., and Flake, G. W. 2002. Using Web structure for classifying and describing Web pages. In Proceedings of the 11th International Conference on the World Wide Web (WWW '02). ACM, New York, 562--569. Google ScholarDigital Library
- Grossman, D. A. and Frieder, O. 2004. Information Retrieval: Algorithms and Heuristics 2nd Ed. Springer, Berlin. Google ScholarDigital Library
- Guodong, Z., Jian, S., Jie, Z., and Min, Z. 2005. Exploring various knowledge in relation extraction. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (ACL '05). Association for Computational Linguistics, Morristown, NJ, 427--434. Google ScholarDigital Library
- Hanneman, R. and Riddle, M. 2005. Introduction to social network methods. http://faculty.ucr.edu/òhanneman/nettext/.Google Scholar
- Hasegawa, T., Sekine, S., and Grishman, R. 2004. Discovering relations among named entities from large corpora. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics (ACL '04). Association for Computational Linguistics, Morristown, NJ, 415. Google ScholarDigital Library
- Jurafsky, D. and Martin, J. H. 2009. Speech and Language Processing: An Introduction to Natural Language Processing, Speech Recognition, and Computational Linguistics 2nd Ed. Prentice-Hall, Englewood Cliffs, NJ. Google ScholarDigital Library
- Kambhatla, N. 2004. Combining lexical, syntactic and semantic features with maximum entropy models. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL '04). Google ScholarDigital Library
- Knox, H., Savage, M., and Harvey, P. 2006. Social networks and the study of relations: Networks as method, metaphor and form. Economy Soc. 35, 1, 113--140.Google ScholarCross Ref
- Maimon, O. and Rokach, L. (eds.). 2005. The Data Mining and Knowledge Discovery Handbook. Springer, Berlin. Google ScholarDigital Library
- Manning, C. D., Raghavan, P., and Schtze., H. 2008. Introduction to Information Retrieval. Cambridge University Press, Cambridge, UK. Google ScholarDigital Library
- Marlow, C. 2004. Audience, structure and authority in the weblog community. International Communication Association.Google Scholar
- Minkov, E. and Wang, R.C. 2005. Extracting personal names from emails: Applying named entity recognition to informal text. In Proceedings of the HLT-EMNLP. Google ScholarDigital Library
- Mintz, M., Bills, S., Snow, R., and Jurafsky, D. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th InternationalJoint Conference on Natural Language Processing of the AFNLP. Vol. 2, Association for Computational Linguistics, Morristown, NJ, 1003--1011. Google ScholarDigital Library
- Ratinov, L. and Roth, D. 2009. Design challenges and misconceptions in named entity recognition. In Proceedings of the 13th Conference on Computational Natural Language Learning (CoNLL '09). Association for Computational Linguistics Morristown, NJ, 147--155. Google ScholarDigital Library
- Rivest, R. 1992. The MD5 message-digest algorithm. RFC 1321. MIT and RSA Data Security. Google ScholarDigital Library
- Robertson, S. 2004. Understanding inverse document frequency: On theoretical arguments for IDF. J. Documentation 60.Google Scholar
- Rosario, B. and Hearst, M.A. 2004. Classifying semantic relations in bioscience texts. In Proceedings of the ACL. 430--437. Google ScholarDigital Library
- Rosenfeld, B. and Feldman, R. 2007. Clustering for unsupervised relation identification. In Proceedings of CIKM '07, ACM, New York, 411--418. Google ScholarDigital Library
- Shinyama, Y. and Sekine, S. 2006. Preemptive information extraction using unrestricted relation discovery. In Proceedings on the Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics. Association for Computational Linguistics, Morristown, NJ, 304--311. Google ScholarDigital Library
- Syed, Z., Finin, T., and Joshi, A. 2008. Wikipedia as an ontology for describing documents. In Proceedings of the Second International Conference on Weblogs and Social Media. AAAI Press.Google Scholar
- Toutanova, K., Klein, D., Manning, C. D., and Singer, Y. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of HLT-NAACL. Google ScholarDigital Library
- Treeratpituk, P. and Callan, J. 2006. Automatically labeling hierarchical clusters. In Proceedings of the International Conference on Digital Government Research. ACM, New York, 167--176. Google ScholarDigital Library
- Tsatsaronis, G., Varlamis, I., and Vazirgiannis, M. 2010. Text relatedness based on a word thesaurus. J. Artif. Intell. Res. 37, 1--39. Google ScholarCross Ref
- Wu, F. and Weld, D. S. 2010. Open information extraction using Wikipedia. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL-10). Google ScholarDigital Library
- Zelenko, D., Aone, C., and Richardella, A. 2003. Kernel methods for relation extraction. J. Mach. Learn. Res. 3, 1083--1106. Google ScholarDigital Library
- Zhang, M., Su, J., Wang, D., Zhou, G., and Tan, C.L. 2005. Discovering relations between named entities from a large raw corpus using tree similarity-based clustering. In Proceedings of IJCNLP. 378--389. Google ScholarDigital Library
- Zhu, J., Nie, Z., Liu, X., Zhang, B., and Wen, J.-R. 2009. Statsnowball: A statistical approach to extracting entity relationships. In Proceedings of the 18th International Conference on the World Wide Web (WWW '09). ACM, New York, 101--110. Google ScholarDigital Library
Index Terms
- Extracting information networks from the blogosphere
Recommendations
Open Information Extraction with Global Structure Constraints
WWW '18: Companion Proceedings of the The Web Conference 2018Extracting entities and their relations from text is an important task for understanding massive text corpora. Open information extraction (IE) systems mine relation tuples (i.e., entity arguments and a predicate string to describe their relation) from ...
Integrating Local Context and Global Cohesiveness for Open Information Extraction
WSDM '19: Proceedings of the Twelfth ACM International Conference on Web Search and Data MiningExtracting entities and their relations from text is an important task for understanding massive text corpora. Open information extraction (IE) systems mine relation tuples (i.e., entity arguments and a predicate string to describe their relation) from ...
Harnessing Open Information Extraction for Entity Classification in a French Corpus
Proceedings of the 29th Canadian Conference on Artificial Intelligence on Advances in Artificial Intelligence - Volume 9673We describe a recall-oriented open information extraction system designed to extract knowledge from French corpora. We put it to the test by showing that general domain information triples extracted from French Wikipedia can be used for deriving new ...
Comments