research-article

Extracting information networks from the blogosphere

Authors:
Yuval Merhav

Illinois Institute of Technology, Chicago, IL

Illinois Institute of Technology, Chicago, IL
View Profile

,
Filipe Mesquita

University of Alberta

University of Alberta
View Profile

,
Denilson Barbosa

University of Alberta

University of Alberta
View Profile

,
Wai Gen Yee

Orbitz Worldwide, Chicago, IL

Orbitz Worldwide, Chicago, IL
View Profile

,
Ophir Frieder

Georgetown University, Washington, D.C.

Georgetown University, Washington, D.C.
View Profile

Authors Info & Claims

ACM Transactions on the Web Volume 6 Issue 3Article No.: 11pp 1–33https://doi.org/10.1145/2344416.2344418

Published:02 October 2012Publication History

ACM Transactions on the Web

Abstract

We study the problem of automatically extracting information networks formed by recognizable entities as well as relations among them from social media sites. Our approach consists of using state-of-the-art natural language processing tools to identify entities and extract sentences that relate such entities, followed by using text-clustering algorithms to identify the relations within the information network. We propose a new term-weighting scheme that significantly improves on the state-of-the-art in the task of relation extraction, both when used in conjunction with the standard tf ċ idf scheme and also when used as a pruning filter. We describe an effective method for identifying benchmarks for open information extraction that relies on a curated online database that is comparable to the hand-crafted evaluation datasets in the literature. From this benchmark, we derive a much larger dataset which mimics realistic conditions for the task of open information extraction. We report on extensive experiments on both datasets, which not only shed light on the accuracy levels achieved by state-of-the-art open information extraction tools, but also on how to tune such tools for better results.

References

Agichtein, E. and Gravano, L. 2000. Snowball: Extracting relations from large plain-text collections. In Proceedings of the 5th ACM Conference on Digital Libraries. ACM, New York, 85--94. Google ScholarDigital Library
Allan, J. 1998. Book review: Readings in information retrieval edited by K. Sparck Jones and P. Willett. Inf. Process. Manage. 34, 4, 489--490.Google ScholarCross Ref
Amigo, E., Gonzalo, J., Artiles, J., and Verdejo, F. 2009. A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf. Retrieval 12, 461--486. Google ScholarDigital Library
Banko, M., Cafarella, M. J., Soderland, S., Broadhead, M., and Etzioni, O. 2007. Open information extraction from the Web. In Proceedings of the IJCAI. M.M. Veloso Ed., 2670--2676. Google ScholarDigital Library
Banko, M. and Etzioni, O. 2008. The tradeoffs between open and traditional relation extraction. In Proceedings of ACL-08: HLT. Association for Computational Linguistics, 28--36.Google Scholar
Bontcheva, K., Dimitrov, M.,Maynard, D., Tablan, V., and Cunningham, H. 2002. Shallow methods for named entity co-reference resolution. In Proceedings of TALN.Google Scholar
Brin, S. 1998. Extracting patterns and relations from the world wide web. In Proceedings of WebDB. 172--183. Google ScholarDigital Library
Bunescu, R. C. and Mooney, R. J. 2005. A shortest path dependency kernel for relation extraction. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT '05), R. J. Mooney Ed., Association for Computational Linguistics, Morristown, NJ, 724--731. Google ScholarDigital Library
Bunescu, R. C. and Mooney, R. J. 2007. Learning to extract relations from the web using minimal supervision. ACM Trans.Web. To appear.Google Scholar
Burton, K., Java, A., and Soboroff, I. 2009. The ICWSM 2009 spinn3r dataset. In Proceedings of the Annual Conference on Weblogs and Social Media.Google Scholar
Carmel, D., Roitman, H., and Zwerdling, N. 2009. Enhancing cluster labeling using wikipedia. In SIGIR. 139--146. Google ScholarDigital Library
Chen, J., Ji, D., Tan, C.L., and Niu, Z. 2005. Unsupervised feature selection for relation extraction. In Proceedings of the 2nd International Joint Conference on Natural Language Processing (IJCNLP-05). Springer, Berlin.Google Scholar
CNN. 2008. McCain ad compares Obama to Britney Spears, Paris Hilton. http://www.cnn.com/2008/POLITICS/07/30/mccain.ad.Google Scholar
Craven, M., Dipasquo, D., Freitag, D., McCallum, A., Mitchell, T.M., Nigam, K., And Slattery, S. 2000. Learning to construct knowledge bases from the world wide web. Artif. Intell. 118, 1-2, 69--113. Google ScholarDigital Library
Culotta, A. and Sorensen, J.S. 2004. Dependency tree kernels for relation extraction. In Proceedings of ACL. 423--429. Google ScholarDigital Library
Doddington, G., Mitchell, A., Przybocki, M., Ramshaw, L., Strassel, S., and Weischedel, R. 2004. The automatic content extraction (ace) program--tasks, data, and evaluation. In Proceedings of LREC. 837--840.Google Scholar
Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A.-M., Shaked, T., Soderl, S., , Weld, D. S., and Yates, A. 2004. Web-scale information extraction in Knowitall: (Preliminary results). In Proceedings of the 13th International Conference on the World Wide Web (WWW '04). ACM, New York, 100--110. Google ScholarDigital Library
Fader, A., Soderland, S., and Etzioni, O. 2011. Identifying relations for open information extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP'11). Association for Computational Linguistics, 1535--1545. Google ScholarDigital Library
Fisher, D., Soderland, S., Feng, F., and Lehnert, W. 1995. Description of the UMASS system as used for MUC-6. In Proceedings of the 6th Conference on Message Understanding (MUC6 '95). Association for Computational Linguistics, Morristown, NJ, 127--140. Google ScholarDigital Library
Fleiss, J. L., Levin, B., and Paik, M. C. 2003. Statistical Methods for Rates and Proportions 3rd Ed., Wiley, New York.Google Scholar
Gabrilovich, E. and Markovitch, S. 2007. Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI '07). Morgan Kaufmann, San Francisco, CA, 1606--1611. Google ScholarDigital Library
Glover, E. J., Tsioutsiouliklis, K., Lawrence, S., Pennock, D. M., and Flake, G. W. 2002. Using Web structure for classifying and describing Web pages. In Proceedings of the 11th International Conference on the World Wide Web (WWW '02). ACM, New York, 562--569. Google ScholarDigital Library
Grossman, D. A. and Frieder, O. 2004. Information Retrieval: Algorithms and Heuristics 2nd Ed. Springer, Berlin. Google ScholarDigital Library
Guodong, Z., Jian, S., Jie, Z., and Min, Z. 2005. Exploring various knowledge in relation extraction. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (ACL '05). Association for Computational Linguistics, Morristown, NJ, 427--434. Google ScholarDigital Library
Hanneman, R. and Riddle, M. 2005. Introduction to social network methods. http://faculty.ucr.edu/òhanneman/nettext/.Google Scholar
Hasegawa, T., Sekine, S., and Grishman, R. 2004. Discovering relations among named entities from large corpora. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics (ACL '04). Association for Computational Linguistics, Morristown, NJ, 415. Google ScholarDigital Library
Jurafsky, D. and Martin, J. H. 2009. Speech and Language Processing: An Introduction to Natural Language Processing, Speech Recognition, and Computational Linguistics 2nd Ed. Prentice-Hall, Englewood Cliffs, NJ. Google ScholarDigital Library
Kambhatla, N. 2004. Combining lexical, syntactic and semantic features with maximum entropy models. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL '04). Google ScholarDigital Library
Knox, H., Savage, M., and Harvey, P. 2006. Social networks and the study of relations: Networks as method, metaphor and form. Economy Soc. 35, 1, 113--140.Google ScholarCross Ref
Maimon, O. and Rokach, L. (eds.). 2005. The Data Mining and Knowledge Discovery Handbook. Springer, Berlin. Google ScholarDigital Library
Manning, C. D., Raghavan, P., and Schtze., H. 2008. Introduction to Information Retrieval. Cambridge University Press, Cambridge, UK. Google ScholarDigital Library
Marlow, C. 2004. Audience, structure and authority in the weblog community. International Communication Association.Google Scholar
Minkov, E. and Wang, R.C. 2005. Extracting personal names from emails: Applying named entity recognition to informal text. In Proceedings of the HLT-EMNLP. Google ScholarDigital Library
Mintz, M., Bills, S., Snow, R., and Jurafsky, D. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th InternationalJoint Conference on Natural Language Processing of the AFNLP. Vol. 2, Association for Computational Linguistics, Morristown, NJ, 1003--1011. Google ScholarDigital Library
Ratinov, L. and Roth, D. 2009. Design challenges and misconceptions in named entity recognition. In Proceedings of the 13th Conference on Computational Natural Language Learning (CoNLL '09). Association for Computational Linguistics Morristown, NJ, 147--155. Google ScholarDigital Library
Rivest, R. 1992. The MD5 message-digest algorithm. RFC 1321. MIT and RSA Data Security. Google ScholarDigital Library
Robertson, S. 2004. Understanding inverse document frequency: On theoretical arguments for IDF. J. Documentation 60.Google Scholar
Rosario, B. and Hearst, M.A. 2004. Classifying semantic relations in bioscience texts. In Proceedings of the ACL. 430--437. Google ScholarDigital Library
Rosenfeld, B. and Feldman, R. 2007. Clustering for unsupervised relation identification. In Proceedings of CIKM '07, ACM, New York, 411--418. Google ScholarDigital Library
Shinyama, Y. and Sekine, S. 2006. Preemptive information extraction using unrestricted relation discovery. In Proceedings on the Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics. Association for Computational Linguistics, Morristown, NJ, 304--311. Google ScholarDigital Library
Syed, Z., Finin, T., and Joshi, A. 2008. Wikipedia as an ontology for describing documents. In Proceedings of the Second International Conference on Weblogs and Social Media. AAAI Press.Google Scholar
Toutanova, K., Klein, D., Manning, C. D., and Singer, Y. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of HLT-NAACL. Google ScholarDigital Library
Treeratpituk, P. and Callan, J. 2006. Automatically labeling hierarchical clusters. In Proceedings of the International Conference on Digital Government Research. ACM, New York, 167--176. Google ScholarDigital Library
Tsatsaronis, G., Varlamis, I., and Vazirgiannis, M. 2010. Text relatedness based on a word thesaurus. J. Artif. Intell. Res. 37, 1--39. Google ScholarCross Ref
Wu, F. and Weld, D. S. 2010. Open information extraction using Wikipedia. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL-10). Google ScholarDigital Library
Zelenko, D., Aone, C., and Richardella, A. 2003. Kernel methods for relation extraction. J. Mach. Learn. Res. 3, 1083--1106. Google ScholarDigital Library
Zhang, M., Su, J., Wang, D., Zhou, G., and Tan, C.L. 2005. Discovering relations between named entities from a large raw corpus using tree similarity-based clustering. In Proceedings of IJCNLP. 378--389. Google ScholarDigital Library
Zhu, J., Nie, Z., Liu, X., Zhang, B., and Wen, J.-R. 2009. Statsnowball: A statistical approach to extracting entity relationships. In Proceedings of the 18th International Conference on the World Wide Web (WWW '09). ACM, New York, 101--110. Google ScholarDigital Library

Index Terms

Extracting information networks from the blogosphere
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources
2. Information systems
  1. Information retrieval
    1. Document representation
      1. Content analysis and feature selection

Recommendations

Open Information Extraction with Global Structure Constraints
WWW '18: Companion Proceedings of the The Web Conference 2018

Extracting entities and their relations from text is an important task for understanding massive text corpora. Open information extraction (IE) systems mine relation tuples (i.e., entity arguments and a predicate string to describe their relation) from ...
Read More
Integrating Local Context and Global Cohesiveness for Open Information Extraction
WSDM '19: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining

Extracting entities and their relations from text is an important task for understanding massive text corpora. Open information extraction (IE) systems mine relation tuples (i.e., entity arguments and a predicate string to describe their relation) from ...
Read More
Harnessing Open Information Extraction for Entity Classification in a French Corpus
Proceedings of the 29th Canadian Conference on Artificial Intelligence on Advances in Artificial Intelligence - Volume 9673

We describe a recall-oriented open information extraction system designed to extract knowledge from French corpora. We put it to the test by showing that general domain information triples extracted from French Wikipedia can be used for deriving new ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on the Web Volume 6, Issue 3
September 2012
133 pages
ISSN:1559-1131
EISSN:1559-114X
DOI:10.1145/2344416
Issue’s Table of Contents

Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 2 October 2012
- Accepted: 1 April 2012
- Revised: 1 January 2012
- Received: 1 September 2010
Published in tweb Volume 6, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
clustering
domain frequency
named entities
open information extraction
relation extraction
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 10
  Total Citations
  View Citations
- 531
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Extracting information networks from the blogosphere

ACM Transactions on the Web

Abstract

References

Cited By

Index Terms

Recommendations

Open Information Extraction with Global Structure Constraints

Integrating Local Context and Global Cohesiveness for Open Information Extraction

Harnessing Open Information Extraction for Entity Classification in a French Corpus

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Extracting information networks from the blogosphere

ACM Transactions on the Web

Abstract

References

Cited By

Index Terms

Recommendations

Open Information Extraction with Global Structure Constraints

Integrating Local Context and Global Cohesiveness for Open Information Extraction

Harnessing Open Information Extraction for Entity Classification in a French Corpus

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media