skip to main content
research-article

Extracting information networks from the blogosphere

Published:02 October 2012Publication History
Skip Abstract Section

Abstract

We study the problem of automatically extracting information networks formed by recognizable entities as well as relations among them from social media sites. Our approach consists of using state-of-the-art natural language processing tools to identify entities and extract sentences that relate such entities, followed by using text-clustering algorithms to identify the relations within the information network. We propose a new term-weighting scheme that significantly improves on the state-of-the-art in the task of relation extraction, both when used in conjunction with the standard tf ċ idf scheme and also when used as a pruning filter. We describe an effective method for identifying benchmarks for open information extraction that relies on a curated online database that is comparable to the hand-crafted evaluation datasets in the literature. From this benchmark, we derive a much larger dataset which mimics realistic conditions for the task of open information extraction. We report on extensive experiments on both datasets, which not only shed light on the accuracy levels achieved by state-of-the-art open information extraction tools, but also on how to tune such tools for better results.

References

  1. Agichtein, E. and Gravano, L. 2000. Snowball: Extracting relations from large plain-text collections. In Proceedings of the 5th ACM Conference on Digital Libraries. ACM, New York, 85--94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Allan, J. 1998. Book review: Readings in information retrieval edited by K. Sparck Jones and P. Willett. Inf. Process. Manage. 34, 4, 489--490.Google ScholarGoogle ScholarCross RefCross Ref
  3. Amigo, E., Gonzalo, J., Artiles, J., and Verdejo, F. 2009. A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf. Retrieval 12, 461--486. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Banko, M., Cafarella, M. J., Soderland, S., Broadhead, M., and Etzioni, O. 2007. Open information extraction from the Web. In Proceedings of the IJCAI. M.M. Veloso Ed., 2670--2676. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Banko, M. and Etzioni, O. 2008. The tradeoffs between open and traditional relation extraction. In Proceedings of ACL-08: HLT. Association for Computational Linguistics, 28--36.Google ScholarGoogle Scholar
  6. Bontcheva, K., Dimitrov, M.,Maynard, D., Tablan, V., and Cunningham, H. 2002. Shallow methods for named entity co-reference resolution. In Proceedings of TALN.Google ScholarGoogle Scholar
  7. Brin, S. 1998. Extracting patterns and relations from the world wide web. In Proceedings of WebDB. 172--183. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Bunescu, R. C. and Mooney, R. J. 2005. A shortest path dependency kernel for relation extraction. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing (HLT '05), R. J. Mooney Ed., Association for Computational Linguistics, Morristown, NJ, 724--731. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Bunescu, R. C. and Mooney, R. J. 2007. Learning to extract relations from the web using minimal supervision. ACM Trans.Web. To appear.Google ScholarGoogle Scholar
  10. Burton, K., Java, A., and Soboroff, I. 2009. The ICWSM 2009 spinn3r dataset. In Proceedings of the Annual Conference on Weblogs and Social Media.Google ScholarGoogle Scholar
  11. Carmel, D., Roitman, H., and Zwerdling, N. 2009. Enhancing cluster labeling using wikipedia. In SIGIR. 139--146. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Chen, J., Ji, D., Tan, C.L., and Niu, Z. 2005. Unsupervised feature selection for relation extraction. In Proceedings of the 2nd International Joint Conference on Natural Language Processing (IJCNLP-05). Springer, Berlin.Google ScholarGoogle Scholar
  13. CNN. 2008. McCain ad compares Obama to Britney Spears, Paris Hilton. http://www.cnn.com/2008/POLITICS/07/30/mccain.ad.Google ScholarGoogle Scholar
  14. Craven, M., Dipasquo, D., Freitag, D., McCallum, A., Mitchell, T.M., Nigam, K., And Slattery, S. 2000. Learning to construct knowledge bases from the world wide web. Artif. Intell. 118, 1-2, 69--113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Culotta, A. and Sorensen, J.S. 2004. Dependency tree kernels for relation extraction. In Proceedings of ACL. 423--429. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Doddington, G., Mitchell, A., Przybocki, M., Ramshaw, L., Strassel, S., and Weischedel, R. 2004. The automatic content extraction (ace) program--tasks, data, and evaluation. In Proceedings of LREC. 837--840.Google ScholarGoogle Scholar
  17. Etzioni, O., Cafarella, M., Downey, D., Kok, S., Popescu, A.-M., Shaked, T., Soderl, S., , Weld, D. S., and Yates, A. 2004. Web-scale information extraction in Knowitall: (Preliminary results). In Proceedings of the 13th International Conference on the World Wide Web (WWW '04). ACM, New York, 100--110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Fader, A., Soderland, S., and Etzioni, O. 2011. Identifying relations for open information extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP'11). Association for Computational Linguistics, 1535--1545. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Fisher, D., Soderland, S., Feng, F., and Lehnert, W. 1995. Description of the UMASS system as used for MUC-6. In Proceedings of the 6th Conference on Message Understanding (MUC6 '95). Association for Computational Linguistics, Morristown, NJ, 127--140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Fleiss, J. L., Levin, B., and Paik, M. C. 2003. Statistical Methods for Rates and Proportions 3rd Ed., Wiley, New York.Google ScholarGoogle Scholar
  21. Gabrilovich, E. and Markovitch, S. 2007. Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI '07). Morgan Kaufmann, San Francisco, CA, 1606--1611. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Glover, E. J., Tsioutsiouliklis, K., Lawrence, S., Pennock, D. M., and Flake, G. W. 2002. Using Web structure for classifying and describing Web pages. In Proceedings of the 11th International Conference on the World Wide Web (WWW '02). ACM, New York, 562--569. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Grossman, D. A. and Frieder, O. 2004. Information Retrieval: Algorithms and Heuristics 2nd Ed. Springer, Berlin. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Guodong, Z., Jian, S., Jie, Z., and Min, Z. 2005. Exploring various knowledge in relation extraction. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (ACL '05). Association for Computational Linguistics, Morristown, NJ, 427--434. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Hanneman, R. and Riddle, M. 2005. Introduction to social network methods. http://faculty.ucr.edu/òhanneman/nettext/.Google ScholarGoogle Scholar
  26. Hasegawa, T., Sekine, S., and Grishman, R. 2004. Discovering relations among named entities from large corpora. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics (ACL '04). Association for Computational Linguistics, Morristown, NJ, 415. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Jurafsky, D. and Martin, J. H. 2009. Speech and Language Processing: An Introduction to Natural Language Processing, Speech Recognition, and Computational Linguistics 2nd Ed. Prentice-Hall, Englewood Cliffs, NJ. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Kambhatla, N. 2004. Combining lexical, syntactic and semantic features with maximum entropy models. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL '04). Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Knox, H., Savage, M., and Harvey, P. 2006. Social networks and the study of relations: Networks as method, metaphor and form. Economy Soc. 35, 1, 113--140.Google ScholarGoogle ScholarCross RefCross Ref
  30. Maimon, O. and Rokach, L. (eds.). 2005. The Data Mining and Knowledge Discovery Handbook. Springer, Berlin. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Manning, C. D., Raghavan, P., and Schtze., H. 2008. Introduction to Information Retrieval. Cambridge University Press, Cambridge, UK. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Marlow, C. 2004. Audience, structure and authority in the weblog community. International Communication Association.Google ScholarGoogle Scholar
  33. Minkov, E. and Wang, R.C. 2005. Extracting personal names from emails: Applying named entity recognition to informal text. In Proceedings of the HLT-EMNLP. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Mintz, M., Bills, S., Snow, R., and Jurafsky, D. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th InternationalJoint Conference on Natural Language Processing of the AFNLP. Vol. 2, Association for Computational Linguistics, Morristown, NJ, 1003--1011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Ratinov, L. and Roth, D. 2009. Design challenges and misconceptions in named entity recognition. In Proceedings of the 13th Conference on Computational Natural Language Learning (CoNLL '09). Association for Computational Linguistics Morristown, NJ, 147--155. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Rivest, R. 1992. The MD5 message-digest algorithm. RFC 1321. MIT and RSA Data Security. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Robertson, S. 2004. Understanding inverse document frequency: On theoretical arguments for IDF. J. Documentation 60.Google ScholarGoogle Scholar
  38. Rosario, B. and Hearst, M.A. 2004. Classifying semantic relations in bioscience texts. In Proceedings of the ACL. 430--437. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Rosenfeld, B. and Feldman, R. 2007. Clustering for unsupervised relation identification. In Proceedings of CIKM '07, ACM, New York, 411--418. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Shinyama, Y. and Sekine, S. 2006. Preemptive information extraction using unrestricted relation discovery. In Proceedings on the Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics. Association for Computational Linguistics, Morristown, NJ, 304--311. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Syed, Z., Finin, T., and Joshi, A. 2008. Wikipedia as an ontology for describing documents. In Proceedings of the Second International Conference on Weblogs and Social Media. AAAI Press.Google ScholarGoogle Scholar
  42. Toutanova, K., Klein, D., Manning, C. D., and Singer, Y. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of HLT-NAACL. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Treeratpituk, P. and Callan, J. 2006. Automatically labeling hierarchical clusters. In Proceedings of the International Conference on Digital Government Research. ACM, New York, 167--176. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Tsatsaronis, G., Varlamis, I., and Vazirgiannis, M. 2010. Text relatedness based on a word thesaurus. J. Artif. Intell. Res. 37, 1--39. Google ScholarGoogle ScholarCross RefCross Ref
  45. Wu, F. and Weld, D. S. 2010. Open information extraction using Wikipedia. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL-10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Zelenko, D., Aone, C., and Richardella, A. 2003. Kernel methods for relation extraction. J. Mach. Learn. Res. 3, 1083--1106. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Zhang, M., Su, J., Wang, D., Zhou, G., and Tan, C.L. 2005. Discovering relations between named entities from a large raw corpus using tree similarity-based clustering. In Proceedings of IJCNLP. 378--389. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Zhu, J., Nie, Z., Liu, X., Zhang, B., and Wen, J.-R. 2009. Statsnowball: A statistical approach to extracting entity relationships. In Proceedings of the 18th International Conference on the World Wide Web (WWW '09). ACM, New York, 101--110. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Extracting information networks from the blogosphere

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on the Web
        ACM Transactions on the Web  Volume 6, Issue 3
        September 2012
        133 pages
        ISSN:1559-1131
        EISSN:1559-114X
        DOI:10.1145/2344416
        Issue’s Table of Contents

        Copyright © 2012 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 2 October 2012
        • Accepted: 1 April 2012
        • Revised: 1 January 2012
        • Received: 1 September 2010
        Published in tweb Volume 6, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader