skip to main content
10.1145/3394486.3406468acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
tutorial

Multi-modal Information Extraction from Text, Semi-structured, and Tabular Data on the Web

Published:20 August 2020Publication History

ABSTRACT

How do we surface the large amount of information present in HTML documents on the Web, from news articles to Rotten Tomatoes pages to tables of sports scores? Such information can enable a variety of applications including knowledge base construction, question answering, recommendation, and more. In this tutorial, we present approaches for information extraction (IE) from Web data that can be differentiated along two key dimensions: 1) the diversity in data modality that is leveraged, e.g. text, visual, XML/HTML, and 2) the thrust to develop scalable approaches with zero to limited human supervision.

References

  1. Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, and Paolo Papotti. 2013. Extraction and integration of partially overlapping web sources. PVLDB 6, 10 (2013), 805--816.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Michael Cafarella, Alon Halevy, Hongrae Lee, Jayant Madhavan, Cong Yu, Daisy Zhe Wang, and Eugene Wu. 2018. Ten years of webtables. PVLDB 11, 12 (2018), 2140--2149.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Pankaj Gulhane, Amit Madaan, Rupesh Mehta, Jeyashankher Ramamirtham, Rajeev Rastogi, Sandeep Satpal, Srinivasan H Sengamedu, Ashwin Tengli, and Charu Tiwari. 2011. Web-scale information extraction with vertex. In ICDM. IEEE, 1209--1220.Google ScholarGoogle Scholar
  4. Anoop R. Katti, Christian Reisswig, Cordula Guder, Sebastian Brarda, Steffen Bickel, Johannes Höhne, and Jean Baptiste Faddoul. 2018. Chargrid: Towards Understanding 2D Documents. In EMNLP.Google ScholarGoogle Scholar
  5. Colin Lockard, Xin Luna Dong, Arash Einolghozati, and Prashant Shiralkar. 2018. CERES: Distantly supervised relation extraction from the semi-structured web. PVLDB 11, 10 (2018), 1084--1096.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Colin Lockard, Prashant Shiralkar, Xin Luna Dong, and Hannaneh Hajishirzi. 2020. ZeroShotCeres: Zero-Shot Relation Extraction from Semi-Structured Webpages. In ACL. Association for Computational Linguistics, Online, 8105--8117. https: //www.aclweb.org/anthology/2020.acl-main.721Google ScholarGoogle Scholar
  7. Yi Luan, Dave Wadden, Luheng He, Amy Shah, Mari Ostendorf, and Hannaneh Hajishirzi. 2019. A general framework for information extraction using dynamic span graphs. In NAACL-HLT.Google ScholarGoogle Scholar
  8. Yujie Qian, Enrico Santus, Zhijing Jin, Jiang Guo, and Regina Barzilay. 2018. GraphIE: A Graph-Based Framework for Information Extraction. In NAACL-HLT.Google ScholarGoogle Scholar
  9. Sen Wu, Luke Hsiao, Xiao Cheng, Braden Hancock, Theodoros Rekatsinas, Philip Levis, and Christopher Ré. 2018. Fonduer: Knowledge Base Construction from Richly Formatted Data. SIGMOD 2018 (2018), 1301--1316.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Multi-modal Information Extraction from Text, Semi-structured, and Tabular Data on the Web

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
        August 2020
        3664 pages
        ISBN:9781450379984
        DOI:10.1145/3394486

        Copyright © 2020 Owner/Author

        Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 20 August 2020

        Check for updates

        Qualifiers

        • tutorial

        Acceptance Rates

        Overall Acceptance Rate1,133of8,635submissions,13%

        Upcoming Conference

        KDD '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader