tutorial

Multi-modal Information Extraction from Text, Semi-structured, and Tabular Data on the Web

Authors:
Xin Luna Dong

Amazon, Seattle, WA, USA

Amazon, Seattle, WA, USA
View Profile

,
Hannaneh Hajishirzi

University of Washington & Allen Institute for AI, Seattle, WA, USA

University of Washington & Allen Institute for AI, Seattle, WA, USA
View Profile

,
Colin Lockard

University of Washington & Amazon, Seattle, WA, USA

University of Washington & Amazon, Seattle, WA, USA
View Profile

,
Prashant Shiralkar

Amazon, Seattle, WA, USA

Amazon, Seattle, WA, USA
View Profile

KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data MiningAugust 2020Pages 3543–3544https://doi.org/10.1145/3394486.3406468

Published:20 August 2020Publication History

KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pages 3543–3544

ABSTRACT

How do we surface the large amount of information present in HTML documents on the Web, from news articles to Rotten Tomatoes pages to tables of sports scores? Such information can enable a variety of applications including knowledge base construction, question answering, recommendation, and more. In this tutorial, we present approaches for information extraction (IE) from Web data that can be differentiated along two key dimensions: 1) the diversity in data modality that is leveraged, e.g. text, visual, XML/HTML, and 2) the thrust to develop scalable approaches with zero to limited human supervision.

References

Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, and Paolo Papotti. 2013. Extraction and integration of partially overlapping web sources. PVLDB 6, 10 (2013), 805--816.Google ScholarDigital Library
Michael Cafarella, Alon Halevy, Hongrae Lee, Jayant Madhavan, Cong Yu, Daisy Zhe Wang, and Eugene Wu. 2018. Ten years of webtables. PVLDB 11, 12 (2018), 2140--2149.Google ScholarDigital Library
Pankaj Gulhane, Amit Madaan, Rupesh Mehta, Jeyashankher Ramamirtham, Rajeev Rastogi, Sandeep Satpal, Srinivasan H Sengamedu, Ashwin Tengli, and Charu Tiwari. 2011. Web-scale information extraction with vertex. In ICDM. IEEE, 1209--1220.Google Scholar
Anoop R. Katti, Christian Reisswig, Cordula Guder, Sebastian Brarda, Steffen Bickel, Johannes Höhne, and Jean Baptiste Faddoul. 2018. Chargrid: Towards Understanding 2D Documents. In EMNLP.Google Scholar
Colin Lockard, Xin Luna Dong, Arash Einolghozati, and Prashant Shiralkar. 2018. CERES: Distantly supervised relation extraction from the semi-structured web. PVLDB 11, 10 (2018), 1084--1096.Google ScholarDigital Library
Colin Lockard, Prashant Shiralkar, Xin Luna Dong, and Hannaneh Hajishirzi. 2020. ZeroShotCeres: Zero-Shot Relation Extraction from Semi-Structured Webpages. In ACL. Association for Computational Linguistics, Online, 8105--8117. https: //www.aclweb.org/anthology/2020.acl-main.721Google Scholar
Yi Luan, Dave Wadden, Luheng He, Amy Shah, Mari Ostendorf, and Hannaneh Hajishirzi. 2019. A general framework for information extraction using dynamic span graphs. In NAACL-HLT.Google Scholar
Yujie Qian, Enrico Santus, Zhijing Jin, Jiang Guo, and Regina Barzilay. 2018. GraphIE: A Graph-Based Framework for Information Extraction. In NAACL-HLT.Google Scholar
Sen Wu, Luke Hsiao, Xiao Cheng, Braden Hancock, Theodoros Rekatsinas, Philip Levis, and Christopher Ré. 2018. Fonduer: Knowledge Base Construction from Richly Formatted Data. SIGMOD 2018 (2018), 1301--1316.Google ScholarDigital Library

Index Terms

Multi-modal Information Extraction from Text, Semi-structured, and Tabular Data on the Web
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Information extraction
2. Information systems
  1. World Wide Web
    1. Web mining
      1. Data extraction and integration

Recommendations

Automatic information extraction from semi-structured Web pages by pattern discovery
Web retrieval and mining

The World Wide Web is now undeniably the richest and most dense source of information; yet, its structure makes it difficult to make use of that information in a systematic way. This paper proposes a pattern discovery approach to the rapid generation of ...
Read More
Web-scale Knowledge Collection
WSDM '20: Proceedings of the 13th International Conference on Web Search and Data Mining

How do we surface the large amount of information present in HTML documents on the Web, from news articles to scientific papers to Rotten Tomatoes pages to tables of sports scores? Such information can enable a variety of applications including ...
Read More
Information Extraction Using Web Usage Mining, Web Scrapping and Semantic Annotation
CICN '11: Proceedings of the 2011 International Conference on Computational Intelligence and Communication Networks

Extracting useful information from the web is the most significant issue of concern for the realization of semantic web. This may be achieved by several ways among which Web Usage Mining, Web Scrapping and Semantic Annotation plays an important role. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
August 2020
3664 pages
ISBN:9781450379984
DOI:10.1145/3394486
General Chairs:
Rajesh Gupta
UC San Diego, USA
,
Yan Liu
USC, USA
,
Program Chairs:
Mohak Shah
LG Electronics, USA
,
Suju Rajan
Linkedin, USA
,
Publications Chairs:
Jiliang Tang
Michigan State, USA
,
B. Aditya Prakash
Georgia Tech, USA
Copyright © 2020 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 August 2020
Check for updates
Author Tags
information extraction
semi-structured data
web extraction
web mining
Qualifiers
- tutorial
Conference

Acceptance Rates
Overall Acceptance Rate1,133of8,635submissions,13%
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 7
  Total Citations
  View Citations
- 585
  Total Downloads
- Downloads (Last 12 months)93
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Multi-modal Information Extraction from Text, Semi-structured, and Tabular Data on the Web

KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Automatic information extraction from semi-structured Web pages by pattern discovery

Web-scale Knowledge Collection

Information Extraction Using Web Usage Mining, Web Scrapping and Semantic Annotation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Multi-modal Information Extraction from Text, Semi-structured, and Tabular Data on the Web

KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Automatic information extraction from semi-structured Web pages by pattern discovery

Web-scale Knowledge Collection

Information Extraction Using Web Usage Mining, Web Scrapping and Semantic Annotation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media