ABSTRACT
How do we surface the large amount of information present in HTML documents on the Web, from news articles to Rotten Tomatoes pages to tables of sports scores? Such information can enable a variety of applications including knowledge base construction, question answering, recommendation, and more. In this tutorial, we present approaches for information extraction (IE) from Web data that can be differentiated along two key dimensions: 1) the diversity in data modality that is leveraged, e.g. text, visual, XML/HTML, and 2) the thrust to develop scalable approaches with zero to limited human supervision.
- Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, and Paolo Papotti. 2013. Extraction and integration of partially overlapping web sources. PVLDB 6, 10 (2013), 805--816.Google ScholarDigital Library
- Michael Cafarella, Alon Halevy, Hongrae Lee, Jayant Madhavan, Cong Yu, Daisy Zhe Wang, and Eugene Wu. 2018. Ten years of webtables. PVLDB 11, 12 (2018), 2140--2149.Google ScholarDigital Library
- Pankaj Gulhane, Amit Madaan, Rupesh Mehta, Jeyashankher Ramamirtham, Rajeev Rastogi, Sandeep Satpal, Srinivasan H Sengamedu, Ashwin Tengli, and Charu Tiwari. 2011. Web-scale information extraction with vertex. In ICDM. IEEE, 1209--1220.Google Scholar
- Anoop R. Katti, Christian Reisswig, Cordula Guder, Sebastian Brarda, Steffen Bickel, Johannes Höhne, and Jean Baptiste Faddoul. 2018. Chargrid: Towards Understanding 2D Documents. In EMNLP.Google Scholar
- Colin Lockard, Xin Luna Dong, Arash Einolghozati, and Prashant Shiralkar. 2018. CERES: Distantly supervised relation extraction from the semi-structured web. PVLDB 11, 10 (2018), 1084--1096.Google ScholarDigital Library
- Colin Lockard, Prashant Shiralkar, Xin Luna Dong, and Hannaneh Hajishirzi. 2020. ZeroShotCeres: Zero-Shot Relation Extraction from Semi-Structured Webpages. In ACL. Association for Computational Linguistics, Online, 8105--8117. https: //www.aclweb.org/anthology/2020.acl-main.721Google Scholar
- Yi Luan, Dave Wadden, Luheng He, Amy Shah, Mari Ostendorf, and Hannaneh Hajishirzi. 2019. A general framework for information extraction using dynamic span graphs. In NAACL-HLT.Google Scholar
- Yujie Qian, Enrico Santus, Zhijing Jin, Jiang Guo, and Regina Barzilay. 2018. GraphIE: A Graph-Based Framework for Information Extraction. In NAACL-HLT.Google Scholar
- Sen Wu, Luke Hsiao, Xiao Cheng, Braden Hancock, Theodoros Rekatsinas, Philip Levis, and Christopher Ré. 2018. Fonduer: Knowledge Base Construction from Richly Formatted Data. SIGMOD 2018 (2018), 1301--1316.Google ScholarDigital Library
Index Terms
- Multi-modal Information Extraction from Text, Semi-structured, and Tabular Data on the Web
Recommendations
Automatic information extraction from semi-structured Web pages by pattern discovery
Web retrieval and miningThe World Wide Web is now undeniably the richest and most dense source of information; yet, its structure makes it difficult to make use of that information in a systematic way. This paper proposes a pattern discovery approach to the rapid generation of ...
Web-scale Knowledge Collection
WSDM '20: Proceedings of the 13th International Conference on Web Search and Data MiningHow do we surface the large amount of information present in HTML documents on the Web, from news articles to scientific papers to Rotten Tomatoes pages to tables of sports scores? Such information can enable a variety of applications including ...
Information Extraction Using Web Usage Mining, Web Scrapping and Semantic Annotation
CICN '11: Proceedings of the 2011 International Conference on Computational Intelligence and Communication NetworksExtracting useful information from the web is the most significant issue of concern for the realization of semantic web. This may be achieved by several ways among which Web Usage Mining, Web Scrapping and Semantic Annotation plays an important role. ...
Comments