Article

Towards domain-independent information extraction from web tables

Authors:
Wolfgang Gatterbauer

Vienna University of Technology, Vienna, Austria

Vienna University of Technology, Vienna, Austria
View Profile

,
Paul Bohunsky

Vienna University of Technology, Vienna, Austria

Vienna University of Technology, Vienna, Austria
View Profile

,
Marcus Herzog

Vienna University of Technology, Vienna, Austria

Vienna University of Technology, Vienna, Austria
View Profile

,
Bernhard Krüpl

Vienna University of Technology, Vienna, Austria

Vienna University of Technology, Vienna, Austria
View Profile

,
Bernhard Pollak

Vienna University of Technology, Vienna, Austria

Vienna University of Technology, Vienna, Austria
View Profile

WWW '07: Proceedings of the 16th international conference on World Wide WebMay 2007Pages 71–80https://doi.org/10.1145/1242572.1242583

Published:08 May 2007Publication History

WWW '07: Proceedings of the 16th international conference on World Wide Web

Pages 71–80

ABSTRACT

Traditionally, information extraction from web tables has focused on small, more or less homogeneous corpora, often based on assumptions about the use of <table> tags. A multitude of different HTML implementations of web tables make these approaches difficult to scale. In this paper, we approach the problem of domain-independent information extraction from web tables by shifting our attention from the tree-based representation of webpages to a variation of the two-dimensional visual box model used by web browsers to display the information on the screen. The there by obtained topological and style information allows us to fill the gap created by missing domain-specific knowledge about content and table templates. We believe that, in a future step, this approach can become the basis for a new way of large-scale knowledge acquisition from the current "Visual Web.

References

S. Abiteboul, R. Hull, and V. Vianu. Foundations of databases. Addison-Wesley, 1995. Google ScholarDigital Library
E. Agichtein and L. Gravano. Snowball: Extracting relations from large plain-text collections. In Proc. 5th ACM DL, pp. 85--94. ACM, June 2000. Google ScholarDigital Library
Y. Aumann, R. Feldman, Y. Liberzon, B. Rosenfeld, and J. Schler. Visual information extraction. Knowledge and Information Systems, 10(1):1--15, 2006. Google ScholarDigital Library
M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open information extraction from the Web. In Proc. 20th IJCAI, pp. 2670--2676, Jan. 2007. Google ScholarDigital Library
M. Bilenko, S. Basu, and M. Sahami. Adaptive product normalization: Using online learning for record linkage in comparison shopping. In Proc. 5th ICDM, pp. 58--65. IEEE, Nov. 2005. Google ScholarDigital Library
D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma. Extracting content structure for web pages based on visual representation. In Proc. 5th AP Web, pp. 406--417. Springer, Apr. 2003. Google ScholarDigital Library
H.-H. Chen, S.-C. Tsai, and J.-H. Tsai. Mining tables from large scale HTML texts. In Proc. 18th COLING, pp. 166--172. Morgan Kaufmann, Aug. 2000. Google ScholarDigital Library
W.W. Cohen, M. Hurst, and L.S. Jensen. A flexible learning system for wrapping tables and lists in HTML documents. In Proc. 11th WWW, pp. 232--241. ACM, May 2002. Google ScholarDigital Library
M. Cosulschi, N. Constantinescu, and M. Gabroveanu. Classification and comparison of information structures from a web page. The Annals of the University of Craiova, 31:109--121, 2004.Google Scholar
A. Culotta, A. McCallum, and J. Betz. Integrating probabilistic extraction models and data mining to discover relations and patterns in text. In Proc. HLT-NAACL, pp. 296--303, New York, NY, June 2006. Google ScholarDigital Library
N. N. Dalvi and D. Suciu. Answering queries from statistics and probabilistic views. In Proc. 31st VLDB, pp. 805--816, Aug. 2005. Google ScholarDigital Library
D.W. Embley, M. Hurst, D. P. Lopresti, and G. Nagy. Table-processing paradigms: a research survey. IJDAR, 8(2-3):66--86, June 2006.Google ScholarCross Ref
D.W. Embley, D.P. Lopresti, and G. Nagy. Notes on contemporary table recognition. In Proc. 7th Int. Workshop on Document Analysis Systems (DAS), pp. 164--175. Springer, Feb. 2006. Google ScholarDigital Library
O. Etzioni, M.J. Cafarella, D. Downey, A.-M. Popescu, T. Shaked, S. Soderland, D.S. Weld, and A. Yates. Methods for domain-independent information extraction from the Web: An experimental comparison. In Proc. 19th AAAI, pp. 391--398. AAAI Press/MIT Press, July 2004. Google ScholarDigital Library
W. Gatterbauer and P. Bohunsky. Table extraction using spatial reasoning on the CSS2 visual box model. In Proc. 21st AAAI, pp. 1313--1318. AAAI Press, July 2006. Google ScholarDigital Library
X. Gu, J. Chen, W.-Y. Ma, and G. Chen. Visual based content understanding towards web adaptation. In Proc. 2nd AH, pp. 164--173. Springer, May 2002. Google ScholarDigital Library
J. Hu, R. S. Kashi, D. P. Lopresti, G.T. Wilfong, and G. Nagy. Why table ground-truthing is hard. In Proc. 6th ICDAR, pp. 129--133. IEEE, Sept. 2001. Google ScholarDigital Library
M. Hurst. Layout and language: Challenges for table understanding on the Web. In Proc. 1st WDA at 6th ICDAR, pp. 27--30, Sept. 2001.Google Scholar
M. Kovacevic, M. Diligenti, M. Gori, and V. Milutinovic. Recognition of common areas in a web page using visual information: a possible application in a page classification. In Proc. 2nd ICDM, pp. 250--257. IEEE, Dec. 2002. Google ScholarDigital Library
B. Krüpl and M. Herzog. Visually guided bottom-up table detection and segmentation in web documents. In Proc. 15th WWW, pp. 933--934. ACM, May 2006. Google ScholarDigital Library
B. Krüpl, M. Herzog, and W. Gatterbauer. Using visual cues for extraction of tabular data from arbitrary HTML documents. In Poster Proc. 14th WWW, pp. 1000--1001. ACM, May 2005. Google ScholarDigital Library
K. Lerman, L. Getoor, S. Minton, and C. A. Knoblock. Using the structure of web sites for automatic segmentation of tables. In Proc. SIGMOD, pp. 119--130. ACM, June 2004. Google ScholarDigital Library
B. Liu and K. C.-C. Chang. Editorial: special issue on web content mining. SIGKDD Explorations, 6(2):1--4, 2004. Google ScholarDigital Library
B. Parsia and P.F. Patel-Schneider. Meaning and the Semantic Web. In Proc. IRW at 15th WWW, May 2006. Google ScholarDigital Library
G. Penn, J. Hu, H. Luo, and R. McDonald. Flexible web document analysis for delivery to narrow-bandwidth devices. In Proc. 6th ICDAR, pp. 1074--1078. IEEE, Sept. 2001. Google ScholarDigital Library
A. Pivk, P. Cimiano, and Y. Sure. From tables to frames. Journal of Web Semantics, 3(2--3):132--146, 2005. Google ScholarDigital Library
B. Pollak and W. Gatterbauer. Creating permanent test sets of web pages for information extraction research. In Proc. 33rd SOFSEM: Theory and Practice of Computer Science, volII, pp. 103--115, Jan. 2007.Google Scholar
K. Simon and G. Lausen. ViPER: augmenting automatic information extraction with visual perceptions. In Proc. 14th CIKM, pp. 381--388. ACM, Nov. 2005. Google ScholarDigital Library
A. Tengli, Y. Yang, and N.L. Ma. Learning table extraction from examples. In Proc. 20th COLING, pp. 987--993. COLING, Aug. 2004. Google ScholarDigital Library
Y.A. Tijerino, D.W. Embley, D.W. Lonsdale, Y. Ding, and G. Nagy. Towards ontology generation from tables. World Wide Web, 8(3): 261--285, 2005. Google ScholarDigital Library
C. Vanoirbeek. Formatting structured tables. In Proc. of Electronic Publishing'92, pp. 291--309. Cambridge University Press, Apr. 1992.Google Scholar
X. Wang. Tabular abstraction, editing, and formatting. Ph.D. thesis, University of Waterloo, 1996. Google ScholarDigital Library
Y. Wang and J. Hu. A machine learning based approach for table detection on the {W}eb. In Proc. 11th WWW, pp. 242--250. ACM, May 2002. Google ScholarDigital Library
H. Wium Lie, B. Bos, C. Lilley, and I. Jacobs. Cascading Style Sheets, level 2. Technical report, World WideSS2.Google Scholar
T. Wohlberg. Hypertables: Development of a structure description language for tables in XML. Master thesis, University of Hamburg, Germany, 1999.(Original title in German: Hypertables: Entwicklung einer Strukturbeschreibungssprache für Tabellen in XML).Google Scholar
Y. Yang and W.-S. Luk. A framework for web table mining. In Proc. 4th WIDM at 11th CIKM, pp. 36--42. ACM, Nov. 2002. Google ScholarDigital Library
Y. Yang and H. Zhang. HTML page analysis based on visual cues. In Proc. 6th ICDAR, pp. 859--864. IEEE, Sept. 2001. Google ScholarDigital Library
M. Yoshida, K. Torisawa, and J. Tsujii. A method to integrate tables of the world wide web. In Proc. 1st WDA at 6th ICDAR, pp. 31--34, Sept. 2001.Google Scholar
R. Zanibbi, D. Blostein, and J.R. Cordy. A survey of table recognition. IJDAR, 7(1):1--16, 2004. Google ScholarDigital Library
Y. Zhai and B. Liu. Web data extraction based on partial tree alignment. In Proc. 14th WWW, pp. 76--85. ACM, May 2005. Google ScholarDigital Library
H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C. Yu. Fully automatic wrapper generation for search engines. In Proc. 14th WWW, pp. 66--75. ACM, May 2005. Google ScholarDigital Library

Index Terms

Towards domain-independent information extraction from web tables
1. Information systems
  1. Information retrieval
  2. Information systems applications
    1. Data mining

Recommendations

Information extraction from web tables
iiWAS '09: Proceedings of the 11th International Conference on Information Integration and Web-based Applications & Services

Nowadays, many users use web search engines to find and gather information. User faces an increasing amount of various web pages information sources. The issue of correlating, integrating and presenting related information to users becomes important. ...
Read More
Information Extraction from A Whole Web Site
Proceedings of the 2006 conference on Advances in Intelligent IT: Active Media Technology 2006

This paper focuses on information extraction from one site rather than from one page. A new directed-acyclic graph based representation method is introduced for representing link structures on the Web sites. A rule based language is developed for ...
Read More
Learning to adapt cross language information extraction wrapper

We propose a framework for adapting a previously learned wrapper from a source Web site to unseen sites in different languages. To achieve this, we exploit the previously learned information extraction knowledge and the previously extracted or collected ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '07: Proceedings of the 16th international conference on World Wide Web
May 2007
1382 pages
ISBN:9781595936547
DOI:10.1145/1242572
General Chairs:
Carey Williamson
University of Calgary, Canada
,
Mary Ellen Zurko
IBM, USA
,
Program Chairs:
Peter Patel-Schneider
Bell Labs Research, USA
,
Prashant Shenoy
University of Massachusetts at Amherst, USA
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 May 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
information extraction
visual analysis
web mining
web page representation
web tables
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Upcoming Conference
WWW '24

Sponsor:

sigweb

The ACM Web Conference 2024

May 13 - 17, 2024

Singapore , Singapore
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 162
  Total Citations
  View Citations
- 1,785
  Total Downloads
- Downloads (Last 12 months)32
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Towards domain-independent information extraction from web tables

WWW '07: Proceedings of the 16th international conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Information extraction from web tables

Information Extraction from A Whole Web Site

Learning to adapt cross language information extraction wrapper