skip to main content
10.1145/1242572.1242583acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
Article

Towards domain-independent information extraction from web tables

Published:08 May 2007Publication History

ABSTRACT

Traditionally, information extraction from web tables has focused on small, more or less homogeneous corpora, often based on assumptions about the use of <table> tags. A multitude of different HTML implementations of web tables make these approaches difficult to scale. In this paper, we approach the problem of domain-independent information extraction from web tables by shifting our attention from the tree-based representation of webpages to a variation of the two-dimensional visual box model used by web browsers to display the information on the screen. The there by obtained topological and style information allows us to fill the gap created by missing domain-specific knowledge about content and table templates. We believe that, in a future step, this approach can become the basis for a new way of large-scale knowledge acquisition from the current "Visual Web.

References

  1. S. Abiteboul, R. Hull, and V. Vianu. Foundations of databases. Addison-Wesley, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. E. Agichtein and L. Gravano. Snowball: Extracting relations from large plain-text collections. In Proc. 5th ACM DL, pp. 85--94. ACM, June 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Y. Aumann, R. Feldman, Y. Liberzon, B. Rosenfeld, and J. Schler. Visual information extraction. Knowledge and Information Systems, 10(1):1--15, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open information extraction from the Web. In Proc. 20th IJCAI, pp. 2670--2676, Jan. 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Bilenko, S. Basu, and M. Sahami. Adaptive product normalization: Using online learning for record linkage in comparison shopping. In Proc. 5th ICDM, pp. 58--65. IEEE, Nov. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma. Extracting content structure for web pages based on visual representation. In Proc. 5th AP Web, pp. 406--417. Springer, Apr. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. H.-H. Chen, S.-C. Tsai, and J.-H. Tsai. Mining tables from large scale HTML texts. In Proc. 18th COLING, pp. 166--172. Morgan Kaufmann, Aug. 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. W.W. Cohen, M. Hurst, and L.S. Jensen. A flexible learning system for wrapping tables and lists in HTML documents. In Proc. 11th WWW, pp. 232--241. ACM, May 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Cosulschi, N. Constantinescu, and M. Gabroveanu. Classification and comparison of information structures from a web page. The Annals of the University of Craiova, 31:109--121, 2004.Google ScholarGoogle Scholar
  10. A. Culotta, A. McCallum, and J. Betz. Integrating probabilistic extraction models and data mining to discover relations and patterns in text. In Proc. HLT-NAACL, pp. 296--303, New York, NY, June 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. N. N. Dalvi and D. Suciu. Answering queries from statistics and probabilistic views. In Proc. 31st VLDB, pp. 805--816, Aug. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D.W. Embley, M. Hurst, D. P. Lopresti, and G. Nagy. Table-processing paradigms: a research survey. IJDAR, 8(2-3):66--86, June 2006.Google ScholarGoogle ScholarCross RefCross Ref
  13. D.W. Embley, D.P. Lopresti, and G. Nagy. Notes on contemporary table recognition. In Proc. 7th Int. Workshop on Document Analysis Systems (DAS), pp. 164--175. Springer, Feb. 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. O. Etzioni, M.J. Cafarella, D. Downey, A.-M. Popescu, T. Shaked, S. Soderland, D.S. Weld, and A. Yates. Methods for domain-independent information extraction from the Web: An experimental comparison. In Proc. 19th AAAI, pp. 391--398. AAAI Press/MIT Press, July 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. W. Gatterbauer and P. Bohunsky. Table extraction using spatial reasoning on the CSS2 visual box model. In Proc. 21st AAAI, pp. 1313--1318. AAAI Press, July 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. X. Gu, J. Chen, W.-Y. Ma, and G. Chen. Visual based content understanding towards web adaptation. In Proc. 2nd AH, pp. 164--173. Springer, May 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. Hu, R. S. Kashi, D. P. Lopresti, G.T. Wilfong, and G. Nagy. Why table ground-truthing is hard. In Proc. 6th ICDAR, pp. 129--133. IEEE, Sept. 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. Hurst. Layout and language: Challenges for table understanding on the Web. In Proc. 1st WDA at 6th ICDAR, pp. 27--30, Sept. 2001.Google ScholarGoogle Scholar
  19. M. Kovacevic, M. Diligenti, M. Gori, and V. Milutinovic. Recognition of common areas in a web page using visual information: a possible application in a page classification. In Proc. 2nd ICDM, pp. 250--257. IEEE, Dec. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. B. Krüpl and M. Herzog. Visually guided bottom-up table detection and segmentation in web documents. In Proc. 15th WWW, pp. 933--934. ACM, May 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. B. Krüpl, M. Herzog, and W. Gatterbauer. Using visual cues for extraction of tabular data from arbitrary HTML documents. In Poster Proc. 14th WWW, pp. 1000--1001. ACM, May 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. K. Lerman, L. Getoor, S. Minton, and C. A. Knoblock. Using the structure of web sites for automatic segmentation of tables. In Proc. SIGMOD, pp. 119--130. ACM, June 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. B. Liu and K. C.-C. Chang. Editorial: special issue on web content mining. SIGKDD Explorations, 6(2):1--4, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. B. Parsia and P.F. Patel-Schneider. Meaning and the Semantic Web. In Proc. IRW at 15th WWW, May 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. G. Penn, J. Hu, H. Luo, and R. McDonald. Flexible web document analysis for delivery to narrow-bandwidth devices. In Proc. 6th ICDAR, pp. 1074--1078. IEEE, Sept. 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A. Pivk, P. Cimiano, and Y. Sure. From tables to frames. Journal of Web Semantics, 3(2--3):132--146, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. B. Pollak and W. Gatterbauer. Creating permanent test sets of web pages for information extraction research. In Proc. 33rd SOFSEM: Theory and Practice of Computer Science, volII, pp. 103--115, Jan. 2007.Google ScholarGoogle Scholar
  28. K. Simon and G. Lausen. ViPER: augmenting automatic information extraction with visual perceptions. In Proc. 14th CIKM, pp. 381--388. ACM, Nov. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. A. Tengli, Y. Yang, and N.L. Ma. Learning table extraction from examples. In Proc. 20th COLING, pp. 987--993. COLING, Aug. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Y.A. Tijerino, D.W. Embley, D.W. Lonsdale, Y. Ding, and G. Nagy. Towards ontology generation from tables. World Wide Web, 8(3): 261--285, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. C. Vanoirbeek. Formatting structured tables. In Proc. of Electronic Publishing'92, pp. 291--309. Cambridge University Press, Apr. 1992.Google ScholarGoogle Scholar
  32. X. Wang. Tabular abstraction, editing, and formatting. Ph.D. thesis, University of Waterloo, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Y. Wang and J. Hu. A machine learning based approach for table detection on the {W}eb. In Proc. 11th WWW, pp. 242--250. ACM, May 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. H. Wium Lie, B. Bos, C. Lilley, and I. Jacobs. Cascading Style Sheets, level 2. Technical report, World WideSS2.Google ScholarGoogle Scholar
  35. T. Wohlberg. Hypertables: Development of a structure description language for tables in XML. Master thesis, University of Hamburg, Germany, 1999.(Original title in German: Hypertables: Entwicklung einer Strukturbeschreibungssprache für Tabellen in XML).Google ScholarGoogle Scholar
  36. Y. Yang and W.-S. Luk. A framework for web table mining. In Proc. 4th WIDM at 11th CIKM, pp. 36--42. ACM, Nov. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Y. Yang and H. Zhang. HTML page analysis based on visual cues. In Proc. 6th ICDAR, pp. 859--864. IEEE, Sept. 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. M. Yoshida, K. Torisawa, and J. Tsujii. A method to integrate tables of the world wide web. In Proc. 1st WDA at 6th ICDAR, pp. 31--34, Sept. 2001.Google ScholarGoogle Scholar
  39. R. Zanibbi, D. Blostein, and J.R. Cordy. A survey of table recognition. IJDAR, 7(1):1--16, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Y. Zhai and B. Liu. Web data extraction based on partial tree alignment. In Proc. 14th WWW, pp. 76--85. ACM, May 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C. Yu. Fully automatic wrapper generation for search engines. In Proc. 14th WWW, pp. 66--75. ACM, May 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Towards domain-independent information extraction from web tables

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        WWW '07: Proceedings of the 16th international conference on World Wide Web
        May 2007
        1382 pages
        ISBN:9781595936547
        DOI:10.1145/1242572

        Copyright © 2007 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 8 May 2007

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        Overall Acceptance Rate1,899of8,196submissions,23%

        Upcoming Conference

        WWW '24
        The ACM Web Conference 2024
        May 13 - 17, 2024
        Singapore , Singapore

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader