ABSTRACT
Traditionally, information extraction from web tables has focused on small, more or less homogeneous corpora, often based on assumptions about the use of <table> tags. A multitude of different HTML implementations of web tables make these approaches difficult to scale. In this paper, we approach the problem of domain-independent information extraction from web tables by shifting our attention from the tree-based representation of webpages to a variation of the two-dimensional visual box model used by web browsers to display the information on the screen. The there by obtained topological and style information allows us to fill the gap created by missing domain-specific knowledge about content and table templates. We believe that, in a future step, this approach can become the basis for a new way of large-scale knowledge acquisition from the current "Visual Web.
- S. Abiteboul, R. Hull, and V. Vianu. Foundations of databases. Addison-Wesley, 1995. Google ScholarDigital Library
- E. Agichtein and L. Gravano. Snowball: Extracting relations from large plain-text collections. In Proc. 5th ACM DL, pp. 85--94. ACM, June 2000. Google ScholarDigital Library
- Y. Aumann, R. Feldman, Y. Liberzon, B. Rosenfeld, and J. Schler. Visual information extraction. Knowledge and Information Systems, 10(1):1--15, 2006. Google ScholarDigital Library
- M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open information extraction from the Web. In Proc. 20th IJCAI, pp. 2670--2676, Jan. 2007. Google ScholarDigital Library
- M. Bilenko, S. Basu, and M. Sahami. Adaptive product normalization: Using online learning for record linkage in comparison shopping. In Proc. 5th ICDM, pp. 58--65. IEEE, Nov. 2005. Google ScholarDigital Library
- D. Cai, S. Yu, J.-R. Wen, and W.-Y. Ma. Extracting content structure for web pages based on visual representation. In Proc. 5th AP Web, pp. 406--417. Springer, Apr. 2003. Google ScholarDigital Library
- H.-H. Chen, S.-C. Tsai, and J.-H. Tsai. Mining tables from large scale HTML texts. In Proc. 18th COLING, pp. 166--172. Morgan Kaufmann, Aug. 2000. Google ScholarDigital Library
- W.W. Cohen, M. Hurst, and L.S. Jensen. A flexible learning system for wrapping tables and lists in HTML documents. In Proc. 11th WWW, pp. 232--241. ACM, May 2002. Google ScholarDigital Library
- M. Cosulschi, N. Constantinescu, and M. Gabroveanu. Classification and comparison of information structures from a web page. The Annals of the University of Craiova, 31:109--121, 2004.Google Scholar
- A. Culotta, A. McCallum, and J. Betz. Integrating probabilistic extraction models and data mining to discover relations and patterns in text. In Proc. HLT-NAACL, pp. 296--303, New York, NY, June 2006. Google ScholarDigital Library
- N. N. Dalvi and D. Suciu. Answering queries from statistics and probabilistic views. In Proc. 31st VLDB, pp. 805--816, Aug. 2005. Google ScholarDigital Library
- D.W. Embley, M. Hurst, D. P. Lopresti, and G. Nagy. Table-processing paradigms: a research survey. IJDAR, 8(2-3):66--86, June 2006.Google ScholarCross Ref
- D.W. Embley, D.P. Lopresti, and G. Nagy. Notes on contemporary table recognition. In Proc. 7th Int. Workshop on Document Analysis Systems (DAS), pp. 164--175. Springer, Feb. 2006. Google ScholarDigital Library
- O. Etzioni, M.J. Cafarella, D. Downey, A.-M. Popescu, T. Shaked, S. Soderland, D.S. Weld, and A. Yates. Methods for domain-independent information extraction from the Web: An experimental comparison. In Proc. 19th AAAI, pp. 391--398. AAAI Press/MIT Press, July 2004. Google ScholarDigital Library
- W. Gatterbauer and P. Bohunsky. Table extraction using spatial reasoning on the CSS2 visual box model. In Proc. 21st AAAI, pp. 1313--1318. AAAI Press, July 2006. Google ScholarDigital Library
- X. Gu, J. Chen, W.-Y. Ma, and G. Chen. Visual based content understanding towards web adaptation. In Proc. 2nd AH, pp. 164--173. Springer, May 2002. Google ScholarDigital Library
- J. Hu, R. S. Kashi, D. P. Lopresti, G.T. Wilfong, and G. Nagy. Why table ground-truthing is hard. In Proc. 6th ICDAR, pp. 129--133. IEEE, Sept. 2001. Google ScholarDigital Library
- M. Hurst. Layout and language: Challenges for table understanding on the Web. In Proc. 1st WDA at 6th ICDAR, pp. 27--30, Sept. 2001.Google Scholar
- M. Kovacevic, M. Diligenti, M. Gori, and V. Milutinovic. Recognition of common areas in a web page using visual information: a possible application in a page classification. In Proc. 2nd ICDM, pp. 250--257. IEEE, Dec. 2002. Google ScholarDigital Library
- B. Krüpl and M. Herzog. Visually guided bottom-up table detection and segmentation in web documents. In Proc. 15th WWW, pp. 933--934. ACM, May 2006. Google ScholarDigital Library
- B. Krüpl, M. Herzog, and W. Gatterbauer. Using visual cues for extraction of tabular data from arbitrary HTML documents. In Poster Proc. 14th WWW, pp. 1000--1001. ACM, May 2005. Google ScholarDigital Library
- K. Lerman, L. Getoor, S. Minton, and C. A. Knoblock. Using the structure of web sites for automatic segmentation of tables. In Proc. SIGMOD, pp. 119--130. ACM, June 2004. Google ScholarDigital Library
- B. Liu and K. C.-C. Chang. Editorial: special issue on web content mining. SIGKDD Explorations, 6(2):1--4, 2004. Google ScholarDigital Library
- B. Parsia and P.F. Patel-Schneider. Meaning and the Semantic Web. In Proc. IRW at 15th WWW, May 2006. Google ScholarDigital Library
- G. Penn, J. Hu, H. Luo, and R. McDonald. Flexible web document analysis for delivery to narrow-bandwidth devices. In Proc. 6th ICDAR, pp. 1074--1078. IEEE, Sept. 2001. Google ScholarDigital Library
- A. Pivk, P. Cimiano, and Y. Sure. From tables to frames. Journal of Web Semantics, 3(2--3):132--146, 2005. Google ScholarDigital Library
- B. Pollak and W. Gatterbauer. Creating permanent test sets of web pages for information extraction research. In Proc. 33rd SOFSEM: Theory and Practice of Computer Science, volII, pp. 103--115, Jan. 2007.Google Scholar
- K. Simon and G. Lausen. ViPER: augmenting automatic information extraction with visual perceptions. In Proc. 14th CIKM, pp. 381--388. ACM, Nov. 2005. Google ScholarDigital Library
- A. Tengli, Y. Yang, and N.L. Ma. Learning table extraction from examples. In Proc. 20th COLING, pp. 987--993. COLING, Aug. 2004. Google ScholarDigital Library
- Y.A. Tijerino, D.W. Embley, D.W. Lonsdale, Y. Ding, and G. Nagy. Towards ontology generation from tables. World Wide Web, 8(3): 261--285, 2005. Google ScholarDigital Library
- C. Vanoirbeek. Formatting structured tables. In Proc. of Electronic Publishing'92, pp. 291--309. Cambridge University Press, Apr. 1992.Google Scholar
- X. Wang. Tabular abstraction, editing, and formatting. Ph.D. thesis, University of Waterloo, 1996. Google ScholarDigital Library
- Y. Wang and J. Hu. A machine learning based approach for table detection on the {W}eb. In Proc. 11th WWW, pp. 242--250. ACM, May 2002. Google ScholarDigital Library
- H. Wium Lie, B. Bos, C. Lilley, and I. Jacobs. Cascading Style Sheets, level 2. Technical report, World WideSS2.Google Scholar
- T. Wohlberg. Hypertables: Development of a structure description language for tables in XML. Master thesis, University of Hamburg, Germany, 1999.(Original title in German: Hypertables: Entwicklung einer Strukturbeschreibungssprache für Tabellen in XML).Google Scholar
- Y. Yang and W.-S. Luk. A framework for web table mining. In Proc. 4th WIDM at 11th CIKM, pp. 36--42. ACM, Nov. 2002. Google ScholarDigital Library
- Y. Yang and H. Zhang. HTML page analysis based on visual cues. In Proc. 6th ICDAR, pp. 859--864. IEEE, Sept. 2001. Google ScholarDigital Library
- M. Yoshida, K. Torisawa, and J. Tsujii. A method to integrate tables of the world wide web. In Proc. 1st WDA at 6th ICDAR, pp. 31--34, Sept. 2001.Google Scholar
- R. Zanibbi, D. Blostein, and J.R. Cordy. A survey of table recognition. IJDAR, 7(1):1--16, 2004. Google ScholarDigital Library
- Y. Zhai and B. Liu. Web data extraction based on partial tree alignment. In Proc. 14th WWW, pp. 76--85. ACM, May 2005. Google ScholarDigital Library
- H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C. Yu. Fully automatic wrapper generation for search engines. In Proc. 14th WWW, pp. 66--75. ACM, May 2005. Google ScholarDigital Library
Index Terms
- Towards domain-independent information extraction from web tables
Recommendations
Information extraction from web tables
iiWAS '09: Proceedings of the 11th International Conference on Information Integration and Web-based Applications & ServicesNowadays, many users use web search engines to find and gather information. User faces an increasing amount of various web pages information sources. The issue of correlating, integrating and presenting related information to users becomes important. ...
Information Extraction from A Whole Web Site
Proceedings of the 2006 conference on Advances in Intelligent IT: Active Media Technology 2006This paper focuses on information extraction from one site rather than from one page. A new directed-acyclic graph based representation method is introduced for representing link structures on the Web sites. A rule based language is developed for ...
Learning to adapt cross language information extraction wrapper
We propose a framework for adapting a previously learned wrapper from a source Web site to unseen sites in different languages. To achieve this, we exploit the previously learned information extraction knowledge and the previously extracted or collected ...
Comments