research-article

TUTA: Tree-based Transformers for Generally Structured Table Pre-training

Authors:
Zhiruo Wang

Carnegie Mellon University, Pittsburgh, PA, USA

Carnegie Mellon University, Pittsburgh, PA, USA
View Profile

,
Haoyu Dong

Microsoft Research, Beijing, China

Microsoft Research, Beijing, China
View Profile

,
Ran Jia

Microsoft Research, Beijing, China

Microsoft Research, Beijing, China
View Profile

,
Jia Li

Peking University, Beijing, China

Peking University, Beijing, China
View Profile

,
Zhiyi Fu

Peking University, Beijing, China

Peking University, Beijing, China
View Profile

,
Shi Han

Microsoft Research, Beijing, China, China

Microsoft Research, Beijing, China, China
View Profile

,
Dongmei Zhang

Microsoft Research, Beijing, China

Microsoft Research, Beijing, China
View Profile

KDD '21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data MiningAugust 2021Pages 1780–1790https://doi.org/10.1145/3447548.3467434

Published:14 August 2021Publication History

KDD '21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining

Pages 1780–1790

ABSTRACT

We propose TUTA, a unified pre-training architecture for understanding generally structured tables. Noticing that understanding a table requires spatial, hierarchical, and semantic information, we enhance transformers with three novel structure-aware mechanisms. First, we devise a unified tree-based structure, called a bi-dimensional coordinate tree, to describe both the spatial and hierarchical information of generally structured tables. Upon this, we propose tree-based attention and position embedding to better capture the spatial and hierarchical information. Moreover, we devise three progressive pre-training objectives to enable representations at the token, cell, and table levels. We pre-train TUTA on a wide range of unlabeled web and spreadsheet tables and fine-tune it on two critical tasks in the field of table structure understanding: cell type classification and table type classification. Experiments show that TUTA is highly effective, achieving state-of-the-art on five widely-studied datasets.

Supplemental Material

KDD21-fp3291.mp4.mp4

mp4

79 MB

Download

References

Chandra Sekhar Bhagavatula, Thanapon Noraset, and Doug Downey. Tabel: entity linking in web tables. In International Semantic Web Conference. Springer, 2015.Google Scholar
Zhe Chen and Michael Cafarella. Automatic web spreadsheet data extraction. In Proceedings of the 3rd International Workshop on Semantic Search over the Web, 2013.Google ScholarDigital Library
Zhe Chen and Michael Cafarella. Integrating spreadsheet data via accurate and low-effort extraction. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1126--1135, 2014.Google ScholarDigital Library
Jiaoyan Chen, Ernesto Jiménez-Ruiz, Ian Horrocks, and Charles Sutton. Colnet: Embedding the semantics of web tables for column type prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 29--36, 2019.Google ScholarDigital Library
Jiaoyan Chen, Ernesto Jiménez-Ruiz, Ian Horrocks, and Charles Sutton. Learning semantic annotations for tabular data. arXiv preprint:1906.00781, 2019.Google Scholar
Wenhu Chen, Hongmin Wang, Jianshu Chen, Yunkai Zhang, Hong Wang, Shiyang Li, Xiyou Zhou, and William Yang Wang. Tabfact: A large-scale dataset for table-based fact verification. arXiv preprint:1909.02164, 2019.Google Scholar
Eric Crestan and Patrick Pantel. Web-scale table census and classification. In Proceedings of international conference on Web search and data mining, 2011.Google ScholarDigital Library
Xiang Deng, Huan Sun, Alyssa Lees, You Wu, and Cong Yu. Turl: Table understanding through representation learning. arXiv preprint:2006.14806, 2020.Google Scholar
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint:1810.04805, 2018.Google Scholar
Haoyu Dong, Shijie Liu, Zhouyu Fu, Shi Han, and Dongmei Zhang. Semantic structure extraction for spreadsheet tables with a multi-task learning architecture. In Workshop on Document Intelligence at NeurIPS 2019, 2019.Google Scholar
Haoyu Dong, Shijie Liu, Shi Han, Zhouyu Fu, and Dongmei Zhang. Tablesense: Spreadsheet table detection with convolutional neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 69--76, 2019.Google ScholarDigital Library
Haoyu Dong Dong, Jinyu Wang, Zhouyu Fu, Shi Han, and Dongmei Zhang. Neural formatting for spreadsheet tables. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pages 305--314, 2020.Google ScholarDigital Library
Wensheng Dou, Shi Han, Liang Xu, Dongmei Zhang, and Jun Wei. Expandable group identification in spreadsheets. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, pages 498--508, 2018.Google ScholarDigital Library
Julian Eberius, Katrin Braunschweig, and Others. Building the dresden web table corpus: A classification approach. In 2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC), pages 41--50. IEEE, 2015.Google ScholarCross Ref
Jing Fang, Prasenjit Mitra, Zhi Tang, and C Lee Giles. Table header detection and classification. In Twenty-Sixth AAAI Conference on Artificial Intelligence, 2012.Google Scholar
Besnik Fetahu, Avishek Anand, and Maria Koutraki. Tablenet: An approach for determining fine-grained relations for wikipedia tables. In The World Wide Web Conference, pages 2736--2742, 2019.Google ScholarDigital Library
Majid Ghasemi-Gol and Pedro Szekely. Tabvec: Table vectors for classification of web tables. arXiv preprint:1802.06290, 2018.Google Scholar
Majid Ghasemi Gol, Jay Pujara, and Pedro Szekely. Tabular cell classification using pre-trained cell embeddings. In 2019 IEEE International Conference on Data Mining (ICDM), pages 230--239. IEEE, 2019.Google Scholar
Julius Gonsior, Josephine Rehak, Maik Thiele, Elvis Koci, Michael Günther, and Wolfgang Lehner. Active learning for spreadsheet cell classification. In EDBT/ICDT Workshops, 2020.Google Scholar
Tong Guo, Derong Shen, Tiezheng Nie, and Yue Kou. Web table column type detection using deep learning and probability graph model. In International Conference on Web Information Systems and Applications, pages 401--414. Springer, 2020.Google ScholarDigital Library
Jonathan Herzig, Paweŀ Krzysztof Nowak, Thomas Müller, Francesco Piccinno, and Julian Martin Eisenschlos. Tapas: Weakly supervised table parsing via pretraining. arXiv preprint:2004.02349, 2020.Google Scholar
Marcin Kardas, Piotr Czapla, Pontus Stenetorp, Sebastian Ruder, Sebastian Riedel, Ross Taylor, and Robert Stojnic. Axcell: Automatic extraction of results from machine learning papers. arXiv preprint:2004.14356, 2020.Google Scholar
Elvis Koci, Maik Thiele, Wolfgang Lehner, and Oscar Romero. Table recognition in spreadsheets via a graph representation. In 2018 13th IAPR International Workshop on Document Analysis Systems (DAS), pages 139--144. IEEE, 2018.Google ScholarCross Ref
Elvis Koci, Maik Thiele, Josephine Rehak, Oscar Romero, and Wolfgang Lehner. Deco: A dataset of annotated spreadsheets for layout and table recognition. In International Conference on Document Analysis and Recognition. IEEE, 2019.Google ScholarCross Ref
Guillaume Lample and Alexis Conneau. Cross-lingual language model pretraining. arXiv preprint:1901.07291, 2019.Google Scholar
Larissa R Lautert, Marcelo M Scheidt, and Carina F Dorneles. Web table taxonomy and formalization. ACM SIGMOD Record, 42(3):28--33, 2013.Google ScholarDigital Library
Oliver Lehmberg, Dominique Ritze, Robert Meusel, and Christian Bizer. A large public corpus of web tables containing time and context metadata. In Proceedings of the 25th International Conference Companion on World Wide Web, pages 75--76, 2016.Google ScholarDigital Library
Seung-Jin Lim and Yiu-Kai Ng. An automated approach for retrieving hierarchical data from html tables. In Proceedings of the eighth international conference on Information and knowledge management, pages 466--474, 1999.Google ScholarDigital Library
Xuan-Phi Nguyen, Shafiq Joty, Steven CH Hoi, and Richard Socher. Tree-structured attention with hierarchical accumulation. arXiv preprint:2002.08046, 2020.Google Scholar
Kyosuke Nishida, Kugatsu Sadamitsu, Ryuichiro Higashinaka, and Yoshihiro Matsuo. Understanding the semantic structures of tables with a hybrid deep neural network architecture. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.Google ScholarDigital Library
Viacheslav Paramonov, Alexey Shigarov, and Varvara Vetrova. Table header correction algorithm based on heuristics for improving spreadsheet data extraction. In International Conference on Information and Software Technologies. Springer, 2020.Google ScholarCross Ref
Panupong Pasupat and Percy Liang. Compositional semantic parsing on semi-structured tables. arXiv preprint:1508.00305, 2015.Google Scholar
Kexuan Sun Harsha Rayudu Jay Pujara. A hybrid probabilistic approach for table understanding. Proceedings of the AAAI Conference on Artificial Intelligence, 2021.Google Scholar
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training, 2018.Google Scholar
Dominique Ritze and Christian Bizer. Matching web tables to dbpedia-a feature utility study. context, 42(41):19--31, 2017.Google Scholar
Vighnesh Shiv and Chris Quirk. Novel positional encodings to enable tree-based transformers. In Advances in Neural Information Processing Systems, 2019.Google Scholar
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998--6008, 2017.Google ScholarDigital Library
Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv preprint:1710.10903, 2017.Google Scholar
Yau-Shian Wang, Hung-Yi Lee, and Yun-Nung Chen. Tree transformer: Integrating tree structures into self-attention. arXiv preprint:1909.06639, 2019.Google Scholar
Xinxin Wang. Tabular abstraction, editing, and formatting. 2016.Google Scholar
Pengcheng Yin, Graham Neubig, Wen-tau Yih, and Sebastian Riedel. Tabert: Pretraining for joint understanding of textual and tabular data. arxiv, 2020.Google Scholar
Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, et al. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. arXiv preprint:1809.08887, 2018.Google Scholar
Richard Zanibbi, Dorothea Blostein, and James R Cordy. A survey of table recognition. Document Analysis and Recognition, 7(1):1--16, 2004.Google ScholarDigital Library
Vicky Zayats, Kristina Toutanova, and Mari Ostendorf. Representations for question answering from documents with tables and text. ArXiv:2101.10573, 2021.Google Scholar
Li Zhang, Shuo Zhang, and Krisztian Balog. Table2vec: Neural word and entity embeddings for table population and retrieval. In Proceedings of International ACM SIGIR Conference on Research and Development in Information Retrieval, 2019.Google Scholar
Xingyao Zhang, Linjun Shou, Jian Pei, Ming Gong, Lijie Wen, and Daxin Jiang. A graph representation of semi-structured data for web question answering. arXiv preprint:2010.06801, 2020.Google Scholar
Chen Zhao and Yeye He. Auto-em: End-to-end fuzzy entity-matching using pre-trained deep models and transfer learning. In The World Wide Web Conference, pages 2413--2424, 2019.Google ScholarDigital Library
Mengyu Zhou, Wang Tao, Ji Pengxin, and Others. Table2analysis: Modeling and recommendation of common analysis patterns for multi-dimensional data. In Proceedings of the AAAI Conference on Artificial Intelligence, 2020.Google ScholarCross Ref

Index Terms

TUTA: Tree-based Transformers for Generally Structured Table Pre-training
1. Information systems
  1. Information retrieval

Recommendations

Poster: Boosting Adversarial Robustness by Adversarial Pre-training
CCS '23: Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security

Vision Transformer (ViT) shows superior performance on various tasks, but, similar to other deep learning techniques, it is vulnerable to adversarial attacks. Due to the differences between ViT and traditional CNNs, previous works designed new ...
Read More
Multi-level wavelet network based on CNN-Transformer hybrid attention for single image deraining
Abstract
Removing rain streaks from rainy images can improve the accuracy of computer vision applications such as object detection. In order to make full use of the frequency domain analysis characteristics of wavelet and combine the advantages of ...
Read More
Hyperspectral Image Classification Based on Transformer and Generative Adversarial Network
PRICAI 2022: Trends in Artificial Intelligence
Abstract
In recent years, hyperspectral image (HSI) classification methods based on generative adversarial networks (GANs) have been proposed and have made great progress, which can alleviate the dilemma of limited training samples. However, GAN-based HSI ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining
August 2021
4259 pages
ISBN:9781450383325
DOI:10.1145/3447548
General Chairs:
Feida Zhu
Singapore Management University
,
Beng Chin Ooi
National University of Singapore
,
Chunyan Miao
Nanyang Technology University
,
Program Chairs:
Haixun Wang,
Iryna Skrypnyk,
Wynne Hsu,
Sanjay Chawla
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 August 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
generally structured table
self supervision
transformer
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,133of8,635submissions,13%
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 21
  Total Citations
  View Citations
- 719
  Total Downloads
- Downloads (Last 12 months)188
- Downloads (Last 6 weeks)26
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

TUTA: Tree-based Transformers for Generally Structured Table Pre-training

KDD '21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Poster: Boosting Adversarial Robustness by Adversarial Pre-training

Multi-level wavelet network based on CNN-Transformer hybrid attention for single image deraining

Hyperspectral Image Classification Based on Transformer and Generative Adversarial Network

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

TUTA: Tree-based Transformers for Generally Structured Table Pre-training

KDD '21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Poster: Boosting Adversarial Robustness by Adversarial Pre-training

Multi-level wavelet network based on CNN-Transformer hybrid attention for single image deraining

Hyperspectral Image Classification Based on Transformer and Generative Adversarial Network

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media