research-article

Open Access

Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching

Authors:
Liu Yang

Google Research, Mountain View, CA, USA

Google Research, Mountain View, CA, USA
View Profile

,
Mingyang Zhang

Google Research, Mountain View, CA, USA

Google Research, Mountain View, CA, USA
View Profile

,
Cheng Li

Google Research, Mountain View, CA, USA

Google Research, Mountain View, CA, USA
View Profile

,
Michael Bendersky

Google Research, Mountain View, CA, USA

Google Research, Mountain View, CA, USA
View Profile

,
Marc Najork

Google Research, Mountain View, CA, USA

Google Research, Mountain View, CA, USA
View Profile

CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge ManagementOctober 2020Pages 1725–1734https://doi.org/10.1145/3340531.3411908

Published:19 October 2020Publication History

CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge Management

Pages 1725–1734

ABSTRACT

Many natural language processing and information retrieval problems can be formalized as the task of semantic matching. Existing work in this area has been largely focused on matching between short texts (e.g., question answering), or between a short and a long text (e.g., ad-hoc retrieval). Semantic matching between long-form documents, which has many important applications like news recommendation, related article recommendation and document clustering, is relatively less explored and needs more research effort. In recent years, self-attention based models like Transformers and BERT have achieved state-of-the-art performance in the task of text matching. These models, however, are still limited to short text like a few sentences or one paragraph due to the quadratic computational complexity of self-attention with respect to input text length. In this paper, we address the issue by proposing the Siamese Multi-depth Transformer-based Hierarchical (SMITH) Encoder for long-form document matching. Our model contains several innovations to adapt self-attention models for longer text input. We propose a transformer based hierarchical encoder to capture the document structure information. In order to better capture sentence level semantic relations within a document, we pre-train the model with a novel masked sentence block language modeling task in addition to the masked word language modeling task used by BERT. Our experimental results on several benchmark data sets for long-form document matching show that our proposed SMITH model outperforms the previous state-of-the-art models including hierarchical attention, multi-depth attention-based hierarchical recurrent neural network, and BERT. Comparing to BERT based baselines, our model is able to increase maximum input text length from 512 to 2048. We will open source a Wikipedia based benchmark data set, code and a pre-trained model to accelerate future research on long-form document matching.

Supplemental Material

3340531.3411908.mp4

mp4

111.9 MB

Download

References

S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning. 2015. A large annotated corpus for learning natural language inference. In EMNLP '15. 632--642.Google Scholar
R. Child, S. Gray, A. Radford, and I. Sutskever. 2019. Generating Long Sequences with Sparse Transformers. (2019). arXiv:1904.10509Google Scholar
N. Craswell, B. Mitra, E. Yilmaz, D. Campos, and E. M Voorhees. 2020. Overview of the TREC 2019 deep learning track. (2020). arXiv:2003.07820Google Scholar
Z. Dai and J. Callan. 2019. Deeper Text Understanding for IR with Contextual Neural Language Modeling. In SIGIR '19.Google Scholar
Z. Dai, Z. Yang, Y. Yang, J. G. Carbonell, Q. V. Le, and R. Salakhutdinov. 2019. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context.(2019). arXiv:1901.02860Google Scholar
J. Devlin, M. Chang, K. Lee, and K. Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. (2018). arXiv:1810.04805Google Scholar
W. B. Dolan and C. Brockett. 2005. Automatically Constructing a Corpus of Sentential Paraphrases. In IWP 2005. 9--16.Google Scholar
J. Guo, Y. Fan, Q. Ai, and W. B. Croft. 2016. A Deep Relevance Matching Model for Ad-hoc Retrieval. In CIKM '16. 55--64.Google Scholar
J. Guo, Y. Fan, L. Pang, L. Yang, Q. Ai, H. Zamani, C. Wu, W. B. Croft, and X. Cheng. 2019. A Deep Look into Neural Ranking Models for Information Retrieval.(2019). arXiv:1903.06902Google Scholar
K. He, X. Zhang, S. Ren, and J. Sun. 2015. Deep Residual Learning for Image Recognition. (2015). arXiv:1512.03385Google Scholar
J. Ho, N. Kalchbrenner, D. Weissenborn, and T. Salimans. 2019. Axial Attentionin Multidimensional Transformers. (2019). arXiv:1912.12180Google Scholar
B. Hu, Z. Lu, H. Li, and Q. Chen. 2014. Convolutional Neural Network Architectures for Matching Natural Language Sentences. In NIPS '14. 2042--2050.Google Scholar
P. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. P. Heck. 2013. Learning Deep Structured Semantic Models for Web Search using Clickthrough Data. In CIKM'13. 2333--2338.Google Scholar
J. Jiang, M. Zhang, C. Li, M. Bendersky, N. Golbandi, and M. Najork. 2019. Semantic Text Matching for Long-Form Documents. In WWW '19. 795--806.Google Scholar
J. Johnson, M. Douze, and H. Jégou. 2017. Billion-scale similarity search with GPUs. (2017). arXiv:1702.08734Google Scholar
N. Kitaev, L. Kaiser, and A. Levskaya. 2020. Reformer: The Efficient Transformer. In ICLR '20.Google Scholar
H. Li and J. Xu. 2014.Semantic Matching in Search. Now Publishers Inc., Hanover,MA, USA.Google Scholar
R. Lowe, N. Pow, I. Serban, and J. Pineau. 2015. The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems. (2015). arXiv:1506.08909Google Scholar
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In NIPS '13. 3111--3119.Google ScholarDigital Library
B. Mitra, F. Diaz, and N. Craswell. 2017. Learning to Match Using Local and Distributed Representations of Text for Web Search. In WWW '17. 1291--1299.Google Scholar
I. Ounis, C. MacDonald, and I. Soboroff. 2008. Overview of the TREC 2008 BlogTrack. In TREC '08.Google Scholar
L. Pang, Y. Lan, J. Guo, J. Xu, S. Wan, and X. Cheng. 2016. Text Matching as Image Recognition. In AAAI '16. 2793--2799.Google ScholarDigital Library
M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettle-moyer. 2018. Deep contextualized word representations. (2018). arXiv:1802.05365Google Scholar
J. Qiu, H. Ma, O. Levy, S. W. Yih, S. Wang, and J. Tang. 2019. Blockwise Self-Attention for Long Document Understanding. (2019). arXiv:1911.02972Google Scholar
D. R. Radev, P. Muthukrishnan, and V. Qazvinian. 2009. The ACL Anthology Network Corpus. In NLPIR4DL '09. 54--61.Google Scholar
A. Radford. 2018. Improving Language Understanding by Generative Pre-Training. Preprint, OpenAI. (2018).Google Scholar
J. W. Rae, A. Potapenko, S. M. Jayakumar, and T. P. Lillicrap. 2019. Compressive Transformers for Long-Range Sequence Modelling. (2019). arXiv:1911.05507Google Scholar
A. Roy, M. T. Saffar, D. Grangier, and A. Vaswani. 2020. Efficient Content-Based Sparse Attention with Routing Transformers. (2020). arXiv:2003.05997Google Scholar
S. Sukhbaatar, E. Grave, P. Bojanowski, and A. Joulin. 2019. Adaptive Attention Span in Transformers. (2019). arXiv:1905.07799Google Scholar
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, ?. Kaiser, and I. Polosukhin. 2017. Attention is All You Need. In NIPS '17.Google Scholar
X. Wu, R. Guo, A. Suresh, S. Kumar, D. Holtmann-Rice, D. Simcha, and F. Yu. 2017. Multiscale Quantization for Fast Similarity Search. In NIPS '17. 5745--5755.Google Scholar
Y. Wu, W. Wu, C. Xing, M. Zhou, and Z. Li. 2017. Sequential Matching Network: A New Architecture for Multi-turn Response Selection in Retrieval-Based Chatbots. In ACL '17. 163--197.Google Scholar
C. Xiong, Z. Dai, J. Callan, Z. Liu, and R. Power. 2017. End-to-End Neural Ad-hoc Ranking with Kernel Pooling. In SIGIR '17. 55--64.Google Scholar
L. Yang, Q. Ai, J. Guo, and W. B. Croft. 2016. aNMM: Ranking Short Answer Texts with Attention-Based Neural Matching Model. InCIKM '16. 287--296.Google Scholar
Y. Yang, W. Yih, and C. Meek. 2015. WikiQA: A Challenge Dataset for Open-Domain Question Answering. In EMNLP '15. 2013--2018.Google Scholar
Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhutdinov, and Q. V. Le. 2019.XLNet: Generalized Autoregressive Pretraining for Language Understanding. (2019). arXiv:1906.08237Google Scholar
Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy. 2016. Hierarchical Attention Networks for Document Classification. In NAACL '16. 1480--1489.Google Scholar
W. Yin and H. Schütze. 2015. Convolutional Neural Network for Paraphrase Identification. In NAACL '15. 901--911.Google Scholar
J. Yu, M. Qiu, J. Jiang, J. Huang, S. Song, W. Chu, and H. Chen. 2018. Modelling Domain Relationships for Transfer Learning on Retrieval-based Question Answering Systems in E-commerce. In WSDM '18. 682--690.Google Scholar
X. Zhang, F. Wei, and M. Zhou. 2019. HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization. (2019).arXiv:1905.06566Google Scholar

Index Terms

Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching
1. Information systems
  1. Information retrieval
    1. Document representation
      1. Document structure

Recommendations

Semantic Text Matching for Long-Form Documents
WWW '19: The World Wide Web Conference

Semantic text matching is one of the most important research problems in many domains, including, but not limited to, information retrieval, question answering, and recommendation. Among the different types of semantic text matching, long-document-to-...
Read More
Matching Long-form Document with Topic Extraction and Aggregation
ACAI '22: Proceedings of the 2022 5th International Conference on Algorithms, Computing and Artificial Intelligence

BERT-based models have been widely used for document matching, but they generally do not perform well on the matching of long-form documents, as the sequence length limitation could lead to loss of information in the document. Also, the increased noise ...
Read More
SA_MetaMatch: relevant document discovery through document metadata and indexing
ACM-SE 42: Proceedings of the 42nd annual Southeast regional conference

SA_MetaMatch, a component of the Standards Advisor (SA), is designed to find relevant documents through matching indices of metadata and document content. The elements in the metadata schema are mainly adopted from the Dublin Core (DC). The ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge Management
October 2020
3619 pages
ISBN:9781450368599
DOI:10.1145/3340531
General Chairs:
Mathieu d'Aquin
DSI, Insight, NUI Galway, Ireland
,
Stefan Dietze
GESIS, Cologne, Germany, Heinrich-Heine-University Düsseldorf, Germany, L3S Research Center, Germany
,
Program Chairs:
Claudia Hauff
TU Delft, The Netherlands
,
Edward Curry
DSI, Insight, NUI Galway, Ireland
,
Philippe Cudre Mauroux
eXascale, University of Fribourg, Switzerland
Copyright © 2020 Owner/Author
This work is licensed under a Creative Commons Attribution International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 October 2020
Check for updates
Author Tags
deep learning
document matching
language model pre-training
transformers and self-attention models
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,861of8,427submissions,22%
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 39
  Total Citations
  View Citations
- 1,792
  Total Downloads
- Downloads (Last 12 months)504
- Downloads (Last 6 weeks)52
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching

CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge Management

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Semantic Text Matching for Long-Form Documents

Matching Long-form Document with Topic Extraction and Aggregation

SA_MetaMatch: relevant document discovery through document metadata and indexing