ABSTRACT
Many natural language processing and information retrieval problems can be formalized as the task of semantic matching. Existing work in this area has been largely focused on matching between short texts (e.g., question answering), or between a short and a long text (e.g., ad-hoc retrieval). Semantic matching between long-form documents, which has many important applications like news recommendation, related article recommendation and document clustering, is relatively less explored and needs more research effort. In recent years, self-attention based models like Transformers and BERT have achieved state-of-the-art performance in the task of text matching. These models, however, are still limited to short text like a few sentences or one paragraph due to the quadratic computational complexity of self-attention with respect to input text length. In this paper, we address the issue by proposing the Siamese Multi-depth Transformer-based Hierarchical (SMITH) Encoder for long-form document matching. Our model contains several innovations to adapt self-attention models for longer text input. We propose a transformer based hierarchical encoder to capture the document structure information. In order to better capture sentence level semantic relations within a document, we pre-train the model with a novel masked sentence block language modeling task in addition to the masked word language modeling task used by BERT. Our experimental results on several benchmark data sets for long-form document matching show that our proposed SMITH model outperforms the previous state-of-the-art models including hierarchical attention, multi-depth attention-based hierarchical recurrent neural network, and BERT. Comparing to BERT based baselines, our model is able to increase maximum input text length from 512 to 2048. We will open source a Wikipedia based benchmark data set, code and a pre-trained model to accelerate future research on long-form document matching.
Supplemental Material
- S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning. 2015. A large annotated corpus for learning natural language inference. In EMNLP '15. 632--642.Google Scholar
- R. Child, S. Gray, A. Radford, and I. Sutskever. 2019. Generating Long Sequences with Sparse Transformers. (2019). arXiv:1904.10509Google Scholar
- N. Craswell, B. Mitra, E. Yilmaz, D. Campos, and E. M Voorhees. 2020. Overview of the TREC 2019 deep learning track. (2020). arXiv:2003.07820Google Scholar
- Z. Dai and J. Callan. 2019. Deeper Text Understanding for IR with Contextual Neural Language Modeling. In SIGIR '19.Google Scholar
- Z. Dai, Z. Yang, Y. Yang, J. G. Carbonell, Q. V. Le, and R. Salakhutdinov. 2019. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context.(2019). arXiv:1901.02860Google Scholar
- J. Devlin, M. Chang, K. Lee, and K. Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. (2018). arXiv:1810.04805Google Scholar
- W. B. Dolan and C. Brockett. 2005. Automatically Constructing a Corpus of Sentential Paraphrases. In IWP 2005. 9--16.Google Scholar
- J. Guo, Y. Fan, Q. Ai, and W. B. Croft. 2016. A Deep Relevance Matching Model for Ad-hoc Retrieval. In CIKM '16. 55--64.Google Scholar
- J. Guo, Y. Fan, L. Pang, L. Yang, Q. Ai, H. Zamani, C. Wu, W. B. Croft, and X. Cheng. 2019. A Deep Look into Neural Ranking Models for Information Retrieval.(2019). arXiv:1903.06902Google Scholar
- K. He, X. Zhang, S. Ren, and J. Sun. 2015. Deep Residual Learning for Image Recognition. (2015). arXiv:1512.03385Google Scholar
- J. Ho, N. Kalchbrenner, D. Weissenborn, and T. Salimans. 2019. Axial Attentionin Multidimensional Transformers. (2019). arXiv:1912.12180Google Scholar
- B. Hu, Z. Lu, H. Li, and Q. Chen. 2014. Convolutional Neural Network Architectures for Matching Natural Language Sentences. In NIPS '14. 2042--2050.Google Scholar
- P. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. P. Heck. 2013. Learning Deep Structured Semantic Models for Web Search using Clickthrough Data. In CIKM'13. 2333--2338.Google Scholar
- J. Jiang, M. Zhang, C. Li, M. Bendersky, N. Golbandi, and M. Najork. 2019. Semantic Text Matching for Long-Form Documents. In WWW '19. 795--806.Google Scholar
- J. Johnson, M. Douze, and H. Jégou. 2017. Billion-scale similarity search with GPUs. (2017). arXiv:1702.08734Google Scholar
- N. Kitaev, L. Kaiser, and A. Levskaya. 2020. Reformer: The Efficient Transformer. In ICLR '20.Google Scholar
- H. Li and J. Xu. 2014.Semantic Matching in Search. Now Publishers Inc., Hanover,MA, USA.Google Scholar
- R. Lowe, N. Pow, I. Serban, and J. Pineau. 2015. The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems. (2015). arXiv:1506.08909Google Scholar
- T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In NIPS '13. 3111--3119.Google ScholarDigital Library
- B. Mitra, F. Diaz, and N. Craswell. 2017. Learning to Match Using Local and Distributed Representations of Text for Web Search. In WWW '17. 1291--1299.Google Scholar
- I. Ounis, C. MacDonald, and I. Soboroff. 2008. Overview of the TREC 2008 BlogTrack. In TREC '08.Google Scholar
- L. Pang, Y. Lan, J. Guo, J. Xu, S. Wan, and X. Cheng. 2016. Text Matching as Image Recognition. In AAAI '16. 2793--2799.Google ScholarDigital Library
- M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettle-moyer. 2018. Deep contextualized word representations. (2018). arXiv:1802.05365Google Scholar
- J. Qiu, H. Ma, O. Levy, S. W. Yih, S. Wang, and J. Tang. 2019. Blockwise Self-Attention for Long Document Understanding. (2019). arXiv:1911.02972Google Scholar
- D. R. Radev, P. Muthukrishnan, and V. Qazvinian. 2009. The ACL Anthology Network Corpus. In NLPIR4DL '09. 54--61.Google Scholar
- A. Radford. 2018. Improving Language Understanding by Generative Pre-Training. Preprint, OpenAI. (2018).Google Scholar
- J. W. Rae, A. Potapenko, S. M. Jayakumar, and T. P. Lillicrap. 2019. Compressive Transformers for Long-Range Sequence Modelling. (2019). arXiv:1911.05507Google Scholar
- A. Roy, M. T. Saffar, D. Grangier, and A. Vaswani. 2020. Efficient Content-Based Sparse Attention with Routing Transformers. (2020). arXiv:2003.05997Google Scholar
- S. Sukhbaatar, E. Grave, P. Bojanowski, and A. Joulin. 2019. Adaptive Attention Span in Transformers. (2019). arXiv:1905.07799Google Scholar
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, ?. Kaiser, and I. Polosukhin. 2017. Attention is All You Need. In NIPS '17.Google Scholar
- X. Wu, R. Guo, A. Suresh, S. Kumar, D. Holtmann-Rice, D. Simcha, and F. Yu. 2017. Multiscale Quantization for Fast Similarity Search. In NIPS '17. 5745--5755.Google Scholar
- Y. Wu, W. Wu, C. Xing, M. Zhou, and Z. Li. 2017. Sequential Matching Network: A New Architecture for Multi-turn Response Selection in Retrieval-Based Chatbots. In ACL '17. 163--197.Google Scholar
- C. Xiong, Z. Dai, J. Callan, Z. Liu, and R. Power. 2017. End-to-End Neural Ad-hoc Ranking with Kernel Pooling. In SIGIR '17. 55--64.Google Scholar
- L. Yang, Q. Ai, J. Guo, and W. B. Croft. 2016. aNMM: Ranking Short Answer Texts with Attention-Based Neural Matching Model. InCIKM '16. 287--296.Google Scholar
- Y. Yang, W. Yih, and C. Meek. 2015. WikiQA: A Challenge Dataset for Open-Domain Question Answering. In EMNLP '15. 2013--2018.Google Scholar
- Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhutdinov, and Q. V. Le. 2019.XLNet: Generalized Autoregressive Pretraining for Language Understanding. (2019). arXiv:1906.08237Google Scholar
- Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy. 2016. Hierarchical Attention Networks for Document Classification. In NAACL '16. 1480--1489.Google Scholar
- W. Yin and H. Schütze. 2015. Convolutional Neural Network for Paraphrase Identification. In NAACL '15. 901--911.Google Scholar
- J. Yu, M. Qiu, J. Jiang, J. Huang, S. Song, W. Chu, and H. Chen. 2018. Modelling Domain Relationships for Transfer Learning on Retrieval-based Question Answering Systems in E-commerce. In WSDM '18. 682--690.Google Scholar
- X. Zhang, F. Wei, and M. Zhou. 2019. HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization. (2019).arXiv:1905.06566Google Scholar
Index Terms
- Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching
Recommendations
Semantic Text Matching for Long-Form Documents
WWW '19: The World Wide Web ConferenceSemantic text matching is one of the most important research problems in many domains, including, but not limited to, information retrieval, question answering, and recommendation. Among the different types of semantic text matching, long-document-to-...
Matching Long-form Document with Topic Extraction and Aggregation
ACAI '22: Proceedings of the 2022 5th International Conference on Algorithms, Computing and Artificial IntelligenceBERT-based models have been widely used for document matching, but they generally do not perform well on the matching of long-form documents, as the sequence length limitation could lead to loss of information in the document. Also, the increased noise ...
SA_MetaMatch: relevant document discovery through document metadata and indexing
ACM-SE 42: Proceedings of the 42nd annual Southeast regional conferenceSA_MetaMatch, a component of the Standards Advisor (SA), is designed to find relevant documents through matching indices of metadata and document content. The elements in the metadata schema are mainly adopted from the Dublin Core (DC). The ...
Comments