skip to main content
10.1145/3340531.3411908acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article
Open Access

Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching

Published:19 October 2020Publication History

ABSTRACT

Many natural language processing and information retrieval problems can be formalized as the task of semantic matching. Existing work in this area has been largely focused on matching between short texts (e.g., question answering), or between a short and a long text (e.g., ad-hoc retrieval). Semantic matching between long-form documents, which has many important applications like news recommendation, related article recommendation and document clustering, is relatively less explored and needs more research effort. In recent years, self-attention based models like Transformers and BERT have achieved state-of-the-art performance in the task of text matching. These models, however, are still limited to short text like a few sentences or one paragraph due to the quadratic computational complexity of self-attention with respect to input text length. In this paper, we address the issue by proposing the Siamese Multi-depth Transformer-based Hierarchical (SMITH) Encoder for long-form document matching. Our model contains several innovations to adapt self-attention models for longer text input. We propose a transformer based hierarchical encoder to capture the document structure information. In order to better capture sentence level semantic relations within a document, we pre-train the model with a novel masked sentence block language modeling task in addition to the masked word language modeling task used by BERT. Our experimental results on several benchmark data sets for long-form document matching show that our proposed SMITH model outperforms the previous state-of-the-art models including hierarchical attention, multi-depth attention-based hierarchical recurrent neural network, and BERT. Comparing to BERT based baselines, our model is able to increase maximum input text length from 512 to 2048. We will open source a Wikipedia based benchmark data set, code and a pre-trained model to accelerate future research on long-form document matching.

Skip Supplemental Material Section

Supplemental Material

3340531.3411908.mp4

mp4

111.9 MB

References

  1. S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning. 2015. A large annotated corpus for learning natural language inference. In EMNLP '15. 632--642.Google ScholarGoogle Scholar
  2. R. Child, S. Gray, A. Radford, and I. Sutskever. 2019. Generating Long Sequences with Sparse Transformers. (2019). arXiv:1904.10509Google ScholarGoogle Scholar
  3. N. Craswell, B. Mitra, E. Yilmaz, D. Campos, and E. M Voorhees. 2020. Overview of the TREC 2019 deep learning track. (2020). arXiv:2003.07820Google ScholarGoogle Scholar
  4. Z. Dai and J. Callan. 2019. Deeper Text Understanding for IR with Contextual Neural Language Modeling. In SIGIR '19.Google ScholarGoogle Scholar
  5. Z. Dai, Z. Yang, Y. Yang, J. G. Carbonell, Q. V. Le, and R. Salakhutdinov. 2019. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context.(2019). arXiv:1901.02860Google ScholarGoogle Scholar
  6. J. Devlin, M. Chang, K. Lee, and K. Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. (2018). arXiv:1810.04805Google ScholarGoogle Scholar
  7. W. B. Dolan and C. Brockett. 2005. Automatically Constructing a Corpus of Sentential Paraphrases. In IWP 2005. 9--16.Google ScholarGoogle Scholar
  8. J. Guo, Y. Fan, Q. Ai, and W. B. Croft. 2016. A Deep Relevance Matching Model for Ad-hoc Retrieval. In CIKM '16. 55--64.Google ScholarGoogle Scholar
  9. J. Guo, Y. Fan, L. Pang, L. Yang, Q. Ai, H. Zamani, C. Wu, W. B. Croft, and X. Cheng. 2019. A Deep Look into Neural Ranking Models for Information Retrieval.(2019). arXiv:1903.06902Google ScholarGoogle Scholar
  10. K. He, X. Zhang, S. Ren, and J. Sun. 2015. Deep Residual Learning for Image Recognition. (2015). arXiv:1512.03385Google ScholarGoogle Scholar
  11. J. Ho, N. Kalchbrenner, D. Weissenborn, and T. Salimans. 2019. Axial Attentionin Multidimensional Transformers. (2019). arXiv:1912.12180Google ScholarGoogle Scholar
  12. B. Hu, Z. Lu, H. Li, and Q. Chen. 2014. Convolutional Neural Network Architectures for Matching Natural Language Sentences. In NIPS '14. 2042--2050.Google ScholarGoogle Scholar
  13. P. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. P. Heck. 2013. Learning Deep Structured Semantic Models for Web Search using Clickthrough Data. In CIKM'13. 2333--2338.Google ScholarGoogle Scholar
  14. J. Jiang, M. Zhang, C. Li, M. Bendersky, N. Golbandi, and M. Najork. 2019. Semantic Text Matching for Long-Form Documents. In WWW '19. 795--806.Google ScholarGoogle Scholar
  15. J. Johnson, M. Douze, and H. Jégou. 2017. Billion-scale similarity search with GPUs. (2017). arXiv:1702.08734Google ScholarGoogle Scholar
  16. N. Kitaev, L. Kaiser, and A. Levskaya. 2020. Reformer: The Efficient Transformer. In ICLR '20.Google ScholarGoogle Scholar
  17. H. Li and J. Xu. 2014.Semantic Matching in Search. Now Publishers Inc., Hanover,MA, USA.Google ScholarGoogle Scholar
  18. R. Lowe, N. Pow, I. Serban, and J. Pineau. 2015. The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems. (2015). arXiv:1506.08909Google ScholarGoogle Scholar
  19. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In NIPS '13. 3111--3119.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. B. Mitra, F. Diaz, and N. Craswell. 2017. Learning to Match Using Local and Distributed Representations of Text for Web Search. In WWW '17. 1291--1299.Google ScholarGoogle Scholar
  21. I. Ounis, C. MacDonald, and I. Soboroff. 2008. Overview of the TREC 2008 BlogTrack. In TREC '08.Google ScholarGoogle Scholar
  22. L. Pang, Y. Lan, J. Guo, J. Xu, S. Wan, and X. Cheng. 2016. Text Matching as Image Recognition. In AAAI '16. 2793--2799.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettle-moyer. 2018. Deep contextualized word representations. (2018). arXiv:1802.05365Google ScholarGoogle Scholar
  24. J. Qiu, H. Ma, O. Levy, S. W. Yih, S. Wang, and J. Tang. 2019. Blockwise Self-Attention for Long Document Understanding. (2019). arXiv:1911.02972Google ScholarGoogle Scholar
  25. D. R. Radev, P. Muthukrishnan, and V. Qazvinian. 2009. The ACL Anthology Network Corpus. In NLPIR4DL '09. 54--61.Google ScholarGoogle Scholar
  26. A. Radford. 2018. Improving Language Understanding by Generative Pre-Training. Preprint, OpenAI. (2018).Google ScholarGoogle Scholar
  27. J. W. Rae, A. Potapenko, S. M. Jayakumar, and T. P. Lillicrap. 2019. Compressive Transformers for Long-Range Sequence Modelling. (2019). arXiv:1911.05507Google ScholarGoogle Scholar
  28. A. Roy, M. T. Saffar, D. Grangier, and A. Vaswani. 2020. Efficient Content-Based Sparse Attention with Routing Transformers. (2020). arXiv:2003.05997Google ScholarGoogle Scholar
  29. S. Sukhbaatar, E. Grave, P. Bojanowski, and A. Joulin. 2019. Adaptive Attention Span in Transformers. (2019). arXiv:1905.07799Google ScholarGoogle Scholar
  30. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, ?. Kaiser, and I. Polosukhin. 2017. Attention is All You Need. In NIPS '17.Google ScholarGoogle Scholar
  31. X. Wu, R. Guo, A. Suresh, S. Kumar, D. Holtmann-Rice, D. Simcha, and F. Yu. 2017. Multiscale Quantization for Fast Similarity Search. In NIPS '17. 5745--5755.Google ScholarGoogle Scholar
  32. Y. Wu, W. Wu, C. Xing, M. Zhou, and Z. Li. 2017. Sequential Matching Network: A New Architecture for Multi-turn Response Selection in Retrieval-Based Chatbots. In ACL '17. 163--197.Google ScholarGoogle Scholar
  33. C. Xiong, Z. Dai, J. Callan, Z. Liu, and R. Power. 2017. End-to-End Neural Ad-hoc Ranking with Kernel Pooling. In SIGIR '17. 55--64.Google ScholarGoogle Scholar
  34. L. Yang, Q. Ai, J. Guo, and W. B. Croft. 2016. aNMM: Ranking Short Answer Texts with Attention-Based Neural Matching Model. InCIKM '16. 287--296.Google ScholarGoogle Scholar
  35. Y. Yang, W. Yih, and C. Meek. 2015. WikiQA: A Challenge Dataset for Open-Domain Question Answering. In EMNLP '15. 2013--2018.Google ScholarGoogle Scholar
  36. Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhutdinov, and Q. V. Le. 2019.XLNet: Generalized Autoregressive Pretraining for Language Understanding. (2019). arXiv:1906.08237Google ScholarGoogle Scholar
  37. Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy. 2016. Hierarchical Attention Networks for Document Classification. In NAACL '16. 1480--1489.Google ScholarGoogle Scholar
  38. W. Yin and H. Schütze. 2015. Convolutional Neural Network for Paraphrase Identification. In NAACL '15. 901--911.Google ScholarGoogle Scholar
  39. J. Yu, M. Qiu, J. Jiang, J. Huang, S. Song, W. Chu, and H. Chen. 2018. Modelling Domain Relationships for Transfer Learning on Retrieval-based Question Answering Systems in E-commerce. In WSDM '18. 682--690.Google ScholarGoogle Scholar
  40. X. Zhang, F. Wei, and M. Zhou. 2019. HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization. (2019).arXiv:1905.06566Google ScholarGoogle Scholar

Index Terms

  1. Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader