Skip to main content

VSR: A Unified Framework for Document Layout Analysis Combining Vision, Semantics and Relations

  • Conference paper
  • First Online:
Document Analysis and Recognition – ICDAR 2021 (ICDAR 2021)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12821))

Included in the following conference series:

Abstract

Document layout analysis is crucial for understanding document structures. On this task, vision and semantics of documents, and relations between layout components contribute to the understanding process. Though many works have been proposed to exploit the above information, they show unsatisfactory results. NLP-based methods model layout analysis as a sequence labeling task and show insufficient capabilities in layout modeling. CV-based methods model layout analysis as a detection or segmentation task, but bear limitations of inefficient modality fusion and lack of relation modeling between layout components. To address the above limitations, we propose a unified framework VSR for document layout analysis, combining vision, semantics and relations. VSR supports both NLP-based and CV-based methods. Specifically, we first introduce vision through document image and semantics through text embedding maps. Then, modality-specific visual and semantic features are extracted using a two-stream network, which are adaptively fused to make full use of complementary information. Finally, given component candidates, a relation module based on graph neural network is incorported to model relations between components and output final results. On three popular benchmarks, VSR outperforms previous models by large margins. Code will be released soon.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    In the rest of this paper, we assume text is available. There are tools available to extract text from PDF documents (e.g., PDFMiner [28]) and document images (e.g., OCR engine [30]).

  2. 2.

    dhSegment\(^T\) means dhSegment with inputs of image and text embedding maps.

  3. 3.

    Sentence is a group of words or phrases, which usually ends with a period, question mark or exclamation point. For simplicity, we approximate it with text lines.

  4. 4.

    https://icdar2021.org/competitions/competition-on-scientific-literature-parsing/.

References

  1. Aggarwal, M., Sarkar, M., Gupta, H., Krishnamurthy, B.: Multi-modal association based grouping for form structure extraction. In: WACV, pp. 2064–2073 (2020)

    Google Scholar 

  2. Baltrusaitis, T., Ahuja, C., Morency, L.: Multimodal machine learning: a survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41(2), 423–443 (2019)

    Article  Google Scholar 

  3. Barman, R., Ehrmann, M., Clematide, S., Oliveira, S.A., Kaplan, F.: Combining visual and textual features for semantic segmentation of historical newspapers. CoRR https://arxiv.org/abs/2002.06144 (2020)

  4. BinMakhashen, G.M., Mahmoud, S.A.: Document layout analysis: a comprehensive survey. ACM Comput. Surv. 52(6), 109:1–109:36 (2020)

    Google Scholar 

  5. Chen, K., Seuret, M., Liwicki, M., Hennebert, J., Ingold, R.: Page segmentation of historical document images with convolutional autoencoders. In: ICDAR, pp. 1011–1015 (2015)

    Google Scholar 

  6. Conway, A.: Page grammars and page parsing. A syntactic approach to document layout recognition. In: ICDAR, pp. 761–764 (1993)

    Google Scholar 

  7. Corbelli, A., Baraldi, L., Grana, C., Cucchiara, R.: Historical document digitization through layout analysis and deep content classification. In: ICPR, pp. 4077–4082 (2016)

    Google Scholar 

  8. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT, pp. 4171–4186 (2019)

    Google Scholar 

  9. Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: CVPR, pp. 1933–1941 (2016)

    Google Scholar 

  10. Gatos, B., Louloudis, G., Stamatopoulos, N.: Segmentation of historical handwritten documents into text zones and text lines. In: ICFHR, pp. 464–469 (2014)

    Google Scholar 

  11. Han, J., Chen, H., Liu, N., Yan, C., Li, X.: CNNs-based RGB-D saliency detection via cross-view transfer and multiview fusion. IEEE Trans. Cybern. 48(11), 3171–3183 (2018)

    Article  Google Scholar 

  12. He, D., Cohen, S., Price, B.L., Kifer, D., Giles, C.L.: Multi-scale multi-task FCN for semantic page segmentation and table detection. In: ICDAR, pp. 254–261 (2017)

    Google Scholar 

  13. He, K., Gkioxari, G., Dollár, P., Girshick, R.B.: Mask R-CNN. In: ICCV, pp. 2980–2988 (2017)

    Google Scholar 

  14. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)

    Google Scholar 

  15. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  16. Hu, H., Gu, J., Zhang, Z., Dai, J., Wei, Y.: Relation networks for object detection. In: CVPR, pp. 3588–3597 (2018)

    Google Scholar 

  17. Krishnamoorthy, M.S., Nagy, G., Seth, S.C., Viswanathan, M.: Syntactic segmentation and labeling of digitized pages from technical journals. IEEE Trans. Pattern Anal. Mach. Intell. 15(7), 737–747 (1993)

    Article  Google Scholar 

  18. Lee, J., Hayashi, H., Ohyama, W., Uchida, S.: Page segmentation using a convolutional neural network with trainable co-occurrence features. In: ICDAR, pp. 1023–1028 (2019)

    Google Scholar 

  19. Li, K., et al.: Cross-domain document object detection: benchmark suite and method. In: CVPR, pp. 12912–12921 (2020)

    Google Scholar 

  20. Li, M., et al.: Docbank: a benchmark dataset for document layout analysis. In: COLING, pp. 949–960 (2020)

    Google Scholar 

  21. Li, X., Yin, F., Xue, T., Liu, L., Ogier, J., Liu, C.: Instance aware document image segmentation using label pyramid networks and deep watershed transformation. In: ICDAR, pp. 514–519 (2019)

    Google Scholar 

  22. Lin, T., et al.: Feature pyramid networks for object detection. In: CVPR, pp. 936–944 (2017)

    Google Scholar 

  23. Lin, T., RoyChowdhury, A., Maji, S.: Bilinear CNN models for fine-grained visual recognition. In: ICCV, pp. 1449–1457 (2015)

    Google Scholar 

  24. Liu, X., Gao, F., Zhang, Q., Zhao, H.: Graph convolution for multimodal information extraction from visually rich documents. In: NAACL-HLT, pp. 32–39 (2019)

    Google Scholar 

  25. Liu, Y., et al.: Roberta: a robustly optimized BERT pretraining approach. CoRR https://arxiv.org/abs/1907.11692 (2019)

  26. Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NeurIPS, pp. 91–99 (2015)

    Google Scholar 

  27. Shilman, M., Liang, P., Viola, P.A.: Learning non-generative grammatical models for document analysis. In: ICCV, pp. 962–969 (2005)

    Google Scholar 

  28. Shinyama, Y.: Pdfminer: python pdf parser and analyzer. Retrieved on 11 (2015)

    Google Scholar 

  29. Siegel, N., Lourie, N., Power, R., Ammar, W.: Extracting scientific figures with distantly supervised neural networks. In: JCDL, pp. 223–232 (2018)

    Google Scholar 

  30. Smith, R.: An overview of the tesseract OCR engine. In: ICDAR, pp. 629–633 (2007)

    Google Scholar 

  31. Soto, C., Yoo, S.: Visual detection with context for document layout analysis. In: EMNLP-IJCNLP, pp. 3462–3468 (2019)

    Google Scholar 

  32. Vaswani, A., et al.: Attention is all you need. In: NeurIPS, pp. 5998–6008 (2017)

    Google Scholar 

  33. Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., Bengio, Y.: Graph attention networks. In: ICLR (2018)

    Google Scholar 

  34. Vo, Q.N., Lee, G.: Dense prediction for text line segmentation in handwritten document images. In: ICIP, pp. 3264–3268 (2016)

    Google Scholar 

  35. Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018)

    Google Scholar 

  36. Wick, C., Puppe, F.: Fully convolutional neural networks for page segmentation of historical document images. In: DAS, pp. 287–292 (2018)

    Google Scholar 

  37. Xie, S., Girshick, R.B., Dollár P., He, K.: Aggregated residual transformations for deep neural networks. In: CVPR, pp. 5987–5995 (2017)

    Google Scholar 

  38. Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: Layoutlm: pre-training of text and layout for document image understanding. In: KDD, pp. 1192–1200 (2020)

    Google Scholar 

  39. Yang, X., Yumer, E., Asente, P., Kraley, M., Kifer, D., Giles, C.L.: Learning to extract semantic structure from documents using multimodal fully convolutional neural networks. In: CVPR, pp. 4342–4351 (2017)

    Google Scholar 

  40. Yu, W., Lu, N., Qi, X., Gong, P., Xiao, R.: PICK: processing key information extraction from documents using improved graph learning-convolutional networks. In: ICPR, pp. 4363–4370 (2020)

    Google Scholar 

  41. Zagoris, K., Pratikakis, I., Gatos, B.: Segmentation-based historical handwritten word spotting using document-specific local features. In: ICFHR, pp. 9–14 (2014)

    Google Scholar 

  42. Zhang, P., et al.: TRIE: end-to-end text reading and information extraction for document understanding. In: MM, pp. 1413–1422 (2020)

    Google Scholar 

  43. Zhong, X., Tang, J., Jimeno-Yepes, A.: Publaynet: largest dataset ever for document layout analysis. In: ICDAR, pp. 1015–1022 (2019)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhanzhan Cheng .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhang, P. et al. (2021). VSR: A Unified Framework for Document Layout Analysis Combining Vision, Semantics and Relations. In: Lladós, J., Lopresti, D., Uchida, S. (eds) Document Analysis and Recognition – ICDAR 2021. ICDAR 2021. Lecture Notes in Computer Science(), vol 12821. Springer, Cham. https://doi.org/10.1007/978-3-030-86549-8_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-86549-8_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86548-1

  • Online ISBN: 978-3-030-86549-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics