VSR: A Unified Framework for Document Layout Analysis Combining Vision, Semantics and Relations

Zhang, Peng; Li, Can; Qiao, Liang; Cheng, Zhanzhan; Pu, Shiliang; Niu, Yi; Wu, Fei

doi:10.1007/978-3-030-86549-8_8

Peng Zhang¹¹,
Can Li¹¹,
Liang Qiao¹¹,
Zhanzhan Cheng^11,12,
Shiliang Pu¹¹,
Yi Niu¹¹ &
…
Fei Wu¹²

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12821))

Included in the following conference series:

International Conference on Document Analysis and Recognition

4372 Accesses
28 Citations

Abstract

Document layout analysis is crucial for understanding document structures. On this task, vision and semantics of documents, and relations between layout components contribute to the understanding process. Though many works have been proposed to exploit the above information, they show unsatisfactory results. NLP-based methods model layout analysis as a sequence labeling task and show insufficient capabilities in layout modeling. CV-based methods model layout analysis as a detection or segmentation task, but bear limitations of inefficient modality fusion and lack of relation modeling between layout components. To address the above limitations, we propose a unified framework VSR for document layout analysis, combining vision, semantics and relations. VSR supports both NLP-based and CV-based methods. Specifically, we first introduce vision through document image and semantics through text embedding maps. Then, modality-specific visual and semantic features are extracted using a two-stream network, which are adaptively fused to make full use of complementary information. Finally, given component candidates, a relation module based on graph neural network is incorported to model relations between components and output final results. On three popular benchmarks, VSR outperforms previous models by large margins. Code will be released soon.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
In the rest of this paper, we assume text is available. There are tools available to extract text from PDF documents (e.g., PDFMiner [28]) and document images (e.g., OCR engine [30]).
2.
dhSegment\(^T\) means dhSegment with inputs of image and text embedding maps.
3.
Sentence is a group of words or phrases, which usually ends with a period, question mark or exclamation point. For simplicity, we approximate it with text lines.
4.
https://icdar2021.org/competitions/competition-on-scientific-literature-parsing/.

References

Aggarwal, M., Sarkar, M., Gupta, H., Krishnamurthy, B.: Multi-modal association based grouping for form structure extraction. In: WACV, pp. 2064–2073 (2020)
Google Scholar
Baltrusaitis, T., Ahuja, C., Morency, L.: Multimodal machine learning: a survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 41(2), 423–443 (2019)
Article Google Scholar
Barman, R., Ehrmann, M., Clematide, S., Oliveira, S.A., Kaplan, F.: Combining visual and textual features for semantic segmentation of historical newspapers. CoRR https://arxiv.org/abs/2002.06144 (2020)
BinMakhashen, G.M., Mahmoud, S.A.: Document layout analysis: a comprehensive survey. ACM Comput. Surv. 52(6), 109:1–109:36 (2020)
Google Scholar
Chen, K., Seuret, M., Liwicki, M., Hennebert, J., Ingold, R.: Page segmentation of historical document images with convolutional autoencoders. In: ICDAR, pp. 1011–1015 (2015)
Google Scholar
Conway, A.: Page grammars and page parsing. A syntactic approach to document layout recognition. In: ICDAR, pp. 761–764 (1993)
Google Scholar
Corbelli, A., Baraldi, L., Grana, C., Cucchiara, R.: Historical document digitization through layout analysis and deep content classification. In: ICPR, pp. 4077–4082 (2016)
Google Scholar
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT, pp. 4171–4186 (2019)
Google Scholar
Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: CVPR, pp. 1933–1941 (2016)
Google Scholar
Gatos, B., Louloudis, G., Stamatopoulos, N.: Segmentation of historical handwritten documents into text zones and text lines. In: ICFHR, pp. 464–469 (2014)
Google Scholar
Han, J., Chen, H., Liu, N., Yan, C., Li, X.: CNNs-based RGB-D saliency detection via cross-view transfer and multiview fusion. IEEE Trans. Cybern. 48(11), 3171–3183 (2018)
Article Google Scholar
He, D., Cohen, S., Price, B.L., Kifer, D., Giles, C.L.: Multi-scale multi-task FCN for semantic page segmentation and table detection. In: ICDAR, pp. 254–261 (2017)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.B.: Mask R-CNN. In: ICCV, pp. 2980–2988 (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Hu, H., Gu, J., Zhang, Z., Dai, J., Wei, Y.: Relation networks for object detection. In: CVPR, pp. 3588–3597 (2018)
Google Scholar
Krishnamoorthy, M.S., Nagy, G., Seth, S.C., Viswanathan, M.: Syntactic segmentation and labeling of digitized pages from technical journals. IEEE Trans. Pattern Anal. Mach. Intell. 15(7), 737–747 (1993)
Article Google Scholar
Lee, J., Hayashi, H., Ohyama, W., Uchida, S.: Page segmentation using a convolutional neural network with trainable co-occurrence features. In: ICDAR, pp. 1023–1028 (2019)
Google Scholar
Li, K., et al.: Cross-domain document object detection: benchmark suite and method. In: CVPR, pp. 12912–12921 (2020)
Google Scholar
Li, M., et al.: Docbank: a benchmark dataset for document layout analysis. In: COLING, pp. 949–960 (2020)
Google Scholar
Li, X., Yin, F., Xue, T., Liu, L., Ogier, J., Liu, C.: Instance aware document image segmentation using label pyramid networks and deep watershed transformation. In: ICDAR, pp. 514–519 (2019)
Google Scholar
Lin, T., et al.: Feature pyramid networks for object detection. In: CVPR, pp. 936–944 (2017)
Google Scholar
Lin, T., RoyChowdhury, A., Maji, S.: Bilinear CNN models for fine-grained visual recognition. In: ICCV, pp. 1449–1457 (2015)
Google Scholar
Liu, X., Gao, F., Zhang, Q., Zhao, H.: Graph convolution for multimodal information extraction from visually rich documents. In: NAACL-HLT, pp. 32–39 (2019)
Google Scholar
Liu, Y., et al.: Roberta: a robustly optimized BERT pretraining approach. CoRR https://arxiv.org/abs/1907.11692 (2019)
Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NeurIPS, pp. 91–99 (2015)
Google Scholar
Shilman, M., Liang, P., Viola, P.A.: Learning non-generative grammatical models for document analysis. In: ICCV, pp. 962–969 (2005)
Google Scholar
Shinyama, Y.: Pdfminer: python pdf parser and analyzer. Retrieved on 11 (2015)
Google Scholar
Siegel, N., Lourie, N., Power, R., Ammar, W.: Extracting scientific figures with distantly supervised neural networks. In: JCDL, pp. 223–232 (2018)
Google Scholar
Smith, R.: An overview of the tesseract OCR engine. In: ICDAR, pp. 629–633 (2007)
Google Scholar
Soto, C., Yoo, S.: Visual detection with context for document layout analysis. In: EMNLP-IJCNLP, pp. 3462–3468 (2019)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NeurIPS, pp. 5998–6008 (2017)
Google Scholar
Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., Bengio, Y.: Graph attention networks. In: ICLR (2018)
Google Scholar
Vo, Q.N., Lee, G.: Dense prediction for text line segmentation in handwritten document images. In: ICIP, pp. 3264–3268 (2016)
Google Scholar
Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018)
Google Scholar
Wick, C., Puppe, F.: Fully convolutional neural networks for page segmentation of historical document images. In: DAS, pp. 287–292 (2018)
Google Scholar
Xie, S., Girshick, R.B., Dollár P., He, K.: Aggregated residual transformations for deep neural networks. In: CVPR, pp. 5987–5995 (2017)
Google Scholar
Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: Layoutlm: pre-training of text and layout for document image understanding. In: KDD, pp. 1192–1200 (2020)
Google Scholar
Yang, X., Yumer, E., Asente, P., Kraley, M., Kifer, D., Giles, C.L.: Learning to extract semantic structure from documents using multimodal fully convolutional neural networks. In: CVPR, pp. 4342–4351 (2017)
Google Scholar
Yu, W., Lu, N., Qi, X., Gong, P., Xiao, R.: PICK: processing key information extraction from documents using improved graph learning-convolutional networks. In: ICPR, pp. 4363–4370 (2020)
Google Scholar
Zagoris, K., Pratikakis, I., Gatos, B.: Segmentation-based historical handwritten word spotting using document-specific local features. In: ICFHR, pp. 9–14 (2014)
Google Scholar
Zhang, P., et al.: TRIE: end-to-end text reading and information extraction for document understanding. In: MM, pp. 1413–1422 (2020)
Google Scholar
Zhong, X., Tang, J., Jimeno-Yepes, A.: Publaynet: largest dataset ever for document layout analysis. In: ICDAR, pp. 1015–1022 (2019)
Google Scholar

Download references

Author information

Authors and Affiliations

Hikvision Research Institute, Hangzhou, China
Peng Zhang, Can Li, Liang Qiao, Zhanzhan Cheng, Shiliang Pu & Yi Niu
Zhejiang University, Hangzhou, China
Zhanzhan Cheng & Fei Wu

Authors

Peng Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Can Li
View author publications
You can also search for this author in PubMed Google Scholar
Liang Qiao
View author publications
You can also search for this author in PubMed Google Scholar
Zhanzhan Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Shiliang Pu
View author publications
You can also search for this author in PubMed Google Scholar
Yi Niu
View author publications
You can also search for this author in PubMed Google Scholar
Fei Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhanzhan Cheng .

Editor information

Editors and Affiliations

Universitat Autònoma de Barcelona, Barcelona, Spain
Josep Lladós
Lehigh University, Bethlehem, PA, USA
Daniel Lopresti
Kyushu University, Fukuoka-shi, Japan
Seiichi Uchida

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, P. et al. (2021). VSR: A Unified Framework for Document Layout Analysis Combining Vision, Semantics and Relations. In: Lladós, J., Lopresti, D., Uchida, S. (eds) Document Analysis and Recognition – ICDAR 2021. ICDAR 2021. Lecture Notes in Computer Science(), vol 12821. Springer, Cham. https://doi.org/10.1007/978-3-030-86549-8_8

Download citation

DOI: https://doi.org/10.1007/978-3-030-86549-8_8
Published: 02 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86548-1
Online ISBN: 978-3-030-86549-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)