Skip to main content
Log in

Document cleanup using page frame detection

  • Original Paper
  • Published:
International Journal of Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Abstract

When a page of a book is scanned or photocopied, textual noise (extraneous symbols from the neighboring page) and/or non-textual noise (black borders, speckles, ...) appear along the border of the document. Existing document analysis methods can handle non-textual noise reasonably well, whereas textual noise still presents a major issue for document analysis systems. Textual noise may result in undesired text in optical character recognition (OCR) output that needs to be removed afterwards. Existing document cleanup methods try to explicitly detect and remove marginal noise. This paper presents a new perspective for document image cleanup by detecting the page frame of the document. The goal of page frame detection is to find the actual page contents area, ignoring marginal noise along the page border. We use a geometric matching algorithm to find the optimal page frame of structured documents (journal articles, books, magazines) by exploiting their text alignment property. We evaluate the algorithm on the UW-III database. The results show that the error rates are below 4% each of the performance measures used. Further tests were run on a dataset of magazine pages and on a set of camera captured document images. To demonstrate the benefits of using page frame detection in practical applications, we choose OCR and layout-based document image retrieval as sample applications. Experiments using a commercial OCR system show that by removing characters outside the computed page frame, the OCR error rate is reduced from 4.3 to 1.7% on the UW-III dataset. The use of page frame detection in layout-based document image retrieval application decreases the retrieval error rates by 30%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Cattoni R., Coianiz T., Messelodi S., Modena C.M.: Geometric layout analysis techniques for document image understanding: a review, Tech. Rep. 9703-09. IRST, Trento (1998)

    Google Scholar 

  2. Baird H.S.: Background structure in document images. In: Bunke, H., Wang, P., Baird, H.S. (eds) Document Image Analysis, pp. 17–34. World Scientific, Singapore (1994)

    Google Scholar 

  3. Breuel, T.M.: Two geometric algorithms for layout analysis. In: Proceedings of Document Analysis Systems. Lecture Notes in Computer Science, vol. 2423, Princeton, NY, USA, pp. 188–199 (2002)

  4. O’Gorman L.: The document spectrum for page layout analysis. IEEE Trans. Pattern Anal. Mach. Intell. 15(11), 1162–1173 (1993)

    Article  Google Scholar 

  5. Shafait F., Keysers D., Breuel T.M.: Performance evaluation and benchmarking of six page segmentation algorithms. IEEE Trans. Pattern Anal. Mach. Intell. 30(6), 941–954 (2008)

    Article  Google Scholar 

  6. Le, D.X., Thoma, G.R., Wechsler, H.: Automated borders detection and adaptive segmentation for binary document images. In: 13th International Conference on Pattern Recognition, Vienna, Austria, pp. 737–741 (1996)

  7. Avila, B.T., Lins, R.D.: Efficient removal of noisy borders from monochromatic documents. In: International Conference on Image Analysis and Recognition, Porto, Portugal, pp. 249–256 (2004)

  8. Fan K.C., Wang Y.K., Lay T.R.: Marginal noise removal of document images. Pattern Recognit. 35(11), 2593–2611 (2002)

    Article  MATH  Google Scholar 

  9. Cinque L., Levialdi S., Lombardi L., Tanimoto S.: Segmentation of page images having artifacts of photocopying and scanning. Pattern Recognit. 35(5), 1167–1177 (2002)

    Article  MATH  Google Scholar 

  10. Peerawit, W., Kawtrakul, A.: Marginal noise removal from document images using edge density. In: 4th Information and Computer Engineering Postgraduate Workshop, Phuket, Thailand (2004)

  11. Stamatopoulos, N., Gatos, B., Kesidis, A.: Automatic borders detection of camera document images. In: 2nd International Workshop on Camera-Based Document Analysis and Recognition, Curitiba, Brazil, pp. 71–78 (2007)

  12. van Beusekom, J., Keysers, D., Shafait, F., Breuel, T.M.: Distance measures for layout-based document image retrieval. In: 2nd IEEE International Conference on Document Image Analysis for Libraries, Lyon, France, pp. 232–242 (2006)

  13. Shafait, F., van Beusekom, J., Keysers, D., Breuel, T.M.: Page frame detection for marginal noise removal from scanned documents, in: SCIA 2007, Image Analysis, Proceedings. Lecture Notes in Computer Science, vol. 4522, Aalborg, Denmark, pp. 651–660 (2007)

  14. Dengel, A., Barth, G., ANASTASIL: Hybrid knowledge-based system for document image analysis. In: Proceedings of International Joint Conference on Artificial Intelligence, Detroit, MI, USA, pp. 1249–1254 (1989)

  15. Liang J., Phillips I.T., Haralick R.M.: Performance evaluation of document structure extraction algorithms. Comput. Vis. Image Underst. 84(1), 144–159 (2001)

    Article  MATH  Google Scholar 

  16. Das A.K., Saha S.K., Chanda B.: An empirical measure of the performance of a document image segmentation algorithm. Int. J. Document Anal. Recognit. 4(3), 183–190 (2002)

    Article  Google Scholar 

  17. Kise K., Sato A., Iwata M.: Segmentation of page images using the area Voronoi diagram. Comput. Vis. Image Underst. 70(3), 370–382 (1998)

    Article  Google Scholar 

  18. Shafait, F., Keysers, D., Breuel, T.M.: Performance comparison of six algorithms for page segmentation. In: 7th IAPR Workshop on Document Analysis Systems. Lecture Notes in Computer Science, vol. 3872, Nelson, New Zealand, pp. 368–379 (2006)

  19. Breuel, T.M.: The OCRopus open source OCR system. In: Proceedings of SPIE Document Recognition and Retrieval XV, San Jose, CA, USA, pp. 0F1–0F15 (2008)

  20. Mao S., Kanungo T.: Software architecture of PSET: a page segmentation evaluation toolkit. Int. J. Document Anal. Recognit. 4(3), 205–217 (2002)

    Article  Google Scholar 

  21. Okun, O., Pietikainen, M., Sauvola, J.: Robust skew estimation on low-resolution document images. In: 5th International Conference on Document Analysis and Recognition, Bangalore, India, pp. 621–624 (1999)

  22. Breuel, T.M.: Robust least square baseline finding using a branch and bound algorithm. In: Proceedings of SPIE Document Recognition and Retrieval IX, San Jose, CA, USA, pp. 20–27 (2002)

  23. Breuel T.M.: A practical, globally optimal algorithm for geometric matching under uncertainty. Electronic Notes Theor. Comput. Sci. 46, 1–15 (2001)

    Article  Google Scholar 

  24. Breuel T.M.: On the use of interval arithmetic in geometric branch-and-bound algorithms. Pattern Recognit. Lett. 24(9–10), 1375–1384 (2003)

    Article  MATH  Google Scholar 

  25. Breuel T.M.: Implementation techniques for geometric branch-and-bound matching methods. Comput. Vis. Image Underst 90(3), 258–294 (2003)

    Article  MATH  Google Scholar 

  26. Levenshtein V.I.: Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl. 10(8), 707–710 (1966)

    MathSciNet  Google Scholar 

  27. Phillips I.T.: User’s reference manual for the UW english/technical document image database III, Tech. rep. Seattle University, Washington (1996)

    Google Scholar 

  28. Breuel, T.M. (1993) Recognition by Adaptive Subdivision of Transformation Space: practical experiences and comparison with the Hough transform. In: IEE Colloquium on ’Hough Transforms’ (Digest No.106), pp. 71–74 (1993)

  29. Nagy G., Seth S., Viswanathan M.: A prototype document image analysis system for technical journals. Computer 7(25), 10–22 (1992)

    Article  Google Scholar 

  30. Antonacopoulos, A., Gatos, B., Bridson, D.: Page segmentation competition. In: Proceedings of 9th International Conference on Document Analysis and Recognition, Curitiba, Brazil, pp. 1279–1283 (2007)

  31. Ulges, A., Lampert, C., Breuel, T.: Document image dewarping using robust estimation of curled text lines. In: Proceedings of Eighth International Conference on Document Analysis and Recognition, pp. 1001–1005 (2005)

  32. Shafait, F., Breuel, T.M.: Document image dewarping contest. In: 2nd International Workshop on Camera-Based Document Analysis and Recognition, Curitiba, Brazil, pp. 181–188 (2007)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Joost van Beusekom.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shafait, F., van Beusekom, J., Keysers, D. et al. Document cleanup using page frame detection. IJDAR 11, 81–96 (2008). https://doi.org/10.1007/s10032-008-0071-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-008-0071-7

Keywords

Navigation