skip to main content
10.1145/2037342.2037372acmotherconferencesArticle/Chapter ViewAbstractPublication PageshipConference Proceedingsconference-collections
research-article

A design of a preprocessing framework for large database of historical documents

Authors Info & Claims
Published:16 September 2011Publication History

ABSTRACT

The objective of document preprocessing is to ease the text recognition or the document indexing processes. The analysis of historical documents seems to be a big challenge because the majority of those documents are noisy and present many degradations. In this paper we propose a preprocessing framework for a large dataset of historical documents. The proposed framework is decomposed of two phases, the selection and the evaluation. During the first phase one or multiple methods are corresponded for each book of the used database. The validation of the selection results is performed during the evaluation. The experiments are applied on printed and handwritten documents extracted respectively from Google-Books and Bayerische Staatsbibliothek databases. The results returned during the evaluation are very promising.

References

  1. I. Ben Messaoud and H. El Abed, "Automatic annotation for handwritten historical documents using markov models," in Inter. Conf. on Frontiers in Handwriting Recognition (ICFHR), November 2010, pp. 381--386. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. K. Ntirogiannis, B. Gatos, and I. Pratikakis, "An objective evaluation methodology for document image binarization techniques," in IAPR Inter. Workshop on Document Analysis Systems (DAS), September 2008, pp. 217--224. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. B. Su, S. Lu, and C. Tan, "Binarization of historical document images using the local maximum and minimum," in IAPR Inter. Workshop on Document Analysis Systems (DAS), June 2010, pp. 159--165. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. P. Stathis, E. Kavallieratou, and N. Papamarkos, "An evaluation technique for binarization algorithms," Journal of Universal Computer Science, vol. 14, no. 18, pp. 3011--3030, October 2008.Google ScholarGoogle Scholar
  5. B. Gatos, K. Ntirogiannis, and I. Pratikakis, "ICDAR 2009 document image binarization contest (DIBCO 2009)," in Inter. Conf. on Document Analysis and Recognition (ICDAR), September 2009, pp. 1375--1382. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. I. Pratikakis, B. Gatos, and K. Ntirogiannis, "H-DIBCO 2010-handwritten document image binarization competition," in Inter. Conf. on Frontiers in Handwriting Recognition (ICFHR), November 2010, pp. 727--726. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. Prasad, P. Natarajan, K. Subramanian, S. Saleem, and R. Schwartz, "Finding structure in noisy text: Topic classification and unsupervised clustering," in Workshop on Analytics for Noisy Unstructured Text Data, January 2007, pp. 3--8.Google ScholarGoogle Scholar
  8. E. Saund, J. Lind, and P. S. and, "Pixlabeler: User interface for pixel-level labeling of elements in document images," in Inter. Conf. on Document Analysis and Recognition (ICDAR), September 2009, pp. 646--650. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. E. Barney Smith, "An anlysis of binarization ground truth," in IAPR Inter. Workshop on Document Analysis Systems (DAS), June 2010, pp. 27--34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. N. Otsu, "A threshold selection method from gray level histograms," IEEE Trans. Syst., Man, Cybern., vol. 9, pp. 62--66, 1979.Google ScholarGoogle ScholarCross RefCross Ref
  11. J. Bernsen, "Dynamic thresholding of grey-level images," in Inter. Conf. on Pattern Recognition (ICPR), 1986, pp. 1251--1255.Google ScholarGoogle Scholar
  12. W. Niblack, "An introduction to digital image processing," in Prentice Hall Englewood Cliffs, 1986, pp. 115--116. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Sauvola and M. Pietikäinen, "Adaptive document image binarization," Pattern Recognition, vol. 33, no. 2, pp. 225--236, February 2000.Google ScholarGoogle ScholarCross RefCross Ref
  14. B. Gatos, I. Pratikakis, and S. Perantonis, "Adaptive degraded document image binarization," Pattern Recognition, vol. 39, pp. 317--327, September 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. I. Ben Messaoud, H. El Abed, H. Amiri, and V. Märgner, "New binarization approach based on text block extraction," in Inter. Conf. on Document Analysis and Recognition (ICDAR), September 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. R. Schilling, Fundamentals of Robotics Analysis and Control, E. Cliffs, Ed. Prentice-Hall, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Kamel and A. Zhao, "Extraction of binary character/graphics images from grayscale document images," CVGIP: Graphical Models and Image Processing, vol. 55, pp. 203--217, May 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Y. Yang and H. Yan, "An adaptive logical method for binarization of degraded document image," Pattern Recognition, vol. 33, no. 5, pp. 787--807, May 2000.Google ScholarGoogle ScholarCross RefCross Ref
  19. S. Lu and B. S.. C. L. Ta, "Document image binarization using background estimation and stroke edge," Inter. Journal on Document Analysis and Recognition, vol. 13, no. 4, pp. 303--314, December 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. L. Lam, S. W. Lee, and C. Y. Suen, "Thinning methodologies-a comprehensive survey," IEEE Trans. Pattern Anal. Mach. Intell., vol. 14, no. 9, pp. 869--885, September 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. R. Paredes and E. Kavallieratou, "ICFHR 2010 contest: Quantitative evaluation of binarization algorithms," in Inter. Conf. on Frontiers in Handwriting Recognition (ICFHR), November 2010, pp. 733--736. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. K. Coyle, "Mass digitization of books," Journal of Academic Librarianship, vol. 32, no. 6, pp. 641--645, 2006.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. A design of a preprocessing framework for large database of historical documents

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Other conferences
            HIP '11: Proceedings of the 2011 Workshop on Historical Document Imaging and Processing
            September 2011
            195 pages
            ISBN:9781450309165
            DOI:10.1145/2037342

            Copyright © 2011 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 16 September 2011

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate52of90submissions,58%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader