ABSTRACT
The objective of document preprocessing is to ease the text recognition or the document indexing processes. The analysis of historical documents seems to be a big challenge because the majority of those documents are noisy and present many degradations. In this paper we propose a preprocessing framework for a large dataset of historical documents. The proposed framework is decomposed of two phases, the selection and the evaluation. During the first phase one or multiple methods are corresponded for each book of the used database. The validation of the selection results is performed during the evaluation. The experiments are applied on printed and handwritten documents extracted respectively from Google-Books and Bayerische Staatsbibliothek databases. The results returned during the evaluation are very promising.
- I. Ben Messaoud and H. El Abed, "Automatic annotation for handwritten historical documents using markov models," in Inter. Conf. on Frontiers in Handwriting Recognition (ICFHR), November 2010, pp. 381--386. Google ScholarDigital Library
- K. Ntirogiannis, B. Gatos, and I. Pratikakis, "An objective evaluation methodology for document image binarization techniques," in IAPR Inter. Workshop on Document Analysis Systems (DAS), September 2008, pp. 217--224. Google ScholarDigital Library
- B. Su, S. Lu, and C. Tan, "Binarization of historical document images using the local maximum and minimum," in IAPR Inter. Workshop on Document Analysis Systems (DAS), June 2010, pp. 159--165. Google ScholarDigital Library
- P. Stathis, E. Kavallieratou, and N. Papamarkos, "An evaluation technique for binarization algorithms," Journal of Universal Computer Science, vol. 14, no. 18, pp. 3011--3030, October 2008.Google Scholar
- B. Gatos, K. Ntirogiannis, and I. Pratikakis, "ICDAR 2009 document image binarization contest (DIBCO 2009)," in Inter. Conf. on Document Analysis and Recognition (ICDAR), September 2009, pp. 1375--1382. Google ScholarDigital Library
- I. Pratikakis, B. Gatos, and K. Ntirogiannis, "H-DIBCO 2010-handwritten document image binarization competition," in Inter. Conf. on Frontiers in Handwriting Recognition (ICFHR), November 2010, pp. 727--726. Google ScholarDigital Library
- R. Prasad, P. Natarajan, K. Subramanian, S. Saleem, and R. Schwartz, "Finding structure in noisy text: Topic classification and unsupervised clustering," in Workshop on Analytics for Noisy Unstructured Text Data, January 2007, pp. 3--8.Google Scholar
- E. Saund, J. Lind, and P. S. and, "Pixlabeler: User interface for pixel-level labeling of elements in document images," in Inter. Conf. on Document Analysis and Recognition (ICDAR), September 2009, pp. 646--650. Google ScholarDigital Library
- E. Barney Smith, "An anlysis of binarization ground truth," in IAPR Inter. Workshop on Document Analysis Systems (DAS), June 2010, pp. 27--34. Google ScholarDigital Library
- N. Otsu, "A threshold selection method from gray level histograms," IEEE Trans. Syst., Man, Cybern., vol. 9, pp. 62--66, 1979.Google ScholarCross Ref
- J. Bernsen, "Dynamic thresholding of grey-level images," in Inter. Conf. on Pattern Recognition (ICPR), 1986, pp. 1251--1255.Google Scholar
- W. Niblack, "An introduction to digital image processing," in Prentice Hall Englewood Cliffs, 1986, pp. 115--116. Google ScholarDigital Library
- J. Sauvola and M. Pietikäinen, "Adaptive document image binarization," Pattern Recognition, vol. 33, no. 2, pp. 225--236, February 2000.Google ScholarCross Ref
- B. Gatos, I. Pratikakis, and S. Perantonis, "Adaptive degraded document image binarization," Pattern Recognition, vol. 39, pp. 317--327, September 2006. Google ScholarDigital Library
- I. Ben Messaoud, H. El Abed, H. Amiri, and V. Märgner, "New binarization approach based on text block extraction," in Inter. Conf. on Document Analysis and Recognition (ICDAR), September 2011. Google ScholarDigital Library
- R. Schilling, Fundamentals of Robotics Analysis and Control, E. Cliffs, Ed. Prentice-Hall, 1990. Google ScholarDigital Library
- M. Kamel and A. Zhao, "Extraction of binary character/graphics images from grayscale document images," CVGIP: Graphical Models and Image Processing, vol. 55, pp. 203--217, May 1993. Google ScholarDigital Library
- Y. Yang and H. Yan, "An adaptive logical method for binarization of degraded document image," Pattern Recognition, vol. 33, no. 5, pp. 787--807, May 2000.Google ScholarCross Ref
- S. Lu and B. S.. C. L. Ta, "Document image binarization using background estimation and stroke edge," Inter. Journal on Document Analysis and Recognition, vol. 13, no. 4, pp. 303--314, December 2010. Google ScholarDigital Library
- L. Lam, S. W. Lee, and C. Y. Suen, "Thinning methodologies-a comprehensive survey," IEEE Trans. Pattern Anal. Mach. Intell., vol. 14, no. 9, pp. 869--885, September 1992. Google ScholarDigital Library
- R. Paredes and E. Kavallieratou, "ICFHR 2010 contest: Quantitative evaluation of binarization algorithms," in Inter. Conf. on Frontiers in Handwriting Recognition (ICFHR), November 2010, pp. 733--736. Google ScholarDigital Library
- K. Coyle, "Mass digitization of books," Journal of Academic Librarianship, vol. 32, no. 6, pp. 641--645, 2006.Google ScholarCross Ref
Index Terms
- A design of a preprocessing framework for large database of historical documents
Recommendations
Collaborative Access to Ancient Documents: Towards a Distributed Comparison of Pre-Processing Approaches
With the evolution of the next generation networks several applications have emerged to be used through the web. Applications allowing the analysis and the recognition of documents are emerged to be used through Internet. Document pre-processing output ...
A Multilevel Text-Line Segmentation Framework for Handwritten Historical Documents
ICFHR '12: Proceedings of the 2012 International Conference on Frontiers in Handwriting RecognitionText-line segmentation is considered as a crucial step of document analysis and recognition systems because its output is considered as the input of recognition systems. Due to the reason that the same handwritten image page has different ...
A bimodal crowdsourcing platform for demographic historical manuscripts
DATeCH '14: Proceedings of the First International Conference on Digital Access to Textual Cultural HeritageIn this paper we present a crowdsourcing web-based application for extracting information from demographic handwritten document images. The proposed application integrates two points of view: the semantic information for demographic research, and the ...
Comments