research-article

A design of a preprocessing framework for large database of historical documents

Authors:
Ines Ben Messaoud

Laboratoire des Systèmes et Traitement de Signal, LSTS Ecole Nationale d'Ingénieurs de Tunis, ENIT, Tunis, Tunisia

Laboratoire des Systèmes et Traitement de Signal, LSTS Ecole Nationale d'Ingénieurs de Tunis, ENIT, Tunis, Tunisia
View Profile

,
Haikal El Abed

Technische Universität, Braunschweig, Braunschweig, Germany

Technische Universität, Braunschweig, Braunschweig, Germany
View Profile

,
Volker Märgner

Technische Universität, Braunschweig, Braunschweig, Germany

Technische Universität, Braunschweig, Braunschweig, Germany
View Profile

,
Hamid Amiri

Laboratoire des Systèmes et Traitement de Signal, LSTS, Ecole Nationale d'Ingénieurs de Tunis, ENIT, Tunis, Tunisia

Laboratoire des Systèmes et Traitement de Signal, LSTS, Ecole Nationale d'Ingénieurs de Tunis, ENIT, Tunis, Tunisia
View Profile

HIP '11: Proceedings of the 2011 Workshop on Historical Document Imaging and ProcessingSeptember 2011Pages 177–183https://doi.org/10.1145/2037342.2037372

Published:16 September 2011Publication History

HIP '11: Proceedings of the 2011 Workshop on Historical Document Imaging and Processing

Pages 177–183

ABSTRACT

The objective of document preprocessing is to ease the text recognition or the document indexing processes. The analysis of historical documents seems to be a big challenge because the majority of those documents are noisy and present many degradations. In this paper we propose a preprocessing framework for a large dataset of historical documents. The proposed framework is decomposed of two phases, the selection and the evaluation. During the first phase one or multiple methods are corresponded for each book of the used database. The validation of the selection results is performed during the evaluation. The experiments are applied on printed and handwritten documents extracted respectively from Google-Books and Bayerische Staatsbibliothek databases. The results returned during the evaluation are very promising.

References

I. Ben Messaoud and H. El Abed, "Automatic annotation for handwritten historical documents using markov models," in Inter. Conf. on Frontiers in Handwriting Recognition (ICFHR), November 2010, pp. 381--386. Google ScholarDigital Library
K. Ntirogiannis, B. Gatos, and I. Pratikakis, "An objective evaluation methodology for document image binarization techniques," in IAPR Inter. Workshop on Document Analysis Systems (DAS), September 2008, pp. 217--224. Google ScholarDigital Library
B. Su, S. Lu, and C. Tan, "Binarization of historical document images using the local maximum and minimum," in IAPR Inter. Workshop on Document Analysis Systems (DAS), June 2010, pp. 159--165. Google ScholarDigital Library
P. Stathis, E. Kavallieratou, and N. Papamarkos, "An evaluation technique for binarization algorithms," Journal of Universal Computer Science, vol. 14, no. 18, pp. 3011--3030, October 2008.Google Scholar
B. Gatos, K. Ntirogiannis, and I. Pratikakis, "ICDAR 2009 document image binarization contest (DIBCO 2009)," in Inter. Conf. on Document Analysis and Recognition (ICDAR), September 2009, pp. 1375--1382. Google ScholarDigital Library
I. Pratikakis, B. Gatos, and K. Ntirogiannis, "H-DIBCO 2010-handwritten document image binarization competition," in Inter. Conf. on Frontiers in Handwriting Recognition (ICFHR), November 2010, pp. 727--726. Google ScholarDigital Library
R. Prasad, P. Natarajan, K. Subramanian, S. Saleem, and R. Schwartz, "Finding structure in noisy text: Topic classification and unsupervised clustering," in Workshop on Analytics for Noisy Unstructured Text Data, January 2007, pp. 3--8.Google Scholar
E. Saund, J. Lind, and P. S. and, "Pixlabeler: User interface for pixel-level labeling of elements in document images," in Inter. Conf. on Document Analysis and Recognition (ICDAR), September 2009, pp. 646--650. Google ScholarDigital Library
E. Barney Smith, "An anlysis of binarization ground truth," in IAPR Inter. Workshop on Document Analysis Systems (DAS), June 2010, pp. 27--34. Google ScholarDigital Library
N. Otsu, "A threshold selection method from gray level histograms," IEEE Trans. Syst., Man, Cybern., vol. 9, pp. 62--66, 1979.Google ScholarCross Ref
J. Bernsen, "Dynamic thresholding of grey-level images," in Inter. Conf. on Pattern Recognition (ICPR), 1986, pp. 1251--1255.Google Scholar
W. Niblack, "An introduction to digital image processing," in Prentice Hall Englewood Cliffs, 1986, pp. 115--116. Google ScholarDigital Library
J. Sauvola and M. Pietikäinen, "Adaptive document image binarization," Pattern Recognition, vol. 33, no. 2, pp. 225--236, February 2000.Google ScholarCross Ref
B. Gatos, I. Pratikakis, and S. Perantonis, "Adaptive degraded document image binarization," Pattern Recognition, vol. 39, pp. 317--327, September 2006. Google ScholarDigital Library
I. Ben Messaoud, H. El Abed, H. Amiri, and V. Märgner, "New binarization approach based on text block extraction," in Inter. Conf. on Document Analysis and Recognition (ICDAR), September 2011. Google ScholarDigital Library
R. Schilling, Fundamentals of Robotics Analysis and Control, E. Cliffs, Ed. Prentice-Hall, 1990. Google ScholarDigital Library
M. Kamel and A. Zhao, "Extraction of binary character/graphics images from grayscale document images," CVGIP: Graphical Models and Image Processing, vol. 55, pp. 203--217, May 1993. Google ScholarDigital Library
Y. Yang and H. Yan, "An adaptive logical method for binarization of degraded document image," Pattern Recognition, vol. 33, no. 5, pp. 787--807, May 2000.Google ScholarCross Ref
S. Lu and B. S.. C. L. Ta, "Document image binarization using background estimation and stroke edge," Inter. Journal on Document Analysis and Recognition, vol. 13, no. 4, pp. 303--314, December 2010. Google ScholarDigital Library
L. Lam, S. W. Lee, and C. Y. Suen, "Thinning methodologies-a comprehensive survey," IEEE Trans. Pattern Anal. Mach. Intell., vol. 14, no. 9, pp. 869--885, September 1992. Google ScholarDigital Library
R. Paredes and E. Kavallieratou, "ICFHR 2010 contest: Quantitative evaluation of binarization algorithms," in Inter. Conf. on Frontiers in Handwriting Recognition (ICFHR), November 2010, pp. 733--736. Google ScholarDigital Library
K. Coyle, "Mass digitization of books," Journal of Academic Librarianship, vol. 32, no. 6, pp. 641--645, 2006.Google ScholarCross Ref

Index Terms

A design of a preprocessing framework for large database of historical documents

Recommendations

Collaborative Access to Ancient Documents: Towards a Distributed Comparison of Pre-Processing Approaches

With the evolution of the next generation networks several applications have emerged to be used through the web. Applications allowing the analysis and the recognition of documents are emerged to be used through Internet. Document pre-processing output ...
Read More
A Multilevel Text-Line Segmentation Framework for Handwritten Historical Documents
ICFHR '12: Proceedings of the 2012 International Conference on Frontiers in Handwriting Recognition

Text-line segmentation is considered as a crucial step of document analysis and recognition systems because its output is considered as the input of recognition systems. Due to the reason that the same handwritten image page has different ...
Read More
A bimodal crowdsourcing platform for demographic historical manuscripts
DATeCH '14: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage

In this paper we present a crowdsourcing web-based application for extracting information from demographic handwritten document images. The proposed application integrates two points of view: the semantic information for demographic research, and the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
HIP '11: Proceedings of the 2011 Workshop on Historical Document Imaging and Processing
September 2011
195 pages
ISBN:9781450309165
DOI:10.1145/2037342
Program Chairs:
Bill Barrett
Brigham Young University
,
Michael S. Brown
National University of Singapore
,
R. Manmatha
UMass Amherst
,
Jake Gehring
FamilySearch Data Operations
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 16 September 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
evaluation metrics
ground-truth generation
method selection
preprocessing framework
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate52of90submissions,58%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 14
  Total Citations
  View Citations
- 186
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A design of a preprocessing framework for large database of historical documents

HIP '11: Proceedings of the 2011 Workshop on Historical Document Imaging and Processing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Collaborative Access to Ancient Documents: Towards a Distributed Comparison of Pre-Processing Approaches

A Multilevel Text-Line Segmentation Framework for Handwritten Historical Documents

A bimodal crowdsourcing platform for demographic historical manuscripts

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A design of a preprocessing framework for large database of historical documents

HIP '11: Proceedings of the 2011 Workshop on Historical Document Imaging and Processing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Collaborative Access to Ancient Documents: Towards a Distributed Comparison of Pre-Processing Approaches

A Multilevel Text-Line Segmentation Framework for Handwritten Historical Documents

A bimodal crowdsourcing platform for demographic historical manuscripts

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media