ABSTRACT
This paper describes the complete indexing process of the registers of a French census dating back to more than a hundred years, from image analysis to the integration into the information system, in the context of probate genealogy. The documents of interest are composed of a table of personal information in which the cells containing the first name, the surname and the relation to head of household must be extracted and recognized. More than 30 millions of cells were processed and their content either directly integrated into the information system or sent to keyers for manual validation, allowing an automation rate at 80% while keeping the error rate below 15% on average. Based on this project, we have started the development of a generic platform for table-based historical documents processing including new functionalities and a more generic and user-friendly table model definition interface.
- M. Bulacu, R. Van Koert, L. Schomaker, and T. Van Der Zant. Layout Analysis of Handwritten Historical Documents for Searching the Archive of the Cabinet of the Dutch Queen. In Proc. of the Int. Conf. on Document Analysis and Recognition, volume 1, pages 357--361, 2007. Google ScholarDigital Library
- B. Coüasnon. DMOS, a generic document recognition method: application to table structure analysis in a general and in a specific way. International Journal on Document Analysis and Recognition, 8(2-3):111--122, Mar. 2006.Google ScholarCross Ref
- N. Gorski, V. Anisimov, E. Augustin, O. Baret, and S. Maximov. Industrial bank check processing: the A2iA CheckReader. International Journal on Document Analysis and Recognition, pages 196--206, 2001.Google ScholarCross Ref
- K. Laven, S. Leishman, and S. Roweis. A statistical learning approach to document image analysis. In Proc. of the Int. Conf. on Document Analysis and Recognition, ICDAR '05, pages 357--361, 2005. Google ScholarDigital Library
- L. Likforman-Sulem, A. Hanimyan, and C. Faure. A Hough based algorithm for extracting text lines in handwritten documents. Proceedings of 3rd International Conference on Document Analysis and Recognition, 2:774--777, 1995. Google ScholarDigital Library
- D. Lopresti and G. Nagy. A tabular survey of automated table processing. In International Workshop on Graphics Recognition, volume 1941, page 93. Springer, 2000. Google ScholarDigital Library
- R. Manmatha and T. M. Rath. Indexing of Handwritten Historical Documents - Recent Progress. In Proc. of the Symposium on Document Image Understanding Technology, pages 77--85, 2003.Google Scholar
- I. Martinat, B. Coüasnon, and J. Camillerapp. An Adaptative Recognition System Using a Table Description Language for Hierarchical Table Structures in Archival Documents, volume 5046 of Lecture Notes in Computer Science, pages 9--20. Apr. 2008. Google ScholarDigital Library
- W. Niblack. An Introduction to Digital Image Processing. Englewood Cliffs, N. J.: Prentice Hall, pages 115--116, 1986. Google ScholarDigital Library
- H. Nielson and W. Barrett. Consensus-based table form recognition of low-quality historical documents. International Journal on Document Analysis and Recognition, 8(2-3):183--200, Feb. 2006.Google ScholarCross Ref
- J. Serra. Image Analysis and Mathematical Morphology. Academic Press, Inc., Orlando, FL, USA, 1983. Google ScholarDigital Library
- P. Soille. Morphological Image Analysis: Principles and Applications, 2 edition. 2003. Google ScholarDigital Library
- C. Wolf and J.-M. Jolion. Extraction and recognition of artificial text in multimedia documents. Pattern Anal. Appl., 6(4):309--326, 2004. Google ScholarDigital Library
- R. Zanibbi, D. Blostein, and J. R. Cordy. A survey of table recognition: Models, Obervations Transformations, and Infrences. International Journal on Document Analysis and Recognition, 7(1):1--16, 2004. Google ScholarDigital Library
Index Terms
- Automatic indexing of French handwritten census registers for probate geneaology
Recommendations
Automatic document indexing in large medical collections
HIKM '06: Proceedings of the international workshop on Healthcare information and knowledge managementTerm extraction relates to extracting the most characteristic or important terms (words or phrases) in a document. This information is commonly used for improving the accuracy of document indexing and retrieval in large text collections. It also allows ...
A French clinical corpus with comprehensive semantic annotations: development of the Medical Entity and Relation LIMSI annOtated Text corpus (MERLOT)
Quality annotated resources are essential for Natural Language Processing. The objective of this work is to present a corpus of clinical narratives in French annotated for linguistic, semantic and structural information, aimed at clinical information ...
Comments