research-article

Automatic indexing of French handwritten census registers for probate geneaology

Authors:
Cédric Sibade

AZiA, Artificial Intelligence and Image Analysis, Paris - France

AZiA, Artificial Intelligence and Image Analysis, Paris - France
View Profile

,
Thomas Retornaz

AZiA, Artificial Intelligence and Image Analysis, Paris - France

AZiA, Artificial Intelligence and Image Analysis, Paris - France
View Profile

,
Thibauld Nion

AZiA, Artificial Intelligence and Image Analysis, Paris - France

AZiA, Artificial Intelligence and Image Analysis, Paris - France
View Profile

,
Romain Lerallut

AZiA, Artificial Intelligence and Image Analysis, Paris - France

AZiA, Artificial Intelligence and Image Analysis, Paris - France
View Profile

,
Christopher Kermorvant

AZiA, Artificial Intelligence and Image Analysis, Paris - France

AZiA, Artificial Intelligence and Image Analysis, Paris - France
View Profile

HIP '11: Proceedings of the 2011 Workshop on Historical Document Imaging and ProcessingSeptember 2011Pages 51–58https://doi.org/10.1145/2037342.2037352

Published:16 September 2011Publication History

HIP '11: Proceedings of the 2011 Workshop on Historical Document Imaging and Processing

Pages 51–58

ABSTRACT

This paper describes the complete indexing process of the registers of a French census dating back to more than a hundred years, from image analysis to the integration into the information system, in the context of probate genealogy. The documents of interest are composed of a table of personal information in which the cells containing the first name, the surname and the relation to head of household must be extracted and recognized. More than 30 millions of cells were processed and their content either directly integrated into the information system or sent to keyers for manual validation, allowing an automation rate at 80% while keeping the error rate below 15% on average. Based on this project, we have started the development of a generic platform for table-based historical documents processing including new functionalities and a more generic and user-friendly table model definition interface.

References

M. Bulacu, R. Van Koert, L. Schomaker, and T. Van Der Zant. Layout Analysis of Handwritten Historical Documents for Searching the Archive of the Cabinet of the Dutch Queen. In Proc. of the Int. Conf. on Document Analysis and Recognition, volume 1, pages 357--361, 2007. Google ScholarDigital Library
B. Coüasnon. DMOS, a generic document recognition method: application to table structure analysis in a general and in a specific way. International Journal on Document Analysis and Recognition, 8(2-3):111--122, Mar. 2006.Google ScholarCross Ref
N. Gorski, V. Anisimov, E. Augustin, O. Baret, and S. Maximov. Industrial bank check processing: the A2iA CheckReader. International Journal on Document Analysis and Recognition, pages 196--206, 2001.Google ScholarCross Ref
K. Laven, S. Leishman, and S. Roweis. A statistical learning approach to document image analysis. In Proc. of the Int. Conf. on Document Analysis and Recognition, ICDAR '05, pages 357--361, 2005. Google ScholarDigital Library
L. Likforman-Sulem, A. Hanimyan, and C. Faure. A Hough based algorithm for extracting text lines in handwritten documents. Proceedings of 3rd International Conference on Document Analysis and Recognition, 2:774--777, 1995. Google ScholarDigital Library
D. Lopresti and G. Nagy. A tabular survey of automated table processing. In International Workshop on Graphics Recognition, volume 1941, page 93. Springer, 2000. Google ScholarDigital Library
R. Manmatha and T. M. Rath. Indexing of Handwritten Historical Documents - Recent Progress. In Proc. of the Symposium on Document Image Understanding Technology, pages 77--85, 2003.Google Scholar
I. Martinat, B. Coüasnon, and J. Camillerapp. An Adaptative Recognition System Using a Table Description Language for Hierarchical Table Structures in Archival Documents, volume 5046 of Lecture Notes in Computer Science, pages 9--20. Apr. 2008. Google ScholarDigital Library
W. Niblack. An Introduction to Digital Image Processing. Englewood Cliffs, N. J.: Prentice Hall, pages 115--116, 1986. Google ScholarDigital Library
H. Nielson and W. Barrett. Consensus-based table form recognition of low-quality historical documents. International Journal on Document Analysis and Recognition, 8(2-3):183--200, Feb. 2006.Google ScholarCross Ref
J. Serra. Image Analysis and Mathematical Morphology. Academic Press, Inc., Orlando, FL, USA, 1983. Google ScholarDigital Library
P. Soille. Morphological Image Analysis: Principles and Applications, 2 edition. 2003. Google ScholarDigital Library
C. Wolf and J.-M. Jolion. Extraction and recognition of artificial text in multimedia documents. Pattern Anal. Appl., 6(4):309--326, 2004. Google ScholarDigital Library
R. Zanibbi, D. Blostein, and J. R. Cordy. A survey of table recognition: Models, Obervations Transformations, and Infrences. International Journal on Document Analysis and Recognition, 7(1):1--16, 2004. Google ScholarDigital Library

Index Terms

Automatic indexing of French handwritten census registers for probate geneaology
1. Applied computing
  1. Document management and text processing
    1. Document capture

Recommendations

Automatic document indexing in large medical collections
HIKM '06: Proceedings of the international workshop on Healthcare information and knowledge management

Term extraction relates to extracting the most characteristic or important terms (words or phrases) in a document. This information is commonly used for improving the accuracy of document indexing and retrieval in large text collections. It also allows ...
Read More
A French clinical corpus with comprehensive semantic annotations: development of the Medical Entity and Relation LIMSI annOtated Text corpus (MERLOT)

Quality annotated resources are essential for Natural Language Processing. The objective of this work is to present a corpus of clinical narratives in French annotated for linguistic, semantic and structural information, aimed at clinical information ...
Read More
Automatic office document classification and information extraction
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
HIP '11: Proceedings of the 2011 Workshop on Historical Document Imaging and Processing
September 2011
195 pages
ISBN:9781450309165
DOI:10.1145/2037342
Program Chairs:
Bill Barrett
Brigham Young University
,
Michael S. Brown
National University of Singapore
,
R. Manmatha
UMass Amherst
,
Jake Gehring
FamilySearch Data Operations
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 16 September 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate52of90submissions,58%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 11
  Total Citations
  View Citations
- 109
  Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Automatic indexing of French handwritten census registers for probate geneaology

HIP '11: Proceedings of the 2011 Workshop on Historical Document Imaging and Processing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Automatic document indexing in large medical collections

A French clinical corpus with comprehensive semantic annotations: development of the Medical Entity and Relation LIMSI annOtated Text corpus (MERLOT)

Automatic office document classification and information extraction

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Automatic indexing of French handwritten census registers for probate geneaology

HIP '11: Proceedings of the 2011 Workshop on Historical Document Imaging and Processing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Automatic document indexing in large medical collections

A French clinical corpus with comprehensive semantic annotations: development of the Medical Entity and Relation LIMSI annOtated Text corpus (MERLOT)

Automatic office document classification and information extraction

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media