research-article

Mass Digitization of Early Modern Texts With Optical Character Recognition

Authors:
Matthew Christy

Texas A8M University, College Station, TX

Texas A8M University, College Station, TX
View Profile

,
Anshul Gupta

Texas A8M University, College Station, TX

Texas A8M University, College Station, TX
View Profile

,
Elizabeth Grumbach

Texas A8M University, College Station, TX

Texas A8M University, College Station, TX
View Profile

,
Laura Mandell

Texas A8M University, College Station, TX

Texas A8M University, College Station, TX

0000-0002-4013-114X
View Profile

,
Richard Furuta

Texas A8M University, College Station, TX

Texas A8M University, College Station, TX
View Profile

,
Ricardo Gutierrez-Osuna

Texas A8M University, College Station, TX

Texas A8M University, College Station, TX
View Profile

Authors Info & Claims

Journal on Computing and Cultural Heritage Volume 11 Issue 1Article No.: 6pp 1–25https://doi.org/10.1145/3075645

Published:07 December 2017Publication History

Journal on Computing and Cultural Heritage

Abstract

Optical character recognition (OCR) engines work poorly on texts published with premodern printing technologies. Engaging the key technological contributors from the IMPACT project, an earlier project attempting to solve the OCR problem for early modern and modern texts, the Early Modern OCR Project (eMOP) of Texas A8M received funding from the Andrew W. Mellon Foundation to improve OCR outputs for early modern texts from the Eighteenth Century Collections Online (ECCO) and Early English Books Online (EEBO) proprietary database products—or some 45 million pages. Added to print problems are the poor quality of the page images in these collections, which would be too time consuming and expensive to reimage. This article describes eMOP's attempts to OCR 307,000 documents digitized from microfilm to make our cultural heritage available for current and future researchers. We describe the reasoning behind our choices as we undertook the project based on other relevant studies; discoveries we made; the data and the system we developed for processing it; the software, algorithms, training procedures, and tools that we developed; and future directions that should be taken for further work in developing OCR engines for cultural heritage materials.

References

E. Niggemann, J. D. Decker, and M. Lévy. 2011. The New Renaissance: Report of the “Comité des Sages.” Office of the European Union.Google Scholar
L. Mandell. 2017. What can you do with ‘dirty OCR’? Digital literary history beyond the canon. Presented at Instant History, the Postwar Digital Humanities and Their Legacies: A Day Conference.Google Scholar
A. Gupta, R. Gutierrez-Osuna, M. Christy, C. Boris, A. Loretta, L. Grumbach, R. Furuta, and L. Mandell. 2015. Automatic assessment of OCR quality in historical documents. In Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI’15). 1735--1741. Google ScholarDigital Library
G. Crane. 1987. From the old to the new: Integrating hypertext into traditional scholarship. In Proceedings of the ACM Conference on Hypertext (HYPERTEXT’87). 51--55. Google ScholarDigital Library
R. Smith. 1995. A simple and efficient skew detection algorithm via text row accumulation. In Proceedings of the 3rd International Conference on Document Analysis and Recognition (ICDAR’95). 1145. Google ScholarDigital Library
R. Smith. 2007. An overview of the Tesseract OCR engine. In Proceedings of the 9th International Conference on Document Analysis and Recognition (ICDAR’07). Google ScholarDigital Library
U. Reffle and C. Ringlstetter. 2013. Unsupervised profiling of OCRed historical documents. Pattern Recognition 46, 5, 1346--1357. Google ScholarDigital Library
M. Reynaert. 2008. Non-interactive OCR post-correction for giga-scale digitization projects. In Proceedings of the 9th International Conference on Computational Linguistics and Intelligent Text Processing. 617--630. Google ScholarDigital Library
B. Alex, C. Grover, E. Klein, and R. Tobin. 2012. Digitised historical text: Does it have to be mediOCRe? In Proceedings of KONVENS 2012 (LThist 2012 Workshop). 401--409.Google Scholar
P. Ye and D. Doermann. 2013. Document image quality assessment: A brief survey. In Proceedings of the 2013 12th Conference on Document Analysis and Recognition (ICDAR’13). Google ScholarDigital Library
R. D. Lins, S. Banergee, and M. Thielo. 2010. Automatically detecting and classifying noises in document images. In Proceedings of the 2010 ACM Symposium on Applied Computing (SAC’10). 33--39. Google ScholarDigital Library
N. Sandhya, R. Krishnan, and D. Babu. 2012. A language independent characterization of document image noise in historical scripts. International Journal of Computer Applications 50, 11--18.Google ScholarCross Ref
A. Farahmand, A. Sarrafzadeh, and J. Shanbehzadeh. 2013. Document image noises and removal methods. In Proceedings of the International Multiconference of Engineers and Computer Scientists.Google Scholar
K. Ait-Mohand, L. Heutte, T. Paquet, and N. Ragot. 2010. Font adaptation of an HMM-based OCR system. In Proceedings of SPIE 7534: Document Recognition and Retrieval XVII.Google Scholar
D. Ghosh, T. Dube, and A. P. Shivaprasad. 2010. Script recognition—a review. IEEE Transactions on Pattern Analysis and Machine Intelligence 32, 12, 2142--2161. Google ScholarDigital Library
R. Rani, R. Dhir, and G. S. Lehal. 2013. Script identification of pre-segmented multi-font characters and digits. In Proceedings of the 2013 12th International Conference on Document Analysis and Recognition (ICDAR’13). 1150--1154. Google ScholarDigital Library
G. Schohn and D. Cohn. 2000. Less is more: Active learning with support vector machines. In Proceedings of the International Conference on Machine Learning. 839--846. Google ScholarDigital Library
Y. Fu, X. Zhu, and B. Li. 2013. A survey on instance selection for active learning. Knowledge and Information Systems 35, 249--283.Google ScholarCross Ref
M.-R. Bouguelia, Y. Belaïd, and A. Belaïd. 2013. A stream-based semi-supervised active learning approach for document classification. In Proceedings of the International Conference on Document Analysis and Recognition. 611--615. Google ScholarDigital Library
G. B. Newby and C. Franks. 2003. Distributed proofreading. In Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital Libraries. Google ScholarDigital Library
L. von Ahn. 2006. Games with a purpose. Computer 39, 6, 92--94. Google ScholarDigital Library
L. von Ahn and L. Dabbish. 2008. Designing games with a purpose. Communications of the ACM 51, 8, 58--67. Google ScholarDigital Library
L. von Ahn, B. Maurer, C. McMillen, D. Abraham, and M. Blum. 2008. reCAPTCHA: Human-based character recognition via Web security measures. Science 321, 5895, 1465--1468.Google Scholar
S. La Manna, A. Colia, and A. Sperduti. 1999. Optical font recognition for multi-font OCR and document processing. In Proceedings of the 10th International Workshop on Database and Expert Systems Applications. 549--553. Google ScholarDigital Library
M. B. Imani, M. R. Keyvanpour, and R. Azmi. 2011. Semi-supervised Persian font recognition. Procedia Computer Science 3, 336--342.Google ScholarCross Ref
R. C. Gonzalez and R. E. Woods. 2007. Digital Image Processing (3rd ed.). Prentice Hall. Google ScholarDigital Library
E. Kavallieratou, N. Fakotakis, and G. Kokkinakis. 2002. Skew angle estimation for printed and handwritten documents using the Wigner--Ville distribution. Image and Vision Computing 20, 813--824.Google ScholarCross Ref
J. Illingworth and J. Kittler. 1988. A survey of the Hough transform. Computer Vision, Graphics, and Image Processing 44, 1, 87--116. Google ScholarDigital Library
A. Khotanzad and Y. H. Hong. 1990. Invariant image recognition by Zernike moments. IEEE Transactions on Pattern Analysis and Machine Intelligence 12, 5, 489--497. Google ScholarDigital Library
A. Tahmasbi, F. Saki, and S. B. Shokouhi. 2011. Classification of benign and malignant masses based on Zernike moments. Computers in Biology and Medicine 41, 8, 726--735. Google ScholarDigital Library
C. Wolf, G. Taylor, and J.-M. Jolion. 2011. Learning Individual Human Activities From Short Binary Shape Sequences. Technical Report LIRIS. Available at http://liris.cnrs.fr/Documents/Liris-5294.pdf.Google Scholar
J. Sivic and A. Zisserman. 2003. Video google: A text retrieval approach to object matching in videos. In Proceedings of the 9th IEEE International Conference on Computer Vision. 1470--1477. Google ScholarDigital Library
T. Kobayashi, K. Watanabe, and N. Otsu. 2012. Logistic label propagation. Pattern Recognition Letters 33, 5, 580--588. Google ScholarDigital Library
B. Settles. 2012. Active Learning: Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan 8 Claypool. Google ScholarDigital Library
K. Black. 2004. Booklist/Reference Books Bulletin, November 1.Google Scholar

Index Terms

Mass Digitization of Early Modern Texts With Optical Character Recognition
1. Applied computing
  1. Document management and text processing
    1. Document capture
      1. Optical character recognition

Recommendations

Nastaliq optical character recognition
ACM-SE 46: Proceedings of the 46th Annual Southeast Regional Conference on XX

Nastaliq is a calligraphic, beautiful and more aesthetic style of writing Urdu, the national language of Pakistan, also used to read and write in India and other countries of the region.

OCRs developed for many world languages are already under ...
Read More
The optical character recognition of Urdu-like cursive scripts

We survey the optical character recognition (OCR) literature with reference to the Urdu-like cursive scripts. In particular, the Urdu, Pushto, and Sindhi languages are discussed, with the emphasis being on the Nasta'liq and Naskh scripts. Before ...
Read More
Automated system for Arabic optical character recognition
ICICS '12: Proceedings of the 3rd International Conference on Information and Communication Systems

In this paper an Arabic Optical Character Recognition system is implemented. The system takes a scanned image of an Arabic text as an input and generates an editable text out of it. The system starts by segmenting the document which is presented as an ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Journal on Computing and Cultural Heritage Volume 11, Issue 1
Special Issue on GCH 2016 and Regular Papers
January 2018
116 pages
ISSN:1556-4673
EISSN:1556-4711
DOI:10.1145/3172938
Editor:
Roberto Scopigno
CNRźISTI, Italy
Issue’s Table of Contents
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 December 2017
- Accepted: 1 March 2017
- Revised: 1 February 2017
- Received: 1 April 2016
Published in jocch Volume 11, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Machine learning
digital humanities
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 7
  Total Citations
  View Citations
- 657
  Total Downloads
- Downloads (Last 12 months)43
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Mass Digitization of Early Modern Texts With Optical Character Recognition

Journal on Computing and Cultural Heritage

Abstract

References

Cited By

Index Terms

Recommendations

Nastaliq optical character recognition

The optical character recognition of Urdu-like cursive scripts

Automated system for Arabic optical character recognition