skip to main content
research-article

Mass Digitization of Early Modern Texts With Optical Character Recognition

Published:07 December 2017Publication History
Skip Abstract Section

Abstract

Optical character recognition (OCR) engines work poorly on texts published with premodern printing technologies. Engaging the key technological contributors from the IMPACT project, an earlier project attempting to solve the OCR problem for early modern and modern texts, the Early Modern OCR Project (eMOP) of Texas A8M received funding from the Andrew W. Mellon Foundation to improve OCR outputs for early modern texts from the Eighteenth Century Collections Online (ECCO) and Early English Books Online (EEBO) proprietary database products—or some 45 million pages. Added to print problems are the poor quality of the page images in these collections, which would be too time consuming and expensive to reimage. This article describes eMOP's attempts to OCR 307,000 documents digitized from microfilm to make our cultural heritage available for current and future researchers. We describe the reasoning behind our choices as we undertook the project based on other relevant studies; discoveries we made; the data and the system we developed for processing it; the software, algorithms, training procedures, and tools that we developed; and future directions that should be taken for further work in developing OCR engines for cultural heritage materials.

References

  1. E. Niggemann, J. D. Decker, and M. Lévy. 2011. The New Renaissance: Report of the “Comité des Sages.” Office of the European Union.Google ScholarGoogle Scholar
  2. L. Mandell. 2017. What can you do with ‘dirty OCR’? Digital literary history beyond the canon. Presented at Instant History, the Postwar Digital Humanities and Their Legacies: A Day Conference.Google ScholarGoogle Scholar
  3. A. Gupta, R. Gutierrez-Osuna, M. Christy, C. Boris, A. Loretta, L. Grumbach, R. Furuta, and L. Mandell. 2015. Automatic assessment of OCR quality in historical documents. In Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI’15). 1735--1741. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. G. Crane. 1987. From the old to the new: Integrating hypertext into traditional scholarship. In Proceedings of the ACM Conference on Hypertext (HYPERTEXT’87). 51--55. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. Smith. 1995. A simple and efficient skew detection algorithm via text row accumulation. In Proceedings of the 3rd International Conference on Document Analysis and Recognition (ICDAR’95). 1145. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R. Smith. 2007. An overview of the Tesseract OCR engine. In Proceedings of the 9th International Conference on Document Analysis and Recognition (ICDAR’07). Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. U. Reffle and C. Ringlstetter. 2013. Unsupervised profiling of OCRed historical documents. Pattern Recognition 46, 5, 1346--1357. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Reynaert. 2008. Non-interactive OCR post-correction for giga-scale digitization projects. In Proceedings of the 9th International Conference on Computational Linguistics and Intelligent Text Processing. 617--630. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. B. Alex, C. Grover, E. Klein, and R. Tobin. 2012. Digitised historical text: Does it have to be mediOCRe? In Proceedings of KONVENS 2012 (LThist 2012 Workshop). 401--409.Google ScholarGoogle Scholar
  10. P. Ye and D. Doermann. 2013. Document image quality assessment: A brief survey. In Proceedings of the 2013 12th Conference on Document Analysis and Recognition (ICDAR’13). Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. R. D. Lins, S. Banergee, and M. Thielo. 2010. Automatically detecting and classifying noises in document images. In Proceedings of the 2010 ACM Symposium on Applied Computing (SAC’10). 33--39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. N. Sandhya, R. Krishnan, and D. Babu. 2012. A language independent characterization of document image noise in historical scripts. International Journal of Computer Applications 50, 11--18.Google ScholarGoogle ScholarCross RefCross Ref
  13. A. Farahmand, A. Sarrafzadeh, and J. Shanbehzadeh. 2013. Document image noises and removal methods. In Proceedings of the International Multiconference of Engineers and Computer Scientists.Google ScholarGoogle Scholar
  14. K. Ait-Mohand, L. Heutte, T. Paquet, and N. Ragot. 2010. Font adaptation of an HMM-based OCR system. In Proceedings of SPIE 7534: Document Recognition and Retrieval XVII.Google ScholarGoogle Scholar
  15. D. Ghosh, T. Dube, and A. P. Shivaprasad. 2010. Script recognition—a review. IEEE Transactions on Pattern Analysis and Machine Intelligence 32, 12, 2142--2161. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. R. Rani, R. Dhir, and G. S. Lehal. 2013. Script identification of pre-segmented multi-font characters and digits. In Proceedings of the 2013 12th International Conference on Document Analysis and Recognition (ICDAR’13). 1150--1154. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. G. Schohn and D. Cohn. 2000. Less is more: Active learning with support vector machines. In Proceedings of the International Conference on Machine Learning. 839--846. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Y. Fu, X. Zhu, and B. Li. 2013. A survey on instance selection for active learning. Knowledge and Information Systems 35, 249--283.Google ScholarGoogle ScholarCross RefCross Ref
  19. M.-R. Bouguelia, Y. Belaïd, and A. Belaïd. 2013. A stream-based semi-supervised active learning approach for document classification. In Proceedings of the International Conference on Document Analysis and Recognition. 611--615. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. G. B. Newby and C. Franks. 2003. Distributed proofreading. In Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital Libraries. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. L. von Ahn. 2006. Games with a purpose. Computer 39, 6, 92--94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. L. von Ahn and L. Dabbish. 2008. Designing games with a purpose. Communications of the ACM 51, 8, 58--67. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. L. von Ahn, B. Maurer, C. McMillen, D. Abraham, and M. Blum. 2008. reCAPTCHA: Human-based character recognition via Web security measures. Science 321, 5895, 1465--1468.Google ScholarGoogle Scholar
  24. S. La Manna, A. Colia, and A. Sperduti. 1999. Optical font recognition for multi-font OCR and document processing. In Proceedings of the 10th International Workshop on Database and Expert Systems Applications. 549--553. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. B. Imani, M. R. Keyvanpour, and R. Azmi. 2011. Semi-supervised Persian font recognition. Procedia Computer Science 3, 336--342.Google ScholarGoogle ScholarCross RefCross Ref
  26. R. C. Gonzalez and R. E. Woods. 2007. Digital Image Processing (3rd ed.). Prentice Hall. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. E. Kavallieratou, N. Fakotakis, and G. Kokkinakis. 2002. Skew angle estimation for printed and handwritten documents using the Wigner--Ville distribution. Image and Vision Computing 20, 813--824.Google ScholarGoogle ScholarCross RefCross Ref
  28. J. Illingworth and J. Kittler. 1988. A survey of the Hough transform. Computer Vision, Graphics, and Image Processing 44, 1, 87--116. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. A. Khotanzad and Y. H. Hong. 1990. Invariant image recognition by Zernike moments. IEEE Transactions on Pattern Analysis and Machine Intelligence 12, 5, 489--497. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. A. Tahmasbi, F. Saki, and S. B. Shokouhi. 2011. Classification of benign and malignant masses based on Zernike moments. Computers in Biology and Medicine 41, 8, 726--735. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. C. Wolf, G. Taylor, and J.-M. Jolion. 2011. Learning Individual Human Activities From Short Binary Shape Sequences. Technical Report LIRIS. Available at http://liris.cnrs.fr/Documents/Liris-5294.pdf.Google ScholarGoogle Scholar
  32. J. Sivic and A. Zisserman. 2003. Video google: A text retrieval approach to object matching in videos. In Proceedings of the 9th IEEE International Conference on Computer Vision. 1470--1477. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. T. Kobayashi, K. Watanabe, and N. Otsu. 2012. Logistic label propagation. Pattern Recognition Letters 33, 5, 580--588. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. B. Settles. 2012. Active Learning: Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan 8 Claypool. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. K. Black. 2004. Booklist/Reference Books Bulletin, November 1.Google ScholarGoogle Scholar

Index Terms

  1. Mass Digitization of Early Modern Texts With Optical Character Recognition

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image Journal on Computing and Cultural Heritage
      Journal on Computing and Cultural Heritage   Volume 11, Issue 1
      Special Issue on GCH 2016 and Regular Papers
      January 2018
      116 pages
      ISSN:1556-4673
      EISSN:1556-4711
      DOI:10.1145/3172938
      Issue’s Table of Contents

      Copyright © 2017 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 7 December 2017
      • Accepted: 1 March 2017
      • Revised: 1 February 2017
      • Received: 1 April 2016
      Published in jocch Volume 11, Issue 1

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader