Abstract
Optical character recognition (OCR) engines work poorly on texts published with premodern printing technologies. Engaging the key technological contributors from the IMPACT project, an earlier project attempting to solve the OCR problem for early modern and modern texts, the Early Modern OCR Project (eMOP) of Texas A8M received funding from the Andrew W. Mellon Foundation to improve OCR outputs for early modern texts from the Eighteenth Century Collections Online (ECCO) and Early English Books Online (EEBO) proprietary database products—or some 45 million pages. Added to print problems are the poor quality of the page images in these collections, which would be too time consuming and expensive to reimage. This article describes eMOP's attempts to OCR 307,000 documents digitized from microfilm to make our cultural heritage available for current and future researchers. We describe the reasoning behind our choices as we undertook the project based on other relevant studies; discoveries we made; the data and the system we developed for processing it; the software, algorithms, training procedures, and tools that we developed; and future directions that should be taken for further work in developing OCR engines for cultural heritage materials.
- E. Niggemann, J. D. Decker, and M. Lévy. 2011. The New Renaissance: Report of the “Comité des Sages.” Office of the European Union.Google Scholar
- L. Mandell. 2017. What can you do with ‘dirty OCR’? Digital literary history beyond the canon. Presented at Instant History, the Postwar Digital Humanities and Their Legacies: A Day Conference.Google Scholar
- A. Gupta, R. Gutierrez-Osuna, M. Christy, C. Boris, A. Loretta, L. Grumbach, R. Furuta, and L. Mandell. 2015. Automatic assessment of OCR quality in historical documents. In Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI’15). 1735--1741. Google ScholarDigital Library
- G. Crane. 1987. From the old to the new: Integrating hypertext into traditional scholarship. In Proceedings of the ACM Conference on Hypertext (HYPERTEXT’87). 51--55. Google ScholarDigital Library
- R. Smith. 1995. A simple and efficient skew detection algorithm via text row accumulation. In Proceedings of the 3rd International Conference on Document Analysis and Recognition (ICDAR’95). 1145. Google ScholarDigital Library
- R. Smith. 2007. An overview of the Tesseract OCR engine. In Proceedings of the 9th International Conference on Document Analysis and Recognition (ICDAR’07). Google ScholarDigital Library
- U. Reffle and C. Ringlstetter. 2013. Unsupervised profiling of OCRed historical documents. Pattern Recognition 46, 5, 1346--1357. Google ScholarDigital Library
- M. Reynaert. 2008. Non-interactive OCR post-correction for giga-scale digitization projects. In Proceedings of the 9th International Conference on Computational Linguistics and Intelligent Text Processing. 617--630. Google ScholarDigital Library
- B. Alex, C. Grover, E. Klein, and R. Tobin. 2012. Digitised historical text: Does it have to be mediOCRe? In Proceedings of KONVENS 2012 (LThist 2012 Workshop). 401--409.Google Scholar
- P. Ye and D. Doermann. 2013. Document image quality assessment: A brief survey. In Proceedings of the 2013 12th Conference on Document Analysis and Recognition (ICDAR’13). Google ScholarDigital Library
- R. D. Lins, S. Banergee, and M. Thielo. 2010. Automatically detecting and classifying noises in document images. In Proceedings of the 2010 ACM Symposium on Applied Computing (SAC’10). 33--39. Google ScholarDigital Library
- N. Sandhya, R. Krishnan, and D. Babu. 2012. A language independent characterization of document image noise in historical scripts. International Journal of Computer Applications 50, 11--18.Google ScholarCross Ref
- A. Farahmand, A. Sarrafzadeh, and J. Shanbehzadeh. 2013. Document image noises and removal methods. In Proceedings of the International Multiconference of Engineers and Computer Scientists.Google Scholar
- K. Ait-Mohand, L. Heutte, T. Paquet, and N. Ragot. 2010. Font adaptation of an HMM-based OCR system. In Proceedings of SPIE 7534: Document Recognition and Retrieval XVII.Google Scholar
- D. Ghosh, T. Dube, and A. P. Shivaprasad. 2010. Script recognition—a review. IEEE Transactions on Pattern Analysis and Machine Intelligence 32, 12, 2142--2161. Google ScholarDigital Library
- R. Rani, R. Dhir, and G. S. Lehal. 2013. Script identification of pre-segmented multi-font characters and digits. In Proceedings of the 2013 12th International Conference on Document Analysis and Recognition (ICDAR’13). 1150--1154. Google ScholarDigital Library
- G. Schohn and D. Cohn. 2000. Less is more: Active learning with support vector machines. In Proceedings of the International Conference on Machine Learning. 839--846. Google ScholarDigital Library
- Y. Fu, X. Zhu, and B. Li. 2013. A survey on instance selection for active learning. Knowledge and Information Systems 35, 249--283.Google ScholarCross Ref
- M.-R. Bouguelia, Y. Belaïd, and A. Belaïd. 2013. A stream-based semi-supervised active learning approach for document classification. In Proceedings of the International Conference on Document Analysis and Recognition. 611--615. Google ScholarDigital Library
- G. B. Newby and C. Franks. 2003. Distributed proofreading. In Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital Libraries. Google ScholarDigital Library
- L. von Ahn. 2006. Games with a purpose. Computer 39, 6, 92--94. Google ScholarDigital Library
- L. von Ahn and L. Dabbish. 2008. Designing games with a purpose. Communications of the ACM 51, 8, 58--67. Google ScholarDigital Library
- L. von Ahn, B. Maurer, C. McMillen, D. Abraham, and M. Blum. 2008. reCAPTCHA: Human-based character recognition via Web security measures. Science 321, 5895, 1465--1468.Google Scholar
- S. La Manna, A. Colia, and A. Sperduti. 1999. Optical font recognition for multi-font OCR and document processing. In Proceedings of the 10th International Workshop on Database and Expert Systems Applications. 549--553. Google ScholarDigital Library
- M. B. Imani, M. R. Keyvanpour, and R. Azmi. 2011. Semi-supervised Persian font recognition. Procedia Computer Science 3, 336--342.Google ScholarCross Ref
- R. C. Gonzalez and R. E. Woods. 2007. Digital Image Processing (3rd ed.). Prentice Hall. Google ScholarDigital Library
- E. Kavallieratou, N. Fakotakis, and G. Kokkinakis. 2002. Skew angle estimation for printed and handwritten documents using the Wigner--Ville distribution. Image and Vision Computing 20, 813--824.Google ScholarCross Ref
- J. Illingworth and J. Kittler. 1988. A survey of the Hough transform. Computer Vision, Graphics, and Image Processing 44, 1, 87--116. Google ScholarDigital Library
- A. Khotanzad and Y. H. Hong. 1990. Invariant image recognition by Zernike moments. IEEE Transactions on Pattern Analysis and Machine Intelligence 12, 5, 489--497. Google ScholarDigital Library
- A. Tahmasbi, F. Saki, and S. B. Shokouhi. 2011. Classification of benign and malignant masses based on Zernike moments. Computers in Biology and Medicine 41, 8, 726--735. Google ScholarDigital Library
- C. Wolf, G. Taylor, and J.-M. Jolion. 2011. Learning Individual Human Activities From Short Binary Shape Sequences. Technical Report LIRIS. Available at http://liris.cnrs.fr/Documents/Liris-5294.pdf.Google Scholar
- J. Sivic and A. Zisserman. 2003. Video google: A text retrieval approach to object matching in videos. In Proceedings of the 9th IEEE International Conference on Computer Vision. 1470--1477. Google ScholarDigital Library
- T. Kobayashi, K. Watanabe, and N. Otsu. 2012. Logistic label propagation. Pattern Recognition Letters 33, 5, 580--588. Google ScholarDigital Library
- B. Settles. 2012. Active Learning: Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan 8 Claypool. Google ScholarDigital Library
- K. Black. 2004. Booklist/Reference Books Bulletin, November 1.Google Scholar
Index Terms
- Mass Digitization of Early Modern Texts With Optical Character Recognition
Recommendations
Nastaliq optical character recognition
ACM-SE 46: Proceedings of the 46th Annual Southeast Regional Conference on XXNastaliq is a calligraphic, beautiful and more aesthetic style of writing Urdu, the national language of Pakistan, also used to read and write in India and other countries of the region.
OCRs developed for many world languages are already under ...
The optical character recognition of Urdu-like cursive scripts
We survey the optical character recognition (OCR) literature with reference to the Urdu-like cursive scripts. In particular, the Urdu, Pushto, and Sindhi languages are discussed, with the emphasis being on the Nasta'liq and Naskh scripts. Before ...
Automated system for Arabic optical character recognition
ICICS '12: Proceedings of the 3rd International Conference on Information and Communication SystemsIn this paper an Arabic Optical Character Recognition system is implemented. The system takes a scanned image of an Arabic text as an input and generates an editable text out of it. The system starts by segmenting the document which is presented as an ...
Comments