ABSTRACT
We describe efforts to adapt the Tesseract open source OCR engine for multiple scripts and languages. Effort has been concentrated on enabling generic multi-lingual operation such that negligible customization is required for a new language beyond providing a corpus of text. Although change was required to various modules, including physical layout analysis, and linguistic post-processing, no change was required to the character classifier beyond changing a few limits. The Tesseract classifier has adapted easily to Simplified Chinese. Test results on English, a mixture of European languages, and Russian, taken from a random sample of books, show a reasonably consistent word error rate between 3.72% and 5.78%, and Simplified Chinese has a character error rate of only 3.77%.
- Nagy, G., "Chinese character recognition: a twenty-five-year perspective" 9th Int. Conf. on Pattern Recognition, Nov 1988, pp 163--167.Google Scholar
- Xia, F. "Knowledge-based sub-pattern segmentation: decompositions of Chinese characters" Image Processing 1994. Proc. ICIP-94, IEEE Int. Conf. vol. 1, 13--16 Nov 1994, pp 179--182.Google Scholar
- Zhidong Lu, Schwartz, R. Natarajan, P. Bazzi, I. Makhoul, J. "Advances in the BBN BYBLOS OCR system" Proc. 5th Int. Conf. on Document Analysis and Recognition, 1999, pp 337--340. Google ScholarDigital Library
- Kanungo, T., Marton, G. A., Bulbul, O., "Omnipage vs. Sakhr: paired Model Evaluation of Two Arabic OCR Products" Proc. SPIE 3651, 7 Jan 1999, pp 109--120.Google ScholarCross Ref
- Bansal, V.; Sinha, R. M. K, "A complete OCR for printed Hindi text in Devanagari script" Proc. 6th Int. Conf on Document Analysis and Recognition, 2001, pp 800--804. Google ScholarDigital Library
- Govindaraju, V., et. al. "Tools for enabling digital access to multi-lingual Indic documents" Proc 1st Int. Workshop on document Image Analysis for Libraries, 2004, pp 122--133. Google ScholarDigital Library
- Official Google Blog: http://googleblog.blogspot.com/2008/07/hitting-40-languages.html.Google Scholar
- Smith, R., "An Overview of the Tesseract OCR Engine" Proc 9th Int. Conf. on Document Analysis and Recognition, 2007, pp 629--633. Google ScholarDigital Library
- Tesseract Open-Source OCR: http://code.google.com/p/tesseract-ocr.Google Scholar
- Smith, R "Hybrid Page Layout Analysis via Tab-Stop Detection, Document Analysis and Recognition" Proc. 10th Int. Conf. on Document Analysis and Recognition, 2009. Google ScholarDigital Library
- Smith, R., "A simple and efficient skew detection algorithm via text row accumulation" Proc. 3rd Int. Conf. on Document Analysis and Recognition, 1995, pp 1145--1148. Google ScholarDigital Library
- Unnikrishnan, R., Smith, R., "Combined Script and Page Orientation Estimation using the Tesseract OCR engine" Submitted to International Workshop of Multilingual OCR, 25th July 2009, Barcelona, Spain. Google ScholarDigital Library
- Gionis, A., Indyk, P., Motwani, R., "Similarity Search in High Dimensions via Hashing" Proc. 25th Int. Conf. on Very Large Data Bases, 1999, pp 518--529. Google ScholarDigital Library
- Baluja, S., Covell, M., "Learning to hash: forgiving hash functions and applications" Data Mining and Knowledge Discovery 17(3), Dec 2008, pp 402--430. Google ScholarDigital Library
- Schapire, R. E., "The Strength of Weak Learnability" Machine Learning, 5, 1990, pp 197--227. Google ScholarDigital Library
Index Terms
- Adapting the Tesseract open source OCR engine for multilingual OCR
Recommendations
Multilingual OCR research and applications: an overview
MOCR '13: Proceedings of the 4th International Workshop on Multilingual OCRThis paper offers an overview of the current approaches to research in the field of off-line multilingual OCR. Typically, off-line OCR systems are designed for a particular script or language. However, the ideal approach to multilingual OCR would likely ...
Adapting Tesseract for Complex Scripts: An Example for Urdu Nastalique
SBES '13: Proceedings of the 2013 27th Brazilian Symposium on Software EngineeringTesseract engine supports multilingual text recognition. However, the recognition of cursive scripts using Tesseract is a challenging task. In this paper, Tesseract engine is analyzed and modified for the recognition of Nastalique writing style for Urdu ...
An Open Source Tesseract Based Optical Character Recognizer for Bangla Script
ICDAR '09: Proceedings of the 2009 10th International Conference on Document Analysis and RecognitionBanglaOCR is currently the only open source optical character recognition (OCR) software for the Bangla (Bengali) script developed by the Center for Research on Bangla Language Processing (CRBLP). Tesseract, maintained by Google, is considered to be one ...
Comments