research-article

Adapting the Tesseract open source OCR engine for multilingual OCR

Authors:
Ray Smith

Google Inc., Mountain View, CA

Google Inc., Mountain View, CA
View Profile

,
Daria Antonova

Google Inc., Mountain View, CA

Google Inc., Mountain View, CA
View Profile

,
Dar-Shyang Lee

Google Inc., Mountain View, CA

Google Inc., Mountain View, CA
View Profile

MOCR '09: Proceedings of the International Workshop on Multilingual OCRJuly 2009Article No.: 1Pages 1–8https://doi.org/10.1145/1577802.1577804

Published:25 July 2009Publication History

MOCR '09: Proceedings of the International Workshop on Multilingual OCR

Pages 1–8

ABSTRACT

We describe efforts to adapt the Tesseract open source OCR engine for multiple scripts and languages. Effort has been concentrated on enabling generic multi-lingual operation such that negligible customization is required for a new language beyond providing a corpus of text. Although change was required to various modules, including physical layout analysis, and linguistic post-processing, no change was required to the character classifier beyond changing a few limits. The Tesseract classifier has adapted easily to Simplified Chinese. Test results on English, a mixture of European languages, and Russian, taken from a random sample of books, show a reasonably consistent word error rate between 3.72% and 5.78%, and Simplified Chinese has a character error rate of only 3.77%.

References

Nagy, G., "Chinese character recognition: a twenty-five-year perspective" 9^th Int. Conf. on Pattern Recognition, Nov 1988, pp 163--167.Google Scholar
Xia, F. "Knowledge-based sub-pattern segmentation: decompositions of Chinese characters" Image Processing 1994. Proc. ICIP-94, IEEE Int. Conf. vol. 1, 13--16 Nov 1994, pp 179--182.Google Scholar
Zhidong Lu, Schwartz, R. Natarajan, P. Bazzi, I. Makhoul, J. "Advances in the BBN BYBLOS OCR system" Proc. 5th Int. Conf. on Document Analysis and Recognition, 1999, pp 337--340. Google ScholarDigital Library
Kanungo, T., Marton, G. A., Bulbul, O., "Omnipage vs. Sakhr: paired Model Evaluation of Two Arabic OCR Products" Proc. SPIE 3651, 7 Jan 1999, pp 109--120.Google ScholarCross Ref
Bansal, V.; Sinha, R. M. K, "A complete OCR for printed Hindi text in Devanagari script" Proc. 6^th Int. Conf on Document Analysis and Recognition, 2001, pp 800--804. Google ScholarDigital Library
Govindaraju, V., et. al. "Tools for enabling digital access to multi-lingual Indic documents" Proc 1^st Int. Workshop on document Image Analysis for Libraries, 2004, pp 122--133. Google ScholarDigital Library
Official Google Blog: http://googleblog.blogspot.com/2008/07/hitting-40-languages.html.Google Scholar
Smith, R., "An Overview of the Tesseract OCR Engine" Proc 9^th Int. Conf. on Document Analysis and Recognition, 2007, pp 629--633. Google ScholarDigital Library
Tesseract Open-Source OCR: http://code.google.com/p/tesseract-ocr.Google Scholar
Smith, R "Hybrid Page Layout Analysis via Tab-Stop Detection, Document Analysis and Recognition" Proc. 10^th Int. Conf. on Document Analysis and Recognition, 2009. Google ScholarDigital Library
Smith, R., "A simple and efficient skew detection algorithm via text row accumulation" Proc. 3^rd Int. Conf. on Document Analysis and Recognition, 1995, pp 1145--1148. Google ScholarDigital Library
Unnikrishnan, R., Smith, R., "Combined Script and Page Orientation Estimation using the Tesseract OCR engine" Submitted to International Workshop of Multilingual OCR, 25th July 2009, Barcelona, Spain. Google ScholarDigital Library
Gionis, A., Indyk, P., Motwani, R., "Similarity Search in High Dimensions via Hashing" Proc. 25th Int. Conf. on Very Large Data Bases, 1999, pp 518--529. Google ScholarDigital Library
Baluja, S., Covell, M., "Learning to hash: forgiving hash functions and applications" Data Mining and Knowledge Discovery 17(3), Dec 2008, pp 402--430. Google ScholarDigital Library
Schapire, R. E., "The Strength of Weak Learnability" Machine Learning, 5, 1990, pp 197--227. Google ScholarDigital Library

Index Terms

Adapting the Tesseract open source OCR engine for multilingual OCR
1. Applied computing
  1. Document management and text processing
    1. Document capture
2. Computing methodologies
  1. Machine learning

Recommendations

Multilingual OCR research and applications: an overview
MOCR '13: Proceedings of the 4th International Workshop on Multilingual OCR

This paper offers an overview of the current approaches to research in the field of off-line multilingual OCR. Typically, off-line OCR systems are designed for a particular script or language. However, the ideal approach to multilingual OCR would likely ...
Read More
Adapting Tesseract for Complex Scripts: An Example for Urdu Nastalique
SBES '13: Proceedings of the 2013 27th Brazilian Symposium on Software Engineering

Tesseract engine supports multilingual text recognition. However, the recognition of cursive scripts using Tesseract is a challenging task. In this paper, Tesseract engine is analyzed and modified for the recognition of Nastalique writing style for Urdu ...
Read More
An Open Source Tesseract Based Optical Character Recognizer for Bangla Script
ICDAR '09: Proceedings of the 2009 10th International Conference on Document Analysis and Recognition

BanglaOCR is currently the only open source optical character recognition (OCR) software for the Bangla (Bengali) script developed by the Center for Research on Bangla Language Processing (CRBLP). Tesseract, maintained by Google, is considered to be one ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MOCR '09: Proceedings of the International Workshop on Multilingual OCR
July 2009
139 pages
ISBN:9781605586984
DOI:10.1145/1577802
General Chairs:
Venu Govindaraju
University at Buffalo
,
Prem Natarajan
BBN Technologies
,
Program Chairs:
Santanu Chaudhury
IIT Delhi
,
Daniel Lopresti
Lehigh University
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 July 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Tesseract
multi-lingual OCR
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate17of34submissions,50%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 66
  Total Citations
  View Citations
- 1,009
  Total Downloads
- Downloads (Last 12 months)77
- Downloads (Last 6 weeks)12
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Adapting the Tesseract open source OCR engine for multilingual OCR

MOCR '09: Proceedings of the International Workshop on Multilingual OCR

ABSTRACT

References

Cited By

Index Terms

Recommendations

Multilingual OCR research and applications: an overview

Adapting Tesseract for Complex Scripts: An Example for Urdu Nastalique

An Open Source Tesseract Based Optical Character Recognizer for Bangla Script