skip to main content
10.1145/1577802.1577804acmotherconferencesArticle/Chapter ViewAbstractPublication PagesmocrConference Proceedingsconference-collections
research-article

Adapting the Tesseract open source OCR engine for multilingual OCR

Published:25 July 2009Publication History

ABSTRACT

We describe efforts to adapt the Tesseract open source OCR engine for multiple scripts and languages. Effort has been concentrated on enabling generic multi-lingual operation such that negligible customization is required for a new language beyond providing a corpus of text. Although change was required to various modules, including physical layout analysis, and linguistic post-processing, no change was required to the character classifier beyond changing a few limits. The Tesseract classifier has adapted easily to Simplified Chinese. Test results on English, a mixture of European languages, and Russian, taken from a random sample of books, show a reasonably consistent word error rate between 3.72% and 5.78%, and Simplified Chinese has a character error rate of only 3.77%.

References

  1. Nagy, G., "Chinese character recognition: a twenty-five-year perspective" 9th Int. Conf. on Pattern Recognition, Nov 1988, pp 163--167.Google ScholarGoogle Scholar
  2. Xia, F. "Knowledge-based sub-pattern segmentation: decompositions of Chinese characters" Image Processing 1994. Proc. ICIP-94, IEEE Int. Conf. vol. 1, 13--16 Nov 1994, pp 179--182.Google ScholarGoogle Scholar
  3. Zhidong Lu, Schwartz, R. Natarajan, P. Bazzi, I. Makhoul, J. "Advances in the BBN BYBLOS OCR system" Proc. 5th Int. Conf. on Document Analysis and Recognition, 1999, pp 337--340. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Kanungo, T., Marton, G. A., Bulbul, O., "Omnipage vs. Sakhr: paired Model Evaluation of Two Arabic OCR Products" Proc. SPIE 3651, 7 Jan 1999, pp 109--120.Google ScholarGoogle ScholarCross RefCross Ref
  5. Bansal, V.; Sinha, R. M. K, "A complete OCR for printed Hindi text in Devanagari script" Proc. 6th Int. Conf on Document Analysis and Recognition, 2001, pp 800--804. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Govindaraju, V., et. al. "Tools for enabling digital access to multi-lingual Indic documents" Proc 1st Int. Workshop on document Image Analysis for Libraries, 2004, pp 122--133. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Official Google Blog: http://googleblog.blogspot.com/2008/07/hitting-40-languages.html.Google ScholarGoogle Scholar
  8. Smith, R., "An Overview of the Tesseract OCR Engine" Proc 9th Int. Conf. on Document Analysis and Recognition, 2007, pp 629--633. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Tesseract Open-Source OCR: http://code.google.com/p/tesseract-ocr.Google ScholarGoogle Scholar
  10. Smith, R "Hybrid Page Layout Analysis via Tab-Stop Detection, Document Analysis and Recognition" Proc. 10th Int. Conf. on Document Analysis and Recognition, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Smith, R., "A simple and efficient skew detection algorithm via text row accumulation" Proc. 3rd Int. Conf. on Document Analysis and Recognition, 1995, pp 1145--1148. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Unnikrishnan, R., Smith, R., "Combined Script and Page Orientation Estimation using the Tesseract OCR engine" Submitted to International Workshop of Multilingual OCR, 25th July 2009, Barcelona, Spain. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Gionis, A., Indyk, P., Motwani, R., "Similarity Search in High Dimensions via Hashing" Proc. 25th Int. Conf. on Very Large Data Bases, 1999, pp 518--529. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Baluja, S., Covell, M., "Learning to hash: forgiving hash functions and applications" Data Mining and Knowledge Discovery 17(3), Dec 2008, pp 402--430. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Schapire, R. E., "The Strength of Weak Learnability" Machine Learning, 5, 1990, pp 197--227. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Adapting the Tesseract open source OCR engine for multilingual OCR

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        MOCR '09: Proceedings of the International Workshop on Multilingual OCR
        July 2009
        139 pages
        ISBN:9781605586984
        DOI:10.1145/1577802

        Copyright © 2009 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 25 July 2009

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate17of34submissions,50%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader