research-article

Topic based language models for OCR correction

Authors:
Anurag Bhardwaj

University at Buffalo, Amherst, NY

University at Buffalo, Amherst, NY
View Profile

,
Faisal Farooq

University at Buffalo, Amherst, NY

University at Buffalo, Amherst, NY
View Profile

,
Huaigu Cao

University at Buffalo, Amherst, NY

University at Buffalo, Amherst, NY
View Profile

,
Venu Govindaraju

University at Buffalo, Amherst, NY

University at Buffalo, Amherst, NY
View Profile

AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text dataJuly 2008Pages 107–112https://doi.org/10.1145/1390749.1390766

Published:24 July 2008Publication History

AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text data

Pages 107–112

ABSTRACT

Despite several decades of research in document analysis, recognition of unconstrained handwritten documents is still considered a challenging task. Previous research in this area has shown that word recognizers produce reasonably clean output when used with a restricted lexicon. But in absence of such a restricted lexicon, the output of an unconstrained handwritten word recognizer is noisy. The objective of this research is to process noisy recognizer output and eliminate spurious recognition choices using a topic based language model. We construct a topic based language model for every document using a training data which is manually categorized. A topic categorization sub-system based on Maximum Entropy model is also trained which is used to generate the topic distribution of a test document. A given test word image is processed by the recognizer and its word recognition likelihood is refined by incorporating topic distribution of the document and topic based language model probability. The proposed method is evaluated on a publicly available IAM dataset and experimental results show significant improvement in the word recognition accuracy from 32% to 40% over a test set consisting of 4033 word images extracted from 70 handwritten document images.

References

J. Perez-Cortes, J. Amerngual, J. Arlandis and R. Llobet, Stochastic error-correcting parsing for OCR postprocessing, International Conference on Pattern Recognition, 2000, pages 4405--4408, Barcelona, Spain. Google ScholarDigital Library
F. Farooq, D. Jose and V. Govindaraju, Phrase Based Direct Model for Improving Handwriting Recognition Accuracies, To appear in International Conference on Frontiers in Handwriting Recognition, 2008, Montreal, Canada.Google Scholar
F. Farooq, G. Chandalia and V. Govindaraju. Lexicon Reduction in Handwriting Recognition Using Topic Categorization. Under Review - In Eight International Workshop on Document Analysis Systems. Nara, Japan, 2008. Google ScholarDigital Library
V. Govindaraju, V. Ramanaprasad, D. Lee and S. Srihari. Reading handwritten us census forms. In Proceedings of Third International Conference on Document Analysis and Recognition, pages 82--85, Montreal, Canada, 1997. Google ScholarDigital Library
N. D. Guillevic D and Y. K. Word lexicon reduction by character spotting. Proceedings of the Seventh International Workshop on Frontiers in Handwriting Recognition. pages 373--382, 2000.Google Scholar
S. Impedovo, P. Wang, and H. Bunke. Automatic bankcheck processing. Machine Perception and Artificial Intelligence, 28, 1997.Google Scholar
G. Kaufmann, H. Bunke, and M. Hadorn. Lexicon reduction in an hmm-framework based on quantized feature vectors. In Proceedings of the 4th International Conference on Document Analysis and Recognition, pages 1097--1101, Washington, DC, USA, 1997. Google ScholarDigital Library
G. Kim and V. Govindaraju. A lexicon driven approach to handwritten word recognition for real-time applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4):366--379, 1997. Google ScholarDigital Library
G. Kim, V. Govindaraju, and S. Srihari. Architecture for handwriting recognition systems. International Journal of Document Analysis and Recognition, 2(1):37--44, 1999.Google ScholarCross Ref
A. Koerich, R. Sabourin, and C. Suen. Large vocabulary offline handwriting recognition using a constrained level building algorithm. Pattern Analysis and Applications, 6(2):97--121, 2003.Google ScholarDigital Library
K. Kukich, Techniques for automatically correcting words in text, ACM Computing Surveys, 24(4):377--439, 1992. Google ScholarDigital Library
S. Madhvanath and V. Govindaraju. Holistic lexicon reduction for handwritten word recognition. In Proceedings of the SPIE - Document Recognition III, pages 224--234, San Jose, CA, 1996.Google ScholarCross Ref
S. Madhvanath and V. Govindaraju. Syntatic methodology of pruning large lexicons in cursive script recognition. Pattern Recognition, 34(1):37--46, January 2001.Google ScholarCross Ref
Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval. Cambridge University Press. 2008. Google ScholarDigital Library
U. Marti and H. Bunke. The iam-database: an english sentence database for off-line handwriting recognition. International Journal on Document Analysis and Recognition, 5:39--46, 2002.Google ScholarCross Ref
U. Pal, P. Kundu and B. Chaudhuri, OCR error correction of an inflectional Indian language using morphological parsing, Journal of Information Science and Engineering, 16(6):903--922, 2000.Google Scholar
N. S. R. K. Powalka and R. J. Whitrow. Word shape analysis for a hybrid recognition system. Pattern Recognition, 30(3):421--445, March 1997.Google ScholarCross Ref
S. Srihari and E. Keubert. Integration of hand-written address interpretation technology into the united states postal service remote computer reader system. In Proceedings of Fourth International Conference on Document Analysis and Recognition, pages 892--896, Ulm, Germany, 1997. Google ScholarDigital Library
K. Taghva and E. Stofsky. 2001. OCRSpell: an interactive spelling correction system for OCR errors in text. International Journal on Document Analysis and Recognition, 3(3):125--137.Google ScholarCross Ref
A. Vinciarelli, S. Bengio and H. Bunke, Offline recognition of unconstrained handwritten texts using HMMs and statistical language models, IEEE transactions on Pattern analysis and Machine intelligence, 26(6):709--720, 2004 Google ScholarDigital Library

Index Terms

Topic based language models for OCR correction
1. Applied computing
  1. Document management and text processing
    1. Document capture
      1. Optical character recognition

Recommendations

Using topic models for OCR correction
Special Issue NOISY

Despite several decades of research in document analysis, recognition of unconstrained handwritten documents is still considered a challenging task. Previous research in this area has shown that word recognizers perform adequately on constrained ...
Read More
Bigram Language Models and Reevaluation Strategy for Improved Recognition of Online Handwritten Tamil Words

This article describes a postprocessing strategy for online, handwritten, isolated Tamil words. Contributions have been made with regard to two issues hardly addressed in the online Indic word recognition literature, namely, use of (1) language models ...
Read More
A bilingual Gurmukhi-English OCR based on multiple script identifiers and language models
MOCR '13: Proceedings of the 4th International Workshop on Multilingual OCR

English words are frequently encountered in Gurmukhi texts. A monolingual Gurmukhi OCR will recognize such words as garbage. It becomes necessary to add bilingual capability to the Gurmukhi OCR to recognize English text too. But adding bilingual ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text data
July 2008
130 pages
ISBN:9781605581965
DOI:10.1145/1390749
Conference Chairs:
Daniel Lopresti
Lehigh University
,
Shourya Roy
IBM India Research Lab
,
Klaus Schulz
University of Munich
,
L. Venkata Subramaniam
India Research Lab
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 July 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
OCR correction
document analysis
dynamic lexicon
language models
topic categorization
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate15of22submissions,68%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 9
  Total Citations
  View Citations
- 436
  Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Topic based language models for OCR correction

AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Using topic models for OCR correction

Bigram Language Models and Reevaluation Strategy for Improved Recognition of Online Handwritten Tamil Words

A bilingual Gurmukhi-English OCR based on multiple script identifiers and language models