Skip to main content
Log in

A survey on optical character recognition for Bangla and Devanagari scripts

  • Published:
Sadhana Aims and scope Submit manuscript

Abstract

The past few decades have witnessed an intensive research on optical character recognition (OCR) for Roman, Chinese, and Japanese scripts. A lot of work has been also reported on OCR efforts for various Indian scripts, like Devanagari, Bangla, Oriya, Tamil, Telugu, Malayalam, Kannada, Gurmukhi, Gujarati, etc.  In this paper, we present a review of OCR work on Indian scripts, mainly on Bangla and Devanagari—the two most popular scripts in India. We have summarized most of the published papers on this topic and have also analysed the various methodologies and their reported results. Future directions of research in OCR for Indian scripts have been also given.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7

Similar content being viewed by others

References

  • Amin A 1997 Off line Arabic character recognition: A survey, In: Proceedings of the International Conference on Document Analysis and Recognition, 596–599

  • Antani S and Agnihotri L 1999 Gujarati character recognition, In: Proceedings of the International Conference on Document Analysis and Recognition, 218–221

  • Aradhya V N M, Kumar G H and Noushath S 2008 Multilingual OCR system for South Indian scripts and English documents: An approach based on Fourier transform and principal component analysis. Eng. Appl. Artif. Intell. 21: 658–668

    Article  Google Scholar 

  • Ashwin T V and Sastry P S 2002 A font and size-independent OCR system for printed Kannada documents using support vector machines. Sādhanā 27: 35–58

    Google Scholar 

  • Bag S and Harit G 2011 A novel topographic feature extraction method for Indian character images, In: International Conference on Computer Science and Information Technology, 358–367

  • Bag S, Bhowmick P and Harit G 2011a Recognition of Bengali handwritten characters using skeletal convexity and dynamic programming, In: International Conference on Emerging Applications of Information Technology, 265–268

  • Bag S, Bhowmick P and Harit G 2012 Detection of structural concavities in character images—A writer-independent approach, In: Indo-Japan Conference on Perception and Machine Intelligence, 260–268

  • Bag S, Harit G and Bhowmick P 2011b Topological features for recognizing printed and handwritten Bangla characters, In: Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data, article no. 10

  • Bahrampour A, Barkhoda W and Azami B Z 2009 Implementation of three text to speech systems for Kurdish language, In: Iberoamerican Congress on Pattern Recognition, 321–328

  • Bajaj R, Dey L and Chaudhury S 2002 Devnagari numeral recognition by combining decision of multiple connectionist classifiers. Sādhanā 27: 59–72

    Google Scholar 

  • Banashree N P and Vasanta R 2007 OCR for script identification of Hindi (Devnagari) numerals using feature sub selection by means of end-point with neuro-memetic model. Int. J. Intell. Tech. 2: 206–210

    Google Scholar 

  • BAN-ACA 2011 Bangla Academy. http://en.wikipedia.org/wiki/Paschimbanga_Bangla_Akademi

  • Bansal V and Sinha R M K 2000 Integrating knowledge sources in Devanagari text recognition system. IEEE Trans. Syst. Man Cybern., Part A, Syst. Humans 30: 500–505

    Article  Google Scholar 

  • Bansal V and Sinha R M K 2001 A complete OCR for printed Hindi text in Devanagari script, In: Proceedings of the International Conference on Document Analysis and Recognition, 800–804

  • Bansal V and Sinha R M K 2002 Segmentation of touching and fused Devanagari characters. Pattern Recogn. 35: 875–893

    Article  MATH  Google Scholar 

  • Basu S, Chaudhuri C, Kundu M, Nasipuri M and Basu D K 2004 A two-pass approach to pattern classification, In: Proceedings of the International Conference on Neural Information Processing, 781–786

  • Basu S, Das N, Sarkar R, Kundu M, Nasipuri M and Basu D K 2009 A hierarchical approach to recognition of handwritten Bangla characters. Pattern Recogn. 42: 1467–1484

    Article  MATH  Google Scholar 

  • Basu S, Sarkar R, Das N, Kundu M, Nasipuri M and Basu D K 2005 Handwritten Bangla digit recognition using classifier combination through DS technique, In: Proceedings of the Pattern Recognition and Machine Intelligence, 236–241

  • Basu S, Sarkar R, Das N, Kundu M, Nasipuri M and Basu D K 2007 A fuzzy technique for segmentation of handwritten Bangla word images, In: Proceedings of the International Conference on Computing: Theory and Application, 427–433

  • Bhattacharya U, Shridhar M and Parui S K 2006 On recognition of handwritten Bangla characters, In: Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, 817–828

  • Bhattacharya U, Das T K, Datta A, Parui S K and Chaudhuri B B 2002a Recognition of handprinted Bangla numerals using neural network models, In: Proceedings of the AFSS International Conference on Fuzzy Systems, 228–235

  • Bhattacharya U, Das T K, Datta A, Parui S K and Chaudhuri B B 2002b A hybrid scheme for handprinted numeral recognition based on a self-organizing network and MLP classifiers. Int. J. Pattern Recogn. Artif. Intell. 16: 845–864

    Article  Google Scholar 

  • Bhowmik T K, Bhattacharya U and Parui S K 2004 Recognition of Bangla handwritten characters using an MLP classifier based on stroke features, In: Proceedings of the International Conference on Neural Information Processing, 814–819

  • Bhowmik T K, Ghanty P, Roy A and Parui S K 2009 SVM-based hierarchical architecture for handwritten Bangla character recognition. Int. J. Doc. Anal. Recognit. 12: 97–108

    Article  Google Scholar 

  • Bishnu A and Chaudhuri B B 1999 Segmentation of Bangla handwritten text into characters by recursive contour following, In: Proceedings of the International Conference on Document Analysis and Recognition, 402–405

  • Chaudhuri B B and Pal U 1997 An OCR system to read two Indian languages scripts: Bangla and Devnagari (Hindi), In: Proceedings of the International Conference on Document Analysis and Recognition, 1011–1015

  • Chaudhuri B B and Pal U 1998 A complete printed Bangla OCR system. Pattern Recogn. 31: 531–549

    Article  Google Scholar 

  • Chaudhuri B B, Pal U and Mitra M 2002 Automatic recognition of printed Oriya script. Sādhanā 27: 23–34

    Google Scholar 

  • Cheriet M, Yacoubi M E, Fujisawa H, Lopresti D and Lorette G 2009 Handwritten recognition research: Twenty years of achievement... and beyond. Pattern Recogn. 42: 3131–3135

    Article  Google Scholar 

  • Chowdhury M I S, Dey B and Rahman M S 2008 Segmentation of printed Bangla characters using structural properties of Bangla script, In: Proceedings of the International Conference on Electrical and Computer Engineering, 639–643

  • Chowdhury S, Garain U and Chattopadhyay T 2011 A weighted finite-state transducer (WFST)-based language model for online Indic script handwriting recognition, In: Proceedings of the International Conference on Document Analysis and Recognition, 599–602

  • Cons-India 2007 Constitution of India, Government of India, Ministry of Law and Justice, 330, Eighth Schedule, Articles 344 (1) and 351

  • Das N, Das B, Sarkar R, Basu S, Kundu M and Nasipuri M 2010 Handwritten Bangla basic and compound character recognition using MLP and SVM classifier. J. Comput. 2: 109–115

    Google Scholar 

  • Dhurandhar A, Shankarnarayanan K and Jawale R 2005 Robust pattern recognition scheme for Devanagari script, In: Proceedings of the International Conference on Computational Intelligence and Security, 1021–1026

  • Doucet A, Kazai G, Dresevic B, Uzelac A, Radakovic B and Todic N 2011 Setting up a competition framework for the evaluation of structure extraction from OCR-ed books. Int. J. Doc. Anal. Recognit. 14: 45–52

    Article  Google Scholar 

  • Dutta A and Chaudhury S 1993 Bengali alpha-numeric character recognition using curvature features. Pattern Recogn. 26: 1757–1770

    Article  Google Scholar 

  • EuroNet 2010 Semantic lexicons for European languages. http://www.dcs.shef.ac.uk/research/groups/nlp/funded/eurowordnet.html

  • Freeman H 1974 Computer processing of line-drawing images. ACM Comput. Surv. 6: 57–97

    Article  MATH  Google Scholar 

  • Fuentes F A, Garcia R G and Contelles J M B 2010 A high-dimensional access method for approximated similarity search in text mining, In: International Conference on Pattern Recognition, 3155– 3158

  • Fujisawa H 2008 Forty years of research in character and document recognition—An industrial perspective. Pattern Recogn. 41: 2435–2446

    Article  Google Scholar 

  • Genzel D, Popat A C, Spasojevic N, Jahr M, Senior A, Ie E and Tang F Y 2011 Translation-inspired OCR, In: International Conference on Document Analysis and Recognition, 1339–1343

  • Garain U and Chaudhuri B B 1998 Compound character recognition by run-number-based metric distance, In: Proceedings of the IS&T/SPIE International Symposium on Electronic Imaging: Science and Technology 3305: 90–97

    Google Scholar 

  • Garain U and Chaudhuri B B 2002 Segmentation of touching characters in printed Devnagari and Bangla scripts using fuzzy multifactorial analysis. IEEE Trans. Syst. Man Cybern., Part C 32: 449– 459

    Article  Google Scholar 

  • Govindan V K and Shivaprasad A P 1990 Character recognition – A review. Pattern Recogn. 23: 671– 683

    Article  Google Scholar 

  • Hanmandlu M, Nath A V, Mishra A C and Madasu V K 2007a Fuzzy model based recognition of handwritten Hindi numerals using bacterial foraging, In: Proceedings of the International Conference on Computer and Information Science, 490–496

  • Hanmandlu M, Ramana Murthy O V and Madasu V K 2007b Fuzzy model based recognition of handwritten Hindi characters, In: Proceedings of the Digital Image Computing Techniques and Applications, 454–461

  • ISI 2010 ISI Kolkata Bangla handwritten basic character dataset. http://www.isical.ac.in/~ujjwal/download/database.html

  • Jayadevan R, Kolhe S R, Patil P M and Pal U 2011 Offline recognition of Devanagari script: A survey. IEEE Trans. Syst. Man Cybern., Part C: Appl. Rev. 41: 782–796

    Article  Google Scholar 

  • Jawahar C V, Pavan Kumar M N S S K and Ravi Kiran S S 2003 A bilingual OCR for Hindi-Telugu documents and its applications, In: Proceedings of the International Conference on Document Analysis and Recognition, 408–413

  • Kae A, Smith D A and Learned-Miller E 2011 Learning on the fly: A font-free approach toward multilingual OCR. Int. J. Doc. Anal. Recognit. 14: 289–301

    Article  Google Scholar 

  • Kannan R J 2009 A comparative study of optical character recognition for Tamil script. Eur. J. Scientific Res. 35: 570–582

    Google Scholar 

  • Khorsheed M S 2002 Off-line Arabic character recognition–A review. Pattern Anal. Appl. 5: 31–45

    Article  MathSciNet  Google Scholar 

  • Kim H J and Kim P K 1996 Recognition of off-line handwritten Korean characters. Pattern Recogn. 29: 245–254

    Article  Google Scholar 

  • Kimura F 2007 OCR technologies for machine printed and hand printed Japanese text, In: Proceedings of the Digital Document Processing: Major Directions and Recent Advances, 49–71

  • Kompalli S, Nayak S and Setlur S 2005 Challenges in OCR of Devanagari documents, In: Proceedings of the International Conference on Document Analysis and Recognition, 408–413

  • Kompalli S and Setlur S 2006 Design and comparison of segmentation driven and recognition driven Devanagari OCR, In: Proceedings of the International Conference on Document Image Analysis for Libraries, 96–102

  • Kompalli S, Setlur S and Govindaraju V 2009 Devanagari OCR using a recognition driven segmentation framework and stochastic language models. Int. J. Doc. Anal. Recognit. 12: 123–1308

    Article  Google Scholar 

  • Kuich W and Salomaa A 1986 Semirings, Automata, Language, In: EATCS Monographs on Theoretical Computer Science, Berlin: Springer-Verlag

    Google Scholar 

  • Kumar P P, Bhagvati C, Negi A, Agarwal A and Deekshatulu B L 2011 Towards improving the accuracy of Telugu OCR systems, In: Proceedings of the International Conference on Document Analysis and Recognition, 910–914

  • Kumar A, Jawahar C V and Manmatha R 2007 Efficient search in document image collections, In: Asian Conference on Computer Vision, 586–595

  • Kunte R S and Samuel R D S 2007 A bilingual machine-interface OCR for printed Kannada and English text employing wavelet features, In: Proceedings of the International Conference on Information Technology, 202–207

  • Kwon J O, Sin B and Kim J H 1997 Recognition of on-line cursive Korean characters combining statistical and structural methods. Pattern Recogn. 30: 1255–1263

    Article  Google Scholar 

  • Lehal G S and Bhatt N 2000 A recognition system for Devnagri and English handwritten numerals, In: Proceedings of the International Conference on Advances in Multimodal Interfaces, 442–449

  • Lehal G S and Singh C 2000 A Gurmukhi script recognition system, In: Proceedings of the International Conference on Pattern Recognition, 557–560

  • Lehal G S and Singh C 2002 A post processor for Gurmukhi OCR. Sādhanā 27: 99–111

    Google Scholar 

  • Ma H and Doermann D 2003 Adaptive Hindi OCR using generalized hausdorff image comparison. ACM Trans. Asian Lang. Inf. Process. 2: 193–218

    Article  Google Scholar 

  • Mahmud J U, Raihan M F and Rahman C M 2003 A complete OCR system for continuous Bengali characters, In: Proceedings of the TENCON, 1372–1376

  • Majumdar A 2007 Bangla basic character recognition using digital curvelet transform. Journal of Pattern Recognition Research 2: 17–26

    Google Scholar 

  • Majumdar A and Chaudhuri B B 2006 A MLP classifier for both printed and handwritten Bangla numeral recognition, In: Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, 796–804

  • Mantas J 1986 An overview of character recognition methodologies. Pattern Recogn. 19: 425–430

    Article  Google Scholar 

  • Marti U and Bunke H 1999 A full English sentence database for off-line handwriting recognition, In: Proceedings of the International Conference on Document Analysis and Recognition, 705–708

  • Mohan K and Jawahar C V 2010 A post-processing scheme for Malayalam using statistical sub-character language models, In: Proceedings of the International Workshop on Document Analysis Systems, 493–500

  • Mohanty S, Dasbebartta H N and Behera T K 2009 An efficient bilingual optical character recognition (English-Oriya) system for printed documents, In: Proceedings of the International Conference on Advances in Pattern Recognition, 398–401

  • Mori S, Suen C Y and Yamamoto K 1992 Historical review of OCR research and development, In: Proceedings of IEEE 80: 1029–1058

    Article  Google Scholar 

  • Nagy G 2000 Twenty years of document image analysis in pattern analysis and machine intelligence. IEEE Trans. Pattern Anal. Mach. Intell. 22: 38–62

    Article  Google Scholar 

  • Ng K C and Abramson B 1990 Uncertainty management in expert systems. IEEE Expert 5: 29–48,

    Article  Google Scholar 

  • Oh I S and Suen C Y 2002 A class-modular feedforward neural network for handwriting recognition. Pattern Recogn. 35: 229–244

    Article  MATH  Google Scholar 

  • Pal U and Chaudhuri B B 1997 Printed Devnagari script OCR system. Vivek 10: 12–24

    Google Scholar 

  • Pal U and Chaudhuri B B 2000 Automatic recognition of unconstrained off-line Bangla hand-written numerals, In: Proceedings of the International Conference on Advances in Multimodal Interfaces, 371–378

  • Pal U and Chaudhuri B B 2004 Indian script character recognition: A survey. Pattern Recogn. 37: 1887–1899

    Article  Google Scholar 

  • Pal U and Datta S 2003 Segmentation of Bangla unconstrained handwritten text, In: Proceedings of the International Conference on Document Analysis and Recognition, 1128–1132

  • Pal U, Roy K and Kimura F 2009 A lexicon-driven handwritten city-name recognition scheme for Indian postal automation. IEICE Trans. Inf. Syst. E92-D: 1146–1158

    Article  Google Scholar 

  • Pal U, Sharma N, Wakabayashi T and Kimura F 2007a Off-line handwritten character recognition of Devnagari script, In: Proceedings of the International Conference on Document Analysis and Recognition, 496–500

  • Pal U, Wakabayashi T and Kimura F 2007b Handwritten Bangla compound character recognition using gradient feature, In: Proceedings of the International Conference on Information Technology, 208–213

  • Pal U, Wakabayashi T and Kimura F 2009 Comparative study of Devnagari handwritten character recognition using different feature and classifiers, In: Proceedings of the International Conference on Document Analysis and Recognition, 1111–1115

  • Passino K M 2002 Biomimicry of bacterial foraging for distributed optimization and control. IEEE Control Systems Magazine 22: 52–67

    Article  Google Scholar 

  • Philip B and Samuel R D S 2009 A novel bilingual OCR for printed Malayalam-English text based on Gabor features and dominant singular values, In: Proceedings of the International Conference on Digital Image Processing, 361–365

  • Plamondon R and Srihari S N 2000 On-line and off-line handwritten character recognition: A comprehensive survey. IEEE Trans. Pattern Anal. Mach. Intell. 22: 62–84

    Article  Google Scholar 

  • Procter S, Illingworth J and Mokhtarian F 2000 Cursive handwriting recognition using hidden markov models and a lexicon-driven level building algorithm. IEE P-VIS Image Sign. 147: 332–339

    Article  Google Scholar 

  • Rahiman M A and Rajasree M S 2009 Printed Malayalam character recognition using back-propagation neural networks, In: Proceedings of the International Advance Computing Conference, 197–201

  • Rahman A F R and Kaykobad M 1998 A complete Bengali OCR: A novel hybrid approach to handwritten Bengali character recognition. J. Comput. Information Technol. 6: 395–413

    Google Scholar 

  • Rahman A F R, Rahman R and Fairhurst M C 2002 Recognition of handwritten Bengali characters: A novel multistage approach. Pattern Recogn. 35: 997–1006

    Article  MATH  Google Scholar 

  • Rahman M A and Saddik A E 2007 Modified syntactic method to recognize Bengali handwritten characters. IEEE Trans. Instrum. Meas. 56: 2623–2632

    Article  Google Scholar 

  • Rodriguez-Serrano J A and Perronnin F 2009 Handwritten word-spotting using hidden Markov models and universal vocabularies. Pattern Recogn. 42: 2106–2116

    Article  MATH  Google Scholar 

  • Roy A, Bhowmik T K, Parui S K and Roy U 2005 A novel approach to skew detection and character segmentation for handwritten Bangla words, In: Proceedings of the Digital Image Computing: Techniques and Applications, 203–210

  • Ruwei D, Chenglin L and Baihua X 2007 Chinese character recognition: History, status and prospects. Front. Comput. Sci. 1: 126-136

    Article  Google Scholar 

  • Sarkar P 2006 Document image analysis for digital libraries, In: Proceedings of the International Workshop on Research Issues in Digital Libraries, Article 12

  • Sarkar R, Das N, Basu S, Kundu M, Nasipuri M and Basu D K 2008 A two-stage approach for segmentation of handwritten Bangla word images, In: Proceedings of International Conference on Frontiers in Handwriting Recognitions, 403–408

  • Sethi K and Chatterjee B 1977 Machine recognition of constrained hand-printed Devnagari. Pattern Recogn. 9: 69–77

    Article  Google Scholar 

  • Sharma N, Pal U, Kimura F and Pal S 2006 Recognition of off-line handwritten Devnagari characters using quadratic classifier, In: Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, 805–816

  • Singh D, Dutta M and Singh S H 2009 Neural network based handwritten Hindi character recognition system, In: Proceedings of the Bangalore Annual Compute Conference, article no. 15

  • Sinha R M K 1987 Rule based contextual post-processing for Devnagari text recognition. Pattern Recogn. 20: 475–485

    Article  Google Scholar 

  • Sinha R M K and Mahabala H 1979 Machine recognition of Devnagari script. IEEE Trans. Syst. Man Cybern. 9: 435–441

    Article  MathSciNet  MATH  Google Scholar 

  • Srihari S N, Yang X and Ball G R 2007 Offline Chinese handwriting recognition: An assessment of current technology. Front. Comput. Sci. 1: 137–155

    Article  Google Scholar 

  • Su T H, Zhang T W, Guan D J and Huang H J 2009 Off-line recognition of realistic Chinese handwriting using segmentation-free strategy. Pattern Recogn. 42: 167–182

    Article  MATH  Google Scholar 

  • Sural S and Das P K 1999 An MLP using Hough transform based fuzzy feature extraction for Bengali script recognition. Pattern Recogn. Lett. 20: 771–782

    Article  Google Scholar 

  • Verma B K 1995 Handwritten Hindi character recognition using multilayer perceptron and radial basis function neural networks, In: Proceedings of the IEEE International Conference on Neural Network, 2111–2115

  • Wong P K and Chan C 1998 Off-line handwritten Chinese character recognition as a compound Bays decision problem. IEEE Trans. Pattern Anal. Mach. Intell. 20: 1016–1023

    Article  Google Scholar 

  • WordNet 2010 Semantic lexicons for English language. http://wordnet.princeton.edu/

  • Xu Y and Nagy G 1999 Prototype extraction and adaptive OCR. IEEE Trans. Pattern Anal. Mach. Intell. 21: 1280–1296

    Article  Google Scholar 

  • Zagoris K, Papamarkos N and Chamzas C 2006 Web document image retrieval system based on word spotting, In: Proceedings of the International Conference on Image Processing, 477–480

  • Zhuang L, Bao T and Zhu X Y 2004 A Chinese OCR spelling check approach based on statistical language models, Proceedings of the IEEE International Conference on System, Man and Cybernetics, 4727–4732

  • Zhuang L and Zhu X 2005 An OCR post-processing approach based on multi-knowledge, In: Proceedings of the International Conference on Knowledge-Based and Intelligent Information and Engineering Systems, 346–352

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to SOUMEN BAG.

Rights and permissions

Reprints and permissions

About this article

Cite this article

BAG, S., HARIT, G. A survey on optical character recognition for Bangla and Devanagari scripts. Sadhana 38, 133–168 (2013). https://doi.org/10.1007/s12046-013-0121-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12046-013-0121-9

Keywords

Navigation