Abstract
The past few decades have witnessed an intensive research on optical character recognition (OCR) for Roman, Chinese, and Japanese scripts. A lot of work has been also reported on OCR efforts for various Indian scripts, like Devanagari, Bangla, Oriya, Tamil, Telugu, Malayalam, Kannada, Gurmukhi, Gujarati, etc. In this paper, we present a review of OCR work on Indian scripts, mainly on Bangla and Devanagari—the two most popular scripts in India. We have summarized most of the published papers on this topic and have also analysed the various methodologies and their reported results. Future directions of research in OCR for Indian scripts have been also given.
Similar content being viewed by others
References
Amin A 1997 Off line Arabic character recognition: A survey, In: Proceedings of the International Conference on Document Analysis and Recognition, 596–599
Antani S and Agnihotri L 1999 Gujarati character recognition, In: Proceedings of the International Conference on Document Analysis and Recognition, 218–221
Aradhya V N M, Kumar G H and Noushath S 2008 Multilingual OCR system for South Indian scripts and English documents: An approach based on Fourier transform and principal component analysis. Eng. Appl. Artif. Intell. 21: 658–668
Ashwin T V and Sastry P S 2002 A font and size-independent OCR system for printed Kannada documents using support vector machines. Sādhanā 27: 35–58
Bag S and Harit G 2011 A novel topographic feature extraction method for Indian character images, In: International Conference on Computer Science and Information Technology, 358–367
Bag S, Bhowmick P and Harit G 2011a Recognition of Bengali handwritten characters using skeletal convexity and dynamic programming, In: International Conference on Emerging Applications of Information Technology, 265–268
Bag S, Bhowmick P and Harit G 2012 Detection of structural concavities in character images—A writer-independent approach, In: Indo-Japan Conference on Perception and Machine Intelligence, 260–268
Bag S, Harit G and Bhowmick P 2011b Topological features for recognizing printed and handwritten Bangla characters, In: Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data, article no. 10
Bahrampour A, Barkhoda W and Azami B Z 2009 Implementation of three text to speech systems for Kurdish language, In: Iberoamerican Congress on Pattern Recognition, 321–328
Bajaj R, Dey L and Chaudhury S 2002 Devnagari numeral recognition by combining decision of multiple connectionist classifiers. Sādhanā 27: 59–72
Banashree N P and Vasanta R 2007 OCR for script identification of Hindi (Devnagari) numerals using feature sub selection by means of end-point with neuro-memetic model. Int. J. Intell. Tech. 2: 206–210
BAN-ACA 2011 Bangla Academy. http://en.wikipedia.org/wiki/Paschimbanga_Bangla_Akademi
Bansal V and Sinha R M K 2000 Integrating knowledge sources in Devanagari text recognition system. IEEE Trans. Syst. Man Cybern., Part A, Syst. Humans 30: 500–505
Bansal V and Sinha R M K 2001 A complete OCR for printed Hindi text in Devanagari script, In: Proceedings of the International Conference on Document Analysis and Recognition, 800–804
Bansal V and Sinha R M K 2002 Segmentation of touching and fused Devanagari characters. Pattern Recogn. 35: 875–893
Basu S, Chaudhuri C, Kundu M, Nasipuri M and Basu D K 2004 A two-pass approach to pattern classification, In: Proceedings of the International Conference on Neural Information Processing, 781–786
Basu S, Das N, Sarkar R, Kundu M, Nasipuri M and Basu D K 2009 A hierarchical approach to recognition of handwritten Bangla characters. Pattern Recogn. 42: 1467–1484
Basu S, Sarkar R, Das N, Kundu M, Nasipuri M and Basu D K 2005 Handwritten Bangla digit recognition using classifier combination through DS technique, In: Proceedings of the Pattern Recognition and Machine Intelligence, 236–241
Basu S, Sarkar R, Das N, Kundu M, Nasipuri M and Basu D K 2007 A fuzzy technique for segmentation of handwritten Bangla word images, In: Proceedings of the International Conference on Computing: Theory and Application, 427–433
Bhattacharya U, Shridhar M and Parui S K 2006 On recognition of handwritten Bangla characters, In: Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, 817–828
Bhattacharya U, Das T K, Datta A, Parui S K and Chaudhuri B B 2002a Recognition of handprinted Bangla numerals using neural network models, In: Proceedings of the AFSS International Conference on Fuzzy Systems, 228–235
Bhattacharya U, Das T K, Datta A, Parui S K and Chaudhuri B B 2002b A hybrid scheme for handprinted numeral recognition based on a self-organizing network and MLP classifiers. Int. J. Pattern Recogn. Artif. Intell. 16: 845–864
Bhowmik T K, Bhattacharya U and Parui S K 2004 Recognition of Bangla handwritten characters using an MLP classifier based on stroke features, In: Proceedings of the International Conference on Neural Information Processing, 814–819
Bhowmik T K, Ghanty P, Roy A and Parui S K 2009 SVM-based hierarchical architecture for handwritten Bangla character recognition. Int. J. Doc. Anal. Recognit. 12: 97–108
Bishnu A and Chaudhuri B B 1999 Segmentation of Bangla handwritten text into characters by recursive contour following, In: Proceedings of the International Conference on Document Analysis and Recognition, 402–405
Chaudhuri B B and Pal U 1997 An OCR system to read two Indian languages scripts: Bangla and Devnagari (Hindi), In: Proceedings of the International Conference on Document Analysis and Recognition, 1011–1015
Chaudhuri B B and Pal U 1998 A complete printed Bangla OCR system. Pattern Recogn. 31: 531–549
Chaudhuri B B, Pal U and Mitra M 2002 Automatic recognition of printed Oriya script. Sādhanā 27: 23–34
Cheriet M, Yacoubi M E, Fujisawa H, Lopresti D and Lorette G 2009 Handwritten recognition research: Twenty years of achievement... and beyond. Pattern Recogn. 42: 3131–3135
Chowdhury M I S, Dey B and Rahman M S 2008 Segmentation of printed Bangla characters using structural properties of Bangla script, In: Proceedings of the International Conference on Electrical and Computer Engineering, 639–643
Chowdhury S, Garain U and Chattopadhyay T 2011 A weighted finite-state transducer (WFST)-based language model for online Indic script handwriting recognition, In: Proceedings of the International Conference on Document Analysis and Recognition, 599–602
Cons-India 2007 Constitution of India, Government of India, Ministry of Law and Justice, 330, Eighth Schedule, Articles 344 (1) and 351
Das N, Das B, Sarkar R, Basu S, Kundu M and Nasipuri M 2010 Handwritten Bangla basic and compound character recognition using MLP and SVM classifier. J. Comput. 2: 109–115
Dhurandhar A, Shankarnarayanan K and Jawale R 2005 Robust pattern recognition scheme for Devanagari script, In: Proceedings of the International Conference on Computational Intelligence and Security, 1021–1026
Doucet A, Kazai G, Dresevic B, Uzelac A, Radakovic B and Todic N 2011 Setting up a competition framework for the evaluation of structure extraction from OCR-ed books. Int. J. Doc. Anal. Recognit. 14: 45–52
Dutta A and Chaudhury S 1993 Bengali alpha-numeric character recognition using curvature features. Pattern Recogn. 26: 1757–1770
EuroNet 2010 Semantic lexicons for European languages. http://www.dcs.shef.ac.uk/research/groups/nlp/funded/eurowordnet.html
Freeman H 1974 Computer processing of line-drawing images. ACM Comput. Surv. 6: 57–97
Fuentes F A, Garcia R G and Contelles J M B 2010 A high-dimensional access method for approximated similarity search in text mining, In: International Conference on Pattern Recognition, 3155– 3158
Fujisawa H 2008 Forty years of research in character and document recognition—An industrial perspective. Pattern Recogn. 41: 2435–2446
Genzel D, Popat A C, Spasojevic N, Jahr M, Senior A, Ie E and Tang F Y 2011 Translation-inspired OCR, In: International Conference on Document Analysis and Recognition, 1339–1343
Garain U and Chaudhuri B B 1998 Compound character recognition by run-number-based metric distance, In: Proceedings of the IS&T/SPIE International Symposium on Electronic Imaging: Science and Technology 3305: 90–97
Garain U and Chaudhuri B B 2002 Segmentation of touching characters in printed Devnagari and Bangla scripts using fuzzy multifactorial analysis. IEEE Trans. Syst. Man Cybern., Part C 32: 449– 459
Govindan V K and Shivaprasad A P 1990 Character recognition – A review. Pattern Recogn. 23: 671– 683
Hanmandlu M, Nath A V, Mishra A C and Madasu V K 2007a Fuzzy model based recognition of handwritten Hindi numerals using bacterial foraging, In: Proceedings of the International Conference on Computer and Information Science, 490–496
Hanmandlu M, Ramana Murthy O V and Madasu V K 2007b Fuzzy model based recognition of handwritten Hindi characters, In: Proceedings of the Digital Image Computing Techniques and Applications, 454–461
ISI 2010 ISI Kolkata Bangla handwritten basic character dataset. http://www.isical.ac.in/~ujjwal/download/database.html
Jayadevan R, Kolhe S R, Patil P M and Pal U 2011 Offline recognition of Devanagari script: A survey. IEEE Trans. Syst. Man Cybern., Part C: Appl. Rev. 41: 782–796
Jawahar C V, Pavan Kumar M N S S K and Ravi Kiran S S 2003 A bilingual OCR for Hindi-Telugu documents and its applications, In: Proceedings of the International Conference on Document Analysis and Recognition, 408–413
Kae A, Smith D A and Learned-Miller E 2011 Learning on the fly: A font-free approach toward multilingual OCR. Int. J. Doc. Anal. Recognit. 14: 289–301
Kannan R J 2009 A comparative study of optical character recognition for Tamil script. Eur. J. Scientific Res. 35: 570–582
Khorsheed M S 2002 Off-line Arabic character recognition–A review. Pattern Anal. Appl. 5: 31–45
Kim H J and Kim P K 1996 Recognition of off-line handwritten Korean characters. Pattern Recogn. 29: 245–254
Kimura F 2007 OCR technologies for machine printed and hand printed Japanese text, In: Proceedings of the Digital Document Processing: Major Directions and Recent Advances, 49–71
Kompalli S, Nayak S and Setlur S 2005 Challenges in OCR of Devanagari documents, In: Proceedings of the International Conference on Document Analysis and Recognition, 408–413
Kompalli S and Setlur S 2006 Design and comparison of segmentation driven and recognition driven Devanagari OCR, In: Proceedings of the International Conference on Document Image Analysis for Libraries, 96–102
Kompalli S, Setlur S and Govindaraju V 2009 Devanagari OCR using a recognition driven segmentation framework and stochastic language models. Int. J. Doc. Anal. Recognit. 12: 123–1308
Kuich W and Salomaa A 1986 Semirings, Automata, Language, In: EATCS Monographs on Theoretical Computer Science, Berlin: Springer-Verlag
Kumar P P, Bhagvati C, Negi A, Agarwal A and Deekshatulu B L 2011 Towards improving the accuracy of Telugu OCR systems, In: Proceedings of the International Conference on Document Analysis and Recognition, 910–914
Kumar A, Jawahar C V and Manmatha R 2007 Efficient search in document image collections, In: Asian Conference on Computer Vision, 586–595
Kunte R S and Samuel R D S 2007 A bilingual machine-interface OCR for printed Kannada and English text employing wavelet features, In: Proceedings of the International Conference on Information Technology, 202–207
Kwon J O, Sin B and Kim J H 1997 Recognition of on-line cursive Korean characters combining statistical and structural methods. Pattern Recogn. 30: 1255–1263
Lehal G S and Bhatt N 2000 A recognition system for Devnagri and English handwritten numerals, In: Proceedings of the International Conference on Advances in Multimodal Interfaces, 442–449
Lehal G S and Singh C 2000 A Gurmukhi script recognition system, In: Proceedings of the International Conference on Pattern Recognition, 557–560
Lehal G S and Singh C 2002 A post processor for Gurmukhi OCR. Sādhanā 27: 99–111
Ma H and Doermann D 2003 Adaptive Hindi OCR using generalized hausdorff image comparison. ACM Trans. Asian Lang. Inf. Process. 2: 193–218
Mahmud J U, Raihan M F and Rahman C M 2003 A complete OCR system for continuous Bengali characters, In: Proceedings of the TENCON, 1372–1376
Majumdar A 2007 Bangla basic character recognition using digital curvelet transform. Journal of Pattern Recognition Research 2: 17–26
Majumdar A and Chaudhuri B B 2006 A MLP classifier for both printed and handwritten Bangla numeral recognition, In: Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, 796–804
Mantas J 1986 An overview of character recognition methodologies. Pattern Recogn. 19: 425–430
Marti U and Bunke H 1999 A full English sentence database for off-line handwriting recognition, In: Proceedings of the International Conference on Document Analysis and Recognition, 705–708
Mohan K and Jawahar C V 2010 A post-processing scheme for Malayalam using statistical sub-character language models, In: Proceedings of the International Workshop on Document Analysis Systems, 493–500
Mohanty S, Dasbebartta H N and Behera T K 2009 An efficient bilingual optical character recognition (English-Oriya) system for printed documents, In: Proceedings of the International Conference on Advances in Pattern Recognition, 398–401
Mori S, Suen C Y and Yamamoto K 1992 Historical review of OCR research and development, In: Proceedings of IEEE 80: 1029–1058
Nagy G 2000 Twenty years of document image analysis in pattern analysis and machine intelligence. IEEE Trans. Pattern Anal. Mach. Intell. 22: 38–62
Ng K C and Abramson B 1990 Uncertainty management in expert systems. IEEE Expert 5: 29–48,
Oh I S and Suen C Y 2002 A class-modular feedforward neural network for handwriting recognition. Pattern Recogn. 35: 229–244
Pal U and Chaudhuri B B 1997 Printed Devnagari script OCR system. Vivek 10: 12–24
Pal U and Chaudhuri B B 2000 Automatic recognition of unconstrained off-line Bangla hand-written numerals, In: Proceedings of the International Conference on Advances in Multimodal Interfaces, 371–378
Pal U and Chaudhuri B B 2004 Indian script character recognition: A survey. Pattern Recogn. 37: 1887–1899
Pal U and Datta S 2003 Segmentation of Bangla unconstrained handwritten text, In: Proceedings of the International Conference on Document Analysis and Recognition, 1128–1132
Pal U, Roy K and Kimura F 2009 A lexicon-driven handwritten city-name recognition scheme for Indian postal automation. IEICE Trans. Inf. Syst. E92-D: 1146–1158
Pal U, Sharma N, Wakabayashi T and Kimura F 2007a Off-line handwritten character recognition of Devnagari script, In: Proceedings of the International Conference on Document Analysis and Recognition, 496–500
Pal U, Wakabayashi T and Kimura F 2007b Handwritten Bangla compound character recognition using gradient feature, In: Proceedings of the International Conference on Information Technology, 208–213
Pal U, Wakabayashi T and Kimura F 2009 Comparative study of Devnagari handwritten character recognition using different feature and classifiers, In: Proceedings of the International Conference on Document Analysis and Recognition, 1111–1115
Passino K M 2002 Biomimicry of bacterial foraging for distributed optimization and control. IEEE Control Systems Magazine 22: 52–67
Philip B and Samuel R D S 2009 A novel bilingual OCR for printed Malayalam-English text based on Gabor features and dominant singular values, In: Proceedings of the International Conference on Digital Image Processing, 361–365
Plamondon R and Srihari S N 2000 On-line and off-line handwritten character recognition: A comprehensive survey. IEEE Trans. Pattern Anal. Mach. Intell. 22: 62–84
Procter S, Illingworth J and Mokhtarian F 2000 Cursive handwriting recognition using hidden markov models and a lexicon-driven level building algorithm. IEE P-VIS Image Sign. 147: 332–339
Rahiman M A and Rajasree M S 2009 Printed Malayalam character recognition using back-propagation neural networks, In: Proceedings of the International Advance Computing Conference, 197–201
Rahman A F R and Kaykobad M 1998 A complete Bengali OCR: A novel hybrid approach to handwritten Bengali character recognition. J. Comput. Information Technol. 6: 395–413
Rahman A F R, Rahman R and Fairhurst M C 2002 Recognition of handwritten Bengali characters: A novel multistage approach. Pattern Recogn. 35: 997–1006
Rahman M A and Saddik A E 2007 Modified syntactic method to recognize Bengali handwritten characters. IEEE Trans. Instrum. Meas. 56: 2623–2632
Rodriguez-Serrano J A and Perronnin F 2009 Handwritten word-spotting using hidden Markov models and universal vocabularies. Pattern Recogn. 42: 2106–2116
Roy A, Bhowmik T K, Parui S K and Roy U 2005 A novel approach to skew detection and character segmentation for handwritten Bangla words, In: Proceedings of the Digital Image Computing: Techniques and Applications, 203–210
Ruwei D, Chenglin L and Baihua X 2007 Chinese character recognition: History, status and prospects. Front. Comput. Sci. 1: 126-136
Sarkar P 2006 Document image analysis for digital libraries, In: Proceedings of the International Workshop on Research Issues in Digital Libraries, Article 12
Sarkar R, Das N, Basu S, Kundu M, Nasipuri M and Basu D K 2008 A two-stage approach for segmentation of handwritten Bangla word images, In: Proceedings of International Conference on Frontiers in Handwriting Recognitions, 403–408
Sethi K and Chatterjee B 1977 Machine recognition of constrained hand-printed Devnagari. Pattern Recogn. 9: 69–77
Sharma N, Pal U, Kimura F and Pal S 2006 Recognition of off-line handwritten Devnagari characters using quadratic classifier, In: Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, 805–816
Singh D, Dutta M and Singh S H 2009 Neural network based handwritten Hindi character recognition system, In: Proceedings of the Bangalore Annual Compute Conference, article no. 15
Sinha R M K 1987 Rule based contextual post-processing for Devnagari text recognition. Pattern Recogn. 20: 475–485
Sinha R M K and Mahabala H 1979 Machine recognition of Devnagari script. IEEE Trans. Syst. Man Cybern. 9: 435–441
Srihari S N, Yang X and Ball G R 2007 Offline Chinese handwriting recognition: An assessment of current technology. Front. Comput. Sci. 1: 137–155
Su T H, Zhang T W, Guan D J and Huang H J 2009 Off-line recognition of realistic Chinese handwriting using segmentation-free strategy. Pattern Recogn. 42: 167–182
Sural S and Das P K 1999 An MLP using Hough transform based fuzzy feature extraction for Bengali script recognition. Pattern Recogn. Lett. 20: 771–782
Verma B K 1995 Handwritten Hindi character recognition using multilayer perceptron and radial basis function neural networks, In: Proceedings of the IEEE International Conference on Neural Network, 2111–2115
Wong P K and Chan C 1998 Off-line handwritten Chinese character recognition as a compound Bays decision problem. IEEE Trans. Pattern Anal. Mach. Intell. 20: 1016–1023
WordNet 2010 Semantic lexicons for English language. http://wordnet.princeton.edu/
Xu Y and Nagy G 1999 Prototype extraction and adaptive OCR. IEEE Trans. Pattern Anal. Mach. Intell. 21: 1280–1296
Zagoris K, Papamarkos N and Chamzas C 2006 Web document image retrieval system based on word spotting, In: Proceedings of the International Conference on Image Processing, 477–480
Zhuang L, Bao T and Zhu X Y 2004 A Chinese OCR spelling check approach based on statistical language models, Proceedings of the IEEE International Conference on System, Man and Cybernetics, 4727–4732
Zhuang L and Zhu X 2005 An OCR post-processing approach based on multi-knowledge, In: Proceedings of the International Conference on Knowledge-Based and Intelligent Information and Engineering Systems, 346–352
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
BAG, S., HARIT, G. A survey on optical character recognition for Bangla and Devanagari scripts. Sadhana 38, 133–168 (2013). https://doi.org/10.1007/s12046-013-0121-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12046-013-0121-9