Abstract
Urdu is a language spoken in the Indian subcontinent by an estimated 130–270 million speakers. At the spoken level, Urdu and Hindi are considered dialects of a single language because of shared vocabulary and the similarity in grammar. At the written level, however, Urdu is much closer to Arabic because it is written in Nastaliq, the calligraphic style of the Persian–Arabic script. Therefore, a speaker of Hindi can understand spoken Urdu but may not be able to read written Urdu because Hindi is written in Devanagari script, whereas an Arabic writer can read the written words but may not understand the spoken Urdu. In this chapter we present an overview of written Urdu. Prior research in handwritten Urdu OCR is very limited. We present (perhaps) the first system for recognizing handwritten Urdu words. On a data set of about 1300 handwritten words, we achieved an accuracy of 70% for the top choice, and 82% for the top three choices.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Amin A, Masini G (1986) Machine recognition of multifont printed Arabic texts. Proc. Int. Conf. Patt. Recogn., Paris, pp. 392–395
Bazzi I, Schwartz R, Makhoul J (1999) An Omnifont Open-Vocabulary OCR system for English and Arabic. IEEE Trans. Pattern Anal. Mach. Intell. 21:495–504
Decerbo M, Natarajan P, Prasad R, MacRostie E, Ravindran A (2005) Performance improvements to the BBN BYBLOS OCR system. Proc. Int. Conf. Doc. Anal. Recogn., pp. 411–415
Durrani N (2007) Typology of Word and Automatic Word Segmentation in Urdu Text Corpus. NU
Favata JT, Srikantan G, Srihari SN (1994) Handprinted Character/Digit Recognition using a Multiple Feature/Resolution Philosophy. IWFHR 1994
Gulzar A, Shafiq-ur-Rahman (2007) Nastaleeq: A Challenge Accepted by Omega. EuroTex, Bachotek, Poland, 2007.
Hardie A (2003) Developing a tagset for automated part-of-speech tagging in Urdu. Proc. Corpus Linguist. 16
Hashemi MR, Fatemi O, Safavi R (1995) Persian Cursive Script Recognition. Proc. Int. Conf. Doc. Anal. Recogn., pp. 869–873
Husain, SA (2002) A multi-tier holistic approach for Urdu Nastaliq recognition. INMIC 2002, Islamabad pp. 528–532
Hussain S (2008) Resources for Urdu Language Processing, In the proceedings of the 6th Workshop on Asian Language Resources, IJCNLP’08, IIIT Hyderabad, India.
Ijaz M, Hussain S (2007) Corpus based Urdu Lexicon development. Proc. Conf. Lang. Technol., Peshawar
Jamil AM (1982) Noori Nastaliq: Computerized Urdu Calligraphy, Elite Publishers, Karachi.
Kise K, Sato A, Iwata M (1998) Segmentation of page images using the area Voronoi diagram. Comp. Vis. Image Underst. 70(3):370–382
Lorigo LM, Govindaraju V (2006) Off-line Arabic handwriting recognition: A survey. PAMI doi: 10.1109/TPAMI.2006.102
Makhoul J, Schwartz R, Lapre C, Bazzi I (1998) A script-independent methodology for optical character recognition. Patt. Recogn. 31:1285–1294
McEnery A, Baker J, Gaizauskas R, Cunningham H (2000) EMILLE: towards a corpus of South Asian languages. Br. Comput. Soc. Mach. Trans. Spec. Group.
Naseem T, Hussain S (2007) Spelling error trends in Urdu. Proc. Conf. Lang. Technol., Peshwar
Natarajan P, Decerbo M, Keller T, Schwartz R, Makhoul J (2003) Porting the BBN BYBLOS OCR system to new languages. Proc. Symp. Doc. Image Underst. Technol. pp. 47–52
Nouh A, Sultan A, Tolba R (1980) An approach for Arabic characters recognition. J. Eng. Sci., Univ. Riyadh 6:185–191
Nouh A, Sultan A, Tolba R (1984) On feature extraction and selection for Arabic character recognition. Arab Gulf J. Sci. Res. 2:329–347
Otsu (1979) A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 19:62–66
Pal U, Sarkar A (2003) Recognition of printed Urdu script. ICDAR doi: http://doi.ieeecomputersociety.org/10.1109/ICDAR.2003.1227844
Parhami B, Taraghi M (1981) Automatic recognition of printed Farsi texts. Patt. Recogn., 14:395–403
Shafait F, Adnan-ul-Hasan, Keysers D, Breuel TM (2006) Layout Analysis of Urdu Document Images. INMIC 2006, Islamabad
Shamsher I, Ahmad Z, Orakzai JK, Adnan A (2007) OCR for printed Urdu script using feed forward neural network. Proc. Acad. Sci. Eng. Technol.
Srikantan G, Lam SW, Srihari SN (1996) Gradient-based contour encoding for character recognition. Patt. Recogn.  29(7), 147-1160
Wali A, Hussain S (2006). “Context Sensitive Shape-Substitution in Nastaliq Writing System: Analysis and Formulation” International Joint Conferences on Computer, Information, and System Science, and Engineering (CISSE2006), 53-58.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag London Limited
About this chapter
Cite this chapter
Mukhtar, O., Setlur, S., Govindaraju, V. (2009). Experiments on Urdu Text Recognition. In: Govindaraju, V., Setlur, S. (eds) Guide to OCR for Indic Scripts. Advances in Pattern Recognition. Springer, London. https://doi.org/10.1007/978-1-84800-330-9_8
Download citation
DOI: https://doi.org/10.1007/978-1-84800-330-9_8
Published:
Publisher Name: Springer, London
Print ISBN: 978-1-84800-329-3
Online ISBN: 978-1-84800-330-9
eBook Packages: Computer ScienceComputer Science (R0)