Experiments on Urdu Text Recognition

Mukhtar, Omar; Setlur, Srirangaraj; Govindaraju, Venu

doi:10.1007/978-1-84800-330-9_8

Omar Mukhtar³,
Srirangaraj Setlur³ &
Venu Govindaraju³

Part of the book series: Advances in Pattern Recognition ((ACVPR))

763 Accesses
4 Citations

Abstract

Urdu is a language spoken in the Indian subcontinent by an estimated 130–270 million speakers. At the spoken level, Urdu and Hindi are considered dialects of a single language because of shared vocabulary and the similarity in grammar. At the written level, however, Urdu is much closer to Arabic because it is written in Nastaliq, the calligraphic style of the Persian–Arabic script. Therefore, a speaker of Hindi can understand spoken Urdu but may not be able to read written Urdu because Hindi is written in Devanagari script, whereas an Arabic writer can read the written words but may not understand the spoken Urdu. In this chapter we present an overview of written Urdu. Prior research in handwritten Urdu OCR is very limited. We present (perhaps) the first system for recognizing handwritten Urdu words. On a data set of about 1300 handwritten words, we achieved an accuracy of 70% for the top choice, and 82% for the top three choices.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Amin A, Masini G (1986) Machine recognition of multifont printed Arabic texts. Proc. Int. Conf. Patt. Recogn., Paris, pp. 392–395
Google Scholar
Bazzi I, Schwartz R, Makhoul J (1999) An Omnifont Open-Vocabulary OCR system for English and Arabic. IEEE Trans. Pattern Anal. Mach. Intell. 21:495–504
Article Google Scholar
Decerbo M, Natarajan P, Prasad R, MacRostie E, Ravindran A (2005) Performance improvements to the BBN BYBLOS OCR system. Proc. Int. Conf. Doc. Anal. Recogn., pp. 411–415
Google Scholar
Durrani N (2007) Typology of Word and Automatic Word Segmentation in Urdu Text Corpus. NU
Google Scholar
Favata JT, Srikantan G, Srihari SN (1994) Handprinted Character/Digit Recognition using a Multiple Feature/Resolution Philosophy. IWFHR 1994
Google Scholar
Gulzar A, Shafiq-ur-Rahman (2007) Nastaleeq: A Challenge Accepted by Omega. EuroTex, Bachotek, Poland, 2007.
Google Scholar
Hardie A (2003) Developing a tagset for automated part-of-speech tagging in Urdu. Proc. Corpus Linguist. 16
Google Scholar
Hashemi MR, Fatemi O, Safavi R (1995) Persian Cursive Script Recognition. Proc. Int. Conf. Doc. Anal. Recogn., pp. 869–873
Google Scholar
Husain, SA (2002) A multi-tier holistic approach for Urdu Nastaliq recognition. INMIC 2002, Islamabad pp. 528–532
Google Scholar
Hussain S (2008) Resources for Urdu Language Processing, In the proceedings of the 6th Workshop on Asian Language Resources, IJCNLP’08, IIIT Hyderabad, India.
Google Scholar
Ijaz M, Hussain S (2007) Corpus based Urdu Lexicon development. Proc. Conf. Lang. Technol., Peshawar
Google Scholar
Jamil AM (1982) Noori Nastaliq: Computerized Urdu Calligraphy, Elite Publishers, Karachi.
Google Scholar
Kise K, Sato A, Iwata M (1998) Segmentation of page images using the area Voronoi diagram. Comp. Vis. Image Underst. 70(3):370–382
Article Google Scholar
Lorigo LM, Govindaraju V (2006) Off-line Arabic handwriting recognition: A survey. PAMI doi: 10.1109/TPAMI.2006.102
Google Scholar
Makhoul J, Schwartz R, Lapre C, Bazzi I (1998) A script-independent methodology for optical character recognition. Patt. Recogn. 31:1285–1294
Article Google Scholar
McEnery A, Baker J, Gaizauskas R, Cunningham H (2000) EMILLE: towards a corpus of South Asian languages. Br. Comput. Soc. Mach. Trans. Spec. Group.
Google Scholar
Naseem T, Hussain S (2007) Spelling error trends in Urdu. Proc. Conf. Lang. Technol., Peshwar
Google Scholar
Natarajan P, Decerbo M, Keller T, Schwartz R, Makhoul J (2003) Porting the BBN BYBLOS OCR system to new languages. Proc. Symp. Doc. Image Underst. Technol. pp. 47–52
Google Scholar
Nouh A, Sultan A, Tolba R (1980) An approach for Arabic characters recognition. J. Eng. Sci., Univ. Riyadh 6:185–191
Google Scholar
Nouh A, Sultan A, Tolba R (1984) On feature extraction and selection for Arabic character recognition. Arab Gulf J. Sci. Res. 2:329–347
Google Scholar
Otsu (1979) A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 19:62–66
Google Scholar
Pal U, Sarkar A (2003) Recognition of printed Urdu script. ICDAR doi: http://doi.ieeecomputersociety.org/10.1109/ICDAR.2003.1227844
Parhami B, Taraghi M (1981) Automatic recognition of printed Farsi texts. Patt. Recogn., 14:395–403
Article Google Scholar
Shafait F, Adnan-ul-Hasan, Keysers D, Breuel TM (2006) Layout Analysis of Urdu Document Images. INMIC 2006, Islamabad
Google Scholar
Shamsher I, Ahmad Z, Orakzai JK, Adnan A (2007) OCR for printed Urdu script using feed forward neural network. Proc. Acad. Sci. Eng. Technol.
Google Scholar
Srikantan G, Lam SW, Srihari SN (1996) Gradient-based contour encoding for character recognition. Patt. Recogn. 29(7), 147-1160
Article Google Scholar
Wali A, Hussain S (2006). “Context Sensitive Shape-Substitution in Nastaliq Writing System: Analysis and Formulation” International Joint Conferences on Computer, Information, and System Science, and Engineering (CISSE2006), 53-58.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Center for Unified Biometrics and Sensors, University at Buffalo, 14228, Amherst, NY, USA
Omar Mukhtar, Srirangaraj Setlur & Venu Govindaraju

Authors

Omar Mukhtar
View author publications
You can also search for this author in PubMed Google Scholar
Srirangaraj Setlur
View author publications
You can also search for this author in PubMed Google Scholar
Venu Govindaraju
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Omar Mukhtar .

Editor information

Editors and Affiliations

Analysis & Recognition (CEDAR), Center of Excellence for Document, Lee Entrance 520, Amherst, 14228, U.S.A.
Venu Govindaraju
Analysis & Recognition (CEDAR), Center of Excellence for Document, Lee Entrance 520, Amherst, 14228, U.S.A.
Srirangaraj (Ranga) Setlur

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Mukhtar, O., Setlur, S., Govindaraju, V. (2009). Experiments on Urdu Text Recognition. In: Govindaraju, V., Setlur, S. (eds) Guide to OCR for Indic Scripts. Advances in Pattern Recognition. Springer, London. https://doi.org/10.1007/978-1-84800-330-9_8

Download citation

DOI: https://doi.org/10.1007/978-1-84800-330-9_8
Published: 28 August 2009
Publisher Name: Springer, London
Print ISBN: 978-1-84800-329-3
Online ISBN: 978-1-84800-330-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics