Skip to main content

Experiments on Urdu Text Recognition

  • Chapter
  • First Online:
Guide to OCR for Indic Scripts

Part of the book series: Advances in Pattern Recognition ((ACVPR))

Abstract

Urdu is a language spoken in the Indian subcontinent by an estimated 130–270 million speakers. At the spoken level, Urdu and Hindi are considered dialects of a single language because of shared vocabulary and the similarity in grammar. At the written level, however, Urdu is much closer to Arabic because it is written in Nastaliq, the calligraphic style of the Persian–Arabic script. Therefore, a speaker of Hindi can understand spoken Urdu but may not be able to read written Urdu because Hindi is written in Devanagari script, whereas an Arabic writer can read the written words but may not understand the spoken Urdu. In this chapter we present an overview of written Urdu. Prior research in handwritten Urdu OCR is very limited. We present (perhaps) the first system for recognizing handwritten Urdu words. On a data set of about 1300 handwritten words, we achieved an accuracy of 70% for the top choice, and 82% for the top three choices.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Amin A, Masini G (1986) Machine recognition of multifont printed Arabic texts. Proc. Int. Conf. Patt. Recogn., Paris, pp. 392–395

    Google Scholar 

  2. Bazzi I, Schwartz R, Makhoul J (1999) An Omnifont Open-Vocabulary OCR system for English and Arabic. IEEE Trans. Pattern Anal. Mach. Intell. 21:495–504

    Article  Google Scholar 

  3. Decerbo M, Natarajan P, Prasad R, MacRostie E, Ravindran A (2005) Performance improvements to the BBN BYBLOS OCR system. Proc. Int. Conf. Doc. Anal. Recogn., pp. 411–415

    Google Scholar 

  4. Durrani N (2007) Typology of Word and Automatic Word Segmentation in Urdu Text Corpus. NU

    Google Scholar 

  5. Favata JT, Srikantan G, Srihari SN (1994) Handprinted Character/Digit Recognition using a Multiple Feature/Resolution Philosophy. IWFHR 1994

    Google Scholar 

  6. Gulzar A, Shafiq-ur-Rahman (2007) Nastaleeq: A Challenge Accepted by Omega. EuroTex, Bachotek, Poland, 2007.

    Google Scholar 

  7. Hardie A (2003) Developing a tagset for automated part-of-speech tagging in Urdu. Proc. Corpus Linguist. 16

    Google Scholar 

  8. Hashemi MR, Fatemi O, Safavi R (1995) Persian Cursive Script Recognition. Proc. Int. Conf. Doc. Anal. Recogn., pp. 869–873

    Google Scholar 

  9. Husain, SA (2002) A multi-tier holistic approach for Urdu Nastaliq recognition. INMIC 2002, Islamabad pp. 528–532

    Google Scholar 

  10. Hussain S (2008) Resources for Urdu Language Processing, In the proceedings of the 6th Workshop on Asian Language Resources, IJCNLP’08, IIIT Hyderabad, India.

    Google Scholar 

  11. Ijaz M, Hussain S (2007) Corpus based Urdu Lexicon development. Proc. Conf. Lang. Technol., Peshawar

    Google Scholar 

  12. Jamil AM (1982) Noori Nastaliq: Computerized Urdu Calligraphy, Elite Publishers, Karachi.

    Google Scholar 

  13. Kise K, Sato A, Iwata M (1998) Segmentation of page images using the area Voronoi diagram. Comp. Vis. Image Underst. 70(3):370–382

    Article  Google Scholar 

  14. Lorigo LM, Govindaraju V (2006) Off-line Arabic handwriting recognition: A survey. PAMI doi: 10.1109/TPAMI.2006.102

    Google Scholar 

  15. Makhoul J, Schwartz R, Lapre C, Bazzi I (1998) A script-independent methodology for optical character recognition. Patt. Recogn. 31:1285–1294

    Article  Google Scholar 

  16. McEnery A, Baker J, Gaizauskas R, Cunningham H (2000) EMILLE: towards a corpus of South Asian languages. Br. Comput. Soc. Mach. Trans. Spec. Group.

    Google Scholar 

  17. Naseem T, Hussain S (2007) Spelling error trends in Urdu. Proc. Conf. Lang. Technol., Peshwar

    Google Scholar 

  18. Natarajan P, Decerbo M, Keller T, Schwartz R, Makhoul J (2003) Porting the BBN BYBLOS OCR system to new languages. Proc. Symp. Doc. Image Underst. Technol. pp. 47–52

    Google Scholar 

  19. Nouh A, Sultan A, Tolba R (1980) An approach for Arabic characters recognition. J. Eng. Sci., Univ. Riyadh 6:185–191

    Google Scholar 

  20. Nouh A, Sultan A, Tolba R (1984) On feature extraction and selection for Arabic character recognition. Arab Gulf J. Sci. Res. 2:329–347

    Google Scholar 

  21. Otsu (1979) A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 19:62–66

    Google Scholar 

  22. Pal U, Sarkar A (2003) Recognition of printed Urdu script. ICDAR doi: http://doi.ieeecomputersociety.org/10.1109/ICDAR.2003.1227844

  23. Parhami B, Taraghi M (1981) Automatic recognition of printed Farsi texts. Patt. Recogn., 14:395–403

    Article  Google Scholar 

  24. Shafait F, Adnan-ul-Hasan, Keysers D, Breuel TM (2006) Layout Analysis of Urdu Document Images. INMIC 2006, Islamabad

    Google Scholar 

  25. Shamsher I, Ahmad Z, Orakzai JK, Adnan A (2007) OCR for printed Urdu script using feed forward neural network. Proc. Acad. Sci. Eng. Technol.

    Google Scholar 

  26. Srikantan G, Lam SW, Srihari SN (1996) Gradient-based contour encoding for character recognition. Patt. Recogn.  29(7), 147-1160

    Article  Google Scholar 

  27. Wali A, Hussain S (2006). “Context Sensitive Shape-Substitution in Nastaliq Writing System: Analysis and Formulation” International Joint Conferences on Computer, Information, and System Science, and Engineering (CISSE2006), 53-58.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Omar Mukhtar .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag London Limited

About this chapter

Cite this chapter

Mukhtar, O., Setlur, S., Govindaraju, V. (2009). Experiments on Urdu Text Recognition. In: Govindaraju, V., Setlur, S. (eds) Guide to OCR for Indic Scripts. Advances in Pattern Recognition. Springer, London. https://doi.org/10.1007/978-1-84800-330-9_8

Download citation

  • DOI: https://doi.org/10.1007/978-1-84800-330-9_8

  • Published:

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-84800-329-3

  • Online ISBN: 978-1-84800-330-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics