Skip to main content
Log in

Urdu language processing: a survey

  • Published:
Artificial Intelligence Review Aims and scope Submit manuscript

Abstract

Extensive work has been done on different activities of natural language processing for Western languages as compared to its Eastern counterparts particularly South Asian Languages. Western languages are termed as resource-rich languages. Core linguistic resources e.g. corpora, WordNet, dictionaries, gazetteers and associated tools being developed for Western languages are customarily available. Most South Asian Languages are low resource languages e.g. Urdu is a South Asian Language, which is among the widely spoken languages of sub-continent. Due to resources scarcity not enough work has been conducted for Urdu. The core objective of this paper is to present a survey regarding different linguistic resources that exist for Urdu language processing, to highlight different tasks in Urdu language processing and to discuss different state of the art available techniques. Conclusively, this paper attempts to describe in detail the recent increase in interest and progress made in Urdu language processing research. Initially, the available datasets for Urdu language are discussed. Characteristic, resource sharing between Hindi and Urdu, orthography, and morphology of Urdu language are provided. The aspects of the pre-processing activities such as stop words removal, Diacritics removal, Normalization and Stemming are illustrated. A review of state of the art research for the tasks such as Tokenization, Sentence Boundary Detection, Part of Speech tagging, Named Entity Recognition, Parsing and development of WordNet tasks are discussed. In addition, impact of ULP on application areas, such as, Information Retrieval, Classification and plagiarism detection is investigated. Finally, open issues and future directions for this new and dynamic area of research are provided. The goal of this paper is to organize the ULP work in a way that it can provide a platform for ULP research activities in future.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. http://www.cle.org.pk/software/langproc/transliterator_tools.htm.

  2. http://www.unicode.org/reports/tr15/.

  3. http://www.unicode.org/reports/tr15/.

  4. http://www.cle.org.pk/software/langproc/urdunormalization.htm.

  5. The diacritics (called zer-e-izafat or hamza-e-izafat) are optional, and are not written in the example given.

  6. http://www.cle.org.pk/software/langproc/POStagset.htm.

References

  • Abbas Q (2014) Semi-semantic part of speech annotation and evaluation. In: Proceedings of ACL 8th Linguistic Annotation Workshop held in conjunction with COLING, Association of Computational Linguistics, pp 75–81

  • Adeeba F, Hussain S (2011) Experiences in building the UrduWordNet. In: Proceedings of the 9th workshop on Asian language resources, pp 31–35

  • Ahmed T, Hautli A (2010) Developing a basic lexical resource for Urdu using Hindi WordNet. In: Proceedings of CLT10, Islamabad, Pakistan

  • Ahmed T, Hautli A (2011) A first approach towards an UrduWordNet. Linguist Lit Rev 6(1):1–14

    Google Scholar 

  • Akram Q, Naseer A, et al. (2009) Assas-band, an affix-exception-list based Urdu stemmer. In: Proceedings of the 7th workshop on Asian language resources, pp 40–46

  • Ali S, Khlid S, Saleemi MH (2014) A novel stemming approach for Urdu language. J Appl Environ Biol Sci 4(7S):436–443

    Google Scholar 

  • Ali A, Ijaz M (2009) Urdu text classification. In: Proceedings of the 7th international conference on frontiers of information technology, pp 1–7

  • Al-Shammari (2008) Towards an error free stemming. In: Proceedings of ACM workshop on improving non English web searching, pp 9–16

  • Anwar W et al (2006) A survey of automatic Urdu language processing. In: Proceedings of conference on machine learning and cybernetics, pp 4489–4494

  • Anwar W, et al (2007) A statistical based part of speech tagger for Urdu language. In: Proceedings of IEEE international conference on machine learning and cybernetics, pp 3418–3424

  • Attia M (2007) Arabic tokenization system. In: Proceedings of the Urdu2007 workshop on computational approaches to semitic languages: common issues and resources, pp 65–72

  • Baker A, Hardie P et al (2003) Corpus data for south Asian language processing. In: Proceedings of the 10th annual workshop for South Asian language processing, pp 1–8

  • Becker D, Riaz K (2002) A study in Urdu corpus construction. In: Proceedings of Urdu 3rd workshop on Asian language resources and international standardization, pp 1–5

  • Biemann C (2006) Unsupervised part-of-speech tagging employing efficient graph clustering. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics: student research workshop, pp 7–12

  • Capstick J, Diagne AK, Erbach G, Uszkoreit H, Leisenberg A, Leisenberg M (2000) A system for supporting cross-lingual information retrieval. Inf Process Manag 36(2):275–289

    Article  Google Scholar 

  • Chiong R, Wei W (2006) Named entity recognition using hybrid machine learning approach. In: Proceedings of international conference on cognitive informatics, pp 578–583

  • CLE (2015) Urdu digest POS tagged corpus. Retrieved 2015-08-07, from http://www.cle.org.pk/software/localization.htm

  • Daud A et al (2010) Knowledge discovery through directed probabilistic topic models a survey. Front Comput Sci 4(2):280–301

    Article  Google Scholar 

  • Durrani N, Hussain S (2010) Urdu word segmentation. In: Proceedings of international conference on human language technologies, pp 528–536

  • Ekbal A, et al. (2008) Named entity recognition in Bengali: a conditional random field approach. In: Proceedings of the 3rd international joint conference on natural language processing (ijcnlp), pp 589–594

  • Ekbal A, Haque R, Das A, Poka V, Bandyopadhyay S (2008). Language independent named entity recognition in Indian languages. In: Proceedings of the IJCNLP workshop on NER for South and SouthEast Asian languages, pp 33–40

  • Estahbanati S, Javidan R (2011) A new stemmer for Farsi language. In: Proceedings of international symposium on computer science and software engineering (CSSE), pp 25–29

  • Fellbaum C (1998). WordNet. Blackwell Publishing Ltd, New York

  • Flagship (2012) Undergraduate program and resource center for Hindi-Urdu at the university of Texas at Austin. Retrieved 2015-03-09, from http://HindiUrduflagship.org/about/two-languages-or-one/

  • Gali K, et al (2008) Aggregating machine learning and rule-based heuristics for named entity recognition. In: Proceedings of the ijcnlp-08 workshop on NER for South and SouthEast Asian languages, pp 25–32

  • Graça J et al (2011) Controlling complexity in part-of-speech induction. J Artif Intell Res 41(2):527–551

    MathSciNet  MATH  Google Scholar 

  • Gupta V, Joshi N, Mathur I (2013) Rule based stemmer in Urdu. In: Proceedings of IEEE 4th international conference on computer and communication technology (ICCCT), pp. 129–132

  • Gupta V, Joshi N, Mathur I (2015) Design and development of rule based inflectional and derivational Urdu stemmer ‘Usal’. In: Proceedings of IEEE international conference on futuristic trends on computational analysis and knowledge management (ABLAZE), pp. 7–12

  • Hardie A (2003) Developing a tagset for automated part-of-speech tagging in Urdu. In: Proceedings of conference on corpus linguistics, Lancaster, pp 1–7

  • Henderson R, Deane S (2003) Xml made simple. Routledge

  • Horváth T et al (1999) Application of different learning methods to Hungarian part-of-speech tagging. Induc Logic Programm 1634(1):128–139

    Google Scholar 

  • Humayoun M, et al. (2007) Urdu morphology, orthography and lexicon extraction. In: Second workshop on computational approaches to Arabic script-based languages,(caasl-2: Lsa), pp 1–8

  • Hussain S (2008) Resources for Urdu language processing. In: Proceedings of the 6th workshop on Asian language resources (IJCNLP’08), pp 99–100

  • Imran MR (2011) Online Urdu character recognition in unconstrained environment (doctoral dissertation, International Islamic University, Islamabad)

  • Jafar R, et al (2004) Language oriented parsing through morphologically closed word classes in Urdu. In: Proceedings of IEEE student conference on engineering, sciences and technology, pp. 19–24

  • Jawaid B, Ahmed T (2009) Hindi to Urdu conversion: beyond simple transliteration. In: Proceedings of the conference on language and technology, pp. 24–31

  • Kabir H, et al. (2002) Two pass parsing implementation for an Urdu grammar checker. In: Proceedings of IEEE international multi topic conference, pp. 1–8

  • Kaplan R (2005) A method for tokenizing text. CSLI Publications, Stanford, UK

    Google Scholar 

  • Khan SA, Anwar W, Bajwa UI, Wang X (2012) A light weight stemmer for Urdu language: a scarce resourced language. In: 24th international conference on computational linguistics, pp 69–78

  • Khan M, et al. (2011) Copy detection in Urdu language documents using n-grams model. In: Proceedings of international conference on computer networks and information technology (ICCNIT), pp 263–266

  • Lehal, et al. (2012) Rule based Urdu stemmer. In: Proceeding of the 24th international conference on computational linguistics, pp 267–276

  • Lehal, G. (2010). A two stage word segmentation system for handling space insertion problem in Urdu script. In: Proceedings of the 1st workshop on south and southeast Asian natural language processing (WASSANLP), the 23rd international conference on computational linguistics(COLING), pp 43–50

  • Lehal, G. S. (2013). Ligature segmentation for Urdu OCR. In: Proceedings of IEEE 12th international conference on document analysis and recognition (ICDAR), pp. 1130–1134

  • Matsukawa T, et al. (1993) Example-based correction of word segmentation and part of speech labeling. In: Proceedings of the workshop on human language technology, pp 227–232

  • Meknavin S, et al. (1997) Feature-based Thai word segmentation. In: Proceedings of natural language processing Pacific Rimsymposium (NLRPS), pp. 35–46

  • Miller GA (1995) WordNet: a lexical database for English. Commun ACM 38(11):39–41

    Article  Google Scholar 

  • Mukhtar N et al (2012) Algorithm for developing Urdu probabilistic parser. Int J Electr Comput Sci IJECS-IJENS 12(3):57–66

    Google Scholar 

  • Mukund, S., & Srihari., R. (2009). NE tagging for Urdu based on bootstrap POS learning. In: Proceedings of third international cross lingual information access workshop, pp. 61–69

  • Mukund S et al (2010) An information-extraction system for Urdu-a resource-poor language. ACM Trans Asian Lang Inf Process 9(4):1–43

    Article  Google Scholar 

  • Mukund S, Srihari R (2012) An NLP framework for non-topical text analysis in Urdu—a resource poor language (unpublished doctoral dissertation). State University of New York at Buffalo

  • Naz F et al (2012) Urdu part of speech tagging using transformation based error driven learning. World Appl Sci J 3(16):437–448

    Google Scholar 

  • Naz S et al (2014) Challenges of Urdu named entity recognition: a scarce resource language. Res J Appl Sci Eng Technol 8(10):1272–1278

    Google Scholar 

  • Paik J, et al. (2011). A novel corpus-based stemming algorithm using co-occurrence statistics. In: Proceedings of the 34th international ACMSIGIR conference on research and development ininformation retrieval, pp 863–872

  • Pandey AK, Siddiqui TJ (2009) Evaluating effect of stemming and stop-word removal on hindi text retrieval. In: Tiwary US, Siddiqui TJ, Radhakrishna M, Tiwari MD (eds) Proceedings of the first international conference on intelligent human computer interaction. Springer, pp 316–326

  • Prasad, K., & Virk., S. (2012). Computational evidence that Hindi and Urdu share a grammar but not the lexicon. In: Proceedings of the 24th international conference on computational linguistics (COLING), pp 1–13

  • Raj S, Rehman Z, Rauf S, Siddique R, Anwar W (2015) An artificial neural network approach for sentence boundary sisambiguation in Urdu language text. Int Arab J Inf Technol 12(4):395–400

    Google Scholar 

  • Ranta A (2004) Grammatical framework: a type-theoretical grammar formalism. J Funct Programm 14(2):145–189

    Article  MathSciNet  MATH  Google Scholar 

  • Rehman Z et al (2012) A hybrid approach for Urdu sentence boundary disambiguation. Int Arab J Inf Technol 9(3):250–255

    MathSciNet  Google Scholar 

  • Rehman Z, et al. (2011) Challenges in Urdu text tokenization and sentence boundary disambiguation. In: Proceedings of the 2nd workshop on South and Southeast Asian natural language processing (WASSANLP 2011), pp 40–45

  • Riaz K (2007) Challenges in Urdu stemming. In: Proceedings of BCS IRSG symposium on future directions in information access, pp 1–4

  • Riaz K (2008a) Baseline for UrduIR evaluation. In: Proceedings of the 2nd ACM workshop on improving on English web searching, pp 97–100

  • Riaz K (2008b) Concept search in Urdu. In: Proceedings of the 2nd PhD workshop on information and knowledge management, pp 33–40

  • Riaz K (2009) Urdu is not Hindi for information access. SIGIR workshop on information access in a multilingual World, pp 53–57

  • Riaz K (2010) Rule-based named entity recognition in Urdu. In: Proceedings of the 2010 named entities workshop, pp 12–35

  • Riaz K (2012) Comparison of Hindi and Urdu in computational context. Int J Comput Linguist Nat Lang Process 1(3):92–97

    Google Scholar 

  • Rizvi, S., & Hussain, M. (2005). Analysis, design and implementation of Urdu morphological analyzer. In Proceedings of student conference on engineering sciences and technology (sconest), pp 1–7

  • Sajjad H (2007) Statistical part of speech tagger for Urdu. Master unpublished thesis: National University of Computer and Emerging Sciences. Lahore, Pakistan

  • Sajjad H, Schmid H (2009) Tagging Urdu text with part of speech: a tagger comparison. In: Proceedings of the 12th conference of the European chapter of the association for computational linguistics, pp 692–700

  • Sattar SA (2009) A technique for the design and implementation of an OCR for printed Nastaliq text. Doctoral dissertation, NED University of Engineering and Technology, Karachi

  • Schmidt R (1999) Urdu: an essential grammar (1st edn). British library catalog using in publication data: Routledge 11 New Fetter Lane, London EC4P 4EE

  • Singh U et al. (2012) Named entity recognition system for Urdu. In: Proceedings of international conference on Urdu, pp 2507–2518

  • Small and George (1908) A grammar of the Hindustani of Urdu language (30th edn). California digital library: London : K. Paul, Trench, Trübner Co., ltd

  • Thoongsup S et al (2009) Thai WordNet construction. In: Proceedings of the 7th workshop on Asian language resources, pp 139–144

  • Visweswariah K, et al. (2010) Urdu and Hindi: translation and sharing of linguistic resources. In: Proceedings of the 23rd international conference on computational linguistics (COLING), pp 1283–1291

  • Wong DF, Chao LS, Zeng X (2014) Isentenizer-\(\mu \): multilingual sentence boundary detection model. Sci World J 2014:1–10

    Google Scholar 

  • Yang C, Li K (2005) A heuristic method based on a statistical approach for Chinese text segmentation. J Am Soc Inform Sci Technol 56(13):1438–1447

    Article  Google Scholar 

  • Zafar A, et al. (2012) Developing Urdu WordNet using the merge approach. In: Proceedings of conference on language and technology, pp 55–59

  • Zhang C, Baldwin T, Ho H, Kimelfeld B, Li Y (2013) Adaptive parser-centric text normalization. In: ACL (1), pp 1159–1168

  • Zhou L, Liu Q (2002) A character-net based Chinese text segmentation method. In: Proceedings of the Urdu 2002 workshop on building and using semantic networks, pp 1–6

Download references

Acknowledgments

The work is supported by Higher Education Commission (HEC), Islamabad, Pakistan.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ali Daud.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Daud, A., Khan, W. & Che, D. Urdu language processing: a survey. Artif Intell Rev 47, 279–311 (2017). https://doi.org/10.1007/s10462-016-9482-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10462-016-9482-x

Keywords

Navigation