skip to main content
research-article

Arabic Natural Language Processing: Challenges and Solutions

Published:01 December 2009Publication History
Skip Abstract Section

Abstract

The Arabic language presents researchers and developers of natural language processing (NLP) applications for Arabic text and speech with serious challenges. The purpose of this article is to describe some of these challenges and to present some solutions that would guide current and future practitioners in the field of Arabic natural language processing (ANLP). We begin with general features of the Arabic language in Sections 1, 2, and 3 and then we move to more specific properties of the language in the rest of the article. In Section 1 of this article we highlight the significance of the Arabic language today and describe its general properties. Section 2 presents the feature of Arabic Diglossia showing how the sociolinguistic aspects of the Arabic language differ from other languages. The stability of Arabic Diglossia and its implications for ANLP applications are discussed and ways to deal with this problematic property are proposed. Section 3 deals with the properties of the Arabic script and the explosion of ambiguity that results from the absence of short vowel representations and overt case markers in contemporary Arabic texts. We present in Section 4 specific features of the Arabic language such as the nonconcatenative property of Arabic morphology, Arabic as an agglutinative language, Arabic as a pro-drop language, and the challenge these properties pose to ANLP. We also present solutions that have already been adopted by some pioneering researchers in the field. In Section 5 we point out to the lack of formal and explicit grammars of Modern Standard Arabic which impedes the progress of more advanced ANLP systems. In Section 6 we draw our conclusion.

References

  1. Abdel Monem, A., Shaalan, K., Rafea, A., and Baraka, H. 2009. Generating Arabic Text in Multilingual Speech-to-Speech Machine Translation Framework, Machine Translation. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Almas, Y. and Ahmed, K. 2007. A note on extracting “sentiments” in financial news in English, Arabic, and Urdu. In Proceedings of the 2nd Workshop on Computational Approaches to Arabic Script-based Languages (CAASL’07). 1--12.Google ScholarGoogle Scholar
  3. Al-Sughaiyer, I. and Al-Kharashi, I. 2004. Arabic morphological analysis techniques: A comprehensive survey. J. Amer. Soc. Inform. Sci. Technol. 55, 3, 189--213. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Attia, M. 1999. A large scale computational processor of Arabic morphology and applications. Master’s Dissertation, Computer Engineering, Cairo University, Egypt.Google ScholarGoogle Scholar
  5. Attia, M. 2007. Arabic tokenization system. In Proceedings of the Association of Computational Linguistics (ACL’07). Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Attia, M. 2008. Handling Arabic morphological and syntactic ambiguities within the LFG framework with a view to machine translation. PhD Dissertation, University of Manchester.Google ScholarGoogle Scholar
  7. Badawi, E., Carter, M. G., and Gully, A. 2004. Modern Written Arabic: A Comprehensive Grammar. Routledge, London.Google ScholarGoogle Scholar
  8. Bakalla, M. H. 2002. Arabic Language Through Its Language and Literature. Kegan Paul, London.Google ScholarGoogle Scholar
  9. Beesley, K. 1996. Arabic finite-state morphological analysis and generation. In Proceedings of the 16th International Conference on Computational Linguistics (COLING’96). 89--94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Beesley, K. 2001. Finite-state morphological analysis of Arabic at Xerox Research: Status and plans in 2001. In Proceedings of the Workshop on Arabic Natural Language Processing at the 39th Annual Meeting of the Association for Computational Linguistics (ACL’01). 1--8.Google ScholarGoogle Scholar
  11. Benjajiba, Y., Diab, M., and Rasso P. 2008. Arabic named entity recognition using optimized feature sets. In Proceedings of the Empirical Methods in Natural Language Processing (EMNLP’08). 284--293. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Bresnan, J. 2000. Lexical Functional Syntax. Blackwell Publishers Inc., Malden, MA.Google ScholarGoogle Scholar
  13. Buckwalter, T. 2002. Arabic transliteration. http://www.qamus.org/aramorph/.Google ScholarGoogle Scholar
  14. Buckwalter, T. 2004. Issues in Arabic orthography and morphology analysis. In Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages (CAASL’04). 31--34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Chomsky, N. 1965. Aspects of the theory of syntax. MIT Press, Cambridge, MA.Google ScholarGoogle Scholar
  16. Chomsky, N. 1981. Lectures on Government and Binding. Foris Publications, Dordrecht.Google ScholarGoogle Scholar
  17. Chomsky, N. 1982. Some concepts and consequences of the theory of government and binding. MIT Press, Cambridge, MA.Google ScholarGoogle Scholar
  18. Choukri, K. 2009. MEDAR: Mediterranean Arabic language and speech technology: Inventory of the HLT products, players, projects and language resources. In Proceedings of the 2nd International Conference on Arabic Language Resources and Tools (MEDAR’09).Google ScholarGoogle Scholar
  19. Cavalli-Sforza V., Soudi, A., and Mitamura, T. 2000. Arabic morphology generation using a concatenative strategy. In Proceedings of the 6th Applied Natural Language Processing Conference (ANLP’00). 86--93. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. CIA. 2008. CIA Word Fact Book. Central Intelligence Agency, Washington, D.C.Google ScholarGoogle Scholar
  21. Dell’Orletta, F., Lenci, A., Montemagni, S., and Pirrelli, V. 2005. Climbing the path to grammar: A maximum entropy model of subject/object learning. In Proceedings of the 2nd Workshop on Psycho-Computational Models of Human Language Acquisition, Association for Computational Linguistics (ACL’05). 72--81. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Diab, M. and Habash, N. 2007. Arabic dialect tutorial. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (NAACL’07). Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Farghaly, A. 1982. Subject pronoun deletion rule. In Proceedings of the 2nd English Language Symposium on Discourse Analysis (LSDA’82). 110--117.Google ScholarGoogle Scholar
  24. Farghaly, A. 1987. Three level morphology for Arabic. In Proceedings of the Arabic Morphology Workshop (AMW’87).Google ScholarGoogle Scholar
  25. Farghaly, A. 1999. Arabic diglossia and Arabic identity in the information age. Al-Fikr Al-Arabi, March-April.Google ScholarGoogle Scholar
  26. Farghaly, A. 2005. A case for inter-Arabic Grammar. In Eligbali, A., Ed., Investigating Arabic: Current Parameters in Analysis and Learning. Brill, Boston.Google ScholarGoogle Scholar
  27. Farghaly, A. 2007. Information retrieval and the Arabic noun construct. In Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages (CAASl’07).Google ScholarGoogle Scholar
  28. Farghaly, A. 2008. Arabic NLP: Overview, the state of the art: Challenges and opportunities. In Proceedings of the International Arab Conference on Information Technology (ACIT’08).Google ScholarGoogle Scholar
  29. Farghaly, A. 2010. Introduction in Arabic computational linguistics. CSLI Publications, Stanford, CA.Google ScholarGoogle Scholar
  30. Farghaly, A. and Senellart, J. 2003. Intuitive coding of the Arabic lexicon. In Proceedings of the MT Summit IX, the Association for Machine Translation in the Americas (AMTA’03).Google ScholarGoogle Scholar
  31. Ferguson, C. 1959. Diglossia. WORD, 15 3, 325--340.Google ScholarGoogle ScholarCross RefCross Ref
  32. Ferguson, C. 1996. Epilogue: Diglossia revisited. In Contemporary Arabic Linguistics in Honor of El-Said Badawi. The American University in Cairo.Google ScholarGoogle Scholar
  33. Fraser, A. and Wong, W. 2009. The language weaver Arabic to English statistical machine translation system. To appear in Farghaly, A., Ed., Arabic Computational Linguistics. CSLI Publications. To appear.Google ScholarGoogle Scholar
  34. Grishman R. and Sundheim, B. 1996. Message understanding conference (MUC--6): A brief history. In Proceedings of the 16th International Conference on Computational Linguistics (ICCL’96). 466--471. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Habash, N. 2004. Large-scale lexeme based Arabic morphological generation. In Proceedings of Traitement Automatique du Langage Naturel (TALN’04).Google ScholarGoogle Scholar
  36. Habash, N., Owen, R., and George, K. 2005. Morphological analysis and generation for Arabic dialects. In Proceedings of the Association for Computational Linguistics (ACL’05). Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Habash, N. and Owen, R. 2005. Arabic tokenization, part of speech tagging and morphological disambiguation in one fell swoop. In Proceedings of the Association for Computational Linguistics (ACL’05). Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Hlal, Y. 1985. Morphological analysis of Arabic speech. In Proceedings of the 2nd Conference on Computer Processing of the Arabic Language (CPAL’85).Google ScholarGoogle Scholar
  39. Hosny, A., Shaalan, K., and Fahmy, A. 2008. Automatic morphological rule induction for Arabic. In Proceedings of the Workshop on Human Language Translation and Natural Language Processing within the Arabic World (LREC’08). 97--101.Google ScholarGoogle Scholar
  40. Larkey, L. and Connell, M. E. 2001. Arabic information retrieval at UMASS in TREC-10. In Proceedings of the 10th Text Retrieval Conference (TREC’01).Google ScholarGoogle Scholar
  41. Maamouri, M. and Bies, A. 2010. The Penn Arabic Treebank. In Farghaly, A., Ed., Arabic Computational Linguistics. CSLI Publications, Stanford, CA.Google ScholarGoogle Scholar
  42. McCarthy, J. 1981. A prosodic theory of nonconcatenative morphology. Linguistic Inquiry. 12, 373--418.Google ScholarGoogle Scholar
  43. Rafea, A. and Shaalan K. 1993. Lexical analysis of inflected Arabic words using exhaustive search of an augmented transition network. Softw. Prac. Exper. 23, 6, 567--588. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Ryding K. 2005. Reference grammar of modern standard Arabic. Cambridge University Press, Cambridge, UK.Google ScholarGoogle Scholar
  45. Sag I. and Pollard, C. 1994. Head-Driven Phrase Structure Grammar. University of Chicago Press, Chicago, IL.Google ScholarGoogle Scholar
  46. Sawaf, H. 2009. The AppTek hybrid machine translation system. In Farghaly, Ali, Ed., Arabic Computational Linguistics. CSLI Publications. To appear.Google ScholarGoogle Scholar
  47. Shaalan K. 2005a. An intelligent computer-assisted language learning system for Arabic learners. J. Int. Comput. Assist. Lang. Learn. 18, 1/2, 81--108.Google ScholarGoogle Scholar
  48. Shaalan, K. 2005b. Arabic GramCheck: A Grammar Checker for Arabic, Software Practice and Experience. John Wiley & Sons, Ltd., 643--665. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Shaalan K., Rafea, A., Abdel Monem, A., and Baraka, H. 2004. Machine translation of English noun phrases into Arabic. Int. J. Comput. Proc. Oriental Lang. 17, 2, 121--134.Google ScholarGoogle ScholarCross RefCross Ref
  50. Shaalan, K., Abdel Monem, A., and Rafea, A. 2006. Arabic morphological generation from Interlingua: A rule-based approach. In Intelligent Information Processing III, Z. Shi, K. Shimohara, and D. Feng, Eds. Springer, 441--451.Google ScholarGoogle Scholar
  51. Shaalan, K., Abo Bakr, H., and Ziedan, I. 2007. Transferring Egyptian colloquial into modern standard Arabic. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP’07). 525--529.Google ScholarGoogle Scholar
  52. Shaalan, K. and Raza H. 2008. Arabic named entity recognition from diverse text types. In Proceedings of the 6th International Conference on Natural Language Processing (GoTAL’08). B. Nordström, and A. Ranta, Eds. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Shaalan, K. and Raza, H. 2009. NERA: Named entity recognition for Arabic. J. Amer. Soc. Inform. Sci. Technol. 60, 7, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Soudi, A., Bosch, A., and Günter, N., eds. 2007. Arabic Computational Morphology: Knowledge-Based and Empirical Methods (Text, Speech, and Language Technology), Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Versteegh, K. 1997. The Arabic Language. Columbia University Press, New York.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image ACM Transactions on Asian Language Information Processing
    ACM Transactions on Asian Language Information Processing  Volume 8, Issue 4
    December 2009
    121 pages
    ISSN:1530-0226
    EISSN:1558-3430
    DOI:10.1145/1644879
    Issue’s Table of Contents

    Copyright © 2009 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 1 December 2009
    • Accepted: 1 October 2009
    • Revised: 1 September 2009
    • Received: 1 June 2009
    Published in talip Volume 8, Issue 4

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader