Abstract
The Arabic language presents researchers and developers of natural language processing (NLP) applications for Arabic text and speech with serious challenges. The purpose of this article is to describe some of these challenges and to present some solutions that would guide current and future practitioners in the field of Arabic natural language processing (ANLP). We begin with general features of the Arabic language in Sections 1, 2, and 3 and then we move to more specific properties of the language in the rest of the article. In Section 1 of this article we highlight the significance of the Arabic language today and describe its general properties. Section 2 presents the feature of Arabic Diglossia showing how the sociolinguistic aspects of the Arabic language differ from other languages. The stability of Arabic Diglossia and its implications for ANLP applications are discussed and ways to deal with this problematic property are proposed. Section 3 deals with the properties of the Arabic script and the explosion of ambiguity that results from the absence of short vowel representations and overt case markers in contemporary Arabic texts. We present in Section 4 specific features of the Arabic language such as the nonconcatenative property of Arabic morphology, Arabic as an agglutinative language, Arabic as a pro-drop language, and the challenge these properties pose to ANLP. We also present solutions that have already been adopted by some pioneering researchers in the field. In Section 5 we point out to the lack of formal and explicit grammars of Modern Standard Arabic which impedes the progress of more advanced ANLP systems. In Section 6 we draw our conclusion.
- Abdel Monem, A., Shaalan, K., Rafea, A., and Baraka, H. 2009. Generating Arabic Text in Multilingual Speech-to-Speech Machine Translation Framework, Machine Translation. Springer. Google ScholarDigital Library
- Almas, Y. and Ahmed, K. 2007. A note on extracting “sentiments” in financial news in English, Arabic, and Urdu. In Proceedings of the 2nd Workshop on Computational Approaches to Arabic Script-based Languages (CAASL’07). 1--12.Google Scholar
- Al-Sughaiyer, I. and Al-Kharashi, I. 2004. Arabic morphological analysis techniques: A comprehensive survey. J. Amer. Soc. Inform. Sci. Technol. 55, 3, 189--213. Google ScholarDigital Library
- Attia, M. 1999. A large scale computational processor of Arabic morphology and applications. Master’s Dissertation, Computer Engineering, Cairo University, Egypt.Google Scholar
- Attia, M. 2007. Arabic tokenization system. In Proceedings of the Association of Computational Linguistics (ACL’07). Google ScholarDigital Library
- Attia, M. 2008. Handling Arabic morphological and syntactic ambiguities within the LFG framework with a view to machine translation. PhD Dissertation, University of Manchester.Google Scholar
- Badawi, E., Carter, M. G., and Gully, A. 2004. Modern Written Arabic: A Comprehensive Grammar. Routledge, London.Google Scholar
- Bakalla, M. H. 2002. Arabic Language Through Its Language and Literature. Kegan Paul, London.Google Scholar
- Beesley, K. 1996. Arabic finite-state morphological analysis and generation. In Proceedings of the 16th International Conference on Computational Linguistics (COLING’96). 89--94. Google ScholarDigital Library
- Beesley, K. 2001. Finite-state morphological analysis of Arabic at Xerox Research: Status and plans in 2001. In Proceedings of the Workshop on Arabic Natural Language Processing at the 39th Annual Meeting of the Association for Computational Linguistics (ACL’01). 1--8.Google Scholar
- Benjajiba, Y., Diab, M., and Rasso P. 2008. Arabic named entity recognition using optimized feature sets. In Proceedings of the Empirical Methods in Natural Language Processing (EMNLP’08). 284--293. Google ScholarDigital Library
- Bresnan, J. 2000. Lexical Functional Syntax. Blackwell Publishers Inc., Malden, MA.Google Scholar
- Buckwalter, T. 2002. Arabic transliteration. http://www.qamus.org/aramorph/.Google Scholar
- Buckwalter, T. 2004. Issues in Arabic orthography and morphology analysis. In Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages (CAASL’04). 31--34. Google ScholarDigital Library
- Chomsky, N. 1965. Aspects of the theory of syntax. MIT Press, Cambridge, MA.Google Scholar
- Chomsky, N. 1981. Lectures on Government and Binding. Foris Publications, Dordrecht.Google Scholar
- Chomsky, N. 1982. Some concepts and consequences of the theory of government and binding. MIT Press, Cambridge, MA.Google Scholar
- Choukri, K. 2009. MEDAR: Mediterranean Arabic language and speech technology: Inventory of the HLT products, players, projects and language resources. In Proceedings of the 2nd International Conference on Arabic Language Resources and Tools (MEDAR’09).Google Scholar
- Cavalli-Sforza V., Soudi, A., and Mitamura, T. 2000. Arabic morphology generation using a concatenative strategy. In Proceedings of the 6th Applied Natural Language Processing Conference (ANLP’00). 86--93. Google ScholarDigital Library
- CIA. 2008. CIA Word Fact Book. Central Intelligence Agency, Washington, D.C.Google Scholar
- Dell’Orletta, F., Lenci, A., Montemagni, S., and Pirrelli, V. 2005. Climbing the path to grammar: A maximum entropy model of subject/object learning. In Proceedings of the 2nd Workshop on Psycho-Computational Models of Human Language Acquisition, Association for Computational Linguistics (ACL’05). 72--81. Google ScholarDigital Library
- Diab, M. and Habash, N. 2007. Arabic dialect tutorial. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (NAACL’07). Google ScholarDigital Library
- Farghaly, A. 1982. Subject pronoun deletion rule. In Proceedings of the 2nd English Language Symposium on Discourse Analysis (LSDA’82). 110--117.Google Scholar
- Farghaly, A. 1987. Three level morphology for Arabic. In Proceedings of the Arabic Morphology Workshop (AMW’87).Google Scholar
- Farghaly, A. 1999. Arabic diglossia and Arabic identity in the information age. Al-Fikr Al-Arabi, March-April.Google Scholar
- Farghaly, A. 2005. A case for inter-Arabic Grammar. In Eligbali, A., Ed., Investigating Arabic: Current Parameters in Analysis and Learning. Brill, Boston.Google Scholar
- Farghaly, A. 2007. Information retrieval and the Arabic noun construct. In Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages (CAASl’07).Google Scholar
- Farghaly, A. 2008. Arabic NLP: Overview, the state of the art: Challenges and opportunities. In Proceedings of the International Arab Conference on Information Technology (ACIT’08).Google Scholar
- Farghaly, A. 2010. Introduction in Arabic computational linguistics. CSLI Publications, Stanford, CA.Google Scholar
- Farghaly, A. and Senellart, J. 2003. Intuitive coding of the Arabic lexicon. In Proceedings of the MT Summit IX, the Association for Machine Translation in the Americas (AMTA’03).Google Scholar
- Ferguson, C. 1959. Diglossia. WORD, 15 3, 325--340.Google ScholarCross Ref
- Ferguson, C. 1996. Epilogue: Diglossia revisited. In Contemporary Arabic Linguistics in Honor of El-Said Badawi. The American University in Cairo.Google Scholar
- Fraser, A. and Wong, W. 2009. The language weaver Arabic to English statistical machine translation system. To appear in Farghaly, A., Ed., Arabic Computational Linguistics. CSLI Publications. To appear.Google Scholar
- Grishman R. and Sundheim, B. 1996. Message understanding conference (MUC--6): A brief history. In Proceedings of the 16th International Conference on Computational Linguistics (ICCL’96). 466--471. Google ScholarDigital Library
- Habash, N. 2004. Large-scale lexeme based Arabic morphological generation. In Proceedings of Traitement Automatique du Langage Naturel (TALN’04).Google Scholar
- Habash, N., Owen, R., and George, K. 2005. Morphological analysis and generation for Arabic dialects. In Proceedings of the Association for Computational Linguistics (ACL’05). Google ScholarDigital Library
- Habash, N. and Owen, R. 2005. Arabic tokenization, part of speech tagging and morphological disambiguation in one fell swoop. In Proceedings of the Association for Computational Linguistics (ACL’05). Google ScholarDigital Library
- Hlal, Y. 1985. Morphological analysis of Arabic speech. In Proceedings of the 2nd Conference on Computer Processing of the Arabic Language (CPAL’85).Google Scholar
- Hosny, A., Shaalan, K., and Fahmy, A. 2008. Automatic morphological rule induction for Arabic. In Proceedings of the Workshop on Human Language Translation and Natural Language Processing within the Arabic World (LREC’08). 97--101.Google Scholar
- Larkey, L. and Connell, M. E. 2001. Arabic information retrieval at UMASS in TREC-10. In Proceedings of the 10th Text Retrieval Conference (TREC’01).Google Scholar
- Maamouri, M. and Bies, A. 2010. The Penn Arabic Treebank. In Farghaly, A., Ed., Arabic Computational Linguistics. CSLI Publications, Stanford, CA.Google Scholar
- McCarthy, J. 1981. A prosodic theory of nonconcatenative morphology. Linguistic Inquiry. 12, 373--418.Google Scholar
- Rafea, A. and Shaalan K. 1993. Lexical analysis of inflected Arabic words using exhaustive search of an augmented transition network. Softw. Prac. Exper. 23, 6, 567--588. Google ScholarDigital Library
- Ryding K. 2005. Reference grammar of modern standard Arabic. Cambridge University Press, Cambridge, UK.Google Scholar
- Sag I. and Pollard, C. 1994. Head-Driven Phrase Structure Grammar. University of Chicago Press, Chicago, IL.Google Scholar
- Sawaf, H. 2009. The AppTek hybrid machine translation system. In Farghaly, Ali, Ed., Arabic Computational Linguistics. CSLI Publications. To appear.Google Scholar
- Shaalan K. 2005a. An intelligent computer-assisted language learning system for Arabic learners. J. Int. Comput. Assist. Lang. Learn. 18, 1/2, 81--108.Google Scholar
- Shaalan, K. 2005b. Arabic GramCheck: A Grammar Checker for Arabic, Software Practice and Experience. John Wiley & Sons, Ltd., 643--665. Google ScholarDigital Library
- Shaalan K., Rafea, A., Abdel Monem, A., and Baraka, H. 2004. Machine translation of English noun phrases into Arabic. Int. J. Comput. Proc. Oriental Lang. 17, 2, 121--134.Google ScholarCross Ref
- Shaalan, K., Abdel Monem, A., and Rafea, A. 2006. Arabic morphological generation from Interlingua: A rule-based approach. In Intelligent Information Processing III, Z. Shi, K. Shimohara, and D. Feng, Eds. Springer, 441--451.Google Scholar
- Shaalan, K., Abo Bakr, H., and Ziedan, I. 2007. Transferring Egyptian colloquial into modern standard Arabic. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP’07). 525--529.Google Scholar
- Shaalan, K. and Raza H. 2008. Arabic named entity recognition from diverse text types. In Proceedings of the 6th International Conference on Natural Language Processing (GoTAL’08). B. Nordström, and A. Ranta, Eds. Google ScholarDigital Library
- Shaalan, K. and Raza, H. 2009. NERA: Named entity recognition for Arabic. J. Amer. Soc. Inform. Sci. Technol. 60, 7, 1--12. Google ScholarDigital Library
- Soudi, A., Bosch, A., and Günter, N., eds. 2007. Arabic Computational Morphology: Knowledge-Based and Empirical Methods (Text, Speech, and Language Technology), Springer. Google ScholarDigital Library
- Versteegh, K. 1997. The Arabic Language. Columbia University Press, New York.Google Scholar
Recommendations
An Approach for Arabic Diacritization
Natural Language Processing and Information SystemsAbstractModern Standard Arabic (MSA) contains optional diacritical marks (diacritics, in Arabic harakat), which became less used in Arabic books, newspapers and other written media. Diacritics are very important for readability and understandability of ...
Enhanced structural perceptual feature extraction model for Arabic literal amount recognition
One of the important applications for document recognition is the bank cheque processing, which is known as cheque literal amount. A few studies focused on Arabic bank cheque processing system compared to other systems, such as Latin and Chinese. The ...
Comments