skip to main content
10.1145/564376.564423acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Term selection for searching printed Arabic

Published:11 August 2002Publication History

ABSTRACT

Since many Arabic documents are available only in print, automating retrieval from collections of scanned Arabic document images using Optical Character Recognition (OCR) is an interesting problem. Arabic combines rich morphology with a writing system that presents unique challenges to OCR systems. These factors must be considered when selecting terms for automatic indexing. In this paper, alternative choices of indexing terms are explored using both an existing electronic text collection and a newly developed collection built from images of actual printed Arabic documents. Character n-grams or lightly stemmed words were found to typically yield near-optimal retrieval effectiveness, and combining both types of terms resulted in robust performance across a broad range of conditions.

References

  1. Ahmed, Mohamed Attia. A Large-Scale Computational Processor of the Arabic Morphology, and Applications. Master's Thesis, Faculty of Engineering, Cairo University, Cairo, Egypt, 2000.Google ScholarGoogle Scholar
  2. Al-Areeb Electronic Publishers, LLC. 16013 Malcolm Dr., Laurel, MD 20707, USA.Google ScholarGoogle Scholar
  3. Al-Kharashi, Ibrahim and Martha Evens. Comparing Words, Stems, and Roots as Index Terms in an Arabic Information Retrieval System. JASIS. 45 (8): 548-560, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Aljlayl, M., S. Beitzel, E. Jensen, A. Chowdhury, D. Holmes, M. Lee, D. Grossman, and O. Frieder. IIT at TREC-10. TREC-2001, 2001.Google ScholarGoogle Scholar
  5. Baird, Henry. Document Image Defects Models and their Uses. Proceedings of the Second International Conference on Document Analysis and Recognition (ICDAR), 62-67, 1993.Google ScholarGoogle Scholar
  6. Baird, Henry. State of the Art of Document Image Degradation Modeling. Proceedings of the 4th IAPR Workshop on Document Analysis Systems (DAS 2000), 2000.Google ScholarGoogle Scholar
  7. Beesley, Kenneth. Arabic Finite-State Morphological Analysis and Generation. COLING-96, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Chen, Aitao and F. Gey. Translation Term Weighting and Combining Translation Resources in Cross-Language Retrieval TREC-2001, 2001.Google ScholarGoogle Scholar
  9. Darwish, Kareem. Building a Shallow Morphological Analyzer in One Day. To appear in ACL 2002 Workshop on Computational Approaches to Semitic Languages, July 11, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Darwish, Kareem, D. Doermann, R. Jones, D. Oard, and M. Rautiainen. TREC-10 Experiments at Maryland: CLIR and Video. TREC-2001, 2001.Google ScholarGoogle Scholar
  11. Gey, Fredric and D. Oard. The TREC-2001 Cross Language Retrieval Track: Searching Arabic using English, French, and Arabic Queries. TREC-2001, 2001.Google ScholarGoogle Scholar
  12. Harding, S., W. Croft, and C. Weir. Probabilistic Retrieval of OCR Degraded Text Using N-Grams. European Conference on Digital Libraries, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Hmeidi, Ismail, Ghassan Kanaan, and Martha Evens. Design and Implementation of Automatic Indexing for Information Retrieval with Arabic Documents. JASIS. 48 (10): 867-881, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Kanungo, Tapas. Document Degradation Models and Methodology for Degradation Model Validation. Ph.D. Thesis, Electrical Engineering Department, University of Washington, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Kanungo, Tapas, Gregory Marton, and Osama Bulbul. OmniPage vs. Sakhr: Paired Model Evaluation of Two Arabic OCR Products. Proceedings of SPIE Conference on Document Recognition and Retrieval (VI), Vol. 3651, San Jose, California, Jan. 27-28, 1999.Google ScholarGoogle Scholar
  16. Mayfield, James, P. McNamee, C. Costello, C. Piatko, and A. Banerjee. JHU/APL at TREC 2001: Experiments in Filtering and in Arabic, Video, and Web Retrieval. TREC-2001, 2001.Google ScholarGoogle Scholar
  17. Rice, S., Frank R. Jenkins, and Thomas A. Nartker. The fifth annual test of OCR accuracy. Technical Report 96-01Information Science Research Institute, University of Nevada, Las Vegas, April 1996.Google ScholarGoogle Scholar
  18. Robertson, S. and K. S. Jones. Simple proven approaches to text retrieval. Tech. Rep. TR356, Cambridge University Computer Laboratory, 1997.Google ScholarGoogle Scholar
  19. Sakhr Technologies, Cairo, Egypt www.sakhr.com.Google ScholarGoogle Scholar
  20. Singhal, Amit, Gerard Salton, and Chris Buckley. Length Normalization in Degraded Text Collections. Proceedings of 5th Annual Symposium on Document Analysis and Information Retrieval, 149-162, April 15-17, 1996.Google ScholarGoogle Scholar
  21. Taghva, Kazem, Julie Borasack, Allen Condit, and Jeff Gilbreth. Results and Implications of the Noisy Data Projects. Technical Report 94-01, Information Science Research Institute, University of Nevada, Las Vegas, 1994.Google ScholarGoogle Scholar
  22. Taghva, Kazem, Julie Borasack, Allen Condit, and Padma Inaparthy. Querying Short OCR'd Documents. Technical Report 94-10, Information Science Research Institute 1995.Google ScholarGoogle Scholar
  23. Trenkle, John, Andrew Gillies, Erik Erlandson, Steve Schlosser, and Stan Cavin. Advances in Arabic Text Recognition. Proceeding of Symposium on Document Image Understanding Technology, Columbia, Maryland, April 23-25, 2001.Google ScholarGoogle Scholar
  24. Tseng, Yuen-Hsien and Douglas Oard. Document Image Retrieval Techniques for Chinese. Proceeding of Symposium on Document Image Understanding Technology, Columbia, Maryland, April 23-25, 2001.Google ScholarGoogle Scholar
  25. Xu, Jinxi, A. Fraser, and R. Weischedel. TREC 2001 Cross-Lingual Retrieval at BBN. TREC-2001, 2001.Google ScholarGoogle Scholar

Index Terms

  1. Term selection for searching printed Arabic

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
          August 2002
          478 pages
          ISBN:1581135610
          DOI:10.1145/564376

          Copyright © 2002 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 11 August 2002

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • Article

          Acceptance Rates

          SIGIR '02 Paper Acceptance Rate44of219submissions,20%Overall Acceptance Rate792of3,983submissions,20%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader