Article

Term selection for searching printed Arabic

Authors:
Kareem Darwish

University of Maryland, College Park, College Park, MD

University of Maryland, College Park, College Park, MD
View Profile

,
Douglas W. Oard

University of Maryland, College Park, College Park, MD

University of Maryland, College Park, College Park, MD
View Profile

SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrievalAugust 2002Pages 261–268https://doi.org/10.1145/564376.564423

Published:11 August 2002Publication History

SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 261–268

ABSTRACT

Since many Arabic documents are available only in print, automating retrieval from collections of scanned Arabic document images using Optical Character Recognition (OCR) is an interesting problem. Arabic combines rich morphology with a writing system that presents unique challenges to OCR systems. These factors must be considered when selecting terms for automatic indexing. In this paper, alternative choices of indexing terms are explored using both an existing electronic text collection and a newly developed collection built from images of actual printed Arabic documents. Character n-grams or lightly stemmed words were found to typically yield near-optimal retrieval effectiveness, and combining both types of terms resulted in robust performance across a broad range of conditions.

References

Ahmed, Mohamed Attia. A Large-Scale Computational Processor of the Arabic Morphology, and Applications. Master's Thesis, Faculty of Engineering, Cairo University, Cairo, Egypt, 2000.Google Scholar
Al-Areeb Electronic Publishers, LLC. 16013 Malcolm Dr., Laurel, MD 20707, USA.Google Scholar
Al-Kharashi, Ibrahim and Martha Evens. Comparing Words, Stems, and Roots as Index Terms in an Arabic Information Retrieval System. JASIS. 45 (8): 548-560, 1994. Google ScholarDigital Library
Aljlayl, M., S. Beitzel, E. Jensen, A. Chowdhury, D. Holmes, M. Lee, D. Grossman, and O. Frieder. IIT at TREC-10. TREC-2001, 2001.Google Scholar
Baird, Henry. Document Image Defects Models and their Uses. Proceedings of the Second International Conference on Document Analysis and Recognition (ICDAR), 62-67, 1993.Google Scholar
Baird, Henry. State of the Art of Document Image Degradation Modeling. Proceedings of the 4th IAPR Workshop on Document Analysis Systems (DAS 2000), 2000.Google Scholar
Beesley, Kenneth. Arabic Finite-State Morphological Analysis and Generation. COLING-96, 1996. Google ScholarDigital Library
Chen, Aitao and F. Gey. Translation Term Weighting and Combining Translation Resources in Cross-Language Retrieval TREC-2001, 2001.Google Scholar
Darwish, Kareem. Building a Shallow Morphological Analyzer in One Day. To appear in ACL 2002 Workshop on Computational Approaches to Semitic Languages, July 11, 2002. Google ScholarDigital Library
Darwish, Kareem, D. Doermann, R. Jones, D. Oard, and M. Rautiainen. TREC-10 Experiments at Maryland: CLIR and Video. TREC-2001, 2001.Google Scholar
Gey, Fredric and D. Oard. The TREC-2001 Cross Language Retrieval Track: Searching Arabic using English, French, and Arabic Queries. TREC-2001, 2001.Google Scholar
Harding, S., W. Croft, and C. Weir. Probabilistic Retrieval of OCR Degraded Text Using N-Grams. European Conference on Digital Libraries, 1997. Google ScholarDigital Library
Hmeidi, Ismail, Ghassan Kanaan, and Martha Evens. Design and Implementation of Automatic Indexing for Information Retrieval with Arabic Documents. JASIS. 48 (10): 867-881, 1997. Google ScholarDigital Library
Kanungo, Tapas. Document Degradation Models and Methodology for Degradation Model Validation. Ph.D. Thesis, Electrical Engineering Department, University of Washington, 1996. Google ScholarDigital Library
Kanungo, Tapas, Gregory Marton, and Osama Bulbul. OmniPage vs. Sakhr: Paired Model Evaluation of Two Arabic OCR Products. Proceedings of SPIE Conference on Document Recognition and Retrieval (VI), Vol. 3651, San Jose, California, Jan. 27-28, 1999.Google Scholar
Mayfield, James, P. McNamee, C. Costello, C. Piatko, and A. Banerjee. JHU/APL at TREC 2001: Experiments in Filtering and in Arabic, Video, and Web Retrieval. TREC-2001, 2001.Google Scholar
Rice, S., Frank R. Jenkins, and Thomas A. Nartker. The fifth annual test of OCR accuracy. Technical Report 96-01Information Science Research Institute, University of Nevada, Las Vegas, April 1996.Google Scholar
Robertson, S. and K. S. Jones. Simple proven approaches to text retrieval. Tech. Rep. TR356, Cambridge University Computer Laboratory, 1997.Google Scholar
Sakhr Technologies, Cairo, Egypt www.sakhr.com.Google Scholar
Singhal, Amit, Gerard Salton, and Chris Buckley. Length Normalization in Degraded Text Collections. Proceedings of 5th Annual Symposium on Document Analysis and Information Retrieval, 149-162, April 15-17, 1996.Google Scholar
Taghva, Kazem, Julie Borasack, Allen Condit, and Jeff Gilbreth. Results and Implications of the Noisy Data Projects. Technical Report 94-01, Information Science Research Institute, University of Nevada, Las Vegas, 1994.Google Scholar
Taghva, Kazem, Julie Borasack, Allen Condit, and Padma Inaparthy. Querying Short OCR'd Documents. Technical Report 94-10, Information Science Research Institute 1995.Google Scholar
Trenkle, John, Andrew Gillies, Erik Erlandson, Steve Schlosser, and Stan Cavin. Advances in Arabic Text Recognition. Proceeding of Symposium on Document Image Understanding Technology, Columbia, Maryland, April 23-25, 2001.Google Scholar
Tseng, Yuen-Hsien and Douglas Oard. Document Image Retrieval Techniques for Chinese. Proceeding of Symposium on Document Image Understanding Technology, Columbia, Maryland, April 23-25, 2001.Google Scholar
Xu, Jinxi, A. Fraser, and R. Weischedel. TREC 2001 Cross-Lingual Retrieval at BBN. TREC-2001, 2001.Google Scholar

Index Terms

Term selection for searching printed Arabic
1. Information systems
  1. Information retrieval

Recommendations

A Database for Arabic Printed Character Recognition
ICIAR '08: Proceedings of the 5th international conference on Image Analysis and Recognition

Electronic Document Management (EDM) technology is being widely adopted as it makes for the efficient routing and retrieval of documents. Optical Character Recognition (OCR) is an important front end for such technology. Excellent OCR now exists for ...
Read More
Two template matching approaches to Arabic, Amharic and Latin isolated characters recognition

With the establishment of commercial OCR systems for Latin text, recent research efforts have been directed at the design of recognition systems for non-Latin scripts, such as Japanese, Cyrillic, Chinese, Hindi, Tibetan, and in particular Arabic. The ...
Read More
Arabic online handwriting recognition: a survey
IML '17: Proceedings of the 1st International Conference on Internet of Things and Machine Learning

Nowadays, Arabic handwriting recognition is an active research area. The optical character recognition is classified into two approaches offline and online. There are many studies and applications for Arabic offline recognition, both typed and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
August 2002
478 pages
ISBN:1581135610
DOI:10.1145/564376
General Chair:
Kalervo Järvelin
University of Tampere, Finland
,
Program Chairs:
Micheline Beaulieu
University of Sheffield, UK
,
Ricardo Baeza-Yates
University of Chile, Chile
,
Sung Hyon Myaeng
Chungnam National University, Korea
Copyright © 2002 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 August 2002
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Arabic
OCR
information retrieval
term selection
Qualifiers
- Article
Conference

Acceptance Rates
SIGIR '02 Paper Acceptance Rate44of219submissions,20%Overall Acceptance Rate792of3,983submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 34
  Total Citations
  View Citations
- 985
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Term selection for searching printed Arabic

SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

A Database for Arabic Printed Character Recognition

Two template matching approaches to Arabic, Amharic and Latin isolated characters recognition

Arabic online handwriting recognition: a survey