article

Towards effective strategies for monolingual and bilingual information retrieval: Lessons learned from NTCIR-4

Authors:
Yan Qu

Clairvoyance Corporation, Pittsburgh, PA

Clairvoyance Corporation, Pittsburgh, PA
View Profile

,
David A. Hull

Clairvoyance Corporation, Pittsburgh, PA

Clairvoyance Corporation, Pittsburgh, PA
View Profile

,
Gregory Grefenstette

Clairvoyance Corporation, Pittsburgh, PA

Clairvoyance Corporation, Pittsburgh, PA
View Profile

,
David A. Evans

Clairvoyance Corporation, Pittsburgh, PA

Clairvoyance Corporation, Pittsburgh, PA
View Profile

,
Motoko Ishikawa

Justsystem Corporation, Tokushima-city, Japan

Justsystem Corporation, Tokushima-city, Japan
View Profile

,
Setsuko Nara

Justsystem Corporation, Tokushima-city, Japan

Justsystem Corporation, Tokushima-city, Japan
View Profile

,
Toshiya Ueda

Justsystem Corporation, Tokushima-city, Japan

Justsystem Corporation, Tokushima-city, Japan
View Profile

,
Daisuke Noda

Justsystem Corporation, Tokushima-city, Japan

Justsystem Corporation, Tokushima-city, Japan
View Profile

,
Kousaku Arita

Justsystem Corporation, Tokushima-city, Japan

Justsystem Corporation, Tokushima-city, Japan
View Profile

,
Yuki Funakoshi

Justsystem Corporation, Tokushima-city, Japan

Justsystem Corporation, Tokushima-city, Japan
View Profile

,
Hiroshi Matsuda

Justsystem Corporation, Tokushima-city, Japan

Justsystem Corporation, Tokushima-city, Japan
View Profile

ACM Transactions on Asian Language Information Processing Volume 4 Issue 2pp 78–110https://doi.org/10.1145/1105696.1105698

Published:01 June 2005Publication History

ACM Transactions on Asian Language Information Processing

Abstract

At the NTCIR-4 workshop, Justsystem Corporation (JSC) and Clairvoyance Corporation (CC) collaborated in the cross-language retrieval task (CLIR). Our goal was to evaluate the performance and robustness of our recently developed commercial-grade CLIR systems for English and Asian languages. The main contribution of this article is the investigation of different strategies, their interactions in both monolingual and bilingual retrieval tasks, and their respective contributions to operational retrieval systems in the context of NTCIR-4. We report results of Japanese and English monolingual retrieval and results of Japanese-to-English bilingual retrieval. In monolingual retrieval analysis, we examine two special properties of the NTCIR experimental design (two levels of relevance and identical queries in multiple languages) and explore how they interact with strategies of our retrieval system, including pseudo-relevance feedback, multi-word term down-weighting, and term weight merging strategies. Our analysis shows that the choice of language (English or Japanese) does not have a significant impact on retrieval performance. Query expansion is slightly more effective with relaxed judgments than with rigid judgments. For better retrieval performance, weights of multi-word terms should be lowered. In the bilingual retrieval analysis, we aim to identify robust strategies that are effective when used alone and when used in combination with other strategies. We examine cross-lingual specific strategies such as translation disambiguation and translation structuring, as well as general strategies such as pseudo-relevance feedback and multi-word term down-weighting. For shorter title topics, pseudo-relevance feedback is a major performance enhancer, but translation structuring affects retrieval performance negatively when used alone or in combination with other strategies. All experimented strategies improve retrieval performance for the longer description topics, with pseudo-relevance feedback and translation structuring as the major contributors.

References

Allan, J., Connell, M. E., Croft, W. B., Feng, F., Fisher, D., and Li, X. 2000. Inquery and TREC-9. In Proceedings of the 9th Text REtrieval Conference (TREC 2000). National Institute of Standards and Technology (NIST), Gaithersburg, MD.]]Google Scholar
Ballesteros, L. and Croft, W. B. 1996. Dictionary methods for cross-lingual information retrieval, In Proceedings of the 7th International Conference on Database and Expert Systems Applications (Zurich, Switzerland), 791--801.]] Google Scholar
Ballesteros, L. and Croft, W. B. 1998. Resolving ambiguity for cross-language retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Melbourne, Australia). ACM, New York, 64--71.]] Google Scholar
Chen, A., He, J., Xu, L., Gey, F. C., and Meggs, J. 1997. Chinese text retrieval without using a dictionary. In Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Philadelphia, PA). ACM, New York, 42--49.]] Google Scholar
Davis, M. W. and Ogden, W. C. 1997. QUILT: Implementing a large-scale cross-language text retrieval system. In Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Philadelphia, PA). ACM, New York, 92--98.]] Google Scholar
Evans, D. A. and Lefferts, R. G. 1995. CLARIT-TREC experiments. Inf. Process. Manage. 31, 3 (1997), 385--395.]] Google Scholar
Evans, D. A. and Zhai, C. 1996. Noun-phrase analysis in unrestricted text. In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics (University of California, Santa Cruz). Morgan Kaufmann/ACL, 17--24.]] Google Scholar
Fujita, S. 1999. Notes on phrasal indexing: JSCB evaluation experiments at NTCIR AD HOC. In Proceedings of the First NTCIR Workshop on Research in Japanese Text Retrieval and Term Recognition. National Center for Science Information Systems (NACSIS), Tokyo, Japan.]]Google Scholar
Grefenstette, G. 1998. The problem of cross-language information retrieval. In Cross-Language Information Retrieval. G. Grefenstette (ed). Kluwer Academic, Boston, MA, 1--9.]]Google Scholar
Hull, D. A. and Grefenstette, G. 1996. Querying across languages: A dictionary-based approach to multilingual information retrieval. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Zurich, Switzerland). ACM, New York, 49--57.]] Google Scholar
Kando, N., Kuriyama, K., Nozue, T., Eguchi, K., Kato, H., and Hidaka, S. 1999. Overview of IR tasks at the first NTCIR workshop. In Proceedings of the First NTCIR Workshop on Research in Japanese Text Retrieval and Term Recognition. National Center for Science Information Systems (NACSIS), Tokyo, Japan, 11--44.]]Google Scholar
Kishida, K., Chen, K., Lee, S., Kuriyama, K., Kando, N., Chen, H., Myaeng, S. H., and Eguchi, K. 2004. Overview of CLIR task at the fourth NTCIR workshop. In NTCIR-4 Workshop Meeting: Working Notes of the Fourth NTCIR Workshop Meeting. National Institute of Informatics, Tokyo, Japan, 1--60.]]Google Scholar
Kishida, K. and Kando, N. 2004. Two-stages refinement of query translation for pivot language approach to cross-lingual information retrieval. In Comparative Evaluation of Multilingual Information Access Systems, 4th Workshop of the Cross-Language Evaluation Forum, CLEF 2003 (Trondheim, Norway, Aug. 21-22, 2003). Revised selected papers. P. C. Gonzalo et al. (eds). Lecture Notes in Computer Science 3237, Springer, New York, 253--262.]]Google Scholar
Kwok, K. L. 1997. Comparing representations in Chinese information retrieval. In Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, (Philadelphia, PA). ACM, New York, 34--41.]] Google Scholar
Kwok, K. L. 1999. Employing multiple representations in Chinese information retrieval. J. American Society for Information Science 50, 8 (1999), 709--723.]] Google Scholar
Fujii, H. and Croft, W. B. 1993. A comparison of indexing techniques for Japanese text retrieval. In Proceedings of the16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Pittsburgh, PA). ACM, New York, 237--246.]] Google Scholar
Lindman, H. R. 1974. Analysis of Variances in Complex Experimental Designs. Freeman, New York.]]Google Scholar
Littman, M. L., Dumais, S. T., and Landauer, T. K. 1998. Automatic cross-language information retrieval using latent semantic indexing. In Cross-Language Information Retrieval. G. Grefenstette (ed). Kluwer Academic, Boston, MA, 51--62.]]Google Scholar
Manning, C. D. and Schütze, H. 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA.]] Google Scholar
McCarley, J. S. 1999. Should we translate the documents or the queries in cross-language information retrieval? In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (University of Maryland, College Park). ACL, 208--214.]] Google Scholar
Nakagawa, T. and Kitamura, M. 2004. NTCIR-4 CLIR experiments at Oki. In NTCIR-4 Workshop Meeting: Working Notes of the Fourth NTCIR Workshop Meeting. National Institute of Informatics, Tokyo, 96--99.]]Google Scholar
Oyama, K., Ishida, E., and Kando, N. (Eds) 2003. NTCIR Workshop 3: Proceedings of the Third NTCIR Workshop on Research in Information Retrieval, Automatic Text Summarization and Question Answering, National Institute of Informatics, Tokyo, Japan.]]Google Scholar
Oard, D. W. 1998. A comparative study of query and document translation for cross-language information retrieval. In Proceedings of Machine Translation and the Information Soup, Third Conference of the Association for Machine Translation in the Americas. AMTA '98. Lecture Notes in Computer Science 1529, Springer, New York, 472--483.]] Google Scholar
Oard, D. W. and Wang, J. 2001. NTCIR-2 experiments at Maryland: Comparing structured queries and balanced translation. In Proceedings of the 2nd NTCIR Workshop on Research in Chinese & Japanese Text Retrieval and Text Summarization. National Institute of Informatics, Tokyo, Japan.]]Google Scholar
Pirkola, A. 1998. The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Melbourne, Australia). ACM, New York, 55--63.]] Google Scholar
Pirkola, A., Puolamaki, D., and Jarvelin, K. 2003. Applying query structuring in cross-language retrieval. Inf. Process. Manage. 39, 3 (2003), 391--402.]] Google Scholar
Qu, Y., Grefenstette, G., and Evans, D. A. 2003. Resolving translation ambiguity using monolingual corpora. In Advances in Cross-Language Information Retrieval: Third Workshop of the Cross-Language Evaluation Forum (CLEF 2002, Rome, Italy, Sept. 19-20, 2002), C. Peters et al. (eds). Lecture Notes in Computer Science 2785, Springer, New York, 223--241.]]Google Scholar
Robertson, S. E. and Walker, S. 1994. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. In Proceedings of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval (Dublin, Ireland). ACM/Springer, New York, 232--241.]] Google Scholar
Savoy, J. 2004. Report on CLIR task for the NTCIR-4 evaluation campaign. In NTCIR-4 Workshop Meeting: Working Notes of the Fourth NTCIR Workshop Meeting. National Institute of Informatics, Tokyo, Japan, 178--192.]]Google Scholar
Sperer, R. and Oard, W. 2000. Structured translation for cross-language information retrieval. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Athens, Greece). ACM, New York, 120--127.]] Google Scholar
Sakai, T., Koyama, M., Kumano, A., and Manabe, T. 2004. Toshiba BIRDJE at NTCIR-4 CLIR: Monolingual/bilingual IR and flexible feedback. In NTCIR-4 Workshop Meeting: Working Notes of the Fourth NTCIR Workshop Meeting. National Institute of Informatics, Tokyo, Japan, 65--72.]]Google Scholar
Sheridan, P. and Ballerini, J. P. 1997. Experiments in multilingual information retrieval using the SPIDER system, In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Zurich, Switzerland). ACM, New York, 58--65.]] Google Scholar
Tong, X., Zhai, C., Milic-Frayling, N., and Evans, D. A. 1996. Experiments on Chinese text indexing---CLARIT TREC-5 Chinese track report. In Proceedings of the Fifth Text REtrieval Conference (TREC-5, Gaithersburg, MD). National Institute of Standards and Technology (NIST), Special Publication 500-238.]]Google Scholar
Utiyama, M., Isahara, H. 2003. Reliable measures for aligning Japanese-English news articles and sentences. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (Sapporo, Japan). 72--79.]] Google Scholar
Voorhees, E. 2002. The philosophy of information retrieval evaluation. In Evaluation of Cross-Language Information Retrieval Systems: CLEF 2001 Workshop Revised Papers. C. Peters, et al. (eds). Lecture Notes in Computer Science 2406, Springer, New York, 355--370.]] Google Scholar
Yamashita, T. and Matsumoto, Y. 2000. Language independent morphological analysis. In Proceedings of the 6th Applied Natural Language Processing Conference (Seattle, WA). 232--328.]] Google Scholar
Yang, L., Ji, D., and Tang, L. 2004. Chinese information retrieval based on terms and ontology. In NTCIR-4 Workshop Meeting: Working Notes of the Fourth NTCIR Workshop Meeting, National Institute of Informatics, Tokyo, Japan, 136--142.]]Google Scholar

Index Terms

Towards effective strategies for monolingual and bilingual information retrieval: Lessons learned from NTCIR-4
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources
2. Information systems
  1. Information retrieval
    1. Document representation
      1. Content analysis and feature selection
    2. Evaluation of retrieval results
      1. Relevance assessment

Recommendations

Statistical Models for Monolingual and Bilingual Information Retrieval
Abstract
This work reviews information retrieval systems developed at ITC-irst which were evaluated through several tracks of CLEF, during the last three years. The presentation tries to follow the progress made over time in developing new statistical ...
Read More
Applications of tf-idf concept to improve monolingual and cross-language information retrieval based on word embeddings
AISS '19: Proceedings of the 1st International Conference on Advanced Information Science and System

This work applied word embeddings for English monolingual information retrieval and Dutch-English cross-language information retrieval. Besides word embeddings, this work also applied tf-idf concept to increase result of relevant documents. We present ...
Read More
Word normalization and decompounding in mono- and bilingual IR
Abstract
The present research studies the impact of decompounding and two different word normalization methods, stemming and lemmatization, on monolingual and bilingual retrieval. The languages in the monolingual runs are English, Finnish, German and ...
Read More

Reviews

Reviewer: Dagobert Soergel

This substantial paper will be very useful for researchers working in automated information retrieval (IR), but not for a general audience. It describes, in great detail, techniques for both monolingual IR in English and Japanese, and Japanese-English cross-language IR (Japanese queries, English documents). The paper reports on retrieval experiments in the context of NTICR-4, a Japanese retrieval testing program run by the National Institute of Informatics (NII), much like the Text Retrieval Conference (TREC) run by the National Institute for Standards and Technology (NIST) in the US. It describes retrieval systems developed in a collaboration between Justsystem Corporation (JSC) and Clairvoyance Corporation (CC). The system uses natural language processing (NLP) techniques, including noun phrase detection, with language-specific extensions, and rich translation resources. It explores issues of noun-phrase weighting, translation weighting, pseudo-relevance feedback, and term-weight merging. The experiments are carefully set up, exploring the interactions of variables through analysis of variance (ANOVA) and reporting statistical significance. A particularly welcome feature is error analysis that uses a typology of errors to gain insight into the contribution of various system components to the end result. The results are presented in many tables. The system, testing procedures, and results are all well explained. There are no earth-shattering results here, but that is true for most papers reporting on IR experiments. There are too many variables influencing retrieval performance; results are often specific to a given context, and grand generalizations are hard to come by. What sets this paper apart is the clear framework used for testing various configurations of system components, and the carefully worked out testing methodology, especially the typology of errors for the failure analysis. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Asian Language Information Processing Volume 4, Issue 2
June 2005
179 pages
ISSN:1530-0226
EISSN:1558-3430
DOI:10.1145/1105696
Issue’s Table of Contents

Copyright © 2005 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 June 2005
Published in talip Volume 4, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Monolingual information retrieval
NTCIR
comparison
cross-language information retrieval
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 635
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Towards effective strategies for monolingual and bilingual information retrieval: Lessons learned from NTCIR-4

ACM Transactions on Asian Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

Statistical Models for Monolingual and Bilingual Information Retrieval

Applications of tf-idf concept to improve monolingual and cross-language information retrieval based on word embeddings

Word normalization and decompounding in mono- and bilingual IR

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Towards effective strategies for monolingual and bilingual information retrieval: Lessons learned from NTCIR-4

ACM Transactions on Asian Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

Statistical Models for Monolingual and Bilingual Information Retrieval

Applications of tf-idf concept to improve monolingual and cross-language information retrieval based on word embeddings

Word normalization and decompounding in mono- and bilingual IR

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media