article

Comparative study of monolingual and multilingual search models for use with asian languages

Author:
Jacques Savoy

Université de Neuchâtel, Neuchâtel, Switzerland

Université de Neuchâtel, Neuchâtel, Switzerland
View Profile

ACM Transactions on Asian Language Information Processing Volume 4 Issue 2pp 163–189https://doi.org/10.1145/1105696.1105701

Published:01 June 2005Publication History

ACM Transactions on Asian Language Information Processing

Abstract

Based on the NTCIR-4 test-collection, our first objective is to present an overview of the retrieval effectiveness of nine vector-space and two probabilistic models that perform monolingual searches in the Chinese, Japanese, Korean, and English languages. Our second goal is to analyze the relative merits of the various automated and freely available toolsto translate the English-language topics into Chinese, Japanese, or Korean, and then submit the resultant query in order to retrieve pertinent documents written in one of the three Asian languages. We also demonstrate how bilingual searches could be improved by applying both the combined query translation strategies and data-fusion approaches. Finally, we address basic problems related to multilingual searches, in which queries written in English are used to search documents written in the English, Chinese, Japanese, and Korean languages.

References

Amati, G. and Van Rijsbergen, C. J. 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans Information Systems 20, 4 (2002), 357--389. Google Scholar
Amati, G., Carpineto, C., and Romano, G. 2003. Italian monolingual information retrieval with PROSIT. In Advances in Cross-Language Information Retrieval. C. Peters et al. (eds.), Lecture Notes in Computer Science, 2785, Springer-Verlag, Berlin, 257--264.Google Scholar
Bloomfield, L. 1933. Language. Holt, Rinehart and Winston, New York.Google Scholar
Buckley, C., Singhal, A., Mitra, M., and Salton, G. 1996. New retrieval approaches using SMART. In Proceedings of the TREC-4 Conference (Gaithersburg, MD, Nov. 1995). D.K. Harman (ed), NIST Special Publication 500-236, 25--48.Google Scholar
Buckley, C., Mitra, M., Waltz, J., and Cardie, C. 1998. Using clustering and superconcepts within SMART. In Proceedings of the TREC-6 Conference (Gaithersburg, MD, Nov.1997). E.M. Voorhees and D.K. Harman (eds), NIST Special Publication 500-240, 107--124.Google Scholar
Braschler, M. and Schäuble, P. 2000. Using corpus-based approaches in a system for multilingual information retrieval. IR Journal 3, 3 (2000), 273--284. Google Scholar
Braschler, M. and Peters, C. 2004. Cross-language evaluation forum: Objectives, results and achievements. IR Journal 7, 1-2 (2004), 7--31. Google Scholar
Braschler, M. and Ripplinger, B. 2004. How effective is stemming and decompounding for German text retrieval? IR Journal 7, 3-4 (2004), 291--316. Google Scholar
Carpineto, C., De Mori, R., Romano, G. and Bigi, B. 2001. An information-theoretic approach to automatic query expansion. ACM Trans. Information Systems 19, 1 (2001), 1--27. Google Scholar
Chen, A. and Gey, F. C. 2003. Experiments on cross-language and patent retrieval at NTCIR-3 workshop. In Proceedings of the NTCIR-3 Conference (Tokyo). N. Kando (ed.), 2003.Google Scholar
Chen, A. and Gey, F. C. 2004. Multilingual information retrieval using machine translation, relevance feedback, and decompounding. IR Journal 7, 1-2 (2004), 149--182. Google Scholar
Cronen-Townsend, S., Zhou, Y., and Croft, W. B. 2002. Predicting query performance. In Proceedings of the 25th International Conference on Research and Development in Information Retrieval. K. Jarvelin et al. (eds.). ACM, New York, 299--306. Google Scholar
Dumais, S. T. 1994. Latent semantic indexing (LSI) and TREC-2. In Proceedings of the TREC-2 Conference (Gaithersburg, MD, Sept. 1993). D.K. Harman (ed.), NIST Special Publication 500--215, 105--115.Google Scholar
Dodge, Y. (Ed.) 2003. The Oxford dictionary of Statistical Terms. Oxford University Press, Oxford, UK.Google Scholar
Foo, S. and Li, H. 2004. Chinese word segmentation and its effect on information retrieval. Information Process. Manage. 40, 1 (2004), 161--190. Google Scholar
Fox, E. A. and Shaw, J. A. 1994. Combination of multiple searches. In Proceedings of the TREC-2 Conference (Gaithersburg, MD, Sept. 1993). D.K. Harman (ed.), NIST Special Publication 500-215, 243--249.Google Scholar
Fujii, H. and Croft, W. B. 1993. A comparison of indexing techniques for Japanese text retrieval. In Proceedings of the 16th International Conference on Research and Development in Information Retrieval. ACM, New York, 237--246. Google Scholar
Gale, W. A. and Church, K. W. 1993. A program for aligning sentences in bilingual corpora. Computational Linguistic 19, 1 (1993), 75--102. Google Scholar
Gey, F. 2004. Chinese and Korean topic search of Japanese news collections. In Working Notes of NTCIR-4, N. Kando (ed.), Tokyo, June 2004, 214--218.Google Scholar
Grunfeld, L., Kwok, K. L., Dinstl, N., and Deng, P. 2004. TREC2003 robust, HARD and QA track experiments using PIRCS. In Proceedings of the TREC-12 Conference (Gaithersburg, MD, Nov. 2003). E.M. Voorhees and D.K. Harman (eds), NIST Special Publication 500-255, 510--521.Google Scholar
Halpern, J. 2002. Lexicon-based orthographic disambiguation in CJK intelligent information retrieval. In Proceedings of COLING-02: The 3rd Workshop on Asian Language Resources and International Standardization. Google Scholar
Harter, S. P. 1975. A probabilistic approach to automatic keyword indexing. Part I: On the distribution of specialty words in a technical literature. J. American Association for Information Science 26 (1975), 197--216.Google Scholar
Kishida, K., Chen, K. H., Lee, S., Kuriyama, K., Kando, N., Chen, H. H., Myaeng, S. H., and Eguchi, K. 2004a. Overview of CLIR task at the Fourth NTCIR Workshop. In Working Notes of NTCIR-4. N. Kando (ed.), Tokyo, June 2004, 1--59.Google Scholar
Kishida, K., Kuriyama, K., Kando, N., and Eguchi, K. 2004b. Prediction of performance on cross-lingual information retrieval by regression models. In Working Notes of NTCIR-4. N. Kando (ed.), Tokyo, June 2004, 219--224.Google Scholar
Kwok, K. L., Grunfeld, L., Dinstl, N., and Chan, M. 2001. TREC-9 Cross-language, Web and question-answering track experiments using PIRCS. In Proceedings of the TREC-9 Conference (Gaithersburg, MD, Nov, 2000). E.M. Voorhees and D.K. Harman (eds). NIST Special Publication 500-249, 417--426.Google Scholar
Kwok, K. L. (1999). Employing multiple representations for Chinese information retrieval. J. American Society for Information Science 50, 8 (1999), 709--723. Google Scholar
Kwok, K. L., Dinstl, N., and Choi, S. 2004. NTCIR-4 Chinese, English, Korean cross-language retrieval experiments using PIRCS. In Working Notes of NTCIR-4. N. Kando (ed.), Tokyo, June 2004, 186--192.Google Scholar
Le Calvé, A. and Savoy, J. 2000. Database merging strategy based on logistic regression. Information Process. Manage. 36, 3 (2000), 341--359. Google Scholar
Lee, J. H. and Ahn, J. S. 1996. Using n-grams for Korean text retrieval. In Proceedings of the 19th International Conference on the ACM-SIGIR'96). H. P. Frei et al. (eds.). ACM Press, New York, 216--224. Google Scholar
Lee, J. J., Cho, H. Y., and Park, H. R. 1999. N-gram-based indexing for Korean text retrieval. Information Process. Manage. 35, 4 (1999), 427--441.Google Scholar
Leek, T., Schwartz, R., and Srinivasa, S. 2002. Probabilistic approaches to topic detection and tracking. In Topic Detection and Tracking: Event-based Information Organization. J. Allan (ed.). Kluwer, Boston, MA, 67--83. Google Scholar
Lovins, J. B. 1982. Development of a stemming algorithm. Mechanical Translation and Computational Linguistics 11, 1 (1982), 22--31.Google Scholar
Luk, R. W. P. and Kwok, K. L. 2002. A comparison of Chinese document indexing strategies and retrieval models. ACM Trans. Asian Language Information Process. 1, 3 (2002), 225--268. Google Scholar
Luk, R. W. P. and Wong, K. F. 2004. Pseudo-relevance feedback and title re-ranking for Chinese information retrieval. In Working Notes of NTCIR-4. N. Kando (ed.).Tokyo, June 2004, 206--213.Google Scholar
Lunde, K. 1998. CJKV Information Processing. Chinese, Japanese, Korean & Vietnamese Computing. O'Reilly, New York.Google Scholar
Manmatha, R., Rath, T., and Feng, F. 2001. Modeling score distributions for combining the outputs of search engines. In Proceedings of the 24th International Conference on the ACM-SIGIR'2001. D. H. Kraft et al. (eds). ACM, New York, 267--275. Google Scholar
Matsumoto, Y., Kitauchi, A., Yamashita, T., Hirano, Y., Matsuda, H., and Asahara, M. 1999. Japanese morphological analysis system ChaSen. Tech. Rep. NAIST-IS-TR99009, NAIST. http://chasen.aist-nara.ac.jp/Google Scholar
McNamee, P. and Mayfield, J. 2004. JHU/APL experiments in tokenization and non-word translation. In Comparative Evaluation of Multilingual Information Access Systems. C. Peters et al. (eds.). Lecture Notes in Computer Science 3237. Springer-Verlag, Berlin, 85--97.Google Scholar
Moulinier, I. and Williams, K. 2005. Report on Thomson legal and regulatory experiments at CLEF 2004. In Advances in Cross-Language Information Retrieval. C. Peters et al. (eds.). Lecture Notes in Computer Science 3491 Springer-Verlag, Berlin, 110--122. Google Scholar
Murata, M., Ma, Q., and Isahara, H. 2003. Applying multiple characteristics and techniques to obtain high levels of performance in information retrieval. In Proceedings of the NTCIR-3 Conference (Tokyo). N. Kando (ed.).Google Scholar
Nie, J. Y. and Ren, F. 1999. Chinese information retrieval: using characters or words? Information Process. Manage. 35, 4 (1999), 443--462.Google Scholar
Nie, J. Y., Simard, M., Isabelle, P., and Durand, R. 1999. Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web. In Proceedings of the 22nd International Conference of the ACM-SIGIR'99. M. Hearst et al. (eds). ACM, New York, 74--81. Google Scholar
Nie, J. Y. and Simard, M. 2001. Using statistical translation models for bilingual IR. In Evaluation of Cross-language Information Retrieval Systems. C. Peters et al. (eds.). Springer-Verlag, Berlin, 137--150. Google Scholar
Peters, C., Braschler, M., Gonzalo, J., Kluck, M. (eds.). 2004. Advances in Cross-Language Information Retrieval. Lecture Notes in Computer Science 2785, Springer-Verlag, Berlin.Google Scholar
Peters, C., Clough, P., Gonzalo, J., Jones, G., Kluck, M., Magnini, B. (eds.). 2005. Advances in Cross-Language Information Retrieval. Lecture Notes in Computer Science 3491, Springer-Verlag, Berlin.Google Scholar
Robertson, S. E. 1990. On term selection for query expansion. J. Documentation 46, 4 (1990), 359--364. Google Scholar
Robertson, S. E., Walker, S., and Beaulieu, M. 2000. Experimentation as a way of life. Information Process. Manage. 36, 1(2000), 95--108. Google Scholar
Savoy, J. 1997. Statistical inference in retrieval effectiveness evaluation. Information Process. Manage. 33, 4 (1997), 495--512. Google Scholar
Savoy, J. 2002. Recherche d'information dans des corpus plurilingues. Ingénierie des systèmes d'informations 7, 1-2 (2002), 63--93.Google Scholar
Savoy, J. 2004a. Combining multiple strategies for effective monolingual and cross-lingual retrieval. IR Journal 7, 1-2 (2004), 121--148. Google Scholar
Savoy, J. 2004b. Report on CLIR task for the NTCIR-4 evaluation campaign. In Working Notes of the NTCIR-4. N. Kando (ed.). Tokyo, June 2004, 178--185.Google Scholar
Savoy, J. 2004c. Report on CLEF-2003 monolingual tracks: Fusion of probabilistic models for effective monolingual retrieval. In Comparative Evaluation of Multilingual Information Access Systems. C. Peters et al. (eds.), Lecture Notes in Computer Science 3237, Springer-Verlag, Berlin, 322--336.Google Scholar
Savoy, J. 2005. Data fusion for effective European monolingual information retrieval. In Advances in Cross-Language Information Retrieval. C. Peters et al. (eds.). Lecture Notes in Computer Science, Springer-Verlag, Berlin, 233--244. Google Scholar
Singhal, A., Choi, J., Hindle, D., Lewis, D. D., and Pereira, F. 1999. AT&T at TREC-7. In Proceedings of the TREC-7 Conference (Gaithersburg, MD, Nov. 1998). E.M. Voorhees and D.K. Harman (eds.). NIST Special Publication 500-242, 239--251.Google Scholar
Sparck Jones, K. 1972. A statistical interpretation of term specificity and its application in retrieval. J. Documentation 28, 1 (1972), 11--21.Google Scholar
Sproat, R. 1992. Morphology and Computation. The MIT Press, Cambridge, MA.Google Scholar
Tomlinson, S. 2004. Lexical and algorithmic stemming compared for 9 European languages with Hummingbird SearchServer#8482; at CLEF 2003. In Comparative Evaluation of Multilingual Information Access Systems. C. Peters et al. (eds.). Lecture Notes in Computer Science 3237, Springer-Verlag, Berlin, 286--300.Google Scholar
Vogt, C. C. and Cottrell, G. W. 1999. Fusion via a linear combination of scores. IR Journal 1, 3 (1999), 151--173. Google Scholar
Voorhees, E. M., Gupta, N. K., and Johnson-Laird, B. 1995. The collection fusion problem. In Proceedings of the TREC-3 Conference (Gaithersburg, MD, Nov. 1994). D. K. Harman (eds). NIST Special Publication 500-225, 95--104.Google Scholar

Index Terms

Comparative study of monolingual and multilingual search models for use with asian languages
1. Information systems
  1. Information retrieval

Recommendations

Using Communities of Words Derived from Multilingual Word Vectors for Cross-Language Information Retrieval in Indian Languages

We investigate the use of word embeddings for query translation to improve precision in cross-language information retrieval (CLIR). Word vectors represent words in a distributional space such that syntactically or semantically similar words are close ...
Read More
A comparison of morpheme and word based document retrieval for Asian languages
DEXA '96: Proceedings of the 7th International Workshop on Database and Expert Systems Applications

Most document retrieval systems are word based. Words are very convenient retrieval units in English but not so in some Asian languages. The task of determining which morphemes constitute words in Vietnamese and Chinese is problematic, and has been ...
Read More
Assessing the Impact of Vocabulary Similarity on Multilingual Information Retrieval for Bantu Languages
FIRE '16: Proceedings of the 8th Annual Meeting of the Forum for Information Retrieval Evaluation

Despite the availability of massive open information and efforts to promote multilingualism on the Web, content in Bantu languages remains negligible. Additionally, Information Retrieval (IR) systems, such as the Google search engine, use algorithms ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Asian Language Information Processing Volume 4, Issue 2
June 2005
179 pages
ISSN:1530-0226
EISSN:1558-3430
DOI:10.1145/1105696
Issue’s Table of Contents

Copyright © 2005 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 June 2005
Published in talip Volume 4, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Chinese language
Japanese language
Korean language
Multilingual information retrieval
cross-language information retrieval
natural language processing with Asian languages
results-merging
search engines with Asian languages
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 35
  Total Citations
  View Citations
- 813
  Total Downloads
- Downloads (Last 12 months)9
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Comparative study of monolingual and multilingual search models for use with asian languages

ACM Transactions on Asian Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

Using Communities of Words Derived from Multilingual Word Vectors for Cross-Language Information Retrieval in Indian Languages

A comparison of morpheme and word based document retrieval for Asian languages

Assessing the Impact of Vocabulary Similarity on Multilingual Information Retrieval for Bantu Languages

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Comparative study of monolingual and multilingual search models for use with asian languages

ACM Transactions on Asian Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

Using Communities of Words Derived from Multilingual Word Vectors for Cross-Language Information Retrieval in Indian Languages

A comparison of morpheme and word based document retrieval for Asian languages

Assessing the Impact of Vocabulary Similarity on Multilingual Information Retrieval for Bantu Languages

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media