Abstract
Based on the NTCIR-4 test-collection, our first objective is to present an overview of the retrieval effectiveness of nine vector-space and two probabilistic models that perform monolingual searches in the Chinese, Japanese, Korean, and English languages. Our second goal is to analyze the relative merits of the various automated and freely available toolsto translate the English-language topics into Chinese, Japanese, or Korean, and then submit the resultant query in order to retrieve pertinent documents written in one of the three Asian languages. We also demonstrate how bilingual searches could be improved by applying both the combined query translation strategies and data-fusion approaches. Finally, we address basic problems related to multilingual searches, in which queries written in English are used to search documents written in the English, Chinese, Japanese, and Korean languages.
- Amati, G. and Van Rijsbergen, C. J. 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans Information Systems 20, 4 (2002), 357--389. Google Scholar
- Amati, G., Carpineto, C., and Romano, G. 2003. Italian monolingual information retrieval with PROSIT. In Advances in Cross-Language Information Retrieval. C. Peters et al. (eds.), Lecture Notes in Computer Science, 2785, Springer-Verlag, Berlin, 257--264.Google Scholar
- Bloomfield, L. 1933. Language. Holt, Rinehart and Winston, New York.Google Scholar
- Buckley, C., Singhal, A., Mitra, M., and Salton, G. 1996. New retrieval approaches using SMART. In Proceedings of the TREC-4 Conference (Gaithersburg, MD, Nov. 1995). D.K. Harman (ed), NIST Special Publication 500-236, 25--48.Google Scholar
- Buckley, C., Mitra, M., Waltz, J., and Cardie, C. 1998. Using clustering and superconcepts within SMART. In Proceedings of the TREC-6 Conference (Gaithersburg, MD, Nov.1997). E.M. Voorhees and D.K. Harman (eds), NIST Special Publication 500-240, 107--124.Google Scholar
- Braschler, M. and Schäuble, P. 2000. Using corpus-based approaches in a system for multilingual information retrieval. IR Journal 3, 3 (2000), 273--284. Google Scholar
- Braschler, M. and Peters, C. 2004. Cross-language evaluation forum: Objectives, results and achievements. IR Journal 7, 1-2 (2004), 7--31. Google Scholar
- Braschler, M. and Ripplinger, B. 2004. How effective is stemming and decompounding for German text retrieval? IR Journal 7, 3-4 (2004), 291--316. Google Scholar
- Carpineto, C., De Mori, R., Romano, G. and Bigi, B. 2001. An information-theoretic approach to automatic query expansion. ACM Trans. Information Systems 19, 1 (2001), 1--27. Google Scholar
- Chen, A. and Gey, F. C. 2003. Experiments on cross-language and patent retrieval at NTCIR-3 workshop. In Proceedings of the NTCIR-3 Conference (Tokyo). N. Kando (ed.), 2003.Google Scholar
- Chen, A. and Gey, F. C. 2004. Multilingual information retrieval using machine translation, relevance feedback, and decompounding. IR Journal 7, 1-2 (2004), 149--182. Google Scholar
- Cronen-Townsend, S., Zhou, Y., and Croft, W. B. 2002. Predicting query performance. In Proceedings of the 25th International Conference on Research and Development in Information Retrieval. K. Jarvelin et al. (eds.). ACM, New York, 299--306. Google Scholar
- Dumais, S. T. 1994. Latent semantic indexing (LSI) and TREC-2. In Proceedings of the TREC-2 Conference (Gaithersburg, MD, Sept. 1993). D.K. Harman (ed.), NIST Special Publication 500--215, 105--115.Google Scholar
- Dodge, Y. (Ed.) 2003. The Oxford dictionary of Statistical Terms. Oxford University Press, Oxford, UK.Google Scholar
- Foo, S. and Li, H. 2004. Chinese word segmentation and its effect on information retrieval. Information Process. Manage. 40, 1 (2004), 161--190. Google Scholar
- Fox, E. A. and Shaw, J. A. 1994. Combination of multiple searches. In Proceedings of the TREC-2 Conference (Gaithersburg, MD, Sept. 1993). D.K. Harman (ed.), NIST Special Publication 500-215, 243--249.Google Scholar
- Fujii, H. and Croft, W. B. 1993. A comparison of indexing techniques for Japanese text retrieval. In Proceedings of the 16th International Conference on Research and Development in Information Retrieval. ACM, New York, 237--246. Google Scholar
- Gale, W. A. and Church, K. W. 1993. A program for aligning sentences in bilingual corpora. Computational Linguistic 19, 1 (1993), 75--102. Google Scholar
- Gey, F. 2004. Chinese and Korean topic search of Japanese news collections. In Working Notes of NTCIR-4, N. Kando (ed.), Tokyo, June 2004, 214--218.Google Scholar
- Grunfeld, L., Kwok, K. L., Dinstl, N., and Deng, P. 2004. TREC2003 robust, HARD and QA track experiments using PIRCS. In Proceedings of the TREC-12 Conference (Gaithersburg, MD, Nov. 2003). E.M. Voorhees and D.K. Harman (eds), NIST Special Publication 500-255, 510--521.Google Scholar
- Halpern, J. 2002. Lexicon-based orthographic disambiguation in CJK intelligent information retrieval. In Proceedings of COLING-02: The 3rd Workshop on Asian Language Resources and International Standardization. Google Scholar
- Harter, S. P. 1975. A probabilistic approach to automatic keyword indexing. Part I: On the distribution of specialty words in a technical literature. J. American Association for Information Science 26 (1975), 197--216.Google Scholar
- Kishida, K., Chen, K. H., Lee, S., Kuriyama, K., Kando, N., Chen, H. H., Myaeng, S. H., and Eguchi, K. 2004a. Overview of CLIR task at the Fourth NTCIR Workshop. In Working Notes of NTCIR-4. N. Kando (ed.), Tokyo, June 2004, 1--59.Google Scholar
- Kishida, K., Kuriyama, K., Kando, N., and Eguchi, K. 2004b. Prediction of performance on cross-lingual information retrieval by regression models. In Working Notes of NTCIR-4. N. Kando (ed.), Tokyo, June 2004, 219--224.Google Scholar
- Kwok, K. L., Grunfeld, L., Dinstl, N., and Chan, M. 2001. TREC-9 Cross-language, Web and question-answering track experiments using PIRCS. In Proceedings of the TREC-9 Conference (Gaithersburg, MD, Nov, 2000). E.M. Voorhees and D.K. Harman (eds). NIST Special Publication 500-249, 417--426.Google Scholar
- Kwok, K. L. (1999). Employing multiple representations for Chinese information retrieval. J. American Society for Information Science 50, 8 (1999), 709--723. Google Scholar
- Kwok, K. L., Dinstl, N., and Choi, S. 2004. NTCIR-4 Chinese, English, Korean cross-language retrieval experiments using PIRCS. In Working Notes of NTCIR-4. N. Kando (ed.), Tokyo, June 2004, 186--192.Google Scholar
- Le Calvé, A. and Savoy, J. 2000. Database merging strategy based on logistic regression. Information Process. Manage. 36, 3 (2000), 341--359. Google Scholar
- Lee, J. H. and Ahn, J. S. 1996. Using n-grams for Korean text retrieval. In Proceedings of the 19th International Conference on the ACM-SIGIR'96). H. P. Frei et al. (eds.). ACM Press, New York, 216--224. Google Scholar
- Lee, J. J., Cho, H. Y., and Park, H. R. 1999. N-gram-based indexing for Korean text retrieval. Information Process. Manage. 35, 4 (1999), 427--441.Google Scholar
- Leek, T., Schwartz, R., and Srinivasa, S. 2002. Probabilistic approaches to topic detection and tracking. In Topic Detection and Tracking: Event-based Information Organization. J. Allan (ed.). Kluwer, Boston, MA, 67--83. Google Scholar
- Lovins, J. B. 1982. Development of a stemming algorithm. Mechanical Translation and Computational Linguistics 11, 1 (1982), 22--31.Google Scholar
- Luk, R. W. P. and Kwok, K. L. 2002. A comparison of Chinese document indexing strategies and retrieval models. ACM Trans. Asian Language Information Process. 1, 3 (2002), 225--268. Google Scholar
- Luk, R. W. P. and Wong, K. F. 2004. Pseudo-relevance feedback and title re-ranking for Chinese information retrieval. In Working Notes of NTCIR-4. N. Kando (ed.).Tokyo, June 2004, 206--213.Google Scholar
- Lunde, K. 1998. CJKV Information Processing. Chinese, Japanese, Korean & Vietnamese Computing. O'Reilly, New York.Google Scholar
- Manmatha, R., Rath, T., and Feng, F. 2001. Modeling score distributions for combining the outputs of search engines. In Proceedings of the 24th International Conference on the ACM-SIGIR'2001. D. H. Kraft et al. (eds). ACM, New York, 267--275. Google Scholar
- Matsumoto, Y., Kitauchi, A., Yamashita, T., Hirano, Y., Matsuda, H., and Asahara, M. 1999. Japanese morphological analysis system ChaSen. Tech. Rep. NAIST-IS-TR99009, NAIST. http://chasen.aist-nara.ac.jp/Google Scholar
- McNamee, P. and Mayfield, J. 2004. JHU/APL experiments in tokenization and non-word translation. In Comparative Evaluation of Multilingual Information Access Systems. C. Peters et al. (eds.). Lecture Notes in Computer Science 3237. Springer-Verlag, Berlin, 85--97.Google Scholar
- Moulinier, I. and Williams, K. 2005. Report on Thomson legal and regulatory experiments at CLEF 2004. In Advances in Cross-Language Information Retrieval. C. Peters et al. (eds.). Lecture Notes in Computer Science 3491 Springer-Verlag, Berlin, 110--122. Google Scholar
- Murata, M., Ma, Q., and Isahara, H. 2003. Applying multiple characteristics and techniques to obtain high levels of performance in information retrieval. In Proceedings of the NTCIR-3 Conference (Tokyo). N. Kando (ed.).Google Scholar
- Nie, J. Y. and Ren, F. 1999. Chinese information retrieval: using characters or words? Information Process. Manage. 35, 4 (1999), 443--462.Google Scholar
- Nie, J. Y., Simard, M., Isabelle, P., and Durand, R. 1999. Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web. In Proceedings of the 22nd International Conference of the ACM-SIGIR'99. M. Hearst et al. (eds). ACM, New York, 74--81. Google Scholar
- Nie, J. Y. and Simard, M. 2001. Using statistical translation models for bilingual IR. In Evaluation of Cross-language Information Retrieval Systems. C. Peters et al. (eds.). Springer-Verlag, Berlin, 137--150. Google Scholar
- Peters, C., Braschler, M., Gonzalo, J., Kluck, M. (eds.). 2004. Advances in Cross-Language Information Retrieval. Lecture Notes in Computer Science 2785, Springer-Verlag, Berlin.Google Scholar
- Peters, C., Clough, P., Gonzalo, J., Jones, G., Kluck, M., Magnini, B. (eds.). 2005. Advances in Cross-Language Information Retrieval. Lecture Notes in Computer Science 3491, Springer-Verlag, Berlin.Google Scholar
- Robertson, S. E. 1990. On term selection for query expansion. J. Documentation 46, 4 (1990), 359--364. Google Scholar
- Robertson, S. E., Walker, S., and Beaulieu, M. 2000. Experimentation as a way of life. Information Process. Manage. 36, 1(2000), 95--108. Google Scholar
- Savoy, J. 1997. Statistical inference in retrieval effectiveness evaluation. Information Process. Manage. 33, 4 (1997), 495--512. Google Scholar
- Savoy, J. 2002. Recherche d'information dans des corpus plurilingues. Ingénierie des systèmes d'informations 7, 1-2 (2002), 63--93.Google Scholar
- Savoy, J. 2004a. Combining multiple strategies for effective monolingual and cross-lingual retrieval. IR Journal 7, 1-2 (2004), 121--148. Google Scholar
- Savoy, J. 2004b. Report on CLIR task for the NTCIR-4 evaluation campaign. In Working Notes of the NTCIR-4. N. Kando (ed.). Tokyo, June 2004, 178--185.Google Scholar
- Savoy, J. 2004c. Report on CLEF-2003 monolingual tracks: Fusion of probabilistic models for effective monolingual retrieval. In Comparative Evaluation of Multilingual Information Access Systems. C. Peters et al. (eds.), Lecture Notes in Computer Science 3237, Springer-Verlag, Berlin, 322--336.Google Scholar
- Savoy, J. 2005. Data fusion for effective European monolingual information retrieval. In Advances in Cross-Language Information Retrieval. C. Peters et al. (eds.). Lecture Notes in Computer Science, Springer-Verlag, Berlin, 233--244. Google Scholar
- Singhal, A., Choi, J., Hindle, D., Lewis, D. D., and Pereira, F. 1999. AT&T at TREC-7. In Proceedings of the TREC-7 Conference (Gaithersburg, MD, Nov. 1998). E.M. Voorhees and D.K. Harman (eds.). NIST Special Publication 500-242, 239--251.Google Scholar
- Sparck Jones, K. 1972. A statistical interpretation of term specificity and its application in retrieval. J. Documentation 28, 1 (1972), 11--21.Google Scholar
- Sproat, R. 1992. Morphology and Computation. The MIT Press, Cambridge, MA.Google Scholar
- Tomlinson, S. 2004. Lexical and algorithmic stemming compared for 9 European languages with Hummingbird SearchServer#8482; at CLEF 2003. In Comparative Evaluation of Multilingual Information Access Systems. C. Peters et al. (eds.). Lecture Notes in Computer Science 3237, Springer-Verlag, Berlin, 286--300.Google Scholar
- Vogt, C. C. and Cottrell, G. W. 1999. Fusion via a linear combination of scores. IR Journal 1, 3 (1999), 151--173. Google Scholar
- Voorhees, E. M., Gupta, N. K., and Johnson-Laird, B. 1995. The collection fusion problem. In Proceedings of the TREC-3 Conference (Gaithersburg, MD, Nov. 1994). D. K. Harman (eds). NIST Special Publication 500-225, 95--104.Google Scholar
Index Terms
- Comparative study of monolingual and multilingual search models for use with asian languages
Recommendations
Using Communities of Words Derived from Multilingual Word Vectors for Cross-Language Information Retrieval in Indian Languages
We investigate the use of word embeddings for query translation to improve precision in cross-language information retrieval (CLIR). Word vectors represent words in a distributional space such that syntactically or semantically similar words are close ...
A comparison of morpheme and word based document retrieval for Asian languages
DEXA '96: Proceedings of the 7th International Workshop on Database and Expert Systems ApplicationsMost document retrieval systems are word based. Words are very convenient retrieval units in English but not so in some Asian languages. The task of determining which morphemes constitute words in Vietnamese and Chinese is problematic, and has been ...
Assessing the Impact of Vocabulary Similarity on Multilingual Information Retrieval for Bantu Languages
FIRE '16: Proceedings of the 8th Annual Meeting of the Forum for Information Retrieval EvaluationDespite the availability of massive open information and efforts to promote multilingualism on the Web, content in Bantu languages remains negligible. Additionally, Information Retrieval (IR) systems, such as the Google search engine, use algorithms ...
Comments