skip to main content
article

Comparative study of monolingual and multilingual search models for use with asian languages

Published:01 June 2005Publication History
Skip Abstract Section

Abstract

Based on the NTCIR-4 test-collection, our first objective is to present an overview of the retrieval effectiveness of nine vector-space and two probabilistic models that perform monolingual searches in the Chinese, Japanese, Korean, and English languages. Our second goal is to analyze the relative merits of the various automated and freely available toolsto translate the English-language topics into Chinese, Japanese, or Korean, and then submit the resultant query in order to retrieve pertinent documents written in one of the three Asian languages. We also demonstrate how bilingual searches could be improved by applying both the combined query translation strategies and data-fusion approaches. Finally, we address basic problems related to multilingual searches, in which queries written in English are used to search documents written in the English, Chinese, Japanese, and Korean languages.

References

  1. Amati, G. and Van Rijsbergen, C. J. 2002. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans Information Systems 20, 4 (2002), 357--389. Google ScholarGoogle Scholar
  2. Amati, G., Carpineto, C., and Romano, G. 2003. Italian monolingual information retrieval with PROSIT. In Advances in Cross-Language Information Retrieval. C. Peters et al. (eds.), Lecture Notes in Computer Science, 2785, Springer-Verlag, Berlin, 257--264.Google ScholarGoogle Scholar
  3. Bloomfield, L. 1933. Language. Holt, Rinehart and Winston, New York.Google ScholarGoogle Scholar
  4. Buckley, C., Singhal, A., Mitra, M., and Salton, G. 1996. New retrieval approaches using SMART. In Proceedings of the TREC-4 Conference (Gaithersburg, MD, Nov. 1995). D.K. Harman (ed), NIST Special Publication 500-236, 25--48.Google ScholarGoogle Scholar
  5. Buckley, C., Mitra, M., Waltz, J., and Cardie, C. 1998. Using clustering and superconcepts within SMART. In Proceedings of the TREC-6 Conference (Gaithersburg, MD, Nov.1997). E.M. Voorhees and D.K. Harman (eds), NIST Special Publication 500-240, 107--124.Google ScholarGoogle Scholar
  6. Braschler, M. and Schäuble, P. 2000. Using corpus-based approaches in a system for multilingual information retrieval. IR Journal 3, 3 (2000), 273--284. Google ScholarGoogle Scholar
  7. Braschler, M. and Peters, C. 2004. Cross-language evaluation forum: Objectives, results and achievements. IR Journal 7, 1-2 (2004), 7--31. Google ScholarGoogle Scholar
  8. Braschler, M. and Ripplinger, B. 2004. How effective is stemming and decompounding for German text retrieval? IR Journal 7, 3-4 (2004), 291--316. Google ScholarGoogle Scholar
  9. Carpineto, C., De Mori, R., Romano, G. and Bigi, B. 2001. An information-theoretic approach to automatic query expansion. ACM Trans. Information Systems 19, 1 (2001), 1--27. Google ScholarGoogle Scholar
  10. Chen, A. and Gey, F. C. 2003. Experiments on cross-language and patent retrieval at NTCIR-3 workshop. In Proceedings of the NTCIR-3 Conference (Tokyo). N. Kando (ed.), 2003.Google ScholarGoogle Scholar
  11. Chen, A. and Gey, F. C. 2004. Multilingual information retrieval using machine translation, relevance feedback, and decompounding. IR Journal 7, 1-2 (2004), 149--182. Google ScholarGoogle Scholar
  12. Cronen-Townsend, S., Zhou, Y., and Croft, W. B. 2002. Predicting query performance. In Proceedings of the 25th International Conference on Research and Development in Information Retrieval. K. Jarvelin et al. (eds.). ACM, New York, 299--306. Google ScholarGoogle Scholar
  13. Dumais, S. T. 1994. Latent semantic indexing (LSI) and TREC-2. In Proceedings of the TREC-2 Conference (Gaithersburg, MD, Sept. 1993). D.K. Harman (ed.), NIST Special Publication 500--215, 105--115.Google ScholarGoogle Scholar
  14. Dodge, Y. (Ed.) 2003. The Oxford dictionary of Statistical Terms. Oxford University Press, Oxford, UK.Google ScholarGoogle Scholar
  15. Foo, S. and Li, H. 2004. Chinese word segmentation and its effect on information retrieval. Information Process. Manage. 40, 1 (2004), 161--190. Google ScholarGoogle Scholar
  16. Fox, E. A. and Shaw, J. A. 1994. Combination of multiple searches. In Proceedings of the TREC-2 Conference (Gaithersburg, MD, Sept. 1993). D.K. Harman (ed.), NIST Special Publication 500-215, 243--249.Google ScholarGoogle Scholar
  17. Fujii, H. and Croft, W. B. 1993. A comparison of indexing techniques for Japanese text retrieval. In Proceedings of the 16th International Conference on Research and Development in Information Retrieval. ACM, New York, 237--246. Google ScholarGoogle Scholar
  18. Gale, W. A. and Church, K. W. 1993. A program for aligning sentences in bilingual corpora. Computational Linguistic 19, 1 (1993), 75--102. Google ScholarGoogle Scholar
  19. Gey, F. 2004. Chinese and Korean topic search of Japanese news collections. In Working Notes of NTCIR-4, N. Kando (ed.), Tokyo, June 2004, 214--218.Google ScholarGoogle Scholar
  20. Grunfeld, L., Kwok, K. L., Dinstl, N., and Deng, P. 2004. TREC2003 robust, HARD and QA track experiments using PIRCS. In Proceedings of the TREC-12 Conference (Gaithersburg, MD, Nov. 2003). E.M. Voorhees and D.K. Harman (eds), NIST Special Publication 500-255, 510--521.Google ScholarGoogle Scholar
  21. Halpern, J. 2002. Lexicon-based orthographic disambiguation in CJK intelligent information retrieval. In Proceedings of COLING-02: The 3rd Workshop on Asian Language Resources and International Standardization. Google ScholarGoogle Scholar
  22. Harter, S. P. 1975. A probabilistic approach to automatic keyword indexing. Part I: On the distribution of specialty words in a technical literature. J. American Association for Information Science 26 (1975), 197--216.Google ScholarGoogle Scholar
  23. Kishida, K., Chen, K. H., Lee, S., Kuriyama, K., Kando, N., Chen, H. H., Myaeng, S. H., and Eguchi, K. 2004a. Overview of CLIR task at the Fourth NTCIR Workshop. In Working Notes of NTCIR-4. N. Kando (ed.), Tokyo, June 2004, 1--59.Google ScholarGoogle Scholar
  24. Kishida, K., Kuriyama, K., Kando, N., and Eguchi, K. 2004b. Prediction of performance on cross-lingual information retrieval by regression models. In Working Notes of NTCIR-4. N. Kando (ed.), Tokyo, June 2004, 219--224.Google ScholarGoogle Scholar
  25. Kwok, K. L., Grunfeld, L., Dinstl, N., and Chan, M. 2001. TREC-9 Cross-language, Web and question-answering track experiments using PIRCS. In Proceedings of the TREC-9 Conference (Gaithersburg, MD, Nov, 2000). E.M. Voorhees and D.K. Harman (eds). NIST Special Publication 500-249, 417--426.Google ScholarGoogle Scholar
  26. Kwok, K. L. (1999). Employing multiple representations for Chinese information retrieval. J. American Society for Information Science 50, 8 (1999), 709--723. Google ScholarGoogle Scholar
  27. Kwok, K. L., Dinstl, N., and Choi, S. 2004. NTCIR-4 Chinese, English, Korean cross-language retrieval experiments using PIRCS. In Working Notes of NTCIR-4. N. Kando (ed.), Tokyo, June 2004, 186--192.Google ScholarGoogle Scholar
  28. Le Calvé, A. and Savoy, J. 2000. Database merging strategy based on logistic regression. Information Process. Manage. 36, 3 (2000), 341--359. Google ScholarGoogle Scholar
  29. Lee, J. H. and Ahn, J. S. 1996. Using n-grams for Korean text retrieval. In Proceedings of the 19th International Conference on the ACM-SIGIR'96). H. P. Frei et al. (eds.). ACM Press, New York, 216--224. Google ScholarGoogle Scholar
  30. Lee, J. J., Cho, H. Y., and Park, H. R. 1999. N-gram-based indexing for Korean text retrieval. Information Process. Manage. 35, 4 (1999), 427--441.Google ScholarGoogle Scholar
  31. Leek, T., Schwartz, R., and Srinivasa, S. 2002. Probabilistic approaches to topic detection and tracking. In Topic Detection and Tracking: Event-based Information Organization. J. Allan (ed.). Kluwer, Boston, MA, 67--83. Google ScholarGoogle Scholar
  32. Lovins, J. B. 1982. Development of a stemming algorithm. Mechanical Translation and Computational Linguistics 11, 1 (1982), 22--31.Google ScholarGoogle Scholar
  33. Luk, R. W. P. and Kwok, K. L. 2002. A comparison of Chinese document indexing strategies and retrieval models. ACM Trans. Asian Language Information Process. 1, 3 (2002), 225--268. Google ScholarGoogle Scholar
  34. Luk, R. W. P. and Wong, K. F. 2004. Pseudo-relevance feedback and title re-ranking for Chinese information retrieval. In Working Notes of NTCIR-4. N. Kando (ed.).Tokyo, June 2004, 206--213.Google ScholarGoogle Scholar
  35. Lunde, K. 1998. CJKV Information Processing. Chinese, Japanese, Korean & Vietnamese Computing. O'Reilly, New York.Google ScholarGoogle Scholar
  36. Manmatha, R., Rath, T., and Feng, F. 2001. Modeling score distributions for combining the outputs of search engines. In Proceedings of the 24th International Conference on the ACM-SIGIR'2001. D. H. Kraft et al. (eds). ACM, New York, 267--275. Google ScholarGoogle Scholar
  37. Matsumoto, Y., Kitauchi, A., Yamashita, T., Hirano, Y., Matsuda, H., and Asahara, M. 1999. Japanese morphological analysis system ChaSen. Tech. Rep. NAIST-IS-TR99009, NAIST. http://chasen.aist-nara.ac.jp/Google ScholarGoogle Scholar
  38. McNamee, P. and Mayfield, J. 2004. JHU/APL experiments in tokenization and non-word translation. In Comparative Evaluation of Multilingual Information Access Systems. C. Peters et al. (eds.). Lecture Notes in Computer Science 3237. Springer-Verlag, Berlin, 85--97.Google ScholarGoogle Scholar
  39. Moulinier, I. and Williams, K. 2005. Report on Thomson legal and regulatory experiments at CLEF 2004. In Advances in Cross-Language Information Retrieval. C. Peters et al. (eds.). Lecture Notes in Computer Science 3491 Springer-Verlag, Berlin, 110--122. Google ScholarGoogle Scholar
  40. Murata, M., Ma, Q., and Isahara, H. 2003. Applying multiple characteristics and techniques to obtain high levels of performance in information retrieval. In Proceedings of the NTCIR-3 Conference (Tokyo). N. Kando (ed.).Google ScholarGoogle Scholar
  41. Nie, J. Y. and Ren, F. 1999. Chinese information retrieval: using characters or words? Information Process. Manage. 35, 4 (1999), 443--462.Google ScholarGoogle Scholar
  42. Nie, J. Y., Simard, M., Isabelle, P., and Durand, R. 1999. Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web. In Proceedings of the 22nd International Conference of the ACM-SIGIR'99. M. Hearst et al. (eds). ACM, New York, 74--81. Google ScholarGoogle Scholar
  43. Nie, J. Y. and Simard, M. 2001. Using statistical translation models for bilingual IR. In Evaluation of Cross-language Information Retrieval Systems. C. Peters et al. (eds.). Springer-Verlag, Berlin, 137--150. Google ScholarGoogle Scholar
  44. Peters, C., Braschler, M., Gonzalo, J., Kluck, M. (eds.). 2004. Advances in Cross-Language Information Retrieval. Lecture Notes in Computer Science 2785, Springer-Verlag, Berlin.Google ScholarGoogle Scholar
  45. Peters, C., Clough, P., Gonzalo, J., Jones, G., Kluck, M., Magnini, B. (eds.). 2005. Advances in Cross-Language Information Retrieval. Lecture Notes in Computer Science 3491, Springer-Verlag, Berlin.Google ScholarGoogle Scholar
  46. Robertson, S. E. 1990. On term selection for query expansion. J. Documentation 46, 4 (1990), 359--364. Google ScholarGoogle Scholar
  47. Robertson, S. E., Walker, S., and Beaulieu, M. 2000. Experimentation as a way of life. Information Process. Manage. 36, 1(2000), 95--108. Google ScholarGoogle Scholar
  48. Savoy, J. 1997. Statistical inference in retrieval effectiveness evaluation. Information Process. Manage. 33, 4 (1997), 495--512. Google ScholarGoogle Scholar
  49. Savoy, J. 2002. Recherche d'information dans des corpus plurilingues. Ingénierie des systèmes d'informations 7, 1-2 (2002), 63--93.Google ScholarGoogle Scholar
  50. Savoy, J. 2004a. Combining multiple strategies for effective monolingual and cross-lingual retrieval. IR Journal 7, 1-2 (2004), 121--148. Google ScholarGoogle Scholar
  51. Savoy, J. 2004b. Report on CLIR task for the NTCIR-4 evaluation campaign. In Working Notes of the NTCIR-4. N. Kando (ed.). Tokyo, June 2004, 178--185.Google ScholarGoogle Scholar
  52. Savoy, J. 2004c. Report on CLEF-2003 monolingual tracks: Fusion of probabilistic models for effective monolingual retrieval. In Comparative Evaluation of Multilingual Information Access Systems. C. Peters et al. (eds.), Lecture Notes in Computer Science 3237, Springer-Verlag, Berlin, 322--336.Google ScholarGoogle Scholar
  53. Savoy, J. 2005. Data fusion for effective European monolingual information retrieval. In Advances in Cross-Language Information Retrieval. C. Peters et al. (eds.). Lecture Notes in Computer Science, Springer-Verlag, Berlin, 233--244. Google ScholarGoogle Scholar
  54. Singhal, A., Choi, J., Hindle, D., Lewis, D. D., and Pereira, F. 1999. AT&T at TREC-7. In Proceedings of the TREC-7 Conference (Gaithersburg, MD, Nov. 1998). E.M. Voorhees and D.K. Harman (eds.). NIST Special Publication 500-242, 239--251.Google ScholarGoogle Scholar
  55. Sparck Jones, K. 1972. A statistical interpretation of term specificity and its application in retrieval. J. Documentation 28, 1 (1972), 11--21.Google ScholarGoogle Scholar
  56. Sproat, R. 1992. Morphology and Computation. The MIT Press, Cambridge, MA.Google ScholarGoogle Scholar
  57. Tomlinson, S. 2004. Lexical and algorithmic stemming compared for 9 European languages with Hummingbird SearchServer#8482; at CLEF 2003. In Comparative Evaluation of Multilingual Information Access Systems. C. Peters et al. (eds.). Lecture Notes in Computer Science 3237, Springer-Verlag, Berlin, 286--300.Google ScholarGoogle Scholar
  58. Vogt, C. C. and Cottrell, G. W. 1999. Fusion via a linear combination of scores. IR Journal 1, 3 (1999), 151--173. Google ScholarGoogle Scholar
  59. Voorhees, E. M., Gupta, N. K., and Johnson-Laird, B. 1995. The collection fusion problem. In Proceedings of the TREC-3 Conference (Gaithersburg, MD, Nov. 1994). D. K. Harman (eds). NIST Special Publication 500-225, 95--104.Google ScholarGoogle Scholar

Index Terms

  1. Comparative study of monolingual and multilingual search models for use with asian languages

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader