Abstract
Cross-lingual information retrieval allows users to query mixed-language collections or to probe for documents written in an unfamiliar language. A major difficulty for cross-lingual information retrieval is the detection and translation of out-of-vocabulary (OOV) terms; for OOV terms in Chinese, another difficulty is segmentation. At NTCIR-4, we explored methods for translation and disambiguation for OOV terms when using a Chinese query on an English collection. We have developed a new segmentation-free technique for automatic translation of Chinese OOV terms using the web. We have also investigated the effects of distance factor and window size when using a hidden Markov model to provide disambiguation. Our experiments show these methods significantly improve effectiveness; in conjunction with our post-translation query expansion technique, effectiveness approaches that of monolingual retrieval.
- Ballesteros, L. and Croft, W. B. 1998. Resolving Ambiguity for Cross-Language Retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia. ACM Press New York. 64--71. Google Scholar
- Billerbeck, B. and Zobel, J. 2004. Questioning query expansion: An examination of behaviour and parameters. In Proceedings of the 15th Australasian Database Conference, K. D. Schewe and H. E. Williams, Eds. Dunedin, New Zealand, pp. 69--76. Google Scholar
- Chen, A., Jiang, H., and Gey, F. 2000. Combining multiple sources for short query translation in Chinese--English cross-language information retrieval. In Proceedings of the 5th International Workshop Information Retrieval with Asian Languages, Hong Kong, China. ACM Press, New York. 17--23. Google Scholar
- Chen, J. and Nie, J. Y. 2000. Parallel Web Text Mining for Cross-Language IR. In Proceedings of RIAO-2000: Content-Based Multimedia Information Access. CollCge de France, Paris, France. 188--192.Google Scholar
- Cheng, P. J., Teng, J. W., Chen, R. C., Wang, J. H., Lu, W.-H., and Chien, L.-F. 2004. Translating Unknown Queries with Web Corpora for Cross-Language Information Retrieval. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Sheffield, UK. ACM Press New York. 146--153. Google Scholar
- Federico, M. and Bertoldi, N. 2002. Statistical cross-language information retrieval using N-Best query translations. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Tampere, Finland. ACM Press New York. 167--174. Google Scholar
- Gao, J., Zhou, M., Nie, J., He, H., and Chen, W. 2002. Resolving query translation ambiguity using a decaying co-occurrence model and syntactic dependence relations. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Tampere, Finland. ACM Press, New York. 183--190. Google Scholar
- Kando, N. 2004. Overview of the Fourth NTCIR Workshop. In Working Notes of the Fourth NTCIR Workshop Meeting (NTCIR4). National Institute of Informatics, Tokyo, Japan. i--viii.Google Scholar
- Kishida, K., hua Chen, K., Lee, S., Kuriyama, K., Kando, N., Chen, H.-H., Myaeng, S. H., and Eguchi, K. 2004. Overview of CLIR Task at the Fourth NTCIR Workshop. In Working Notes of the Fourth NTCIR Workshop Meeting (NTCIR4). National Institute of Informatics, Tokyo, Japan.Google Scholar
- Kraaij, W. 2001. TNO at CLEF-2001: Comparing translation resources. In Proceedings of the CLEF 2001 Workshop. Springer, Darmstadt, Germany. 79--83. Google Scholar
- Kwok, K. L., Dinstl, N., and Choi, S. 2004. NTCIR-4 Chinese, English, Korean Cross Language Retrieval Experiments using PIRCS. In Working Notes of the Fourth NTCIR Workshop Meeting (NTCIR4). National Institute of Informatics, Tokyo, Japan. 186--192.Google Scholar
- Lin, W.-H. and Chen, H.-H. 2002. Backward machine transliteration by learning phonetic similarity. In Proceedings of CoNLL-2002, D. Roth and A. van den Bosch, Eds. Taipei, Taiwan. 139--145. Google Scholar
- Lu, W., Tung, C., Chien, L., and Lee, H. 2002. Translation of web queries using anchor text mining. ACM Transactions on Asian Language Information Processing 2, 1, 159--172. Google Scholar
- Maeda, A., Sadat, F., Yoshikawa, M., and Uemura, S. 2000. Query term disambiguation for Web cross-language information retrieval using a search engine. In Proceedings of the 5th International Workshop on Information Retrieval with Asian Languages, Hong Kong, China. ACM Press, New York, 25--32. Google Scholar
- Mandala, R., Tokunaga, T., and Tanaka, H. 1999. Combining multiple evidence from different types of thesaurus for query expansion. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, CA. ACM Press, New York. 191--197. Google Scholar
- McEwan, C. J. A., Ounis, I., and Ruthven, I. 2002. Building bilingual dictionaries from parallel web documents. In Proceedings of the 24th European Colloquium on Information Retrieval Research, Glasgow, Scotland. Springer-Verlag, New York. 303--323. Google Scholar
- Mend, H., Chef, B., Kidnaper, S., Levee, G., Lo, W., Oared, D., Scone, P., Tang, K., Wang, H., and Wang, J. 2004. Mandarin-English Information (MEI): Investigating translingual speech retrieval. Computer Speech and Language 18, 2, 163--179.Google Scholar
- Miller, D. R., Leek, T., and Schwartz, R. M. 1999. A hidden Markov model information retrieval system. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, CA. ACM Press New York. 214--221. Google Scholar
- Mirna, A. 2000. Using statistical term similarity for sense disambiguation in cross-language information retrieval. Information Retrieval 2, 1, 67--68. Google Scholar
- Ney, H., Essen, U., and Kneser, R. 1994. On structuring probabilistic dependences in stochastic language modelling. Computer Speech and Language 8, 3, 1--38.Google Scholar
- Porter, M. F. 1980. An algorithm for su#x stripping. Automated Library and Information Systems 14, 3, 130--137.Google Scholar
- Robertson, S. and Jones, K. S. 1976. Relevance weighting of search terms. The American Society for Information Science 27, 3, 129--146.Google Scholar
- Ruthven, I. 2003. Re-examining the potential effectiveness of interactive query expansion. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, Toronto, Canada. ACM Press, New York. 213--220. Google Scholar
- Sun, J., Zhou, M., and Gao, J. F. 2003. A Class-based language model approach to chinese named entity identification. International Journal of Computational Linguistics and Chinese Language Processing, 8, 2, 1--28.Google Scholar
- van Rijsbergen, C. J. 1977. A theoretical basis for the use of co-occurrence data in information retrieval. Journal of Documentation 33, 106--119.Google Scholar
- Xu, J. and Croft, W. B. 2000. Improving the effectiveness of information retrieval with local context analysis. ACM Transactions on Information Systems (TOIS) 18, 1, 79--112. Google Scholar
- Yang, C. C. and Li, K. W. 2002. Mining English/Chinese Parallel documents from the world wide web. In Proceedings of the 11th International World Wide Web Conference, Honolulu, Hawaii. ACM Press, New York. 188--192.Google Scholar
- Zhang, Y. and Vines, P. 2003. Improved Cross-Language Information Retrieval via Disambiguation and Vocabulary Discovery. In Proceedings of the 8th Australasian Document Computing Symposium. CSIRO ICT Centre, Canberra, Australia. 3--7.Google Scholar
- Zhang, Y. and Vines, P. 2004a. RMIT Chinese--English CLIR at NTCIR-4. In Working Notes of the Fourth NTCIR Workshop Meeting (NTCIR4). National Institute of Informatics, Tokyo, Japan. 60--64.Google Scholar
- Zhang, Y. and Vines, P. 2004b. Using the Web for Automated Translation Extraction in Cross-Language Information Retrieval. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Sheffield, UK. ACM Press, New York, 162--169. Google Scholar
Index Terms
- Chinese OOV translation and post-translation query expansion in chinese--english cross-lingual information retrieval
Recommendations
Mining translations of OOV terms from the web through cross-lingual query expansion
SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrievalTranslating out-of-vocabulary (OOV) terms is a great challenge for the Cross-lingual Information Retrieval and Data-driven Machine Translation systems. Several approaches have been proposed to mine translations for OOV terms from the web, especially ...
Detection and translation of OOV terms prior to query time
SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrievalAccurate cross-language information retrieval requires that query terms be correctly translated. Several new techniques to improve the translation of out of vocabulary terms in English-Chinese cross-language information retrieval have been developed. ...
Comparing different units for query translation in Chinese cross-language information retrieval
InfoScale '07: Proceedings of the 2nd international conference on Scalable information systemsAlthough both words and n-grams of characters have been used in Chinese IR, they have often been used as two competing methods. For cross-language IR with Chinese, word translation has been used in all previous studies. In this paper, we re-examine the ...
Comments