skip to main content
article

Chinese OOV translation and post-translation query expansion in chinese--english cross-lingual information retrieval

Authors Info & Claims
Published:01 June 2005Publication History
Skip Abstract Section

Abstract

Cross-lingual information retrieval allows users to query mixed-language collections or to probe for documents written in an unfamiliar language. A major difficulty for cross-lingual information retrieval is the detection and translation of out-of-vocabulary (OOV) terms; for OOV terms in Chinese, another difficulty is segmentation. At NTCIR-4, we explored methods for translation and disambiguation for OOV terms when using a Chinese query on an English collection. We have developed a new segmentation-free technique for automatic translation of Chinese OOV terms using the web. We have also investigated the effects of distance factor and window size when using a hidden Markov model to provide disambiguation. Our experiments show these methods significantly improve effectiveness; in conjunction with our post-translation query expansion technique, effectiveness approaches that of monolingual retrieval.

References

  1. Ballesteros, L. and Croft, W. B. 1998. Resolving Ambiguity for Cross-Language Retrieval. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia. ACM Press New York. 64--71. Google ScholarGoogle Scholar
  2. Billerbeck, B. and Zobel, J. 2004. Questioning query expansion: An examination of behaviour and parameters. In Proceedings of the 15th Australasian Database Conference, K. D. Schewe and H. E. Williams, Eds. Dunedin, New Zealand, pp. 69--76. Google ScholarGoogle Scholar
  3. Chen, A., Jiang, H., and Gey, F. 2000. Combining multiple sources for short query translation in Chinese--English cross-language information retrieval. In Proceedings of the 5th International Workshop Information Retrieval with Asian Languages, Hong Kong, China. ACM Press, New York. 17--23. Google ScholarGoogle Scholar
  4. Chen, J. and Nie, J. Y. 2000. Parallel Web Text Mining for Cross-Language IR. In Proceedings of RIAO-2000: Content-Based Multimedia Information Access. CollCge de France, Paris, France. 188--192.Google ScholarGoogle Scholar
  5. Cheng, P. J., Teng, J. W., Chen, R. C., Wang, J. H., Lu, W.-H., and Chien, L.-F. 2004. Translating Unknown Queries with Web Corpora for Cross-Language Information Retrieval. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Sheffield, UK. ACM Press New York. 146--153. Google ScholarGoogle Scholar
  6. Federico, M. and Bertoldi, N. 2002. Statistical cross-language information retrieval using N-Best query translations. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Tampere, Finland. ACM Press New York. 167--174. Google ScholarGoogle Scholar
  7. Gao, J., Zhou, M., Nie, J., He, H., and Chen, W. 2002. Resolving query translation ambiguity using a decaying co-occurrence model and syntactic dependence relations. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Tampere, Finland. ACM Press, New York. 183--190. Google ScholarGoogle Scholar
  8. Kando, N. 2004. Overview of the Fourth NTCIR Workshop. In Working Notes of the Fourth NTCIR Workshop Meeting (NTCIR4). National Institute of Informatics, Tokyo, Japan. i--viii.Google ScholarGoogle Scholar
  9. Kishida, K., hua Chen, K., Lee, S., Kuriyama, K., Kando, N., Chen, H.-H., Myaeng, S. H., and Eguchi, K. 2004. Overview of CLIR Task at the Fourth NTCIR Workshop. In Working Notes of the Fourth NTCIR Workshop Meeting (NTCIR4). National Institute of Informatics, Tokyo, Japan.Google ScholarGoogle Scholar
  10. Kraaij, W. 2001. TNO at CLEF-2001: Comparing translation resources. In Proceedings of the CLEF 2001 Workshop. Springer, Darmstadt, Germany. 79--83. Google ScholarGoogle Scholar
  11. Kwok, K. L., Dinstl, N., and Choi, S. 2004. NTCIR-4 Chinese, English, Korean Cross Language Retrieval Experiments using PIRCS. In Working Notes of the Fourth NTCIR Workshop Meeting (NTCIR4). National Institute of Informatics, Tokyo, Japan. 186--192.Google ScholarGoogle Scholar
  12. Lin, W.-H. and Chen, H.-H. 2002. Backward machine transliteration by learning phonetic similarity. In Proceedings of CoNLL-2002, D. Roth and A. van den Bosch, Eds. Taipei, Taiwan. 139--145. Google ScholarGoogle Scholar
  13. Lu, W., Tung, C., Chien, L., and Lee, H. 2002. Translation of web queries using anchor text mining. ACM Transactions on Asian Language Information Processing 2, 1, 159--172. Google ScholarGoogle Scholar
  14. Maeda, A., Sadat, F., Yoshikawa, M., and Uemura, S. 2000. Query term disambiguation for Web cross-language information retrieval using a search engine. In Proceedings of the 5th International Workshop on Information Retrieval with Asian Languages, Hong Kong, China. ACM Press, New York, 25--32. Google ScholarGoogle Scholar
  15. Mandala, R., Tokunaga, T., and Tanaka, H. 1999. Combining multiple evidence from different types of thesaurus for query expansion. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, CA. ACM Press, New York. 191--197. Google ScholarGoogle Scholar
  16. McEwan, C. J. A., Ounis, I., and Ruthven, I. 2002. Building bilingual dictionaries from parallel web documents. In Proceedings of the 24th European Colloquium on Information Retrieval Research, Glasgow, Scotland. Springer-Verlag, New York. 303--323. Google ScholarGoogle Scholar
  17. Mend, H., Chef, B., Kidnaper, S., Levee, G., Lo, W., Oared, D., Scone, P., Tang, K., Wang, H., and Wang, J. 2004. Mandarin-English Information (MEI): Investigating translingual speech retrieval. Computer Speech and Language 18, 2, 163--179.Google ScholarGoogle Scholar
  18. Miller, D. R., Leek, T., and Schwartz, R. M. 1999. A hidden Markov model information retrieval system. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, CA. ACM Press New York. 214--221. Google ScholarGoogle Scholar
  19. Mirna, A. 2000. Using statistical term similarity for sense disambiguation in cross-language information retrieval. Information Retrieval 2, 1, 67--68. Google ScholarGoogle Scholar
  20. Ney, H., Essen, U., and Kneser, R. 1994. On structuring probabilistic dependences in stochastic language modelling. Computer Speech and Language 8, 3, 1--38.Google ScholarGoogle Scholar
  21. Porter, M. F. 1980. An algorithm for su#x stripping. Automated Library and Information Systems 14, 3, 130--137.Google ScholarGoogle Scholar
  22. Robertson, S. and Jones, K. S. 1976. Relevance weighting of search terms. The American Society for Information Science 27, 3, 129--146.Google ScholarGoogle Scholar
  23. Ruthven, I. 2003. Re-examining the potential effectiveness of interactive query expansion. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, Toronto, Canada. ACM Press, New York. 213--220. Google ScholarGoogle Scholar
  24. Sun, J., Zhou, M., and Gao, J. F. 2003. A Class-based language model approach to chinese named entity identification. International Journal of Computational Linguistics and Chinese Language Processing, 8, 2, 1--28.Google ScholarGoogle Scholar
  25. van Rijsbergen, C. J. 1977. A theoretical basis for the use of co-occurrence data in information retrieval. Journal of Documentation 33, 106--119.Google ScholarGoogle Scholar
  26. Xu, J. and Croft, W. B. 2000. Improving the effectiveness of information retrieval with local context analysis. ACM Transactions on Information Systems (TOIS) 18, 1, 79--112. Google ScholarGoogle Scholar
  27. Yang, C. C. and Li, K. W. 2002. Mining English/Chinese Parallel documents from the world wide web. In Proceedings of the 11th International World Wide Web Conference, Honolulu, Hawaii. ACM Press, New York. 188--192.Google ScholarGoogle Scholar
  28. Zhang, Y. and Vines, P. 2003. Improved Cross-Language Information Retrieval via Disambiguation and Vocabulary Discovery. In Proceedings of the 8th Australasian Document Computing Symposium. CSIRO ICT Centre, Canberra, Australia. 3--7.Google ScholarGoogle Scholar
  29. Zhang, Y. and Vines, P. 2004a. RMIT Chinese--English CLIR at NTCIR-4. In Working Notes of the Fourth NTCIR Workshop Meeting (NTCIR4). National Institute of Informatics, Tokyo, Japan. 60--64.Google ScholarGoogle Scholar
  30. Zhang, Y. and Vines, P. 2004b. Using the Web for Automated Translation Extraction in Cross-Language Information Retrieval. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Sheffield, UK. ACM Press, New York, 162--169. Google ScholarGoogle Scholar

Index Terms

  1. Chinese OOV translation and post-translation query expansion in chinese--english cross-lingual information retrieval

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader