ABSTRACT
The non-English Web is growing at breakneck speed, but available language processing tools are mostly English based. Taxonomies are a case in point: while there are plenty of commercial and non-commercial taxonomies for the English Web, taxonomies for other languages are either not available or of very limited quality. Given that building taxonomies in all non-English languages is prohibitively expensive, it is natural to ask whether existing English taxonomies can be leveraged, possibly via machine translation, to enable information processing tasks in other languages. Preliminary results presented in this paper indicate that the answer is affirmative with respect to query classification, a task which is essential both for understanding the user intent and thus provide better search results, and for better targeting of search-based advertising, the economic underpinning of commercial Web search engines. We propose a robust method for classifying non-English queries against an English taxonomy and classifier using widely available, off-the-shelf machine translation systems. In particular, we show that by viewing the search results in the query's original language as independent sources of information, we can alleviate the impact of poor quality or erroneous machine translations. Empirical results for Chinese queries show that we achieve remarkably encouraging results.
- N. Bel, C. H. A. Koster, and M. Villegas. Cross-lingual text categorization. In Proceedings of the 7th European Conference on Research and Advanced Technology for Digital Libraries, pages 126--139, 2003.Google ScholarCross Ref
- A. Broder, P. Ciccolo, M. Fontoura, E. Gabrilovich, V. Josifovski, and L. Riedel. Search advertising using Web relevance feedback. In Proceedings of the 17th ACM Conference on Information and Knowledge Management, 2008. Google ScholarDigital Library
- A. Z. Broder, M. Fontoura, E. Gabrilovich, A. Joshi, V. Josifovski, and T. Zhang. Robust classification of rare queries using web knowledge. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 231--238, 2007. Google ScholarDigital Library
- A. Gliozzo and C. Strapparava. Exploiting comparable corpora and bilingual dictionaries for cross-language text categorization. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, pages 553--560, 2006. Google ScholarDigital Library
- E.-H. Han and G. Karypis. Centroid-based document classification: Analysis and experimental results. In Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery, pages 424--431, 2000. Google ScholarDigital Library
- Y. Li and J. Shawe-Taylor. Advanced learning algorithms for cross-language patent retrieval and classification. Information Processing and Management, 43(5):1183--1199, 2007. Google ScholarDigital Library
- X. Ling, G.-R. Xue, W. Dai, Y. Jiang, Q. Yang, and Y. Yu. Can chinese web pages be classified with english data source? In Proceeding of the 17th international conference on World Wide Web, pages 969--978, 2008. Google ScholarDigital Library
- J. S. Olsson, D. W. Oard, and J. Hajič. Cross-language text classification. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 645--646, 2005. Google ScholarDigital Library
- L. Rigutini, M. Maggini, and B. Liu. An EM based training algorithm for cross-language text categorization. In Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence, pages 529--535, 2005. Google ScholarDigital Library
Index Terms
- Cross-lingual query classification: a preliminary study
Recommendations
Cross-language query classification using web search for exogenous knowledge
WSDM '09: Proceedings of the Second ACM International Conference on Web Search and Data MiningThe non-English Web is growing at phenomenal speed, but available language processing tools and resources are predominantly English-based. Taxonomies are a case in point: while there are plenty of commercial and non-commercial taxonomies for the English ...
Manipuri–English comparable corpus for cross-lingual studies
AbstractThis paper presents Mni-EnCC, a temporal alligned Manipuri–English comparable corpus, to facilitate cross-lingual studies between Manipuri and English. Mni-EnCC has been created by collating text from two publicly published news sources in ...
Cross-lingual query suggestion using query logs of different languages
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrievalQuery suggestion aims to suggest relevant queries for a given query, which help users better specify their information needs. Previously, the suggested terms are mostly in the same language of the input query. In this paper, we extend it to cross-...
Comments