ABSTRACT
We propose a new unified framework for monolingual (MoIR) and cross-lingual information retrieval (CLIR) which relies on the induction of dense real-valued word vectors known as word embeddings (WE) from comparable data. To this end, we make several important contributions: (1) We present a novel word representation learning model called Bilingual Word Embeddings Skip-Gram (BWESG) which is the first model able to learn bilingual word embeddings solely on the basis of document-aligned comparable data; (2) We demonstrate a simple yet effective approach to building document embeddings from single word embeddings by utilizing models from compositional distributional semantics. BWESG induces a shared cross-lingual embedding vector space in which both words, queries, and documents may be presented as dense real-valued vectors; (3) We build novel ad-hoc MoIR and CLIR models which rely on the induced word and document embeddings and the shared cross-lingual embedding space; (4) Experiments for English and Dutch MoIR, as well as for English-to-Dutch and Dutch-to-English CLIR using benchmarking CLEF 2001-2003 collections and queries demonstrate the utility of our WE-based MoIR and CLIR models. The best results on the CLEF collections are obtained by the combination of the WE-based approach and a unigram language model. We also report on significant improvements in ad-hoc IR tasks of our WE-based framework over the state-of-the-art framework for learning text representations from comparable data based on latent Dirichlet allocation (LDA).
- M. Baroni, G. Dinu, and G. Kruszewski. Don't count, predict! A systematic comparison of context-counting vs context-predicting semantic vectors. In ACL, pages 238--247, 2014.Google ScholarCross Ref
- Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic language model. Journal of Machine Learning Research, 3:1137--1155, 2003. Google ScholarDigital Library
- A. Berger and J. Lafferty. Information retrieval as statistical translation. In SIGIR, pages 222--229, 1999. Google ScholarDigital Library
- W. Blacoe and M. Lapata. A comparison of vector-based representations for semantic composition. In EMNLP-CoNLL, pages 546--556, 2012. Google ScholarDigital Library
- D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993--1022, 2003. Google ScholarDigital Library
- R. Collobert and J. Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In ICML, pages 160--167, 2008. Google ScholarDigital Library
- R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. P. Kuksa. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12:2493--2537, 2011. Google ScholarDigital Library
- T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley-Interscience, 1991. Google ScholarDigital Library
- F. Diaz and D. Metzler. Improving the estimation of relevance models using large external corpora. In SIGIR, pages 154--161, 2006. Google ScholarDigital Library
- Y. Goldberg and O. Levy. Word2vec explained: Deriving Mikolov et al.'s negative-sampling word-embedding method. CoRR, abs/1402.3722, 2014.Google Scholar
- K. M. Hermann and P. Blunsom. Multilingual models for compositional distributed semantics. In ACL, pages 58--68, 2014.Google ScholarCross Ref
- J. Jagarlamudi and H. Daumé III. Extracting multilingual topics from unaligned comparable corpora. In ECIR, pages 444--456, 2010. Google ScholarDigital Library
- D. Kiela and L. Bottou. Learning image embeddings using convolutional neural networks for improved multi-modal semantics. In EMNLP, pages 36--45, 2014.Google ScholarCross Ref
- A. Klementiev, I. Titov, and B. Bhattarai. Inducing crosslingual distributed representations of words. In COLING, pages 1459--1474, 2012.Google Scholar
- P. Koehn. Europarl: A parallel corpus for statistical machine translation. In MT SUMMIT, pages 79--86, 2005.Google Scholar
- T. K. Landauer and S. T. Dumais. Solutions to Plato's problem: The Latent Semantic Analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2):211--240, 1997.Google ScholarCross Ref
- V. Lavrenko, M. Choquette, and W. B. Croft. Cross-lingual relevance models. In SIGIR, pages 175--182, 2002. Google ScholarDigital Library
- V. Lavrenko and W. B. Croft. Relevance-based language models. In SIGIR, pages 120--127, 2001. Google ScholarDigital Library
- O. Levy and Y. Goldberg. Dependency-based word embeddings. In ACL, pages 302--308, 2014.Google ScholarCross Ref
- O. Levy and Y. Goldberg. Linguistic regularities in sparse and explicit word representations. In CoNLL, pages 171--180, 2014.Google ScholarCross Ref
- O. Levy and Y. Goldberg. Neural word embedding as implicit matrix factorization. In NIPS, pages 2177--2185, 2014.Google ScholarDigital Library
- O. Levy, Y. Goldberg, and I. Dagan. Improving distributional similarity with lessons learned from word embeddings. Transactions of ACL, to appear, 2015.Google Scholar
- C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008. Google ScholarCross Ref
- T. Mikolov, K. Chen, G. S. Corrado, and J. Dean. Efficient estimation of word representations in vector space. In ICLR Workshop Papers, 2013.Google Scholar
- T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, pages 3111--3119, 2013.Google ScholarDigital Library
- T. Mikolov, W. Yih, and G. Zweig. Linguistic regularities in continuous space word representations. In NAACL-HLT, pages 746--751, 2013.Google Scholar
- D. Milajevs, D. Kartsaklis, M. Sadrzadeh, and M. Purver. Evaluating neural word representations in tensor-based compositional settings. In EMNLP, pages 708--719, 2014.Google ScholarCross Ref
- D. Mimno, H. Wallach, J. Naradowsky, D. A. Smith, and A. McCallum. Polylingual topic models. In EMNLP, pages 880--889, 2009. Google ScholarDigital Library
- J. Mitchell and M. Lapata. Vector-based models of semantic composition. In ACL, pages 236--244, 2008.Google Scholar
- A. Mnih and K. Kavukcuoglu. Learning word embeddings efficiently with noise-contrastive estimation. In NIPS, pages 2265--2273, 2013.Google ScholarDigital Library
- X. Ni, J.-T. Sun, J. Hu, and Z. Chen. Mining multilingual topics from Wikipedia. In WWW, pages 1155--1156, 2009. Google ScholarDigital Library
- J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation. In EMNLP, pages 1532--1543, 2014.Google ScholarCross Ref
- C. Peters, M. Braschler, J. Gonzalo, and M. Kluck, editors. CLEF 2001, Revised Papers, 2002.Google ScholarCross Ref
- C. Peters, M. Braschler, J. Gonzalo, and M. Kluck, editors. CLEF 2002, Revised Papers, 2003.Google Scholar
- J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In SIGIR, pages 275--281, 1998. Google ScholarDigital Library
- D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. Nature, 323:533--536, 1986.Google ScholarCross Ref
- G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11):613--620, 1975. Google ScholarDigital Library
- R. Socher, B. Huval, C. D. Manning, and A. Y. Ng. Semantic compositionality through recursive matrix-vector spaces. In EMNLP-CoNLL, pages 1201--1211, 2012. Google ScholarDigital Library
- M. Steyvers and T. Griffiths. Probabilistic topic models. Handbook of Latent Semantic Analysis, 427(7):424--440, 2007.Google Scholar
- J. P. Turian, L. Ratinov, and Y. Bengio. Word representations: A simple and general method for semi-supervised learning. In ACL, pages 384--394, 2010. Google ScholarDigital Library
- I. Vulić, W. De Smet, and M.-F. Moens. Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora. Information Retrieval, 16(3):331--368, 2013. Google ScholarDigital Library
- I. Vulić and M. Moens. A unified framework for monolingual and cross-lingual relevance modeling based on probabilistic topic models. In ECIR, pages 98--109, 2013. Google ScholarDigital Library
- I. Vulić, W. D. Smet, J. Tang, and M. Moens. Probabilistic topic modeling in multilingual settings: An overview of its methodology and applications. Information Processing and Management, 51(1):111--147, 2015.Google ScholarCross Ref
- X. Wei and W. B. Croft. LDA-based document models for ad-hoc retrieval. In SIGIR, pages 178--185, 2006. Google ScholarDigital Library
- X. Yi and J. Allan. A comparative study of utilizing topic models for information retrieval. In ECIR, pages 29--41, 2009. Google ScholarDigital Library
- C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems, 22(2):179--214, 2004. Google ScholarDigital Library
Index Terms
- Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings
Recommendations
Unsupervised Cross-Lingual Information Retrieval Using Monolingual Data Only
SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information RetrievalWe propose a fully unsupervised framework for ad-hoc cross-lingual information retrieval (CLIR) which requires no bilingual data at all. The framework leverages shared cross-lingual word embedding spaces in which terms, queries, and documents can be ...
Unsupervised Bilingual Sentiment Word Embeddings for Cross-lingual Sentiment Classification
ICIAI '20: Proceedings of the 2020 the 4th International Conference on Innovation in Artificial IntelligenceIn recent years, bilingual word embeddings have been used to promote sentiment classification task in low-resource languages. However, existing bilingual word embedding methods either require annotated cross-lingual data or fail to capture enough ...
Improving bilingual word embeddings mapping with monolingual context information
AbstractBilingual word embeddings (BWEs) play a very important role in many natural language processing (NLP) tasks, especially cross-lingual tasks such as machine translation (MT) and cross-language information retrieval. Most existing methods to train ...
Comments