research-article

Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings

Authors:
Ivan Vulić

KU Leuven, Heverlee, Belgium

KU Leuven, Heverlee, Belgium
View Profile

,
Marie-Francine Moens

KU Leuven, Heverlee, Belgium

KU Leuven, Heverlee, Belgium
View Profile

SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information RetrievalAugust 2015Pages 363–372https://doi.org/10.1145/2766462.2767752

Published:09 August 2015Publication History

SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 363–372

ABSTRACT

We propose a new unified framework for monolingual (MoIR) and cross-lingual information retrieval (CLIR) which relies on the induction of dense real-valued word vectors known as word embeddings (WE) from comparable data. To this end, we make several important contributions: (1) We present a novel word representation learning model called Bilingual Word Embeddings Skip-Gram (BWESG) which is the first model able to learn bilingual word embeddings solely on the basis of document-aligned comparable data; (2) We demonstrate a simple yet effective approach to building document embeddings from single word embeddings by utilizing models from compositional distributional semantics. BWESG induces a shared cross-lingual embedding vector space in which both words, queries, and documents may be presented as dense real-valued vectors; (3) We build novel ad-hoc MoIR and CLIR models which rely on the induced word and document embeddings and the shared cross-lingual embedding space; (4) Experiments for English and Dutch MoIR, as well as for English-to-Dutch and Dutch-to-English CLIR using benchmarking CLEF 2001-2003 collections and queries demonstrate the utility of our WE-based MoIR and CLIR models. The best results on the CLEF collections are obtained by the combination of the WE-based approach and a unigram language model. We also report on significant improvements in ad-hoc IR tasks of our WE-based framework over the state-of-the-art framework for learning text representations from comparable data based on latent Dirichlet allocation (LDA).

References

M. Baroni, G. Dinu, and G. Kruszewski. Don't count, predict! A systematic comparison of context-counting vs context-predicting semantic vectors. In ACL, pages 238--247, 2014.Google ScholarCross Ref
Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic language model. Journal of Machine Learning Research, 3:1137--1155, 2003. Google ScholarDigital Library
A. Berger and J. Lafferty. Information retrieval as statistical translation. In SIGIR, pages 222--229, 1999. Google ScholarDigital Library
W. Blacoe and M. Lapata. A comparison of vector-based representations for semantic composition. In EMNLP-CoNLL, pages 546--556, 2012. Google ScholarDigital Library
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993--1022, 2003. Google ScholarDigital Library
R. Collobert and J. Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In ICML, pages 160--167, 2008. Google ScholarDigital Library
R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. P. Kuksa. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12:2493--2537, 2011. Google ScholarDigital Library
T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley-Interscience, 1991. Google ScholarDigital Library
F. Diaz and D. Metzler. Improving the estimation of relevance models using large external corpora. In SIGIR, pages 154--161, 2006. Google ScholarDigital Library
Y. Goldberg and O. Levy. Word2vec explained: Deriving Mikolov et al.'s negative-sampling word-embedding method. CoRR, abs/1402.3722, 2014.Google Scholar
K. M. Hermann and P. Blunsom. Multilingual models for compositional distributed semantics. In ACL, pages 58--68, 2014.Google ScholarCross Ref
J. Jagarlamudi and H. Daumé III. Extracting multilingual topics from unaligned comparable corpora. In ECIR, pages 444--456, 2010. Google ScholarDigital Library
D. Kiela and L. Bottou. Learning image embeddings using convolutional neural networks for improved multi-modal semantics. In EMNLP, pages 36--45, 2014.Google ScholarCross Ref
A. Klementiev, I. Titov, and B. Bhattarai. Inducing crosslingual distributed representations of words. In COLING, pages 1459--1474, 2012.Google Scholar
P. Koehn. Europarl: A parallel corpus for statistical machine translation. In MT SUMMIT, pages 79--86, 2005.Google Scholar
T. K. Landauer and S. T. Dumais. Solutions to Plato's problem: The Latent Semantic Analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2):211--240, 1997.Google ScholarCross Ref
V. Lavrenko, M. Choquette, and W. B. Croft. Cross-lingual relevance models. In SIGIR, pages 175--182, 2002. Google ScholarDigital Library
V. Lavrenko and W. B. Croft. Relevance-based language models. In SIGIR, pages 120--127, 2001. Google ScholarDigital Library
O. Levy and Y. Goldberg. Dependency-based word embeddings. In ACL, pages 302--308, 2014.Google ScholarCross Ref
O. Levy and Y. Goldberg. Linguistic regularities in sparse and explicit word representations. In CoNLL, pages 171--180, 2014.Google ScholarCross Ref
O. Levy and Y. Goldberg. Neural word embedding as implicit matrix factorization. In NIPS, pages 2177--2185, 2014.Google ScholarDigital Library
O. Levy, Y. Goldberg, and I. Dagan. Improving distributional similarity with lessons learned from word embeddings. Transactions of ACL, to appear, 2015.Google Scholar
C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008. Google ScholarCross Ref
T. Mikolov, K. Chen, G. S. Corrado, and J. Dean. Efficient estimation of word representations in vector space. In ICLR Workshop Papers, 2013.Google Scholar
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, pages 3111--3119, 2013.Google ScholarDigital Library
T. Mikolov, W. Yih, and G. Zweig. Linguistic regularities in continuous space word representations. In NAACL-HLT, pages 746--751, 2013.Google Scholar
D. Milajevs, D. Kartsaklis, M. Sadrzadeh, and M. Purver. Evaluating neural word representations in tensor-based compositional settings. In EMNLP, pages 708--719, 2014.Google ScholarCross Ref
D. Mimno, H. Wallach, J. Naradowsky, D. A. Smith, and A. McCallum. Polylingual topic models. In EMNLP, pages 880--889, 2009. Google ScholarDigital Library
J. Mitchell and M. Lapata. Vector-based models of semantic composition. In ACL, pages 236--244, 2008.Google Scholar
A. Mnih and K. Kavukcuoglu. Learning word embeddings efficiently with noise-contrastive estimation. In NIPS, pages 2265--2273, 2013.Google ScholarDigital Library
X. Ni, J.-T. Sun, J. Hu, and Z. Chen. Mining multilingual topics from Wikipedia. In WWW, pages 1155--1156, 2009. Google ScholarDigital Library
J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation. In EMNLP, pages 1532--1543, 2014.Google ScholarCross Ref
C. Peters, M. Braschler, J. Gonzalo, and M. Kluck, editors. CLEF 2001, Revised Papers, 2002.Google ScholarCross Ref
C. Peters, M. Braschler, J. Gonzalo, and M. Kluck, editors. CLEF 2002, Revised Papers, 2003.Google Scholar
J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In SIGIR, pages 275--281, 1998. Google ScholarDigital Library
D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. Nature, 323:533--536, 1986.Google ScholarCross Ref
G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11):613--620, 1975. Google ScholarDigital Library
R. Socher, B. Huval, C. D. Manning, and A. Y. Ng. Semantic compositionality through recursive matrix-vector spaces. In EMNLP-CoNLL, pages 1201--1211, 2012. Google ScholarDigital Library
M. Steyvers and T. Griffiths. Probabilistic topic models. Handbook of Latent Semantic Analysis, 427(7):424--440, 2007.Google Scholar
J. P. Turian, L. Ratinov, and Y. Bengio. Word representations: A simple and general method for semi-supervised learning. In ACL, pages 384--394, 2010. Google ScholarDigital Library
I. Vulić, W. De Smet, and M.-F. Moens. Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora. Information Retrieval, 16(3):331--368, 2013. Google ScholarDigital Library
I. Vulić and M. Moens. A unified framework for monolingual and cross-lingual relevance modeling based on probabilistic topic models. In ECIR, pages 98--109, 2013. Google ScholarDigital Library
I. Vulić, W. D. Smet, J. Tang, and M. Moens. Probabilistic topic modeling in multilingual settings: An overview of its methodology and applications. Information Processing and Management, 51(1):111--147, 2015.Google ScholarCross Ref
X. Wei and W. B. Croft. LDA-based document models for ad-hoc retrieval. In SIGIR, pages 178--185, 2006. Google ScholarDigital Library
X. Yi and J. Allan. A comparative study of utilizing topic models for information retrieval. In ECIR, pages 29--41, 2009. Google ScholarDigital Library
C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems, 22(2):179--214, 2004. Google ScholarDigital Library

Index Terms

Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Recommendations

Unsupervised Cross-Lingual Information Retrieval Using Monolingual Data Only
SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval

We propose a fully unsupervised framework for ad-hoc cross-lingual information retrieval (CLIR) which requires no bilingual data at all. The framework leverages shared cross-lingual word embedding spaces in which terms, queries, and documents can be ...
Read More
Unsupervised Bilingual Sentiment Word Embeddings for Cross-lingual Sentiment Classification
ICIAI '20: Proceedings of the 2020 the 4th International Conference on Innovation in Artificial Intelligence

In recent years, bilingual word embeddings have been used to promote sentiment classification task in low-resource languages. However, existing bilingual word embedding methods either require annotated cross-lingual data or fail to capture enough ...
Read More
Improving bilingual word embeddings mapping with monolingual context information
Abstract
Bilingual word embeddings (BWEs) play a very important role in many natural language processing (NLP) tasks, especially cross-lingual tasks such as machine translation (MT) and cross-language information retrieval. Most existing methods to train ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval
August 2015
1198 pages
ISBN:9781450336215
DOI:10.1145/2766462
General Chair:
Ricardo Baeza-Yates
Yahoo Labs, USA
,
Program Chairs:
Mounia Lalmas
Yahoo Labs, UK
,
Alistair Moffat
University of Melbourne, Australia
,
Berthier Ribeiro-Neto
Google, Brazil, and UFMG, Brazil
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 August 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
ad-hoc retrieval
comparable data
cross-lingual information retrieval
multilinguality
semantic composition
text representation learning
vector space retrieval models
word embeddings
Qualifiers
- research-article
Conference

Acceptance Rates
SIGIR '15 Paper Acceptance Rate70of351submissions,20%Overall Acceptance Rate792of3,983submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 134
  Total Citations
  View Citations
- 2,031
  Total Downloads
- Downloads (Last 12 months)58
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings

SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Unsupervised Cross-Lingual Information Retrieval Using Monolingual Data Only

Unsupervised Bilingual Sentiment Word Embeddings for Cross-lingual Sentiment Classification

Improving bilingual word embeddings mapping with monolingual context information