skip to main content
10.1145/2766462.2767752acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings

Published:09 August 2015Publication History

ABSTRACT

We propose a new unified framework for monolingual (MoIR) and cross-lingual information retrieval (CLIR) which relies on the induction of dense real-valued word vectors known as word embeddings (WE) from comparable data. To this end, we make several important contributions: (1) We present a novel word representation learning model called Bilingual Word Embeddings Skip-Gram (BWESG) which is the first model able to learn bilingual word embeddings solely on the basis of document-aligned comparable data; (2) We demonstrate a simple yet effective approach to building document embeddings from single word embeddings by utilizing models from compositional distributional semantics. BWESG induces a shared cross-lingual embedding vector space in which both words, queries, and documents may be presented as dense real-valued vectors; (3) We build novel ad-hoc MoIR and CLIR models which rely on the induced word and document embeddings and the shared cross-lingual embedding space; (4) Experiments for English and Dutch MoIR, as well as for English-to-Dutch and Dutch-to-English CLIR using benchmarking CLEF 2001-2003 collections and queries demonstrate the utility of our WE-based MoIR and CLIR models. The best results on the CLEF collections are obtained by the combination of the WE-based approach and a unigram language model. We also report on significant improvements in ad-hoc IR tasks of our WE-based framework over the state-of-the-art framework for learning text representations from comparable data based on latent Dirichlet allocation (LDA).

References

  1. M. Baroni, G. Dinu, and G. Kruszewski. Don't count, predict! A systematic comparison of context-counting vs context-predicting semantic vectors. In ACL, pages 238--247, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  2. Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. A neural probabilistic language model. Journal of Machine Learning Research, 3:1137--1155, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Berger and J. Lafferty. Information retrieval as statistical translation. In SIGIR, pages 222--229, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. W. Blacoe and M. Lapata. A comparison of vector-based representations for semantic composition. In EMNLP-CoNLL, pages 546--556, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993--1022, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R. Collobert and J. Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In ICML, pages 160--167, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. P. Kuksa. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12:2493--2537, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley-Interscience, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. F. Diaz and D. Metzler. Improving the estimation of relevance models using large external corpora. In SIGIR, pages 154--161, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Y. Goldberg and O. Levy. Word2vec explained: Deriving Mikolov et al.'s negative-sampling word-embedding method. CoRR, abs/1402.3722, 2014.Google ScholarGoogle Scholar
  11. K. M. Hermann and P. Blunsom. Multilingual models for compositional distributed semantics. In ACL, pages 58--68, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  12. J. Jagarlamudi and H. Daumé III. Extracting multilingual topics from unaligned comparable corpora. In ECIR, pages 444--456, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. Kiela and L. Bottou. Learning image embeddings using convolutional neural networks for improved multi-modal semantics. In EMNLP, pages 36--45, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  14. A. Klementiev, I. Titov, and B. Bhattarai. Inducing crosslingual distributed representations of words. In COLING, pages 1459--1474, 2012.Google ScholarGoogle Scholar
  15. P. Koehn. Europarl: A parallel corpus for statistical machine translation. In MT SUMMIT, pages 79--86, 2005.Google ScholarGoogle Scholar
  16. T. K. Landauer and S. T. Dumais. Solutions to Plato's problem: The Latent Semantic Analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2):211--240, 1997.Google ScholarGoogle ScholarCross RefCross Ref
  17. V. Lavrenko, M. Choquette, and W. B. Croft. Cross-lingual relevance models. In SIGIR, pages 175--182, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. V. Lavrenko and W. B. Croft. Relevance-based language models. In SIGIR, pages 120--127, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. O. Levy and Y. Goldberg. Dependency-based word embeddings. In ACL, pages 302--308, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  20. O. Levy and Y. Goldberg. Linguistic regularities in sparse and explicit word representations. In CoNLL, pages 171--180, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  21. O. Levy and Y. Goldberg. Neural word embedding as implicit matrix factorization. In NIPS, pages 2177--2185, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. O. Levy, Y. Goldberg, and I. Dagan. Improving distributional similarity with lessons learned from word embeddings. Transactions of ACL, to appear, 2015.Google ScholarGoogle Scholar
  23. C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008. Google ScholarGoogle ScholarCross RefCross Ref
  24. T. Mikolov, K. Chen, G. S. Corrado, and J. Dean. Efficient estimation of word representations in vector space. In ICLR Workshop Papers, 2013.Google ScholarGoogle Scholar
  25. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, pages 3111--3119, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. T. Mikolov, W. Yih, and G. Zweig. Linguistic regularities in continuous space word representations. In NAACL-HLT, pages 746--751, 2013.Google ScholarGoogle Scholar
  27. D. Milajevs, D. Kartsaklis, M. Sadrzadeh, and M. Purver. Evaluating neural word representations in tensor-based compositional settings. In EMNLP, pages 708--719, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  28. D. Mimno, H. Wallach, J. Naradowsky, D. A. Smith, and A. McCallum. Polylingual topic models. In EMNLP, pages 880--889, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J. Mitchell and M. Lapata. Vector-based models of semantic composition. In ACL, pages 236--244, 2008.Google ScholarGoogle Scholar
  30. A. Mnih and K. Kavukcuoglu. Learning word embeddings efficiently with noise-contrastive estimation. In NIPS, pages 2265--2273, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. X. Ni, J.-T. Sun, J. Hu, and Z. Chen. Mining multilingual topics from Wikipedia. In WWW, pages 1155--1156, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation. In EMNLP, pages 1532--1543, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  33. C. Peters, M. Braschler, J. Gonzalo, and M. Kluck, editors. CLEF 2001, Revised Papers, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  34. C. Peters, M. Braschler, J. Gonzalo, and M. Kluck, editors. CLEF 2002, Revised Papers, 2003.Google ScholarGoogle Scholar
  35. J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In SIGIR, pages 275--281, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. Nature, 323:533--536, 1986.Google ScholarGoogle ScholarCross RefCross Ref
  37. G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11):613--620, 1975. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. R. Socher, B. Huval, C. D. Manning, and A. Y. Ng. Semantic compositionality through recursive matrix-vector spaces. In EMNLP-CoNLL, pages 1201--1211, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. M. Steyvers and T. Griffiths. Probabilistic topic models. Handbook of Latent Semantic Analysis, 427(7):424--440, 2007.Google ScholarGoogle Scholar
  40. J. P. Turian, L. Ratinov, and Y. Bengio. Word representations: A simple and general method for semi-supervised learning. In ACL, pages 384--394, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. I. Vulić, W. De Smet, and M.-F. Moens. Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora. Information Retrieval, 16(3):331--368, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. I. Vulić and M. Moens. A unified framework for monolingual and cross-lingual relevance modeling based on probabilistic topic models. In ECIR, pages 98--109, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. I. Vulić, W. D. Smet, J. Tang, and M. Moens. Probabilistic topic modeling in multilingual settings: An overview of its methodology and applications. Information Processing and Management, 51(1):111--147, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  44. X. Wei and W. B. Croft. LDA-based document models for ad-hoc retrieval. In SIGIR, pages 178--185, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. X. Yi and J. Allan. A comparative study of utilizing topic models for information retrieval. In ECIR, pages 29--41, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Transactions on Information Systems, 22(2):179--214, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval
      August 2015
      1198 pages
      ISBN:9781450336215
      DOI:10.1145/2766462

      Copyright © 2015 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 9 August 2015

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      SIGIR '15 Paper Acceptance Rate70of351submissions,20%Overall Acceptance Rate792of3,983submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader