skip to main content
10.1145/2766462.2767780acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper

Word Embedding based Generalized Language Model for Information Retrieval

Published:09 August 2015Publication History

ABSTRACT

Word2vec, a state-of-the-art word embedding technique has gained a lot of interest in the NLP community. The embedding of the word vectors helps to retrieve a list of words that are used in similar contexts with respect to a given word. In this paper, we focus on using the word embeddings for enhancing retrieval effectiveness. In particular, we construct a generalized language model, where the mutual independence between a pair of words (say t and t') no longer holds. Instead, we make use of the vector embeddings of the words to derive the transformation probabilities between words. Specifically, the event of observing a term t in the query from a document d is modeled by two distinct events, that of generating a different term t', either from the document itself or from the collection, respectively, and then eventually transforming it to the observed query term t. The first event of generating an intermediate term from the document intends to capture how well does a term contextually fit within a document, whereas the second one of generating it from the collection aims to address the vocabulary mismatch problem by taking into account other related terms in the collection. Our experiments, conducted on the standard TREC collection, show that our proposed method yields significant improvements over LM and LDA-smoothed LM baselines.

References

  1. D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993--1022, March 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. J. Mach. Learn. Res., 12:2493--2537, Nov. 2011. Google ScholarGoogle ScholarCross RefCross Ref
  3. S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. JASIS, 41(6):391--407, 1990.Google ScholarGoogle ScholarCross RefCross Ref
  4. Y. Goldberg and O. Levy. word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method. CoRR, abs/1402.3722, 2014.Google ScholarGoogle Scholar
  5. T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences (PNAS), 101(suppl. 1):5228--5235, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  6. D. Hiemstra. Using Language Models for Information Retrieval. PhD thesis, Center of Telematics and Information Technology, AE Enschede, 2000.Google ScholarGoogle Scholar
  7. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Proc. of NIPS '13, pages 3111--3119, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In SIGIR, pages 275--281. ACM, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. X. Wei and W. B. Croft. LDA-based document models for ad-hoc retrieval. In SIGIR '06, pages 178--185, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst., 22(2):179--214, Apr. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Word Embedding based Generalized Language Model for Information Retrieval

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval
        August 2015
        1198 pages
        ISBN:9781450336215
        DOI:10.1145/2766462

        Copyright © 2015 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 9 August 2015

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • short-paper

        Acceptance Rates

        SIGIR '15 Paper Acceptance Rate70of351submissions,20%Overall Acceptance Rate792of3,983submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader