short-paper

Word Embedding based Generalized Language Model for Information Retrieval

Authors:
Debasis Ganguly

Dublin City University, Dublin, Ireland

Dublin City University, Dublin, Ireland
View Profile

,
Dwaipayan Roy

Indian Statistical Institute, Kolkata, India

Indian Statistical Institute, Kolkata, India
View Profile

,
Mandar Mitra

Indian Statistical Institute, Kolkata, India

Indian Statistical Institute, Kolkata, India
View Profile

,
Gareth J.F. Jones

Dublin City University, Dublin, Ireland

Dublin City University, Dublin, Ireland
View Profile

SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information RetrievalAugust 2015Pages 795–798https://doi.org/10.1145/2766462.2767780

Published:09 August 2015Publication History

SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 795–798

ABSTRACT

Word2vec, a state-of-the-art word embedding technique has gained a lot of interest in the NLP community. The embedding of the word vectors helps to retrieve a list of words that are used in similar contexts with respect to a given word. In this paper, we focus on using the word embeddings for enhancing retrieval effectiveness. In particular, we construct a generalized language model, where the mutual independence between a pair of words (say t and t') no longer holds. Instead, we make use of the vector embeddings of the words to derive the transformation probabilities between words. Specifically, the event of observing a term t in the query from a document d is modeled by two distinct events, that of generating a different term t', either from the document itself or from the collection, respectively, and then eventually transforming it to the observed query term t. The first event of generating an intermediate term from the document intends to capture how well does a term contextually fit within a document, whereas the second one of generating it from the collection aims to address the vocabulary mismatch problem by taking into account other related terms in the collection. Our experiments, conducted on the standard TREC collection, show that our proposed method yields significant improvements over LM and LDA-smoothed LM baselines.

References

D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993--1022, March 2003. Google ScholarDigital Library
R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. J. Mach. Learn. Res., 12:2493--2537, Nov. 2011. Google ScholarCross Ref
S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. JASIS, 41(6):391--407, 1990.Google ScholarCross Ref
Y. Goldberg and O. Levy. word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method. CoRR, abs/1402.3722, 2014.Google Scholar
T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences (PNAS), 101(suppl. 1):5228--5235, 2004.Google ScholarCross Ref
D. Hiemstra. Using Language Models for Information Retrieval. PhD thesis, Center of Telematics and Information Technology, AE Enschede, 2000.Google Scholar
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Proc. of NIPS '13, pages 3111--3119, 2013.Google ScholarDigital Library
J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In SIGIR, pages 275--281. ACM, 1998. Google ScholarDigital Library
X. Wei and W. B. Croft. LDA-based document models for ad-hoc retrieval. In SIGIR '06, pages 178--185, 2006. Google ScholarDigital Library
C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst., 22(2):179--214, Apr. 2004. Google ScholarDigital Library

Index Terms

Word Embedding based Generalized Language Model for Information Retrieval
1. Information systems
  1. Information retrieval
    1. Information retrieval query processing
    2. Retrieval models and ranking

Recommendations

Improving Arabic information retrieval using word embedding similarities

Term mismatch is a common limitation of traditional information retrieval (IR) models where relevance scores are estimated based on exact matching of documents and queries. Typically, good IR model should consider distinct but semantically similar words ...
Read More
Improving Vietnamese WordNet using word embedding
NLPIR '19: Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval

This paper presents a simple but effective method to improve the quality of WordNet synsets and extract glosses for synsets. We translate the Princeton WordNet and other intermediate WordNets to a target language using a machine translator, then the ...
Read More
Adaptive Probabilistic Word Embedding
WWW '20: Proceedings of The Web Conference 2020

Word embeddings have been widely used and proven to be effective in many natural language processing and text modeling tasks. It is obvious that one ambiguous word could have very different semantics in various contexts, which is called polysemy. Most ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval
August 2015
1198 pages
ISBN:9781450336215
DOI:10.1145/2766462
General Chair:
Ricardo Baeza-Yates
Yahoo Labs, USA
,
Program Chairs:
Mounia Lalmas
Yahoo Labs, UK
,
Alistair Moffat
University of Melbourne, Australia
,
Berthier Ribeiro-Neto
Google, Brazil, and UFMG, Brazil
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 August 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
generalized language model
word embedding
Qualifiers
- short-paper
Conference

Acceptance Rates
SIGIR '15 Paper Acceptance Rate70of351submissions,20%Overall Acceptance Rate792of3,983submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 149
  Total Citations
  View Citations
- 2,803
  Total Downloads
- Downloads (Last 12 months)117
- Downloads (Last 6 weeks)15
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Word Embedding based Generalized Language Model for Information Retrieval

SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Improving Arabic information retrieval using word embedding similarities

Improving Vietnamese WordNet using word embedding

Adaptive Probabilistic Word Embedding