ABSTRACT
Word embeddings, which are low-dimensional vector representations of vocabulary terms that capture the semantic similarity between them, have recently been shown to achieve impressive performance in many natural language processing tasks. The use of word embeddings in information retrieval, however, has only begun to be studied. In this paper, we explore the use of word embeddings to enhance the accuracy of query language models in the ad-hoc retrieval task. To this end, we propose to use word embeddings to incorporate and weight terms that do not occur in the query, but are semantically related to the query terms. We describe two embedding-based query expansion models with different assumptions. Since pseudo-relevance feedback methods that use the top retrieved documents to update the original query model are well-known to be effective, we also develop an embedding-based relevance model, an extension of the effective and robust relevance model approach. In these models, we transform the similarity values obtained by the widely-used cosine similarity with a sigmoid function to have more discriminative semantic similarity values. We evaluate our proposed methods using three TREC newswire and web collections. The experimental results demonstrate that the embedding-based methods significantly outperform competitive baselines in most cases. The embedding-based methods are also shown to be more robust than the baselines.
- N. Abdul-jaleel, J. Allan, W. B. Croft, F. Diaz, L. Larkey, X. Li, D. Metzler, M. D. Smucker, T. Strohman, H. Turtle, and C. Wade. UMass at TREC 2004: Novelty and HARD. In TREC '04, 2004.Google Scholar
- M. ALMasri, C. Berrut, and J.-P. Chevallet. A Comparison of Deep Learning Based Query Expansion with Pseudo-Relevance Feedback and Mutual Information. In ECIR '16, pages 709--715, 2016.Google Scholar
- J. Bai, J.-Y. Nie, G. Cao, and H. Bouchard. Using Query Contexts in Information Retrieval. In SIGIR '07, pages 15--22, 2007. Google ScholarDigital Library
- C. Carpineto and G. Romano. A Survey of Automatic Query Expansion in Information Retrieval. ACM Comput. Surv., 44(1):1:1--1:50, 2012. Google ScholarDigital Library
- S. Clinchant and F. Perronnin. Aggregating Continuous Word Embeddings for Information Retrieval. In CVSC@ACL '13, pages 100--109, 2013.Google Scholar
- K. Collins-Thompson. Reducing the Risk of Query Expansion via Robust Constrained Optimization. In CIKM '09, pages 837--846, 2009. Google ScholarDigital Library
- S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391--407, 1990.Google ScholarCross Ref
- P. Dhillon, D. P. Foster, and L. H. Ungar. Multi-View Learning of Word Embeddings via CCA. In NIPS '11, pages 199--207, 2011. Google ScholarDigital Library
- F. Diaz, B. Mitra, and N. Craswell. Query Expansion with Locally-Trained Word Embeddings. In ACL '16, 2016.Google Scholar
- D. Ganguly, D. Roy, M. Mitra, and G. J. Jones. Word Embedding Based Generalized Language Model for Information Retrieval. In SIGIR '15, pages 795--798, 2015. Google ScholarDigital Library
- M. Karimzadehgan and C. Zhai. Estimation of Statistical Translation Models Based on Mutual Information for Ad Hoc Information Retrieval. In SIGIR '10, pages 323--330, 2010. Google ScholarDigital Library
- T. Kenter and M. de Rijke. Short Text Similarity with Word Embeddings. In CIKM '15, pages 1411--1420, 2015. Google ScholarDigital Library
- M. J. Kusner, Y. Sun, N. I. Kolkin, and K. Q. Weinberger. From Word Embeddings to Document Distances. In ICML '15, pages 957--966, 2015.Google ScholarDigital Library
- J. Lafferty and C. Zhai. Document Language Models, Query Models, and Risk Minimization for Information Retrieval. In SIGIR '01, pages 111--119, 2001. Google ScholarDigital Library
- V. Lavrenko and W. B. Croft. Relevance Based Language Models. In SIGIR '01, pages 120--127, 2001. Google ScholarDigital Library
- Q. V. Le and T. Mikolov. Distributed Representations of Sentences and Documents. In ICML '14, pages 1188--1196, 2014.Google Scholar
- O. Levy, Y. Goldberg, and I. Dagan. Improving Distributional Similarity with Lessons Learned from Word Embeddings. TACL, 3:211--225, 2015.Google ScholarCross Ref
- Y. Lv and C. Zhai. A Comparative Study of Methods for Estimating Query Language Models with Pseudo Feedback. In CIKM '09, pages 1895--1898, 2009. Google ScholarDigital Library
- T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed Representations of Words and Phrases and their Compositionality. In NIPS '13, pages 3111--3119, 2013. Google ScholarDigital Library
- A. Montazeralghaem, H. Zamani, and A. Shakery. Axiomatic Analysis for Improving the Log-Logistic Feedback Model. In SIGIR '16, pages 765--768, 2016. Google ScholarDigital Library
- J. Pennington, R. Socher, and C. Manning. GloVe: Global Vectors for Word Representation. In EMNLP '14, pages 1532--1543, 2014.Google Scholar
- J. M. Ponte and W. B. Croft. A Language Modeling Approach to Information Retrieval. In SIGIR '98, pages 275--281, 1998. Google ScholarDigital Library
- P. Resnik. Using Information Content to Evaluate Semantic Similarity in a Taxonomy. In IJCAI '95, pages 448--453, 1995. Google ScholarDigital Library
- J. J. Rocchio. Relevance Feedback in Information Retrieval. In The SMART Retrieval System: Experiments in Automatic Document Processing, pages 313--323. 1971.Google Scholar
- A. Sordoni, Y. Bengio, and J.-Y. Nie. Learning Concept Embeddings for Query Expansion by Quantum Entropy Minimization. In AAAI '14, pages 1586--1592, 2014. Google ScholarDigital Library
- E. M. Voorhees. Query Expansion Using Lexical-semantic Relations. In SIGIR '94, pages 61--69, 1994. Google ScholarDigital Library
- I. Vulić and M.-F. Moens. Monolingual and Cross-Lingual Information Retrieval Models Based on (Bilingual) Word Embeddings. In SIGIR '15, pages 363--372, 2015. Google ScholarDigital Library
- J. Xu and W. B. Croft. Query Expansion Using Local and Global Document Analysis. In SIGIR '96, pages 4--11, 1996. Google ScholarDigital Library
- C. Zhai and J. Lafferty. Model-based Feedback in the Language Modeling Approach to Information Retrieval. In CIKM '01, pages 403--410, 2001. Google ScholarDigital Library
- G. Zheng and J. Callan. Learning to Reweight Terms with Distributed Representations. In SIGIR '15, pages 575--584, 2015. Google ScholarDigital Library
- G. Zhou, T. He, J. Zhao, and P. Hu. Learning Continuous Word Embedding with Metadata for Question Retrieval in Community Question Answering. In ACL '15, pages 250--259, 2015.Google Scholar
- G. Zuccon, B. Koopman, P. Bruza, and L. Azzopardi. Integrating and Evaluating Neural Word Embeddings in Information Retrieval. In ADCS '15, pages 12:1--12:8, 2015. Google ScholarDigital Library
Index Terms
- Embedding-based Query Language Models
Recommendations
Relevance-based Word Embedding
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information RetrievalLearning a high-dimensional dense representation for vocabulary terms, also known as a word embedding, has recently attracted much attention in natural language processing and information retrieval tasks. The embedding vectors are typically learned ...
Estimating Embedding Vectors for Queries
ICTIR '16: Proceedings of the 2016 ACM International Conference on the Theory of Information RetrievalThe dense vector representation of vocabulary terms, also known as word embeddings, have been shown to be highly effective in many natural language processing tasks. Word embeddings have recently begun to be studied in a number of information retrieval (...
Word-embedding-based pseudo-relevance feedback for Arabic information retrieval
Pseudo-relevance feedback (PRF) is a very effective query expansion approach, which reformulates queries by selecting expansion terms from top k pseudo-relevant documents. Although standard PRF models have been proven effective to deal with vocabulary ...
Comments