Article

Two-stage language models for information retrieval

Authors:
ChengXiang Zhai

Carnegie Mellon University, Pittsburgh, PA

Carnegie Mellon University, Pittsburgh, PA
View Profile

,
John Lafferty

Carnegie Mellon University, Pittsburgh, PA

Carnegie Mellon University, Pittsburgh, PA
View Profile

SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrievalAugust 2002Pages 49–56https://doi.org/10.1145/564376.564387

Published:11 August 2002Publication History

SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 49–56

ABSTRACT

The optimal settings of retrieval parameters often depend on both the document collection and the query, and are usually found through empirical tuning. In this paper, we propose a family of two-stage language models for information retrieval that explicitly captures the different influences of the query and document collection on the optimal settings of retrieval parameters. As a special case, we present a two-stage smoothing method that allows us to estimate the smoothing parameters completely automatically. In the first stage, the document language model is smoothed using a Dirichlet prior with the collection language model as the reference model. In the second stage, the smoothed document language model is further interpolated with a query background language model. We propose a leave-one-out method for estimating the Dirichlet parameter of the first stage, and the use of document mixture models for estimating the interpolation parameter of the second stage. Evaluation on five different databases and four types of queries indicates that the two-stage smoothing method with the proposed parameter estimation methods consistently gives retrieval performance that is close to---or better than---the best results achieved using a single smoothing method and exhaustive parameter search on the test data.

References

Berger, A. and Lafferty, J. (1999). Information retrieval as statistical translation. In Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval, pages 222--229. Google ScholarDigital Library
Hiemstra, D. and Kraaij, W. (1998). Twenty-one at TREC-7: Ad-hoc and cross-language track. In Proc. of Seventh Text REtrieval Conference (TREC-7).Google Scholar
Kwok, K. and Chan, M. (98). Improving two-stage ad-hoc retrieval for short queries. In Proceedings of SIGIR'98, pages 250--256. Google ScholarDigital Library
Lafferty, J. and Zhai, C. (2001a). Document language models, query models, and risk minimization for information retrieval. In Proceedings of SIGIR'2001. Google ScholarDigital Library
Lafferty, J. and Zhai, C. (2001b). Probabilistic IR models based on query and document generation. In Proceedings of the Language Modeling and IR workshop. Extended abstract.Google Scholar
Lavrenko, V. and Croft, B. (2001). Relevance-based language models. In Proceedings of SIGIR'2001. Google ScholarDigital Library
Miller, D. H., Leek, T., and Schwartz, R. (1999). A hidden Markov model information retrieval system. In Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval, pages 214--221. Google ScholarDigital Library
Ney, H., Essen, U., and Kneser, R. (1995). On the estimation of `small' probabilities by leaving-one-out. IEEE Transactions on Pattern Analysis and Machine Intelligence, (12):1202--1212. Google ScholarDigital Library
Ponte, J. and Croft, W. B. (1998). A language modeling approach to information retrieval. In Proceedings of the ACM SIGIR, pages 275--281. Google ScholarDigital Library
Robertson, S. and Sparck Jones, K. (1976). Relevance weighting of search terms. Journal of the American Society for Information Science, 27:129--146.Google ScholarCross Ref
Robertson, S. E., Walker, S., Jones, S., M.Hancock-Beaulieu, M., and Gatford, M. (1995). Okapi at TREC-3. In Harman, D. K., editor, The Third Text REtrieval Conference (TREC-3).Google Scholar
Salton, G., Wong, A., and Yang, C. S. (1975a). A vector space model for automatic indexing. Communications of the ACM, (11):613--620. Google ScholarDigital Library
Salton, G. and Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24:513--523. Google ScholarDigital Library
Singhal, A., Buckley, C., and Mitra, M. (1996). Pivoted document length normalization. In Proceedings of the 1996 ACM SIGIR Conference on Research and Development in Information Retrieval, pages 21--29. Google ScholarDigital Library
Voorhees, E. and Harman, D., editors (2001). Proceedings of Text REtrieval Conference (TREC1-9). NIST Special Publications. http://trec.nist.gov/pubs.html.Google Scholar
Zhai, C. and Lafferty, J. (2001a). Model-based feedback in the KL-divergence retrieval model. In Tenth International Conference on Information and Knowledge Management (CIKM 2001).Google Scholar
Zhai, C. and Lafferty, J. (2001b). A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of SIGIR'2001. Google ScholarDigital Library

Index Terms

Two-stage language models for information retrieval
1. Information systems
  1. Information retrieval

Recommendations

A study of smoothing methods for language models applied to information retrieval

Language modeling approaches to information retrieval are attractive and promising because they connect the problem of retrieval with that of language model estimation, which has been studied extensively in other application areas such as speech ...
Read More
Parsimonious language models for information retrieval
SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval

We systematically investigate a new approach to estimating the parameters of language models for information retrieval, called parsimonious language models. Parsimonious language models explicitly address the relation between levels of language models ...
Read More
Unsupervised Text Learning Based on Context Mixture Model with Dirichlet Prior
IWHDM: Revised Selected Papers of the APWeb 2008 International Workshops on Advanced Web and Network Technologies, and Applications - Volume 4977

In this paper, we proposed a bayesian mixture model, in which introduce a context variable, which has Dirichlet prior, in a bayesian framework to model text multiple topics and then clustering. It is a novel unsupervised text learning algorithm to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
August 2002
478 pages
ISBN:1581135610
DOI:10.1145/564376
General Chair:
Kalervo Järvelin
University of Tampere, Finland
,
Program Chairs:
Micheline Beaulieu
University of Sheffield, UK
,
Ricardo Baeza-Yates
University of Chile, Chile
,
Sung Hyon Myaeng
Chungnam National University, Korea
Copyright © 2002 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 August 2002
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Dirichlet prior
interpolation
leave-one-out
mixture model
parameter estimation
risk minimization
two-stage language models
two-stage smoothing
Qualifiers
- Article
Conference

Acceptance Rates
SIGIR '02 Paper Acceptance Rate44of219submissions,20%Overall Acceptance Rate792of3,983submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 106
  Total Citations
  View Citations
- 1,653
  Total Downloads
- Downloads (Last 12 months)37
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Two-stage language models for information retrieval

SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

A study of smoothing methods for language models applied to information retrieval

Parsimonious language models for information retrieval

Unsupervised Text Learning Based on Context Mixture Model with Dirichlet Prior