ABSTRACT
The optimal settings of retrieval parameters often depend on both the document collection and the query, and are usually found through empirical tuning. In this paper, we propose a family of two-stage language models for information retrieval that explicitly captures the different influences of the query and document collection on the optimal settings of retrieval parameters. As a special case, we present a two-stage smoothing method that allows us to estimate the smoothing parameters completely automatically. In the first stage, the document language model is smoothed using a Dirichlet prior with the collection language model as the reference model. In the second stage, the smoothed document language model is further interpolated with a query background language model. We propose a leave-one-out method for estimating the Dirichlet parameter of the first stage, and the use of document mixture models for estimating the interpolation parameter of the second stage. Evaluation on five different databases and four types of queries indicates that the two-stage smoothing method with the proposed parameter estimation methods consistently gives retrieval performance that is close to---or better than---the best results achieved using a single smoothing method and exhaustive parameter search on the test data.
- Berger, A. and Lafferty, J. (1999). Information retrieval as statistical translation. In Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval, pages 222--229. Google ScholarDigital Library
- Hiemstra, D. and Kraaij, W. (1998). Twenty-one at TREC-7: Ad-hoc and cross-language track. In Proc. of Seventh Text REtrieval Conference (TREC-7).Google Scholar
- Kwok, K. and Chan, M. (98). Improving two-stage ad-hoc retrieval for short queries. In Proceedings of SIGIR'98, pages 250--256. Google ScholarDigital Library
- Lafferty, J. and Zhai, C. (2001a). Document language models, query models, and risk minimization for information retrieval. In Proceedings of SIGIR'2001. Google ScholarDigital Library
- Lafferty, J. and Zhai, C. (2001b). Probabilistic IR models based on query and document generation. In Proceedings of the Language Modeling and IR workshop. Extended abstract.Google Scholar
- Lavrenko, V. and Croft, B. (2001). Relevance-based language models. In Proceedings of SIGIR'2001. Google ScholarDigital Library
- Miller, D. H., Leek, T., and Schwartz, R. (1999). A hidden Markov model information retrieval system. In Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval, pages 214--221. Google ScholarDigital Library
- Ney, H., Essen, U., and Kneser, R. (1995). On the estimation of `small' probabilities by leaving-one-out. IEEE Transactions on Pattern Analysis and Machine Intelligence, (12):1202--1212. Google ScholarDigital Library
- Ponte, J. and Croft, W. B. (1998). A language modeling approach to information retrieval. In Proceedings of the ACM SIGIR, pages 275--281. Google ScholarDigital Library
- Robertson, S. and Sparck Jones, K. (1976). Relevance weighting of search terms. Journal of the American Society for Information Science, 27:129--146.Google ScholarCross Ref
- Robertson, S. E., Walker, S., Jones, S., M.Hancock-Beaulieu, M., and Gatford, M. (1995). Okapi at TREC-3. In Harman, D. K., editor, The Third Text REtrieval Conference (TREC-3).Google Scholar
- Salton, G., Wong, A., and Yang, C. S. (1975a). A vector space model for automatic indexing. Communications of the ACM, (11):613--620. Google ScholarDigital Library
- Salton, G. and Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24:513--523. Google ScholarDigital Library
- Singhal, A., Buckley, C., and Mitra, M. (1996). Pivoted document length normalization. In Proceedings of the 1996 ACM SIGIR Conference on Research and Development in Information Retrieval, pages 21--29. Google ScholarDigital Library
- Voorhees, E. and Harman, D., editors (2001). Proceedings of Text REtrieval Conference (TREC1-9). NIST Special Publications. http://trec.nist.gov/pubs.html.Google Scholar
- Zhai, C. and Lafferty, J. (2001a). Model-based feedback in the KL-divergence retrieval model. In Tenth International Conference on Information and Knowledge Management (CIKM 2001).Google Scholar
- Zhai, C. and Lafferty, J. (2001b). A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of SIGIR'2001. Google ScholarDigital Library
Index Terms
- Two-stage language models for information retrieval
Recommendations
A study of smoothing methods for language models applied to information retrieval
Language modeling approaches to information retrieval are attractive and promising because they connect the problem of retrieval with that of language model estimation, which has been studied extensively in other application areas such as speech ...
Parsimonious language models for information retrieval
SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrievalWe systematically investigate a new approach to estimating the parameters of language models for information retrieval, called parsimonious language models. Parsimonious language models explicitly address the relation between levels of language models ...
Unsupervised Text Learning Based on Context Mixture Model with Dirichlet Prior
IWHDM: Revised Selected Papers of the APWeb 2008 International Workshops on Advanced Web and Network Technologies, and Applications - Volume 4977In this paper, we proposed a bayesian mixture model, in which introduce a context variable, which has Dirichlet prior, in a bayesian framework to model text multiple topics and then clustering. It is a novel unsupervised text learning algorithm to ...
Comments