skip to main content
10.1145/564376.564387acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Two-stage language models for information retrieval

Published:11 August 2002Publication History

ABSTRACT

The optimal settings of retrieval parameters often depend on both the document collection and the query, and are usually found through empirical tuning. In this paper, we propose a family of two-stage language models for information retrieval that explicitly captures the different influences of the query and document collection on the optimal settings of retrieval parameters. As a special case, we present a two-stage smoothing method that allows us to estimate the smoothing parameters completely automatically. In the first stage, the document language model is smoothed using a Dirichlet prior with the collection language model as the reference model. In the second stage, the smoothed document language model is further interpolated with a query background language model. We propose a leave-one-out method for estimating the Dirichlet parameter of the first stage, and the use of document mixture models for estimating the interpolation parameter of the second stage. Evaluation on five different databases and four types of queries indicates that the two-stage smoothing method with the proposed parameter estimation methods consistently gives retrieval performance that is close to---or better than---the best results achieved using a single smoothing method and exhaustive parameter search on the test data.

References

  1. Berger, A. and Lafferty, J. (1999). Information retrieval as statistical translation. In Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval, pages 222--229. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Hiemstra, D. and Kraaij, W. (1998). Twenty-one at TREC-7: Ad-hoc and cross-language track. In Proc. of Seventh Text REtrieval Conference (TREC-7).Google ScholarGoogle Scholar
  3. Kwok, K. and Chan, M. (98). Improving two-stage ad-hoc retrieval for short queries. In Proceedings of SIGIR'98, pages 250--256. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Lafferty, J. and Zhai, C. (2001a). Document language models, query models, and risk minimization for information retrieval. In Proceedings of SIGIR'2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Lafferty, J. and Zhai, C. (2001b). Probabilistic IR models based on query and document generation. In Proceedings of the Language Modeling and IR workshop. Extended abstract.Google ScholarGoogle Scholar
  6. Lavrenko, V. and Croft, B. (2001). Relevance-based language models. In Proceedings of SIGIR'2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Miller, D. H., Leek, T., and Schwartz, R. (1999). A hidden Markov model information retrieval system. In Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval, pages 214--221. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Ney, H., Essen, U., and Kneser, R. (1995). On the estimation of `small' probabilities by leaving-one-out. IEEE Transactions on Pattern Analysis and Machine Intelligence, (12):1202--1212. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Ponte, J. and Croft, W. B. (1998). A language modeling approach to information retrieval. In Proceedings of the ACM SIGIR, pages 275--281. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Robertson, S. and Sparck Jones, K. (1976). Relevance weighting of search terms. Journal of the American Society for Information Science, 27:129--146.Google ScholarGoogle ScholarCross RefCross Ref
  11. Robertson, S. E., Walker, S., Jones, S., M.Hancock-Beaulieu, M., and Gatford, M. (1995). Okapi at TREC-3. In Harman, D. K., editor, The Third Text REtrieval Conference (TREC-3).Google ScholarGoogle Scholar
  12. Salton, G., Wong, A., and Yang, C. S. (1975a). A vector space model for automatic indexing. Communications of the ACM, (11):613--620. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Salton, G. and Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24:513--523. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Singhal, A., Buckley, C., and Mitra, M. (1996). Pivoted document length normalization. In Proceedings of the 1996 ACM SIGIR Conference on Research and Development in Information Retrieval, pages 21--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Voorhees, E. and Harman, D., editors (2001). Proceedings of Text REtrieval Conference (TREC1-9). NIST Special Publications. http://trec.nist.gov/pubs.html.Google ScholarGoogle Scholar
  16. Zhai, C. and Lafferty, J. (2001a). Model-based feedback in the KL-divergence retrieval model. In Tenth International Conference on Information and Knowledge Management (CIKM 2001).Google ScholarGoogle Scholar
  17. Zhai, C. and Lafferty, J. (2001b). A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of SIGIR'2001. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Two-stage language models for information retrieval

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
      August 2002
      478 pages
      ISBN:1581135610
      DOI:10.1145/564376

      Copyright © 2002 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 11 August 2002

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      SIGIR '02 Paper Acceptance Rate44of219submissions,20%Overall Acceptance Rate792of3,983submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader