research-article

Large-scale validation and analysis of interleaved search evaluation

Authors:
Olivier Chapelle

Yahoo! Research

Yahoo! Research
View Profile

,
Thorsten Joachims

Cornell University

Cornell University
View Profile

,
Filip Radlinski

Microsoft

Microsoft
View Profile

,
Yisong Yue

Carnegie Mellon University

Carnegie Mellon University
View Profile

Authors Info & Claims

ACM Transactions on Information Systems Volume 30 Issue 1Article No.: 6pp 1–41https://doi.org/10.1145/2094072.2094078

Published:06 March 2012Publication History

ACM Transactions on Information Systems

Abstract

Interleaving is an increasingly popular technique for evaluating information retrieval systems based on implicit user feedback. While a number of isolated studies have analyzed how this technique agrees with conventional offline evaluation approaches and other online techniques, a complete picture of its efficiency and effectiveness is still lacking. In this paper we extend and combine the body of empirical evidence regarding interleaving, and provide a comprehensive analysis of interleaving using data from two major commercial search engines and a retrieval system for scientific literature. In particular, we analyze the agreement of interleaving with manual relevance judgments and observational implicit feedback measures, estimate the statistical efficiency of interleaving, and explore the relative performance of different interleaving variants. We also show how to learn improved credit-assignment functions for clicks that further increase the sensitivity of interleaving.

References

Agichtein, E., Brill, E., Dumais, S., and Ragno, R. 2006. Learning user interaction models for prediction web search results preferences. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR). 3--10. Google ScholarDigital Library
Agrawal, R., Halverson, A., Kenthapadi, K., Mishra, N., and Tsaparas, P. 2009. Generating labels from clicks. In Proceedings of the 2nd ACM International Conference on Web Search and Data Mining. 172--181. Google ScholarDigital Library
Ali, K. and Chang, C. 2006. On the relationship between click-rate and relevance for search engines. In Proceedings of the Conference on Data-Mining and Information Engineering. 213--222.Google Scholar
Allan, J., Aslam, J. A., Carterette, B., Pavlu, V., and Kanoulas, E. 2008. Million query track 2008 overview. In Proceedings of TREC.Google Scholar
Aslam, J. A., Pavlu, V., and Yilmaz, E. 2005. A sampling technique for efficiently estimating measures of query retrieval performance using incomplete judgments. In Proceedings of the ICML Workshop on Learning with Partially Classified Training Data.Google Scholar
Becker, H., Meek, C., and Chickering, D. M. 2007. Modeling contextual factors of click rates. In Proceedings of the National Conference on Artificial Intelligence (AAAI). AAAI Press, 1310--1315. Google ScholarDigital Library
Boyan, J., Freitag, D., and Joachims, T. 1996. A machine learning architecture for optimizing web search engines. In Proceedings of the AAAI Workshop on Internet Based Information Systems. 1--8.Google Scholar
Buckley, C. and Voorhees, E. M. 2004. Retrieval evaluation with incomplete information. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR). 25--32. Google ScholarDigital Library
Buscher, G., Dengel, A., and van Elst, L. 2008. Eye movements as implicit relevance feedback. In Proceedings of the Conference on Human Factors in Computing Systems (CHI), M. Czerwinski, A. M. Lund, and D. S. Tan, Eds. ACM, 2991--2996. Google ScholarDigital Library
Carterette, B., Allan, J., and Sitaraman, R. 2006. Minimal test collections for retrieval evaluation. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR). 268--275. Google ScholarDigital Library
Carterette, B., Bennett, P. N., Chickering, D. M., and Dumais, S. T. 2008. Here or there: Preference judgements for relevance. In Proceedings of the European Conference on Information Retrieval (ECIR). 16--27. Google ScholarDigital Library
Carterette, B. and Jones, R. 2007. Evaluating search engines by modeling the relationship between relevance and clicks. In Proceedings of the International Conference on Advances in Neural Information Processing Systems (NIPS). 217--224.Google Scholar
Chapelle, O., Metlzer, D., Zhang, Y., and Grinspan, P. 2009. Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM'09). ACM, New York, NY, 621--630. Google ScholarDigital Library
Chapelle, O. and Zhang, Y. 2009. A dynamic bayesian network click model for web search ranking. In Proceedings of the 18th International Conference on World Wide Web. ACM, New York, NY, 1--10. Google ScholarDigital Library
Clarke, C. L. A., Kolla, M., Cormack, G. V., Vechtomova, O., Ashkan, A., Büttcher, S., and MacKinnon, I. 2008. Novelty and diversity in information retrieval evaluation. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR). 659--666. Google ScholarDigital Library
Claypool, M., Le, P., Waseda, M., and Brown, D. 2001. Implicit interest indicators. In Proceedings of the International Conference on Intelligent User Interfaces (IUI). 33--40. Google ScholarDigital Library
Cleverdon, C. W., Mills, J., and Keen, E. M. 1966. Factors Determining the Performance of Indexing Systems. Cranfield: College of Aeronautics.Google Scholar
Craswell, N., Zoeter, O., Taylor, M. J., and Ramsey, B. 2008. An experimental comparison of click position-bias models. In Proceedings of the ACM International Conference on Web Search and Data Mining (WSDM). M. Najork, A. Z. Broder, and S. Chakrabarti, Eds., ACM, 87--94. Google ScholarDigital Library
Dupret, G., Murdock, V., and Piwowarski, B. 2007. Web search engine evaluation using clickthrough data and a user model. In Proceedings of the WWW Workshop on Query Log Analysis.Google Scholar
Efron, B. and Tibshirani, R. 1993. An Introduction to the Bootstrap. Chapman & Hall/CRC.Google Scholar
Feige, U., Raghavan, P., Peleg, D., and Upfal, E. 1994. Computing with noisy information. SIAM J. Comput. 23, 5, 1001--1018. Google ScholarDigital Library
Fox, S., Karnawat, K., Mydland, M., Dumais, S., and White, T. 2005. Evaluating implicit measures to improve web search. ACM Trans. Inf. Syst. 23, 2, 147--168. Google ScholarDigital Library
He, J., Zhai, C., and Li, X. 2009. Evaluation of methods for relative comparison of retrieval systems based on clickthroughs. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM'09). ACM, New York, NY, 2029--2032. Google ScholarDigital Library
Hofmann, K., Whiteson, S., and de Rijke, M. 2011. A probabilistic method for inferring preferences from clicks. In Proceedings of the ACM Conference on Information and Knowledge Management (CIKM). Google ScholarDigital Library
Huffman, S. B. and Hochster, M. 2007. How well does result relevance predict session satisfaction&quest; In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR). 567--574. Google ScholarDigital Library
Joachims, T. 2002. Optimizing Search Engines Using Clickthrough Data. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (KDD). ACM, New York, NY, 132--142. Google ScholarDigital Library
Joachims, T. 2003. Evaluating Retrieval Performance using Clickthrough Data. In Text Mining, J. Franke, G. Nakhaeizadeh, and I. Renz, Eds., Physica/Springer Verlag, 79--96.Google Scholar
Joachims, T., Freitag, D., and Mitchell, T. 1997. WebWatcher: a tour guide for the World Wide Web. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI). Vol. 1. Morgan Kaufmann, 770--777.Google Scholar
Joachims, T., Granka, L., Pan, B., Hembrooke, H., Radlinski, F., and Gay, G. 2007. Evaluating the accuracy of implicit feedback from clicks and query reformulations in web search. ACM Trans. Inf. Syst. 25, 2, Article 7. Google ScholarDigital Library
Kantor, P. 1988. National, language-specific evaluation sites for retrieval systems and interfaces. In Proceedings of the International Conference on Computer-Assisted Information Retrieval (RIAO). 139--147.Google Scholar
Kelly, D. 2005. Implicit feedback: Using behavior to infer relevance. In New Directions in Cognitive Information Retrieval, 169--186.Google Scholar
Kelly, D. and Teevan, J. 2003. Implicit feedback for inferring user preference: A bibliography. Proceedings of the ACM SIGIR Forum 37, 2, 18--28. Google ScholarDigital Library
Kozielecki, J. 1981. Psychological Decision Theory. Kluwer.Google Scholar
Kulkarni, A., Teevan, J., Svore, K., and Dumais, S. 2011. Understanding temporal query dynamics. In Proceedings of the ACM International Conference on Web Search and Data Mining (WSDM). 167--176. Google ScholarDigital Library
Laming, D. 1986. Sensory Analysis. Academic Press, London.Google Scholar
Lieberman, H. 1995. Letizia: An agent that assists Web browsing. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI). Morgan Kaufmann, 924--929. Google ScholarDigital Library
Liu, Y., Fu, Y., Zhang, M., Ma, S., and Ru, L. 2007. Automatic search engine performance evaluation with click-through data analysis. In Proceedings of the International World Wide Web Conference (WWW). 1133--1134. Google ScholarDigital Library
Manning, C. D., Raghavan, P., and Schuetze, H. 2008. Introduction to Information Retrieval. Cambridge University Press. Google ScholarDigital Library
Moe, K. K., Jensen, J. M., and Larsen, B. 2007. A qualitative look at eye-tracking for implicit relevance feedback. In Proceedings of the Workshop on Context-Based Information Retrieval, B.-L. Doan, J. M. Jose, and M. Melucci, Eds., CEUR Workshop Proceedings, vol. 326, CEUR-WS.org.Google Scholar
Mood, A., Graybill, F., and Boes, D. 1974. Introduction to the Theory of Statistics 3rd Ed. McGraw-Hill, New York, NY.Google Scholar
Morita, M. and Shinoda, Y. 1994. Information filtering based on user behavior analysis and best match text retrieval. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR). 272--281. Google ScholarDigital Library
Oard, D. W. and Kim, J. 2001. Modeling information content using observable behavior. In Proceedings of the Annual Meeting of the American Society for Information Science and Technology. 28--45.Google Scholar
Radlinski, F., Bennett, P. N., Carterette, B., and Joachims, T. 2009. Sigir workshop report: Redundancy, diversity and interdependent document relevance. SIGIR Forum 43, 2, 46--52. Google ScholarDigital Library
Radlinski, F. and Craswell, N. 2010. Comparing the sensitivity of information retrieval metrics. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 667--674. Google ScholarDigital Library
Radlinski, F. and Joachims, T. 2005. Query chains: Learning to rank from implicit feedback. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (KDD). 239--248. Google ScholarDigital Library
Radlinski, F. and Joachims, T. 2006. Minimally invasive randomization for collecting unbiased preferences from clickthrough logs. In Proceedings of the Conference of the Association for the Advancement of Artificial Intelligence (AAAI). 1406--1412. Google ScholarDigital Library
Radlinski, F., Kurup, M., and Joachims, T. 2008. How does clickthrough data reflect retrieval quality. In Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM'08). ACM, New York, NY, 43--52. Google ScholarDigital Library
Radlinski, F., Kurup, M., and Joachims, T. 2010a. Evaluating search engine relevance with click-based metrics. In Preference Learning, J. Fuernkranz and E. Huellermeier, Eds., Springer, 337--362.Google Scholar
Radlinski, F., Szummer, M., and Craswell, N. 2010b. Inferring query intent from reformulations and clicks. In Proceedings of the International World Wide Web Conference (WWW). 1171--1172. Google ScholarDigital Library
Salojärvi, J., Puolamäki, K., and Kaski, S. 2005. Implicit relevance feedback from eye movements. In Proceedings of the International Conference on Artificial Neural Networks (ICANN). W. Duch, J. Kacprzyk, E. Oja, and S. Zadrozny, Eds., Springer, 513--518. Google ScholarDigital Library
Sanderson, M. and Zobel, J. 2005. Information retrieval system evaluation: Effort, sensitivity and reliability. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR). 162--169. Google ScholarDigital Library
Shao, J. and Tu, D. 1995. The Jackknife and Bootstrap. Springer.Google Scholar
Soboroff, I., Nicholas, C., and Cahan, P. 2001. Ranking retrieval systems without relevance judgments. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR). 66--73. Google ScholarDigital Library
Teevan, J., Dumais, S., and Horvitz, E. 2007. The potential value of personalizing search. In Proceedings of SIGIR. 756--757. Google ScholarDigital Library
Turpin, A. and Scholer, F. 2006. User performance versus precision measures for simple search tasks. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR). 11--18. Google ScholarDigital Library
Voorhees, E. and Harman, D. 2005. TREC: Experiment and Evaluation in Information Retrieval. MIT press. Google ScholarDigital Library
Voorhees, E. M. and Buckley, C. 2002. The effect of topic set size on retrieval experiment error. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR). 316--323. Google ScholarDigital Library
Wang, K., Walker, T., and Zheng, Z. 2009. PSkip: estimating relevance ranking quality from web search clickthrough data. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'09). ACM, New York, NY, 1355--1364. Google ScholarDigital Library
White, R., Ruthven, I., Jose, J., and van Rijsbergen, C. 2005. Evaluating implicit feedback models using searcher simulations. ACM Trans. Inf. Syst. 23, 3, 325--361. Google ScholarDigital Library
White, R., Ruthven, I., and Jose, J. M. 2002. The use of implicit evidence for relevance feedback in web retrieval. In Proceedings of the European Conference on Information Retrieval (ECIR). F. Crestani, M. Girolami, and C. J. van Rijsbergen, Eds., Lecture Notes in Computer Science, vol. 2291, Springer, 93--109. Google ScholarDigital Library
Yue, Y., Broder, J., Kleinberg, R., and Joachims, T. 2009. The K-armed Dueling Bandits Problem. In Proceedings of the Annual Conference on Learning Theory (COLT).Google Scholar
Yue, Y., Gao, Y., Chapelle, O., Zhang, Y., and Joachims, T. 2010. Learning more powerful test statistics for click-based retrieval evaluation. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 507--514. Google ScholarDigital Library
Yue, Y. and Joachims, T. 2009. Interactively optimizing information retrieval systems as a dueling bandits problem. In Proceedings of the International Conference on Machine Learning (ICML). 1201--1208. Google ScholarDigital Library
Yue, Y. and Joachims, T. 2011. Beat the mean bandit. In Proceedings of the International Conference on Machine Learning (ICML). 241--248.Google Scholar
Yue, Y., Patel, R., and Roehrig, H. 2010. Beyond position bias: Examining result attractiveness as a source of presentation bias in clickthrough data. In Proceedings of the International World Wide Web Conference (WWW). 1011--1018. Google ScholarDigital Library

Index Terms

Large-scale validation and analysis of interleaved search evaluation
1. Information systems
  1. Information retrieval

Recommendations

A Comparative Analysis of Interleaving Methods for Aggregated Search

A result page of a modern search engine often goes beyond a simple list of “10 blue links.” Many specific user needs (e.g., News, Image, Video) are addressed by so-called aggregated or vertical search solutions: specially presented documents, often ...
Read More
Practical online retrieval evaluation
ECIR'13: Proceedings of the 35th European conference on Advances in Information Retrieval

Online evaluation allows the assessment of information retrieval (IR) techniques based on how real users respond to them. Because this technique is directly based on observed user behavior, it is a promising alternative to traditional offline evaluation,...
Read More
Fidelity, Soundness, and Efficiency of Interleaved Comparison Methods

Ranker evaluation is central to the research into search engines, be it to compare rankers or to provide feedback for learning to rank. Traditional evaluation approaches do not scale well because they require explicit relevance judgments of document-...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Information Systems Volume 30, Issue 1
February 2012
193 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/2094072
Issue’s Table of Contents

Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 6 March 2012
- Accepted: 1 December 2011
- Revised: 1 October 2011
- Received: 1 February 2011
Published in tois Volume 30, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Interleaving
clicks
judgments
online evaluation
search engine
sensitivity
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 138
  Total Citations
  View Citations
- 1,312
  Total Downloads
- Downloads (Last 12 months)54
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Large-scale validation and analysis of interleaved search evaluation

ACM Transactions on Information Systems

Abstract

References

Cited By

Index Terms

Recommendations

A Comparative Analysis of Interleaving Methods for Aggregated Search

Practical online retrieval evaluation

Fidelity, Soundness, and Efficiency of Interleaved Comparison Methods

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Large-scale validation and analysis of interleaved search evaluation

ACM Transactions on Information Systems

Abstract

References

Cited By

Index Terms

Recommendations

A Comparative Analysis of Interleaving Methods for Aggregated Search

Practical online retrieval evaluation

Fidelity, Soundness, and Efficiency of Interleaved Comparison Methods

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media