Abstract
Interleaving is an increasingly popular technique for evaluating information retrieval systems based on implicit user feedback. While a number of isolated studies have analyzed how this technique agrees with conventional offline evaluation approaches and other online techniques, a complete picture of its efficiency and effectiveness is still lacking. In this paper we extend and combine the body of empirical evidence regarding interleaving, and provide a comprehensive analysis of interleaving using data from two major commercial search engines and a retrieval system for scientific literature. In particular, we analyze the agreement of interleaving with manual relevance judgments and observational implicit feedback measures, estimate the statistical efficiency of interleaving, and explore the relative performance of different interleaving variants. We also show how to learn improved credit-assignment functions for clicks that further increase the sensitivity of interleaving.
- Agichtein, E., Brill, E., Dumais, S., and Ragno, R. 2006. Learning user interaction models for prediction web search results preferences. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR). 3--10. Google ScholarDigital Library
- Agrawal, R., Halverson, A., Kenthapadi, K., Mishra, N., and Tsaparas, P. 2009. Generating labels from clicks. In Proceedings of the 2nd ACM International Conference on Web Search and Data Mining. 172--181. Google ScholarDigital Library
- Ali, K. and Chang, C. 2006. On the relationship between click-rate and relevance for search engines. In Proceedings of the Conference on Data-Mining and Information Engineering. 213--222.Google Scholar
- Allan, J., Aslam, J. A., Carterette, B., Pavlu, V., and Kanoulas, E. 2008. Million query track 2008 overview. In Proceedings of TREC.Google Scholar
- Aslam, J. A., Pavlu, V., and Yilmaz, E. 2005. A sampling technique for efficiently estimating measures of query retrieval performance using incomplete judgments. In Proceedings of the ICML Workshop on Learning with Partially Classified Training Data.Google Scholar
- Becker, H., Meek, C., and Chickering, D. M. 2007. Modeling contextual factors of click rates. In Proceedings of the National Conference on Artificial Intelligence (AAAI). AAAI Press, 1310--1315. Google ScholarDigital Library
- Boyan, J., Freitag, D., and Joachims, T. 1996. A machine learning architecture for optimizing web search engines. In Proceedings of the AAAI Workshop on Internet Based Information Systems. 1--8.Google Scholar
- Buckley, C. and Voorhees, E. M. 2004. Retrieval evaluation with incomplete information. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR). 25--32. Google ScholarDigital Library
- Buscher, G., Dengel, A., and van Elst, L. 2008. Eye movements as implicit relevance feedback. In Proceedings of the Conference on Human Factors in Computing Systems (CHI), M. Czerwinski, A. M. Lund, and D. S. Tan, Eds. ACM, 2991--2996. Google ScholarDigital Library
- Carterette, B., Allan, J., and Sitaraman, R. 2006. Minimal test collections for retrieval evaluation. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR). 268--275. Google ScholarDigital Library
- Carterette, B., Bennett, P. N., Chickering, D. M., and Dumais, S. T. 2008. Here or there: Preference judgements for relevance. In Proceedings of the European Conference on Information Retrieval (ECIR). 16--27. Google ScholarDigital Library
- Carterette, B. and Jones, R. 2007. Evaluating search engines by modeling the relationship between relevance and clicks. In Proceedings of the International Conference on Advances in Neural Information Processing Systems (NIPS). 217--224.Google Scholar
- Chapelle, O., Metlzer, D., Zhang, Y., and Grinspan, P. 2009. Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM'09). ACM, New York, NY, 621--630. Google ScholarDigital Library
- Chapelle, O. and Zhang, Y. 2009. A dynamic bayesian network click model for web search ranking. In Proceedings of the 18th International Conference on World Wide Web. ACM, New York, NY, 1--10. Google ScholarDigital Library
- Clarke, C. L. A., Kolla, M., Cormack, G. V., Vechtomova, O., Ashkan, A., Büttcher, S., and MacKinnon, I. 2008. Novelty and diversity in information retrieval evaluation. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR). 659--666. Google ScholarDigital Library
- Claypool, M., Le, P., Waseda, M., and Brown, D. 2001. Implicit interest indicators. In Proceedings of the International Conference on Intelligent User Interfaces (IUI). 33--40. Google ScholarDigital Library
- Cleverdon, C. W., Mills, J., and Keen, E. M. 1966. Factors Determining the Performance of Indexing Systems. Cranfield: College of Aeronautics.Google Scholar
- Craswell, N., Zoeter, O., Taylor, M. J., and Ramsey, B. 2008. An experimental comparison of click position-bias models. In Proceedings of the ACM International Conference on Web Search and Data Mining (WSDM). M. Najork, A. Z. Broder, and S. Chakrabarti, Eds., ACM, 87--94. Google ScholarDigital Library
- Dupret, G., Murdock, V., and Piwowarski, B. 2007. Web search engine evaluation using clickthrough data and a user model. In Proceedings of the WWW Workshop on Query Log Analysis.Google Scholar
- Efron, B. and Tibshirani, R. 1993. An Introduction to the Bootstrap. Chapman & Hall/CRC.Google Scholar
- Feige, U., Raghavan, P., Peleg, D., and Upfal, E. 1994. Computing with noisy information. SIAM J. Comput. 23, 5, 1001--1018. Google ScholarDigital Library
- Fox, S., Karnawat, K., Mydland, M., Dumais, S., and White, T. 2005. Evaluating implicit measures to improve web search. ACM Trans. Inf. Syst. 23, 2, 147--168. Google ScholarDigital Library
- He, J., Zhai, C., and Li, X. 2009. Evaluation of methods for relative comparison of retrieval systems based on clickthroughs. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM'09). ACM, New York, NY, 2029--2032. Google ScholarDigital Library
- Hofmann, K., Whiteson, S., and de Rijke, M. 2011. A probabilistic method for inferring preferences from clicks. In Proceedings of the ACM Conference on Information and Knowledge Management (CIKM). Google ScholarDigital Library
- Huffman, S. B. and Hochster, M. 2007. How well does result relevance predict session satisfaction? In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR). 567--574. Google ScholarDigital Library
- Joachims, T. 2002. Optimizing Search Engines Using Clickthrough Data. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (KDD). ACM, New York, NY, 132--142. Google ScholarDigital Library
- Joachims, T. 2003. Evaluating Retrieval Performance using Clickthrough Data. In Text Mining, J. Franke, G. Nakhaeizadeh, and I. Renz, Eds., Physica/Springer Verlag, 79--96.Google Scholar
- Joachims, T., Freitag, D., and Mitchell, T. 1997. WebWatcher: a tour guide for the World Wide Web. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI). Vol. 1. Morgan Kaufmann, 770--777.Google Scholar
- Joachims, T., Granka, L., Pan, B., Hembrooke, H., Radlinski, F., and Gay, G. 2007. Evaluating the accuracy of implicit feedback from clicks and query reformulations in web search. ACM Trans. Inf. Syst. 25, 2, Article 7. Google ScholarDigital Library
- Kantor, P. 1988. National, language-specific evaluation sites for retrieval systems and interfaces. In Proceedings of the International Conference on Computer-Assisted Information Retrieval (RIAO). 139--147.Google Scholar
- Kelly, D. 2005. Implicit feedback: Using behavior to infer relevance. In New Directions in Cognitive Information Retrieval, 169--186.Google Scholar
- Kelly, D. and Teevan, J. 2003. Implicit feedback for inferring user preference: A bibliography. Proceedings of the ACM SIGIR Forum 37, 2, 18--28. Google ScholarDigital Library
- Kozielecki, J. 1981. Psychological Decision Theory. Kluwer.Google Scholar
- Kulkarni, A., Teevan, J., Svore, K., and Dumais, S. 2011. Understanding temporal query dynamics. In Proceedings of the ACM International Conference on Web Search and Data Mining (WSDM). 167--176. Google ScholarDigital Library
- Laming, D. 1986. Sensory Analysis. Academic Press, London.Google Scholar
- Lieberman, H. 1995. Letizia: An agent that assists Web browsing. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI). Morgan Kaufmann, 924--929. Google ScholarDigital Library
- Liu, Y., Fu, Y., Zhang, M., Ma, S., and Ru, L. 2007. Automatic search engine performance evaluation with click-through data analysis. In Proceedings of the International World Wide Web Conference (WWW). 1133--1134. Google ScholarDigital Library
- Manning, C. D., Raghavan, P., and Schuetze, H. 2008. Introduction to Information Retrieval. Cambridge University Press. Google ScholarDigital Library
- Moe, K. K., Jensen, J. M., and Larsen, B. 2007. A qualitative look at eye-tracking for implicit relevance feedback. In Proceedings of the Workshop on Context-Based Information Retrieval, B.-L. Doan, J. M. Jose, and M. Melucci, Eds., CEUR Workshop Proceedings, vol. 326, CEUR-WS.org.Google Scholar
- Mood, A., Graybill, F., and Boes, D. 1974. Introduction to the Theory of Statistics 3rd Ed. McGraw-Hill, New York, NY.Google Scholar
- Morita, M. and Shinoda, Y. 1994. Information filtering based on user behavior analysis and best match text retrieval. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR). 272--281. Google ScholarDigital Library
- Oard, D. W. and Kim, J. 2001. Modeling information content using observable behavior. In Proceedings of the Annual Meeting of the American Society for Information Science and Technology. 28--45.Google Scholar
- Radlinski, F., Bennett, P. N., Carterette, B., and Joachims, T. 2009. Sigir workshop report: Redundancy, diversity and interdependent document relevance. SIGIR Forum 43, 2, 46--52. Google ScholarDigital Library
- Radlinski, F. and Craswell, N. 2010. Comparing the sensitivity of information retrieval metrics. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 667--674. Google ScholarDigital Library
- Radlinski, F. and Joachims, T. 2005. Query chains: Learning to rank from implicit feedback. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (KDD). 239--248. Google ScholarDigital Library
- Radlinski, F. and Joachims, T. 2006. Minimally invasive randomization for collecting unbiased preferences from clickthrough logs. In Proceedings of the Conference of the Association for the Advancement of Artificial Intelligence (AAAI). 1406--1412. Google ScholarDigital Library
- Radlinski, F., Kurup, M., and Joachims, T. 2008. How does clickthrough data reflect retrieval quality. In Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM'08). ACM, New York, NY, 43--52. Google ScholarDigital Library
- Radlinski, F., Kurup, M., and Joachims, T. 2010a. Evaluating search engine relevance with click-based metrics. In Preference Learning, J. Fuernkranz and E. Huellermeier, Eds., Springer, 337--362.Google Scholar
- Radlinski, F., Szummer, M., and Craswell, N. 2010b. Inferring query intent from reformulations and clicks. In Proceedings of the International World Wide Web Conference (WWW). 1171--1172. Google ScholarDigital Library
- Salojärvi, J., Puolamäki, K., and Kaski, S. 2005. Implicit relevance feedback from eye movements. In Proceedings of the International Conference on Artificial Neural Networks (ICANN). W. Duch, J. Kacprzyk, E. Oja, and S. Zadrozny, Eds., Springer, 513--518. Google ScholarDigital Library
- Sanderson, M. and Zobel, J. 2005. Information retrieval system evaluation: Effort, sensitivity and reliability. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR). 162--169. Google ScholarDigital Library
- Shao, J. and Tu, D. 1995. The Jackknife and Bootstrap. Springer.Google Scholar
- Soboroff, I., Nicholas, C., and Cahan, P. 2001. Ranking retrieval systems without relevance judgments. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR). 66--73. Google ScholarDigital Library
- Teevan, J., Dumais, S., and Horvitz, E. 2007. The potential value of personalizing search. In Proceedings of SIGIR. 756--757. Google ScholarDigital Library
- Turpin, A. and Scholer, F. 2006. User performance versus precision measures for simple search tasks. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR). 11--18. Google ScholarDigital Library
- Voorhees, E. and Harman, D. 2005. TREC: Experiment and Evaluation in Information Retrieval. MIT press. Google ScholarDigital Library
- Voorhees, E. M. and Buckley, C. 2002. The effect of topic set size on retrieval experiment error. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR). 316--323. Google ScholarDigital Library
- Wang, K., Walker, T., and Zheng, Z. 2009. PSkip: estimating relevance ranking quality from web search clickthrough data. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'09). ACM, New York, NY, 1355--1364. Google ScholarDigital Library
- White, R., Ruthven, I., Jose, J., and van Rijsbergen, C. 2005. Evaluating implicit feedback models using searcher simulations. ACM Trans. Inf. Syst. 23, 3, 325--361. Google ScholarDigital Library
- White, R., Ruthven, I., and Jose, J. M. 2002. The use of implicit evidence for relevance feedback in web retrieval. In Proceedings of the European Conference on Information Retrieval (ECIR). F. Crestani, M. Girolami, and C. J. van Rijsbergen, Eds., Lecture Notes in Computer Science, vol. 2291, Springer, 93--109. Google ScholarDigital Library
- Yue, Y., Broder, J., Kleinberg, R., and Joachims, T. 2009. The K-armed Dueling Bandits Problem. In Proceedings of the Annual Conference on Learning Theory (COLT).Google Scholar
- Yue, Y., Gao, Y., Chapelle, O., Zhang, Y., and Joachims, T. 2010. Learning more powerful test statistics for click-based retrieval evaluation. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 507--514. Google ScholarDigital Library
- Yue, Y. and Joachims, T. 2009. Interactively optimizing information retrieval systems as a dueling bandits problem. In Proceedings of the International Conference on Machine Learning (ICML). 1201--1208. Google ScholarDigital Library
- Yue, Y. and Joachims, T. 2011. Beat the mean bandit. In Proceedings of the International Conference on Machine Learning (ICML). 241--248.Google Scholar
- Yue, Y., Patel, R., and Roehrig, H. 2010. Beyond position bias: Examining result attractiveness as a source of presentation bias in clickthrough data. In Proceedings of the International World Wide Web Conference (WWW). 1011--1018. Google ScholarDigital Library
Index Terms
- Large-scale validation and analysis of interleaved search evaluation
Recommendations
A Comparative Analysis of Interleaving Methods for Aggregated Search
A result page of a modern search engine often goes beyond a simple list of “10 blue links.” Many specific user needs (e.g., News, Image, Video) are addressed by so-called aggregated or vertical search solutions: specially presented documents, often ...
Practical online retrieval evaluation
ECIR'13: Proceedings of the 35th European conference on Advances in Information RetrievalOnline evaluation allows the assessment of information retrieval (IR) techniques based on how real users respond to them. Because this technique is directly based on observed user behavior, it is a promising alternative to traditional offline evaluation,...
Fidelity, Soundness, and Efficiency of Interleaved Comparison Methods
Ranker evaluation is central to the research into search engines, be it to compare rankers or to provide feedback for learning to rank. Traditional evaluation approaches do not scale well because they require explicit relevance judgments of document-...
Comments