skip to main content
research-article

Large-scale validation and analysis of interleaved search evaluation

Published:06 March 2012Publication History
Skip Abstract Section

Abstract

Interleaving is an increasingly popular technique for evaluating information retrieval systems based on implicit user feedback. While a number of isolated studies have analyzed how this technique agrees with conventional offline evaluation approaches and other online techniques, a complete picture of its efficiency and effectiveness is still lacking. In this paper we extend and combine the body of empirical evidence regarding interleaving, and provide a comprehensive analysis of interleaving using data from two major commercial search engines and a retrieval system for scientific literature. In particular, we analyze the agreement of interleaving with manual relevance judgments and observational implicit feedback measures, estimate the statistical efficiency of interleaving, and explore the relative performance of different interleaving variants. We also show how to learn improved credit-assignment functions for clicks that further increase the sensitivity of interleaving.

References

  1. Agichtein, E., Brill, E., Dumais, S., and Ragno, R. 2006. Learning user interaction models for prediction web search results preferences. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR). 3--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Agrawal, R., Halverson, A., Kenthapadi, K., Mishra, N., and Tsaparas, P. 2009. Generating labels from clicks. In Proceedings of the 2nd ACM International Conference on Web Search and Data Mining. 172--181. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Ali, K. and Chang, C. 2006. On the relationship between click-rate and relevance for search engines. In Proceedings of the Conference on Data-Mining and Information Engineering. 213--222.Google ScholarGoogle Scholar
  4. Allan, J., Aslam, J. A., Carterette, B., Pavlu, V., and Kanoulas, E. 2008. Million query track 2008 overview. In Proceedings of TREC.Google ScholarGoogle Scholar
  5. Aslam, J. A., Pavlu, V., and Yilmaz, E. 2005. A sampling technique for efficiently estimating measures of query retrieval performance using incomplete judgments. In Proceedings of the ICML Workshop on Learning with Partially Classified Training Data.Google ScholarGoogle Scholar
  6. Becker, H., Meek, C., and Chickering, D. M. 2007. Modeling contextual factors of click rates. In Proceedings of the National Conference on Artificial Intelligence (AAAI). AAAI Press, 1310--1315. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Boyan, J., Freitag, D., and Joachims, T. 1996. A machine learning architecture for optimizing web search engines. In Proceedings of the AAAI Workshop on Internet Based Information Systems. 1--8.Google ScholarGoogle Scholar
  8. Buckley, C. and Voorhees, E. M. 2004. Retrieval evaluation with incomplete information. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR). 25--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Buscher, G., Dengel, A., and van Elst, L. 2008. Eye movements as implicit relevance feedback. In Proceedings of the Conference on Human Factors in Computing Systems (CHI), M. Czerwinski, A. M. Lund, and D. S. Tan, Eds. ACM, 2991--2996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Carterette, B., Allan, J., and Sitaraman, R. 2006. Minimal test collections for retrieval evaluation. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR). 268--275. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Carterette, B., Bennett, P. N., Chickering, D. M., and Dumais, S. T. 2008. Here or there: Preference judgements for relevance. In Proceedings of the European Conference on Information Retrieval (ECIR). 16--27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Carterette, B. and Jones, R. 2007. Evaluating search engines by modeling the relationship between relevance and clicks. In Proceedings of the International Conference on Advances in Neural Information Processing Systems (NIPS). 217--224.Google ScholarGoogle Scholar
  13. Chapelle, O., Metlzer, D., Zhang, Y., and Grinspan, P. 2009. Expected reciprocal rank for graded relevance. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM'09). ACM, New York, NY, 621--630. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Chapelle, O. and Zhang, Y. 2009. A dynamic bayesian network click model for web search ranking. In Proceedings of the 18th International Conference on World Wide Web. ACM, New York, NY, 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Clarke, C. L. A., Kolla, M., Cormack, G. V., Vechtomova, O., Ashkan, A., Büttcher, S., and MacKinnon, I. 2008. Novelty and diversity in information retrieval evaluation. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR). 659--666. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Claypool, M., Le, P., Waseda, M., and Brown, D. 2001. Implicit interest indicators. In Proceedings of the International Conference on Intelligent User Interfaces (IUI). 33--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Cleverdon, C. W., Mills, J., and Keen, E. M. 1966. Factors Determining the Performance of Indexing Systems. Cranfield: College of Aeronautics.Google ScholarGoogle Scholar
  18. Craswell, N., Zoeter, O., Taylor, M. J., and Ramsey, B. 2008. An experimental comparison of click position-bias models. In Proceedings of the ACM International Conference on Web Search and Data Mining (WSDM). M. Najork, A. Z. Broder, and S. Chakrabarti, Eds., ACM, 87--94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Dupret, G., Murdock, V., and Piwowarski, B. 2007. Web search engine evaluation using clickthrough data and a user model. In Proceedings of the WWW Workshop on Query Log Analysis.Google ScholarGoogle Scholar
  20. Efron, B. and Tibshirani, R. 1993. An Introduction to the Bootstrap. Chapman & Hall/CRC.Google ScholarGoogle Scholar
  21. Feige, U., Raghavan, P., Peleg, D., and Upfal, E. 1994. Computing with noisy information. SIAM J. Comput. 23, 5, 1001--1018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Fox, S., Karnawat, K., Mydland, M., Dumais, S., and White, T. 2005. Evaluating implicit measures to improve web search. ACM Trans. Inf. Syst. 23, 2, 147--168. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. He, J., Zhai, C., and Li, X. 2009. Evaluation of methods for relative comparison of retrieval systems based on clickthroughs. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM'09). ACM, New York, NY, 2029--2032. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Hofmann, K., Whiteson, S., and de Rijke, M. 2011. A probabilistic method for inferring preferences from clicks. In Proceedings of the ACM Conference on Information and Knowledge Management (CIKM). Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Huffman, S. B. and Hochster, M. 2007. How well does result relevance predict session satisfaction? In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR). 567--574. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Joachims, T. 2002. Optimizing Search Engines Using Clickthrough Data. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (KDD). ACM, New York, NY, 132--142. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Joachims, T. 2003. Evaluating Retrieval Performance using Clickthrough Data. In Text Mining, J. Franke, G. Nakhaeizadeh, and I. Renz, Eds., Physica/Springer Verlag, 79--96.Google ScholarGoogle Scholar
  28. Joachims, T., Freitag, D., and Mitchell, T. 1997. WebWatcher: a tour guide for the World Wide Web. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI). Vol. 1. Morgan Kaufmann, 770--777.Google ScholarGoogle Scholar
  29. Joachims, T., Granka, L., Pan, B., Hembrooke, H., Radlinski, F., and Gay, G. 2007. Evaluating the accuracy of implicit feedback from clicks and query reformulations in web search. ACM Trans. Inf. Syst. 25, 2, Article 7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Kantor, P. 1988. National, language-specific evaluation sites for retrieval systems and interfaces. In Proceedings of the International Conference on Computer-Assisted Information Retrieval (RIAO). 139--147.Google ScholarGoogle Scholar
  31. Kelly, D. 2005. Implicit feedback: Using behavior to infer relevance. In New Directions in Cognitive Information Retrieval, 169--186.Google ScholarGoogle Scholar
  32. Kelly, D. and Teevan, J. 2003. Implicit feedback for inferring user preference: A bibliography. Proceedings of the ACM SIGIR Forum 37, 2, 18--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Kozielecki, J. 1981. Psychological Decision Theory. Kluwer.Google ScholarGoogle Scholar
  34. Kulkarni, A., Teevan, J., Svore, K., and Dumais, S. 2011. Understanding temporal query dynamics. In Proceedings of the ACM International Conference on Web Search and Data Mining (WSDM). 167--176. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Laming, D. 1986. Sensory Analysis. Academic Press, London.Google ScholarGoogle Scholar
  36. Lieberman, H. 1995. Letizia: An agent that assists Web browsing. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI). Morgan Kaufmann, 924--929. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Liu, Y., Fu, Y., Zhang, M., Ma, S., and Ru, L. 2007. Automatic search engine performance evaluation with click-through data analysis. In Proceedings of the International World Wide Web Conference (WWW). 1133--1134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Manning, C. D., Raghavan, P., and Schuetze, H. 2008. Introduction to Information Retrieval. Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Moe, K. K., Jensen, J. M., and Larsen, B. 2007. A qualitative look at eye-tracking for implicit relevance feedback. In Proceedings of the Workshop on Context-Based Information Retrieval, B.-L. Doan, J. M. Jose, and M. Melucci, Eds., CEUR Workshop Proceedings, vol. 326, CEUR-WS.org.Google ScholarGoogle Scholar
  40. Mood, A., Graybill, F., and Boes, D. 1974. Introduction to the Theory of Statistics 3rd Ed. McGraw-Hill, New York, NY.Google ScholarGoogle Scholar
  41. Morita, M. and Shinoda, Y. 1994. Information filtering based on user behavior analysis and best match text retrieval. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR). 272--281. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Oard, D. W. and Kim, J. 2001. Modeling information content using observable behavior. In Proceedings of the Annual Meeting of the American Society for Information Science and Technology. 28--45.Google ScholarGoogle Scholar
  43. Radlinski, F., Bennett, P. N., Carterette, B., and Joachims, T. 2009. Sigir workshop report: Redundancy, diversity and interdependent document relevance. SIGIR Forum 43, 2, 46--52. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Radlinski, F. and Craswell, N. 2010. Comparing the sensitivity of information retrieval metrics. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 667--674. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Radlinski, F. and Joachims, T. 2005. Query chains: Learning to rank from implicit feedback. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (KDD). 239--248. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Radlinski, F. and Joachims, T. 2006. Minimally invasive randomization for collecting unbiased preferences from clickthrough logs. In Proceedings of the Conference of the Association for the Advancement of Artificial Intelligence (AAAI). 1406--1412. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Radlinski, F., Kurup, M., and Joachims, T. 2008. How does clickthrough data reflect retrieval quality. In Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM'08). ACM, New York, NY, 43--52. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Radlinski, F., Kurup, M., and Joachims, T. 2010a. Evaluating search engine relevance with click-based metrics. In Preference Learning, J. Fuernkranz and E. Huellermeier, Eds., Springer, 337--362.Google ScholarGoogle Scholar
  49. Radlinski, F., Szummer, M., and Craswell, N. 2010b. Inferring query intent from reformulations and clicks. In Proceedings of the International World Wide Web Conference (WWW). 1171--1172. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Salojärvi, J., Puolamäki, K., and Kaski, S. 2005. Implicit relevance feedback from eye movements. In Proceedings of the International Conference on Artificial Neural Networks (ICANN). W. Duch, J. Kacprzyk, E. Oja, and S. Zadrozny, Eds., Springer, 513--518. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Sanderson, M. and Zobel, J. 2005. Information retrieval system evaluation: Effort, sensitivity and reliability. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR). 162--169. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Shao, J. and Tu, D. 1995. The Jackknife and Bootstrap. Springer.Google ScholarGoogle Scholar
  53. Soboroff, I., Nicholas, C., and Cahan, P. 2001. Ranking retrieval systems without relevance judgments. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR). 66--73. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Teevan, J., Dumais, S., and Horvitz, E. 2007. The potential value of personalizing search. In Proceedings of SIGIR. 756--757. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Turpin, A. and Scholer, F. 2006. User performance versus precision measures for simple search tasks. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR). 11--18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Voorhees, E. and Harman, D. 2005. TREC: Experiment and Evaluation in Information Retrieval. MIT press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Voorhees, E. M. and Buckley, C. 2002. The effect of topic set size on retrieval experiment error. In Proceedings of the ACM Conference on Research and Development in Information Retrieval (SIGIR). 316--323. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Wang, K., Walker, T., and Zheng, Z. 2009. PSkip: estimating relevance ranking quality from web search clickthrough data. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'09). ACM, New York, NY, 1355--1364. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. White, R., Ruthven, I., Jose, J., and van Rijsbergen, C. 2005. Evaluating implicit feedback models using searcher simulations. ACM Trans. Inf. Syst. 23, 3, 325--361. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. White, R., Ruthven, I., and Jose, J. M. 2002. The use of implicit evidence for relevance feedback in web retrieval. In Proceedings of the European Conference on Information Retrieval (ECIR). F. Crestani, M. Girolami, and C. J. van Rijsbergen, Eds., Lecture Notes in Computer Science, vol. 2291, Springer, 93--109. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Yue, Y., Broder, J., Kleinberg, R., and Joachims, T. 2009. The K-armed Dueling Bandits Problem. In Proceedings of the Annual Conference on Learning Theory (COLT).Google ScholarGoogle Scholar
  62. Yue, Y., Gao, Y., Chapelle, O., Zhang, Y., and Joachims, T. 2010. Learning more powerful test statistics for click-based retrieval evaluation. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 507--514. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Yue, Y. and Joachims, T. 2009. Interactively optimizing information retrieval systems as a dueling bandits problem. In Proceedings of the International Conference on Machine Learning (ICML). 1201--1208. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Yue, Y. and Joachims, T. 2011. Beat the mean bandit. In Proceedings of the International Conference on Machine Learning (ICML). 241--248.Google ScholarGoogle Scholar
  65. Yue, Y., Patel, R., and Roehrig, H. 2010. Beyond position bias: Examining result attractiveness as a source of presentation bias in clickthrough data. In Proceedings of the International World Wide Web Conference (WWW). 1011--1018. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Large-scale validation and analysis of interleaved search evaluation

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Information Systems
      ACM Transactions on Information Systems  Volume 30, Issue 1
      February 2012
      193 pages
      ISSN:1046-8188
      EISSN:1558-2868
      DOI:10.1145/2094072
      Issue’s Table of Contents

      Copyright © 2012 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 6 March 2012
      • Accepted: 1 December 2011
      • Revised: 1 October 2011
      • Received: 1 February 2011
      Published in tois Volume 30, Issue 1

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader