ABSTRACT
Obtaining judgments from human raters is a vital part in the design of search engines' evaluation. Today, a discrepancy exists between judgment acquisition from raters (training phase) and use of the responses for retrieval evaluation (evaluation phase). This discrepancy is due to the inconsistency between the representation of the information in both phases. During training, raters are requested to provide a relevance score for an individual result in the context of a query, whereas the evaluation is performed on ordered lists of search results, with the results' relative position (compared to other results) taken into account. As an alternative to the practice of learning to rank using relevance judgments for individual search results, more and more focus has recently been diverted to the theory and practice of learning from answers to combinatorial questions about sets of search results. That is, users, during training, are asked to rank small sets (typically pairs).
Human rater responses to questions about the relevance of individual results are first compared to their responses to questions about the relevance of pairs of results. We empirically show that neither type of response can be deduced from the other, and that the added context created when results are shown together changes the raters' evaluation process. Since pairwise judgments are directly related to ranking, we conclude they are more accurate for that purpose. We go beyond pairs to show that triplets do not contain significantly more information than pairs for the purpose of measuring statistical preference. These two results establish good stability properties of pairwise comparisons for the purpose of learning to rank. We further analyze different scenarios, in which results of varying quality are added as "decoys".
A recurring source of worry in papers focusing on pairwise comparison is the quadratic number of pairs in a set of results. Which preferences do we choose to solicit from paid raters? Can we provably eliminate a quadratic cost? We employ results from statistical learning theory to show that the quadratic cost can be provably eliminated in certain cases. More precisely, we show that in order to obtain a ranking in which each element is an average of O(n/C) positions away from its position in the optimal ranking, one needs to sample O(nC2) pairs uniformly at random, for any C > 0. We also present an active learning algorithm which samples the pairs adaptively, and conjecture that it provides additional improvement.
Supplemental Material
- N. Ailon. A simple linear ranking algorithm using query dependent intercept variables. In ECIR '09: Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval, pages 685-690, Berling Heidelberg, 2009. Springer-Verlag. Google ScholarDigital Library
- N. Ailon and M. Mohri. An efficient reduction of ranking to classification. Journal of Machine Learning Research (In Press), 2011.Google Scholar
- D. Ariely. Predictably Irrational: The Hidden Forces That Shape Our Decisions. HarperCollins, February 2008.Google Scholar
- D. Ariely, G. Loewenstein, and D. Prelec. Coherent arbitrariness: Stable demand curves without stable preferences. Quarterly Journal of Economics, 118:73-105, 2003.Google ScholarCross Ref
- M.F. Balcan, N. Bansal, A. Beygelzimer, D. Coppersmith, J. Langford, and G. B. Sorkin. Robust reductions from ranking to classification. Mach. Learn., 72(1-2): 139-153, 2008. Google ScholarDigital Library
- M. Braverman and E. Mossel. Noisy sorting without resampling. pages 268-276, 2008. Google ScholarDigital Library
- C. Buckley and E. M. Voorhees. Retrieval evaluation with incomplete information. In SIGIR'04: Proceedings of the 27th annual internationa ACM SIGIR conference on Research and development in information retrieval, pages 25-32, New York, NY, USA, 2004. ACM. Google ScholarDigital Library
- Z. Cao, T. Qi, T.Y. Liu, M.-F. Tsai, and H. Li. Learning to rank: From pairwise approach to listwise approach. In Microsoft Tech Report MSR-TR-2007-40, 2007.Google ScholarDigital Library
- J. Carbonell and J. Goldstein. The use of MMR, diversity-based reranking for recordering documents and producing summaries. In Proceedings Ann. Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, 1998. Google ScholarDigital Library
- B. Carterette and P.N. Bennett. Evaluation measures for preference judgments. In In Proceedings of SIGIR, 2008. Google ScholarDigital Library
- B. Carterette, P.N. Bennett, D.M. Chickering, and S.T. Here or theres: Preference judgements for relevance. In In Proceedings of the European Conference on Information Retrieval (ECIR), 2008. Google ScholarDigital Library
- O. Chapelle, D. Metlzer, Y. Zhang, and P. Grinspan. Expected reciprocal rank for graded relevance. In CIKM'09: Proceeding of the 18th ACM conference on Information and knowledge management, pages 621-630, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- C.L. Clarke, N. Craswell, and I. Soboroff. Overview of the trec 2009 web track. Technical report, no date.Google Scholar
- W. Cohran. Some methods for strengthening the common chi-square tests. Biometrics, 10:417-451, 1954.Google ScholarCross Ref
- W.W. Cohen, R.E. Schapire, and Y. Signer. Learning to order things. In NIPS: Proceedings of the 1997 conference on Advances in neural information processing systems 10, pages 451-457, Cambridge, MA, USA, 1998. MIT Press. Google ScholarDigital Library
- V. Conitzer, A. Davenport, and J. Kalagnanam. Improved bounds for computing kemeny ranking. In AAAI'06: Proceeding of the 21st national conference on Artificial intelligence, pages 620-626. AAAI Press, 2006. Google ScholarDigital Library
- K. Crammer and Y. Singer. Pranking with ranking. In Advances in Neural Information Processing Systems 14, pages 641-647. MIT Press, 2001.Google Scholar
- A. Das Sarma, A. Das Sarma, S. Gollapudi, and R. Panigrahy. Ranking mechanisms in twitter-like forums. In WSDM: Proceedings of the third ACM international conference on Web search and data mining, pages 21-30, 2010. Google ScholarDigital Library
- P. Diaconis and R. Graham. Spearman's footrule as a measure of disarray. J. of Royal Statistical Society, 39(2):262-268, 1977.Google Scholar
- R. Herbrich, T. Graepel, and K. Obermayer. Advances in Large Margin Classifiers. MIT Press, 200.Google Scholar
- E. Hüllermeier, J. Fürnkranz, W. Cheng, and K. Brinker. Label ranking by learning pairwise preferences. Artif. Intell., 172(16-17):1897-1916, 2008. Google ScholarDigital Library
- K. Jarvelin and J. Kekalainen. Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Sys., 20:422-446, 2002. Google ScholarDigital Library
- T. Joachims. Optimizing search engines using clickthrough data. In KDD'02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 133-142, New York, NY, USA, 2002, ACM. Google ScholarDigital Library
- C. Kenyon Mathieu and W. Schudy. How to rank with few errors. In STOC'07: Proceedings of the thirty-ninth annual ACM symposium on Theory of computing, pages 95-103, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- A. Kittur, H. Chi, and B. Suh. Crowdsourcing user studies with mechanical turk. In Proc. CHI 2008, ACM Press, pages 453-456, 2008. Google ScholarDigital Library
- Y. Lan, T.Y. Liu, Z. Ma, and H. Li. Generalization analysis of listwise learning-to-rank algorithms. In ICML '09: Proceedings of the 26th Annual Internatioanl Conference on Machine Learning, pages 577-584, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- S. Mizzaro. A new measure of retrieval effectiveness or: What's wrong with precision and recall. In In Proceedings of the International Workshop on Information Retrieval IR'2001, pages 43-52, 2001.Google Scholar
- A. Moffat and J. Zobel. Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Inf. Sys., 27:1-27, 2008. Google ScholarDigital Library
- A. Oulasvirta, J.P. Hukkinen, and B. Schwartz. When more is less: the paradox of choice in search engine use. In SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pages 516-523, 2009. Google ScholarDigital Library
- J. Payne, J. Bettman, and E. Johnson. The adaptive decision maker. Massachusetts: Cambridge University Press, 1993.Google ScholarCross Ref
- C. Rudin. The P-Norm Push: A simple convex ranking algorithm that concentrates at the top of the list. Journal of Machine Learning Research, 10:2233-2271, Oct 2009. Google ScholarDigital Library
- T. Saracevic. Relevance: A review of the literature and a framework for thinking on the notion in information science. part iii: Behavior and effects of relevance. Journal of the American Society for Information Science and Technology, 58:2126-2144, 2007. Google ScholarDigital Library
- P. Slovic, D. Griffin, and A. Tversky. Compatibility effects in judgment and choice. Insights in decision making, pages 5-27, 1990.Google Scholar
- K. Train. Discrete Choice Methods with Simulation. Massachusetts: Cambridge University Press, 2003.Google ScholarCross Ref
- F. Xia, T.Y. Liu, J. Wang, W. Zhang, and H. Li, Listwise approach to learning to rank: theory and algorithm. In ICML '08: Proceedings of the 25th International conference on Machine learning, pages 1192-1199, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- J. Xu and H. Li. Adarank: a boosting algorithm for information retrieval. In SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 391-398, 2007. Google ScholarDigital Library
Index Terms
- Ranking from pairs and triplets: information quality, evaluation methods and query complexity
Recommendations
Re-ranking search results using query logs
CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge managementThis work addresses two common problems in search, frequently occurring with underspecified user queries: the top-ranked results for such queries may not contain documents relevant to the user's search intent, and fresh and relevant pages may not get ...
Individual Judgments Versus Consensus: Estimating Query-URL Relevance
Query-URL relevance, measuring the relevance of each retrieved URL with respect to a given query, is one of the fundamental criteria to evaluate the performance of commercial search engines. The traditional way to collect reliable and accurate query-URL ...
A comparison of ranking methods for classification algorithm selection
ECML'00: Proceedings of the 11th European Conference on Machine LearningWe investigate the problem of using past performance information to select an algorithm for a given classification problem. We present three ranking methods for that purpose: average ranks, success rate ratios and significant wins. We also analyze the ...
Comments