skip to main content
10.1145/1935826.1935850acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

Ranking from pairs and triplets: information quality, evaluation methods and query complexity

Published:09 February 2011Publication History

ABSTRACT

Obtaining judgments from human raters is a vital part in the design of search engines' evaluation. Today, a discrepancy exists between judgment acquisition from raters (training phase) and use of the responses for retrieval evaluation (evaluation phase). This discrepancy is due to the inconsistency between the representation of the information in both phases. During training, raters are requested to provide a relevance score for an individual result in the context of a query, whereas the evaluation is performed on ordered lists of search results, with the results' relative position (compared to other results) taken into account. As an alternative to the practice of learning to rank using relevance judgments for individual search results, more and more focus has recently been diverted to the theory and practice of learning from answers to combinatorial questions about sets of search results. That is, users, during training, are asked to rank small sets (typically pairs).

Human rater responses to questions about the relevance of individual results are first compared to their responses to questions about the relevance of pairs of results. We empirically show that neither type of response can be deduced from the other, and that the added context created when results are shown together changes the raters' evaluation process. Since pairwise judgments are directly related to ranking, we conclude they are more accurate for that purpose. We go beyond pairs to show that triplets do not contain significantly more information than pairs for the purpose of measuring statistical preference. These two results establish good stability properties of pairwise comparisons for the purpose of learning to rank. We further analyze different scenarios, in which results of varying quality are added as "decoys".

A recurring source of worry in papers focusing on pairwise comparison is the quadratic number of pairs in a set of results. Which preferences do we choose to solicit from paid raters? Can we provably eliminate a quadratic cost? We employ results from statistical learning theory to show that the quadratic cost can be provably eliminated in certain cases. More precisely, we show that in order to obtain a ranking in which each element is an average of O(n/C) positions away from its position in the optimal ranking, one needs to sample O(nC2) pairs uniformly at random, for any C > 0. We also present an active learning algorithm which samples the pairs adaptively, and conjecture that it provides additional improvement.

Skip Supplemental Material Section

Supplemental Material

wsdm2011_radinsky_rfp_01.mov

mov

101.4 MB

wsdm2011_radinsky_rfp_01.mp4

mp4

142.5 MB

References

  1. N. Ailon. A simple linear ranking algorithm using query dependent intercept variables. In ECIR '09: Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval, pages 685-690, Berling Heidelberg, 2009. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. N. Ailon and M. Mohri. An efficient reduction of ranking to classification. Journal of Machine Learning Research (In Press), 2011.Google ScholarGoogle Scholar
  3. D. Ariely. Predictably Irrational: The Hidden Forces That Shape Our Decisions. HarperCollins, February 2008.Google ScholarGoogle Scholar
  4. D. Ariely, G. Loewenstein, and D. Prelec. Coherent arbitrariness: Stable demand curves without stable preferences. Quarterly Journal of Economics, 118:73-105, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  5. M.F. Balcan, N. Bansal, A. Beygelzimer, D. Coppersmith, J. Langford, and G. B. Sorkin. Robust reductions from ranking to classification. Mach. Learn., 72(1-2): 139-153, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Braverman and E. Mossel. Noisy sorting without resampling. pages 268-276, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. C. Buckley and E. M. Voorhees. Retrieval evaluation with incomplete information. In SIGIR'04: Proceedings of the 27th annual internationa ACM SIGIR conference on Research and development in information retrieval, pages 25-32, New York, NY, USA, 2004. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Z. Cao, T. Qi, T.Y. Liu, M.-F. Tsai, and H. Li. Learning to rank: From pairwise approach to listwise approach. In Microsoft Tech Report MSR-TR-2007-40, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Carbonell and J. Goldstein. The use of MMR, diversity-based reranking for recordering documents and producing summaries. In Proceedings Ann. Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. B. Carterette and P.N. Bennett. Evaluation measures for preference judgments. In In Proceedings of SIGIR, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. B. Carterette, P.N. Bennett, D.M. Chickering, and S.T. Here or theres: Preference judgements for relevance. In In Proceedings of the European Conference on Information Retrieval (ECIR), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. O. Chapelle, D. Metlzer, Y. Zhang, and P. Grinspan. Expected reciprocal rank for graded relevance. In CIKM'09: Proceeding of the 18th ACM conference on Information and knowledge management, pages 621-630, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. C.L. Clarke, N. Craswell, and I. Soboroff. Overview of the trec 2009 web track. Technical report, no date.Google ScholarGoogle Scholar
  14. W. Cohran. Some methods for strengthening the common chi-square tests. Biometrics, 10:417-451, 1954.Google ScholarGoogle ScholarCross RefCross Ref
  15. W.W. Cohen, R.E. Schapire, and Y. Signer. Learning to order things. In NIPS: Proceedings of the 1997 conference on Advances in neural information processing systems 10, pages 451-457, Cambridge, MA, USA, 1998. MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. V. Conitzer, A. Davenport, and J. Kalagnanam. Improved bounds for computing kemeny ranking. In AAAI'06: Proceeding of the 21st national conference on Artificial intelligence, pages 620-626. AAAI Press, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. K. Crammer and Y. Singer. Pranking with ranking. In Advances in Neural Information Processing Systems 14, pages 641-647. MIT Press, 2001.Google ScholarGoogle Scholar
  18. A. Das Sarma, A. Das Sarma, S. Gollapudi, and R. Panigrahy. Ranking mechanisms in twitter-like forums. In WSDM: Proceedings of the third ACM international conference on Web search and data mining, pages 21-30, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. P. Diaconis and R. Graham. Spearman's footrule as a measure of disarray. J. of Royal Statistical Society, 39(2):262-268, 1977.Google ScholarGoogle Scholar
  20. R. Herbrich, T. Graepel, and K. Obermayer. Advances in Large Margin Classifiers. MIT Press, 200.Google ScholarGoogle Scholar
  21. E. Hüllermeier, J. Fürnkranz, W. Cheng, and K. Brinker. Label ranking by learning pairwise preferences. Artif. Intell., 172(16-17):1897-1916, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. K. Jarvelin and J. Kekalainen. Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Sys., 20:422-446, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. T. Joachims. Optimizing search engines using clickthrough data. In KDD'02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 133-142, New York, NY, USA, 2002, ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. C. Kenyon Mathieu and W. Schudy. How to rank with few errors. In STOC'07: Proceedings of the thirty-ninth annual ACM symposium on Theory of computing, pages 95-103, New York, NY, USA, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. Kittur, H. Chi, and B. Suh. Crowdsourcing user studies with mechanical turk. In Proc. CHI 2008, ACM Press, pages 453-456, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Y. Lan, T.Y. Liu, Z. Ma, and H. Li. Generalization analysis of listwise learning-to-rank algorithms. In ICML '09: Proceedings of the 26th Annual Internatioanl Conference on Machine Learning, pages 577-584, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. S. Mizzaro. A new measure of retrieval effectiveness or: What's wrong with precision and recall. In In Proceedings of the International Workshop on Information Retrieval IR'2001, pages 43-52, 2001.Google ScholarGoogle Scholar
  28. A. Moffat and J. Zobel. Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Inf. Sys., 27:1-27, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. A. Oulasvirta, J.P. Hukkinen, and B. Schwartz. When more is less: the paradox of choice in search engine use. In SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pages 516-523, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. J. Payne, J. Bettman, and E. Johnson. The adaptive decision maker. Massachusetts: Cambridge University Press, 1993.Google ScholarGoogle ScholarCross RefCross Ref
  31. C. Rudin. The P-Norm Push: A simple convex ranking algorithm that concentrates at the top of the list. Journal of Machine Learning Research, 10:2233-2271, Oct 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. T. Saracevic. Relevance: A review of the literature and a framework for thinking on the notion in information science. part iii: Behavior and effects of relevance. Journal of the American Society for Information Science and Technology, 58:2126-2144, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. P. Slovic, D. Griffin, and A. Tversky. Compatibility effects in judgment and choice. Insights in decision making, pages 5-27, 1990.Google ScholarGoogle Scholar
  34. K. Train. Discrete Choice Methods with Simulation. Massachusetts: Cambridge University Press, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  35. F. Xia, T.Y. Liu, J. Wang, W. Zhang, and H. Li, Listwise approach to learning to rank: theory and algorithm. In ICML '08: Proceedings of the 25th International conference on Machine learning, pages 1192-1199, New York, NY, USA, 2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. J. Xu and H. Li. Adarank: a boosting algorithm for information retrieval. In SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 391-398, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Ranking from pairs and triplets: information quality, evaluation methods and query complexity

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      WSDM '11: Proceedings of the fourth ACM international conference on Web search and data mining
      February 2011
      870 pages
      ISBN:9781450304931
      DOI:10.1145/1935826

      Copyright © 2011 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 9 February 2011

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      WSDM '11 Paper Acceptance Rate83of372submissions,22%Overall Acceptance Rate498of2,863submissions,17%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader