research-article

Ranking from pairs and triplets: information quality, evaluation methods and query complexity

Authors:
Kira Radinsky

Technion, Haifa, Israel

Technion, Haifa, Israel
View Profile

,
Nir Ailon

Technion, Haifa, Israel

Technion, Haifa, Israel
View Profile

WSDM '11: Proceedings of the fourth ACM international conference on Web search and data miningFebruary 2011Pages 105–114https://doi.org/10.1145/1935826.1935850

Published:09 February 2011Publication History

WSDM '11: Proceedings of the fourth ACM international conference on Web search and data mining

Pages 105–114

ABSTRACT

Obtaining judgments from human raters is a vital part in the design of search engines' evaluation. Today, a discrepancy exists between judgment acquisition from raters (training phase) and use of the responses for retrieval evaluation (evaluation phase). This discrepancy is due to the inconsistency between the representation of the information in both phases. During training, raters are requested to provide a relevance score for an individual result in the context of a query, whereas the evaluation is performed on ordered lists of search results, with the results' relative position (compared to other results) taken into account. As an alternative to the practice of learning to rank using relevance judgments for individual search results, more and more focus has recently been diverted to the theory and practice of learning from answers to combinatorial questions about sets of search results. That is, users, during training, are asked to rank small sets (typically pairs).

Human rater responses to questions about the relevance of individual results are first compared to their responses to questions about the relevance of pairs of results. We empirically show that neither type of response can be deduced from the other, and that the added context created when results are shown together changes the raters' evaluation process. Since pairwise judgments are directly related to ranking, we conclude they are more accurate for that purpose. We go beyond pairs to show that triplets do not contain significantly more information than pairs for the purpose of measuring statistical preference. These two results establish good stability properties of pairwise comparisons for the purpose of learning to rank. We further analyze different scenarios, in which results of varying quality are added as "decoys".

A recurring source of worry in papers focusing on pairwise comparison is the quadratic number of pairs in a set of results. Which preferences do we choose to solicit from paid raters? Can we provably eliminate a quadratic cost? We employ results from statistical learning theory to show that the quadratic cost can be provably eliminated in certain cases. More precisely, we show that in order to obtain a ranking in which each element is an average of O(n/C) positions away from its position in the optimal ranking, one needs to sample O(nC²) pairs uniformly at random, for any C > 0. We also present an active learning algorithm which samples the pairs adaptively, and conjecture that it provides additional improvement.

Supplemental Material

wsdm2011_radinsky_rfp_01.mov

mov

101.4 MB

Download

wsdm2011_radinsky_rfp_01.mp4

mp4

142.5 MB

Download

References

N. Ailon. A simple linear ranking algorithm using query dependent intercept variables. In ECIR '09: Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval, pages 685-690, Berling Heidelberg, 2009. Springer-Verlag. Google ScholarDigital Library
N. Ailon and M. Mohri. An efficient reduction of ranking to classification. Journal of Machine Learning Research (In Press), 2011.Google Scholar
D. Ariely. Predictably Irrational: The Hidden Forces That Shape Our Decisions. HarperCollins, February 2008.Google Scholar
D. Ariely, G. Loewenstein, and D. Prelec. Coherent arbitrariness: Stable demand curves without stable preferences. Quarterly Journal of Economics, 118:73-105, 2003.Google ScholarCross Ref
M.F. Balcan, N. Bansal, A. Beygelzimer, D. Coppersmith, J. Langford, and G. B. Sorkin. Robust reductions from ranking to classification. Mach. Learn., 72(1-2): 139-153, 2008. Google ScholarDigital Library
M. Braverman and E. Mossel. Noisy sorting without resampling. pages 268-276, 2008. Google ScholarDigital Library
C. Buckley and E. M. Voorhees. Retrieval evaluation with incomplete information. In SIGIR'04: Proceedings of the 27th annual internationa ACM SIGIR conference on Research and development in information retrieval, pages 25-32, New York, NY, USA, 2004. ACM. Google ScholarDigital Library
Z. Cao, T. Qi, T.Y. Liu, M.-F. Tsai, and H. Li. Learning to rank: From pairwise approach to listwise approach. In Microsoft Tech Report MSR-TR-2007-40, 2007.Google ScholarDigital Library
J. Carbonell and J. Goldstein. The use of MMR, diversity-based reranking for recordering documents and producing summaries. In Proceedings Ann. Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, 1998. Google ScholarDigital Library
B. Carterette and P.N. Bennett. Evaluation measures for preference judgments. In In Proceedings of SIGIR, 2008. Google ScholarDigital Library
B. Carterette, P.N. Bennett, D.M. Chickering, and S.T. Here or theres: Preference judgements for relevance. In In Proceedings of the European Conference on Information Retrieval (ECIR), 2008. Google ScholarDigital Library
O. Chapelle, D. Metlzer, Y. Zhang, and P. Grinspan. Expected reciprocal rank for graded relevance. In CIKM'09: Proceeding of the 18th ACM conference on Information and knowledge management, pages 621-630, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
C.L. Clarke, N. Craswell, and I. Soboroff. Overview of the trec 2009 web track. Technical report, no date.Google Scholar
W. Cohran. Some methods for strengthening the common chi-square tests. Biometrics, 10:417-451, 1954.Google ScholarCross Ref
W.W. Cohen, R.E. Schapire, and Y. Signer. Learning to order things. In NIPS: Proceedings of the 1997 conference on Advances in neural information processing systems 10, pages 451-457, Cambridge, MA, USA, 1998. MIT Press. Google ScholarDigital Library
V. Conitzer, A. Davenport, and J. Kalagnanam. Improved bounds for computing kemeny ranking. In AAAI'06: Proceeding of the 21st national conference on Artificial intelligence, pages 620-626. AAAI Press, 2006. Google ScholarDigital Library
K. Crammer and Y. Singer. Pranking with ranking. In Advances in Neural Information Processing Systems 14, pages 641-647. MIT Press, 2001.Google Scholar
A. Das Sarma, A. Das Sarma, S. Gollapudi, and R. Panigrahy. Ranking mechanisms in twitter-like forums. In WSDM: Proceedings of the third ACM international conference on Web search and data mining, pages 21-30, 2010. Google ScholarDigital Library
P. Diaconis and R. Graham. Spearman's footrule as a measure of disarray. J. of Royal Statistical Society, 39(2):262-268, 1977.Google Scholar
R. Herbrich, T. Graepel, and K. Obermayer. Advances in Large Margin Classifiers. MIT Press, 200.Google Scholar
E. Hüllermeier, J. Fürnkranz, W. Cheng, and K. Brinker. Label ranking by learning pairwise preferences. Artif. Intell., 172(16-17):1897-1916, 2008. Google ScholarDigital Library
K. Jarvelin and J. Kekalainen. Cumulated gain-based evaluation of ir techniques. ACM Trans. Inf. Sys., 20:422-446, 2002. Google ScholarDigital Library
T. Joachims. Optimizing search engines using clickthrough data. In KDD'02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 133-142, New York, NY, USA, 2002, ACM. Google ScholarDigital Library
C. Kenyon Mathieu and W. Schudy. How to rank with few errors. In STOC'07: Proceedings of the thirty-ninth annual ACM symposium on Theory of computing, pages 95-103, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
A. Kittur, H. Chi, and B. Suh. Crowdsourcing user studies with mechanical turk. In Proc. CHI 2008, ACM Press, pages 453-456, 2008. Google ScholarDigital Library
Y. Lan, T.Y. Liu, Z. Ma, and H. Li. Generalization analysis of listwise learning-to-rank algorithms. In ICML '09: Proceedings of the 26th Annual Internatioanl Conference on Machine Learning, pages 577-584, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
S. Mizzaro. A new measure of retrieval effectiveness or: What's wrong with precision and recall. In In Proceedings of the International Workshop on Information Retrieval IR'2001, pages 43-52, 2001.Google Scholar
A. Moffat and J. Zobel. Rank-biased precision for measurement of retrieval effectiveness. ACM Trans. Inf. Sys., 27:1-27, 2008. Google ScholarDigital Library
A. Oulasvirta, J.P. Hukkinen, and B. Schwartz. When more is less: the paradox of choice in search engine use. In SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pages 516-523, 2009. Google ScholarDigital Library
J. Payne, J. Bettman, and E. Johnson. The adaptive decision maker. Massachusetts: Cambridge University Press, 1993.Google ScholarCross Ref
C. Rudin. The P-Norm Push: A simple convex ranking algorithm that concentrates at the top of the list. Journal of Machine Learning Research, 10:2233-2271, Oct 2009. Google ScholarDigital Library
T. Saracevic. Relevance: A review of the literature and a framework for thinking on the notion in information science. part iii: Behavior and effects of relevance. Journal of the American Society for Information Science and Technology, 58:2126-2144, 2007. Google ScholarDigital Library
P. Slovic, D. Griffin, and A. Tversky. Compatibility effects in judgment and choice. Insights in decision making, pages 5-27, 1990.Google Scholar
K. Train. Discrete Choice Methods with Simulation. Massachusetts: Cambridge University Press, 2003.Google ScholarCross Ref
F. Xia, T.Y. Liu, J. Wang, W. Zhang, and H. Li, Listwise approach to learning to rank: theory and algorithm. In ICML '08: Proceedings of the 25th International conference on Machine learning, pages 1192-1199, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
J. Xu and H. Li. Adarank: a boosting algorithm for information retrieval. In SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 391-398, 2007. Google ScholarDigital Library

Index Terms

Ranking from pairs and triplets: information quality, evaluation methods and query complexity
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
      1. Relevance assessment

Recommendations

Re-ranking search results using query logs
CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge management

This work addresses two common problems in search, frequently occurring with underspecified user queries: the top-ranked results for such queries may not contain documents relevant to the user's search intent, and fresh and relevant pages may not get ...
Read More
Individual Judgments Versus Consensus: Estimating Query-URL Relevance

Query-URL relevance, measuring the relevance of each retrieved URL with respect to a given query, is one of the fundamental criteria to evaluate the performance of commercial search engines. The traditional way to collect reliable and accurate query-URL ...
Read More
A comparison of ranking methods for classification algorithm selection
ECML'00: Proceedings of the 11th European Conference on Machine Learning

We investigate the problem of using past performance information to select an algorithm for a given classification problem. We present three ranking methods for that purpose: average ranks, success rate ratios and significant wins. We also analyze the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WSDM '11: Proceedings of the fourth ACM international conference on Web search and data mining
February 2011
870 pages
ISBN:9781450304931
DOI:10.1145/1935826
General Chair:
Irwin King
CUHK, Hong Kong
,
Program Chairs:
Wolfgang Nejdl
L3S and University of Hannover, Germany
,
Hang Li
Microsoft Research Asia, China
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 February 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
ranking evaluation
ranking from pairs
relevance feedback
Qualifiers
- research-article
Conference

Acceptance Rates
WSDM '11 Paper Acceptance Rate83of372submissions,22%Overall Acceptance Rate498of2,863submissions,17%
More
Upcoming Conference
WSDM '25

Sponsor:

sigir

sigir

sigir

sigir

The Eighteenth ACM International Conference on Web Search and Data Mining

April 7 - 11, 2025

Hannover , Germany
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 15
  Total Citations
  View Citations
- 866
  Total Downloads
- Downloads (Last 12 months)15
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Ranking from pairs and triplets: information quality, evaluation methods and query complexity

WSDM '11: Proceedings of the fourth ACM international conference on Web search and data mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Re-ranking search results using query logs

Individual Judgments Versus Consensus: Estimating Query-URL Relevance

A comparison of ranking methods for classification algorithm selection

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Ranking from pairs and triplets: information quality, evaluation methods and query complexity

WSDM '11: Proceedings of the fourth ACM international conference on Web search and data mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Re-ranking search results using query logs

Individual Judgments Versus Consensus: Estimating Query-URL Relevance

A comparison of ranking methods for classification algorithm selection

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media