research-article

Are IR Evaluation Measures on an Interval Scale?

Authors:
Marco Ferrante

University of Padua, Padova, Italy

University of Padua, Padova, Italy
View Profile

,
Nicola Ferro

University of Padua, Padova, Italy

University of Padua, Padova, Italy
View Profile

,
Silvia Pontarollo

University of Padua, Padova, Italy

University of Padua, Padova, Italy
View Profile

ICTIR '17: Proceedings of the ACM SIGIR International Conference on Theory of Information RetrievalOctober 2017Pages 67–74https://doi.org/10.1145/3121050.3121058

Published:01 October 2017Publication History

ICTIR '17: Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval

Pages 67–74

ABSTRACT

In this paper, we formally investigate whether, or not, IR evaluation measures are on an interval scale, which is needed to safely compute the basic statistics, such as mean and variance, we daily use to compare IR systems. We face this issue in the framework of the representational theory of measurement and we rely on the notion of difference structure, i.e. a total equi-spaced ordering on the system runs. We found that the most popular set-based measures, i.e. precision, recall, and F-measure are interval-based. In the case of rank-based measures, using a strongly top-heavy ordering, we found that only RBP with p = 1/2 is on an interval scale while RBP for other p values, AP, DCG, and ERR are not. Moreover, using a weakly top-heavy ordering, we found that none of RBP, AP, DCG, and ERR is on an interval scale.

References

J. Allan, W. B. Croft, A. P. de Vries, C. Zhai, N. Fuhr, and Y. Zhang (Eds.). 2015. Proc. 1st ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR 2015). ACM Press, New York, USA. Google Scholar
E. Amigó, J. Gonzalo, and M. F. Verdejo 2013. A General Evaluation Measure for Document Organization Tasks Proc. 36th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2013), G. J. F. Jones, P. Sheridan, D. Kelly, M. de Rijke, and T. Sakai (Eds.). ACM Press, New York, USA, 643--652. Google ScholarDigital Library
P. Bollmann. 1984. Two Axioms for Evaluation Measures in Information Retrieval Proc. of the Third Joint BCS and ACM Symposium on Research and Development in Information Retrieval, C. J. van Rijsbergen (Ed.). Cambridge University Press, UK, 233--245. Google ScholarDigital Library
P. Bollmann and V. S. Cherniavsky 1980. Measurement-theoretical investigation of the MZ-metric Proc. 3rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1980), C. J. van Rijsbergen (Ed.). ACM Press, New York, USA, 256--267. Google ScholarDigital Library
L. Busin and S. Mizzaro 2013. Axiometrics: An Axiomatic Approach to Information Retrieval Effectiveness Metrics Proc. 4th International Conference on the Theory of Information Retrieval (ICTIR 2013), O. Kurland, D. Metzler, C. Lioma, B. Larsen, and P. Ingwersen (Eds.). ACM Press, New York, USA, 22--29. Google ScholarDigital Library
O. Chapelle, D. Metzler, Y. Zhang, and P. Grinspan. 2009. Expected Reciprocal Rank for Graded Relevance. Proc. 18th International Conference on Information and Knowledge Management (CIKM 2009), D. W.-L. Cheung, I.-Y. Song, W. W. Chu, X. Hu, and J. J. Lin (Eds.). ACM Press, New York, USA, 621--630. Google ScholarDigital Library
M. Ferrante, N. Ferro, and M. Maistro 2015. Towards a Formal Framework for Utility-oriented Measurements of Retrieval Effectiveness, See Nzz-ICTIR2015, 21--30. Google ScholarDigital Library
S. Foldes. 2013. On distances and metrics in discrete ordered sets. arXiv.org, Combinatorics (math.CO) Vol. arXiv:1307.0244 (June 2013).Google Scholar
N. Fuhr 2012. Salton Award Lecture: Information Retrieval As Engineering Science. SIGIR Forum, Vol. 46, 2 (December 2012), 19--28. Google ScholarDigital Library
K. Jarvelin and J. Kekalainen 2002. Cumulated Gain-Based Evaluation of IR Techniques. ACM TOIS, Vol. 20, 4 (October 2002), 422--446. Google ScholarDigital Library
D. E. Knuth. 1981. The Art of Computer Programming -- Volume 2: Seminumerical Algorithms (2nd ed.). Addison-Wesley, USA.Google Scholar
D. H. Krantz, R. D. Luce, P. Suppes, and A. Tversky. 1971. Foundations of Measurement. Additive and Polynomial Representations. Vol. Vol. 1. Academic Press, USA.Google Scholar
S. Miyamoto. 2004. Generalizations of Multisets and Rough Approximations. International Journal of Intelligent Systems, Vol. 19, 7 (July 2004), 639--652. Google ScholarDigital Library
A. Moffat. 2013. Seven Numeric Properties of Effectiveness Metrics Proc. 9th Asia Information Retrieval Societies Conference (AIRS 2013), R. E. Banchs, F. Silvestri, T.-Y. Liu, M. Zhang, S. Gao, and J. Lang (Eds.), Vol. Vol. 8281. LNCS 8281, Springer, Heidelberg, Germany, 1--12.Google Scholar
A. Moffat and J. Zobel 2008. Rank-biased Precision for Measurement of Retrieval Effectiveness. ACM TOIS, Vol. 27, 1 (2008), 2:1--2:27. Google ScholarDigital Library
S. Robertson. 2006. On GMAP: and Other Transformations. In Proc. 15th International Conference on Information and Knowledge Management (CIKM 2006), P. S. Yu, V. Tsotras, E. A. Fox, and C.-B. Liu (Eds.). ACM Press, New York, USA, 78--83. Google ScholarDigital Library
G. B. Rossi. 2014. Measurement and Probability. A Probabilistic Theory of Measurement with Applications. Springer-Verlag, New York, USA.Google Scholar
F. Sebastiani. 2015. An Axiomatically Derived Measure for the Evaluation of Classification Algorithms, See Nzz-ICTIR2015, 11--20. Google ScholarDigital Library
R. P. Stanley. 2012. Enumerative Combinatorics -- Volume 1 (bibinfoedition2nd ed.). Cambridge Studies in Advanced Mathematics, Vol. Vol. 49. Cambridge University Press, Cambridge, UK. Google ScholarDigital Library
S. S. Stevens. 1946. On the Theory of Scales of Measurement. Science, New Series Vol. 103, 2684 (June 1946), 677--680.Google Scholar
C. J. van Rijsbergen. 1974. Foundations of Evaluation. Journal of Documentation Vol. 30, 4 (1974), 365--373.Google ScholarCross Ref
P. F. Velleman and L. Wilkinson 1993. Nominal, Ordinal, Interval, and Ratio Typologies Are Misleading. The American Statistician Vol. 47, 1 (February 1993), 65--72.Google Scholar

Index Terms

Are IR Evaluation Measures on an Interval Scale?
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results
      1. Retrieval effectiveness

Recommendations

How do interval scales help us with better understanding IR evaluation measures?
Abstract
Evaluation measures are the basis for quantifying the performance of IR systems and the way in which their values can be processed to perform statistical analyses depends on the scales on which these measures are defined. For example, mean and ...
Read More
The averaging of interval expert evaluations

The features of the expert evaluation of intractable properties (parameters) in the form of interval values on number scales are analyzed. To find a consistent evaluation, two methods for averaging the evaluations in interval form are considered. The ...
Read More
SimUSF: an efficient and effective similarity measure that is invariant to violations of the interval scale assumption

Similarity measures are central to many machine learning algorithms. There are many different similarity measures, each catering for different applications and data requirements. Most similarity measures used with numerical data assume that the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICTIR '17: Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval
October 2017
348 pages
ISBN:9781450344906
DOI:10.1145/3121050
General Chairs:
Jaap Kamps
University of Amsterdam, The Netherlands
,
Evangelos Kanoulas
University of Amsterdam, The Netherlands
,
Maarten de Rijke
University of Amsterdam, The Netherlands
,
Program Chairs:
Hui Fang
University of Delaware, USA
,
Emine Yilmaz
University College London, UK
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 October 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
- Best Paper
Author Tags
evaluation measures
interval scale
representational theory of measurement
Qualifiers
- research-article
Conference

Acceptance Rates
ICTIR '17 Paper Acceptance Rate27of54submissions,50%Overall Acceptance Rate209of482submissions,43%
More
Upcoming Conference
ICTIR '24

Sponsor:

sigir

The 2024 ACM SIGIR International Conference on the Theory of Information Retrieval

July 13, 2024

Washington DC , DC , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 14
  Total Citations
  View Citations
- 138
  Total Downloads
- Downloads (Last 12 months)9
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Are IR Evaluation Measures on an Interval Scale?

ICTIR '17: Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

How do interval scales help us with better understanding IR evaluation measures?

The averaging of interval expert evaluations

SimUSF: an efficient and effective similarity measure that is invariant to violations of the interval scale assumption