ABSTRACT
In this paper, we formally investigate whether, or not, IR evaluation measures are on an interval scale, which is needed to safely compute the basic statistics, such as mean and variance, we daily use to compare IR systems. We face this issue in the framework of the representational theory of measurement and we rely on the notion of difference structure, i.e. a total equi-spaced ordering on the system runs. We found that the most popular set-based measures, i.e. precision, recall, and F-measure are interval-based. In the case of rank-based measures, using a strongly top-heavy ordering, we found that only RBP with p = 1/2 is on an interval scale while RBP for other p values, AP, DCG, and ERR are not. Moreover, using a weakly top-heavy ordering, we found that none of RBP, AP, DCG, and ERR is on an interval scale.
- J. Allan, W. B. Croft, A. P. de Vries, C. Zhai, N. Fuhr, and Y. Zhang (Eds.). 2015. Proc. 1st ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR 2015). ACM Press, New York, USA. Google Scholar
- E. Amigó, J. Gonzalo, and M. F. Verdejo 2013. A General Evaluation Measure for Document Organization Tasks Proc. 36th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2013), G. J. F. Jones, P. Sheridan, D. Kelly, M. de Rijke, and T. Sakai (Eds.). ACM Press, New York, USA, 643--652. Google ScholarDigital Library
- P. Bollmann. 1984. Two Axioms for Evaluation Measures in Information Retrieval Proc. of the Third Joint BCS and ACM Symposium on Research and Development in Information Retrieval, C. J. van Rijsbergen (Ed.). Cambridge University Press, UK, 233--245. Google ScholarDigital Library
- P. Bollmann and V. S. Cherniavsky 1980. Measurement-theoretical investigation of the MZ-metric Proc. 3rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1980), C. J. van Rijsbergen (Ed.). ACM Press, New York, USA, 256--267. Google ScholarDigital Library
- L. Busin and S. Mizzaro 2013. Axiometrics: An Axiomatic Approach to Information Retrieval Effectiveness Metrics Proc. 4th International Conference on the Theory of Information Retrieval (ICTIR 2013), O. Kurland, D. Metzler, C. Lioma, B. Larsen, and P. Ingwersen (Eds.). ACM Press, New York, USA, 22--29. Google ScholarDigital Library
- O. Chapelle, D. Metzler, Y. Zhang, and P. Grinspan. 2009. Expected Reciprocal Rank for Graded Relevance. Proc. 18th International Conference on Information and Knowledge Management (CIKM 2009), D. W.-L. Cheung, I.-Y. Song, W. W. Chu, X. Hu, and J. J. Lin (Eds.). ACM Press, New York, USA, 621--630. Google ScholarDigital Library
- M. Ferrante, N. Ferro, and M. Maistro 2015. Towards a Formal Framework for Utility-oriented Measurements of Retrieval Effectiveness, See Nzz-ICTIR2015, 21--30. Google ScholarDigital Library
- S. Foldes. 2013. On distances and metrics in discrete ordered sets. arXiv.org, Combinatorics (math.CO) Vol. arXiv:1307.0244 (June 2013).Google Scholar
- N. Fuhr 2012. Salton Award Lecture: Information Retrieval As Engineering Science. SIGIR Forum, Vol. 46, 2 (December 2012), 19--28. Google ScholarDigital Library
- K. Jarvelin and J. Kekalainen 2002. Cumulated Gain-Based Evaluation of IR Techniques. ACM TOIS, Vol. 20, 4 (October 2002), 422--446. Google ScholarDigital Library
- D. E. Knuth. 1981. The Art of Computer Programming -- Volume 2: Seminumerical Algorithms (2nd ed.). Addison-Wesley, USA.Google Scholar
- D. H. Krantz, R. D. Luce, P. Suppes, and A. Tversky. 1971. Foundations of Measurement. Additive and Polynomial Representations. Vol. Vol. 1. Academic Press, USA.Google Scholar
- S. Miyamoto. 2004. Generalizations of Multisets and Rough Approximations. International Journal of Intelligent Systems, Vol. 19, 7 (July 2004), 639--652. Google ScholarDigital Library
- A. Moffat. 2013. Seven Numeric Properties of Effectiveness Metrics Proc. 9th Asia Information Retrieval Societies Conference (AIRS 2013), R. E. Banchs, F. Silvestri, T.-Y. Liu, M. Zhang, S. Gao, and J. Lang (Eds.), Vol. Vol. 8281. LNCS 8281, Springer, Heidelberg, Germany, 1--12.Google Scholar
- A. Moffat and J. Zobel 2008. Rank-biased Precision for Measurement of Retrieval Effectiveness. ACM TOIS, Vol. 27, 1 (2008), 2:1--2:27. Google ScholarDigital Library
- S. Robertson. 2006. On GMAP: and Other Transformations. In Proc. 15th International Conference on Information and Knowledge Management (CIKM 2006), P. S. Yu, V. Tsotras, E. A. Fox, and C.-B. Liu (Eds.). ACM Press, New York, USA, 78--83. Google ScholarDigital Library
- G. B. Rossi. 2014. Measurement and Probability. A Probabilistic Theory of Measurement with Applications. Springer-Verlag, New York, USA.Google Scholar
- F. Sebastiani. 2015. An Axiomatically Derived Measure for the Evaluation of Classification Algorithms, See Nzz-ICTIR2015, 11--20. Google ScholarDigital Library
- R. P. Stanley. 2012. Enumerative Combinatorics -- Volume 1 (bibinfoedition2nd ed.). Cambridge Studies in Advanced Mathematics, Vol. Vol. 49. Cambridge University Press, Cambridge, UK. Google ScholarDigital Library
- S. S. Stevens. 1946. On the Theory of Scales of Measurement. Science, New Series Vol. 103, 2684 (June 1946), 677--680.Google Scholar
- C. J. van Rijsbergen. 1974. Foundations of Evaluation. Journal of Documentation Vol. 30, 4 (1974), 365--373.Google ScholarCross Ref
- P. F. Velleman and L. Wilkinson 1993. Nominal, Ordinal, Interval, and Ratio Typologies Are Misleading. The American Statistician Vol. 47, 1 (February 1993), 65--72.Google Scholar
Index Terms
- Are IR Evaluation Measures on an Interval Scale?
Recommendations
How do interval scales help us with better understanding IR evaluation measures?
AbstractEvaluation measures are the basis for quantifying the performance of IR systems and the way in which their values can be processed to perform statistical analyses depends on the scales on which these measures are defined. For example, mean and ...
The averaging of interval expert evaluations
The features of the expert evaluation of intractable properties (parameters) in the form of interval values on number scales are analyzed. To find a consistent evaluation, two methods for averaging the evaluations in interval form are considered. The ...
SimUSF: an efficient and effective similarity measure that is invariant to violations of the interval scale assumption
Similarity measures are central to many machine learning algorithms. There are many different similarity measures, each catering for different applications and data requirements. Most similarity measures used with numerical data assume that the ...
Comments