skip to main content
10.1145/3121050.3121058acmconferencesArticle/Chapter ViewAbstractPublication PagesictirConference Proceedingsconference-collections
research-article
Best Paper

Are IR Evaluation Measures on an Interval Scale?

Published:01 October 2017Publication History

ABSTRACT

In this paper, we formally investigate whether, or not, IR evaluation measures are on an interval scale, which is needed to safely compute the basic statistics, such as mean and variance, we daily use to compare IR systems. We face this issue in the framework of the representational theory of measurement and we rely on the notion of difference structure, i.e. a total equi-spaced ordering on the system runs. We found that the most popular set-based measures, i.e. precision, recall, and F-measure are interval-based. In the case of rank-based measures, using a strongly top-heavy ordering, we found that only RBP with p = 1/2 is on an interval scale while RBP for other p values, AP, DCG, and ERR are not. Moreover, using a weakly top-heavy ordering, we found that none of RBP, AP, DCG, and ERR is on an interval scale.

References

  1. J. Allan, W. B. Croft, A. P. de Vries, C. Zhai, N. Fuhr, and Y. Zhang (Eds.). 2015. Proc. 1st ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR 2015). ACM Press, New York, USA. Google ScholarGoogle Scholar
  2. E. Amigó, J. Gonzalo, and M. F. Verdejo 2013. A General Evaluation Measure for Document Organization Tasks Proc. 36th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2013), G. J. F. Jones, P. Sheridan, D. Kelly, M. de Rijke, and T. Sakai (Eds.). ACM Press, New York, USA, 643--652. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. Bollmann. 1984. Two Axioms for Evaluation Measures in Information Retrieval Proc. of the Third Joint BCS and ACM Symposium on Research and Development in Information Retrieval, C. J. van Rijsbergen (Ed.). Cambridge University Press, UK, 233--245. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. P. Bollmann and V. S. Cherniavsky 1980. Measurement-theoretical investigation of the MZ-metric Proc. 3rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1980), C. J. van Rijsbergen (Ed.). ACM Press, New York, USA, 256--267. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. L. Busin and S. Mizzaro 2013. Axiometrics: An Axiomatic Approach to Information Retrieval Effectiveness Metrics Proc. 4th International Conference on the Theory of Information Retrieval (ICTIR 2013), O. Kurland, D. Metzler, C. Lioma, B. Larsen, and P. Ingwersen (Eds.). ACM Press, New York, USA, 22--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. O. Chapelle, D. Metzler, Y. Zhang, and P. Grinspan. 2009. Expected Reciprocal Rank for Graded Relevance. Proc. 18th International Conference on Information and Knowledge Management (CIKM 2009), D. W.-L. Cheung, I.-Y. Song, W. W. Chu, X. Hu, and J. J. Lin (Eds.). ACM Press, New York, USA, 621--630. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Ferrante, N. Ferro, and M. Maistro 2015. Towards a Formal Framework for Utility-oriented Measurements of Retrieval Effectiveness, See Nzz-ICTIR2015, 21--30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Foldes. 2013. On distances and metrics in discrete ordered sets. arXiv.org, Combinatorics (math.CO) Vol. arXiv:1307.0244 (June 2013).Google ScholarGoogle Scholar
  9. N. Fuhr 2012. Salton Award Lecture: Information Retrieval As Engineering Science. SIGIR Forum, Vol. 46, 2 (December 2012), 19--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. K. Jarvelin and J. Kekalainen 2002. Cumulated Gain-Based Evaluation of IR Techniques. ACM TOIS, Vol. 20, 4 (October 2002), 422--446. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. E. Knuth. 1981. The Art of Computer Programming -- Volume 2: Seminumerical Algorithms (2nd ed.). Addison-Wesley, USA.Google ScholarGoogle Scholar
  12. D. H. Krantz, R. D. Luce, P. Suppes, and A. Tversky. 1971. Foundations of Measurement. Additive and Polynomial Representations. Vol. Vol. 1. Academic Press, USA.Google ScholarGoogle Scholar
  13. S. Miyamoto. 2004. Generalizations of Multisets and Rough Approximations. International Journal of Intelligent Systems, Vol. 19, 7 (July 2004), 639--652. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Moffat. 2013. Seven Numeric Properties of Effectiveness Metrics Proc. 9th Asia Information Retrieval Societies Conference (AIRS 2013), R. E. Banchs, F. Silvestri, T.-Y. Liu, M. Zhang, S. Gao, and J. Lang (Eds.), Vol. Vol. 8281. LNCS 8281, Springer, Heidelberg, Germany, 1--12.Google ScholarGoogle Scholar
  15. A. Moffat and J. Zobel 2008. Rank-biased Precision for Measurement of Retrieval Effectiveness. ACM TOIS, Vol. 27, 1 (2008), 2:1--2:27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Robertson. 2006. On GMAP: and Other Transformations. In Proc. 15th International Conference on Information and Knowledge Management (CIKM 2006), P. S. Yu, V. Tsotras, E. A. Fox, and C.-B. Liu (Eds.). ACM Press, New York, USA, 78--83. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. G. B. Rossi. 2014. Measurement and Probability. A Probabilistic Theory of Measurement with Applications. Springer-Verlag, New York, USA.Google ScholarGoogle Scholar
  18. F. Sebastiani. 2015. An Axiomatically Derived Measure for the Evaluation of Classification Algorithms, See Nzz-ICTIR2015, 11--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. P. Stanley. 2012. Enumerative Combinatorics -- Volume 1 (bibinfoedition2nd ed.). Cambridge Studies in Advanced Mathematics, Vol. Vol. 49. Cambridge University Press, Cambridge, UK. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. S. Stevens. 1946. On the Theory of Scales of Measurement. Science, New Series Vol. 103, 2684 (June 1946), 677--680.Google ScholarGoogle Scholar
  21. C. J. van Rijsbergen. 1974. Foundations of Evaluation. Journal of Documentation Vol. 30, 4 (1974), 365--373.Google ScholarGoogle ScholarCross RefCross Ref
  22. P. F. Velleman and L. Wilkinson 1993. Nominal, Ordinal, Interval, and Ratio Typologies Are Misleading. The American Statistician Vol. 47, 1 (February 1993), 65--72.Google ScholarGoogle Scholar

Index Terms

  1. Are IR Evaluation Measures on an Interval Scale?

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ICTIR '17: Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval
      October 2017
      348 pages
      ISBN:9781450344906
      DOI:10.1145/3121050

      Copyright © 2017 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 1 October 2017

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      ICTIR '17 Paper Acceptance Rate27of54submissions,50%Overall Acceptance Rate209of482submissions,43%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader