A systematic analysis of performance measures for classification tasks

https://doi.org/10.1016/j.ipm.2009.03.002Get rights and content

Abstract

This paper presents a systematic analysis of twenty four performance measures used in the complete spectrum of Machine Learning classification tasks, i.e., binary, multi-class, multi-labelled, and hierarchical. For each classification task, the study relates a set of changes in a confusion matrix to specific characteristics of data. Then the analysis concentrates on the type of changes to a confusion matrix that do not change a measure, therefore, preserve a classifier’s evaluation (measure invariance). The result is the measure invariance taxonomy with respect to all relevant label distribution changes in a classification problem. This formal analysis is supported by examples of applications where invariance properties of measures lead to a more reliable evaluation of classifiers. Text classification supplements the discussion with several case studies.

Section snippets

Motivation

Machine Learning (ML) divides classification onto binary, multi-class, multi-labelled, and hierarchical tasks. In this work we present a systematic analysis of twenty four performance measures used in these classification subfields. We focus on how well classes are identified without reference to computation cost or time. We consider a set of changes in a confusion matrix that correspond to specific characteristics of data. We then analyze the type of changes that do not change a measure’s

Overview of classification tasks

Supervised ML allows access to the data labels during the algorithm’s training and testing stages. Consider categorical labels when data entries x1,,xn have to be assigned into predefined classes C1,,Cl. Then classification falls into one of the following tasks:

  • Binary: the input is to be classified into one, and only one, of two non-overlapping classes (C1,C2); Binary classification is the most popular classification task. Assigned categories can be objective, independent of manual evaluation

Performance measures for classification

The correctness of a classification can be evaluated by computing the number of correctly recognized class examples (true positives), the number of correctly recognized examples that do not belong to the class (true negatives), and examples that either were incorrectly assigned to the class (false positives) or that were not recognized as class examples (false negatives). These four counts constitute a confusion matrix shown in Table 1 for the case of the binary classification.

Table 2 presents

Invariance properties of measures

We focus on the ability of a measure to preserve its value under a change in the confusion matrix. A measure is invariant if its value does not change when a confusion matrix changes, i.e. invariance indicates that the measure does not detect the change in the confusion matrix. This inability can be beneficial or adverse, depending on the goals.

Let’s consider a case when invariance to the change of tn is beneficial. Text classification extensively uses Precision and Recall (Sensitivity) which

Analysis of invariant properties

To identify similarities among the measures, we compare them according to their invariance and non-invariance properties shown in Table 7. First, we present measure outliers whose properties remarkably differ them from others. Two measures hold unique invariant properties: PrecisionG is the only measure invariant under vertical scaling (I7) and Exact Match Ratio is the only measure non-invariant under uniform scaling (I6¯). Another exception is Retrieval Fscore which is sensitive to all the

Conclusion and future work

In this study, we have analyzed twenty four performance measures used in the complete spectrum of Machine Learning classification tasks: binary, multi-class, multi-labelled, and hierarchical. Effects of changes in the confusion matrix on several well-known measures have been studied. In all the cases, we have shown that the evaluation of classification results can depend on the invariance properties of the measures. A few cases required that we additionally considered monotonicity of the

Acknowledgments

This work has been funded by the Natural Sciences and Engineering Research Council of Canada and the Ontario Centres of Excellence. We thank Elliott Macklovitch for fruitful suggestions on an early draft. We thank anonymous reviewers for helpful comments.

References (45)

  • P. Tan et al.

    Selecting the right objective measure for association analysis

    Information Systems

    (2004)
  • Asuncion, A., & Newman, D. (2007). UCI Machine Learning Repository. Irvine, CA: University of California, School of...
  • Bengio, S., Mariéthoz, J., & Keller, M. (2005). The expected performance curve. In Proceedings of the ICML’05 workshop...
  • J. Blitzer et al.

    Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification

  • Blockeel, H., Bruynooghe, M., Dzeroski, S., Ramon, J., & Struyf, J. (2002). Hierarchical multi-classification. In...
  • V. Bobicev et al.

    An effective and robust method for short text classification

  • J. Cohen

    Statistical power analysis for the behavioral sciences

    (1988)
  • Costa, E., Lorena, A., Carvalho, A., & Freitas, A. (2007). A review of performance evaluation measures for hierarchical...
  • J. Demsar

    Statistical comparisons of classifiers over multiple data sets

    Journal of Machine Learning Research

    (2006)
  • T. Dietterich

    Approximate statistical tests for comparing supervised classification learning algorithms

    Neural Computation

    (1998)
  • R.O. Duda et al.

    Pattern classification and scene analysis

    (1973)
  • Eisner, R., Poulin, B., Szafron, D., Lu, P., & Greiner, R. (2005). Improving protein function prediction using the...
  • Gamon, M., Aue, A., Corston-Oliver, S., & Ringger, E. (2005). Pulse: Mining customer opinions from free text. In...
  • Goutte, C., & Gaussier, E. (2005). A probabilistic interpretation of precision, recall and f-score, with implication...
  • Hersh, W., Buckley, C., Leone, T., & Hickam, D. (1997). OHSUMED: An interactive retrieval evaluation and new large test...
  • Huang, J., & Ling, C. (2007). Constructing new and better evaluation measures for machine learning. In Proceedings of...
  • K. Isselbacher et al.

    Harrison’s principles of internal medicine

    (1994)
  • Japkowicz, N. (2006). Why question machine learning evaluation methods? In Proceedings of the AAAI’06 workshop on...
  • Kazawa, H., Izumitani, T., Taira, H., & Maeda, E. (2005). Maximal margin labeling for multi-topic text categorization....
  • Kiritchenko, S., Matwin, S., Nock, R., & Famili, A. F. (2006). Learning and evaluation in the presence of class...
  • Lachiche, N., & Flach, P. A. (2003). Improving accuracy and cost of two-class and multi-class probabilistic classifiers...
  • P. Langley

    Elements of machine learning

    (1996)
  • Cited by (3954)

    View all citing articles on Scopus
    View full text