A systematic analysis of performance measures for classification tasks
Section snippets
Motivation
Machine Learning (ML) divides classification onto binary, multi-class, multi-labelled, and hierarchical tasks. In this work we present a systematic analysis of twenty four performance measures used in these classification subfields. We focus on how well classes are identified without reference to computation cost or time. We consider a set of changes in a confusion matrix that correspond to specific characteristics of data. We then analyze the type of changes that do not change a measure’s
Overview of classification tasks
Supervised ML allows access to the data labels during the algorithm’s training and testing stages. Consider categorical labels when data entries have to be assigned into predefined classes . Then classification falls into one of the following tasks:
Binary: the input is to be classified into one, and only one, of two non-overlapping classes ; Binary classification is the most popular classification task. Assigned categories can be objective, independent of manual evaluation
Performance measures for classification
The correctness of a classification can be evaluated by computing the number of correctly recognized class examples (true positives), the number of correctly recognized examples that do not belong to the class (true negatives), and examples that either were incorrectly assigned to the class (false positives) or that were not recognized as class examples (false negatives). These four counts constitute a confusion matrix shown in Table 1 for the case of the binary classification.
Table 2 presents
Invariance properties of measures
We focus on the ability of a measure to preserve its value under a change in the confusion matrix. A measure is invariant if its value does not change when a confusion matrix changes, i.e. invariance indicates that the measure does not detect the change in the confusion matrix. This inability can be beneficial or adverse, depending on the goals.
Let’s consider a case when invariance to the change of tn is beneficial. Text classification extensively uses Precision and Recall (Sensitivity) which
Analysis of invariant properties
To identify similarities among the measures, we compare them according to their invariance and non-invariance properties shown in Table 7. First, we present measure outliers whose properties remarkably differ them from others. Two measures hold unique invariant properties: is the only measure invariant under vertical scaling and Exact Match Ratio is the only measure non-invariant under uniform scaling . Another exception is Retrieval Fscore which is sensitive to all the
Conclusion and future work
In this study, we have analyzed twenty four performance measures used in the complete spectrum of Machine Learning classification tasks: binary, multi-class, multi-labelled, and hierarchical. Effects of changes in the confusion matrix on several well-known measures have been studied. In all the cases, we have shown that the evaluation of classification results can depend on the invariance properties of the measures. A few cases required that we additionally considered monotonicity of the
Acknowledgments
This work has been funded by the Natural Sciences and Engineering Research Council of Canada and the Ontario Centres of Excellence. We thank Elliott Macklovitch for fruitful suggestions on an early draft. We thank anonymous reviewers for helpful comments.
References (45)
- et al.
Selecting the right objective measure for association analysis
Information Systems
(2004) - Asuncion, A., & Newman, D. (2007). UCI Machine Learning Repository. Irvine, CA: University of California, School of...
- Bengio, S., Mariéthoz, J., & Keller, M. (2005). The expected performance curve. In Proceedings of the ICML’05 workshop...
- et al.
Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification
- Blockeel, H., Bruynooghe, M., Dzeroski, S., Ramon, J., & Struyf, J. (2002). Hierarchical multi-classification. In...
- et al.
An effective and robust method for short text classification
Statistical power analysis for the behavioral sciences
(1988)- Costa, E., Lorena, A., Carvalho, A., & Freitas, A. (2007). A review of performance evaluation measures for hierarchical...
Statistical comparisons of classifiers over multiple data sets
Journal of Machine Learning Research
(2006)Approximate statistical tests for comparing supervised classification learning algorithms
Neural Computation
(1998)
Pattern classification and scene analysis
Harrison’s principles of internal medicine
Elements of machine learning
Cited by (3954)
Automated machine learning in nanotoxicity assessment: A comparative study of predictive model performance
2024, Computational and Structural Biotechnology JournalIntelligent data-driven condition monitoring of power electronics systems using smart edge–cloud framework
2024, Internet of Things (Netherlands)Semi-supervised learning for detection of sedges in sod farms
2024, Crop ProtectionMethodology based on machine learning through neck motion and POF-based pressure sensors for wheelchair operation
2024, Sensors and Actuators A: PhysicalExploring occupant detection model generalizability for residential buildings using supervised learning with IEQ sensors
2024, Building and EnvironmentSleePyCo: Automatic sleep scoring with feature pyramid and contrastive learning[Formula presented]
2024, Expert Systems with Applications