Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Enhancing Confusion Entropy (CEN) for binary and multiclass classification

Correction

22 Apr 2021: Delgado R, Núñez-González JD (2021) Correction: Enhancing Confusion Entropy (CEN) for binary and multiclass classification. PLOS ONE 16(4): e0250834. https://doi.org/10.1371/journal.pone.0250834 View correction

Abstract

Different performance measures are used to assess the behaviour, and to carry out the comparison, of classifiers in Machine Learning. Many measures have been defined on the literature, and among them, a measure inspired by Shannon’s entropy named the Confusion Entropy (CEN). In this work we introduce a new measure, MCEN, by modifying CEN to avoid its unwanted behaviour in the binary case, that disables it as a suitable performance measure in classification. We compare MCEN with CEN and other performance measures, presenting analytical results in some particularly interesting cases, as well as some heuristic computational experimentation.

Introduction

Machine Learning is the subfield of Computer Science, as well as the branch of Artificial Intelligence, whose objective is to develop techniques that allow computers to learn. It has a wide range of applications, such as search engines or pattern recognition. Examples are: medical diagnosis, fraud detection, stock market analysis, classification of DNA sequences, recognition of speech and written language, images, games and robotics.

Machine learning tasks are typically grouped into two broad categories: Supervised and Unsupervised Learning. Classification falls in the former, since it deals with some input variables (features or characteristics) and an output variable (the class), and uses an algorithm to infer the class of (that is, to classify) a new case from its known features. Different models are used to build classifiers. Decision Trees (J48, Random Forest), Rules (Decision Table, JRip, ZeroR), Neural Networks (Multilayer Perceptron, Extreme Learning Machines, RBFN), Support Vector Machines, and Bayesian Networks (Naive Bayes, TAN) are some, although not the only ones, approximations to supervised classification.

Once a classifier is built, a performance measure is needed in order to assess its behaviour and to compare it with other classifiers. In the binary case, in which the class variable has only two labels or classes, there are several classical measures that have been widely used: Accuracy, Sensitivity, Specificity and F-score, only to mention some of the most commonly used. Not of all them allow a natural extension to the multi-class case (more than two labels), and only few measures have been specially designed for multi-class classification, which is a more complex scenario. Accuracy, by far the simplest and widespread performance measure in classification, extends seamlessly its definition in the binary case to multi-class classification. Another well known performance measure, formerly introduced in the binary case but that extends without problems, is Matthew’s Correlation Coefficient (MCC), introduced by Matthews in [1].

In this work, whose seed is [2], we focus on a different performance measure, named Confusion Entropy (CEN), which measures the uncertainty generated by classification, and has been recently introduced by Wang et al. in [3] as a novel measure for evaluating classifiers based on the concept of Shannon’s entropy. CEN measures generated entropy from misclassified cases considering not only how the cases of each fixed class have been misclassified into other classes, but also how the cases of the other classes have been misclassified as belonging to this class, as well as entropy inside well-classified cases. Given a set of non-negative numbers, say {n1, …, nr}, the Shannon’s entropy generated by the set can be defined as the sum , with if , where log can be, as usual, the logarithm in base 2.

CEN is compared in [3] with Accuracy and other measures, showing a relative consistency with them: higher Accuracy tends to result in lower Confusion Entropy. This performance measure, which is more discriminating for evaluating classifiers than Accuracy, specially when the number or cases grows, has also been studied in [4], where the authors show the strong monotone relation between CEN and MCC, and that both, MCC and CEN, improve over Accuracy.

There are some works in the recent literature using Confusion Entropy. For example, in [5] the authors propose a novel splitting criterion based on CEN for learning decision trees with higher performance; experimental results on some data sets show that this criterion leads to trees with better CEN value without reducing accuracy. The authors of [6] and [7] use CEN, among other performance measures, to compare several common data mining methods used with highly imbalanced data sets where the class of interest is rare. Other works propose modifications of this measure, as [8], in which a Confusion Entropy measure based on a probabilistic confusion matrix is introduced, measuring if cases are classified into true classes and separated from others with high probabilities. A similar approach to that of [8] is followed in [9] to analyze the probability sensitivity of the Gaussian processes in a bankruptcy prediction context, by means of a probabilistic confusion entropy matrix based on the model estimated probabilities. In the context of horizontal collaboration, the system global entropy is introduced in [10] analogously to CEN (see also [11] and [12]), and it is used in the collaborative part of a clustering algorithm, which is iterative with the optimization process continuing as long as the system global entropy is not stable.

It is remarkable that CEN shows to have a weakness in the binary case that invalidates it as a suitable performance measure: in some situations CEN gets values larger than one, unlike what happens in the multi-class case, in which CEN ranges between zero and one. CEN is a measure of the “overall” entropy associated to the confusion matrix, that can be thought as generated by two sources: entropy within the main diagonal, and the one generated by the values outside it, corresponding to misclassification. We will show that CEN is more sensible to the later. A second but not least important point in the weakness of the behaviour of CEN is its lack of monotonicity when the overall entropy does increase (or decrease) monotonously. Along the paper we will show different situations to stand out these items.

Our aim is to introduce an enhanced CEN measure, that we denote by MCEN, and compare it with CEN, MCC and Accuracy. This new measure will show to be highly correlated with CEN. Two aspects deserve to be highlighted:

  1. definitions of probabilities involved in the construction of CEN have been modified in MCEN to improve interpretability as real probabilities,
  2. weakness of CEN in the binary case (out-of-range and lack of monotonicity) are overcome with MCEN.

The paper is structured as follows: first we introduce the Modified Confusion Entropy MCEN and deal with the multi-dimensional perfectly symmetric and balanced case, which is deeply studied, performing a cross comparison between CEN, MCEN, Accuracy and MCC. The general binary case is treated next, focusing on different families of matrices and carrying out the corresponding cross comparisons. Next part is devoted to study the ZA family of confusion matrices. Then, we compare CEN, MCEN, Accuracy and MCC with two recently introduced measures: the Probabilistic Acuracy PACC ([13]) and the Entropy-Modulated Accuracy EMA ([14]). Finally, some experiments performed in the binary setting to compare CEN with MCEN through four real database sets are included in the Supporting Information file. These experiments show that their behaviour is mostly analog, but when it is not the case, MCEN is the one that behaves more according to entropy generated by misclassification. The paper finishes with a conclusion section.

Methods

Given a multi-class classifier learned from a training dataset, with N ≥ 2 classes labelled {1, 2, …, N}, we apply it in order to classify cases from a testing dataset, that is, to infer the class of the cases from their known features or characteristics. Since for the cases in the testing dataset we actually know the class to which they belong, we can construct the N × N confusion matrix C = (Ci,j)i,j=1, …, N, which collects the results issued by the classifier over the testing dataset. Ci,j is the number of cases of class i that have been classified as belonging to class j. We denote by S the sum of values of the matrix, that is, the total number of cases in the testing dataset, .

We introduce notations OUT(C) and IN(C), respectively, to denote the Shannon’s entropy generated by the elements of outside (respectively, inside) the main diagonal of matrix C. That is, while IN is the entropy generated by the well classified cases, OUT is generated by misclassification.

In [3] the misclassification probability of classifying class-i cases as being of class j “subject to class j”, denoted by , is introduced as: (1) that is, is “almost” the relative frequency class-i cases that are classified as being of class j among all cases that are of class j or that have been classified as being of class j. But not exactly. The reason is that class-j cases that have been correctly classified, whose number is Cj,j, are counted twice in the denominator.

Analogously, the misclassification probability of classifying class-i cases as being of class-j “subject to class i”, with analogous interpretation, denoted by , is defined in the same paper by: (2) Then, the Confusion Entropy associated to class j is defined in [3] by: (3) with the convention a logb(a) = 0 if a = 0. Finally, the overall Confusion Entropy associated to the confusion matrix C is defined as a convex combination of the Confusion Entropy of the classes as follows: (4) where the non-negative weights Pj, summing 1, are (5)

Note that CEN is an invariant measure; if we multiply all elements of the confusion matrix by a constant we obtain the same result. The same convenient and useful property holds with Accuracy, MCC and the modified Confusion Entropy measure MCEN, that we will introduce below. As MCC lives in [−1, 1] while Accuracy, CEN and MCEN range in [0, 1], we scale MCC and introduce . Besides, since Accuracy usually has an inverse relationship with both CEN and MCEN, we choose to consider ACC* = 1–Accuracy instead of Accuracy itself.

For N > 2, CEN ranges between 0 and 1, 0 is attained with perfect classification (the off-diagonal elements of matrix C being zero), while 1 under complete misclassification, symmetry and balance in C, that is, if all diagonal elements in C are zero, and the off-diagonal elements take all the same value. In the binary case (N = 2), although CEN remains to be 0 with perfect classification, and is 1 under complete misclassification with symmetry, in intermediate scenarios we can also obtain CEN = 1 and even higher values. That is, in some cases CEN is out-of-range. See, for example, the confusion matrices in Table 1, which have already been considered in [4]. The lack of monotonicity when the situation monotonously goes from perfect classification to completely symmetric and balanced misclassification, as showed by the sequence of matrices in Table 1, represents a great inconvenience of CEN in the binary case, and is our main motivation for introducing a modified version of it.

thumbnail
Table 1. Examples in the perfectly symmetric and balanced binary case with S = 12.

Only CEN values.

https://doi.org/10.1371/journal.pone.0210264.t001

Definition

Instead of (1), we propose to introduce the probability of classifying class-i cases in class j “subject to class j”, as that is, we overcome the fact that in (1) correctly classified class-j cases are counted twice in the denominator. With this definition, is really the relative frequency of class-i cases classified as belonging to class j among all cases that are of class j or that have been classified as being of class j. Analogously, we modify definition (2) in the same sense: and is really the relative frequency of class-i cases classified in class j among all cases that are of class i or that have been classified as being of class i.

Next, we modify definition of the weights in (5) in the following way: where Then, we define the Confusion Entropy associated to class j as in (3) by and the modified Confusion Entropy as in formula (4), that is, (6) Note that when , so the modified overall Confusion Entropy is also defined as a convex combination of the modified Confusion Entropy corresponding to the classes, while in the binary case (N = 2), it is just defined as a conical combination since although the weights are non-negative, they do not necessarily sum up to 1 (indeed, their sum is 1 if and only if all the diagonal elements of the confusion matrix C are zero, that is, if all cases have been misclassified).

We see from (4) and (6) that both measures CEN and MCEN, are decomposable along classes, which makes it easy to assess the effect on the behaviour of the classifier of a simple modification affecting just one class.

We can start performing a preliminary comparison of the behaviour of ACC*, MCC*, CEN and MCEN in the toy example in dimension 2 of Table 2. In this example, the baseline confusion matrix is constant with all its entries equal to 3. First, maintaining the total sum equal to S = 12 and the out-diagonal invariant, we reduce the entropy IN in Table 2(a). In the baseline case, the diagonal elements are the set {3, 3}, whose entropy is 1 (maximum value). The corresponding values of IN in case (a) are consigned in Table 2, in a decreasing order. Analogously for Table 2(b) but in this case changes have been introduced outside the main diagonal. We observe that while ACC* remains insensitive to changes in the arrangement of the elements of the matrix, since the sum of the main diagonal remains constant, MCC* only decreases with decreasing entropy OUT, while when IN decreases, its value increases. As far as their interpretation is concerned, both CEN and MCEN measure the overall entropy of the confusion matrix, giving less weight to the IN entropy, that is, that generated by the well classified cases, than to OUT entropy, corresponding to misclassification. In this example we observe how their values are reduced when IN decreases, maintaining its constant sum, or when the one that is reduced is OUT, but in this second case the reduction is much more drastic, both for CEN and MCEN, and more sharply for the second. The main difference between CEN and MCEN in this sense is that the former is more sensitive to changes of IN entropy than MCEN, while less than CEN to that of OUT (observe the percentages in brackets in Table 2, which are the relative reduction in the measure with respect to that of the baseline case).

thumbnail
Table 2. Toy example: Binary case with S = 12.

(a): Entropy reduction within the main diagonal, IN. (b) Entropy reduction outside the main diagonal, OUT. In brackets the relative reduction in each measure with respect to the baseline case. Entropy refers to IN in (a) and to OUT in (b).

https://doi.org/10.1371/journal.pone.0210264.t002

We can extend this comparison to matrices of type , with A = 1, …, 100, for example. Their main diagonal stays constant. Fig 1 shows the behaviour of CEN, MCEN, ACC* and MCC* as OUT increases. We can observe that indeed, CEN is less correlated with this entropy than MCEN. The same can be observed from the correlations matrix given in Table 3.

thumbnail
Fig 1. CEN, MCEN, ACC* and MCC* for matrix MA, as function of entropy outside the diagonal.

https://doi.org/10.1371/journal.pone.0210264.g001

thumbnail
Table 3. Correlation matrix (Pearson) for the measures of the family of matrices MA, A = 1, …, 100.

https://doi.org/10.1371/journal.pone.0210264.t003

Instead, if we consider matrices , with A = 1, …, 100, the values outside the main diagonal stay constant. Fig 2 shows the behaviour of CEN, MCEN, ACC* and MCC* as IN increases. CEN shows more correlation with this entropy than MCEN (see Table 4), although IN is less correlated (and in an inverse sense that could not be appreciated in the toy example of Table 2) than OUT, both with CEN and MCEN.

thumbnail
Fig 2. CEN, MCEN, ACC* and MCC* for matrix WA, as function of entropy inside the diagonal.

https://doi.org/10.1371/journal.pone.0210264.g002

thumbnail
Table 4. Correlation matrix (Pearson) for the measures of the family of matrices WA, A = 1, …, 100.

https://doi.org/10.1371/journal.pone.0210264.t004

The perfectly symmetric and balanced case

In this section we consider the case in which Ci,j = F for all i, j = 1, …, N, ij and Ci,i = T, with T ≥ 0, F > 0, that is, .

Proposition 1 In the perfectly symmetric and balanced case, (7) where

Note that ACC*, MCC*, CEN and MCEN depend on the matrix values T and F only through its ratio γ. In (7) (case N > 2), CEN and MCEN have the same expression except that CEN depends on δ, which is function of 2γ, while MCEN does on , which is the same function but of γ. Therefore, where in the notation we highlight the dependency of CEN and MCEN on γ.

Corollary 1 In the perfectly symmetric and balanced case, we have that:

  • For any N > 2, CEN, MCEN, ACC* and MCC* are monotonically decreasing functions of γ ≥ 0, with and if γ > 0, MCC* < ACC* < CEN < MCEN.
  • Nevertheless, when N = 2, we have that although MCEN and ACC* = MCC* remain to be monotonically decreasing as functions of γ ≥ 0, CEN does not. Indeed, CEN achieves its global maximum when , which is . More specifically, Moreover, there exists γ0 ≈ 5.78 such that

Proof 1 The proofs of both Proposition 1 and Corollary 1 are straightforward, and then omitted. However, it is worth mentioning that in order to prove CEN < MCEN in case N > 2 we use that function is strictly decreasing for any base b > 1 (in our case, b = 2(N − 1) ≥ 4), and x > e. We apply that fact to see that f(x0) > f(x1) with x0 = 2(N − 1) + γ < x1 = 2(N − 1) + 2γ, since x0 ≥ 4 > e.

The same property of function f allows to prove that both CEN and MCEN are monotonically decreasing as functions of γ, with x = δ = 2(N − 1) + 2γ and , respectively, being both > e for any γ ≥ 0. Note that since for N = 2 the expression of CEN as function of δ is as in case N > 2, the monotonous decrease fails since x = δ = 2 + 2γ < e for .

The rest of proofs are also omitted.

Remark 1 Note that if N = 2, CEN exhibits the unwanted behaviour, not showed by MCEN, of being out-of-range [0, 1], which despairs for N > 2 (see Figs 3 and 4).

thumbnail
Fig 3. The symmetric case.

CEN, MCEN, ACC* and MCC* for γ ∈ [0, 10], with N = 2.

https://doi.org/10.1371/journal.pone.0210264.g003

thumbnail
Fig 4. The symmetric case.

CEN, MCEN, ACC* and MCC* for γ ∈ [0, 10], with N = 3.

https://doi.org/10.1371/journal.pone.0210264.g004

Remark 2 Consider the particular case in which T = F, that is, γ = 1. In other words, the confusion matrix is constant, say . Then, and . Moreover, δ = 2N and .

If N > 2, and

If N = 2, CEN = 1 and

As a consequence, we can easily check that if N > 2, MCC* < ACC* < CEN < MCEN, with limN→+∞ ACC* = limN→+∞ CEN = limN→+∞ MCEN = 1, while if N = 2, MCC* = ACC* < MCEN < CEN.

The particular pathological case of matrices ZA will be studied in the multi-class setting, but before we consider in some detail the binary case.

The general binary case

The binary case (N = 2) can be studied in more detail. We will use the following notation for the confusion matrix in the most general setting, taking class 1 as reference: (8) where TP is the true positive or number of class-1 cases that have been correctly classified, and the same for the true negative number of cases TN with class 2. On the other hand, FP denotes false positives or number of class-2 cases that have been miscllassified, and FN false negatives.

Proposition 2 If the confusion matrix C is given by (8), we have that with S = TP + TN + FP + FN, (9)

To carry out a deeper study, we have to consider particular situations; is what we do in the subsections below, where different particular scenarios have been introduced and developed.

The perfectly symmetric and balanced case.

Table 5 below shows some examples of 2 × 2 confusion matrices of type , that is, in which TP = TN = T and FP = FN = F. All of them correspond to S = 12 and have already been considered in [4]. This is a particular case of the previously considered setting, and Proposition 1 and Corollary 1 apply here. We can observe again the anomalous behaviour of CEN, in contrast with the other measures.

thumbnail
Table 5. Examples in the perfectly symmetric and balanced binary case with S = 12.

https://doi.org/10.1371/journal.pone.0210264.t005

The symmetric but unbalanced family UA.

Consider the particular case of a confusion matrix of type , with A > 0. Both class-1 and class-2 cases are mainly misclassified if A > 1. Entropy out of the main diagonal is 1 and within the diagonal is 0, regardless of the value of A. When 0 < A < 1, say for example that A = 1/B with B > 1, then matrix UA is equivalent to , that is, corresponds to an unbalanced scenario in which class 2 is underrepresented and class-1 cases are mainly well classified. We can observe some properties of CEN, MCEN, ACC* and MCC* (see Fig 5) in Proposition 3, which is derived from Proposition 2.

thumbnail
Fig 5. Famlily UA.

CEN, MCEN, ACC* and MCC* for A ∈ (0, 10].

https://doi.org/10.1371/journal.pone.0210264.g005

Proposition 3 For confusion matrix UA with A > 0, we have:

As a consequence:

CEN(A) < 1 if A < 1, CEN(1) = 1, CEN(A) > 1 if A > 1, MCEN(A) < 1 and ACC*(A) < MCC*(A) < 1, for all A > 0, MCEN, ACC* and MCC* are monotonically increasing functions of A > 0, CEN is not, and achieves its global maximum when A ≈ 2.54, which is > 1, ,

Moreover, there exists A0 ∈ (0, 1) (indeed, A0 ≈ 0.24) such that

The overall entropy associated to the four elements of the confusion matrix, which results to be , increases to 1 when A → +∞ and decreases to 0 when A → 0, and both CEN and MCEN, are sensible to this fact. Note that the lack of monotonicity of CEN(A) as A (and then, as the overall entropy) monotonically increases, is an anomalous behaviour that MCEN has managed to overcome. Moreover, MCEN ranges between 0 and 1. We can also observe this phenomenon in the examples in Table 6.

The asymmetric family VA.

Consider the particular case of confusion matrices of type , with A > 0. This is an asymmetric and unbalanced case in which class 2 is systematically misclassified and is underrepresented if A > 1. Class 1 is also mainly misclassified if A > 1. As A → +∞, entropy out the diagonal, which is , decreases to zero. Entropy within diagonal is zero, while the overall entropy of the elements of matrix VA is , which tends to 0 as A → +∞. When 0 < A < 1 with A = 1/B, B > 1, matrix VA is equivalent to , which corresponds to an almost balanced but asymmetric scenario in which class 1 is mainly well classified but class 2 is not. As B increases (A → 0), entropy out the diagonal also drops to zero. Some properties of CEN, MCEN, ACC* and MCC* are given in Proposition 4 (see also Fig 6).

thumbnail
Fig 6. Family VA.

CEN, MCEN, ACC* and MCC* for A ∈ (0, 10].

https://doi.org/10.1371/journal.pone.0210264.g006

Proposition 4 For confusion matrix VA with A > 0, we have: As a consequence, there exists A1 ∈ (1, 2) (A1 ≈ 1.414) such that:

CEN(A) > 1 if 1 < A < A1, CEN(1) = CEN(A1) = 1, CEN(A) < 1 if A ∉ [1, A1], MCEN(A) < 1, ACC*(A) < 1, MCC*(A) < 1 and MCEN(A) < CEN(A) for all A > 0, ,

Note that as in previous cases, CEN(A) does not stay always (that is, for any A > 0) restricted to [0, 1], while MCEN does. See Fig 6 and some examples in Table 7.

Apart from the fact that CEN is out-of-range for some values of A, its behaviour is similar to that of MCEN, both decreasing with entropy, while nor ACC* nor MCC* are sensitive to the decrease of entropy when A → +∞.

The symmetric but unbalanced family XA, r.

Now we introduce the family of confusion matrices , with A, r > 0. Both class-1 and class-2 cases are mainly misclassified if A, r > 1. Overall entropy of XA, r is , which drops to 0 when A → 0, and when A → +∞ converges to , which in turn converges to 1 as r → +∞. Fixed A > 0, overall entropy converges to 1 as r → +∞, and as r → 0, it converges to , which in turn converges to 0 both when A → 0 and when A → +∞.

When 0 < A < 1, A = 1/B with B > 1, matrix XA, r is equivalent to . We have some properties of CEN, MCEN, ACC* and MCC* in Proposition 5 below. Moreover, for r = 0.5, 5 Figs 7 and 8 show how the measures evolve as function of A, while Figs 9 and 10 show their plots as function of r, fixed A = 0.5, 10.

thumbnail
Fig 7. Family XA, r.

CEN, MCEN, ACC* and MCC* as function of A > 0 for r = 0.5.

https://doi.org/10.1371/journal.pone.0210264.g007

thumbnail
Fig 8. Family XA, r.

CEN, MCEN, ACC* and MCC* as function of A > 0 for r = 5.

https://doi.org/10.1371/journal.pone.0210264.g008

thumbnail
Fig 9. Family XA, r.

CEN, MCEN, ACC* and MCC* as function of r > 0 for A = 0.5.

https://doi.org/10.1371/journal.pone.0210264.g009

thumbnail
Fig 10. Family XA, r.

CEN, MCEN, ACC* and MCC* as function of r > 0 for A = 10.

https://doi.org/10.1371/journal.pone.0210264.g010

Proposition 5 For confusion matrix XA, r with A, r > 0 we have: As a consequence, , and there exists r0 < 1(r0 ≈ 0.8) such that for any r > r0, there exists Ar > 0 such that CEN(A) < 1 if A < Ar, CEN (Ar) = 1, CEN(A) > 1 if A > Ar and If rr0, CEN(A) ≤ 1 for any A > 0 and ℓCEN(r) < 1.

On the other hand, for any r > 0,

MCEN(A) < 1, ACC*(A) < 1 and MCC*(A) < 1, for all A > 0, MCEN, ACC * and MCC * are monotonically increasing functions of A, CEN is not, and has a global maximum, which is > 1 if r > r0, , , .

Moreover, there exist 0 < r3 < r2 < r1 < r0 < 1 (r3 ≈ 0.13, r2 ≈ 0.15, r1 ≈ 0.23) such that: Finally, for any fixed, A > 0, while MCEN, ACC* and MCC* are monotonically increasing functions of r, CEN is not, as can be seen in Figs 9 and 10, for two values of A. Given A > 0, there exists rA > r0 such that CEN(A) > 1 for all r > rA.

Note that although we do not specify it in the notations so as not to complicate them, the performance measures depend on both A and r in the case of this doubly indexed family XA, r.

The asymmetric family YA, r.

Finally, we consider another particular doubly indexed family of confusion matrices in the binary case, with the same overall entropy as XA, r, denoted by YA, r, with A, r > 0. We define this family by . Class-2 is underrepresented and mainly misclassified if A, r > 1, while class-1 cases are classified “at random”, that is, a class-1 case has the same probability to be classified into any of the two classes. Although entropy is as for XA, r, we will see that performance measures behave in a different way for this family of confusion matrices. When 0 < A < 1, A = 1/B with B > 1, then matrix YA, r is equivalent to . In Proposition 6 we give some properties of CEN, MCEN, ACC* and MCC*. See in Fig 11 for r = 0.1, in Fig 12 for r = 0.8, and see Fig 13 for a plot of them as function of r, fixed A = 10.

thumbnail
Fig 11. Family YA, r.

CEN, MCEN, ACC* and MCC* as function of A > 0 for r = 0.1.

https://doi.org/10.1371/journal.pone.0210264.g011

thumbnail
Fig 12. Family YA, r.

CEN, MCEN, ACC* and MCC* as function of A > 0 for r = 0.8.

https://doi.org/10.1371/journal.pone.0210264.g012

thumbnail
Fig 13. Family YA, r.

CEN, MCEN, ACC* and MCC* as function of r for A = 10.

https://doi.org/10.1371/journal.pone.0210264.g013

Proposition 6 For confusion matrix YA, r with A, r > 0 we have: As a consequence, , and there exists R0 < 1(R0 ≈ 0.71) such that Moreover, there exist 0 < R1 < R0 < 1 < R2(R1 ≈ 0.5, R2 ≈ 1.4) such that On the other hand, for any r > 0,

MCEN(A) < 1, ACC*(A) < 1 and MCC*(A) < 1, for all A > 0, ACC* and MCC* are monotonically increasing functions of A, CEN is not, and MCEN is or not, depending on the value of r, , , LMCEN(r) < LCEN(r) for all r > 0.

Note that LACC*(r) < LMCC*(r) if and only if .

Improving classification of the minority class while maintaining the imbalance between the classes.

Up to now, we have evaluated binary confusion matrices with different balances of the two classes but not different classification results. Now let’s do just the opposite. To help clarify the utility of MCEN in the evaluation of improvements in classification of the minority class while maintaining the same amount of imbalance, we consider two different examples.

Example 1: We introduce the family of confusion matrices , with α = 1, 2, …, 101. Note that when α = 1, the corresponding matrix belongs to the family {XA, r} with A = 50 and r = 2. Imbalance in classes stays fix. When α = 1, the minority class is classified very badly, improving classification as α increases and reaching the perfect classification when α = 101. Is MCEN able to detect this behaviour? Yes, it is. Unlike what happens with CEN, MCEN (as well as ACC* and MCC*) monotonically decreases when classification of the minority class improves (α increases). CEN incongruously first increases up to α = 18 and then starts to decrease and behave like the other performance measures (see Fig 14).

thumbnail
Fig 14. Family with α = 1, 2, …, 101.

CEN, MCEN, ACC* and MCC* as function of α.

https://doi.org/10.1371/journal.pone.0210264.g014

Example 2: A similar phenomenon can be observed with family , with β = 1, 2, …, 101 (with β = 1 the corresponding matrix belongs to the family {YA, r} with A = 100 and r = 1. As in Example 1, imbalance in classes is constant and when β = 1, the minority class is classified very badly, improving classification as β increases up to 101, when perfect classification is reached. MCEN as well as ACC* and MCC*, monotonically decrease when β increases, while CEN increases up to β = 14 and then starts to decrease and behave like the other performance measures (see Fig 15).

thumbnail
Fig 15. Family with β = 1, 2, …, 101.

CEN, MCEN, ACC* and MCC* as function of β.

https://doi.org/10.1371/journal.pone.0210264.g015

The ZA family

As noted in [4], the behaviour of the Confusion Entropy CEN is rather diverse from that of MCC* and ACC* for the pathological case of the family of confusion matrices ZA = (ai,j)i,j = 1, …, N, defined by , with A > 0. That is, . We want to study how MCEN behaves when applied to elements of this family.

Proposition 7

In general (N ≥ 2), As a consequence,

  • If N = 2,
    MCEN < CEN(ZA) for all A>0,
    MCEN < 1 for all A > 0, and there exists A3 ∈ (1, 2)(A3 ≈ 1.85) such that
  • If N = 3 (we take this case as example of what happens with N > 2),

In Figs 16 and 17 we can observe this behaviour when N = 2 and N = 3, respectively.

thumbnail
Fig 16. Family ZA.

CEN, MCEN, MCC* and ACC* as function of A > 0 for N = 2.

https://doi.org/10.1371/journal.pone.0210264.g016

thumbnail
Fig 17. Family ZA.

CEN, MCEN, MCC* and ACC* as function of A > 0 for N = 3.

https://doi.org/10.1371/journal.pone.0210264.g017

Table 8 shows some examples of confusion matrices of the family ZA, first with N = 2, and secondly with N = 4.

thumbnail
Table 8. Examples with different matrices ZA in cases N = 2 and N = 4.

https://doi.org/10.1371/journal.pone.0210264.t008

Note that CEN and MCEN exhibit a very different behaviour comparing with ACC* and MCC*, since the former are sensitive to the overall entropy associated to the elements of the matrix, which is . Entropy decreases to log(N2 − 1) when A → 0, and drops to 0 when A → +∞.

Comparing with other performance measures

Several works have considered the question of the introduction and comparison of different performance measures for classification, inspired, in one way or another, by Shannon’s entropy. For example, in [13] the authors introduce a novel measure called PACC (Probabilistic Accuracy) in the multi-class setting, making a comparative study of it with other measures as Accuracy, MCC and CEN, among others.

Besides, Entropy-Modulated Accuracy (EMA), introduced in [14], is a performance measure of classification tasks based on the concept of perplexity, the latter being defined as the effective number of classes a classifier sees. The authors also introduce NIT (Normalized Information Transfer) factor, which is a correction of EMA. They compare both EMA and NIT factor with Accuracy and CEN, rejecting rankings of classifiers based in Accuracy and choosing more meaningful and interpretable classifiers. They show in some examples that MCC is highly correlated with Accuracy, while rankings obtained with CEN, EMA and NIT factor are comparable in some cases but disagree in others.

Although PACC, EMA and NIT factor are useful measures to assess classifiers, in our opinion none of them is completely satisfactory in grading the effectiveness of the classifier learning process, since all reflect some concrete feature of the classification process, being insufficient for covering all the aspects of this complex task, so they should be used cautiously and in a complementary way. That is, all the measures suffer from certain weaknesses that are evident in specific, more or less gimmicky situations. This comment extends also to both CEN and MCEN, although it should be noted that the latter solves the problems showed by CEN in the binary setting, as well as to MCC and Accuracy, the last one having been widely treated (see, for example, the Introduction section in [14]).

Let us exemplify this fact by going back to the toy example in Table 2. In Table 9 we add the calculated values of PACC* = 1-PACC and 1/NIT to that of Table 2. We use NIT factor (inverted to make it comparable with the other measures) instead of EMA since the probability distribution of classes in the validation set is not uniform. Note that our confusion matrices are transposed with respect to that in [14], and also that for the NIT factor we use formula (4). We have used the corrected definition provided by the authors, which had already acknowledged an erratum in Eq (4) in the comments of https://www.researchgate.net/publication/259743406_100_Classification_Accuracy_Considered_Harmful_The_Normalized_Information_Transfer_Factor_Explains_the_Accuracy_Paradox/.

thumbnail
Table 9. Toy example of Table 2 revisited, adding PACC and the NIT factor.

https://doi.org/10.1371/journal.pone.0210264.t009

The behaviour of PACC* showed in Table 9 is consistent with that of MCC*, increasing when IN entropy decreases (a) and decreasing when OUT decreases (b). However, the behaviour of 1/NIT is consistent with that of CEN and MCEN, decreasing in both cases. Nevertheless, unlike what happens with CEN and MCEN, NIT factor does not distinguish among scenarios (a) and (b). This is because both EMA and NIT factor are invariants to permutations of the columns.

Another example is that of the MEG mind reading challenge organized by the PASCAL (Pattern Analysis, Statistical modeling and ComputAtional Learning) network in [15], already considered in [14]. We restrict our comparison to the group of the four most outstanding systems, denoted C1 (Huttunen et al.), C2 (Santana et al.), C3 (Jylänki et al.) and C4 (Tu & Sun), since for them, unlike what happens with the rest, we could access to the confusion matrices in [15]. The results are in Table 10, and from them we see that the most comparable rankings are that given by the NIT factor, CEN and MCEN, showing clusters {C4, C2} and {C1, C3}, with very small differences inside the clusters, specially the second. The authors of the report [15] were specially interested in comparison C1 vs. C4, and 1/NIT factor, as well as CEN and MCEN, give the same ordering: C4 is better (lower value) than C1, in concordance with interpretability given in [14].

thumbnail
Table 10. Results for the first four systems of the MEG mind reading challenge.

Confusion matrices have been obtained from [15].

https://doi.org/10.1371/journal.pone.0210264.t010

One more example to show the variability when performance measures are compared: in Table 11 we see that the NIT factor (equivalently, EMA), unlike the other measures, is not able to distinguish between classifiers whose confusion matrices are A and B in the binary case, nor between C and D in multi-class classification.

thumbnail
Table 11. Two toy examples.

With S = 30 for N = 2, and with S = 40 for N = 3.

https://doi.org/10.1371/journal.pone.0210264.t011

Supporting information file: Experiments and results

The advantages of using Modified Confusion Entropy MCEN measure against CEN have been tested on different binary classifiers, constructed from four available datasets from the UCI ML Repository (https://archive.ics.uci.edu). From each dataset we construct and assess eight different classifiers, five of which are Bayesian networks, while the rest are other standard machine learning procedures used in supervised classification problems.

Because of the comparisons carried out previously with different examples, we have to recognize the impossibility of deciding what measure of behaviour, of the considered ones, can allow to decide in the case that the rankings of classifiers obtained with CEN and MCEN were different. We decided, then, to use OUT entropy as such a reference when there is disparity; in case of a tie, we will use IN entropy to break it. This is what we will call “the criterion of entropy”.

To compare rankings obtained from CEN and MCEN and that obtained by the criterion of entropy, we use both the Hamming distance and the degree of consistency indicator c (see [16]).

The results obtained with all the considered datasets heuristically reinforce that MCEN is more correlated with entropy than CEN. (see S1 File and Tables A-F in S1 File).

Conclusion

We introduced MCEN as a modification of the original Confusion Entropy performance measure CEN introduced in [3], both for binary and multi-class classification, proving some properties. We compared this measure with CEN, MCC and Accuracy, showing that in the binary case, MCEN overcomes the unreliability of CEN in a twofold sense: the departure of the range where it should be (the interval [0, 1]), and the lack of monotonicity when the entropy increases or decreases. These features made CEN an inappropriate measure in the binary case, proving MCEN to be a good alternative, and we study different scenarios to highlight this fact. Moreover, while nor Accuracy nor MCC can distinguish among different misclassification distributions of cases in the confusion matrix, MCEN and CEN have an high level of discrimination.

First, we show that in the binary case (see Table 2), both CEN and MCEN are sensitive to the decreasing in the entropy within the main diagonal IN, an also to that outside the diagonal OUT, but while CEN is more sensitive than MCEN to IN, the opposite occurs with OUT. By contrast, ACC is insensitive as long as the sum of the diagonal and the total sum remain constant. Secondly, we consider the multi-class perfectly symmetric and balanced case in which the main diagonal elements are equal to T and the elements outside the diagonal are equal to F, which is analytically studied in detail, showing the output-of-range of CEN in the binary case when γ = T/F ∈ (0, 1).

After that, se consider different particular situations in the binary setting, through the study of some families of confusion matrices. Family UA is symmetric and unbalanced, showing the out-of-range of CEN for any A > 1, and in addition a lack of monotonicity that contrast with the behaviour of the overall entropy associated to the elements of the matrix. Family VA is asymmetric and unbalanced, and also shows the out-of-range of CEN but only for A in the interval (1, A1), where A1 ≈ 1.4.

Two doubly indexed families have been considered in the binary case. CEN has an anomalous behaviour for family XA, r, which is symmetric but unbalanced, for r > r0 (with r0 ≈ 0.8) since it is not only out-of-range from a certain value of A, but its limit when A → +∞ is >1 if r > 1, showing lack of monotonicity. The same happens from a certain value of r, fixed A. Family YA, r is also unbalanced but asymmetric. When r is in the interval (R0, 1) with R0 ≈ 0.71, CEN is not only out-of-range from a certain value of A, but its limit when A → +∞ is >1 if r > 1, showing lack of monotonicity. But there are other two intervals of values for r in which CEN>1 for A living in a certain bounded interval.

Besides evaluating binary confusion matrices with the same classification results for the minority class but different balances of the two classes, we compare through two examples the behaviour of MCEN with that of CEN, ACC* and MCC*, in evaluating improvements in classification of the minority class while maintaining the same amount of imbalance. We show that CEN is the only one that does not show a monotonous decrease as the classification improves, for which MCEN proves, also in this sense, that it outperforms CEN.

Finally, we also consider the multi-class family ZA, which is asymmetric and unbalanced, and observe that in the binary case, CEN is out-of-range for A ∈ (1, A3), with A3 ≈ 1.85.

In all of these examples, MCEN behave appropriately. Comparing with the overall Shannon’s entropy associated to the set of elements of the confusion matrix, both CEN and MCEN are sensitive to it but CEN sometimes does not show the same behaviour in terms of monotonicity than entropy. With respect to Accuracy and MCC, conveniently scaled, they show sometimes a behaviour in contradiction with Shannon’s entropy, as for families VA and ZA.

A further comparison has been carried out with the Probabilistic Accuracy (PACC) introduced in [13], and the Entropy-Modulated-Accuracy EMA and the Normalized Information Transfer (NIT) factor, both introduced in [15]. We consider different examples in which sometimes PACC* = 1–PACC behaves consistently with MCC*, increasing when IN entropy decreases and decreasing when OUT decreases, while 1/NIT behaves in accordance with CEN and MCEN, decreasing in both cases, but with the handicap that unlike what happens with CEN and MCEN, NIT factor does not distinguish between IN and OUT. But not always. Actually, no measure seems to be completely satisfactory since each one reflects a specific characteristic of the classification process, so they should be used in a complementary way and none can be taken as a gold standard to compare the others.

Finally, to make clear the improvement of MCEN over CEN, we carry out experimentation consisting in the comparison of the rankings of some classifiers obtained from four different real datasets by using both measures. Mostly the classifiers orderings match, but when they do not, it is the MCEN that most agrees with the criterion of entropy. To see that, we use both the Hamming distance and the degree of consistency indicator c. These results heuristically support the use of MCEN as a better alternative to CEN in the binary case, when a performance measure based in entropy is required.

Supporting information

S1 File. Supporting information: Experiments and results.

Table A. Datasets used in the experiments. Table B. Classifiers used in the experiments. Table C. Results for the Breast cancer dataset. Table D. Results for the SPECT heart dataset. Table E. Results for the Congressional voting dataset. Table F in S1 File. Results for the MONK’s Problems.

https://doi.org/10.1371/journal.pone.0210264.s001

(PDF)

Acknowledgments

This work have been supported by Ministerio de Economía y Competitividad, Gobierno de España, project ref. MTM2015 67802-P.

The authors wish to thank the anonymous referees for careful reading and helpful comments that resulted in an overall improvement of the paper, and more especially for drawing their attention on the paper [14]. They also are grateful to the Center for Machine Learning and Intelligent Systems of the Bren School of Information and Computer Science (University of California, Irvine, U.S.A.) for creating and maintaining the UCI Machine Learning repository.

References

  1. 1. Matthews B.: Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et biophysica acta. Vol 405, Num 2, 442–451 (1975).
  2. 2. Delgado, R., Núñez-González, D.: Enhacing Confusion Entropy (CEN) as measure for evaluating classifiers. In: Graña M. et al. (eds) International Joint Conference SOCO’18-CISIS’18-ICEUTE’18. Advances in Intelligent Systems and Computing, vol 771. Springer, Cham (2019).
  3. 3. Wei J.-M., Yuan X.-Y., Hu Q.-H., Wang S.-Q.: A novel measure for evaluating classifiers. Expert Systems with Applications, Vol 37, 3799–3809 (2010).
  4. 4. Jurman G., Riccadonna S., Furlanello C.: A Comparison of MCC and CEN Error Measures in Multi-Class Prediction. Plos One. Vol 7, Num 8, 1–8 (2012).
  5. 5. Jin, H., Wang, X.-N., Gao, F., Li, J., Wei, J.-M.: Learning Decision Trees using Confusion Entropy. Proceedings of the 2013 International Conference on Machine Learning and Cybernetics, Tianjin, 14-17 July (2013).
  6. 6. Roumani Y.-F., May J.-H., Strum D.-P.: Classifying highly imbalanced ICU data. Health Care Manag. Sci. Vol 16, 119–128 (2013).
  7. 7. Roumani Y.-F., Rouman Y., Nwankpa J.-K., Tanniru M.: Classifying readmissions to a cardiac intensive care unit. Annals of Operations Research, vol. 263 (1-2), 429–451 (2018).
  8. 8. Wang X.-N., Wei J.-M., Jin H., Yu G., Zhang H.-W.: Probabilistic Confusion Entropy for Evaluating Classifiers. Entropy, Vol 15, 4969–4992 (2013).
  9. 9. Antunes F., Ribeiro B., Pereira F.: Probabilistic modeling and visualization for bankruptcy prediction. Applied Soft Computing vol. 60, 831–843 (2017).
  10. 10. Sublime J., Grozavu N., Cabanes G., Bennani Y., Cornuéjols A.: From Horizontal to Vertical Collaborative Clustering using Generative Topographic Maps. International Journal of Hybrid Intelligent Systems, vol. 12(4), 245–256 (2015).
  11. 11. Sublime, J., Matei, B., Murena, P.-A.: Analysis of the influence of diversity in collaborative and multi-view clustering. 2017 International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, 4126–4133 (2017).
  12. 12. Sublime J., Matei B., Cabanes G., Grozavu N., Bennani Y., Cornuéjols A.: Entropy based probabilistic collaborative clustering. Pattern Recognition, vol. 72, 144–157 (2017).
  13. 13. Sigdel M., Aygun R.: Pacc—A Discriminative and Accuracy Correlated Measure for Assessment of Classification Results. Machine Learning and Data Minning in Pattern Recognition. Vol 7988. LNCS. pp 281–295 (2013).
  14. 14. Valverde-Albacete F.J., Peláez-Moreno C.: 100% Classification Accuracy Considered Harmful: The Normalized Information Transfer Factor Explains the Accuracy Paradox. Plos One. Vol 9, Num 1, 1–10 (2014).
  15. 15. Klami, A., Ramkumar, P., Virtanen, S., Parkkonen, L., Hari, R., Kaski, S.: ICANN / PASCAL 2 Challenge: MEG Mind Reading –Overview and Results. In: Klami, A., editor, Proceedings of ICANN/PASCAL2 Challenge: MEG Mind Reading. Espoo, Aalto University Publication series SCIENCE + TECHNOLOGY 29/2011, pp. 3–19. http://urn.fi/URN:ISBN:978-952-60-4456-9
  16. 16. Huang J., Ling C.: Using AUC and Accuracy in Evaluating Learning Algoritms. IEEE Transactions on Knowledge and Data Engineering, vol. 17, 299–310 (2005).