Next Article in Journal
Slip Microrotation Flow of Silver-Sodium Alginate Nanofluid via Mixed Convection in a Porous Medium
Next Article in Special Issue
Analysing the Protein-DNA Binding Sites in Arabidopsis thaliana from ChIP-seq Experiments
Previous Article in Journal
Improvement of Trajectory Tracking by Robot Manipulator Based on a New Co-Operative Optimization Algorithm
Previous Article in Special Issue
In Search of Complex Disease Risk through Genome Wide Association Studies
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Techniques to Deal with Off-Diagonal Elements in Confusion Matrices

by
Inmaculada Barranco-Chamorro
*,† and
Rosa M. Carrillo-García
Departamento de Estadística e Investigación Operativa, Facultad de Matemáticas, Universidad de Sevilla, 41012 Sevilla, Spain
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Mathematics 2021, 9(24), 3233; https://doi.org/10.3390/math9243233
Submission received: 24 October 2021 / Revised: 4 December 2021 / Accepted: 10 December 2021 / Published: 14 December 2021
(This article belongs to the Special Issue Models and Methods in Bioinformatics: Theory and Applications)

Abstract

:
Confusion matrices are numerical structures that deal with the distribution of errors between different classes or categories in a classification process. From a quality perspective, it is of interest to know if the confusion between the true class A and the class labelled as B is not the same as the confusion between the true class B and the class labelled as A. Otherwise, a problem with the classifier, or of identifiability between classes, may exist. In this paper two statistical methods are considered to deal with this issue. Both of them focus on the study of the off-diagonal cells in confusion matrices. First, McNemar-type tests to test the marginal homogeneity are considered, which must be followed from a one versus all study for every pair of categories. Second, a Bayesian proposal based on the Dirichlet distribution is introduced. This allows us to assess the probabilities of misclassification in a confusion matrix. Three applications, including a set of omic data, have been carried out by using the software R.

1. Introduction

Confusion matrices are the standard way of summarizing the performance of a classification method. This is an issue of crucial interest in a variety of applied scientific disciplines, such as Geostatistics, mining data, mining text, Economy, Biomedicine or Bioinformatics, to cite only a few. A confusion matrix is obtained as a result of applying a control sampling on a dataset to which a classifier has been applied. Provided that the qualitative response to be predicted has r 2 categories, the confusion matrix will be a r × r matrix, where the rows represent the actual or reference classes and the columns the predicted classes (or vice versa). So the diagonal elements correspond to the items properly classified, and the off-diagonal to the wrong ones. If a classifier is fair or unbiased, then the errors of classification between two given categories A and B must happen randomly, that is, it is expected that they occur approximately with the same relative frequency in every direction. Quite often, this is not the case, and a kind of systematic error occurs in a direction, that is, the observed value in a cell is considerably greater (or smaller) than its symmetric in the confusion matrix. In this paper, by classification bias, we mean this kind of systematic error, which happens between categories in a specific direction. As for the mechanism causing it, we distinguish:
1.
The classification bias can be due to deficiencies in the method of classification. For instance, it is well known [1] that an inappropriate choice of k in the k-nearest neighbor (k-nn) classifier may produce this effect. In case of being detected, the method of selection of k must be revised;
2.
On the other hand, the classification bias may be caused by the existence of a unidirectional confusion between two or more categories, that is, the classes under consideration are not well separated. In case of being detected, maybe additional predictors related to distinguish between these specific classes must be incorporated in the process of classification. Think, for instance, of a problem of classification related to the use of land, and two given categories, such as water and rice; the probability of confusing water with rice is not the same as that of confusing rice with water.
For all the aforementioned reasons we consider that it is of interest to pay attention to structure of misclassifications. In this paper, first marginal homogeneity tests are proposed to identify this problem in a global way. These are based on Stuart–Maxwell test [2] and Bhapkar test [3]. In affirmative case, a One versus All methodology is proposed [4], in which Mc-Nemar tests are proposed for every pair of classes. Since in this context, quite often, prior information can be available, which must be incorporated in the process of estimation [5], a Bayesian method based on the Dirichlet-Multinomial distribution is developed to estimate the probabilities of confusion between the classes previously detected. To illustrate the use of our proposal, three applications are considered. Application 1 corresponds to the field of Geostatistics [6]. There, a 4 × 4 matrix is considered and studied in detail. Classification bias is detected in two categories. Bayesian estimates of probabilities of overprediction and underprediction in these categories are given, along with other Bayesian summaries. Application 2 corresponds to a problem of classification in text mining, specifically, literary genres [7]. In spite of the large number of categories, r = 10 , our strategy allowed us to detect bias of classification in several categories and to estimate the associated probabilities [8]. Finally, in Application 3, a really difficult problem of diagnosis for Inflammatory Bowel Disease based on omic data is considered [9]. In this case, r = 3 , as novelty, this fact allows us to visualize the posterior distributions associated to the different classes. We highlight that a serious problem of overprediction for the Chron Disease has been detected and estimated. As for recent works and references dealing with this topic in confusion matrices, we highlight that most papers focus on the assessment of the overall accuracy of the classification process, kappa coefficient, and methods to improve these measurements, see, for instance, [6,10,11] and references therein. Areas in which bias of classification, and its associated problems are of interest, can be seen in [1,12,13,14,15]. However, a scarce number of papers consider the study of the off-diagonal cells in a confusion matrix. In this sense, the paper by Tsendbazar et al. [16] can be cited where similarity matrices between classes are proposed to be used as weights for the computation of global accuracy measurements. On the other hand, the problem of inference with misclassified multinomial data from a Bayesian point of view is addressed in [17]. All these references show that this topic is of interest for a better definition of classes, and the improvement of the global process of classification. The statistical tools proposed can be used for a better comprehension of information in a confusion matrix. As computational tools, we highlight that the R Software [18] and packages [19,20,21] have been used.
It is of interest to highlight that the results proposed in this paper can be considered as a new metric to be applied to multi-class classification problems in machine learning [22]. In this sense, the first technique introduced in this paper, that is, marginal homogeneity tests, can be used to detect systematic problems of a classifier. On the other hand, the second one, based on a Bayesian analysis of a confusion matrix, can be used as a micro technique which allows us to compare several classifiers. As novelty, we highlight that we propose measurements to assess the performance of a classifier along with summaries about the variability of these measurements, which is not usual in machine learning.

2. Materials and Methods

In this section, we first propose considering a confusion matrix (or error matrix) as a statistical tool for the analysis of paired observations.
Let Y and Z be two categorical variables with r 2 categories. Let Y be the variable that denotes the reference (or actual) categories and Z the predicted classes. As a result of the classification process, the confusion matrix given in Table 1 is obtained, and n i , j denotes the number of observations in the ( i , j ) cell for i , j = 1 , 2 , , r .
The accuracy of Table 1 is
a c c u r a c y = i = 1 r n i , i n + + ,
where n + + equals to the total number of elements in the table, that is, the accuracy is the proportion of items properly classified. Other common global measurements of the performance of a classifier are the kappa index, sensitivity, specificity, Mathew’s correlation coefficient, F1-score and, for the 2 × 2 tables, the area under the ROC curve (AUC) [6]. All of them are global measurements, focusing mainly in proportion of items properly classified, and do not pay attention to structure in the off-diagonal elements.
Let us introduce the notation to address the problem at hand. So, let us define the probability of ( Y , Z ) occurs in the cell which corresponds to the ith row and the jth column, π i j = P [ Y = i , Z = j ] . { π i j } is the joint probability mass function (pmf) of ( Y , Z ) .
The marginal pmf’s of Y and Z, denoted as { π i + } and { π + j } , respectively, are obtained as:
π i + = j = 1 r π i j , π + j = i = 1 r π i j ,
where
i = 1 r π i + = j = 1 r π + j = i = 1 r j = 1 r π i j = 1 .
{ π i + } and { π + j } will be the basis on which to propose marginal homogeneity tests.

3. Marginal Homogeneity

Taking into account that cells in a confusion matrix can be seen as data for matched pairs of classes, we propose to test if marginal homogeneity can be assumed between the row and columns of this matrix, which is equivalent to test if the row and column probabilities agree for all the categories, that is:
P [ Y = s ] = P [ Z = s ] π s + = π + s s = 1 , 2 , , r .
Note that (1) states that the proportion of items classified in the sth class agrees with the proportion of actual or reference items in this class. If this agreement happens for all the categories, then this fact suggests that there do no exist systematic problems of classification (or classification bias) in our confusion matrix. This is the main idea on which to build our proposal.

3.1. 2 × 2 Table

Let us first introduce the method for a 2 × 2 confusion matrix. Here, we propose to apply the McNemar type test [3] tailored for this context. So, for i = 1 , 2 , let us consider:
H 0 : π i + = π + i H 1 : π i + π + i .
Note that, in a classification problem, one of the variables refers to the actual category and the other one to the predicted class; so, in this context, the null hypothesis H 0 establishes that the probability of the class to be predicted is equal to the proportion of actual elements in the ith class. This agreement suggests that the performance of our classifier is good. On the other hand, the alternative hypothesis establishes that these probabilities significantly disagree. Therefore, if the null hypothesis is rejected, it can be concluded that there exists significant evidence of problems with this category. Nevertheless, we want to highlight that the emphasis must be on the method, since this test allows us to focus on the probabilities associated with the off-diagonal elements in a confusion matrix, that is, the probabilities of the wrongly classified or misclassified elements, since (2) is equivalent to:
H 0 : π 12 = π 21 H 1 : π 12 π 21 .
To prove the equivalence between (2) and (3), it is enough to note that
π 1 + = P [ Y = 1 ] = P [ Y = 1 , Z = 1 ] + P [ Y = 1 , Z = 2 ] π + 1 = P [ Z = 1 ] = P [ Y = 1 , Z = 1 ] + P [ Y = 2 , Z = 1 ]
and therefore (2) can be reduced to (3).
Test (3) can be solved following an exact approach, based on the binomial test, or an asymptotic one, based on chi-squared type statistics.
Binomial approach. Let us consider the number of misclassifications and a new variable C defined as:
C = 1 , if Y = 1 and Z = 2 0 , if Y = 2 and Z = 1 .
C is a Bernoulli variable with success probability
π c = P [ C = 1 ] = P [ Y = 1 , Z = 2 ] = π 1 , 2 .
The test given in (3) is equivalent to:
H 0 : π c = 0.5 H 1 : π c 0.5
Let the statistic T = n 12 be the number of misclassified observations in the ( 1 , 2 ) cell. Under the null hypothesis proposed in (6), T follows a binomial distribution, T H 0 B ( n 12 + n 21 , 0.5 ) . Therefore, the binomial test can be applied. Recall that the p-value of test proposed in (6) is
p - value = 2 min { P H 0 [ T n 12 ] , P H 0 [ T n 12 ] } .
A point of practical interest is that the exact approach allows us to carry out one-sided tests, which can also be solved in terms of the previously cited binomial test. The one-sided tests are
H 0 : π 1 , 2 π 2 , 1 H 1 : π 1 , 2 > π 2 , 1 a n d H 0 : π 1 , 2 π 2 , 1 H 1 : π 1 , 2 < π 2 , 1 .
In terms of the variable C, introduced in (5), the one-sided tests proposed in (7) are equivalent to:
H 0 : π C 0.5 H 1 : π C > 0.5 a n d H 0 : π C 0.5 H 1 : π C < 0.5 ,
respectively. The interest of these one-sided tests will be seen in the practical applications.
Asymptotic approach. Under this approach [23], the following statistic can be considered to solve (3):
χ 2 = ( n 12 n 21 ) 2 n 21 + n 12 χ 1 2 ,
or the statistic with continuity correction proposed by Edwars [24]
χ c 2 = ( | n 12 n 21 | 1 ) 2 n 12 + n 21 χ 1 2 .
In both cases, we have that p - value = P [ χ 1 2 > χ o b s 2 ] where χ o b s 2 is the result of applying χ 2 (or χ c 2 ) to our observed 2 × 2 confusion matrix.

3.2. General Case

For a confusion matrix resulting from a multi-class classifier, r > 2 , the Stuart-Maxwell test [3], also known as Generalized McNemar test can be considered. This test is aimed at finding evidence of significant differences between the actual and predicted probabilities in any of the categories, specifically
H 0 : π i + = π + i i = 1 , 2 , , r , H 1 : i | π i + π + i .
This test is based on the paired differences d = ( d 1 , , d r 1 ) , where d s = π + s π s + . Note that, d r is omitted since i = 1 r d i = 0 , as result of i = 1 r π i + = j = 1 r π + j = 1 . Under the null hypothesis H 0 of marginal homogeneity, it was proven in [3] that E ( d ) = 0 and the statistic,
χ 0 2 = N d t V ^ 1 d = N d t ( N V ^ ) 1 N d χ r 1 2 ,
is asymptotically distributed as a chi-square variable with r 1 degrees of freedom.
In (10), N = n + + = i , j n i , j and V ^ are the estimated covariance matrix of vector N d , whose elements are given by
v ^ s t = ( π s t π t s ) s t , t , s = 1 , , r 1 , v ^ s s = π s + + π + s 2 π s s t , s = 1 , , r 1 .
A similar test was proposed by Bhapkar [3] based in the statistic,
χ B 2 = N d t V ^ 1 d = N d t ( N 2 V ^ ) 1 N 2 d χ r 1 2 ,
where the elements of V ^ are estimated by
v ^ s t = ( π s t + π t s ) ( π + s π s + ) ( π + t π t + ) s t , t , s = 1 , , r 1 v ^ s s = π s + + π + s 2 π s s ( π + s π s + ) 2 t , s = 1 , , r 1 .
Both statistics are related via
χ B 2 = χ 0 2 1 χ 0 2 / N ,
and therefore they are equivalent.

3.3. Post-Hoc Analysis

If the null hypothesis is rejected in previous tests, we do not know which particular differences between probabilities of categories are significant. Our proposal is to use post hoc tests to explore which categories are significantly different while controlling the experiment-wise error rate. To reach this end, a One versus All approach is proposed. Specifically, for the ith category, with i = 1 , , r , let us consider:
H 0 , i : π i + = π + i H 1 , i : π i + π + i .
Similarly to (4), note that:
π i + = P [ Y = i ] = P [ Y = i , Z = i ] + P [ Y = i , Z i ] = P [ Y = i , Z = i ] + j i P [ Y = i , Z = j ] π + i = P [ Z = i ] = P [ Y = i , Z = i ] + P [ Y i , Z = i ] = P [ Y = i , Z = i ] + j i P [ Y = j , Z = i ] .
Therefore (11) is equivalent to test:
H 0 , i : P [ Y = i , Z i ] = P [ Y i , Z = i ] H 1 , i : P [ Y = i , Z i ] P [ Y i , Z = i ] .
Note that H 0 , i states that the proportion of elements belonging to the ith class ( Y = i ) and that are classified into other ones ( Z i ) must agree with the proportion of elements which belong to the remaining classes ( Y i ) and have been wrongly predicted or misclassified in the ith category ( Z = i ).
To carry out the test proposed in (12), consider the confusion submatrix.
The McNemar test, given in Section 3.1, can be applied to Table 2 with the statistic test T i = n i + n i i , which is distributed under the null hypothesis proposed in (12) as T i H 0 B ( n i + + n + i 2 n i i , 0.5 ) . We highlight that one-sided tests can also be carried out straightforwardly by applying the results in Section 3.1, which will allow us to draw conclusions about the specific problems with the categories under consideration.

4. Bayesian Methodology

In this section, a Bayesian approach, based on the multinomial-Dirichlet model, is proposed to estimate the probabilities of misclassification in the confusion matrix.
Definition 1
(Multinomial distribution). Let r and n be positive integers and let θ 1 , , θ r be numbers satisfying 0 θ i 1 , i = 1 , , r , and i = 1 r θ i = 1 . The discrete random vector X = ( X 1 , , X r ) t follows a multinomial distribution with n trials and cell probabilities θ = ( θ 1 , , θ r ) t if the joint probability mass function (pmf) of X is:
f X ( n 1 , , n r | θ ) = P X 1 = n 1 , , X r = n r | θ = n ! j = 1 r n j ! j = 1 r θ j n j ,
on the set of ( n 1 , , n r ) such that each n j is a nonnegative integer and j = 1 r n j = n . (13) is denoted as: ( X | n , θ ) M u l t i n o m i a l ( n , θ ) .
Recall that the multinomial distribution is used to describe an experiment consisting of n independent trials, where each trial results in one of r mutually exclusive outcomes. The probability of the jth outcome on every trial is θ j . For j = 1 , , r , X j is the count of the number of times the jth outcome happened in the n trials. Some properties of interest for our purposes are listed in next lemma, additional details can be seen in [25].
Lemma 1.
Let ( X | n , θ ) M u l t i n o m i a l ( n , θ ) with θ = ( θ 1 , , θ r ) . Then,
1.
The marginal distributions are binomials, X j B ( n , θ j ) , j = 1 , , r .
2.
E ( X j ) = n θ j and V a r ( X j ) = n θ j ( 1 θ j ) , j = 1 , , r .
3.
C o v ( X j , X k ) = n θ j θ k , j k .
Remark 1.
In the multinomial distribution, all the coordinates in the vector ( X 1 , , X r ) are related, since their sum must be n. This fact results in all the pairwise covariances being negative, C o v ( X j , X k ) = n θ j θ k , j k . Moreover, note that the negative correlation is greater for variables with higher success probability. This makes sense, as the sum of the variables in the vector is constrained at n, so if one starts to get big, the others tend not to. These appreciations will be of interest in our applications.
Next, the Dirichlet distribution is introduced. Recall that this model is conjugate prior of the multinomial distribution [26].
Definition 2.
Let θ = ( θ 1 , , θ r ) in the ( r 1 ) -simplex, that is, θ { θ i 0 : j = 1 r θ j = 1 } . The random vector θ = ( θ 1 , , θ r ) follows a Dirichlet distribution with parameters α = ( α 1 , , α r ) , α i > 0 , if the joint probability density function (pdf) of θ is
f θ ( θ 1 , , θ r | α ) = Γ ( j = 1 r α j ) j = 1 r Γ ( α j ) j = 1 r θ j α j 1 ,
where Γ ( · ) is the gamma function. (14) is denoted as θ | α D i r i c h l e t ( α ) .
Lemma 2.
Let θ | α D i r i c h l e t ( α ) . If α i > 1 i , then the mode of θ | α is reached at
θ i = α i 1 j = 1 r α j r , i = 1 , , r .
Lemma 3.
Let θ | α D i r i c h l e t ( α ) . Then
1.
The marginal distributions are Beta distributed,
θ j B e t a ( α j , α 0 α j ) with α 0 = j = 1 r α j .
2.
The mean and variance marginals are
E ( θ j ) = α j α 0 , V a r ( θ j ) = α j ( α 0 α j ) α 0 2 ( α 0 + 1 ) , j = 1 , , r .
The Dirichlet-multinomial model can be applied in a confusion matrix as follows. Note that, in the confusion matrix defined in Tabla Table 1 the number of elements in the kth row, denoted as n k + , is fixed (since the rows are the actual or reference categories). Our proposal is to deal with every row as a multinomial distribution with n k + trials and r possible outcomes (these are to be classified in the { 1 , , r } classes) whose probabilities are denoted as ( θ 1 | k , , θ r | k ) ,
( Y k | n k + , θ k ) M u l t i n o m i a l ( n k + , θ k ) where θ k = ( θ 1 | k , , θ r | k ) ,
Y k = ( Y 1 | k , , Y r | k ) and Y j | k counts the number of elements in the kth reference category classified in the jth class, for j = 1 , , r .
Remark 2.
In terms of the notation introduced in Section 2, θ j | k = P [ Z = j | Y = k ] .
As prior distribution for θ k a Dirichlet distribution is proposed
( θ k | α k ) D i r i c h l e t ( α k ) .
Given a confusion matrix, whose observed rows are denoted by y k o b s = ( n k , 1 , , n k , r ) = ( n 1 | k , , n r | k ) , by applying Bayes Theorem, and since the Dirichlet distribution is a conjugated prior for the Multinomial model, the posterior distribution for θ k is
π ( θ k | y k o b s , α k ) j = 1 r θ j | k n j | k + α j | k 1 ,
where ∝ stands for proportional to.
Therefore,
θ k | y k o b s , α k D i r i c h l e t ( n 1 | k + α 1 | k , , n r | k + α r | k ) .

5. Applications

5.1. Application 1

First a confusion matrix taken from the fields of Geostatistics and Image Processing [6] is considered. The matrix has four categories ( r = 4 ) and was obtained from an unsupervised classification method from a Landsat Thematic Mapper image. It is given in Table 3. The categories related to the land use are: FallenLeaf, Conifers, Agricultural and Scrub. Rows correspond to the Actual classes and columns to the Predicted classes. The sample size is n = 434 . As for a global measurement of classification, we have that the a c c u r a c y = 0.74 . Certain asymmetry or misclassification is observed in the off-diagonal elements, which suggests the existence of classification bias or significant differences between pairs of categories. Let us formalize these appreciations.

5.1.1. Marginal Homogeneity

Since we have a 4 × 4 matrix, to test the multiple marginal homogeneity Stuart-Maxwell or Bhapkar tests must be applied. Summaries are given in Table 4. These are the observed values of χ 2 statistics, degrees of freedom (df) of their asymptotic distributions, r 1 = 3 , and the corresponding p-values ( P [ χ 3 2 > χ o b s 2 ] ).
In both tests, we reach the conclusion that there exists significant evidence to reject the null hypothesis of marginal homogeneity. Next step it is to look for those categories with serious deficiencies in the classification process. The One versus All methodology proposed in Section 2 is applied for every category. The necessary auxiliary submatrices are labelled next as Table 5, Table 6, Table 7 and Table 8.
McNemar tests are applied to Table 5, Table 6, Table 7 and Table 8. The results for two-sided and one-sided tests are given in Table 9.
Remark 3.
In order to properly interpret the p-values in Table 9, the problem of multiple comparisons must be taken into account. For a significance level α = 0.05 and by applying the Bonferroni correction, every test should be carried out for α = α / r = 0.05 / 4 = 0.0125 significance level. Other corrections could also be applied.
Note that, from p-values in Table 9, there exist evidence to reject the marginal homogeneity for the categories FallenLeaf and Scrub, which correspond to p-values 1 × 10 7 and 2.2 × 10 6 in Table 9, respectively. Let us go into details and consider the following test for FallenLeaf
H 0 : p f l 0.5 H 1 : p f l < 0.5 ,
where p f l = P [ A _ F a l l e n L e a f P _ O t h e r s ] . The p-value of this test is p-value= 1 × 10 7 , so H 0 is rejected, that is, there exists significant evidence to reject that the proportion of elements in the category Fallen Leaf and they are misclassified in others, p f l , is greater or equal to 0.5 . Therefore p f l < 0.5 may be supposed. As it was seen in Section 3.1, McNemar test allows us to restrict our attention to cells (1, 2) and (2, 1) in Table 5. So p f l < 0.5 is equivalent to suppose that P [ A _ O t h e r s P _ F a l l e n L e a f ] > 0.5 , that is, in this case the dominant probability is that of actual observations in other categories and are predicted as FallenLeaf, P [ A _ O t h e r s P _ F a l l e n L e a f ] . So we may conclude that there exists confusion between the rest of categories and F a l l e n L e a f , since much more observations are assigned to FallenLeaf class than those really belong to. It could be said that there exists an overprediction of observations in the class F a l l e n L e a f .
Analogously, for the S c r u b class, the test which corresponds to p-value= 0.0000022 is:
H 0 : p s 0.5 H 1 : p s > 0.5 ,
with p s = P [ A _ S c r u b P _ O t h e r s ] .
In this case, we have the opposite situation, since the null hypothesis is rejected, there exists evidence to reject that the probability of actual being S c r u b and being classified in other categories, p s , is less than or equal to 0.5 . Therefore, it can be concluded that p s = P [ A _ S c r u b P _ O t h e r s ] is the dominant probability. So it can be said that an important part of actual observations in S c r u b give rise to confusion, and an important part of them are predicted in other classes, therefore causing an underprediction misclassification problem.
Since we have detected problems in certain categories, it is of interest to estimate the associated probabilities. This issue is studied in the next section from a Bayesian perspective.

5.1.2. Bayesian Approach

In this subsection for every category, a uniform prior distribution is considered, which corresponds to the Dirichlet distribution with α k = ( 1 , , 1 ) . Given y k o b s as the kth row in Table 3, the posterior distribution is:
θ k | y k o b s , α k D i r i c h l e t ( α ˜ k ) ,
with α ˜ k = ( n 1 | k + 1 , , n r | k + 1 ) for k = 1 , , 4 .
Explicitly, for the category Fallen Leaf, y 1 o b s = ( 65 , 6 , 0 , 4 ) and by applying (19)
θ 1 | y 1 o b s , α 1 D i r i c h l e t ( 66 , 7 , 1 , 5 ) ,
From (16), the mean, variance and standard deviation of the posterior marginal distributions are given in Table 10. They are denoted by θ ^ j b , V a r ( θ j | α ˜ 1 ) and s d ( θ j | α ˜ 1 ) , respectively.
The column θ ^ j b in Table 10 provides the Bayes estimates of conditional probabilities to
A _ FallenLeaf under quadratic loss function. We highlight the good estimate which has been obtained in this case with
P ^ [ P _ F a l l e n L e a f | A _ F a l l e n L e a f ] = 0.83 .
Remark 4.
The mode of the posterior distribution can also be given as Bayesian estimates of conditional probabilities to A _ FallenLeaf . For the distribution in (20), it would be
m o d e ( θ 1 | y 1 o b s , α ˜ 1 ) = ( 0.867 , 0.080 0.000 , 0.053 ) .
Similarly, the Bayesian summaries are obtained for the rest of the categories. They are listed in Table 11 for Actual Connifers, in Table 12 for Actual Agricultural, and in Table 13 for Actual Scrub.

Conclusions

As a summary of previous tables, Table 14 is given with the Bayesian estimates of probabilities in every conditional distribution.
Let us look to these conditional distributions. First we focus on the fourth column in Table 14, where the conditional probabilities associated with A c t u a l _ S c r u b category have been estimated. Note that
P ^ [ P _ S c r u b | A _ S c r u b ] = 0.63
is quite low. Moreover, we have that
P ^ [ P _ F a l l e n L e a f | A _ S c r u b ] = 0.17 and P ^ [ P _ A g r i c u l t u r a l | A _ S c r u b ] = 0.14 .
It could be said that there exists an underprediction of the Scrub category, since observations which are actual Scrub are often misclassified as FallenLeaf or Agricultural. These appreciations are coherent with the result in test (18).
As for the first column, corresponding to the conditional probabilities in the class A _ F a l l e n L e a f , we highlight the good estimates obtained for P ^ [ P _ F a l l e n L e a f | A _ F a l l e n L e a f ] = 0.83. However, note that, in the first row of Table 14, we have
P ^ [ P _ F a l l e n L e a f | A _ A g r i c u l t u r a l ] = 0.19 and P ^ [ P _ F a l l e n L e a f | A _ S c r u b ] = 0.17 ,
which are coherent with results in test (17). It could be said that those elements which are the actual in the class FallenLeaf are properly classified, but there exists problems of confusion of other categories to FallenLeaf, specifically actual Agricultural and Scrub observations are often misclassified as FallenLeaf. Both facts cause an overprediction of the FallenLeaf class.
Next, 95% credible intervals are given: equal tails, denoted as ( q 25 % , q 97 . 5 % ) and Highest Posterior Density (HPD) intervals, denoted as ( HPD l , HPD s ). Both intervals are obtained from the marginal distributions of posterior Dirichlet distribution given in (19) and by using R software [18] and package [19]. Table 15 provides these intervals for the posterior distributions to A _ FallenLeaf , Table 16 to A _ Conifers , Table 17 to A _ Agricultural and Table 18 to A _ Scrub .
The credible intervals are quite similar. Recall that the HPD intervals are more precise.

5.2. Application 2

In this application, a confusion matrix with r = 10 categories is considered, Table 19. This matrix is obtained as a result of applying classification processes of literary genres in n = 500 books by using text mining techniques [7]. The categories under consideration are Romance ( Rom ) , Mystery ( Mystery ) , Horror ( Hor ) , History ( His ) , Fiction ( Fic ) , Fantasy(Fan), Comedy ( Com ) , Children ( Chi ) , Biographical ( Bio ) and A d v e n t u r e ( A d v ) . We have 50 actual observations in every category. The interest of this application is to illustrate the performance of our proposal in a different field, text mining, and a bigger, r = 10 , confusion matrix.

5.2.1. Marginal Homogeneity

In Table 20, the summaries of applying Stuart–Maxwell and Bhapkar tests to the confusion matrix proposed in Table 19 are listed. In both tests, the conclusion that there exists significant evidence to reject the null hypothesis of multiple marginal homogeneity is reached, and therefore the One versus All strategy based on the McNemar test is applied to every category listed in Table 19. The most relevant summaries of one-sided tests are given in Table 21 and Table 22.
From p-values in Table 21, it could be concluded that Mystery , Fantasy and C h i l d r e n categories are overpredicted with problems of classification of some of the other categories to these ones. On the other hand, from p-values in Table 22, Romance , History , Comedy and A d v e n t u r e are underpredicted, and actual observations in these categories are misassigned to other ones.
Next, Bayesian techniques are applied, which allow us to assess these appreciations.

5.2.2. Bayesian Approach

A similarly process to the one explained in Application 1 has been followed. That is, a noninformative prior Dirichlet distribution is considered for every category, α k = ( 1 , , 1 ) , with k = 1 , , 10 . The summary of Bayesian estimates of conditional probabilities are provided in Table 23.
For those categories in which H 1 : p < 0.5 was accepted an overprediction problem is expected to happen. From Table 21, these are M y s t e r y , F a n t a s y and C h i l d r e n . It can be seen in Table 23, that in these categories the estimated probability of right classification is high
P ^ [ P _ M y s t e r y | A _ M y s t e r y ] = 0.667
P ^ [ P _ F a n t a s y | A _ F a n t a s y ] = 0.617
P ^ [ P _ C h i l d r e n | A _ C h i l d r e n ] = 0.617
Moreover, from the analysis by rows in these categories, we can observe that the estimated probabilities that actual observations in other categories are classified in these ones are high. As an illustration, consider the category M y s t e r y , and note that:
P ^ [ P _ M y s t e r y | A _ H o r ] = 0.15
P ^ [ P _ M y s t e r y | A _ F i c ] = 0.15
P ^ [ P _ M y s t e r y | A _ C o m ] = 0.20
P ^ [ P _ M y s t e r y | A _ A d v ] = 0.167
Equation (21) along with (22)–(25) explain the overprediction of M y s t e r y genre.
A similar analysis can be carried out for F a n t a s y and C h i l d r e n .
On the other hand, for those categories in which H 1 : p > 0.5 was accepted an underprediction problem is expected to happen. These are R o m a n c e , H i s t o r y , C o m e d y and A d v e n t u r e , see Table 22. In these categories the estimated probability of right classification are moderated, see P ^ [ P _ G e n i | A _ G e n i ] , in the diagonal of Table 23. From the analysis by columns in Table 23, note that actual observations in these categories are classified in other ones also with moderate probabilities (around 0.10 or 0.20).
To conclude, take a look at H o r r o r and F i c t i o n . In both cases actual observations in H o r r o r (or F i c t i o n ) are wrongly misclassified in M y s t e r y , F a n t a s y and C h i l d r e n , but also it receives misclassifications of actual observations in C o m or A d v , (for Fiction from History and Comedy). There exists a balance between both opposite streams, which is not detected by the tests in our proposal.

5.3. Application 3

In this case the confusion matrix given in Table 24 is considered. It is taken from [9] (Figure 1, E). This matrix is obtained as result of applying an artificial intelligence classification method for the diagnosis of Inflammatory Bowel Disease (IBD) based on fecal multiomics data. IBD’s are Crohn’s disease (CD) and Ulcerative Colitis (UC). nonIBD refers to the control group. We chose this example because IBD’s are really difficult to diagnose and classify, and their accurate diagnosis is really an important issue in Medicine, details can be seen in [9].
Huang et al. proposed in [9] a method with high accuracy for the diagnosis of different types of IBD. Specifically, the accuracy of Table 24 is a c c u r a c y = 0.6683 , which in this context is considered high. However certain asymmetry is observed in the off-diagonal elements of Table 24,which due to the importance of the problem under consideration deserves additional analysis.

5.3.1. Homogeneity

Similarly to previous applications, the results of applying multiple homogeneity test are given in Table 25. The marginal homogeneity is again rejected. So the one versus all strategy is applied, and their summaries are listed in Table 26.
The analysis of results in Table 26 shows that:
1.
For the control group, nonnIBD, there does not exist evidence to reject the null hypothesis of marginal homogeneity. Therefore we do not detect any systematic errors in this category;
2.
For UC, the test,
H 0 : p U C 0.5 H 1 : p U C > 0.5 ,
with p U C = P [ A _ U C P _ O t h e r s ] . It is obtained p-value = 9.7 e 07 , and therefore the null hypothesis is rejected, which suggests underprediction of the UC category.
3.
For CD, the test,
H 0 : p C D 0.5 H 1 : p C D < 0.5 ,
with p C D = P [ A _ C D P _ O t h e r s ] , p value = 0.0012 is obtained, which suggests overprediction of CD disease.
Since in this example we have evidence of problems of misclassification, the next step is to assess the conditional probabilities of interest.

5.3.2. Bayesian Approach

In this case, a noninformative prior distribution is first considered. Since other possibilities are also possible, later a sequential use of Bayes is illustrated.

Noninformative Prior Distributions

Let us consider a prior Dirichlet distribution with α j = 1 j = 1 , , 3 , as in previous applications the Bayes estimates of conditional probabilities are obtained along with the variances and standards deviations of marginal distributions. As novelty in this application, we highlight that since we are dealing with r = 3 categories, the posterior distribution associated to each category can be represented in the two-dimensional simplex, which allows a visual analysis of these joint distributions. To obtain the graphical representation in the two-dimensional simplex, 1000 values have been generated by using the R software, a grid has been established and the corresponding contour plots have been displayed.
From results in Table 27, we highlight that
P ^ [ P _ n o n I B D | A _ n o n I B D ] = 0.6786 and P ^ [ P _ C D | A _ n o n I B D ] = 0.2857 .
Although in the control group, n o n _ I B D , there is no evidence of classification bias, the estimated probability of being classified as CD is relatively high. As for the plot given in Figure 1, note that the joint posterior distribution is quite concentrated and close to n o n _ I B D vertex. The mode of this posterior distribution can also be given as Bayes estimates of the conditional probabilities, these are:
P ^ [ P _ n o n I B D | A _ n o n I B D ] = 0.6981 , P ^ [ P _ U C | A _ n o n I B D ] = 0.0189 ,
and P ^ [ P _ C D | A _ n o n I B D ] = 0.2830 . These estimates are quite close to the previous ones.
In the A _ U C category, Table 28, we found that, P ^ [ P _ U C | A _ U C ] = 0.3704 is quite low, and P ^ [ P _ C D | A _ U C ] = 0.5000 , that is, the estimated probability of an individual with UC to be diagnosed as CD is surprisingly high.
As for the joint posterior distribution plotted in Figure 2, we highlight that the area of highest posterior density is closer to the CD vertex than to UC vertex. This is coherent with the result in test (26), and confirms the underprediction of UC category in favour of CD.
The mode of this posterior distribution is:
P ^ [ P _ n o n I B D | A _ U C ] = 0.1176 , P ^ [ P _ U C | A _ U C ] = 0.3725 , P ^ [ P _ C D | A _ U C ] = 0.5098 .
Again, these estimates are quite close to the previous ones.
Finally, let us study the CD category in Table 29.
We highlight that the estimated probability of the right classification is the highest one, P ^ [ P _ C D | A _ C D ] = 0.7959 , and the area of highest posterior density is close to CD vertex, see Figure 3. The mode of Figure 3 is ( 0.1579 , 0.0316 , 0.8106 ) .
All these facts allow us to conclude that there exists a serious problem of overprediction of CD and underprediction of UC. To asses this fact estimates of conditional probabilities have been given. As for the joint posterior distributions, note that for A _ n o n I B D and A _ C D , they are close to the corresponding vertex as it can be seen in Figure 1 and Figure 3 respectively, which is good for a right classification. However, for A _ U C is clear the confusion with the category CD, see Figure 2.
For completeness, 95 % credible intervals are given in Appendix A along with results for a uniform discrete prior in r points.

5.3.3. Sequential Use of Bayes Theorem

In this subsection, it is shown that if new information is available then the Bayes theorem can be used in a sequential way to update our beliefs. Moreover our estimates exhibit less variability as it next illustrated.
Step 1, (confusion matrix M 1 ). Consider the Dirichlet-multinomial model for every row in a r × r confusion matrix, denoted as M 1 . That is, for k = 1 , , r , we have a prior Dirichlet distribution for θ k ,
θ k | α k D i r i c h l e t ( α k ) ,
and Y k = ( Y 1 | k , , Y r | k ) , the kth row with the counts in M 1 , is distributed as
Y k | n k + s t e p 1 , θ k M u l t i n o m i a l ( n k + s t e p 1 , θ k ) .
Given y k o b s , the posterior distribution for θ k is
θ k | y k o b s , α k D i r i c h l e t n 1 | k s t e p 1 + α 1 | k , , n r | k s t e p 1 + α r | k .
Step 2, (confusion matrix M 2 ). If, in the same problem of classification, a new confusion matrix, M 2 is obtained in a set of independent observations of those considered to build M 1 , then the distribution given in (28) can be considered as prior in Step 2 to get a new posterior distribution. Specifically, let
θ k | α ˜ k D i r i c h l e t ( α ˜ k ) ,
and W k = ( W 1 | k , , W r | k ) the kth row with the counts in M 2 , where
W k | n k + s t e p 2 , θ k M u l t i n o m i a l n k + s t e p 2 , θ k
Given w k o b s , the posterior distribution for θ k is
θ k | w k o b s , α ˜ k D i r i c h l e t n 1 | k s t e p 2 + α ˜ 1 | k , , n r | k s t e p 2 + α ˜ r | k .
To illustrate the sequential method, let us consider M 1 as the matrix given in Table 24 and M 2 the matrix given in Table 30.
The different estimates are listed in the following tables.
We highlight the increase of precision we got when the new information is incorporated in the process of estimation. Note that the standard deviations of posterior distributions listed in Table 31, Table 32 and Table 33 are less than those listed in Table 27, Table 28 and Table 29. This is the main merit of the sequential use of Bayes’ theorem.

6. Discussion

The aim of this paper is to propose methods to detect the bias of classification, as well as overprediction and underprediction problems associated to categories in a confusion matrix. The methods may be applied to confusion matrices obtained as result of applying supervised learning algorithms, such as logistic regression, linear and quadratic discriminant analysis, naive Bayes, k-nearest neighbors, classification trees, random forests, boosting or support vector machines, among others. First marginal homogeneity tests are introduced. They are based on applying techniques to matched pairs of observations tailored to this context. Second, a Bayesian methodology, based on the multinomial-Dirchlet distribution is developed, which allows us to confirm and to assess the magnitudes of these problems by using prior information. Three applications taken from peer-reviewed and different scientific literature have been carried out. They illustrate relevant aspects related to the performance of our proposal, mainly varying the dimension r of the confusion matrix. In all of them, the results obtained have been satisfactory. We consider that these new methods are of interest for a better definition of classes, to improve the performance of classification methods, and to assess the global process of classification. As for related work, we highlight the results given in [22], where an excellent review of metrics to deal with multi-class classification tasks is given. There, usual indicators such as accuracy, recall, F1-Score, and kappa coefficients, among others along with their properties can be found. In this sense, we highlight that the Bayesian results given in Section 4 can be used as a micro method, with the additional merit of providing measurements about the variability of summaries proposed. In this sense, the standard deviation of posterior distributions can be used. To carry out a comparison of the results in our paper to existing metrics can be of interest in future works. Additionally, we intend to depply study the structure of confusions, for instance, to analyze if certain classes have a common confusion structure or not, their relationships, the effect of the sample size, or dealing with unbalanced classes.

Author Contributions

Conceptualization, methodology and writing, I.B.-C.; validation and software, R.M.C.-G. All authors have read and agreed to the published version of the manuscript.

Funding

The research of Rosa M. Carrillo-García has been funded by Grant PI3 “Programa IMUS de Iniciación a la Investigación”, IMUS, Seville, 2021.

Data Availability Statement

References have been given where the confusion matrices used in the applications can be found.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Application 3

In this Appendix, for completeness, 95% credible intervals are given for the setting studied in Section 5.3, that is Application 3. Also results for another prior distribution, the Perks prior or discrete uniform prior distribution in r points are included. Similar results to those for the continuous case are obtained.
Table A1. 95 % credible intervals: A _ nonIBD .
Table A1. 95 % credible intervals: A _ nonIBD .
q 25 % q 97.5 % HPD l HPD s
P_nonIBD0.55187030.79319190.55643790.7971693
P_UC0.00443450.09719100.00084780.0837975
P_CD0.17629970.40961950.17162350.4040643
Table A2. 95 % credible intervals: A _ UC .
Table A2. 95 % credible intervals: A _ UC .
q 25 % q 97.5 % HPD l HPD s
P_nonIBD0.05479010.23028990.04777980.2195273
P_UC0.24787220.50196680.24480730.4985839
P_CD0.36839540.63160460.36839540.6316046
Table A3. 95 % credible intervals: A _ CD .
Table A3. 95 % credible intervals: A _ CD .
q 25 % q 97.5 % HPD l HPD s
P_nonIBD0.09732470.24219730.09334290.2370862
P_UC0.01134830.08773180.00764030.0800716
P_CD0.71113850.86927130.71555170.8728854
These credible intervals are useful to asses the possible values of interest in the problem under consideration.
Remark A1
(Perks prior or discrete prior distribution in r points). A similar study to the one conducted in Section 5.3 was carried out by using a discrete prior distribution in r points, that is α j = 1 / r , j = 1 , , r . Similar results were obtained, which are listed in Table A37.
Table A4. Summaries IBD (A discrete uniform prior).
Table A4. Summaries IBD (A discrete uniform prior).
A_nonIBDA_UCA_CD
P_nonIBD0.6910.1220.16
P_UC0.0250.3720.035
P_CD0.2840.5060.806

References

  1. Goin, J.E. Classification Bias of the k-Nearest Neighbor Algorithm. IEEE Trans. Pattern Anal. Mach. Intell. 1984, PAMI-6, 379–381. [Google Scholar] [CrossRef] [PubMed]
  2. Black, S.; Gonen, M. A Generalization of the Stuart-Maxwell Test. In SAS Conference Proceedings: South-Central SAS Users Group 1997; Applied Logic Associates, Inc.: Houston, TX, USA, 1997. [Google Scholar]
  3. Sun, X.; Yang, Z. Generalized McNemar’s Test for Homogeneity of the Marginal Distributions. In Proceedings of the SAS Global Forum Proceedings, Statistics and Data Analysis, San Antonio, TX, USA, 16–19 March 2008; Volume 382, pp. 1–10. [Google Scholar]
  4. Hastie, T.; Tibshirani, R.; Friedman, J.H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed.; Springer: New York, NY, USA, 2009. [Google Scholar]
  5. Barranco-Chamorro, I.; Luque-Calvo, P.; Jiménez-Gamero, M.; Alba-Fernández, M. A study of risks of Bayes estimators in the generalized half-logistic distribution for progressively type-II censored samples. Math. Comput. Simul. 2017, 137, 130–147. [Google Scholar] [CrossRef]
  6. Congalton, R.G.; Green, K. Assessing the Accuracy of Remotely Sensed Data. Principles and Practices, 3rd ed.; CRC Press: Boca Raton, FL, USA, 2020. [Google Scholar]
  7. Carrillo-García, R.M. Text Mining: Principios Básicos, Aplicaciones, Técnicas y Casos Prácticos. Master’s Thesis, Universidad de Sevilla, Sevilla, Spain, 2021. [Google Scholar]
  8. Carrillo-García, R.M. Algorithms and Applications in Statistical Data Mining; PI3: Programa IMUS de Iniciación a la Investigación; Instituto de Matemáticas de la Universidad de Sevilla: Sevilla, Spain, 2021. [Google Scholar]
  9. Huang, Q.; Zhang, X.; Hu, Z. Application of Artificial Intelligence Modeling Technology Based on Multi-Omics in Noninvasive Diagnosis of Inflammatory Bowel Disease. J. Inflamm. Res. 2021, 14, 1933–1943. [Google Scholar] [CrossRef] [PubMed]
  10. Liu, C.; Frazier, P.; Kumar, L. Comparative Assessment of the Measures of Thematic Classification Accuracy. Remote Sens. Environ. 2007, 107, 606–616. [Google Scholar] [CrossRef]
  11. Pontius, R.; Millones, M. Death to Kappa: Birth of quantity disagreement and allocation disagreement for accuracy assessment. Int. J. Remote Sens. 2011, 32, 4407–4429. [Google Scholar] [CrossRef]
  12. Lance, R.F.; Kennedy, M.L.; Leberg, P.L. Classification Bias in Discriminant Function Analyses used to Evaluate Putatively Different Taxa. J. Mammal. 2000, 81, 245–249. [Google Scholar] [CrossRef] [Green Version]
  13. Schmidt, R.L.; Walker, B.S.; Cohen, M.B. Verification and classification bias interactions in diagnostic test accuracy studies for fine-needle aspiration biopsy. Cancer Cytopathol. 2015, 123, 193–201. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  14. Rivas-Ruiz, F.; Pérez-Vicente, S.; González-Ramírez, A. Bias in clinical epidemiological study designs. Allergol. Immunopathol. 2013, 41, 54–59. [Google Scholar] [CrossRef] [PubMed]
  15. Barranco-Chamorro, I.; Muñoz Armayones, S.; Romero-Losada, A.; Romero-Campero, F. Multivariate Projection Techniques to Reduce Dimensionality in Large Datasets. In Smart Data. State-of-the-Art Perspectives in Computing and Applications; CRC Press, Taylor & Francis Group: Boca Raton, FL, USA, 2019. [Google Scholar]
  16. Tsendbazar, N.; de Bruin, S.; Mora, B.; Schouten, L.; Herold, M. Comparative assessment of thematic accuracy of GLC maps for specific applications using existing reference data. Int. J. Appl. Earth Obs. Geoinf. 2016, 44, 124–135. [Google Scholar] [CrossRef]
  17. Pérez, C.J.; Girón, F.J.; Martín, J.; Ruiz, M.; Rojano, C. Misclassified multinomial data: A Bayesian approach. Rev. Real Acad. Cienc. Exactas Fís. Nat. Ser. A Mat. (RACSAM) 2007, 101, 71–80. [Google Scholar]
  18. R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2021. [Google Scholar]
  19. Meredith, M.; Kruschke, J. HDInterval: Highest (Posterior) Density Intervals. R Package Version 0.2.2. 2020. Available online: https://cran.r-project.org/web/packages/HDInterval/index.html (accessed on 12 July 2021).
  20. Signorell, A.; Aho, K.; Alfons, A.; Anderegg, N.; Aragon, T.; Arachchige, C.; Arppe, A.; Baddeley, A.; Barton, K.; Bolker, B.; et al. DescTools: Tools for Descriptive Statistics. R Package Version 0.99.44. 2021. Available online: https://cran.r-project.org/web/packages/DescTools/index.html (accessed on 12 December 2021).
  21. Tsagris, M.; Athineou, G. Compositional: Compositional Data Analysis. R Package Version 4.8. 2021. Available online: https://cran.r-project.org/web/packages/Compositional/index.html (accessed on 10 August 2021).
  22. Grandini, M.; Bagli, E.; Visani, G. Metrics for Multi-Class Classification: An Overview. arXiv 2020, arXiv:2008.05756. [Google Scholar]
  23. McNemar, Q. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 1947, 12, 153–157. [Google Scholar] [CrossRef] [PubMed]
  24. Edwards, A. Note on the “correction for continuity” in testing the significance of the difference between correlated proportions. Psychometrika 1948, 13, 185–187. [Google Scholar] [CrossRef] [PubMed]
  25. Balakrishnan, N.; Johnson, N.L.; Kotz, S. Multinominal Distributions. In Discrete Multivariate Distributions; Series in Probability and Statistics; Wiley: Hoboken, NJ, USA, 1997; Chapter 2. [Google Scholar]
  26. Kotz, S.; Balakrishnan, N.; Johnson, N.L. Dirichlet and Inverted Dirichlet Distributions. In Continuous Multivariate Distributions: Models and Applications; Series in Probability and Statistics; Wiley: Hoboken, NJ, USA, 2000; Volume 1, Chapter 49; pp. 458–527. [Google Scholar] [CrossRef]
Figure 1. Posterior Dirichlet in A_nonIBD.
Figure 1. Posterior Dirichlet in A_nonIBD.
Mathematics 09 03233 g001
Figure 2. Posterior Dirichlet in A_UC.
Figure 2. Posterior Dirichlet in A_UC.
Mathematics 09 03233 g002
Figure 3. Posterior Dirichlet in A_CD.
Figure 3. Posterior Dirichlet in A_CD.
Mathematics 09 03233 g003
Table 1. Confusion matrix.
Table 1. Confusion matrix.
Z
Y12 r 1 r
1 n 1 , 1 n 1 , 2 n 1 , r 1 n 1 , r
2 n 2 , 1 n 2 , 2 n 2 , r 1 n 2 , r
r 1 n r 1 , 1 n r 1 , 2 n r 1 , r 1 n r 1 , r
r n r , 1 n r , 2 n r , r 1 n r , r
Table 2. Table 2 × 2.
Table 2. Table 2 × 2.
Z = i Z i
Y = i n i i n i + n i i
Y i n + i n i i k i j i n k j
Table 3. Confusion matrix: Land use.
Table 3. Confusion matrix: Land use.
P_FallenLeafP_ConifiersP_AgriculturalP_Scrub
A_FallenLeaf65604
A_Conifiers481117
A_Agricultural225853
A_Scrub2481990
Table 4. Marginal Homogeneity (Land use).
Table 4. Marginal Homogeneity (Land use).
χ 2 dfp-Value
Stuart-Maxwell11.20230.010680
Bhapkar11.65430.008667
Table 5. Auxiliary matrix FallenLeaf.
Table 5. Auxiliary matrix FallenLeaf.
P_FallenLeafP_Others
A_FallenLeaf6510
A_Others50309
Table 6. Auxiliarymatrix Conifers.
Table 6. Auxiliarymatrix Conifers.
P_ConifersP_Others
A_Conifers8122
A_Others19312
Table 7. Auxiliary matrix Agricultural.
Table 7. Auxiliary matrix Agricultural.
P_AgriculturalP_Others
A_Agricultural8530
A_Others30289
Table 8. Auxiliary matrix Scrub.
Table 8. Auxiliary matrix Scrub.
P_ScrubP_Others
A_Scrub9051
A_Others14279
Table 9. McNemar test for every category.
Table 9. McNemar test for every category.
FallenLeafConifiersAgriculturalScrub
Less 1 × 10 7 0.73364540.55128910.9999994
Greater1.00000.37761430.55128910.0000022
Two_Sided 2 × 10 7 0.75522871.00000000.0000045
Table 10. Bayesian summaries in A_FallenLeaf.
Table 10. Bayesian summaries in A_FallenLeaf.
n 1 + α 1 α ˜ 1 θ ^ j b Var ( θ j | α ˜ 1 ) sd ( θ j | α ˜ 1 )
P_FallenLeaf651660.83544300.00171850.0414545
P_Conifers6170.08860760.00100950.0317719
P_Agricultural0110.01265820.00015620.0124990
P_Scrub4150.06329110.00074110.0272225
Table 11. Bayesian summaries in A _ Conifers .
Table 11. Bayesian summaries in A _ Conifers .
n 2 + α 2 α ˜ 2 θ ^ j b Var ( θ j | α ˜ 2 ) sd ( θ j | α ˜ 2 )
P_FallenLeaf4150.04672900.00041250.0203090
P_Conifiers811820.76635510.00165790.0407175
P_Agricultural111120.11214950.00092200.0303638
P_Scrub7180.07476640.00064050.0253085
Table 12. Bayesian summaries in A _ Agricultural .
Table 12. Bayesian summaries in A _ Agricultural .
n 3 + α 3 α ˜ 3 θ ^ j b Var ( θ j | α ˜ 3 ) sd ( θ j | α ˜ 3 )
P_FallenLeaf221230.19327730.00129930.0360464
P_Conifiers5160.05042020.00039900.0199746
P_Agricultural851860.72268910.00167010.0408666
P_Scrub3140.03361340.00027070.0164529
Table 13. Bayesian summaries in A _ Scrub .
Table 13. Bayesian summaries in A _ Scrub .
n 4 + α 4 α ˜ 4 θ ^ j b Var ( θ j | α ˜ 4 ) sd ( θ j | α ˜ 4 )
P_FallenLeaf241250.17241380.00097730.0312620
P_Conifers8190.06206900.00039870.0199685
P_Agricultural191200.13793100.00081440.0285381
P_Scrub901910.62758620.00160080.0400104
Table 14. Summary Bayesian estimates of conditional probabilities in the Land use problem.
Table 14. Summary Bayesian estimates of conditional probabilities in the Land use problem.
A_FallenLeafA_ConifiersA_AgriculturalA_Scrub
P_FallenLeaf0.8350.0470.1930.172
P_Conifers0.0890.7660.050.062
P_Agricultural0.0130.1120.7230.138
P_Scrub0.0630.0750.0340.628
Table 15. 95% credible intervals: A _ FallenLeaf .
Table 15. 95% credible intervals: A _ FallenLeaf .
q 25 % q 97.5 % HPD l HPD s
P_FallenLeaf0.74667870.90816160.75306190.9129924
P_Conifiers0.03684690.15994640.03168680.1517275
P_Agricultural0.00032450.04619240.00000000.0376786
P_Scrub0.02113970.12612760.01624490.1172020
Table 16. 95% credible intervals: A _ Conifers .
Table 16. 95% credible intervals: A _ Conifers .
q 25 % q 97.5 % HPD l HPD s
P_FallenLeaf0.01549100.09380560.01180010.0869559
P_Conifiers0.68209940.84118580.68568550.8442327
P_Agricultural0.05988320.17809650.05590030.1725756
P_Scrub0.03314580.13133750.02913730.1251047
Table 17. 95% credible intervals: A _ Agricultural .
Table 17. 95% credible intervals: A _ Agricultural .
q 25 % q 97.5 % HPD l HPD s
P_FallenLeaf0.12775530.26855390.12465600.2648023
P_Conifiers0.01888610.09611580.01538990.0900707
P_Agricultural0.63924940.79904030.64189890.8013868
P_Scrub0.00931200.07250240.00623650.0660927
Table 18. 95% credible intervals: A _ Scrub .
Table 18. 95% credible intervals: A _ Scrub .
q 25 % q 97.5 % HPD l HPD s
P_FallenLeaf0.11561590.23776090.11290420.2344728
P_Conifiers0.02897450.10653090.02588460.1018195
P_Agricultural0.08694720.19835720.08402940.1946674
P_Scrub0.54762860.70420940.54883920.7053518
Table 19. Confusion matrix: Literary genres.
Table 19. Confusion matrix: Literary genres.
P_RomP_MysP_HorP_HisP_FicP_FanP_ComP_ChiP_BioP_Adv
A_Rom1043712011111
A_Mys03921101420
A_Hor08231461700
A_His010188712112
A_Fic382011419111
A_Fan20303361500
A_Com211725341231
A_Chi14131303601
A_Bio243242110220
A_Adv09622809212
Table 20. Marginal homogeneity (Literary Genres).
Table 20. Marginal homogeneity (Literary Genres).
χ 2 dfp-value
Stuart–Maxwell94.199 2.341 × 10 16
Bhapkar121.179< 2.2 × 10 16
Table 21. Literary Genres: p-values of tests in which H 1 : p < 0.5 was accepted.
Table 21. Literary Genres: p-values of tests in which H 1 : p < 0.5 was accepted.
MysteryFantasyChildrenBiographical
Less 3.781 × 10 7 0.001900827 3 × 10 10 0.09090503
Table 22. Literary Genres: p-values of tests in which H 1 : p > 0.5 was accepted.
Table 22. Literary Genres: p-values of tests in which H 1 : p > 0.5 was accepted.
RomanceHistoryComedyAdventure
Greater 1.19307 × 10 5 0.03245432 5.2 × 10 9 4.715 × 10 7
Table 23. Summary of Bayesian estimates of conditional probabilities in Literary Genres .
Table 23. Summary of Bayesian estimates of conditional probabilities in Literary Genres .
A_RomA_MysA_HorA_HisA_FicA_FanA_ComA_ChiA_BioA_Adv
P_Rom0.1830.0170.0170.0170.0670.050.050.0330.050.017
P_Mys0.0830.6670.150.0330.150.0170.20.0830.0830.167
P_Hor0.0670.050.40.0170.050.0670.1330.0330.0670.117
P_His0.1330.0330.0330.3170.0170.0170.050.0670.050.05
P_Fic0.0330.0330.0830.150.20.0670.10.0330.0830.05
P_Fan0.050.0170.1170.1330.0830.6170.0670.0670.050.15
P_Com0.0170.0330.0330.0330.0330.0330.0830.0170.0330.017
P_Chi0.20.0830.1330.050.1670.10.2170.6170.1830.167
P_Bio0.20.050.0170.20.20.0170.0670.0170.3830.05
P_Adv0.0330.0170.0170.050.0330.0170.0330.0330.0170.217
Table 24. Confusion matrix: Inflammatory Bowel Disease (IBD).
Table 24. Confusion matrix: Inflammatory Bowel Disease (IBD).
P_nonIBDP_UCP_CD
A_nonIBD37115
A_UC61926
A_CD15377
Table 25. Marginal homogeneity test (IBD).
Table 25. Marginal homogeneity test (IBD).
χ 2 g.l.p-value
Stuart-Maxwell21.7832 1.861 × 10 5
Bhapkar24.4612 4.88 × 10 6
Table 26. IBD: McNemar test for every category.
Table 26. IBD: McNemar test for every category.
nonIBDUCCD
Less0.25568790.99999988640.001896853
Greater0.83799570.00000097080.999226416
Two_Sided0.51137580.00000194160.003793706
Table 27. Bayesian summaries in A _ nonIBD .
Table 27. Bayesian summaries in A _ nonIBD .
n 1 + α 1 α ˜ 1 θ ^ j b Var ( θ j | α ˜ 1 ) sd ( θ j | α ˜ 1 )
P_nonIBD371380.67857140.00382650.0618590
P_UC1120.03571430.00060420.0245803
P_CD151160.28571430.00358040.0598363
Table 28. Bayesian summaries in A _ UC .
Table 28. Bayesian summaries in A _ UC .
n 2 + α 2 α ˜ 2 θ ^ j b Var ( θ j | α ˜ 2 ) sd ( θ j | α ˜ 2 )
P_nonIBD6170.12962960.00205140.0452921
P_UC191200.37037040.00423990.0651147
P_CD261270.50000000.00454550.0674200
Table 29. Bayesian summaries in A _ CD .
Table 29. Bayesian summaries in A _ CD .
n 3 + α 3 α ˜ 3 θ ^ j b Var ( θ j | α ˜ 3 ) sd ( θ j | α ˜ 3 )
P_nonIBD151160.16326530.00137990.0371470
P_UC3140.04081630.00039550.0198861
P_CD771780.79591840.00164070.0405059
Table 30. M 2 matrix (IBD).
Table 30. M 2 matrix (IBD).
P_nonIBDP_UCP_CD
A_nonIBD42110
A_UC22227
A_CD341051
Table 31. Bayesian summaries Step 2 in A _ nonIBD .
Table 31. Bayesian summaries Step 2 in A _ nonIBD .
n 1 + step 2 α ˜ 1 α ˜ 1 step 2 θ ^ j b Var ( θ j | α ˜ 1 step 2 ) sd ( θ j | α ˜ 1 step 2 )
P_nonIBD4238800.73394500.00177520.0421329
P_UC1230.02752290.00024330.0155988
P_CD1016260.23853210.00165120.0406352
Table 32. Bayesian summaries Step 2 in A _ UC .
Table 32. Bayesian summaries Step 2 in A _ UC .
n 2 + step 2 α ˜ 2 α ˜ 2 step 2 θ ^ j b Var ( θ j | α ˜ 2 step 2 ) sd ( θ j | α ˜ 2 step 2 )
P_nonIBD227290.27619050.00188590.0434274
P_UC2220420.40000000.00226420.0475831
P_CD727340.32380950.00206560.0454492
Table 33. Bayesian summaries Step 2 in A _ CD .
Table 33. Bayesian summaries Step 2 in A _ CD .
n 3 + step 2 α ˜ 3 α ˜ 3 step 2 θ ^ j b Var ( θ j | α ˜ 3 step 2 ) sd ( θ j | α ˜ 3 step 2 )
P_nonIBD3416500.25906740.00098940.0314554
P_UC104140.07253890.00034680.0186223
P_CD51781290.66839380.00114250.0338008
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Barranco-Chamorro, I.; Carrillo-García, R.M. Techniques to Deal with Off-Diagonal Elements in Confusion Matrices. Mathematics 2021, 9, 3233. https://doi.org/10.3390/math9243233

AMA Style

Barranco-Chamorro I, Carrillo-García RM. Techniques to Deal with Off-Diagonal Elements in Confusion Matrices. Mathematics. 2021; 9(24):3233. https://doi.org/10.3390/math9243233

Chicago/Turabian Style

Barranco-Chamorro, Inmaculada, and Rosa M. Carrillo-García. 2021. "Techniques to Deal with Off-Diagonal Elements in Confusion Matrices" Mathematics 9, no. 24: 3233. https://doi.org/10.3390/math9243233

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop