Next Article in Journal
Prepivoted Augmented Dickey-Fuller Test with Bootstrap-Assisted Lag Length Selection
Previous Article in Journal
Is Anonymization Through Discretization Reliable? Modeling Latent Probability Distributions for Ordinal Data as a Solution to the Small Sample Size Problem
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Levels of Confidence and Utility for Binary Classifiers

The University of North Carolina at Charlotte, Charlotte, NC 28223, USA
Stats 2024, 7(4), 1209-1225; https://doi.org/10.3390/stats7040071
Submission received: 20 September 2024 / Revised: 7 October 2024 / Accepted: 11 October 2024 / Published: 17 October 2024
(This article belongs to the Section Data Science)

Abstract

:
Two performance measures for binary tree classifiers are introduced: the level of confidence and the level of utility. Both measures are probabilities of desirable events in the construction process of a classifier and hence are easily and intuitively interpretable. The statistical estimation of these measures is discussed. The usual maximum likelihood estimators are shown to have upward biases, and an entropy-based bias-reducing methodology is proposed. Along the way, the basic question of appropriate sample sizes at tree nodes is considered.

1. Introduction

The main objective of this article is to propose two performance measures for binary classifiers, the level of confidence and the level of utility. There is no lack of performance measures for tree classifiers in the literature of data science. However, the two proposed measures are probabilities of two desirable events associated with a binary classifier and, as such, imply simple and clear meanings. Furthermore, the proposed measures provide statistical support for considerations in the process of developing a binary classifier in the sense of classic probability and statistics.
Decision trees are a tool of central importance in modern data science, of which the binary decision trees are an emblematic case. The discussion in this article has relevance to multinomial decision trees. However, for simplicity and clarity, the primary focus of the discussion below is on binary classifiers at nodes in a tree structure. The construction of a decision tree may be approached with different cultures and logics, each with pros and cons. Interested readers may refer to [1,2] for in-depth discussions on different cultures of such undertakings. The two main different cultures are often termed data science and statistics, respectively, albeit not without overlapping domains. Regardless of the variation in the logic of the construction effort, the core task remains the same. Consider a Bernoulli random variable, B = B ( p x ) , where x X * and X * is a sample space for a random co-variate element X. One of the simplest binary tree classifiers may be developed according to the following model.
  • An identically and independently distributed ( i i d ) sample of size n is taken,
    { ( B i , X i ) ; 1 i n } , where X i is a random element on X * according to some distribution and B i is, conditioning on X = x , a Bernoulli random variable with p x . To construct a binary tree classifier is to find, based on the sample, a partition of X * , denoted X = { x j ; 1 j J } , such that in each sub-group indexed by x j , B x j , or more simply B j , is a Bernoulli random variable, conditioning on X = x j with p x j = p j and q j = 1 p j . Let N j = i = 1 n 1 [ X i = x j ] . { N j ; 1 j J } is a multinomial vector of size n with its realization { n j ; 1 j J } . The first sample of size n, { ( B i , X i ) ; 1 i n } , may be thought of as a pair ( B i , X i ) , where X i is a random element on X = { x j ; 1 j J } with probability distribution λ = { λ j ; 1 j J } , and B i is conditionally Bernoulli with p j given X i = x j . Let Y j = i = 1 n B i 1 [ X i = x j ] and p ^ j = Y j / N j be the frequency and the relative frequency of the sample of size n in the j th sub-group. Another i i d element, denoted ( B n + 1 , X n + 1 ) , is to be taken.
  • A tree classifier is defined as follows: given X n + 1 = x X , (a) if p ^ x > 0.5 , B n + 1 is projected to be 1 (a success); (b) if p ^ x < 0.5 , B n + 1 is projected to be 0 (a failure); or (c) if p ^ x = 1 p ^ x = 0.5 , a fair coin is tossed to determine the classification of B n + 1 .
There is a long list of issues involved with constructing a classifier as described above, some of which are fundamental and some are technical. To see a comprehensive discussion, one may refer to, for example, [3]. The volume of methodologies for developing classifiers has increased rapidly in recent decades, but mostly in the realm of data science rather than statistics. There are good reasons why much of the development of classifiers is on the side of data science. One of the most distinctive characteristics of data science, as opposed to statistics, is the highly non-parametric nature of the associated methodologies. Unlike many traditional statistical models, which usually have a low-dimensional data space, data science models are more general, more flexible and more complex. As such, they have a tendency to be over-zealous in dynamically searching and establishing features based on the sample in the data space. This phenomenon is sometimes known as “heat seeking” to data scientists, which may be thought of as over-fitting in the usual statistical terminologies. On top of the said “heating seeking”, there exists a fact that is exacerbating the situation: many important quantities of interest in developing and evaluating a classifier depend on the parameter p = max { p , 1 p } , and the usual and natural estimator p ^ = max { p ^ , 1 p ^ } of p has an upward bias. This fact may be plainly seen in a very simple setting.
  • Let the binary alphabet be denoted L = { 1 , 2 } and associated with a probability distribution P ( 1 ) = p and P ( 2 ) = 1 p .
  • Let p = max { p , 1 p } and p = min { p , 1 p } , and assume p > p .
  • Let the letter, corresponding to probability p , be denoted , that is,
    = arg max L P ( ) . Letter is also referred to as the true letter.
Any reasonable performance measure of a simple classifier based on an i i d Bernoulli sample is basically a function of p (and p ). Therefore, the quality of an estimator of p becomes essential to the quality of the estimator of such a performance measure, which could in turn guide the entire process of constructing a binary classifier. However, a good estimator of p or q = 1 p does not necessarily imply a good estimator of p = max { p , 1 p } . For example, the relative frequencies p ^ and q ^ = 1 p ^ based on an i i d Bernoulli sample are uniform minimum variance unbiased estimators of p and q = 1 p ; but p ^ = max { p ^ , 1 p ^ } is upwardly biased since the function f ( p ) = max { p , 1 p } is a convex function, and hence, by Jensen’s inequality, E ( max { p ^ , 1 p ^ } ) max { E ( p ^ ) , E ( 1 p ^ ) } = p . In fact, the bias could be quite significant when the sample size n is small. This simple observation has profound implications in the process of constructing a binary classifier, or more generally a binary tree classifier as every node of a tree resembles a simple binary classifier. The upward bias tends to overestimate p , hence exaggerating the confidence in selecting ^ = arg max L P ^ ( ) as the likely true letter.
In this article, several relevant results are presented in the subsections of Section 2. These results may be thought to belong to two categories. The first contains motivational arguments leading to the definitions of the level of confidence and the level of utility of a binary classifier. Along the way, a general entropy and a notion of entropic objects, including an entropic binomial distribution, are defined. The second contains some consideration of the estimation of the levels of confidence and utility, which includes an introduction to the notion of an entropic maximum likelihood estimator ( e m l e ), as opposed to the maximum likelihood estimator ( m l e ). The article then proposes a weighted average of the m l e and the e m l e of p as a bias-alleviating estimator of p . Several numerical calculations and simulation studies are also reported in the same section. Finally, the article ends with a few concluding remarks in the last section, including several recommendations to practitioners on how to incorporate the findings of this article into practice.

2. Main Results

2.1. Entropies and Entropic Objects

Consider a general countable alphabet, L = { k ; k 1 } , along with an associated probability distribution, p = { p k ; k 1 } . Let p = { p ( k ) ; k 1 } be the non-increasingly re-arranged p , that is, for every k, k 1 , p ( k ) p ( k + 1 ) . A general notion of entropy was first given in [4], but is given below for a self-contained presentation.
Definition 1. 
A function f ( p ) is referred to as an entropy if f ( p ) depends on p only through p , that is, f ( p ) = f ( p ) .
Definition 1 not only defines general entropies but also implies a notion of label-independence. An entropy is a measure that is invariant with regard to the labels of the underlying alphabet. Many well-known entropies studied in the existing literature include Shannon’s entropy H s = k 1 p k ln p k as in [5], Rényi’s entropy H r = ln k 1 p k α / ( 1 α ) for some α where 0 < α < and α 1 as in [6], the Tsallis entropy H t = ( 1 k 1 p k α ) /   ( α 1 ) for any α > 1 as in [7], and the generalized Simpson’s entropy H g s = k 1 p k u ( 1 p k ) v for any pair of integers u 1 and v 0 , as in [8,9]. It may be interesting to note that p ( k ) for any k is an entropy, and in particular, p = p ( 1 ) is an entropy.
In the spirit of Definition 1, the adjective “entropic” is adopted to describe objects that are label-independent. For example, a sample of size n from a countable alphabet may be summarized by a multinomial random array Y = { Y k ; k 1 } , which may be re-arranged non-increasingly into Y = { Y ( k ) ; k 1 } and referred to as the entropic statistics associated with Y . Similarly, while the elements of p = { p k ; k 1 } are multinomial parameters, these of p may be referred to as entropic multinomial parameters. It may be interesting to note that, by the same token, a classifier, or a decision tree, is also entropic in nature and that an exercise of developing a classifier is also entropic.

2.2. Entropic Binomial Distributions

Consider a Bernoulli population with probability p and an i i d sample of size n taken from it. The sample may be summarized by a binomial random variable Y B ( n , p ) with probability distribution
P ( y ) = P ( Y = y ) = n ! y ! ( n y ) ! p y ( 1 p ) n y
for integer y, 0 y n . Let Y = max { Y , n Y } . The probability distribution of Y is
P ( y ) = P ( Y = y ) = P ( Y = y ) , for y > n y or y > n / 2 ; P ( Y = y ) , for y = n y or y = n / 2 . = n ! y ! ( n y ) ! [ p y ( 1 p ) n y + p n y ( 1 p ) y ] , for n / 2 < y n ; n ! y ! ( n y ) ! [ p ( 1 p ) ] n / 2 , for y = n / 2 .
P ( y ) of (2) is referred to as the entropic binomial distribution. It is to be noted that the entropic binomial distribution is parameterized by the entropic parameter, p , and not by the binomial parameter p. Also to be noted is the fact that all the probabilities in (2) are entropies by Definition 1. Furthermore, it is to be noted that the binomial probability of (1) is defined on a binomial sample space, while (2) is defined on an aggregated binomial sample space. The difference between the two is an important point to be exploited in this article.
Consider a mixture of several Bernoulli populations, each of which has probability p j , for integers 1 j J , with non-negative mixing weights, λ j , such that j = 1 J λ j = 1 . An i i d sample of size n may be summarized into { ( N j , Y j ) ; 1 j J } , where { N j ; 1 j J } is a multinomial vector with size n and category probabilities { λ j ; 1 j J } , and, given { N j = n j ; 1 j J } , { Y j | N j = n j ; 1 j J } is a vector of independent binomial random variables with probabilities { p j ; 1 j J } . It then follows that the probability distribution of { ( N j , Y j ) ; 1 j J } is as given below. Writing N = { N j ; 1 j J } , n = { n j ; 1 j J } as a realization of N , Y = { Y j ; 1 j J } and y = { y j ; 1 j J } as a realization of Y ,
P ( ( N , Y ) = ( n , y ) ) = P ( Y = y | N = n ) P ( N = n ) = j = 1 J n j ! y j ! ( n j y j ) ! p j y j ( 1 p j ) n j y j n ! n 1 ! n 2 ! n J ! j = 1 J λ j n j = n ! j = 1 J [ y j ! ( n j y j ) ! ] j = 1 J λ j n j p j y j ( 1 p j ) n j y j .
Let Y j , = max { Y j , N j Y j } and let y j , = max { y j , n j y j } be a realization of Y j , for every j, 1 j J . The probability distribution of { ( N j , Y j , ) ; 1 j J } is as given in (4) below. Let Y = { Y j , ; j = 1 , , J } where Y j , = max { Y j , n j y j } , let y = { y j , ; 1 j J } be a realization of Y , and let p j , = max { p j , 1 p j } for every j, 1 j J . It is easily seen that, by way of (3), assuming p j 0.5 for every j and for y j , n j / 2 y j n j for all j, 1 j J ,
P ( ( N , Y ) = ( n , y ) ) = P ( Y = y | N = n ) P ( N = n ) = n ! j = 1 J [ y j ! ( n j y j ) ! ] j = 1 J λ j n j p j , y j ( 1 p j , ) n j y j + p j , n j y j ( 1 p j , ) y j 1 [ y j > n j / 2 ] .
Remark 1. 
The probability in (4) is not an entropy in the sense of Definition 1 but a product of entropies, each of which is defined with respect to a Bernoulli sub-population indexed by j, 1 j J .

2.3. Levels of Confidence and Utility

In the two-stage contemplation of constructing a tree classifier described in Section 1, there are two desirable events as follows, both of which pertain to the ( n + 1 ) th observation ( B n + 1 , X n + 1 ) .
  • C = X n + 1 falls into a sub-population (or a tree node), say X n + 1 = x j for some j, where the classifier correctly identifies the true letter based on the sample of size n, { ( B i , X i ) ; 1 i n } .
  • U = B n + 1 is correctly predicted based on the sample of size n, { ( B i , X i ) ; 1 i n } .
Definition 2. 
The probability of C, P ( C ) , is referred to as the level of confidence, and the probability of U, P ( U ) , is referred to as the level of utility.
To be instructive, consider first the levels of confidence and utility in the case of a single Bernoulli population. Let e ( n ) be the indicator function that n is an even integer.
P ( C ) = P ( ^ = ) = P ( Y > n Y ) + e ( n ) P ( Y = n / 2 ) / 2 = 1 P ( Y n / 2 ) + e ( n ) P ( Y = n / 2 ) / 2 = n / 2 < y n n ! y ! ( n y ) ! p y ( 1 p ) n y + n ! e ( n ) 2 [ ( n / 2 ) ! ] 2 [ p ( 1 p ) ] n / 2 ,
P ( U ) = p P ( C ) + ( 1 p ) ( 1 P ( C ) ) ,
where Y B ( n , p ) . It may be interesting to note that both (5) and (6) are entropies. The proof of the following fact is trivial.
Fact 1. 
Assuming p > 1 p , lim n P ( C ) = 1 and lim n P ( U ) = p .
Table 1 and Table 2 below give the level of confidence and the level of utility, according to (5) and (6), for several combinations of sample size n and p . Given a desired level of confidence or a level of utility, an appropriate sample size may be found at every level of p . For example, in Table 1, at a desired confidence level of 95 % and p = 0.75 , the minimum sample size is n = 9 . Similarly to reach a utility level of 0.70 with p = 0.75 , a minimum sample of size n = 5 is required. In practice, however, p is unknown, and therefore, either an empirical value exists or it needs to be estimated to make a judgment as to whether a sample size is adequate. The estimation of p is discussed in the next subsection.
In the case of a tree structure, where there are at least J 2 nodes, the forms of P ( C ) and P ( U ) are slightly more complex. First, let it be noted that N = { N 1 , , N J } is a multinomial random vector with fixed size n and multinomial probabilities, λ = { λ 1 , , λ J } , that is,
P ( N = { n 1 , , n J } ) = n n 1 , , n J j = 1 J λ j n j .
However, the marginal distribution of each N j is a binomial, that is,
P ( N j = n j ) = n n j λ j n j 1 λ j n n j
subject to 0 n j n . Therefore, for each j, 1 j J , letting C j = { ^ j , = j , } be the event that the true letter at the j th node is correctly identified,
P ( C j ) = P ( ^ j , = j , ) = m = 1 n P ( ^ j , = j , | N j = m ) P ( N j = m ) = m = 1 n 1 P ( Y j m / 2 ) + e ( m ) P ( Y j = m / 2 ) / 2 n ! m ! ( n m ) ! λ j m ( 1 λ j ) n m
where Y j B ( m , p j , ) ;
P ( C ) = j = 1 J λ j P ( C j ) and
P ( U ) = j = 1 J λ j [ p j , P ( C j ) + ( 1 p j , ) ( 1 P ( C j ) ) ]
where P ( C j ) is as in (9).
Example 1. 
Suppose a binary tree classifier has J = 2 nodes. The three parameters of the binary classifier are λ, and p 1 , and p 2 , , where λ is the partition weights, λ 1 = λ and λ 2 = 1 λ , and p 1 , and p 2 , are the maximum probabilities in two partitions, respectively. By (9),
P ( C 1 ) = m = 1 n 1 P ( Y 1 m / 2 ) + e ( m ) P ( Y 1 = m / 2 ) / 2 n ! m ! ( n m ) ! λ m ( 1 λ ) n m P ( C 2 ) = m = 1 n 1 P ( Y 2 m / 2 ) + e ( m ) P ( Y 2 = m / 2 ) / 2 n ! m ! ( n m ) ! ( 1 λ ) m λ n m . P ( C ) = λ P ( C 1 ) + ( 1 λ ) P ( C 2 ) ,
P ( U ) = λ [ p 1 , P ( C 1 ) + ( 1 p 1 , ) ( 1 P ( C 1 ) ) ] + ( 1 λ ) [ p 2 , P ( C 2 ) + ( 1 p 2 , ) ( 1 P ( C 2 ) ) ] .
Table 3, Table 4, Table 5, Table 6, Table 7 and Table 8 show calculated levels of confidence and utility for several combined values of the underlying parameters, { λ , p 1 , , p 2 , } , according to (12) and (13).
Consider two possible splits of the Bernoulli population with probability p = 0.70 , along with a no-split, as follows.
1. 
Split A: λ = 0.9 , p 1 , = 0.75 (hence p 2 , = 0.75 );
2. 
Split B: λ = 0.2 , p 1 , = 0.75 (hence p 2 , = 0.6875 );
3. 
No Split: λ = 1 and p = 0.70 .
The levels of confidence as functions of the sample size n are plotted in Figure 1, where the thick solid curve is of No Split, the thin solid curve is of Split A, and the dashed curve is of Split B. Similarly, Figure 2 shows the three curves for the levels of utility. The fact that these curved cross, converge and dominate provides a basis for contemplation in the process of constructing and evaluating classifiers.
Calculations of P ( C ) and P ( U ) for more general cases may be carried out easily according (10) and (11).

2.4. MLE and Entropic MLE

In practice, both the confidence and utility levels are unknown and therefore need to be estimated. Consider first the case of a homogeneous Bernoulli population. Since P ( C ) and P ( U ) are functions of p , their estimation boils down to that of p = max { p , 1 p } . Perhaps the most natural estimator of p is the maximum likelihood estimator ( m l e ) under the binomial distribution in (1),
p ^ = max { p ^ , 1 p ^ }
where p ^ = Y / n , and the corresponding m l e s of P ( C ) and P ( U ) in (5) and (6) are
P ^ ( C ) = y = n / 2 + 1 n n ! y ! ( n y ) ! p ^ y ( 1 p ^ ) n y + n ! / 2 ( n / 2 ) ! ( n / 2 ) ! [ p ^ ( 1 p ^ ) ] n / 2 × e ( n ) ,
P ^ ( U ) = 1 P ^ ( C ) ( 1 p ^ ) p ^ ( 1 P ^ ( C ) ) .
The following three facts collectively indicate a tendency of over-estimation by (15) and (16).
Fact 2. 
Suppose p ( 0 , 1 ) . Then E ( p ^ ) > p .
A proof of Fact 2 is given in Section 1.
Fact 3. 
The confidence level, P ( C ) of (5), is an increasing function of p .
Proof. 
It is best to prove the fact in two separate cases: n is odd and even. Assuming n is odd, rewriting (10) gives
P ( C ) = y = n / 2 + 1 n n ! y ! ( n y ) ! p y ( 1 p ) n y + n ! / 2 ( n / 2 ) ! ( n / 2 ) ! [ p ( 1 p ) ] n / 2 × 1 [ n is even ] = y = n / 2 + 1 n n ! y ! ( n y ) ! p y ( 1 p ) n y = y = ( n + 1 ) / 2 n n ! y ! ( n y ) ! p y ( 1 p ) n y = P ( Y ( n + 1 ) / 2 ) = P ( 2 Y n + 1 ) = P ( Y ( n Y ) + 1 ) = P ( Y > n Y )
where Y is a binomial random variable with distribution B ( n , p ) . Noting that p > 0.5 by assumption and that the event { Y > n Y } is the event of “more successes than failures in a sample of size n”, it follows that P ( C ) increases as p does.
Similarly assuming n is even,
P ( C ) = P ( Y ( n + 1 ) / 2 ) + P ( Y = n / 2 ) n ! / 2 ( n / 2 ) ! ( n / 2 ) ! [ p ( 1 p ) ] n / 2 = P ( Y n / 2 ) n ! / 2 ( n / 2 ) ! ( n / 2 ) ! [ p ( 1 p ) ] n / 2 = P ( 2 Y n ) n ! / 2 ( n / 2 ) ! ( n / 2 ) ! [ p ( 1 p ) ] n / 2 = P ( Y n Y ) n ! / 2 ( n / 2 ) ! ( n / 2 ) ! [ p ( 1 p ) ] n / 2
where Y is a binomial random variable with distribution B ( n , p ) . Noting that p > 0.5 by assumption and that the event { Y > n Y } is the event of “no fewer successes than failures”, it follows that P ( Y n Y ) increases as p does. On the other hand, the negative term in (18) is a strictly decreasing function of p for p [ 0.5 , 1 ) . It follows that P ( C ) is increasing in p . □
Fact 4. 
The utility level, P ( U ) of (11), is an increasing function of p .
Proof. 
Noting P ( U ) = P ( C ) p + ( 1 P ( C ) ) ( 1 p ) of (11), taking the derivative of P ( U ) with respect to p , and letting P ( C ) denote the derivative of P ( C ) with respect to p ,
P ( U ) = P ( C ) p + P ( C ) P ( C ) ( 1 p ) ( 1 P ( C ) ) = P ( C ) ( 2 p 1 ) + ( 2 P ( C ) 1 ) .
Noting P ( C ) > 0 as shown in Fact 3, p > 0.5 (and hence 2 p > 1 ) and P ( C ) 0.5 (and hence 2 P ( C ) 1 ), it follows that P ( U ) > 0 for p ( 0.5 , 1 ) . □
On the other hand, let the maximizing value of p in the likelihood of the entropic binomial distribution (2) be referred to as the entropic maximum likelihood estimator ( e m l e ), denoted p ˜ , and let g ( p ˜ ) , for any function g ( · ) , be referred to as the e m l e of g ( p ) . p ˜ tends to underestimate p with smaller samples. However, it provides an opportunity to offset the upward bias of the m l e , p ^ , in various ways, for example, by means of a weighted average
p ^ = w p ^ + ( 1 w ) p ˜
where w, 0 w 1 , may be data-based. More specifically, the following is the proposed estimator of p of this article, with w = p ^ .
p ^ = p ^ p ^ + ( 1 p ^ ) p ˜ = p ˜ + p ^ ( p ^ p ˜ ) ,
which may be viewed as an under-estimator, p ˜ , with a non-negative correction term, p ^ ( p ^ p ˜ ) .
Example 2. 
Suppose an i i d Bernoulli sample yields Y = 5 and n Y = 2 . The likelihood of the binomial distribution (1) is proportional to p y ( 1 p ) n y , the dashed curve given in Figure 3, and the m l e of p is p ^ = 5 / 7 = 0.7134 , as the dashed arrow points to, and the m l e s of P ( C ) and P ( U ) , by (15) and (16), are, respectively,
P ^ ( C ) = y = 4 7 7 ! y ! ( 7 y ) ! ( 5 / 7 ) y ( 2 / 7 ) 7 y = 0.8917 , P ^ ( U ) = 1 0.8917 ( 1 5 / 7 ) ( 5 / 7 ) ( 1 0.8917 ) = 1 0.8917 ( 2 / 7 ) ( 5 / 7 ) ( 0.1083 ) = 0.6679 .
The entropic likelihood of the entropic binomial distribution (2) is represented by the solid curve in Figure 3. The entropic maximum likelihood estimator ( e m l e ) of p is p ˜ = 0.6667 , as the solid arrow points at, and the e m l e s of P ( C ) and P ( U ) , by (15) and (16), are, respectively,
P ˜ ( C ) = y = 4 7 7 ! y ! ( 7 y ) ! ( 0.6667 ) y ( 1 0.6667 ) 7 y = 0.8267 P ˜ ( U ) = 1 0.8267 ( 1 0.6667 ) ( 0.6667 ) ( 1 0.8267 ) = 0.6089
By (21), (15) and (16), p ^ = 0.7000 , and
P ^ ( C ) = y = 4 7 7 ! y ! ( 7 y ) ! ( 0.7 ) y ( 1 0.7 ) 7 y = 0.8740 P ^ ( U ) = 1 0.8740 ( 1 0.7 ) ( 0.7 ) ( 1 0.8740 ) = 0.6496 .
It may be interesting to note that, in this case,
p ˜ p ^ p ^
P ˜ ( C ) P ^ ( C ) P ^ ( C ) , a n d
P ˜ ( U ) P ^ ( U ) P ^ ( U ) .
It is to be mentioned that the qualitative difference between the two likelihood functions in Figure 3 remains the same in general: the values of e m l e s are lower than those of m l e s. It is important to understand that the binomial distribution leads to an m l e of p, and then one of p = max { p , 1 p } , while the entropic binomial distribution estimates p directly via the likelihood of Y . The inequalities in (22)–(24) are deterministically true in general, which gives an opportunity for a reduction in the biases of p ^ , P ^ ( C ) and P ^ ( U ) .

2.5. Several Numerical Studies

Several simulation studies were conducted, and the results are presented in Tables 16 and 17 for the bias and the mean squared errors (MSE) of p ^ , p ˜ , and p ^ respectively. Each simulated case is based on ten thousand random samples. Results are tabulated for combinations of four p values { 0.6 , 0.7 , 0.8 , 0.9 } crossed with ten values of n from 5 to 50. The MSE values are quite stable across the three estimators in question. Therefore, the comparison is mainly of the biases of the estimators. It is observed that the simulated bias of e m l e , p ˜ , is significantly lower than that of the m l e , p ^ , for smaller values of p and with smaller samples, and the bias of the proposed weighted average estimator, p ^ , snugs in between. This is a fact well suggested by Facts 2–4. The weighted average, p ^ , seems to do better than p ^ across the board. In the study range of p , the bias of p ^ seems to be controlled when the sample size reaches n = 20 or n = 30 , as evidenced by the fact that the simulated bias is controlled under 2 % and 1 % , while the bias of p ^ becomes controlled when the sample size reaches n = 40 to n = 50 . The observed difference between the required sample sizes is the advantage of the proposed estimator. However, it is also observed that the required sample size to reach a reasonable precision, say the bias is under 2 % or 1 % , respectively, depends on the value of p . The smaller the p , the larger the sample required. In that sense, the biases tabulated in Tables 16 and 17 do not tell the whole story. In fact, the first part of Table 9 contains the biases for a very small value of p = 0.51 , which shows that the m l e , p ^ , needs n = 200 to have bias under 2 % and n = 500 to have a bias under 1 % . It is once again seen that the proposed estimator, p ^ , performs much better.
Several simulation results for estimating the levels of confidence and utility in the case of one single Bernoulli population are given in Tables 18 and 19. It is observed that the biases are quite large across the board for smaller values of p ^ and with smaller samples, albeit the biases are much smaller for the estimators of P ( U ) than for those of P ( C ) . For example, when p = 0.6 , for the biases in Table 18 to be under 1 % , a node size of n = 300 is needed. When p = 0.7 , for the biases in Table 18 to be under 1 % , a node size of n = 70 is needed. It may be interesting to note that, in Table 19, in estimating P ( U ) when p = 0.6 for a bias under 1 % , a node size of only n = 70 is needed. When p = 0.7 , for a bias under 1 % , a node size of only n = 20 is needed. An increasing trend in bias, as p decreases, is clearly observed. To give a more complete picture, the second part of Table 9 gives biases of the m l e and the proposed estimators, of P ( C ) and P ( U ) , at p = 0.51 . It is clearly seen that the m l e , P ^ ( C ) of P ( C ) performs very poorly, while the proposed estimator, P ^ ( C ) does much better though is not necessarily satisfactory. It may be interesting to note that, in estimating P ( U ) , the m l e and the proposed estimator are comparable in bias as the sample size varies. One could say that, for lack of a better term, P ( U ) is easier to estimate than P ( C ) .
The simulation studies reported above give a glimpse of how p , the level of confidence and the level of utility may be estimated on a very limited scope, mostly focusing on a single Bernoulli population. However, more general and more complex situations may be easily carried out in similar manners.
Example 3. 
One of the most popular illustrative examples of a decision tree in data science involves predicting whether a randomly selected golfer goes to play the game under a set of weather conditions. A sample of n = 14 is given in Table 10.
Many questions about this data set may be asked. For illustration purposes, let the question be, if only one covariate is used, which of the four, among Outlook, Temp, Humidity and Wind, is the best predictor. Obviously, the sample size is too small to convey any meaningful reliability of the results and therefore is ignored. Several key statistics for each of the four factors are tabulated in Table 11, Table 12, Table 13 and Table 14. The statistics include n j , y j , λ ^ j , p ^ j , , P ^ ( C j ) , and P ^ ( U j ) for each j, specifically noting that (9)–(11) are the basis of the plug-in estimators.
The estimated overall levels of confidence and utility are tabulated for each of the four covariates in Table 15. For comparison, the estimated Gini’s information impurities for the respective covariates are also tabulated. It is clear that for the estimated levels of confidence and utility, Humidity is the best predictor, followed by Wind and then Outlook, and Temp is the worst. Incidentally, the estimated Gini’s information impurities also support the same ranking (Table 16, Table 17, Table 18 and Table 19).

3. Summary

This article proposes two performance measures, P ( C ) and P ( U ) , that are linked to probabilities of two desirable label-invariant events in the sampling/developing process of a binary tree construction. They are referred to as the level of confidence and the level of utility of a binary classifier. A core component of these measures is the larger of the probabilities in a Bernoulli trial, that is, p = max { p , 1 p } . Several properties of p , P ( C ) and P ( U ) are discussed. Also discussed is the estimation of these quantities. However, let it be noted that, although P ( C ) and P ( U ) are the measures of central interest in this article, the estimation of p is important in its own right, since it could be a key element in evaluating many aspects of a binary classifier beyond those considered in this article.
One of the most distinct features identified in this article is the upward bias of the usual m l e of p , namely p ^ . This bias may be significant and increases as p decreases toward 0.5 with a fixed n. Because of that, the biases of the m l e s of P ( C ) and P ( U ) , namely P ^ ( C ) and P ^ ( U ) , have the same issues though to different extents. To control the said biases to within a reasonable bound, for example, 1 % or 2 % , a required sample size may need to be very large.
In terms of practice, several recommendations are made below, which may provide some useful guidance.
  • Small sample size considerations are important because, in developing a tree classifier, the perpetual question is whether to go further into the next layer, regardless of the macro modeling logic one may use. At the end of splitting, the sample size races toward zero. No matter what macro logic is employed in construction, a tree always comes to nodes to be developed with samples of smaller sizes. One of the most important questions is whether the sample size is sufficiently large to be statistically meaningful. To answer this question, the best approach is to have a prior empirical judgment on the range for p . If a range is judged as reasonable, say [ p a , p b ) , where 0.5 < p a < p b 1 , then that p a may be used to determine the appropriate sample size via Formulas (5) and (6) at a given desired level, say 95 % for P ( C ) and another practically chosen level for P ( U ) , noting that P ( U ) has a ceiling, that is, P ( U ) p , according to Fact 1.
  • If no sufficient prior knowledge exists for p , then a preliminary estimate for it is needed. The proposed estimator in (20), p ^ , is preferred to the usual m l e , p ^ . The estimated p is then used in Formulas (5) and (6) to produce estimated levels of confidence and utility, which in turn could give baseline information for further adjustments, such as pruning or further splitting. Of course, in such estimation, a reasonable sample size is needed. A recommended initial minimum sample size, according to Table 17 with a reasonable range [ 0.6 , 0.9 ) , is n = 40 if p ^ is used and n = 20 if p ^ is used.
  • For a given binary tree classifier with J 2 leaves or nodes, both the level of confidence and the level of utility after p j , is estimated for each and every j, 2 j J . The formulas of (10) and (11) may be used, with the m l e of λ, λ ^ j = n j / n , and the m l e or the proposed estimator of P ( C j ) and P ( U j ) for each and every j, 1 j J , to produce estimates of overall levels of confidence and utility. Noting that both (10) and (11) are λ-weighted averages, the overall level of confidence or the overall level of utility may be negatively affected if individual nodes have particularly low P ( C j ) or P ( U j ) for some j, 1 j J . If individual nodes are found to be low in confidence or utility, some repair or adjustment may be called for.
The main objective of this paper is to add two measures of performance to the literature of development and evaluation of a binary tree classifier. The measures have intuitive and simple probabilistic meanings. As such, some basic questions, like the relationship between the parameters and sample sizes, may be naturally considered and described in a style of classic statistics. However, it must be noted that it is not meant to replace or take away anything from the collection of methodologies in modern data science. It is hoped that the discussion of this article serves as a starting point for much more to come as data science advances and evolves.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The author declares no conflicts of interest.

References

  1. Breiman, L. Statistical modeling: The two cultures (with discussion). Stat. Sci. 2001, 16, 199–231. [Google Scholar] [CrossRef]
  2. Kass, R.E. The Two Cultures: Statistics and Machine Learning in Science. Obs. Stud. 2021, 7, 135–144. [Google Scholar] [CrossRef]
  3. Tan, P.-N.; Seinbach, M.; Karpatne, A.; Kumar, V. Introduction to Data Mining, 2nd ed.; Pearson: London, UK, 2018. [Google Scholar]
  4. Zhang, Z. Entropy-Based Statistics and Their Applications. Entropy 2023, 25, 936. [Google Scholar] [CrossRef]
  5. Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423, 623–656. [Google Scholar] [CrossRef]
  6. Rényi, A. On measures of information and entropy. Berkeley Symp. Math. Stat. Probab. 1961, 4.1, 547–561. [Google Scholar]
  7. Tsallis, C. Possible generalization of Boltzmann-Gibbs statistics. J. Stat. Phys. 1988, 52, 479–487. [Google Scholar] [CrossRef]
  8. Simpson, E.H. Measurement of diversity. Nature 1949, 163, 688. [Google Scholar] [CrossRef]
  9. Zhang, Z.; Zhou, J. Re-parameterization of multinomial distribution and diversity indices. J. Stat. Plan. Inference 2010, 140, 1731–1738. [Google Scholar] [CrossRef]
Figure 1. Confidence levels of competing splits.
Figure 1. Confidence levels of competing splits.
Stats 07 00071 g001
Figure 2. Utility levels of competing splits.
Figure 2. Utility levels of competing splits.
Stats 07 00071 g002
Figure 3. Likelihood and entropic likelihood.
Figure 3. Likelihood and entropic likelihood.
Stats 07 00071 g003
Table 1. Confidence levels, P ( C ) , as a function of n and p .
Table 1. Confidence levels, P ( C ) , as a function of n and p .
p 0.500.550.600.650.700.750.800.850.900.951.00
n = 1 0.500.550.600.650.700.750.800.850.900.951.00
n = 2 0.500.550.600.650.700.750.800.850.900.951.00
n = 3 0.500.570.650.720.780.840.900.940.970.991.00
n = 4 0.500.570.650.720.780.840.900.940.970.991.00
n = 5 0.500.590.680.760.840.900.940.970.991.001.00
n = 6 0.500.590.680.760.840.900.940.970.991.001.00
n = 7 0.500.610.710.800.870.930.970.991.001.001.00
n = 8 0.500.610.710.800.870.930.970.991.001.001.00
n = 9 0.500.620.730.830.900.950.980.991.001.001.00
n = 10 0.500.620.730.830.900.950.980.991.001.001.00
n = 15 0.500.650.790.890.950.981.001.001.001.001.00
n = 20 0.500.670.810.910.970.991.001.001.001.001.00
n = 25 0.500.690.850.940.981.001.001.001.001.001.00
n = 30 0.500.710.860.950.991.001.001.001.001.001.00
n = 35 0.500.720.890.970.991.001.001.001.001.001.00
n = 40 0.500.740.900.971.001.001.001.001.001.001.00
n = 45 0.500.750.910.981.001.001.001.001.001.001.00
n = 50 0.500.760.920.981.001.001.001.001.001.001.00
n = 60 0.500.780.940.991.001.001.001.001.001.001.00
n = 70 0.500.800.950.991.001.001.001.001.001.001.00
n = 80 0.500.810.961.001.001.001.001.001.001.001.00
n = 90 0.500.830.971.001.001.001.001.001.001.001.00
n = 100 0.500.840.981.001.001.001.001.001.001.001.00
n = 200 0.500.921.001.001.001.001.001.001.001.001.00
n = 300 0.500.961.001.001.001.001.001.001.001.001.00
Table 2. Utility levels, P ( U ) , as a function of n and p .
Table 2. Utility levels, P ( U ) , as a function of n and p .
p 0.500.550.600.650.700.750.800.850.900.951.00
n = 1 0.500.510.520.550.580.630.680.750.820.911.00
n = 2 0.500.510.520.550.580.630.680.750.820.911.00
n = 3 0.500.510.530.570.610.670.740.810.880.941.00
n = 4 0.500.510.530.570.610.670.740.810.880.941.00
n = 5 0.500.510.540.580.630.700.770.830.890.951.00
n = 6 0.500.510.540.580.630.700.770.830.890.951.00
n = 7 0.500.510.540.590.650.710.780.840.900.951.00
n = 8 0.500.510.540.590.650.710.780.840.900.951.00
n = 9 0.500.510.540.600.660.730.790.850.900.951.00
n = 10 0.500.510.540.600.660.730.790.850.900.951.00
n = 15 0.500.520.560.620.680.740.800.850.900.951.00
n = 20 0.500.520.560.620.690.750.800.850.900.951.00
n = 25 0.500.520.560.630.690.750.800.850.900.951.00
n = 30 0.500.520.570.640.700.750.800.850.900.951.00
n = 35 0.500.520.580.640.700.750.800.850.900.951.00
n = 40 0.500.520.580.640.700.750.800.850.900.951.00
n = 45 0.500.530.580.640.700.750.800.850.900.951.00
n = 50 0.500.530.580.650.700.750.800.850.900.951.00
n = 60 0.500.530.590.650.700.750.800.850.900.951.00
n = 70 0.500.530.590.650.700.750.800.850.900.951.00
n = 80 0.500.530.590.650.700.750.800.850.900.951.00
n = 90 0.500.530.590.650.700.750.800.850.900.951.00
n = 100 0.500.530.590.650.700.750.800.850.900.951.00
n = 200 0.500.540.600.650.700.750.800.850.900.951.00
n = 300 0.500.550.600.650.700.750.800.850.900.951.00
Table 3. Confidence level, P ( C ) , with J = 2 and λ = 0.5 .
Table 3. Confidence level, P ( C ) , with J = 2 and λ = 0.5 .
( p 1 , , p 2 , )(0.6, 0.6)(0.6, 0.7)(0.6, 0.8)(0.6, 0.9)(0.7, 0.7)(0.8, 0.8)(0.9, 0.9)
n = 5 0.53580.57520.61070.64190.61450.68550.7480
n = 10 0.60450.65770.70170.73600.71090.79890.8675
n = 20 0.65740.72520.77260.80170.79300.88790.9460
n = 30 0.69410.76910.81360.83510.84410.93310.9761
n = 40 0.72320.80150.84100.85610.87980.95880.9890
n = 50 0.74750.82680.86090.87120.90610.97420.9949
n = 100 0.83030.89990.91370.91510.96960.99710.9999
n = 200 0.91300.95460.95650.95650.99611.00001.0000
n = 300 0.95240.97590.97620.97620.99941.00001.0000
Table 4. Confidence level, P ( C ) , with J = 2 and λ = 0.75 .
Table 4. Confidence level, P ( C ) , with J = 2 and λ = 0.75 .
( p 1 , , p 2 , )(0.6, 0.6)(0.6, 0.7)(0.6, 0.8)(0.6, 0.9)(0.7, 0.7)(0.8, 0.8)(0.9, 0.9)
n = 5 0.53920.55000.56010.56940.63410.71590.7820
n = 10 0.62150.63840.65360.66700.74470.83470.8921
n = 20 0.69250.71590.73560.75150.83710.91590.9514
n = 30 0.73380.76140.78320.79930.88190.94480.9688
n = 40 0.76500.79560.81830.83370.90930.95960.9781
n = 50 0.79000.82300.84590.86020.92750.96860.8941
n = 100 0.86790.90620.92590.93390.96640.98820.9963
n = 200 0.93140.96860.97920.98110.98770.99780.9998
n = 300 0.95650.98840.99350.99390.99440.99961.0000
Table 5. Confidence level, P ( C ) , with J = 2 and λ = 0.90 .
Table 5. Confidence level, P ( C ) , with J = 2 and λ = 0.90 .
( p 1 , , p 2 , )(0.6, 0.6)(0.6, 0.7)(0.6, 0.8)(0.6, 0.9)(0.7, 0.7)(0.8, 0.8)(0.9, 0.9)
n = 5 0.59020.59170.59310.59440.71560.81400.8785
n = 10 0.65790.66110.66410.66690.80470.89060.9261
n = 20 0.74320.74870.75370.75810.89520.94660.9577
n = 30 0.79550.80240.80850.81380.93360.96330.9699
n = 40 0.83110.83910.84590.85170.95150.97020.9762
n = 50 0.85730.86600.87370.87950.96060.97400.9801
n = 100 0.92490.93650.94530.95140.97430.98330.9895
n = 200 0.96210.97640.98520.98980.98300.99170.9964
n = 300 0.97150.98680.99440.99740.98800.99560.9986
Table 6. Utility level, P ( U ) , with J = 2 and λ = 0.5 .
Table 6. Utility level, P ( U ) , with J = 2 and λ = 0.5 .
( p 1 , , p 2 , )(0.6, 0.6)(0.6, 0.7)(0.6, 0.8)(0.6, 0.9)(0.7, 0.7)(0.8, 0.8)(0.9, 0.9)
n = 5 0.50720.52650.55920.60280.54580.61130.6984
n = 10 0.52090.55260.60010.65740.58440.67940.7940
n = 20 0.53150.57430.63210.69420.61720.73280.8568
n = 30 0.53880.58820.64930.70990.63760.76000.8809
n = 40 0.54460.59830.65600.71790.65190.77530.8912
n = 50 0.54950.60600.66700.72270.66240.78450.8959
n = 100 0.56610.62690.68220.73300.68780.79830.9000
n = 200 0.58260.64050.69130.74130.69840.80000.9000
n = 300 0.59050.64510.69520.74520.69980.80000.9000
n = 0.60000.65000.70000.75000.70000.80000.9000
Table 7. Utility level, P ( U ) , with J = 2 and λ = 0.75 .
Table 7. Utility level, P ( U ) , with J = 2 and λ = 0.75 .
( p 1 , , p 2 , )(0.6, 0.6)(0.6, 0.7)(0.6, 0.8)(0.6, 0.9)(0.7, 0.7)(0.8, 0.8)(0.9, 0.9)
n = 5 0.50780.50370.50350.50670.55360.62960.7256
n = 10 0.52430.53110.54360.56080.59790.70080.8137
n = 20 0.53850.55220.57310.59880.63480.74950.8611
n = 30 0.54680.56360.58800.61660.65280.76690.8751
n = 40 0.55300.57210.59880.62870.66370.77570.8825
n = 50 0.55800.57900.60710.63750.67100.78110.8873
n = 100 0.57360.60000.63050.65600.68660.79290.8970
n = 200 0.58630.61620.64500.67110.69490.79870.8998
n = 300 0.59130.62160.64850.67380.69780.79970.9000
n = 0.60000.62500.65000.67500.70000.80000.9000
Table 8. Utility level, P ( U ) , with J = 2 and λ = 0.90 .
Table 8. Utility level, P ( U ) , with J = 2 and λ = 0.90 .
( p 1 , , p 2 , )(0.6, 0.6)(0.6, 0.7)(0.6, 0.8)(0.6, 0.9)(0.7, 0.7)(0.8, 0.8)(0.9, 0.9)
n = 5 0.51800.51060.50370.49730.58620.68840.8028
n = 10 0.55160.52770.52500.52330.62190.73430.8408
n = 20 0.54860.54950.55210.55640.65810.76800.8662
n = 30 0.55910.56220.56750.57470.67340.77800.8759
n = 40 0.56620.57050.57740.58610.68060.78210.8810
n = 50 0.57150.57650.58430.59400.68420.78440.8841
n = 100 0.58500.59220.60240.61400.68970.79000.8916
n = 200 0.59240.60190.61370.62580.69320.79500.8971
n = 300 0.59430.60500.61710.62870.69520.79740.8989
n = 0.60000.61000.62000.63000.70000.80000.9000
Table 9. Biases with very small p .
Table 9. Biases with very small p .
Bias of p ^ p ˜ p ^ P ^ ( C ) P ^ ( C ) P ^ ( U ) P ^ ( U )
p 0.510.510.510.510.510.510.51
n = 100 0.03070.01210.02200.18470.12420.02680.0207
n = 200 0.01920.00620.01300.15180.09150.01860.0141
n = 300 0.01420.00420.00940.12840.07110.01480.0113
n = 400 0.01120.00230.00690.10940.05020.01240.0092
n = 500 0.00930.00160.00560.09510.03750.01080.0080
n = 600 0.00800.00090.00460.08260.02570.00960.0071
n = 700 0.00690.00050.00380.07100.01460.00860.0063
n = 800 0.00610.00010.00320.06080.00490.00790.0057
Table 10. Golfing and weather.
Table 10. Golfing and weather.
OutlookTempHumidityWindyPlay Golf
RainyHotHighFalseNo
RainyHotHighTrueNo
OvercastHotHighFalseYes
SunnyMildHighFalseYes
SunnyCoolNormalFalseYes
SunnyCoolNormalTrueNo
OvercastCoolNormalTrueYes
RainyMildHighFalseNo
RainyCoolNormalFalseYes
SunnyMildNormalFalseYes
RainyMildNormalTrueYes
OvercastMildHighTrueYes
OvercastHotNormalFalseYes
SunnyMildHighTrueNo
Table 11. Outlook with J = 3 .
Table 11. Outlook with J = 3 .
OutlookRainyOvercastSunny
n j 545
y j 243
λ ^ j 5/144/145/14
p ^ j , 3/54/43/5
P ^ ( C j ) 0.84180.73970.8418
P ^ ( U j ) 0.56840.73970.5684
Table 12. Temperature with J = 3 .
Table 12. Temperature with J = 3 .
TempHotMildCool
n j 464
y j 243
λ ^ j 4/146/144/14
p ^ j , 2/44/63/4
P ^ ( C j ) 0.64260.94080.7268
P ^ ( U j ) 0.50000.64700.6134
Table 13. Humidity with J = 2 .
Table 13. Humidity with J = 2 .
HumidityHighNormal
n j 77
y j 36
λ ^ j 7/147/14
p ^ j , 4/76/7
P ^ ( C j ) 0.92460.9915
P ^ ( U j ) 0.56060.8511
Table 14. Wind with J = 2 .
Table 14. Wind with J = 2 .
WindyFalseTrue
n j 86
y j 63
λ ^ j 8/146/14
p ^ j , 6/83/6
P ^ ( C j ) 0.99160.8547
P ^ ( U j ) 0.74580.5000
Table 15. Estimated levels of confidence and utility.
Table 15. Estimated levels of confidence and utility.
WeatherOutlookTempHumidityWindy
P ^ ( C ) 0.81260.79450.95810.9329
P ^ ( U ) 0.61730.59540.70590.6405
g ^ λ 0.71430.79760.36730.4286
Table 16. Biases in estimating p .
Table 16. Biases in estimating p .
Bias of p ^ p ˜ p ^
p 0.60.70.80.90.60.70.80.90.60.70.80.9
n = 5 0.10210.04680.01580.00370.0468−0.0014−0.0141−0.00770.07810.02840.0048−0.0001
n = 6 0.07430.02780.00760.0025−0.0016−0.0370−0.0364−0.01360.04900.0063−0.0070−0.0028
n = 7 0.07400.02820.00730.00200.0231−0.0111−0.0160−0.00550.05440.0137−0.0008−0.0004
n = 8 0.05780.01830.00480.00130.0056−0.0212−0.0165−0.00360.03850.0038−0.0029−0.0004
n = 9 0.05780.01840.00400.0008−0.0202−0.0434−0.0301−0.00700.0292−0.0036−0.0078−0.0018
n = 10 0.04710.01250.00240.00080.0058−0.0161−0.0101−0.00130.03100.0016−0.00220.0000
n = 20 0.02120.00310.00100.0007−0.0138−0.0140−0.00190.00060.0067−0.0039−0.00020.0006
n = 30 0.01220.00150.00070.0005−0.0098−0.00520.00020.0027−0.00130.00050.00050.0005
n = 40 0.00780.00090.00080.0005−0.0125−0.00310.00070.0005−0.0011−0.00080.00070.0005
n = 50 0.00520.00070.00060.0004−0.0093−0.00090.00060.0004−0.00120.00000.00060.0004
Table 17. Mean squared errors in estimating p .
Table 17. Mean squared errors in estimating p .
MSE of p ^ p ˜ p ^
p 0.60.70.80.90.60.70.80.90.60.70.80.9
n = 5 0.02500.02370.02320.01620.03200.03600.03580.02230.02820.02770.02760.0183
n = 6 0.02740.02370.02240.01440.02740.03860.04130.02440.02300.02630.02710.0171
n = 7 0.01930.01920.01890.01220.02090.02700.02710.01590.01970.02190.02170.0133
n = 8 0.01830.01910.01800.01090.02150.02920.02730.01380.01810.02180.02090.0118
n = 9 0.01500.10630.01570.00970.02000.03210.01460.03080.01470.02020.01980.0111
n = 10 0.01470.01610.01470.00880.01750.02330.01990.01000.01500.01840.01650.0093
n = 20 0.00760.00930.00780.00440.01140.01430.00920.00440.00860.01110.00830.0044
n = 30 0.00560.00670.00520.00290.00860.00540.00530.00290.00640.00740.00290.0029
n = 40 0.00450.00510.00390.00220.00710.00640.00400.00220.00540.00560.00400.0022
n = 50 0.00380.00410.00310.00180.00570.00460.00320.00180.00450.00430.00310.0018
Table 18. Biases in estimating P ( C ) .
Table 18. Biases in estimating P ( C ) .
Bias of P ^ ( C ) P ˜ ( C ) P ˜ ( C )
p 0.60.70.80.90.60.70.80.90.60.70.80.9
n = 20 −0.0004−0.0610−0.0213−0.0014−0.1192−0.1183−0.0311−0.0016−0.0479−0.0834−0.0251−0.0015
n = 30 −0.0318−0.0460−0.0067−0.0001−0.1149−0.0712−0.0082−0.0001−0.0685−0.0562−0.0073−0.0001
n = 40 −0.0479−0.0309−0.0020−0.0000−0.1423−0.0489−0.0023−0.0000−0.0872−0.0382−0.0022−0.0000
n = 50 −0.0561−0.0199−0.00060.0000−0.1301−0.0279−0.00070.0000−0.0877−0.0232−0.00070.0000
n = 60 −0.0594−0.0125−0.0002−0.0000−0.1258−0.0169−0.0002−0.0000−0.0878−0.0143−0.0002−0.0000
n = 70 −0.0599−0.0079−0.0000−0.0000−0.1245−0.0104−0.0000−0.0000−0.0875−0.0089−0.0000−0.0000
n = 80 -0.0595−0.0049−0.0000-0.0000−0.1116−0.0060−0.0000−0.0000−0.0820−0.0054−0.0000−0.0000
n = 90 −0.0570−0.0031-0.0000−0.0000−0.1023−0.0036−0.0000−0.0000−0.0766−0.0033−0.0000−0.0000
n = 100 −0.0531−0.0020-0.0000−0.0000-0.0993−0.0025−0.0000−0.0000−0.0729−0.0022−0.0000−0.0000
n = 200 −0.0222−0.0000−0.00000.0000−0.0337−0.0000−0.00000.0000−0.0272−0.0000−0.00000.0000
n = 300 −0.0074−0.0000−0.00000.0000−0.0096−0.0000−0.00000.0000−0.0084−0.0000−0.00000.0000
Table 19. Biases in estimating P ( U ) .
Table 19. Biases in estimating P ( U ) .
Bias of P ^ ( U ) P ˜ ( U ) P ˜ ( U )
p 0.60.70.80.90.60.70.80.90.60.70.80.9
n = 20 0.04080.0024−0.00430.00000.0252−0.0065−0.00620.00000.0302−0.0033−0.00540.0000
n = 30 0.0250−0.0030−0.00110.00060.0129−0.0077−0.00150.00060.0180−0.0055−0.00130.0006
n = 40 0.0158−0.00330.00020.00050.0055−0.00580.00010.00050.0090−0.00480.00020.0005
n = 50 0.0098−0.00240.00040.00050.0017−0.00360.00040.00040.0048−0.00300.00040.0005
n = 60 0.0060−0.00140.00060.0004−0.0014−0.00210.00060.00040.0016−0.00180.00060.0004
n = 70 0.0033−0.00080.00050.0005−0.0028−0.00120.00050.0005−0.0006−0.00100.00050.0005
n = 80 0.0013−0.00040.00050.0004−0.0038−0.00060.00050.0004−0.0038−0.00050.00050.0004
n = 90 −0.0001−0.00000.00050.0004−0.0046−0.00010.00050.0004−0.0028−0.00010.00050.0004
n = 100 −0.0007−0.00020.00040.0004−0.0046−0.00030.00040.0004−0.0032−0.00030.00040.0004
n = 200 −0.0021−0.00020.00020.0002−0.0030−0.00020.00020.0002−0.0027−0.00020.00020.0002
n = 300 −0.0010−0.00030.00020.0002−0.0011−0.00030.00020.0002−0.0011−0.00030.00020.0002
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, Z. Levels of Confidence and Utility for Binary Classifiers. Stats 2024, 7, 1209-1225. https://doi.org/10.3390/stats7040071

AMA Style

Zhang Z. Levels of Confidence and Utility for Binary Classifiers. Stats. 2024; 7(4):1209-1225. https://doi.org/10.3390/stats7040071

Chicago/Turabian Style

Zhang, Zhiyi. 2024. "Levels of Confidence and Utility for Binary Classifiers" Stats 7, no. 4: 1209-1225. https://doi.org/10.3390/stats7040071

APA Style

Zhang, Z. (2024). Levels of Confidence and Utility for Binary Classifiers. Stats, 7(4), 1209-1225. https://doi.org/10.3390/stats7040071

Article Metrics

Back to TopTop