Next Article in Journal
Nonparametric Problem-Space Clustering: Learning Efficient Codes for Cognitive Control Tasks
Next Article in Special Issue
A Hybrid EEMD-Based SampEn and SVD for Acoustic Signal Processing and Fault Diagnosis
Previous Article in Journal
Fractal Representation of Exergy
Previous Article in Special Issue
Classification Active Learning Based on Mutual Information
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Optimization Approach of Deriving Bounds between Entropy and Error from Joint Distribution: Case Study for Binary Classifications

1
NLPR/LIAMA, Institute of Automation, Chinese Academy of Science, Beijing 100190, China
2
College of Mathematics and Information Science, Hebei University, Baoding 071002, China
*
Author to whom correspondence should be addressed.
Entropy 2016, 18(2), 59; https://doi.org/10.3390/e18020059
Submission received: 3 December 2015 / Revised: 3 February 2016 / Accepted: 4 February 2016 / Published: 19 February 2016
(This article belongs to the Special Issue Information Theoretic Learning)

Abstract

:
In this work, we propose a new approach of deriving the bounds between entropy and error from a joint distribution through an optimization means. The specific case study is given on binary classifications. Two basic types of classification errors are investigated, namely, the Bayesian and non-Bayesian errors. The consideration of non-Bayesian errors is due to the facts that most classifiers result in non-Bayesian solutions. For both types of errors, we derive the closed-form relations between each bound and error components. When Fano’s lower bound in a diagram of “Error Probability vs. Conditional Entropy” is realized based on the approach, its interpretations are enlarged by including non-Bayesian errors and the two situations along with independent properties of the variables. A new upper bound for the Bayesian error is derived with respect to the minimum prior probability, which is generally tighter than Kovalevskij’s upper bound.

1. Introduction

In information theory, the relations between entropy and error probability are one of the important fundamentals. Among the related studies, one milestone is Fano’s inequality (also known as Fano’s lower bound for the error probability of decoders), which was originally proposed in 1952 by Fano but formally published in 1961 [1]. It is well known that Fano’s inequality plays a critical role in deriving other theorems and criteria in information theory [2,3,4]. However, within the research community, it has not been widely accepted exactly who was first to develop the upper bound for the error probability [5]. According to [6,7], Kovalevskij [8] was recognized as the first to derive the upper bound of the error probability in relation to entropy in 1965. Later, several researchers, such as Chu and Chueh in 1966 [9], Tebbe and Dwyer in 1968 [10], Hellman and Raviv in 1970 [11], independently developed upper bounds.
The lower and upper bounds of error probability have been a long-standing topic in studies on information theory [6,7,12,13,14,15,16,17,18,19,20,21]. However, we consider two issues that have received less attention in these studies:
  • What are the closed-form relations between each bound and error components in a diagram of entropy and error probability?
  • What are the lower and upper bounds in terms of the non-Bayesian errors if a non-Bayesian rule is applied in the information processing?
The first issue implies a need for a complete set of interpretations to the bounds in relation to joint distributions, so that both error probability and its error components are known for a deeper understanding. We will discuss the reasons of the need in the later sections of this paper. Up to now, most existing studies derived the bounds through an inequality means without using joint distribution information. Therefore, their bounds are not described by a generic relation to joint distributions so that their error component information cannot be gained. Several significant studies have achieved Fano’s bound from the joint distributions but through different means [16,20,21]. They all did not show the explicit relations to error components. Regarding the second issue, to the best of our knowledge, it seems that no study is shown in open literature on the bounds in terms of the non-Bayesian errors. We will define the Bayesian and non-Bayesian errors in Section 3. The non-Bayesian errors are also of importance because most classifications are realized within this category.
The issues above form the motivation behind this work. We take binary classifications as a problem background since it is more common and understandable from our daily-life experiences. Moreover, we intend to simplify settings within a binary state and Shannon entropy definitions for a case study from an expectation that the central principle of the approach is well highlighted by simple examples. The novel contribution of the present work is given from the following three aspects:
  • A new approach is proposed for deriving bounds directly through the optimization process based on a joint distribution, which is significantly different from all other existing approaches. One advantage of using the approach is the closed-form expressions to the bounds and their error components.
  • A new upper bound in a diagram of “Error Probability vs. Conditional Entropy” for the Bayesian errors is derived with a closed-form expression in the binary state, which has not been reported before. The new bound is generally tighter than Kovalevskij’s upper bound. Fano’s lower bound receives novel interpretations.
  • The comparison study on the bounds in terms of the Bayesian and non-Bayesian errors are made in the binary state. The bounds of non-Bayesian errors are explored for a first time in information theory and imply a significant role in the study of machine learning and classification applications.
In the first aspect, we also conduct the actual derivation using a symbolic software tool, which presents a standard and comprehensive solution in the approach. The rest of this paper is organized as follows. In Section 2, we present related works on the bounds. For a problem background of binary classifications, several related definitions are given in Section 3. The bounds are given and discussed for the Bayesian and non-Bayesian errors in Section 4 and Section 5, respectively. Interpretations to some key points are presented in Section 6. We summarize the work in Section 7 and present some discussions in Section 8. The source code from using symbolic software for the derivation is included in Figure A1 and Figure A2.

2. Related Works

Two important bounds are introduced first, which form the baselines for the comparisons with the new bounds. They were both derived from inequality conditions [1,8]. Suppose the random variables X and Y representing input and output messages (out of m possible messages), and the conditional entropy H ( X | Y ) representing the average amount of information lost on X when given Y [22]. Fano’s lower bound for the error probability [1,22] is given in a form of:
H ( X | Y ) H ( P e ) + P e log 2 ( m 1 ) ,
where P e is the error probability (sometimes, also called error rate or error for short), and H ( P e ) is the binary entropy function defined by [23]:
H ( P e ) = P e log 2 P e ( 1 P e ) log 2 ( 1 P e ) .
The base of the logarithm is two so that the units are bits.
The upper bound for the error probability is given by Kovalevskij [8] in a piecewise linear form [10]:
H ( X | Y ) log 2 k + k ( k + 1 ) ( log 2 k + 1 k ) ( P e k 1 k ) and k < m , m 2 ,
where k is a positive integer number, but defined to be smaller than m. For a binary classification ( m = 2 ), Fano–Kovalevskij bounds become:
H 1 ( P e ) = G ( H ( X | Y ) ) P e H ( X | Y ) 2 ,
where H 1 ( P e ) denotes an inverse function of H ( P e ) and has no closed-form expression. Hence, we set it as a function form, G ( H ( X | Y ) ) , in terms of the variable H ( X | Y ) . Feder and Merhav [24] depicted bounds of Equation (4) and presented interpretations on the two specific points from the background of data compression problems.
Studies from the different perspectives have been reported on the bounds between error probability and entropy. The initial difference is made from the entropy definitions, such as Shannon entropy in [12,14,25,26], and Rényi entropy in [6,7,15]. The second difference is the selection of bound relations, such as “ P e vs. H ( X | Y ) ” in [12,24], “ H ( X | Y ) vs. P e ” in [6,7,14,15,20], “ P e vs. M I ( X ; Y ) ” in [27,28], and “ N M I ( X ; Y ) vs. A” in [25], where A is the accuracy rate, M I ( X ; Y ) and N M I ( X ; Y ) are the mutual information and normalized mutual information between variables X and Y, respectively. Another important study is made on the tightness of bounds. Several investigations [17,19,20,29] have been reported on the improvement of bound tightness. Recently, a study in [26] suggested that an upper bound from the Bayesian errors should be added, which is generally neglected in the bound analysis.

3. Binary Classifications and Related Definitions

Classifications can be viewed as one component in pattern recognition systems [30]. Figure 1 shows a schematic diagram of the pattern recognition systems. The first unit in the systems is termed representation in the present problem background but called encoder in communication background. This unit processes the tasks of feature selection, or feature extraction. The second unit is called classification or classifier in applications. Three sets of variables are involved in the systems, namely, target variable T, feature variables X, and prediction variable Y. While T and Y are univariate discrete random variables for representing labels of the samples, X can be high-dimensional random variables either in forms of discrete, continuous, or their combinations.
In this work, binary classifications are considered as a case study because they are more fundamental in applications. Sometimes, multi-class classifications are processed by binary classifiers [31]. In this section, we will present several necessary definitions for the present case study. Let x be a random sample satisfying x X R d , which is in a d-dimensional feature space and will be classified. The true (or target) state t of x is within the finite set of two classes, t T = { t 1 , t 2 } , and the prediction (or output) state y = f ( x ) is within the two classes, y Y = { y 1 , y 2 } , where f is a function for classifications. Let p ( t i ) be the prior probability of class t i and p ( x | t i ) be the conditional probability density function (or conditional probability) of x given that it belongs to class t i .
Definition 1. 
(Bayesian error in binary classification) In a binary classification, the Bayesian error, denoted by P e , is defined by [30]:
P e = R 2 p ( x | t 1 ) p ( t 1 ) d x + R 1 p ( x | t 2 ) p ( t 2 ) d x ,
where R i is the decision region for class t i . The two regions are determined by the Bayesian rule:
Decide R 1 if p ( x | t 1 ) p ( t 1 ) p ( x | t 2 ) p ( t 2 ) 1 Decide R 2 if p ( x | t 1 ) p ( t 1 ) p ( x | t 2 ) p ( t 2 ) < 1 .
In statistical classifications, the Bayesian error is the theoretically lowest probability of error [30].
Definition 2. 
(Non-Bayesian error) The non-Bayesian error, denoted by P E , is defined to be any error which is larger than the Bayesian error, that is:
P E > P e ,
for the given information of p ( t i ) and p ( x | t i ) .
Remark 1. 
Based on the definitions above, for the given joint distribution, the Bayesian error is unique, but the non-Bayesian errors are multiple. Figure 2 shows the Bayesian decision boundary, x b , on a univariate feature variable x for equal priors. The Bayesian error is P e = e 1 + e 2 . Any other decision boundary different from x b will generate the non-Bayesian error for P E > P e .
In a binary classification, the joint distribution, p ( t , y ) = p ( t = t i , y = y j ) = p i j , is given in a general form of:
p 11 = p 1 e 1 , p 12 = e 1 , p 21 = e 2 , p 22 = p 2 e 2 ,
where p 1 = p ( t 1 ) and p 2 = p ( t 2 ) are the prior probabilities of Class 1 and Class 2, respectively; their associated errors (also called error components) are denoted by e 1 and e 2 . Figure 3 shows a graphic diagram of the probability transformation between target variable T and prediction variable Y via their joint distribution p ( t , y ) in a binary classification. The constraints in Equation (8) are given by [30]:
0 < p 1 < 1 , 0 < p 2 < 1 , p 1 + p 2 = 1 0 e 1 p 1 , 0 e 2 p 2 .
In this work, we use e to denote error probability, or error variable, for representing either the Bayesian error or non-Bayesian error. They are calculated from the same formula:
e = e 1 + e 2 = P e i f e i s t h e m i n i m u m , P E o t h e r w i s e . .
Definition 3. 
(Minimum and maximum error bounds in binary classifications) Classifications suggest the minimum error bound as:
( P E ) m i n = ( P e ) m i n = 0 ,
where the subscript min denotes the minimum value. The maximum error bound for the Bayesian error in binary classifications is [26]:
( P e ) m a x = p m i n = min { p 1 , p 2 } ,
where the symbol min denotes a minimum operation. For the non-Bayesian error, its maximum error bound becomes
( P E ) m a x = 1 .
The Equations from Equations (11) to (13) describe the initial ranges of Bayesian and non-Bayesian errors respectively. When they share the same minimum, their maximums are always different.
Remark 2. 
For a given set of joint distributions in the bound studies, one may fail to tell if it is the solution from using the Bayesian rule or not. Only when e > p m i n , we can say the set is corresponding to the non-Bayesian solution.
In a binary classification, the conditional entropy, H ( T | Y ) , is calculated from the joint distribution in Equation (8):
H ( T | Y ) = H ( T ) M I ( T ; Y ) = p 1 log 2 p 1 p 2 log 2 p 2 e 1 log 2 e 1 ( p 2 + e 1 e 2 ) p 1 e 2 log 2 e 2 ( p 1 e 1 + e 2 ) p 2 ( p 1 e 1 ) log 2 ( p 1 e 1 ) ( p 1 e 1 + e 2 ) p 1 ( p 2 e 2 ) log 2 ( p 2 e 2 ) ( p 2 + e 1 e 2 ) p 2 ,
where H ( T ) is a binary entropy of the random variable T, and M I ( T ; Y ) is mutual information between variables T and Y.
Remark 3. 
When a joint distribution p ( t , y ) is given, its associated conditional entropy H ( T | Y ) is uniquely determined. However, for the given H ( T | Y ) , it is generally unable to reach a unique solution to p ( t , y ) but receives multiple solutions shown later in this work.
Definition 4. 
(Admissible point, admissible set, and their properties in diagram of entropy and error probability) In a given diagram of entropy and error probability, if a point in the diagram is possibly to be realized from a non-empty set of joint distributions for the given classification information, it is defined to be an admissible point. Otherwise, it is a non-admissible point. All admissible points will form an admissible set (or admissible region(s)), which is enclosed by the bounds (also called boundary). If every point located on the boundary is admissible (or non-admissible), we call this admissible set closed (or open). If only a partial portion of boundary points is admissible, the set is said to be partially closed. For an admissible point with the given conditions, if it is realized only by a unique joint distribution, it is called a one-to-one mapping point. If more than one joint distribution is associated to the same admissible point, it is called a one-to-many mapping point.
We consider that classifications present an exemplary justification of raising the first issue in Section 1 about the bound studies. The main reason behind the issue is that a single index of error probability may not be sufficient for dealing with classification problems. For example, when processing class-imbalance problems [32,33], we need to distinguish error types. In other words, for the same error probability e (or even the same admissible point), we are required to know the error components of e 1 and e 2 as well. Suppose one encounters a medical diagnosis problem, where p 1 (say, p 1 = 0 . 98 ) generally represents the majority class for healthy persons (labeled with negative or −1 in Figure 3), and p 2 ( = 0 . 02 ) the minority class for abnormal persons (labeled with positive or 1). A class-imbalance problem is then formed. While e 1 (also called type I error ) is tolerable (say, e 1 = 0 . 01 ), e 2 (or type II error) seems intolerable (say, e 2 = 0 . 01 ) because abnormal persons are considered to be “healthy”. In class-imbalance problems, the performance measure from error probability may become useless. For example, a classification having e = e 2 = p 2 = 0 . 02 does not support a high, yet reasonable, performance. Hence, from either theoretical or application viewpoint, it is necessary for establishing relations between bounds and joint distributions, which can provide error type information within error probability for better interpretations to the bounds.

4. Lower and Upper Bounds for Bayesian Errors

In this work, we select the bound relations between entropy and error probability. Furthermore, the bounds and their associated error components are also given by the following two theorems in a context of binary classifications.
Theorem 1. 
(Lower bound and associated error components) The lower bound in a diagram of “ P e vs. H ( T | Y ) ” and the associated error components with constraints Equations (9) and (12) are given by:
P e max { 0 , G 1 ( H ( T | Y ) ) } ,
f o r G 1 1 ( P e ) = P e log 2 P e ( 1 P e ) log 2 ( 1 P e ) , P e = e 1 + e 2 p m i n ,
( e 1 , e 2 ) = ( 0 . 5 , 0 ) o r ( 0 , 0 . 5 ) , i f P e = 0 . 5 , ( P e ( 1 p 1 P e ) 1 2 P e , P e ( p 1 P e ) 1 2 P e ) , o t h e r w i s e ,
where H ( T | Y ) is the conditional entropy of T when given Y, and G 1 is called the lower bound function (or lower bound). However, one can only achieve the closed-form solution on its inverse function, G 1 1 ( · ) , not on G 1 ( · ) itself.
Proof. 
Based on Equation (14), the lower bound function is derived from the following definition:
G 1 1 ( e ) = arg max e H ( T | Y ) subject to Equations ( 9 ) and ( 12 ) ,
where we take e for the input variable in the derivations Equation (16) describes the function of the maximum H ( T | Y ) with respect to e, and the function needs to satisfy the general constraints of joint distributions in Equation (9). H ( T | Y ) seems to be governed by the four variables from p i and e i in Equation (14). However, only two independent parameter variables determine the solutions of Equations (14) and (16). The variable reduction from four to two is due to the two specific constrains imposed between parameters, that is, p 1 + p 2 = 1 and e 1 + e 2 = e . When we set p 1 and e 1 as two independent variables, (16) is then equivalent to solving the following problem:
G 1 1 ( p 1 , e 1 ) = arg max e = P e H ( T | Y ) subject to Equations ( 9 ) and ( 12 ) .
G 1 1 ( p 1 , e 1 ) is a continuous and differentiable function with respect to the two variables. A differential approach is applied analytically for searching the critical points of the optimizations in Equation (17). We achieve the two differential equations below and set them to be zeros:
H ( T | Y ) e 1 = log 2 ( p 1 e 1 ) ( P e e 1 ) ( 1 + 2 e 1 p 1 P e ) 2 e 1 ( 1 + e 1 p 1 P e ) ( p 1 + P e 2 e 1 ) 2 = 0 , H ( T | Y ) p 1 = log 2 ( p 1 2 e 1 + P e ) ( 1 + e 1 p 1 P e ) ( p 1 e 1 ) ( 1 + 2 e 1 p 1 P e ) = 0 .
By solving them simultaneously, we obtain the three pairs of the critical points through analytical derivations:
e 1 = P e ( 1 p 1 P e ) 1 2 P e , p 1 = P e + 2 e 1 P e e 1 P e 2 P e ,
e 1 = p 1 ( p 1 + P e 1 ) 2 p 1 1 , p 1 = 1 P e 2 + e 1 + 1 2 1 + P e 2 + 4 e 1 2 4 e 1 P e 2 P e ,
e 1 = p 1 ( p 1 + P e 1 ) 2 p 1 1 , p 1 = 1 P e 2 + e 1 1 2 1 + P e 2 + 4 e 1 2 4 e 1 P e 2 P e .
The highest order of each variable, e 1 and p 1 , in Equation (18) is four. However, we can see the quadratic component within the first function in Equation (18), ( 1 + 2 e 1 p 1 P e p 1 + P e 2 e 1 ) 2 , will degenerate the total solution order from four to three. Therefore, the three pairs of critical points exhibit a complete set of possible solutions to the problem in Equation (17). The final solution should be the pair(s) that satisfies both the maximum H ( T | Y ) with respect to e 1 for the given e = P e and the Equations constraints (9) and (12). Due to high complexity of the nonlinearity of the second-order partial differential equations on H ( T | Y ) , it seems intractable to examine the three pairs analytically for the final solution.
To overcome the difficulty above, we apply a symbolic software tool, Maple™9.5 (a registered trademark of Waterloo Maple, Inc.), for a semi-analytical solution to the problem (see Maple code in Figure A1). For simplicity and without loss of generality in classifications, we consider p 1 and P e are known constants in the function. The concavity property of H ( T | Y ) with respect to e 1 in the ranges defined in Equation (19a) is confirmed numerically by varying data on p 1 and P e . Hence, a maximum solution on H ( T | Y ) is always received from the possible solutions of the critical points. Among them, only Equation (19a) satisfies the constraints to be the final solution. When e 1 is set, the expression of e 2 is known as shown in Equation (15c). The singular case is given specifically and the solution of ( e 1 , e 2 ) = ( 0 , 0 . 5 ) is obtained when p 2 is used in the error expressions.  ☐
Remark 4. 
Theorem 1 achieves the same lower bound found by Fano [1] (Figure 4), which is general for finite alphabets (or multiclass classifications). One specific relation to Fano’s bound is given by the marginal probability (see (2-144) in [2]):
p ( y ) = ( 1 P e , P e m 1 , , P e m 1 ) ,
which is termed sharp for attaining equality in Equation (1) [2]. We call Fano’s bound an exact lower bound because every point on it is sharp. The sharp conditions in terms of error components in Equation (15c) are a special case of the study in [20], and can be derived directly from their Theorem 1.
Theorem 2. 
(Upper bound and associated error components) The upper bound and the associated error components with constraints Equations (9) and (12) are given by:
P e min { p m i n , G 2 ( H ( T | Y ) ) } ,
f o r G 2 1 ( P e ) = p m i n log 2 p m i n P e + p m i n P e log 2 P e P e + p m i n ,
a n d P e = e 1 + e 2 p m i n , e i = p j , e j = 0 , p i p j , i j , i , j = 1 , 2 ,
where G 2 is called the upper bound function (or upper bound). The closed-form solution can be achieved only on its inverse function, G 2 1 ( · ) .
Proof. 
The upper bound function is obtained from solving the following equation:
G 2 1 ( p 1 , e 1 ) = arg min e = P e H ( T | Y ) , subject to Equations ( 9 ) and ( 12 ) .
Because the concavity property holds for H ( T | Y ) with respect to e 1 as discussed in the proof of Theorem 1, the possible solutions of e 1 should be located at the two ending points, that is, either at e 1 = 0 or at e 1 = P e . We can take the point which produces the smaller H ( T | Y ) and satisfies the constraints as the final solution. The solution from Maple code in Figure A2 confirms the closed-form expressions in (21).  ☐
Remark 5. 
Theorem 2 describes a novel set of upper bounds which is in general tighter than Kovalevskij’s bound [8] for binary classifications (Figure 4). For example, when p m i n = 0 . 2 is given, the upper bounds defined in Equation (21) shows a curve “ O C ” plus a line “ C C ”. Kovalevskij’s upper bound, given by a line “ O C A ”, is sharp only at Point O and Point C. The solution in Equation (21c) confirms an advantage of using the proposed optimization approach in derivations so that a closed-form expression of the exact bound is possibly achieved.
In comparison, Kovalevskij’s upper bound described in Equation (3) is general for multiclass classifications. This bound misses a general relation to error components like Equation (21c), although the relation is restricted to a binary state. For distinguishing from the Kovalevskij’s upper bound, we also call G 2 a curved upper bound. The new linear upper bound, ( P e ) m a x = p m i n , shows the maximum error for the Bayesian decisions in binary classifications [26], which is also equivalent to the solution of a blind guess when using the maximum-likelihood decision [30]. If p 1 = p 2 , the upper bound becomes a single curved one.
Remark 6. 
The lower and upper bounds defined by Equations (15) and (21) form a closed admissible region in the diagram of “ P e vs. H ( X | Y ) ”. The shape of the admissible region changes depending on a single parameter of p m i n .

5. Lower and Upper Bounds for Non-Bayesian Errors

In classification problems, the Bayesian errors can be realized only if one has the exact information about all probability distributions of classes. The assumption above is generally impossible in real applications. In addition, various classifiers are designed by employing the non-Bayesian rules or resulted in non-Bayesian errors, from the conventional decision trees, artificial neural networks, and supporting vector machines [30], to the emerging deep learning [34]. Therefore, the analysis of the non-Bayesian errors presents significant interests in classification studies, although the conventional information theory does not distinguish the error types.
Definition 5. 
(Label-switching in binary classifications) In binary classifications, a label-switching operation is an exchange between two labels. Suppose the original joint distribution is denoted by:
p A ( t , y ) : p 11 = a , p 12 = b , p 21 = c , p 22 = d .
A label-switching operation will change the prediction labels in Figure 3 to be y 1 = 1 and y 2 = 1 , and generate the following joint distribution:
p B ( t , y ) : p 11 = b , p 12 = a , p 21 = d , p 22 = c .
Proposition 1. 
(Invariant property from label-switching) The related entropy measures, including H ( T ) , H ( Y ) , M I ( T ; Y ) , and H ( T | Y ) , will be invariant to labels, or unchanged from a label-switching operation in binary classifications. However, the error e will be changed to be 1 e .
Proof. 
Substituting the two sets of joint distributions in Equation (23) into each entropy measure formula respectively, one can obtain the same results. The error change is obvious.  ☐
Theorem 3. 
(Lower bound and upper bound for non-Bayesian error without information of p 1 and p 2 ) In a context of binary classifications, when information about p 1 and p 2 is unknown (say, before classifications), the lower bound and upper bound for the non-Bayesian error with constraints Equations (9) and (13) are given by:
G 1 ( H ( T | Y ) ) P E 1 G 1 ( H ( T | Y ) ) ,
f o r G 1 1 ( P E ) = P E log 2 P E ( 1 P E ) log 2 ( 1 P E ) , P E = e 1 + e 2 1 ,
( e 1 , e 2 ) = ( 0 . 5 , 0 ) o r ( 0 , 0 . 5 ) , i f p 1 = p 2 = P E = 0 . 5 , ( P E ( 1 p 1 P E ) 1 2 P E , P E ( p 1 P E ) 1 2 P E ) , i f ( 1 p 1 P E ) ( p 1 P E ) 0 ( p 1 ( p 1 + P E 1 ) 2 p 1 1 , ( 1 p 1 ) ( p 1 P E ) 2 p 1 1 ) , o t h e r w i s e ,
where we call the upper bound in Equation (24a), 1 G 1 ( H ( T | Y ) ) , the general upper bound (or mirrored lower bound), which is a mirror of Fano’s lower bound with the mirror axis along P E = 0 . 5 . Both bounds share the same expression for calculating the associated error components in Equation (24c). When P E 0 . 5 , their components, e 1 and e 2 , correspond to the lower bound, otherwise, to the upper bound.
Proof. 
When the error probability is relaxed by Equation (13), all possible solutions in Equation (19) are applicable but within the special ranges respectively. Suppose an admissible point is located at the lower bound which shows P E 0 . 5 . By a label-switching operation, one can obtain the mirrored admissible point at 1 P E 0 . 5 , which is located at the mirrored lower bound. Proposition 1 suggests both points share the same value of H ( T | Y ) . Because P E is the smallest one for the given conditional entropy H ( T | Y ) , its mirrored point is the biggest one for creating the general upper bound.  ☐
Remark 7. 
Han and Verdù [16] achieved Fano’s bound from the joint distributions by including the independent condition p i j = p ( t i ) p ( y j ) [2]. The condition will only lead to the last set of error equations in Equation (24c), not to the complete sets. In addition, the set is only applicable to the non-Bayesian errors, not to the Bayesian ones except for a special case in Equation (20). Equation (24c) confirms again the advantage of using the optimization in derivations which achieves the complete sets of solutions to describe Fano’s bound for non-Bayesian errors.
Remark 8. 
The bounds from Equation (24) are derived only when p 1 and P e are given. They exist even one does not have such information. In this situation, Fano’s lower bound, its mirror bound, and the axis of P E form an admissible region, denoted by a boundary “ O F A F D O ” in Figure 5. The axis of P E encloses the region, but only Points O and D are admissible. Hence, the admissible region is partially closed.
Theorem 4. 
(Admissible region(s) for non-Bayesian error with known information of p 1 and p 2 ) In binary classifications, when information about p 1 and p 2 is known, a closed admissible region for the non-Bayesian error with constraints Equations (9) and (13) is generally formed (Figure 5) by Fano’s lower bound, the general upper bound, the curved upper bound G 2 1 ( · ) , the mirrored upper bound of G 2 1 ( · ) , and the upper bound H ( T | Y ) m a x . For the H ( T | Y ) m a x bound, its associated error components are given by:
f o r H ( T | Y ) = H ( T | Y ) m a x = H ( p m i n ) , ( e 1 , e 2 ) = ( 0 . 25 , 0 . 25 ) , i f p 1 = p 2 = P E = 0 . 5 , ( p 1 ( 1 p 1 P E ) 1 2 p 1 , P E ( 1 p 1 ) p 1 ( 1 p 1 ) 1 2 p 1 ) , o t h e r w i s e .
Proof. 
Following the proof in Theorem 3, one can get the mirrored upper bound of G 2 1 ( · ) . The upper bound H ( T | Y ) m a x is calculated from the condition of H ( T | Y ) H ( T ) [2]. For the given p 1 and p 2 , H ( T | Y ) m a x is a constant. Because H ( T | Y ) m a x also implies a minimization of M I ( T ; Y ) in Equation (14), its associated error components can be obtained from the following equivalent relation (see (11) in [35]):
M I ( T ; Y ) = 0 p 11 p 21 = p 12 p 22 .
 ☐
Remark 9. 
Equations (25) and (26) equivalently represent a zero value for the mutual information, which suggests no correlation [30] or statistically independent [2] between two variables T and Y.
Remark 10. 
When information of p 1 and p 2 is known, the admissible region(s) is much compact than that when without such information. The shape of the admissible region(s) is fully dependent on a single parameter p m i n . For example, if p m i n = 0 . 1 , the area is enclosed by the four-curve-one-line boundary “ O F F D A O ” in Figure 5. However, if p 1 = p 2 = 0 . 5 , two admissible areas are formed. They are “ O F A O ” and “ D F A D ”, respectively.

6. Classification Interpretations to Some Key Points

For a better understanding of the theoretical insights between the bounds and errors, some key points shown in Figure 4 and Figure 5 are discussed. Those key points may hold special features in classifications. Novel interpretations are expected from the following discussions.
Point O: This point represents a zero value of H ( T | Y ) . It also suggests an exact classification without any error ( P e = P E = 0 ) by a specific setting of the joint distribution:
p 11 = p 1 , p 12 = 0 , p 21 = 0 , p 22 = p 2 .
This point is always admissible and independent of error types.
Point A: This point shows the maximum ranges of H ( T | Y ) = 1 for class-balanced classifications ( p 1 = p 2 ). Three specific classification settings can be obtained for representing this point. The two settings from Equation (24c) are actually no classification:
p 11 = 1 / 2 , p 12 = 0 , p 21 = 1 / 2 , p 22 = 0 , or p 11 = 0 , p 12 = 1 / 2 , p 21 = 0 , p 22 = 1 / 2 .
They also indicate zero information [36] from the classification decisions. The other setting is a random guessing from Equation (25):
p 11 = 1 / 4 , p 12 = 1 / 4 , p 21 = 1 / 4 , p 22 = 1 / 4 .
For the Bayesian errors, this point is always included by both Fano’s bound and Kovalevskij’s bound. However, according to the upper bounds defined in Equation (21a), this point is non-admissible whenever the relation p 1 p 2 holds. For the non-Bayesian errors, the point is either admissible or non-admissible depending on the given information about p 1 and p 2 . This example suggests that the admissible property of a point should generally rely on the given information in classifications.
Point D: This point occurs for the non-Bayesian classifications in a form of:
p 11 = 0 , p 12 = p 1 , p 21 = p 2 , p 22 = 0 .
In this case, one can exchange the labels for a perfect classification.
Point B: This point is located at the corner formed by the curved and linear upper bounds, with H ( T | Y ) = 0 . 8 and e = 0 . 4 . In apart from Point O, this is another point obtained from Equation (21) that locates at Kovalevskij’s upper bound. The point can be realized from either Bayesian or non-Bayesian classifications. Suppose p 1 > p 2 = 0 . 4 for the Bayesian classifications. One will achieve Point B by Equation (21):
p 11 = 0 . 2 , p 12 = 0 . 4 , p 21 = 0 . 0 , p 22 = 0 . 4 ,
for a one-to-one mapping. In other words, this point is uniquely determined by Equation (31) and only corresponding to p m i n = 0 . 4 within the Bayesian classifications. If non-Bayesian classifications are considered, this point becomes a one-to-many mapping and shows p m i n 0 . 4 . For example, one can get another setting of joint distribution from solving H ( p m i n ) = 0 . 8 for p m i n = 0 . 2430 first. Then, by substituting the relations of p 2 = p m i n and P E = 0 . 4 into Equation (25), one can get the error components, that is, e 1 = 0 . 2312 and e 2 = 0 . 1688 , for the new setting of joint distribution on Point B.
Point B becomes non-admissible when p m i n = 0 . 5 (Figure 4), which means no joint distribution exists to satisfy Equation (9). In this situation, we can understand why the new upper bound is generally tighter than Kovalevskij’s upper bound.
Point B : The point is with H ( T | Y ) = 0 . 9710 and e = 0 . 4 in the diagram (Figure 4). It is exactly located at the lower bound and is able to produce a one-to-many mapping for either the Bayesian errors or non-Bayesian errors. One specific setting in terms of the Bayesian errors is:
p 11 = 0 . 6 , p 12 = 0 . 0 , p 21 = 0 . 4 , p 22 = 0 . 0 ,
which suggests zero information from classifications. More settings can be obtained from Equation (15). For example, if given p 1 = 0 . 55 , p 2 = 0 . 45 and P e = 0 . 4 , one can have:
p 11 = 0 . 45 , p 12 = 0 . 10 , p 21 = 0 . 30 , p 22 = 0 . 15 .
Another setting is for the balanced error components:
p 11 = 0 . 3 , p 12 = 0 . 2 , p 21 = 0 . 2 , p 22 = 0 . 3 .
The non-Bayesian errors will enlarge the set of one-to-many mapping for an admissible point due to the relaxed condition of Equation (13). Equation (24c) will be applicable for deriving a specific setting when p 1 and e are given. For example, two settings can be obtained:
i f p 1 = 0 . 250 , P E = 0 . 400 , t h e n p 11 = 0 . 075 , p 12 = 0 . 175 , p 21 = 0 . 225 , p 22 = 0 . 525 ,
i f p 1 = 0 . 300 , P E = 0 . 400 , t h e n p 11 = 0 . 075 , p 12 = 0 . 225 , p 21 = 0 . 175 , p 22 = 0 . 525 ,
for representing the same Point B .
Remark 11. 
One can observe that Equations (35) and (36) will lead to a zero mutual information, but Equations (33) and (34) are not. The observations reveal new interpretations about Fano’s bound in association with two situations in the independent properties of the variables, which have not been reported before.
Points E and E : All points located at the general upper bound, like Point E, will correspond to the settings from the non-Bayesian errors. If a point located at the lower bound, say E , it can represent settings from either the Bayesian or non-Bayesian errors depending on the given information in classifications. Points E and E form the mirrored points. Their settings can be connected by a relation in Equation (23) but are not necessary. For example, one specific setting for Point E with p 1 = 0 . 3 and p 2 = 0 . 7 is:
p 11 = 0 . 0 , p 12 = 0 . 3 , p 21 = 0 . 0 , p 22 = 0 . 7 ,
the other for Point E with p 1 = 0 . 8 and p 2 = 0 . 2 is:
p 11 = 20 30 , p 12 = 4 30 , p 21 = 5 30 , p 22 = 1 30 .
They are mirrored to each other but have no label-switching relation.
Points A and A : When P E = 0 . 5 and p m i n = 0 . 1 , Points A and A form a pair as the ending points for the given conditions. Supposing p 1 = 0 . 9 and p 2 = 0 . 1 , one can get the specific setting for Point A from Equation (21c):
p 11 = 0 . 4 , p 12 = 0 . 5 , p 21 = 0 . 0 , p 22 = 0 . 4 ,
and one for Point A from Equation (25):
p 11 = 0 . 45 , p 12 = 0 . 45 , p 21 = 0 . 05 , p 22 = 0 . 05 .
Points Q and R: The two points are specific due to their positions in the diagrams. For either type of errors, both points are non-admissible in the diagrams, because no joint distribution exists in binary classifications which can represent the points.

7. Summary

This work investigates into lower and upper bounds between entropy and error probability. An optimization approach is proposed to the derivations of the bound functions from a joint distribution. As a preliminary work, we consider binary classifications for a case study. Through the approach, Fano’s lower bound receives novel interpretations. A new upper bound is derived and shows tighter in general than Kovalevskij’s upper bound. The closed-form relations between bounds and error components are presented. The analytical results lead to a better understanding about the sharp conditions of bounds in terms of error components. Because classifications involve either Bayesian errors or non-Bayesian ones, we demonstrate the bounds comparatively for both types of errors.
We recognize that analytical tractability is an issue for the proposed approach. Fortunately, a symbolic software tool is helpful for solving complex problems successfully with different semi-analytical means (such as in [37,38]). The semi-analytical solution used in this work refers to the analytical derivation of all possible solutions, but the numerical verification of the final solution(s).

8. Discussions

To emphasize the importance of the study, we present discussions below from the perspective of machine learning in the context of big-data classifications. We consider that binary classifications will be one of key techniques to implement a divide-and-conquer strategy [39] for efficiently processing large quantities of data. Class-imbalance problems with extremely-skewed ratios are mostly formed from a one-against-other division scheme [31] in binary classifications. Researchers and users, of course, concern error components in types for performance evaluations [32]. The knowledge of bounds in relation to error components is desirable for theoretical and application purposes.
From a viewpoint of machine learning, the bounds derived in this work provide a basic solution to link learning targets between error and entropy in the related studies. Error-based learning is more conventional because of its compatibility with our intuitions in daily life, such as “trial and error”. Significant studies have been reported under this category. In comparison, information-based learning [40] is relatively new and uncommon in some applications, such as classifications. Entropy is not a well-accepted concept related to our intuition in decision making. This is one of the reasons why the learning target is chosen mainly based on error, rather than on entropy. However, error is an empirical concept, whereas entropy is theoretical and general [41]. In [35], we demonstrated that entropy can deal with both notions of error and reject in abstaining classifications. Information-based learning [40] presents a promising and wider perspective for exploring and interpreting learning mechanisms.
When considering all sides of the issues stemming from machine learning studies, we believe that “what to learn” is a primary problem. However, it seems that more investigations focused on the issue of “how to learn”, which should be put as the second-level problem. Moreover, in comparison with the long-standing yet hot theme of feature selection, little study has been done from the perspective of learning target selection. We propose that this theme should be emphasized in the study of machine learning. Hence, the relations studied in this work are fundamental and crucial to the extent that researchers, using either error-based or entropy-based approaches, are able to reach a better understanding about its counterpart.

Acknowledgments

This work is supported in part by National Natural Science Foundation of China No. 61273196 and 61573348 for Bao-Gang Hu, and National Natural Science Foundation of China No. 60903089 for Hong-Jie Xing. The first version of this work, entitled “Analytical bounds between entropy and error probability in binary classifications”, was appeared as arXiv:1205.6602v1[cs.IT] in 30 May 2012. Thanks to T. Uyematsu and the anonymous reviewers for the valuable comments and suggestions.

Author Contributions

Bao-Gang Hu proposed the core concepts, derived the theorems, implemented the Maple codes, and wrote the paper. Hong-Jie Xing provided comments and made the proofreading of the paper. Both authors have read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix

Figure A1. Maple code for deriving the lower bound.
Figure A1. Maple code for deriving the lower bound.
Entropy 18 00059 g006
Figure A2. Maple code for deriving the upper bound.
Figure A2. Maple code for deriving the upper bound.
Entropy 18 00059 g007

References

  1. Fano, R.M. Transmission of Information: A Statistical Theory of Communication. Am. J. Phys. 1961. [Google Scholar] [CrossRef]
  2. Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; John Wiley: New York, NY, USA, 2006. [Google Scholar]
  3. Verdú, S. Fifty years of Shannon theory. IEEE Trans. Inf. Theory 1998, 44, 2057–2078. [Google Scholar] [CrossRef]
  4. Yeung, R.W. A First Course in Information Theory; Kluwer Academic: London, UK, 2002. [Google Scholar]
  5. Golic, J.D. Comment on “Relations between entropy and error probability”. IEEE Trans. Inf. Theory 1999. [Google Scholar] [CrossRef]
  6. Vajda, I.; Zvárová, J. On generalized entropies, Bayesian decisions and statistical diversity. Kybernetika 2007, 43, 675–696. [Google Scholar]
  7. Morales, D.; Vajda, I. Generalized information criteria for optimal Bayes decisions. Kybernetika 2012, 48, 714–749. [Google Scholar]
  8. Kovalevskij, V.A. The Problem of Character Recognition from the Point of View of Mathematical Statistics. In Character Readers and Pattern Recognition; Spartan: New York, NY, USA, 1968; pp. 3–30. [Google Scholar]
  9. Chu, J.T.; Chueh, J.C. Inequalities between information measures and error probability. J. Frankl. Inst. 1966, 282, 121–125. [Google Scholar] [CrossRef]
  10. Tebe, D.L.; Dwyer, S.J. Uncertainty and probability of error. IEEE Trans. Inf. Theory 1968, 16, 516–518. [Google Scholar] [CrossRef]
  11. Hellman, M.E.; Raviv, J. Probability of error, equivocation, and the Chernoff bound. IEEE Trans. Inf. Theory 1970, 16, 368–372. [Google Scholar] [CrossRef]
  12. Chen, C.H. Theoretical comparison of a class of feature selection criteria in pattern recognition. IEEE Trans. Comput. 1971, 20, 1054–1056. [Google Scholar] [CrossRef]
  13. Ben-Bassat, M.; Raviv, J. Rényi’s entropy and the probability of error. IEEE Trans. Inf. Theory 1978, 24, 324–330. [Google Scholar] [CrossRef]
  14. Golić, J.D. On the relationship between the information measures and the Bayes probability of error. IEEE Trans. Inf. Theory 1987, 35, 681–690. [Google Scholar] [CrossRef]
  15. Feder, M.; Merhav, N. Relations between entropy and error probability. IEEE Trans. Inf. Theory 1994, 40, 259–266. [Google Scholar] [CrossRef]
  16. Han, T.S.; Verdú, S. Generalizing the Fano inequality. IEEE Trans. Inf. Theory 1994, 40, 1247–1251. [Google Scholar]
  17. Poor, H.V.; Verdú, S. A Lower bound on the probability of error in multihypothesis testing. IEEE Trans. Inf. Theory 1995, 41, 1992–1994. [Google Scholar] [CrossRef]
  18. Harremoës, P.; Topsøe, F. Inequalities between entropy and index of coincidence derived from information diagrams. IEEE Trans. Inf. Theory 2001, 47, 2944–2960. [Google Scholar] [CrossRef]
  19. Erdogmus, D.; Principe, J.C. Lower and upper bounds for misclassification probability based on Renyi’s information. J. VLSI Signal Process. 2004, 37, 305–317. [Google Scholar] [CrossRef]
  20. Ho, S.-W.; Verdú, S. On the interplay between conditional entropy and error probability. IEEE Trans. Inf. Theory 2010, 56, 5930–5942. [Google Scholar] [CrossRef]
  21. Liang, X.-B. A note on Fano’s inequality. In Proceedings of the 2011 45th Annual Conference on Information Sciences and Systems, Baltimore, MD, USA, 23–25 March 2011.
  22. Fano, R.M. Fano inequality. Scholarpedia 2008. [Google Scholar] [CrossRef]
  23. Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
  24. Feder, M.; Merhav, N. Universal prediction of individual sequences. IEEE Trans. Inf. Theory 1992, 38, 1258–1270. [Google Scholar] [CrossRef]
  25. Wang, Y.; Hu, B.-G. Derivations of normalized mutual information in binary classifications. In Proceedings of the 6th International Conference on Fuzzy Systems and Knowledge Discovery, Tianjin, China, 14–16 August 2009; pp. 155–163.
  26. Hu, B.-G. What are the differences between Bayesian classifiers and mutual-information classifiers? IEEE Trans. Neural Net. Learn. Syst. 2014, 25, 249–264. [Google Scholar]
  27. Eriksson, T.; Kim, S.; Kang, H.-G.; Lee, C. An information-theoretic perspective on feature selection in speaker recognition. IEEE Signal Process. Lett. 2005, 12, 500–503. [Google Scholar] [CrossRef]
  28. Fisher, J.W.; Siracusa, M.; Tieu, K. Estimation of signal information content for classification. In Proceedings of the IEEE DSP Workshop, Marco Island, FL, USA, 4–7 January 2009; pp. 353–358.
  29. Taneja, I.J. Generalized error bounds in pattern recognition. Pattern Recogni. Lett. 1985, 3, 361–368. [Google Scholar] [CrossRef]
  30. Duda, R.O.; Hart, P.E.; Stork, D. Pattern Classification, 2nd ed.; John Wiley: New York, NY, USA, 2001. [Google Scholar]
  31. Cristianini, N.; Shawe-Taylor, J. An Introduction to Support Vector Machines and other Kernel-based Learning Methods; Cambridge University Press: London, UK, 2000. [Google Scholar]
  32. He, H.; Garcia, E.A. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar]
  33. Sun, Y.M.; Wong, A.K.C.; Kamel, M.S. Classification of imbalanced data: A review. Int. J. Pattern Recognit. Artif. Intell. 2009, 23, 687–719. [Google Scholar] [CrossRef]
  34. LeCun, Y.; Bengio, Y.; Hinton, G. Deep Learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
  35. Hu, B.-G.; He, R.; Yuan, X.-T. Information-theoretic measures for objective evaluation of classifications. Acta Autom. Sin. 2012, 38, 1160–1173. [Google Scholar] [CrossRef]
  36. Mackay, D.J.C. Information Theory, Inference, and Learning Algorithms; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
  37. Subramanian, V.R.; White, R.E. Symbolic solutions for boundary value problems using Maple. Comput. Chem. Eng. 2000, 24, 2405–2416. [Google Scholar] [CrossRef]
  38. Temimi, H.; Ansari, A.R. A semi-analytical iterative technique for solving nonlinear problems. Comput. Math. Appl. 2011, 61, 203–210. [Google Scholar] [CrossRef]
  39. Jordan, M.I. On statistics, computation and scalability. Bernoulli 2013, 19, 1378–1390. [Google Scholar] [CrossRef]
  40. Principe, J.C. Information Theoretic Learning: Renyi’s Entropy and Kernel Perspectives; Springer: New York, NY, USA, 2010. [Google Scholar]
  41. Hu, B.-G. Information theory and its relation to machine learning. In Proceedings of the 2015 Chinese Intelligent Automation Conference; Springer-Verlag: Berlin/Heidelberg, Germany, 2015; pp. 1–11. [Google Scholar]
Figure 1. Schematic diagram of the pattern recognition systems (adapted from Figure 1.7 in [30]).
Figure 1. Schematic diagram of the pattern recognition systems (adapted from Figure 1.7 in [30]).
Entropy 18 00059 g001
Figure 2. Bayesian decision boundary x b for equal priors p ( t i ) in a binary classification (adapted from Figure 2.17 in [30]).
Figure 2. Bayesian decision boundary x b for equal priors p ( t i ) in a binary classification (adapted from Figure 2.17 in [30]).
Entropy 18 00059 g002
Figure 3. Graphic diagram of the probability transformation between variables T and Y in a binary classification (or channel). Instead of using conditional probability p ( y | t ) , joint probability distributions p ( t , y ) are applied to describe the channel.
Figure 3. Graphic diagram of the probability transformation between variables T and Y in a binary classification (or channel). Instead of using conditional probability p ( y | t ) , joint probability distributions p ( t , y ) are applied to describe the channel.
Entropy 18 00059 g003
Figure 4. Plot of bounds in a “ P e vs. H ( T | Y ) ” diagram.
Figure 4. Plot of bounds in a “ P e vs. H ( T | Y ) ” diagram.
Entropy 18 00059 g004
Figure 5. Plot of bounds in a “ P E vs. H ( T | Y ) ” diagram.
Figure 5. Plot of bounds in a “ P E vs. H ( T | Y ) ” diagram.
Entropy 18 00059 g005

Share and Cite

MDPI and ACS Style

Hu, B.-G.; Xing, H.-J. An Optimization Approach of Deriving Bounds between Entropy and Error from Joint Distribution: Case Study for Binary Classifications. Entropy 2016, 18, 59. https://doi.org/10.3390/e18020059

AMA Style

Hu B-G, Xing H-J. An Optimization Approach of Deriving Bounds between Entropy and Error from Joint Distribution: Case Study for Binary Classifications. Entropy. 2016; 18(2):59. https://doi.org/10.3390/e18020059

Chicago/Turabian Style

Hu, Bao-Gang, and Hong-Jie Xing. 2016. "An Optimization Approach of Deriving Bounds between Entropy and Error from Joint Distribution: Case Study for Binary Classifications" Entropy 18, no. 2: 59. https://doi.org/10.3390/e18020059

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop