 Next Article in Journal
Dynamic and Thermodynamic Properties of a CA Engine with Non-Instantaneous Adiabats
Next Article in Special Issue
Divergence from, and Convergence to, Uniformity of Probability Density Quantiles
Previous Article in Journal
Metacomputable
Article

# On Normalized Mutual Information: Measure Derivations and Properties

1
Department of Mechanical Engineering, University of Minnesota, Minneapolis, MN 55455, USA
2
Department of Industrial & Systems Engineering, University of Minnesota, Minneapolis, MN 55455, USA
Entropy 2017, 19(11), 631; https://doi.org/10.3390/e19110631
Received: 26 October 2017 / Revised: 12 November 2017 / Accepted: 20 November 2017 / Published: 22 November 2017
(This article belongs to the Special Issue Entropy: From Physics to Information Sciences and Geometry)

## Abstract

Starting with a new formulation for the mutual information (MI) between a pair of events, this paper derives alternative upper bounds and extends those to the case of two discrete random variables. Normalized mutual information (NMI) measures are then obtained from those bounds, emphasizing the use of least upper bounds. Conditional NMI measures are also derived for three different events and three different random variables. Since the MI formulation for a pair of events is always nonnegative, it can properly be extended to include weighted MI and NMI measures for pairs of events or for random variables that are analogous to the well-known weighted entropy. This weighted MI is generalized to the case of continuous random variables. Such weighted measures have the advantage over previously proposed measures of always being nonnegative. A simple transformation is derived for the NMI, such that the transformed measures have the value-validity property necessary for making various appropriate comparisons between values of those measures. A numerical example is provided.

## 1. Introduction

Originating with the classic and profoundly influential work by Shannon , the mutual information between discrete random variables X and Y, also referred to as transinformation (e.g., Reza ), is defined as
$I ( X ; Y ) = ∑ i = 1 I ∑ j = 1 J p ( x i , y j ) log ( p ( x i , y j ) p ( x i ) p ( y j ) )$
where $p ( x i , y j )$, $p ( x i )$, and $p ( y j )$ denote the joint and marginal probabilities and where the natural (base-e) logarithm will be used throughout this paper, although the base-2 logarithm is often used in information theory (with $log 2 a = log e a / log e 2$). Similarly, the conditional mutual information between X and Y given another random variable Z is defined as
$I ( X ; Y | Z ) = ∑ i = 1 I ∑ j = 1 J ∑ k = 1 K p ( x i , y j , z k ) log ( p ( x i , y j , z k ) p ( z k ) p ( x i , z k ) p ( y j , z k ) )$
(e.g.,  (p. 153);  (pp. 34–35);  (p. 23)).
The $I ( X ; Y )$ in (1) also follows from the Kullback-Leibler divergence or “statistical distance” of the probability distribution ${ p ( x i , y j ) }$ from the corresponding independence distribution ${ p ( x i ) p ( y j ) }$ , but this does not apply to (2). The fundamental measures in (1) and (2) also lead to various entropies and inequalities. For example, it follows from (1) that
where H(X) is the entropy of X and $H ( X | Y )$ is called the conditional entropy of X given Y. An inequality such as $H ( X ) ≥ H ( X | Y )$ follows from (3) and the fact that $I ( X ; Y ) ≥ 0$. The $I ( X ; Y )$ and $I ( X ; Y | Z )$ are often defined via entropies as in (3) rather than directly as in (1) and (2) (e.g.,  (p. 139);  (p. 31)).
Since $I ( X ; Y )$ and $I ( X ; Y | Z )$ do not generally have fixed upper bounds, it is sometimes preferable to normalize those measures such that and . Those normalized variants, especially $I ∗ ( X ; Y )$, have been used for various purposes in a wide variety of situations such as measuring association (correlation) between X and Y (e.g., [8,9];  (pp. 230–238);  (pp. 83–85)), similarity or performance in cluster analysis used in pattern recognition and data mining (e.g., [12,13]), non-linear dependence between X and Y using histogram-based estimation (e.g., [14,15]), and measuring performance for classifier evaluation (e.g., ) and of image fusion (e.g., ).
As a clarification of the notation used throughout this paper, and related symbols are used so as to be consistent with the standard notation used in information theory. Of course, neither I nor I* are strictly functions of X or Y, but of the probability distribution ${ p ( x i , y j ) }$ (and $p ( x i ) = ∑ j = 1 J p ( x i , y j )$ and $p ( y j ) = ∑ i = 1 I p ( x i , y j )$). Similarly, p is used for both the joint probability and the marginal probabilities instead of $p X Y$, $p X$, and $p Y$. When necessary for the sake of clarity, $I ( { p ( x i , y j ) } )$ and $I ∗ ( { p ( x i , y j ) } )$ are sometimes used.
As discussed in this paper, there are any number of ways of normalizing I(X;Y) and $I ( X ; Y | Z )$, some of which result in important and unique properties. The analysis presented here is based on an alternative formulation of the mutual information between individual events $X = x i$ and $Y = y j$. This fundamental formulation also provides a convenient basis for introducing appropriate weighted variants of the normalized mutual information measures that, besides the probabilities, incorporate certain weights that are associated with the random variables. Furthermore, this paper discusses the important requirement that such measures need to take on numerical values that are indeed reasonable throughout the -interval. A simple transformation is derived to meet this requirement.

## 2. Mutual Information and Upper Bounds

#### 2.1. Pairwise Measure

The $I ( X ; Y )$ in (1) is a weighted mean of $I ( x i ; y j ) = log [ p ( x i , y j ) / p ( x i ) p ( y j ) ]$, which is considered to be a measure of the mutual information between the two events $X = x i$ and $Y = y j$ or the information conveyed by $Y = y j$ about $X = x i$ (e.g.,  (pp. 104–105);  (pp. 138–140)). This $I ( x i ; y j )$ has also been referred to as the self-mutual information for the event pair ( (p. 33)).
One of the limitations of this $I ( x i ; y j )$ is that it is not necessarily nonnegative. However, an alternative nonnegative measure can be defined, starting with the well-known inequality for all c > 0 (e.g.,  (p. 106)). Setting $c = b / a$ for $a > 0$ and b > 0 and multiplying each side of the inequality with a/b gives
$f ( a , b ) = a b log ( a b ) − a b + 1 ≥ 0$
with $f ( a , b ) = 0$ if, and only if, $a = b$. The function f is strictly convex in $a / b$ since the second-order derivative $d 2 f ( a , b ) / d ( a / b ) 2 = b / a > 0$. Then, setting $a = p ( x i , y j )$ and $b = p ( x i ) p ( y j )$ in (4) results in
$I ( x i ; y j ) = p ( x i , y j ) p ( x i ) p ( y j ) log ( p ( x i , y j ) p ( x i ) p ( y j ) ) − p ( x i , y j ) p ( x i ) p ( y j ) + 1 ≥ 0$
which is proposed as a new measure of the mutual information between the two events $X = x i$ and $Y = y j$.
The following properties of $I ( x i ; y j )$ for all $X = x i$ and $Y = y j$ follow immediately from the definition in (5):
(i)
$I ( x i ; y j ) ≥ 0$.
(ii)
$I ( x i ; y j ) = 0$, if, and only if, the events $X = x i$ and $Y = y j$ are independent.
(iii)
$I ( x i ; y j ) = I ( y j ; x i )$, i.e., I is symmetric in the events $X = x i$ and $Y = y j$.
(iv)
$∑ i = 1 I ∑ j = 1 J p ( x i ) p ( y j ) I ( x i ; y j ) = I ( X ; Y )$ in (1).
Note that $I ( x i ; y j )$ is also defined when $p ( x i , y j ) = 0$ (and $I ( x i ; y j ) = 1$) in the limiting sense that as $a → 0$.
Upper bounds on $I ( x i ; y j )$ in (5) can readily be determined from the fact that $log [ p ( x i , y j ) / p ( x i ) p ( y j ) ]$ is a strictly increasing function of $p ( x i , y j ) / p ( x i ) p ( y j )$. Then, since $p ( x i , y j ) ≤ p ( x i )$, it follows from (5) that
$I ( x i ; y j ) ≤ U y ( x i ; y j ) = p ( x i , y j ) p ( x i ) p ( y j ) [ log ( 1 p ( y j ) ) − 1 ] + 1$
and, since $p ( x i , y j ) ≤ p ( x i )$,
$I ( x i ; y j ) ≤ U X ( x i ; y j ) = p ( x i , y j ) p ( x i ) p ( y j ) [ log ( 1 p ( x i ) ) − 1 ] + 1$
with the least upper bound being .

#### 2.2. Mean Measures

The mutual information between X (or the set of events ${ X = x i : i = 1 , … , I }$) and the specific event $Y = y j$ can logically be defined as the following weighted mean of $I ( x i ; y j )$ in (5) for i = 1, …, I:
$I ( X ; y j ) = ∑ i = 1 I p ( x i ) I ( x i ; y j ) = ∑ i = 1 I p ( x i , y j ) p ( y j ) log ( p ( x i , y j ) p ( x i ) p ( y j ) )$
or, in terms of conditional probabilities,
where $H ( X | y j )$ is the conditional entropy of X given $Y = y j$. The $I ( x i ; Y )$ can similarly be defined as $I ( x i ; Y ) = ∑ j = 1 J p ( y j ) I ( x i ; y j )$. It follows from the properties of $I ( x i ; y j )$ that $I ( X ; y j ) ≥ 0$ with equality when X and Y are independent.
Hamming ( (pp. 140–141)) uses the term conditional mutual information for $I ( X ; y j )$, but this term is commonly reserved for a measure such as $I ( X ; Y | Z )$ in (2). From the first expression in (8), it seems that it is most appropriate to call $I ( X ; y j )$ the mutual information between X and $Y = y j$ or the information about X conveyed by the occurrence of the event $Y = y j$.
From the expression in (8) and the upper bounds on $I ( x i ; y j )$ in (6) and (7), upper bounds on $I ( X ; y j )$ are given by
Note that the bound in (11) is equal to the first term of $I ( X ; y j )$ in (9). The same bounds are also obtained by setting $p ( x i , y j ) ≤ p ( x i )$ and $p ( x i , y j ) ≤ p ( y j )$ in the term $log [ p ( x i , y j ) / p ( x i ) p ( y j ) ]$ of (8).
The $I ( X ; Y )$ in (1) is the following weighted mean of $I ( x i ; y j )$ in (5) or of $I ( X ; y j )$ in (8):
$I ( X ; Y ) = ∑ i = 1 I ∑ j = 1 J p ( x i ) p ( y j ) I ( x i ; y j ) = ∑ j = 1 J p ( y j ) I ( X ; y j )$
Upper bounds on $I ( X ; Y )$ are then obtained from (10)–(12) as
The same bounds are also obtained by setting $p ( x i , y j ) ≤ p ( x i )$ and $p ( x i , y j ) ≤ p ( y j )$ in the term $log [ p ( x i , y j ) / p ( x i ) p ( y j ) ]$ in (1).

#### 2.3. Conditional Measures

In the case of three random variables X, Y, and Z with conditional probabilities $p ( x i | z k )$, $p ( y j | z k )$, and $p ( x i , y j | z k )$ with k = 1, …, K, one can define the mutual information between the events $X = x i$ and $Y = y j$ conditional on the event $Z = z k$ by setting $a = p ( x i , y j | z k )$ and $b = p ( x i | z k ) p ( y j | z k )$ in (4) so that
$I ( x i ; y j | z k ) = p ( x i , y j | z k ) p ( x i | z k ) p ( y j | z k ) [ log ( p ( x i , y j | z k ) p ( x i | z k ) p ( y j | z k ) ) − 1 ] + 1$
where $I ( x i ; y j | z k ) ≥ 0$ with equality only under conditional independence, i.e., if, and only if, .
The mutual information between X and Y given $Z = z k$ can then be defined as the following weighted mean of $I ( x i ; y j | z k )$
$I ( X ; Y | z k ) = ∑ i = 1 I ∑ j = 1 J p ( x i | z k ) p ( y j | z k ) I ( x i ; y j | z k ) = ∑ i = 1 I ∑ j = 1 J p ( x i , y j | z k ) log ( p ( x i ; y j | z k ) p ( x i | z k ) p ( y j | z k ) )$
The conditional mutual information of X and Y given Z as defined in (2) follows from (16) as
$I ( X ; Y | Z ) = ∑ k = 1 K p ( z k ) I ( X ; Y | z k )$
Since $log ( a / b )$ is a strictly increasing function of $a / b$, the following upper bounds on $I ( x i ; y j | z k )$ in (15) are obtained from $p ( x i , y j | z k ) ≤ p ( x i | z k )$ and $p ( x i , y j | z k ) ≤ p ( x i | z k )$:
$I ( x i ; y j | z k ) ≤ U y ( x i ; y j | z k ) = p ( x i , y j | z k ) p ( x i | z k ) p ( y j | z k ) [ log ( 1 p ( y j | z k ) ) − 1 ] + 1$
$I ( x i ; y j | z k ) ≤ U x ( x i ; y j | z k ) = p ( x i , y j | z k ) p ( x i | z k ) p ( y j | z k ) [ log ( 1 p ( x i | z k ) ) − 1 ] + 1$
From (16), (18), and (19), upper bounds on $I ( X ; Y | z k )$ are given by
$U y ( X ; Y | z k ) = ∑ i = 1 I ∑ j = 1 J p ( x i | z k ) p ( y j | z k ) U y ( x i ; y j | z k ) = − ∑ j = 1 J p ( y j | z k ) log p ( y j | z k ) = H ( Y | z k )$
$U x ( X ; Y | z k ) = ∑ i = 1 I ∑ j = 1 J p ( x i | z k ) p ( y j | z k ) U x ( x i ; y j | z k ) = − ∑ i = 1 I p ( x i | z k ) log p ( x i | z k ) = H ( X | z k )$
From (17), (20), and (21), upper bounds on $I ( X ; Y | Z )$ become
$U y ( X ; Y | Z ) = ∑ k = 1 K p ( z k ) U y ( X ; Y | z k ) = H ( Y | Z )$
$U x ( X ; Y | Z ) = ∑ k = 1 K p ( z k ) U x ( X ; Y | z k ) = H ( X | Z )$

## 3. Normalizations

Let I denote any one of the mutual information measures in (5), (8), (12), and (15)–(17) with its derived upper bounds $U x$ and $U y$ and let
denote a normalized form of I. Either $U = U x$ or $U = U y$ would satisfy (24). However, there exists literally infinitely many potential candidates for U in (24), as represented by the α-order arithmetic mean
where $α$ is some real-valued parameter. For any given (fixed) $U x$ and $U y$, $U α$ is a nondecreasing function of $α$ and is strictly increasing unless $U x = U y$ ( (pp. 16–18)). Other means could also be considered, such as the logarithmic mean and Stolarsky means [20,21]. See also .
Particularly well-known members of $U α$ are the following:
in increasing order of magnitude unless $U x = U y$ (when they are all equal). All $I ∗$ from (24)–(26) are symmetric in X and Y. In its strong favor, $I ∗ = I / U − ∞$ is the only member of (24) and (25) that is always capable of attaining the maximal value of 1.
In the case of $I ( X ; Y )$ with $U x = H ( X )$ and $U y = H ( Y )$ in (13) and (14), the most apparent normalized candidates are perhaps the following :
$I ( 3 ) ∗ ( X ; Y ) = I ( X ; Y ) max { H ( X ) , H ( Y ) }$
Horibe  proved that $1 − I ( 3 ) ∗ ( X ; Y )$ is a distance metric so that $I ( 3 ) ∗ ( X ; Y )$ becomes a (normalized) similarity metric . The $I ( 2 ) ∗ ( X ; Y )$ gives equal weight to H(X) and H(Y), as does $I ( X ; Y ) / H ( X ) H ( Y )$ (see also $U − 1$ and $U 2$ in (26)). This $I ( X ; Y ) / H ( X ) H ( Y )$, which has been suggested by Strehl and Ghosh , is somewhat analogous to the correlation coefficient $ρ = Cov ( X ; Y ) / Var ( X ) Var ( Y )$. As stated above with respect to $I / U − ∞$ from (24) and (26), the $I ( 1 ) ∗ ( X ; Y )$ in (27) is the single normalized $I ( X ; Y )$ that is always capable of attaining the value of 1.
As a further explanation of the last statement, consider the condition that either (a) for each i (i = 1, …, I), $p ( x i , y j ) > 0$ for at most one j or (b) for each j $( j = 1 , … , J )$, $p ( x i , y j ) > 0$ for at most one i. In terms of a contingency table with row variable X and column variable Y so that $p ( x i , y j )$ is the probability in row i and column j, this condition means that either (a) each row or (b) each column contains at most one nonzero $p ( x i , y j )$. The term “at most” is not needed if all of the marginal probabilities $p ( x i )$ and $p ( y j )$ are nonzero. For any given marginal distributions ${ p ( x i ) }$ and ${ p ( y j ) }$, this condition is clearly the one for which the mutual information (dependence, association) is at its maximum. No other ${ p ( x i , y j ) }$ distribution could plausibly or intuitively produce a larger $I ( X ; Y )$. Under this condition, irrespective of the values of I and J (dimensions of the $I × J$ contingency table), $I ( X ; Y ) = min { H ( X ) , H ( Y ) }$ so that $I ( 1 ) ∗ ( X ; Y ) = 1$, whereas all of the other normalized variants of $I ( X ; Y )$, including $I ( 2 ) ∗ ( X ; Y )$ and $I ( 3 ) ∗ ( X ; Y )$ in (27), take on values that are necessarily less than 1. Assuming that all of the marginal probabilities are nonzero, it is only when I = J (square contingency tables) that those other normalized variants of $I ( X ; Y )$ are able to attain their upper bound of 1.
Consequently, unless there are particular or compelling reasons to the contrary, normalizations of mutual information measures ought to be based on the least upper bounds. Thus, for the general formulation in (24), $U = min { U x , U y }$ should be the standard normalizing factor so that
This will ensure that the attainable maximal value of $I ∗$ is 1, irrespective of the marginal probabilities and the dimensions I and J.

## 4. Weighted Mutual Information

The idea of weighted entropy introduced by Belis and Guiasu  and Guiasu ( (Chapter 4)) and extended to include weighted divergence (;  (pp. 33–91)) has also been formulated for mutual information as
$I w ( X ; Y ) = ∑ i = 1 I ∑ j = 1 J w ( x i , y j ) p ( x i , y j ) log ( p ( x i , y j ) p ( x i ) p ( y j ) )$
where $w ( x i , y j )$ are some nonnegative weights that are associated with the random variables X and Y [28,29,30]. Of course, when $w ( x i , y j ) = 1$ for all i and j, (29) reduces to (1).
A limitation of this $I w ( X ; Y )$ and perhaps one reason for its relatively limited use is the fact that $I w ( X ; Y )$ is not necessarily nonnegative [30,31]. One way to overcome this limitation is to restrict the weighting function w, such that it depends only on one of the two variables X and Y . However, such a restriction is not necessary if the weighted mutual information measures are based on the nonnegative $I ( x i ; y j )$ in (5), such that
$I w ( x i ; y j ) = w ( x i ; y j ) I ( x i ; y j ) ≥ 0 , i = 1 , … , I , j = 1 , … , J$
and from which
$I w ( X ; y j ) = ∑ i = 1 I p ( x i ) I w ( x i ; y j ) = ∑ i = 1 I w ( x i , y j ) p ( x i , y j ) p ( y j ) [ log ( p ( x i , y j ) p ( x i ) p ( y j ) ) − 1 ] + ∑ i = 1 I w ( x i , y j ) p ( x i )$
and
$I w ( X ; Y ) = ∑ j = 1 J p ( y j ) I w ( X ; y j ) = ∑ i = 1 I ∑ j = 1 J w ( x i , y j ) p ( x i , y j ) [ log ( p ( x i , y j ) p ( x i ) p ( y j ) ) − 1 ] + ∑ i = 1 I ∑ j = 1 J w ( x i , y j ) p ( x i ) p ( y j )$
all of which are nonnegative. When $w ( x i ; y j ) = 1$ for all i and j, (30)–(32) reduce to (5), (8), and (12), respectively.
In the case of conditional mutual information measures, nonnegative weighted equivalents can be derived by starting with
$I w ( x i ; y j | z k ) = w ( x i , y j , z k ) I ( x i ; y j | z k )$
for i = 1, …, I, j = 1, …, J, k = 1, …, K, and for $I ( x i ; y j | z k )$ in (15). The equivalents for (16) and (17) are then obtained from (33) as
$I w ( X ; Y | z k ) = ∑ i = 1 I ∑ j = 1 J p ( x i | z k ) p ( y j | z k ) I w ( x i ; y j | z k )$
and
$I w ( X ; Y | Z ) = ∑ k = 1 K p ( z k ) I w ( X ; Y | z k )$
These weighted conditional measures are nonnegative since $I ( x i ; y j | z k ) ≥ 0$ and it is assumed that the weights $w ( x i , y j , z k ) ≥ 0$ for i = 1, …, I, j = 1, …, J, and k = 1, …, K.
Upper bounds and normalizations for the weighted measures in (30)–(35) can be derived in the same way as done above for their unweighted equivalents. Consider, for example, the $I w ( X ; Y )$ in (32). Since the log( ) is a strictly increasing function and since $p ( x i , y j ) ≤ p ( x i )$, it follows from (32) that
and, since $p ( x i , y j ) ≤ p ( y i )$
For $w ( x i , y j ) = 1$ for all i and j, these upper bounds reduce to those in (13) and (14). The normalized form of $I w ( X ; Y )$ with 1 as the attainable maximum value is given by
with the denominator terms defined by (36) and (37). If all $w ( x i , y j )$ are the same, not necessarily 1, (38) becomes the $I ( 1 ) ∗ ( X ; Y )$ in (27).
In the case when X and Y are continuous random variables, equivalent mutual information measures to all of those introduced above for the discrete case can be obtained by substituting probability density functions f for all of the probabilities p( ), and by substituting definite integrals for the summations. Thus, the equivalent of $I w ( x i , y j )$ in (30) and (5) becomes
$I w ( x ; y ) = w ( x , y ) f ( x , y ) f ( x ) f ( y ) [ log ( f ( x , y ) f ( x ) f ( y ) ) − 1 ] + w ( x , y ) ≥ 0$
and the equivalent of $I w ( X ; Y )$ in (32) becomes
$I w ( X ; Y ) = ∫ x ∫ y f ( x ) f ( y ) I w ( x ; y ) d x d y = ∫ x ∫ y w ( x , y ) f ( x , y ) log ( f ( x , y ) f ( x ) f ( y ) ) d x d y + ∫ x ∫ y w ( x , y ) [ f ( x ) f ( y ) − f ( x , y ) ] d x d y$
where the integrals are over the entire range of values of X and Y. When $w ( x , y ) = 1$ for all x and y, the nonnegative $I w ( X ; Y )$ in (40) reduces to the well-known mutual information
$I ( X ; Y ) = ∫ x ∫ y f ( x , y ) log ( f ( x , y ) f ( x ) f ( y ) ) d x d y$
which is also nonnegative since $I w ( X ; Y )$ is nonnegative for all $w ( x , y ) ≥ 0$.
However, mutual information measures in the continuous case, such as those in (39)–(41) cannot generally be normalized to the [0, 1]-interval unless particular constraints are imposed. Such continuous measures do not generally have fixed upper bounds. If, for example, X and Y have the joint normal distribution with correlation coefficient $ρ$, then $I ( X ; Y ) = − log 1 − ρ 2$ (e.g.,  (pp. 282–283)), which increases without an upper bound as ρ → 1.

## 5. Value Validity

#### 5.1. Value-Validity Consideration

Let $I ∗$ stand for any one of the normalized mutual information measures discussed above, and let etc. stand for its numerical values for different probability distributions. It is then generally of interest to make different types of comparisons between such numerical values. While there may be no particular reason to doubt the validity of size (order) comparisons such as $i a ∗ > i b ∗$, more specific comparisons such as the difference comparisons $i a ∗ − i b ∗ > i c ∗ − i d ∗$ or $i a ∗ − i b ∗ = k ( i c ∗ − i d ∗ )$ (for constant k) may require certain restrictions or modifications on $I ∗$ in order to be valid. The same type of validity requirement would apply to the interpretations of absolute values of $I ∗$.
Although there are different types of validity that are used in measurement theory ( (Chapter 4)), value validity of a measure is used here to mean that all of the potential values of the measure provide true or realistic representations of the extent of the attribute being measured as supported by a generally acceptable criterion or condition. Such an analysis has been done for the normalized entropy , but a different approach is needed for the mutual information between two or more random variables.
One approach is to consider the binary random variables $X = x 1 , x 2$ and $Y = y 1 , y 2$ with all marginal probabilities equal to 1/2 and with the following joint probability distribution ${ p i j α }$:
where is a real-valued parameter. Furthermore, consider the normalized mutual information measure $I ∗$ in (28) that takes on values between 0 and 1, inclusive. Then, for any joint distribution ${ p ( x i , y j ) }$ with i = 1,…,I and j = 1, …, J, the following equality exists:
$I ∗ ( { p ( x i , y j ) } ) = I ∗ ( { p i j α } ) = g ( α )$
where g is a single-valued function of $α$. As a consequence of (43), the value validity of $I ∗$ for any ${ p ( x i , y j ) }$ can be considered based on ${ p i j α }$.
The $I ∗$ takes on its extremal values when $α = 0$ and $α = 1$ with
where ${ p i j 0 }$ corresponds to the statistical independence condition and ${ p i j 1 }$ corresponds to the complete dependence condition ($p 11 1 = p 22 1 = 1 / 2$ and $p 12 1 = p 21 1 = 0$). The probability distribution ${ p i j α } = ( p 11 α , p 12 α , p 21 α , p 22 α )$ can be considered as a point (or vector) in four-dimensional Euclidean space with Cartesian coordinates $p 12 α , … , p 22 α$. Then, first, the ${ p i j α }$ in (42) is seen to be the weighted mean of ${ p i j 0 }$ and ${ p i j 1 }$ as follows:
Second, with $I v ∗$ denoting a normalized mutual information measure that has the value-validity property and in terms of the Euclidean distance d( ), the following equality between distance ratios is propounded as a logical relationship:
$| I v ∗ ( { p i j α } ) − I v ∗ ( { p i j 0 } ) | | I v ∗ ( { p i j 1 } ) − I v ∗ ( { p i j 0 } ) | = d ( { p i j α } , { p i j 0 } ) d ( { p i j 1 } , { p i j 0 } )$
Since $d ( { p i j α } , { p i j 0 } ) = { 2 [ ( 1 + α ) / 4 − 1 / 4 ] 2 + 2 [ ( 1 − α ) / 4 − 1 / 4 ] 2 } 1 / 2 = α / 2$ and $d ( { p i j 1 } , { p i j 0 } ) = 1 / 2$, (46) can be expressed as
$I v ∗ ( { p i j α } ) = α I v ∗ ( { p i j 1 } ) + ( 1 − α ) I v ∗ ( { p i j 0 } )$
and, with $I v ∗ ( { p i j 0 } ) = 0$ and $I v ∗ ( { p i j 1 } ) = 1$ as in (44) and (47) reduces to
$I v ∗ ( { p i j α } ) = α$
The value-validity condition in (47) and (48) is also a logical implication from (45). That is, $I v ∗ ( { p i j α } )$ in (47) as a weighted mean of $I v ∗ ( { p i j 1 } )$ and $I v ∗ ( { p i j 0 } )$ is equivalent to the weighted mean of the probabilities in (45).
In the case when $α = 1 / 2$, $p i j 1 / 2 = ( p i j 0 + p i j 1 ) / 2$ and $| p i j 1 / 2 − p i j 0 | = | p i j 1 / 2 − p i j 1 |$ for so that the distance $d ( { p i j 1 / 2 } , { p i j 0 } ) = d ( { p i j 1 / 2 } , { p i j 1 } )$, $I v ∗ ( { p i j 1 / 2 } ) = 1 / 2$ from (48), which is clearly the only logical value for a measure that can vary from 0 for ${ p i j 0 }$ to 1 for ${ p i j 1 }$. However, as discussed next for the normalized mutual information measures $I ∗$ in (28), $I ∗ ( { p i j 1 / 2 } ) < < 1 / 2$ and $I ∗ ( { p i j α } )$ does not meet the value-validity condition in (48) without some required correction.

#### 5.2. Value-Validity Corrections of $I ∗$

Values of the normalized mutual information measures collectively included in $I ∗$ in (28) and specifically defined in (5)–(14) have been computed for the joint probability distribution ${ p i j α }$ in (42) and for different values of $α$. For each pair of upper bounds $U x$ and $U y$ in (6)–(14), $U x = U y$ since all of the marginal probabilities for the distribution in (42) equal 1/2 The results are summarized in Table 1.
It is clear from these results that all of the $I ∗$ measures fail to comply with the value-validity condition in (48). Their values are substantially smaller than the α-values, implying that those measures substantially understate the true extent of the normalized mutual information attribute or characteristic. The absolute extent of this understatement is greatest around the true midrange $( α ≈ 0.5 )$, while the relative understatement is greatest at the lower end (smaller α-values).
Rather than rejecting these $I ∗$ measures because of their lack of value validity and hence their restricted utility, they can be corrected or modified so as to comply with the requirement in (48) by the use of the relationship in (43). Thus, with $I ∗$ denoting any one of the measures in Table 1 and for any given joint probability distribution ${ p ( x i , y j ) }$ for $i = 1 , … , I$ and $j = 1 , … , J$, the value of $α$ can be determined so as to comply with the equality in (43). The solution $α = I C ∗ ( { p ( x i , y j ) } )$ then becomes the corrected value of $I ∗$. Formally stated,
where h is the inverse function of g in (43). This corrected $I C ∗$ will necessarily comply with the value-validity condition in (48).
For any given distribution ${ p ( x i , y j ) }$, the corrected value $I C ∗ ( { p ( x i , y j ) } )$ of $I ∗ ( { p ( x i , y j ) } )$ can be obtained by using a computer search algorithm to find the value of $α$ for which $I ∗ ( { p ( x i , y j ) } ) = I ∗ ( { p i j α } )$ for the ${ p i j α }$ in (42). The resulting $α$-value, which can be determined to any degree of accuracy, is then the corrected $I C ∗ ( { p ( x i , y j ) } ) .$ Alternatively, the function g in (43) and hence h in (49) may be determined analytically, such that
$h [ I ∗ ( { p i j α } ) ] = α$
It is desirable that the function h be relatively simple and convenient to use rather than being a complex expression that is derived from some model or curve-fitting program.
Consider the data in Table 2 for $I ∗ ( X ; Y )$ based on the distribution ${ p i j α }$ in (42) and different values of $α$. By exploring various forms of h in (50), Table 2 presents the results for two potential candidates for h as approximations to the equality in (50). The square-root function mentioned in  does provide quite respectable approximations, i.e., $I ∗ ( { p i j α } ) ≈ α$. In fact, for the fitted model $α ^ = I ∗ ( { p i j α } )$ and the data in Table 2, the coefficient of determination, when properly computed , is found to be $R 2 = 1 − ∑ ( α − I ∗ ) 2 / ∑ ( α − α ¯ ) 2 = 0.97$.
However, a superior approximation can be achieved by means of regression analysis. Thus, for the simple model $( 1 − α ) = ( 1 − I ∗ ( { p i j α } ) ) β$ and for the data in Table 2, estimated $β = 1.2315$, or, when rounded off to the nearest fraction, $β = 11 / 9$. The resulting function h in (50), i.e., $1 − ( 1 − I ∗ ( { p i j α } ) ) 11 / 9$ is seen from the results in Table 2 to satisfy (50) to a high degree of approximation. In fact, if the data are rounded off to the second decimal place, which is clearly sufficient for most practical purposes, it is seen from Table 2 that the equality in (50) holds exactly.
Consequently, it follows from (49) that the corrected value of $I ∗$ becomes
$I C ∗ ( { p ( x i , y j ) } ) = 1 − ( 1 − I ∗ ( { p ( x i , y j ) } ) ) 11 / 9$
which complies with the value-validity condition in (48) to a high degree of approximation. Although the final part of the analysis has been based specifically on the normalized form of $I ( X ; Y )$ in (1), the value-validity correction in (51) is also applicable to the other normalized mutual information measures, such as $I ∗ ( x i ; y j )$ and $I ∗ ( X ; y j )$, as discussed above and subject to the normalization in (28). This proposition is supported by the fact that, as indicated by the data in Table 1, all of the normalized measures deviate from the value-validity condition in (48) to a comparable extent.

#### 5.3. Numerical Example

Table 3 gives the real sample results of United States Senate elections (for four different years) based on data given by Reynolds ( (p. 2)). Here $X = x 1$ is the event that a vote is for a Democratic candidate and $X = x 2$ that it is for the Republican candidate. The variable Y refers to the three parties with which the voters were identified. Based on the sample probability distribution ${ p ( x i , y j ) }$ in Table 3, the values of the various normalized mutual information measures have been computed, as presented in Table 3. The normalizations have all been based on the least upper bounds as in (28). The values of both the (uncorrected) measures $I ∗$ and the value-validity corrected measures $I C ∗$ from (51) are given in Table 3.
The information measures may in this case be considered as measures of association (dependence, correlation) between the two categorical variables X and Y. Thus, from the overall measure $I C ∗ ( X ; Y ) = 0.63$, one can justifiably make the interpretation that there is a “somewhat high” or “substantial” degree of association between the party identification or affiliation of voters and of candidates. Or, in information-theory terminology, the (amount of) information about the vote (X) obtained by knowing the voters’ party identification is “somewhat high”. A similar numerical result is obtained from Cramér’s coefficient of association V (e.g.,  (p. 47)), with $V = 0.62$ for the ${ p ( x i , y j ) }$ distribution in Table 3. However, a very different and misleading result and interpretation would be obtained if based on the $I ∗ ( X ; Y ) = 0.31$ in Table 3.
A more detailed explanation about the association between X and Y can be gleaned from the $I C ∗ ( x i , y j )$ and $I C ∗ ( X ; y j )$ in Table 3. Both of the values $I C ∗ ( x i , y 2 )$ for i = 1, 2 and $I C ∗ ( X ; y 2 )$ show that relatively little information about the vote (X) is obtained from knowing that a (randomly selected) voter was an Independent $( Y = y 2 )$. Due to the value-validity property of $I C ∗$ and from the results that $I C ∗ ( x i ; y j ) ≥ 5 I C ∗ ( x i , y 2 )$ for $i = 1 , 2$ and $j = 1 , 3$, and $I C ∗ ( X ; y j ) ≥ 5 I C ∗ ( X ; y 2 )$ for $j = 1 , 3$, it is permissible to infer that at least five times as much information about the vote (X) is obtained by knowing that a voter was a Democrat than if the voter was an Independent. The same inference applies to the voter being a Republican versus an Independent. As another observation, the largest pairwise $I C ∗ ( x i ; y j )$ corresponds to the events $X = x 2$ (vote was for Republican candidate) and $Y = y 3$ (voter was a Republican) with $I C ∗ ( x 2 ; y 3 ) = 0.81$. That is, the event $Y = y 3$ provides a “very large” amount of information about the event $X = x 2$. This is significantly (about 23%) more than the $I C ∗ ( x 1 ; y 1 ) = 0.66$ (indicating a somewhat greater party loyalty by Republicans).
The $I C ∗ ( X ; y j )$ and $I C ∗ ( x i ; y j )$, especially perhaps $I C ∗ ( X ; y j )$, are likely to be particularly useful when Y is an explanatory variable and X is a response variable. In this example, with $I C ∗ ( X ; y 1 ) = 0.65$ and $I C ∗ ( X ; y 3 ) = 0.77$, it can be concluded that the information about the vote (X) gained by knowing that a (randomly chosen) voter was a Republican was somewhat larger (by nearly 20%) than by knowing that the voter was a Democrat. Of course, for a small $2 × 3$ table, as in Table 3, some of the above observations or results are rather apparent in general terms from the probabilities in Table 3, but the use of $I C ∗$ provides a means of quantifying those observations (results).

## 6. Conclusions

For the potential normalizations of various mutual information measures that are discussed in this paper, the least upper bounds have been emphasized as for $I ∗$ in (28). This provides $I ∗$ with the desirable property that its upper limit of 1 can always be attained for any marginal probability distributions of the random variables X and Y (and Z in the conditional case), and for any dimensions I and J (and K in the conditional case). Such a property is generally required of any measure of association for categorical variables (e.g.,  (Chapter 33)). In the case of $I ∗ ( X ; Y )$ (i.e., $I ( 1 ) ∗ ( X ; Y )$ in (27)), another normalized form could be considered, such as $I ( X ; Y ) / min { log I , log J }$ , but this measure can only attain the value 1 when the smallest of the entropies H(X) and H(Y) involves equal (uniform) marginal probabilities.
It has also been emphasized above that, for comparisons other than size (order) comparisons such as $I ∗ ( X 1 ; Y 1 ) > I ∗ ( X 2 ; Y 2 )$ for pairs of random variables $( X 1 , Y 1 )$ and $( X 2 , Y 2 )$, it is required that a measure have the value-validity property. Otherwise, results and conclusions may be incorrect and misleading. A simple transformation or correction of $I ∗$ into $I C ∗$ provides for such a requirement. This more informative measure $I C ∗$ permits its numerical values to be properly interpreted as to their absolute magnitudes and to be compared, so as to truly represent the attribute (characteristic) being measured.
Besides the fact that it is preferable and more convenient to interpret and compare results that vary over a fixed interval such as [0, 1], a clear advantage of using $I C ∗$ (or $I ∗$) over I is that $I C ∗$ (or $I ∗$) controls or adjusts for the size of a data set. This makes it possible to compare the results for data sets of varying size (dimension). Such control (adjustment) can be achieved directly by using the normalizing denominator $min { log I , log J }$, or indirectly, as argued in this paper, via the marginal probability distributions of the random variables.

## Conflicts of Interest

The author declares no conflict of interest.

## References

1. Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423, 623–656. [Google Scholar] [CrossRef]
2. Reza, F.M. An Introduction to Information Theory; McGraw-Hill: New York, NY, USA, 1961. [Google Scholar]
3. Hamming, R.W. Coding and Information Theory; Prentice-Hall: Englewood Cliffs, NJ, USA, 1980. [Google Scholar]
4. Han, T.S.; Kobayashi, K. Mathematics of Information and Coding; American Mathematical Society: Providence, RI, USA, 2002. [Google Scholar]
5. Cover, T.M.; Thomas, J.A. Elements of Information Theory; Wiley: New York, NY, USA, 2006. [Google Scholar]
6. Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
7. MacKay, D.J.C. Information Theory, Inference, and Learning Algorithms; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
8. Horibe, Y. Entropy and correlation. IEEE Trans. Syst. Man Cybern. 1985, SMC-15, 641–642. [Google Scholar] [CrossRef]
9. Kvålseth, T.O. Entropy and correlation: Some comments. IEEE Trans. Syst. Man Cybern. 1987, SMC-17, 517–519. [Google Scholar]
10. Wickens, T.D. Multiway Contingency Tables Analysis for the Social Sciences; Lawrence Erlbaum: Hillsdale, NJ, USA, 1989. [Google Scholar]
11. Tang, W.; He, H.; Tu, X.M. Applied Categorical and Count Data Analysis; CRC Press: Boca Raton, FL, USA, 2012. [Google Scholar]
12. Pfitzer, D.; Leibbrandt, R.; Powers, D. Characterization and evaluation of similarity measures of pairs of clusterings. Knowl. Inf. Syst. 2009, 19, 361–394. [Google Scholar] [CrossRef]
13. Yang, Y.; Ma, Z.; Yang, Y.; Nie, F.; Shen, H.T. Multitask spectral clustering by exploring intertask correlation. IEEE Trans. Cybern. 2015, 45, 1069–1080. [Google Scholar] [CrossRef] [PubMed]
14. Jain, N.; Murthy, C.A. A new estimate of mutual information based measure of dependence between two variables: Properties and fast implementation. Int. J. Mach. Learn. Cybern. 2015, 7, 857–875. [Google Scholar] [CrossRef]
15. Reshef, D.N.; Reshef, Y.A.; Finucane, H.K.; Grossman, S.R.; McVean, G.; Turnbaugh, P.J.; Lander, E.S.; Mitzenmacher, M.; Sabeti, P.C. Detecting novel associations in large data sets. Science 2011, 334, 1518–1524. [Google Scholar] [CrossRef] [PubMed]
16. Hu, B.-G. Information measure toolbox for classifier evaluation on open source software scilab. In Proceedings of the 2009 IEEE International Workship on Open-Source Software for Scientific Computing, (OSSC-2009), Guiyang, China, 18–20 September 2009; pp. 179–184. [Google Scholar]
17. Hossny, M.; Nahavandi, S.; Creighton, D. Comments on ‘Information measure for performance of image fusion’. Electron. Lett. 2009, 44, 1066–1067. [Google Scholar] [CrossRef]
18. Hardy, G.H.; Littlewood, J.E.; Pólya, G. Inequalities; Cambridge University Press: Cambridge, UK, 1934. [Google Scholar]
19. Beckenbach, E.F.; Bellman, R. Inequalities; Springer: Heidelberg, Germany, 1971. [Google Scholar]
20. Stolarsky, K.B. Generalizations of the logarithmic mean. Math. Mag. 1975, 48, 87–92. [Google Scholar] [CrossRef]
21. Ebanks, B. Looking for a few good means. Am. Math. Mon. 2012, 119, 658–669. [Google Scholar] [CrossRef]
22. Chen, S.; Ma, B.; Zhang, K. On the similarity metric and the distance metric. Theor. Comput. Sci. 2009, 410, 2365–2376. [Google Scholar] [CrossRef]
23. Strehl, A.; Ghosh, J. Cluster ensembles—A knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 2002, 3, 583–617. [Google Scholar]
24. Belis, M.; Guiasu, S. A quantitative-qualitative measure in cybernetic systems. IEEE Trans. Inf. Theory 1968, 14, 593–594. [Google Scholar] [CrossRef]
25. Guiasu, S. Information Theory with Applications; McGraw-Hill: New York, NY, USA, 1977. [Google Scholar]
26. Taneja, H.C.; Tuteja, R.K. Characterization of a quantitative-qualitative measure of relative information. Inf. Sci. 1984, 33, 217–222. [Google Scholar] [CrossRef]
27. Kapur, J.N. Measures of Information and Their Applications; Wiley Eastern: New Delhi, India, 1994. [Google Scholar]
28. Luan, H.; Qi, F.; Xue, Z.; Chen, L.; Shen, D. Multimodality image registration by maximization of quantitative-qualitative measures of mutual information. Pattern Recognit. 2008, 41, 285–298. [Google Scholar] [CrossRef]
29. Schaffernicht, E.; Gross, H.-M. Weighted mutual information for feature selection. In Proceedings of the 21st International Conference on Artificial Neural Networks, Part II, Espoo, Finland, 14–17 June 2011; pp. 181–188. [Google Scholar]
30. Pocock, A.C. Feature Selection via Joint Likelihood. Ph.D. Thesis, School of Computer Science, University of Manchester, Manchester, UK, 2012. [Google Scholar]
31. Kvålseth, T.O. The relative useful information measure: Some comments. Inf. Sci. 1991, 56, 35–38. [Google Scholar] [CrossRef]
32. Hand, D.J. Measurement Theory and Applications; Wiley: Chichester, UK, 2004. [Google Scholar]
33. Kvålseth, T.O. Entropy evaluation based on value validity. Entropy 2014, 16, 4855–4873. [Google Scholar] [CrossRef]
34. Kvålseth, T.O. Association measures for nominal categorical variables. In International Encyclopedia of Statistical Science; Lovric, M., Ed.; Springer: Berlin/Heidelberg, Germany, 2011; Part 1; pp. 61–64. [Google Scholar]
35. Kvålseth, T.O. Cautionary note about R2. Am. Stat. 1985, 39, 279–285. [Google Scholar]
36. Reynolds, H.T. The Analysis of Cross-Classification; The Free Press: New York, NY, USA, 1977. [Google Scholar]
37. Kendall, M.; Stuart, A. The Advanced Theory of Statistics, Volume 2: Inference and Relationships, 4th ed.; Charles Griffin: London, UK, 1979. [Google Scholar]
Table 1. Values of the normalized forms of the measures in (1), (5), and (8) for the probability distribution ${ p i j α }$ in (42) with differing α-values.
Table 1. Values of the normalized forms of the measures in (1), (5), and (8) for the probability distribution ${ p i j α }$ in (42) with differing α-values.
$I ∗$$α$
0.10.30.50.70.9
$I ∗ ( x 1 ; y 1 ) = I ∗ ( x 2 ; y 2 )$0.010.070.200.420.77
$I ∗ ( x 1 ; y 2 ) = I ∗ ( x 2 ; y 1 )$0.010.060.180.370.69
$I ∗ ( X ; y 1 ) = I ∗ ( X ; y 2 )$0.010.070.190.390.71
$I ∗ ( X ; Y )$0.010.070.190.390.71
Table 2. Values of $I ∗$ for $I ∗ ( X ; Y ) = I ( X ; Y ) / min { H ( X ) , H ( Y ) }$ and the distribution ${ p i j α }$ in (42) with differing α-values, as well as the corresponding values for two different functions h satisfying (50), approximately.
Table 2. Values of $I ∗$ for $I ∗ ( X ; Y ) = I ( X ; Y ) / min { H ( X ) , H ( Y ) }$ and the distribution ${ p i j α }$ in (42) with differing α-values, as well as the corresponding values for two different functions h satisfying (50), approximately.
$α$$I ∗ ( { p i j α } )$$I ∗ ( { p i j α } )$$1 − ( 1 − I ∗ ( { p i j α } ) ) 11 / 9$
0000
0.10.00720.08490.1027
0.20.02910.17060.2044
0.30.06590.25670.3041
0.40.11870.34450.4033
0.50.18870.43440.5017
0.60.27810.52740.5999
0.70.39020.62470.6981
0.80.53100.72870.7970
0.90.71360.84470.8974
1111
Table 3. United States (U.S.) Senate election results in terms of sample probabilities (proportions) $p ( x i , y j )$ for candidate vote (X) and voters’ party identification (Y) (sample size N = 2843). Source: Reynolds ( (p. 2)).
Table 3. United States (U.S.) Senate election results in terms of sample probabilities (proportions) $p ( x i , y j )$ for candidate vote (X) and voters’ party identification (Y) (sample size N = 2843). Source: Reynolds ( (p. 2)).
Vote (X)Party Identification (Y)
Democrat $( y 1 )$Independent $( y 2 )$Republican $( y 3 )$Total
Democrat $( x 1 )$0.390.110.040.54
Republican $( x 2 )$0.070.120.270.46
Total0.460.230.311.00
Corresponding values for the normalized mutual information measures defined in the text:
$I ∗ ( x 1 ; y j )$ = 0.35, 0.01, 0.46; $I C ∗ ( x 1 ; y j )$ = 0.66, 0.12, 0.75 for j = 1, 2, 3
$I ∗ ( x 2 ; y j )$ = 0.33, 0.01, 0.55; $I C ∗ ( x 2 ; y j )$ = 0.65, 0.13, 0.81 for j = 1, 2, 3
$I ∗ ( X ; y j )$ = 0.33, 0.01, 0.49; $I C ∗ ( X ; y j )$ = 0.65, 0.13, 0.77 for j = 1, 2, 3
$I ∗ ( X ; Y )$ = 0.31; $I C ∗ ( X ; Y )$ = 0.63