Generalized Mutual Information

Zhang, Zhiyi

doi:10.3390/stats3020013

Open AccessArticle

Generalized Mutual Information

by

Zhiyi Zhang

Department of Mathematics and Statistics, University of North Carolina at Charlotte, Charlotte, NC 28223, USA

Stats 2020, 3(2), 158-165; https://doi.org/10.3390/stats3020013

Submission received: 28 April 2020 / Revised: 28 May 2020 / Accepted: 7 June 2020 / Published: 10 June 2020

Download Versions Notes

Abstract

Mutual information is one of the essential building blocks of information theory. It is however only finitely defined for distributions in a subclass of the general class of all distributions on a joint alphabet. The unboundedness of mutual information prevents its potential utility from being extended to the general class. This is in fact a void in the foundation of information theory that needs to be filled. This article proposes a family of generalized mutual information whose members are indexed by a positive integer n, with the nth member being the mutual information of nth order. The mutual information of the first order coincides with Shannon’s, which may or may not be finite. It is however established (a) that each mutual information of an order greater than 1 is finitely defined for all distributions of two random elements on a joint countable alphabet, and (b) that each and every member of the family enjoys all the utilities of a finite Shannon’s mutual information.

Keywords:

mutual information; Shannon’s entropy; conditional distribution of total collision; generalized entropy; generalized mutual information

MSC:

primary 60E10; secondary 94A15, 82B30

1. Introduction and Summary

This article proposes a family of generalized mutual information whose members are indexed by a positive integer n, with the nth member being the mutual information of nth order. The mutual information of the first order coincides with Shannon’s, which may or may not be finite. It is however established that each mutual information of an order greater than 1 is finitely defined for all distributions of two random elements on a joint countable alphabet, and that each and every member of the family enjoys several important utilities of a finite Shannon’s mutual information.

Let Z be a random element on a countable alphabet

Z = {z_{k}; k \geq 1}

with an associated distribution

p = {p_{k}; k \geq 1}

. Let the cardinality or support on

Z

be denoted

K = \sum_{k \geq 1} 1 [p_{k} > 0]

, where

1 [\cdot]

is the indicator function. K is possibly finite or infinite. Let

P

denote the family of all distributions on

Z

. Let

(X, Y)

be a pair of random elements on a joint countable alphabet

X \times Y = {(x_{i}, y_{j}); i \geq 1, j \geq 1}

with an associated joint probability distribution

p_{X, Y} = {p_{i, j}; i \geq 1, j \geq 1}

, let the two marginal distributions be respectively denoted

p_{X} = {p_{i, \cdot} = \sum_{j \geq 1} p_{i, j}; i \geq 1}

and

p_{Y} = {p_{\cdot, j} = \sum_{i \geq 1} p_{i, j}; j \geq 1}

. Let

P_{X, Y}

denote the family of all distributions on

X \times Y

. Shannon [1] offers two fundamental building blocks of information theory, Shannon’s entropy

H = H (Z) = - \sum_{k \geq 1} p_{k} log p_{k}

, where the logarithm is 2-based; and mutual information

M I = M I (X, Y) = H (X) + H (Y) - H (X, Y)

, where

H (X)

,

H (Y)

and

H (X, Y)

are entropies respectively defined with the distributions

p_{X}

,

p_{Y}

and

p_{X, Y}

.

Mutual information plays a central role in the theory and the practice of modern data science for three basic reasons. First, the definition of

M I

does not rely on any metrization on an alphabet, nor does it require the letters of the alphabet to be ordinal. This generality allows it to be defined and used in data spaces beyond the real coordinate space

R^{n}

, where random variables (as opposed to random elements) reside. Second, when X and Y are random variables assuming real values, that is, the joint alphabet is metrized,

M I (X, Y)

captures linear as well as any non-linear stochastic association between X and Y. See Chapter 5 of [2] for examples. Third, it offers a single-valued index measure for the stochastic association between two random elements, more specifically

M I (X, Y) \geq 0

for any probability distribution of X and Y on a joint alphabet and

M I (X, Y) = 0

if and only if X and Y are independent, under a wide class of general probability distributions.

However, mutual information

M I

, in its current form, may not be finitely defined for joint distributions in a subclass of

P_{X, Y}

, partially due to the fact that any or all of the three Shannon’s entropies in the linear combination may be unbounded. The said unboundedness prevents the potential utility of mutual information from being fully realized, and hence there is a deficiency of

M I

, which leaves a void in

P_{X, Y}

. (More detailed arguments are provided in Section 2 below). This article introduces a family of generalized mutual information indexed by a positive integer

n \in N

, denoted

I = {M I_{n}; n \geq 1}

, each of whose members,

M I_{n}

, is referred to as the nth order mutual information. All members of

I

are finitely defined for each and every

p_{X, Y} \in P_{X, Y}

, except

M I_{1} = M I

, and all of them preserve the utilities of Shannon’s mutual information when it is finite.

The said deficiency of

M I

is due to the fact that Shannon’s entropy may not be finite for “thick-tailed” distributions (with

p_{k}

decaying slowly in k) in

P

. To address the deficiency of

M I

, the issue of unboundedness of Shannon’s entropy on a subset of

P

must be addressed, through some generalization in one way or the other. The effort to generalize Shannon’s entropy has been long and extensive in the existing literature. The main perspective in the generalization in the existing literature is based on axiomatic characterization of Shannon’s entropy. Interested readers may refer to [3,4] for details and references therewithin. In a nutshell, with respect to the functional form,

H = \sum_{k \geq 1} h (p_{k})

, under certain desirable axioms, for example, [5,6],

h (p) = - p log p

is uniquely determined up to a multiplicative constant; if the strong additivity axiom is relaxed to be one of the weaker versions, say

α

-additivity or composability, then

h (p)

may be of other forms, which give rise to Rényi’s entropy [7], and the Tsallis entropy [8]. However, all such generalization effort does not seem to lead to an information measure on a joint alphabet that would possess all the desirable properties of

M I

, in particular

M I (X, Y) = 0

if and only if X and Y are independent, which is supported by an argument via the Kullback–Leibler divergence [9].

Toward repairing the said deficiency of

M I

, a new perspective of generalizing Shannon’s entropy is introduced in this article. In the new perspective, instead of searching for alternative forms of

h (p)

in

H = \sum_{k \geq 1} h (p_{k})

under weaker axiomatic conditions, it is sought to apply Shannon’s entropy not to the original underlying distribution

p

but to distributions induced by

p

. One particular set of such induced distributions is a family, each of whose members is referred to as a conditional distribution of total collision (CDOTC) indexed by

n \in N

. It is shown that Shannon’s entropy defined with every CDOTC induced by any

p \in P

is bounded above, provided that

n \geq 2

. The boundedness of the generalized entropy allows mutual information to be defined for any CDOTC of degree

n \geq 2

for any

p_{X, Y} \in P_{X, Y}

. The resulting mutual information is referred to as the nth order mutual information index and is denoted

M I_{n}

, which is shown to possess all the desired properties of

M I

but with boundedness guaranteed. The main results are given and established in Section 3 after several motivating arguments for the generalization of mutual information in Section 2.

2. Generalization Motivated

To further motivate the generalization of mutual information in this article, let the definition of mutual information be considered in a broader perspective. Inherited from the Kullback–Leibler divergence, mutual information on a joint alphabet,

M I (X, Y) = \sum_{i \geq 1, j \geq 1} p_{i, j} log (p_{i, j} / (p_{i, \cdot} \times p_{\cdot, j}))

, is unbounded for a large subclass of distributions in

P_{X, Y}

. Example 1 below demonstrates the existence of such a subclass of joint distributions.

Example 1.

Let

p = {p_{k}; k \geq 1}

be a probability distribution with

p_{k} > 0

for every k but unbounded entropy. Let

p_{X, Y} = {p_{i, j}; i \geq 1 a n d j \geq 1}

be such that

p_{i, j} = p_{i}

for all

i = j

and

p_{i, j} = 0

for all

i \neq j

, hence

p_{X} = {p_{i, \cdot} = p_{i}; i \geq 1}

and

p_{Y} = {p_{\cdot, j} = p_{j}; j \geq 1}

. Then

M I (X, Y) = \sum_{i \geq 1, j \geq 1} p_{i, j} log (p_{i, j} / (p_{i, \cdot} \times p_{\cdot, j})) = - \sum_{k \geq 1} p_{k} log p_{k} = \infty

.

One of the most attractive properties of mutual information is that mutual information

M I (X, Y)

is finitely defined for all joint distributions such that

p_{i, j} = p_{i, \cdot} \times p_{\cdot, j}

for all

i \geq 1

and

j \geq 1

and

M I (X, Y) = 0

if and only if the two random elements X and Y are independent. However, the utility of mutual information is beyond a mere indication of whether it is zero or not. The magnitude of mutual information is also of essential importance, although Shannon did not elaborate on that in his landmark paper [1]. The said importance is perhaps best illustrated by the notion of the standardized mutual information defined as

κ (X, Y) = M I (X, Y) / H (X, Y)

and Theorem 1 below.

Remark 1.

There are several variants of standardized mutual information proposed in the existing literature. Interested readers may refer to [10,11,12,13]. Not all variants of the standardized mutual information have the properties given in Theorem 1. A summary of standardized mutual information is found in Chapter 5 of [2].

However, before stating Theorem 1, Definition 1 below is needed.

Definition 1.

Random elements

X \in X

and

Y \in Y

are said to have a one-to-one correspondence, or to be one-to-one corresponded, under a joint probability distribution

p_{X, Y}

on

X \times Y

, if:

for every i satisfying $P (X = x_{i}) > 0$ , there exists a unique j such that $P (Y = y_{j} | X = x_{i}) = 1$ , and
for every j satisfying $P (Y = y_{j}) > 0$ , there exists a unique i such that $P (X = x_{i} | Y = y_{j}) = 1$ .

Theorem 1.

Let

(X, Y)

be a pair of random elements on alphabet

X \times Y

with joint distribution

p_{X, Y} \in P_{X, Y}

such that

H (X, Y) < \infty

. Then:

$0 \leq κ (X, Y) \leq 1$ ,
$κ (X, Y) = 0$ if and only if X and Y are independent, and
$κ (X, Y) = 1$ if and only if X and Y are one-to-one corresponded.

A proof of Theorem 1 can be found on page 159 of [2]. Theorem 1 essentially maps the independence of X and Y (the strongest form of unrelatedness) to

κ = 0

, one-to-one correspondence (the strongest form of relatedness) to

κ = 1

, and everything else in between. In so doing, the magnitude of mutual information is utilized in measuring the degree of dependence in pairs of random elements, which could lead to all sorts of practical tools for evaluating, ranking, and selecting variables in data space.

It is important to note that the condition of

H (X, Y) < \infty

is essential in Theorem 1, since obviously, without it,

κ

may not be well defined. In fact, if

H (X, Y) < \infty

is not imposed, and even observing reasonable conventions such as

1 / \infty = 0

and

0 / \infty = 0

, the statements of Theorem 1 may not be true. To see this, consider the following constructed example.

Example 2.

Let

p = {p_{k}; k \geq 1}

be a probability distribution with

p_{k} > 0

for every k but unbounded entropy. Let

p_{X, Y} = {p_{i, j}; i = 1 o r 2 a n d j \geq 1}

be such that

p_{i, j} = \{\begin{matrix} p_{j} & i = 1 a n d j i s o d d \\ p_{j} & i = 2 a n d j i s e v e n \\ 0 & o t h e r w i s e, \end{matrix}

hence

p_{X} = {p_{1, \cdot}, p_{2, \cdot}} = {\sum_{k = o d d} p_{k}, \sum_{k = e v e n} p_{k}}

and

p_{Y} = {p_{\cdot, j} = p_{j}; j \geq 1}

. X and Y are obviously not independent, and

\begin{matrix} 0 < M I (X, Y) & = \sum_{i \geq 1, j \geq 1} p_{i, j} log (p_{i, j} / (p_{i, \cdot} \times p_{\cdot, j})) = H (X) < \infty . \end{matrix}

It follows that

κ = M I (X, Y) / H (X, Y) = H (X) / H (X, Y) = 0

but in this case

M I (X, Y) > 0

. Therefore Part 2 of Theorem 1 fails.

Example 2 indicates that mutual information in its current form is deprived of the potential utility of Theorem 1 for a large class of joint distributions and therefore leaves much to be desired.

Another argument for the generalization of mutual information can be made in a statistical perspective. In practice, mutual information is often to be estimated from sample data. For statistical inference to be meaningful, the estimand

M I (X, Y)

needs to exist, i.e.,

M I (X, Y) < \infty

. More specifically, in testing the hypothesis of independence between X and Y,

H_{0} : p_{X, Y} \in P_{0}

, where

P_{0} \subset P_{X, Y}

is the subclass of all joint distributions for independent X and Y on

X \times Y

, and

M I (X, Y)

needs to be finitely defined in an open neighborhood of

P_{0}

in

P_{X, Y}

, or else the logic framework of statistical inference is not well supported. Let

P_{\infty}

be the subclass of

P_{X, Y}

such that

M I (X, Y) = \infty

. In general, it can be shown that

P_{\infty}

is dense in

P_{X, Y}

with respect to the p-norm for

p \geq 1

. Specifically, for any

p_{X, Y} \in P_{0}

, there exists a sequence of distributions

{p_{m, X, Y}} \in P_{\infty}

such that

∥ p_{m, X, Y} - p_{X, Y} ∥_{p} \to 0

. See Example 3 below.

Example 3.

Let

p_{X, Y} = {p_{i, j}; i = 1, 2 a n d j = 1, 2}

where

p_{i, j} = 0.25

for all

(i, j)

such that

1 \leq i \leq 2

and

1 \leq j \leq 2

. Obviously X and Y are independent under

p_{X, Y}

, that is,

p_{X, Y} \in P_{0}

. Let

p_{m, X, Y}

be constructed based on

p_{X, Y}

as follows.

Remove an arbitrarily small quantity

ε / 4 > 0

where

ε = 1 / m

away from each of the four positive probabilities in

p_{X, Y}

so each becomes

p_{m, i, j} = 0.25 - ε / 4

for all

(i, j)

, such that

1 \leq i \leq 2

and

1 \leq j \leq 2

. Extend the range of

(i, j)

to

i \geq 3

and

j \geq 3

, and allocate the mass ε over the extended range according to

p_{m, i, j} = \{\begin{matrix} \frac{c}{i {(log i)}^{2}} & i \geq 3, j \geq 3 a n d i = j \\ 0 & i \geq 3, j \geq 3 a n d i \neq j, \end{matrix}

where c is such that

\sum_{k \geq 3} c / [k {(log k)}^{2}] = ε

. Under the constructed

{p_{m, i, j}}

, for any

ε = 1 / m

, X and Y are not independent, and the corresponding mutual information is

\begin{matrix} \sum_{i \geq 1, j \geq 1} & p_{m, i, j} log [\frac{p_{m, i, j}}{(p_{m, i, \cdot} p_{m, \cdot, j})}] \\ = & 4 (0.25 - ε / 4) log [\frac{0.25 - ε / 4}{{(0.5 - ε / 2)}^{2}}] - \sum_{k \geq 3} \frac{c}{k {(log k)}^{2}} log \frac{c}{k {(log k)}^{2}} = \infty . \end{matrix}

However, noting that as

m \to \infty

,

ε \to 0

and hence

c \to 0

,

∥ p_{m, X, Y} - p_{X, Y} ∥_{2}^{2} = 4 ε^{2} + \sum_{k \geq 3} {[\frac{c}{k {(log k)}^{2}}]}^{2} = 4 ε^{2} + c^{2} \sum_{k \geq 3} \frac{1}{k^{2} {(log k)}^{4}} \to 0 .

All things considered, it is therefore desirable to have a mutual information measure, say

M I_{n} (X, Y)

, or for that matter a family of mutual information measures indexed by a positive integer n, such that

M I_{n} (X, Y) < \infty

for all distributions in

P_{X, Y}

, and with an accordingly defined standardized mutual information measure

κ_{n} = κ_{n} (X, Y)

such that the utility of Theorem 1 is preserved with

κ_{n}

in place of

κ

for all distributions in

P_{X, Y}

.

3. Main Results

Given

Z = {z_{k}; k \geq 1}

and

p = {p_{k}}

, consider the experiment of drawing an identically and independently distributed (

i i d

) sample of size n. Let

C_{n}

denote the event that all observations of the sample take on a same letter in

Z

, and let

C_{n}

be referred to as the event of total collision. The conditional probability, given

C_{n}

, that the total collision occurs at letter

z_{k}

is

p_{n, k} = \frac{p_{k}^{n}}{\sum_{i \geq 1} p_{i}^{n}} .

(1)

It is clear that

p_{n} = {p_{n, k}}

is a probability distribution induced from

p = {p_{k}}

. For each n,

p_{n, k}

of (1) is the conditional distribution of total collision (CDOTC) with n particles.

Remark 2.

It is to be noted that, given a

p

,

p_{n} = {p_{n, k}; k \geq 1}

of (1) is a special member of the family of the escort distributions introduced by [14]. The escort distributions are a useful tool in thermodynamics. Interested readers may refer to [15] for a concise introduction.

Lemma 1.

For each n,

n \geq 1

,

p

and

p_{n}

uniquely determine each other.

Proof.

Given

p = {p_{k}; k \geq 1}

, by (1),

p_{n} = {p_{n, k}; \geq 1}

is uniquely determined. Conversely, given

p_{n} = {p_{n, k}; \geq 1}

, for each n and all

k \geq 1

,

p_{k}^{n} / p_{1}^{n} = p_{n, k} / p_{n, 1}

and therefore

\begin{matrix} p_{k} = p_{1} {(\frac{p_{n, k}}{p_{n, 1}})}^{1 / n}, \sum_{i \geq 1} p_{i} = p_{1} \sum_{i \geq 1} {(\frac{p_{n, i}}{p_{n, 1}})}^{1 / n} = 1, p_{1} = {[\sum_{i \geq 1} {(\frac{p_{n, i}}{p_{n, 1}})}^{1 / n}]}^{- 1}, \\ p_{k} = {[\sum_{i \geq 1} {(\frac{p_{n, i}}{p_{n, 1}})}^{1 / n}]}^{- 1} {(\frac{p_{n, k}}{p_{n, 1}})}^{1 / n} = {[\sum_{i \geq 1} {(\frac{p_{n, i}}{p_{n, k}})}^{1 / n}]}^{- 1} = \frac{p_{n, k}^{1 / n}}{\sum_{i \geq 1} p_{n, i}^{1 / n}} . \end{matrix}

(2)

The lemma follows. □

Lemma 2.

For each n,

n \geq 2

, and for any

p \in P

,

H_{n} (Z) = - \sum_{k \geq 1} p_{n, k} ln p_{n, k} < \infty

.

Proof.

Write

η_{n} = \sum_{k \geq 1} p_{k}^{n}

. Noting

0 < η_{n} \leq 1

,

0 \leq - p ln p \leq 1 / e

and therefore

- p log p \leq 1 / (e log 2)

for all

p \in [0, 1]

,

\begin{matrix} H_{n} (Z) & = - \sum_{k \geq 1} p_{n, k} log p_{n, k} = - \sum_{k \geq 1} \frac{p_{k}^{n}}{\sum_{i \geq 1} p_{i}^{n}} log \frac{p_{k}^{n}}{\sum_{i \geq 1} p_{i}^{n}} \\ = - \frac{n}{η_{n}} \sum_{k \geq 1} p_{k}^{n} log p_{k} + log η_{n} \leq (\frac{n}{e log 2}) (\frac{η_{n - 1}}{η_{n}}) + log η_{n} < \infty . \end{matrix}

The lemma follows. □

On the joint alphabet

X \times Y = {(x_{i}, y_{j})}

with distribution

p_{X, Y} = {p_{i, j}}

, consider the associated CDOTC for an n and all pairs

(i, j)

such that

i \geq 1

and

j \geq 1

,

p_{n, i, j} = \frac{p_{i, j}^{n}}{\sum_{s \geq 1, t \geq 1} p_{s, t}^{n}} .

(3)

Let

p_{n, X, Y} = {p_{n, i, j}; i \geq 1, j \geq 1}

. It is to be noted that

p_{n, X, Y} \in P_{X, Y}

. The two marginal distributions of (3) are

p_{n, X} = {p_{n, i, \cdot}}

and

p_{n, Y} = {p_{n, \cdot, j}}

, respectively, where

p_{n, i, \cdot} = \sum_{j \geq 1} p_{n, i, j} = \sum_{j \geq 1} (\frac{p_{i, j}^{n}}{\sum_{s \geq 1, t \geq 1} p_{s, t}^{n}}) = \frac{\sum_{j \geq 1} p_{i, j}^{n}}{\sum_{s \geq 1, t \geq 1} p_{s, t}^{n}},

(4)

p_{n, \cdot, j} = \sum_{i \geq 1} p_{n, i, j} = \sum_{i \geq 1} (\frac{p_{i, j}^{n}}{\sum_{s \geq 1, t \geq 1} p_{s, t}^{n}}) = \frac{\sum_{i \geq 1} p_{i, j}^{n}}{\sum_{s \geq 1, t \geq 1} p_{s, t}^{n}} .

(5)

Lemma 3.

p_{X, Y} = {p_{i, j}} = {p_{i, \cdot} \times p_{\cdot, j}}

if and only if

p_{n, X, Y} = {p_{n, i, j}} = {p_{n, i, \cdot} \times p_{n, \cdot, j}}

.

Proof.

For each positive integer n, if

p_{i, j} = p_{i, \cdot} \times p_{\cdot, j}

for all pairs

(i, j)

,

i \geq 1

and

j \geq 1

, then

\begin{matrix} p_{n, i, j} & = \frac{p_{i, j}^{n}}{\sum_{s \geq 1, t \geq 1} p_{s, t}^{n}} = \frac{p_{i, \cdot}^{n} p_{\cdot, j}^{n}}{\sum_{s \geq 1, t \geq 1} p_{s, \cdot}^{n} p_{\cdot, t}^{n}} = (\frac{p_{i, \cdot}^{n}}{\sum_{s \geq 1} p_{s, \cdot}^{n}}) (\frac{p_{\cdot, j}^{n}}{\sum_{t \geq 1} p_{\cdot, t}^{n}}) \end{matrix}

where the two factors of the last expression above are respectively

P (X_{1} = \dots = X_{n} = x_{i} | C_{n})

and

P (Y_{1} = \dots = Y_{n} = y_{j} | C_{n})

,

(X_{r}, Y_{r})

,

r = 1, \dots, n

, are letter values of the n observations in the sample.

Conversely, if

p_{n, i, j} = p_{n, i}^{*} \times p_{n, j}^{*}

where

p_{n, i}^{*} \geq 0

depends only on n and i and

p_{n, j}^{*} \geq 0

only depends on n and j, then by (2),

\begin{matrix} p_{i, j} & = \frac{p_{n, i, j}^{1 / n}}{\sum_{s \geq 1, t \geq 1} p_{n, s, t}^{1 / n}} = \frac{{(p_{n, i}^{*})}^{1 / n} {(p_{n, j}^{*})}^{1 / n}}{\sum_{s \geq 1} {(p_{n, s}^{*})}^{1 / n} \sum_{t \geq 1} {(p_{n, t}^{*})}^{1 / n}} \\ = (\frac{{(p_{n, i}^{*})}^{1 / n}}{\sum_{s \geq 1} {(p_{n, s}^{*})}^{1 / n}}) \times (\frac{{(p_{n, j}^{*})}^{1 / n}}{\sum_{t \geq 1} {(p_{n, t}^{*})}^{1 / n}}) . \end{matrix}

The lemma immediately follows the factorization theorem. □

For each

n \in N

, let

H_{n} (X, Y)

,

H_{n} (X)

and

H_{n} (Y)

be Shannon’s entropies defined with the joint CDOTC,

{p_{n, i, j}; i \geq 1}

as in (3), and the marginal distributions

{p_{n, i, \cdot}; i \geq 1}

and

{p_{n, \cdot, j}; j \geq 1}

as in (4) and (5), respectively. Let

M I_{n} = M I_{n} (X, Y) = H_{n} (X) + H_{n} (Y) - H_{n} (X, Y) .

(6)

Theorem 2.

For every

n \geq 2

and any

p_{X, Y} \in P_{X, Y}

,

$0 \leq M I_{n} (X, Y) < \infty$ ,
$M I_{n} (X, Y) = 0$ if and only X and Y are independent.

Proof.

In Part 1,

M I_{n} \geq 0

, since

M I_{n}

is a mutual information and

M I_{n} < \infty

by Lemma 2. Part 2 follows Lemma 3 and the fact that

M I_{n}

is a mutual information. □

Let

κ_{n} = κ_{n} (X, Y) = \frac{H_{n} (X) + H_{n} (Y) - H_{n} (X, Y)}{H_{n} (X, Y)}

(7)

be referred to as the nth order standardized mutual information, and write

I_{S} = {κ_{n}; n \geq 1}

. Let

(X^{*}, Y^{*})

be a pair of random elements on

X \times Y

according to the induced joint distribution

p_{n, X, Y}

with index value

n \geq 1

.

Lemma 4.

X and Y have a one-to-one correspondence if and only if

X^{*}

and

Y^{*}

have one.

Proof.

If X and Y have a one-to-one correspondence, then for each i, there is a unique

j_{i}

such that

p_{i, j_{i}} > 0

and

p_{i, j} = 0

for all other j,

j \neq j_{i}

. By (3),

p_{n, i, j_{i}} > 0

and

p_{n, i, j} = 0

for all other j,

j \neq j_{i}

. That is,

X^{*}

and

Y^{*}

have a one-to-one correspondence.

Conversely, if

X^{*}

and

Y^{*}

have a one-to-one correspondence, then for each i, there is a unique

j_{i}

such that

p_{n, i, j_{i}} > 0

and

p_{n, i, j} = 0

for all other j,

j \neq j_{i}

. On the other hand, by (2),

\begin{matrix} p_{i, j} & = \frac{p_{n, i, j}^{1 / n}}{\sum_{s \geq 1, t \geq 1} p_{n, s, t}^{1 / n}}, \end{matrix}

it follows that

p_{i, j_{i}} > 0

and

p_{i, j} = 0

for all other j,

j \neq j_{i}

. That is, X and Y have a one-to-one correspondence. □

Corollary 1.

For every

n \geq 2

and any

p_{X, Y} \in P_{X, Y}

,

$0 \leq κ_{n} (X, Y) \leq 1$ ,
$κ_{n} (X, Y) = 0$ if and only if X and Y are independent, and
$κ_{n} (X, Y) = 1$ if and only if X and Y are one-to-one corresponded.

Proof.

By Lemma 3, X and Y are independent if and only if

X^{*}

and

Y^{*}

are. By Lemma 4, X and Y are one-to-one corresponded if and only if

X^{*}

and

Y^{*}

are. The statement of Corollary 1 follows directly from Theorem 1. □

Theorem 2 and Corollary 1 together fill the void in

P_{X, Y}

left behind by

M I

.

4. Concluding Remarks

The main results of this article may be summarized as follows. A family of generalized mutual information indexed by a positive integer n is proposed. The member corresponding to

n = 1

is Shannon’s mutual information for a given joint distribution,

p_{X, Y}

. The other members of the family correspond to other integer values of n. They are also Shannon’s information defined, not with

p_{X, Y}

, but with induced distributions based on the given distribution

p_{X, Y}

. These induced distributions are called conditional distributions of total collision (CDOTC), which collectively is a special subset of a more general family called the escort distributions, which is often studied in extensive thermodynamics. The main motivation of the generalized mutual information is to resolve the issue of the fact that the standard mutual information is not finitely defined for all distributions of a countable joint alphabet

P_{X, Y} = {all probability distributions on X \times Y}

, which leads to the issue of mutual information’s utility only realized on a fraction of

P

.

On a more specific and finer level, the following facts are established.

There is a one-to-one correspondence between each CDOTC and the given distribution $p$ on a countable alphabet, and hence each CDOTC is a characteristic representation of the original distribution $p$ . One of the implications of this fact is that understanding the underlying $p$ is equivalent to understanding one of its CDOTC. It can be shown that the CDOTC with an order greater than 1 is much easier to estimate than $p$ with sparse data.
Each generalized mutual information is guaranteed to be finite. This result essentially guarantees the validity of statistically testing the null hypothesis of independence of two discrete random elements, as it guarantees the existence of (generalized) mutual information anywhere in the alternative space of dependent join distributions.
It is shown that a particular form of standardized mutual information $κ$ , defined with any CDOTC of any order greater than 1, preserves the zero-to-one scale with independence on one end and total dependence on the other, which is enjoyed by Shannon’s entropy only when it is finite.

In short, the family of conditional distributions of total collision embeds the underlying probability distribution

p

as a special member, and the family of generalized mutual information embeds Shannon’s mutual information as a special member. Consequently, the stochastic association on joint alphabets can be measured by not only one index but by a host of indices, which collectively offer a much extended space to study stochastic dependence in information theory.

Funding

This research received no external funding

Conflicts of Interest

The author declares no conflict of interest.

References

Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423, 623–656. [Google Scholar] [CrossRef]
Zhang, Z. Statistical Implications of Turing’s Formula; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2017. [Google Scholar]
Csiszár, I. Axiomatic characterizations of information measures. Entropy 2008, 10, 261–273. [Google Scholar] [CrossRef]
Amigó, J.M.; Balogh, S.G.; Hernández, S. A brief review of generalized entropies. Entropy 2018, 20, 813. [Google Scholar] [CrossRef]
Khinchin, A.I. Mathematical Foundations of Information Theory; Dover: New York, NY, USA, 1957. [Google Scholar]
Chakrabarti, C.G.; Chakrabarty, I. Shannon entropy: Axiomatic characterization and application. Int. J. Math. Math. Sci. 2005, 17, 2847–2854. [Google Scholar] [CrossRef]
Rényi, A. On measures of information and entropy. In Volume 1: Contributions to the Theory of Statistics, Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 20 June–30 July 1960; University of California Press: Berkeley, CA, USA, 1961; pp. 547–561. [Google Scholar]
Tsallis, C. Possible generalization of Boltzmann-Gibbs statistics. J. Stat. Phys. 1988, 52, 479–487. [Google Scholar] [CrossRef]
Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
Kvalseth, T.O. Entropy and correlation: Some comments. IEEE Trans. Syst. Man Cybern. 1987, 17, 517–519. [Google Scholar] [CrossRef]
Strehl, A.; Ghosh, J. Cluster ensembles—A knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 2002, 3, 583–617. [Google Scholar]
Yao, Y.Y. Information-theoretical measures for knowledge discovery and data mining. In Entropy Measures, Maximum Entropy Principle and Emerging Applications, 1st ed.; Karmeshu, Ed.; Springer: Berlin, Germany, 2003; pp. 115–136. [Google Scholar]
Vinh, N.X.; Epps, J.; Bailey, J. Information theoretical measures for clustering comparison: Variants, properties, normalization and correlation for chance. J. Mach. Learn. Res. 2018, 11, 2837–2854. [Google Scholar]
Beck, C.; Schlögl, F. Thermodynamics of Chaotic Systems: An Introduction; Cambridge University Press: Cambridge, UK, 1993. [Google Scholar]
Matsuzoe, H. A sequence of escort distributions and generalizations of expectations on q-exponential family. Entropy 2017, 19, 7. [Google Scholar] [CrossRef]

© 2020 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Z. Generalized Mutual Information. Stats 2020, 3, 158-165. https://doi.org/10.3390/stats3020013

AMA Style

Zhang Z. Generalized Mutual Information. Stats. 2020; 3(2):158-165. https://doi.org/10.3390/stats3020013

Chicago/Turabian Style

Zhang, Zhiyi. 2020. "Generalized Mutual Information" Stats 3, no. 2: 158-165. https://doi.org/10.3390/stats3020013

APA Style

Zhang, Z. (2020). Generalized Mutual Information. Stats, 3(2), 158-165. https://doi.org/10.3390/stats3020013

Article Menu

Generalized Mutual Information

Abstract

1. Introduction and Summary

2. Generalization Motivated

3. Main Results

4. Concluding Remarks

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI