Minimum Mean Squared Error Estimation and Mutual Information Gain

Gibson, Jerry

doi:10.3390/info15080497

Open AccessArticle

Minimum Mean Squared Error Estimation and Mutual Information Gain

by

Jerry Gibson

Department of Electrical and Computer Engineering, University of California, Santa Barbara, Santa Barbara, CA 93106-9560, USA

Information 2024, 15(8), 497; https://doi.org/10.3390/info15080497

Submission received: 31 July 2024 / Revised: 12 August 2024 / Accepted: 14 August 2024 / Published: 20 August 2024

(This article belongs to the Special Issue Fundamental Problems of Information Studies)

Download Versions Notes

Abstract

Information theoretic quantities such as entropy, entropy rate, information gain, and relative entropy are often used to understand the performance of intelligent agents in learning applications. Mean squared error has not played a role in these analyses, primarily because it is not felt to be a viable performance indicator in these scenarios. We build on a new quantity, the log ratio of entropy powers, to establish that minimum mean squared error (MMSE) estimation, prediction, and smoothing are directly connected to mutual information gain or loss in an agent learning system modeled by a Markov chain for many probability distributions of interest. Expressions for mutual information gain or loss are developed for MMSE estimation, prediction, and smoothing, and an example for fixed lag smoothing is presented.

Keywords:

mutual information gain; entropy power; minimum mean squared error estimation

1. Introduction

Minimum mean squared error estimation, prediction, and smoothing [1], whether as point estimation, batch least squares, recursive least squares [2], Kalman filtering [3], numerically stable square root filters, recursive least squares lattice structures [4], or stochastic gradient algorithms [3], are a staple in signal processing applications. However, even though stochastic gradient algorithms are the workhorses in the inner workings of machine learning, it is felt that mean squared error does not capture the performance of a learning agent [5]. We begin to address this assertion here and show the close relationship between mean squared estimation error and information theoretic quantities such as differential entropy and mutual information. We consider the problem of estimating a random scalar signal

x [k]

(the extension to vectors will be obvious to the reader) given the perhaps noisy measurements

z [j]

, where

j = k

is filtering,

j > k

is smoothing, and

j < k

is prediction, based on a minimum mean squared error cost function.

Information theoretic quantities such as entropy, entropy rate, information gain, and relative entropy are often used to understand the performance of intelligent agents in learning applications [6,7]. In such applications, these information theoretic quantities are used to determine what information can be learned from sequences with different properties. Information theory has also been used to examine what is happening within neural networks through the Information Bottleneck Theory of Deep Learning [5,8,9]. One of the challenges in all of these research efforts is the necessity to obtain estimates of mutual information, which often requires great skill and effort [8,9,10,11].

A relatively newer quantity called mutual information gain or loss has recently been introduced and shown to provide new insights into the process of agent learning [12]. We build on expressions for mutual information gain that involve ratios of mean squared errors and establish that minimum mean squared error (MMSE) estimation, prediction, and smoothing are directly connected to mutual information gain or loss for sequences modeled by many probability distributions of interest. Thus, mean squared error, which is often relatively easy to calculate, can be employed to obtain changes in mutual information as we progress through a system modeled as a Markov chain. The key quantity in establishing these relationships is the log ratio of entropy powers.

We begin in Section 2 by establishing the fundamental information quantities of interest and setting the notation. In Section 3, we review information theoretic quantities that have been defined and used in some agent learning analyses in the literature. Some prior work with similar results, but based on the minimax entropy of the estimation error, is discussed in Section 4. The following section, Section 5, introduces the key tool in our development, the log ratio of entropy powers, and derives its expression in terms of mutual information gain. In Section 6 the log ratio of entropy powers is used to characterize the performance of MMSE smoothing, prediction, and filtering in terms of ratios of entropy powers and mutual information gain. For many probability distributions of interest, we are able to substitute MMSE into the entropy power expressions as shown in Section 7. A simple fixed lag smoothing example is presented in Section 8 that illustrates the power of the approach. Section 9 presents some properties and families of distributions that commonly occur in applications and that have desirable characterizations and implications, such as sufficient statistics. Lists of distributions that satisfy the log ratio of the entropy power property and these properties, and which fall in the classes of interest, are given. Final discussions of the results and future research directions are presented in Section 10.

2. Differential Entropy, Mutual Information, and Entropy Rate: Definitions and Notation

Given a continuous random variable X with probability density function

p (x)

, the differential entropy is defined as

h (X) = - \int_{- \infty}^{\infty} p (x) log p (x) d x

(1)

where we assume X has the variance

v a r (X)

=

σ^{2}

. The differential entropy of a Gaussian sequence with mean zero and variance

σ^{2}

is given by [13],

h (X) = \frac{1}{2} log 2 π e σ^{2}

(2)

An important quantity for investigating structure and randomness is the differential entropy rate [13]

h (X) = lim_{N \to \infty} \frac{1}{N} h (X_{1}, \dots, X_{N})

(3)

which is the long-term average differential entropy in bits/symbols for the sequence being studied. The differential entropy rate is a simple indicator of randomness that has been used in agent learning papers [6,7].

An alternative definition of differential entropy rate is [13]

h (X) = lim_{N \to \infty} h (X_{N} | X_{N - 1}, \dots, X_{1})

(4)

which, for the Gaussian process, yields

h (X) = \frac{1}{2} log 2 π e σ_{\infty}^{2}

(5)

where

σ_{\infty}^{2}

is the minimum mean squared error of the best estimate given the infinite past, expressible as

σ_{\infty}^{2} = \frac{1}{(2 π e)} e^{2 h (X)} \leq σ^{2}

(6)

with

σ^{2}

and

h (X)

being the variance and differential entropy rate of the original sequence, respectively [13]. In addition to defining entropy power (6), this equation shows that the entropy power is the minimum variance that can be associated with the not-necessarily-Gaussian differential entropy

h (X)

.

In his landmark 1948 paper [14], Shannon defined the entropy power (also called entropy rate power) to be the power in a Gaussian white noise limited to the same band as the original ensemble and having the same entropy. He then used the entropy power in bounding the capacity of certain channels and for specifying a lower bound on the rate distortion function of a source.

Shannon gave the quantity

σ_{\infty}^{2}

the notation Q, which operationally is the power in a Gaussian process with the same differential entropy as the original random variable X [14]. Note that the original random variable or process does not need to be Gaussian. Whatever the form of

h (X)

for the original process, the entropy power can be defined as in Equation (6). In the following, we use

h (X)

for both differential entropy and differential entropy rate unless a clear distinction is needed to reduce confusion.

The differential entropy is defined for continuous amplitude random variables and processes, and it is the appropriate quantity to study signals such as speech, audio, and biological signals. However, unlike discrete entropy, differential entropy can be negative or infinite, and is changed by scaling and similar transformations. Note that this is why mutual information is often the better choice for investigating learning applications.

In particular, for continuous random variables X and Y with probability density functions

p (x, y)

,

p (x)

and

p (y)

, respectively, the mutual information between X and Y is [13,15]

I (X; Y) = h (X) - h (X | Y) = h (Y) - h (Y | X) = h (X) + h (Y) - h (X, Y)

(7)

Mutual information is always greater than or equal to zero and is not impacted by scaling or similar transformations. Mutual information is the principal information theoretic indicator employed in this work.

3. Agent Learning and Mutual Information Gain

In agent learning, based on some observations of the environment, we develop an understanding of the structure of the environment, formulate models of this structure, and study any remaining apparent randomness or unpredictability [6,7]. Studies of agent learning have made use of the information theoretic ideas in Section 2, and have created variations on those information theoretic ideas to capture particular characteristics that are distinct to agent learning problems. These expressions and related results are discussed in detail in Gibson [12]. The developments of using mutual information in the Information Bottleneck Theory of Deep Learning are contained in Tishby [5] and Shwartz Ziv and LeCun [16].

The agent learning literature explores the broad ideas of unpredictability and apparent randomness [6,7]. Toward this end, it is common to investigate the total Shannon entropy of length-N sequences

X^{N} = X_{N}, X_{N - 1}, \dots, X_{1}

given by

h (X^{N}) = - \int P_{X}^{(N)} (X) log P_{X}^{(N)} (X) d X^{N}

(8)

as a function of N to characterize learning. The name total Shannon entropy is appropriate, since it is not the usual per component entropy of interest in lossless source coding [13], for example.

In association with the idea of learning or discerning structure in an environment, the entropy gain, as defined in the literature, is the difference between the entropies of length N and length

N - 1

sequences as [7]

Δ H (N) = h (X^{N}) - h (X^{N - 1})

(9)

Equation (9) was derived and studied much earlier by Shannon [14], not as an entropy gain, but as a conditional entropy.

In particular, Shannon [14] defined the conditional entropy of the next symbol when the

N - 1

preceding symbols are known as

\begin{matrix} h (X_{N} | X^{N - 1}) = h (X_{N}, X^{N - 1}) - h (X^{N - 1}) = h (X^{N}) - h (X^{N - 1}) \end{matrix}

(10)

which is exactly Equation (9); so, the entropy gain from the agent learning literature is simply the conditional entropy expression developed by Shannon in 1948.

A recently introduced quantity, mutual information gain, allows for a more detailed parsing of what is happening in the learning process than observing changes in entropy [12]. Even though a relative entropy between two probability densities has been called the information gain in the agent learning literature [6,7], it is evident from Equations (9) and (10) that it is just a conditional entropy [12]. Thus, the nomenclature which defined information gain in terms of this conditional entropy is misleading.

In terms of information gain, the quantity of interest is the mutual information between the overall sequence and the growing history of the past given by

\begin{matrix} I (X_{N}; X^{N - 1}) = h (X_{N}) - h (X_{N} | X^{N - 1}) & = h (X_{N}) - [h (X^{N}) - h (X^{N - 1})] \\ = h (X_{N}) - Δ H (N) \end{matrix}

(11)

where

Δ H (N)

is defined in Equation (9). The mutual information in Equation (11) is a much more direct measure of information gained than entropy gain as a function of N, and includes the entropy gain from agent learning as a natural component. We can obtain more insight by expanding Equation (11) using the chain rule for mutual information [13] as

\begin{matrix} I (X_{N}; X^{N - 1}) & = h (X_{N}) - h (X_{N} | X^{N - 1}) = \sum_{k = 1}^{N - 1} I (X_{N}; X_{k} | X_{k - 1}, \dots, X_{0}) \\ = I (X_{N}; X_{N - 1} | X_{N - 2}, \dots, X_{1}, X_{0}) + \dots + I (X_{N}; X_{2} | X_{1}, X_{0}) + I (X_{N}; X_{1} | X_{0}) \end{matrix}

(12)

Since

I (X_{N}; X_{k - 1} | X_{k - 2}, \dots, X_{0}) \geq 0

, we see that

I (X_{N}; X^{N - 1})

is nondecreasing in N. However, what do these individual terms in Equation (12) mean? The sequence

X_{N}

should be considered the input sequence to be analyzed with the block length N, large but finite. The first term in the sum,

I (X_{N}; X_{1} | X_{0})

, indicates the mutual information between the predicted value of

X_{1}

, given

X_{0}

, and the input sequence

X_{N}

. The next term,

I (X_{N}; X_{2} | X_{1}, X_{0})

, is the mutual information between the input sequence

X_{N}

and the predicted value of

X_{2}

, given the prior values

X_{1}, X_{0}

. Therefore, we can characterize the change in mutual information with increasing knowledge of the past history of the sequence as a sum of conditional mutual informations

I (X_{N}; X_{k - 1} | X_{k - 2}, \dots, X_{0})

[12].

We denote

I (X_{N}; X^{N - 1})

as the total mutual information gain, and

I (X_{N}; X_{k - 1} | X_{k - 2}, \dots, X_{0})

as the incremental mutual information gain. We utilize these terms in the following developments.

4. Minimum Error Entropy

Minimum error entropy approaches to estimation, prediction, and smoothing were studied earlier by Kalata and Priemer [17], and minimax error entropy stochastic approximation was investigated by Kalata and Priemer [18]. They consider the estimation error,

\tilde{X} [k] = X [k] - \hat{X} [k | k]

, and studied random variables with probability density functions that have the differential entropy

h (X) = \frac{1}{2} l o g [A σ_{X}^{2}]

, where A is a scalar constant. The authors point out that random variables with densities of this form are Gaussian, Laplacian, Uniform, triangular, exponential, Rayleigh, and Poisson [17,18].

They show, among other results, that minimizing the estimation error entropy is equivalent to minimizing the mutual information between the estimation error and the observations; that is,

m i n h (\tilde{X}) \Leftrightarrow m i n I (\tilde{X}; Z) .

(13)

For differential entropies of the form

h (X) = \frac{1}{2} l o g [A σ_{X}^{2}]

, they also show that the MMSE estimate is the minimax error entropy estimate, that is,

m i n m a x h (\tilde{X}) \Leftrightarrow m i n \sum_{i = 1}^{N} σ_{X_{i}}^{2}

(14)

where

\tilde{X}

is the estimation error and the minimax notation represents taking the minimax of the entropy of the estimation error. This allows for the development of standard MMSE estimators for filtering, smoothing, and prediction based on the minimax error entropy approach.

The authors also develop an expression for the change in smoothing error with a new observation [19]. They note that there is a change in the error entropy with a new observation,

z (j + 1)

, that is given by

Δ h (\tilde{X} (k) | z (j + 1)) = h (\tilde{X} (k) | Z (j)) - h (\tilde{X} (k) | Z (j + 1))

(15)

where

Z (j) = [z (j), z (j - 1), . . ., z (2), z (1)]

represents the observations through time instant j and

Z (j + 1) = [z (j + 1), Z (j)]

are the observations through time instant

j + 1

. Using the definition of mutual information in terms of entropies, Equation (7), and the given form of the differential entropy, it is shown that the minimum error entropy optimum smoothing error variance decreases as

σ_{\tilde{X} (k) | Z (j + 1)}^{2} = σ_{\tilde{X} (k) | Z (j)}^{2} exp [- 2 I (X (k); z (j + 1) | Z (j))]

(16)

where

σ_{\tilde{X} (k) | Z (j + 1)}^{2}

is the variance of the estimate, given observations denoted by

Z (j + 1)

with the new observation

z (j + 1)

for the stated distributions.

We will see in the following that we can obtain similar results, but in terms of mutual information, for the same probability distributions with MMSE estimation methods without considering the minimax error entropy approach.

5. Log Ratio of Entropy Powers

We can use the definition of the entropy power in Equation (6) to express the logarithm of the ratio of two entropy powers in terms of their respective differential entropies as [20]

log \frac{Q_{X}}{Q_{Y}} = 2 [h (X) - h (Y)] .

(17)

where

Q_{X}

represents the entropy power associated with the differential entropy

h (X)

and similarly for

Q_{Y}

. The conditional version of Equation (6) is [13]

Q_{X | Y_{N}} = \frac{1}{(2 π e)} exp 2 h (X | Y_{N}) \leq V a r (X | Y_{N}),

(18)

and from which we can express Equation (17) in terms of the entropy powers at the outputs of successive stages in a signal processing Markov chain

Y_{N - 1}, Y_{N}

that satisfy the data processing inequality as

\frac{1}{2} log \frac{Q_{X | Y_{N}}}{Q_{X | Y_{N - 1}}} = h (X | Y_{N}) - h (X | Y_{N - 1}) .

(19)

It is important to notice that many signal processing systems satisfy the Markov chain property, and thus the data processing inequality, so Equation (19) is potentially very useful and insightful.

We can expand our insights if we add and subtract

h (X)

to the right-hand side of Equation (19), so we then obtain an expression in terms of the difference in mutual information between the two successive stages as

\frac{1}{2} log \frac{Q_{X | Y_{N}}}{Q_{X | Y_{N - 1}}} = I (X; Y_{N - 1}) - I (X; Y_{N}) .

(20)

From the entropy power in Equation (18), we know that both expressions in Equations (19) and (20) are greater than or equal to zero. Thus, from this result, we see that we can now associate a change in mutual information as data passes through a Markov chain with the log ratio of entropy powers.

These results are from [20] and extend the data processing inequality by providing a new characterization of the mutual information gain or loss between stages in terms of the entropy powers of the two stages. Since differential entropies are difficult to calculate, it is useful to have expressions for the entropy power at two stages and then use Equations (19) and (20) to find the difference in differential entropy and mutual information between these stages.

To get some idea of how useful Equation (20) can be, we turn to a few special cases. In many signal processing operations, a Gaussian assumption is accurate and can provide deep insights. Thus, considering two i.i.d. Gaussian distributions with zero mean and variances

σ_{X}^{2}

and

σ_{Y}^{2}

, we see directly that

Q_{X} = σ_{X}^{2}

and

Q_{Y} = σ_{Y}^{2}

, so

\frac{1}{2} log \frac{Q_{X}}{Q_{Y}} = \frac{1}{2} log \frac{σ_{X}^{2}}{σ_{Y}^{2}} = [h (X) - h (Y)],

(21)

which satisfies Equation (17) exactly.

We can also consider the MMSE error variances in a Markov chain when X and

Y_{N - 1}, Y_{N}

are Gaussian values with the error variances at successive stages denoted as

V a r (X | Y_{N - 1}) = σ_{X | Y_{N - 1}}^{2}

and

V a r (X | Y_{N}) = σ_{X | Y_{N}}^{2}

, then

\frac{1}{2} log \frac{Q_{X | Y_{N}}}{Q_{X | Y_{N - 1}}} = \frac{1}{2} log \frac{σ_{X | Y_{N}}^{2}}{σ_{X | Y_{N - 1}}^{2}} = h (X | Y_{N}) - h (X | Y_{N - 1}) = I (X; Y_{N - 1}) - I (X; Y_{N}) .

(22)

Perhaps surprisingly, this result holds for two i.i.d. Laplacian distributions with variances

λ_{X | Y_{N}}^{2}

and

λ_{X | Y_{N - 1}}^{2}

[21], since their corresponding entropy powers are

Q_{X | Y_{N}} = 2 e λ_{X | Y_{N}}^{2} / π

and

Q_{X | Y_{N - 1}} = 2 e λ_{X | Y_{N - 1}}^{2} / π

, respectively, so we form

\frac{1}{2} log \frac{Q_{X | Y_{N}}}{Q_{X | Y_{N - 1}}} = \frac{1}{2} log \frac{λ_{X | Y_{N}}^{2}}{λ_{X | Y_{N - 1}}^{2}} = h (X | Y_{N}) - h (X | Y_{N - 1}) = I (X; Y_{N - 1}) - I (X; Y_{N}) .

(23)

Since

h (X) = ln (2 e λ_{X})

, the Laplacian distribution also satisfies Equation (17) through Equation (20) exactly [20].

Using mean squared errors or variances in Equations (17) through (20) is accurate for many other distributions as well. It is straightforward to show that Equation (17) holds with equality when the differential entropy takes the form

h (X) = \frac{1}{2} l o g [A σ_{X}^{2}]

(24)

so the entropy powers can be replaced by the mean squared error for the Gaussian, Laplacian, logistic, Cauchy, uniform, symmetric triangular, exponential, and Rayleigh distributions. Equation (24) is of the same form of the distributions considered in [17,18] when considering the minimax error entropy estimate. Note here that we can work directly with MMSE estimates.

Therefore, the satisfaction of Equations (17) through (20) with equality when substituting the variance for entropy power occurs for several distributions of significant interest for applications, and it is the log ratio of entropy powers that enables the use of the mean squared error to calculate the loss or gain in mutual information at each stage.

6. Minimum Mean Squared Error (MMSE) Estimation

Using the results from Section 5, a tight connection between entropy power and mutual information gain or loss in common applications modeled by a Markov chain is established here and in the following subsections. Then, in Section 7, these results are specialized to the use of the error variances.

In minimum mean squared estimation (MMSE), the estimation error to be minimized is

ϵ^{2} = E {(X [k] - \hat{X} [k | j])}^{2}

(25)

at time instant k, given observations up to and including time instant j, where we may have

j = k

,

j > k

, or

j < k

, depending on whether the problem is classical estimation, smoothing, or prediction, respectively.

Using the estimation counterpart to Fano’s Inequality, we can write [13]

E {(X [k] - \hat{X} [k | j])}^{2} \geq \frac{1}{2 π e} exp 2 [h (X [k] | \hat{X} [k | j]) \equiv Q_{X [k] | \hat{X} [k | j]}

(26)

where we have used the classical notation for entropy power defined by Shannon [14]. Taking the logarithm of the right side of Equation (26) past the inequality, we obtain

h (X [k] | \hat{X} [k | j]) = \frac{1}{2} log (2 π e Q_{X [k] | \hat{X} [k | j]})

(27)

Subtracting

h (X [k] | \hat{X} [k | l])

,

l \neq j

, from the left side of Equation (27), and the corresponding entropy power expression from the right side, we get

h (X [k] | \hat{X} [k | j]) - h (X [k] | \hat{X} [k | l]) = \frac{1}{2} log (\frac{Q_{X [k] | \hat{X} [k | j]}}{Q_{X [k] | \hat{X} [k | l]}})

(28)

Note that the

2 π e

divides out in the ratio of entropy powers.

Adding and subtracting

h (X [k])

from both sides of Equation (28), we can write

I (X [k]; \hat{X} [k | j]) - I (X [k]; \hat{X} [k | l]) = \frac{1}{2} log (\frac{Q_{X [k] | \hat{X} [k | j]}}{Q_{X [k] | \hat{X} [k | l]}})

(29)

Therefore, the difference between the mutual information of

X [k]

and

\hat{X} [k | j]

and the mutual information of

X [k]

and

\hat{X} [k | l]

can be expressed as one half the log ratio of conditional entropy powers. This allows us to characterize mutual information gain or loss in terms of the minimum mean squared error in filtering, smoothing, and prediction, as we demonstrate in Section 7.

6.1. MMSE Smoothing

We want to estimate a random scalar signal

x [k]

given the perhaps noisy measurements

z [j]

for

j > k

, where k is fixed and j is increasing, based on a minimum mean squared error cost function. So, the smoothing error to be minimized is Equation (25), and again using the estimation counterpart to Fano’s Inequality [13], we get Equation (26), both with

j > k

for smoothing. As j increases, the optimal smoothing estimate will not increase the MMSE, so

Q_{X [k] | \hat{X} [k | j]} \geq Q_{X [k] | \hat{X} [k | j + 1]}

(30)

Moving

Q_{X [k] | \hat{X} [k | j + 1]}

over to the left side of Equation (30), and substituting the definition of entropy power for each, produces

\frac{Q_{X [k] | \hat{X} [k | j]}}{Q_{X [k] | \hat{X} [k | j + 1]}} = exp (2 [h (X [k] | \hat{X} [k | j])] - h (X [k] | \hat{X} [k | j + 1])]) \geq 1

(31)

Taking logarithms, we see that

\frac{1}{2} log \frac{Q_{X [k] | \hat{X} [k | j]}}{Q_{X [k] | \hat{X} [k | j + 1]}} = [h (X [k] | \hat{X} [k | j]) - h (X [k] | \hat{X} [k | j + 1])] \geq 0

(32)

Adding and subtracting

h (X [k])

to the right hand side of Equation (32) yields

\frac{1}{2} log \frac{Q_{X [k] | \hat{X} [k | j]}}{Q_{X [k] | \hat{X} [k | j + 1]}} = I (X [k]; \hat{X} [k | j + 1]) - I (X [k]; \hat{X} [k | j]) \geq 0

(33)

Equation (33) shows that the mutual information is nondecreasing for increasing

j > k

. Thus, we have an expression for the mutual information gain due to smoothing as a function of lookahead j in terms of entropy powers.

We can also use Equation (33) to obtain the rate of decrease of the entropy power in terms of the mutual information as

Q_{X [k] | \hat{X} [k | j + 1]} = Q_{X [k] | \hat{X} [k | j]} exp [- 2 (I (X [k]; \hat{X} [k | j + 1]) - I (X [k]; \hat{X} [k | j]))]

(34)

Here, we see that the rate of decrease in the entropy power is exponentially related to the mutual information gain due to smoothing.

We note that this result is obtained only using entropy power expressions rather than minimal error entropy. In fact, it can be shown that Equations (34) and (16) are the same by employing the chain rule to prove that

I (X (k); z (j + 1) | Z (j)) = I (X [k]; \hat{X} [k | j + 1]) - I (X [k]; \hat{X} [k | j])

.

6.2. MMSE Prediction

We want to predict a random scalar signal

x [k]

given the perhaps noisy measurements

z [j]

for

j < k

, where k is fixed and j is decreasing from

k - 1

, based on a minimum mean squared error cost function. So, the prediction error to be minimized is Equation (25) and again using the estimation counterpart to Fano’s Inequality [13], we get Equation (26), both with

j < k

for prediction. As j decreases, the optimal prediction will increase the minimum mean squared prediction error since the prediction is further ahead, so

Q_{X [k] | \hat{X} [k | j]} \leq Q_{X [k] | \hat{X} [k | j - 1]}

(35)

Moving

Q_{X [k] | \hat{X} [k | j]}

over to the right side of Equation (35) and substituting the definition of entropy power for each produces

\frac{Q_{X [k] | \hat{X} [k | j - 1]}}{Q_{X [k] | \hat{X} [k | j]}} = exp (2 [h (X [k] | \hat{X} [k | j - 1])] - h (X [k] | \hat{X} [k | j])]) \geq 1

(36)

Taking logarithms, we see that

\frac{1}{2} log \frac{Q_{X [k] | \hat{X} [k | j - 1]}}{Q_{X [k] | \hat{X} [k | j]}} = [h (X [k] | \hat{X} [k | j - 1]) - h (X [k] | \hat{X} [k | j])] \geq 0

(37)

Adding and subtracting

h (X [k])

to the right hand side of Equation (37) yields

\frac{1}{2} log \frac{Q_{X [k] | \hat{X} [k | j - 1]}}{Q_{X [k] | \hat{X} [k | j]}} = I (X [k]; \hat{X} [k | j]) - I (X [k]; \hat{X} [k | j - 1]) \geq 0

(38)

This result shows that there is a mutual information loss with further lookahead in prediction, and this loss is expressible in terms of a ratio of entropy powers. Equation (38) shows that the mutual information is decreasing for decreasing

j < k

, that is, for prediction further ahead, since

I (X [k]; \hat{X} [k | j]) \geq I (X [k]; \hat{X} [k | j - 1])

. As a result, the observations are becoming less relevant to the variable to be predicted. We can also use Equation (38) to obtain the rate of increase of the entropy power as the prediction is further ahead in terms of the mutual information as

Q_{X [k] | \hat{X} [k | j - 1]} = Q_{X [k] | \hat{X} [k | j]} exp [2 (I (X [k]; \hat{X} [k | j]) - I (X [k]; \hat{X} [k | j - 1]))]

(39)

Thus, the entropy power increase grows exponentially with the mutual information loss corresponding to increasing lookahead in prediction.

6.3. MMSE Filtering

We want to estimate a random scalar signal

x [k]

given the perhaps noisy measurements

z [k]

, based on a minimum mean squared error cost function. So, the estimation error to be minimized is Equation (25). From the estimation counterpart to Fano’s Inequality [13] we get Equation (26), both with

j = k

for filtering.

Dividing

Q_{X [k] | \hat{X} [k | k]}

by

Q_{X [k - 1] | \hat{X} [k - 1 | k - 1]}

and substituting the definition of entropy power for each produces

\frac{Q_{X [k] | \hat{X} [k | k]}}{Q_{X [k - 1] | \hat{X} [k - 1 | k - 1]}} = exp (2 [h (X [k] | \hat{X} [k | k])] - h (X [k - 1] | \hat{X} [k - 1 | k - 1])])

(40)

Taking logarithms, we see that

\frac{1}{2} log \frac{Q_{X [k] | \hat{X} [k | k]}}{Q_{X [k - 1] | \hat{X} [k - 1 | k - 1]}} = [h (X [k] | \hat{X} [k | k]) - h (X [k - 1] | \hat{X} [k - 1 | k - 1])]

(41)

Adding and subtracting

h (X [k])

and

h (X [k - 1])

to the right hand side of Equation (41) yields

\frac{1}{2} log \frac{Q_{X [k] | \hat{X} [k | k]}}{Q_{X [k - 1] | \hat{X} [k - 1 | k - 1]}} = I (X [k - 1]; \hat{X} [k - 1 | k - 1]) - I (X [k]; \hat{X} [k | k]) + [h (X [k]) - h (X [k - 1])]

(42)

This equation involves the differential entropies of

X [k]

and

X [k - 1]

, unlike prior expressions for smoothing and prediction. This is because the reference points for the two entropy powers are different. However, for certain wide sense stationary processes, we will have simplifications as shown in the next section on entropy power and MSE, where it is shown that for several important distributions, we can replace the entropy power with the variance.

7. Entropy Power and MSE

We know from Section 2 that the entropy power is the minimum variance that can be associated with a differential entropy

h (X)

. The key insight into relating mean squared error and mutual information comes from considering the (apparently not so special) cases of random variables whose differential entropy has the form in Equation (24) and the log ratio of entropy powers. In these cases, we do not have to explicitly calculate the entropy power, since we can use the variance or mean squared error in the log ratio of entropy power expressions to find the mutual information gain or loss for these distributions.

Thus, all of the results in Section 6, in terms of the log ratio of entropy powers, can be expressed as ratios of variances or mean squared errors for continuous random variables with differential entropies of the form in Equation (24). In the following, we use the more notationally bulky

v a r (X [k] | \hat{X} [k | j])

for the conditional variances rather than the simpler notation

σ_{X (k) | \hat{X} (j)}^{2}

, since the

σ^{2}

symbol could be confused as indicating a Gaussian assumption, which is not needed.

In particular, for the smoothing problem, we can rewrite Equation (33) as

\frac{1}{2} log \frac{v a r (X [k] | \hat{X} [k | j])}{v a r (X [k] | \hat{X} [k | j + 1])} = I (X [k]; \hat{X} [k | j + 1]) - I (X [k]; \hat{X} [k | j]) \geq 0

(43)

and the decrease in MSE in terms of the change in mutual information as

v a r (X [k] | \hat{X} [k | j + 1]) = v a r (X [k] | \hat{X} [k | j]) exp [- 2 (I (X [k]; \hat{X} [k | j + 1]) - I (X [k]; \hat{X} [k | j]))]

(44)

both for a smoothing lag of

L = 1

. Here, we see that the rate of decrease in the MMSE is exponentially related to the mutual information gain due to smoothing.

Rewriting the results for prediction in terms of variances, we see that Equation (39) becomes

\frac{1}{2} log \frac{v a r (X [k] | \hat{X} [k | j - 1])}{v a r (X [k] | \hat{X} [k | j])} = I (X [k]; \hat{X} [k | j]) - I (X [k]; \hat{X} [k | j - 1]) \geq 0

(45)

and that the growth in MMSE with increasing lookahead is

v a r (X [k] | \hat{X} [k | j - 1]) = v a r (X [k] | \hat{X} [k | j]) exp [2 (I (X [k]; \hat{X} [k | j]) - I (X [k]; \hat{X} [k | j - 1]))]

(46)

Thus, as lookahead in prediction is increased, the conditional error variance grows exponentially.

For the filtering problem, we have the two differential entropies,

h (X [k])

and

h (X [k - 1])

, in Equation (42), in addition to the mutual information expressions. However, for wide sense stationary random processes with differential entropies of the form shown in Equation (24), the two variances are equal, so

v a r (X [k]) = v a r (X [k - 1])

, so the difference in the two differential entropies is zero. This simplifies Equation (42) to

\frac{1}{2} log \frac{v a r (X [k] | \hat{X} [k | k])}{v a r (X [k - 1] | \hat{X} [k - 1 | k - 1])} = I (X [k - 1]; \hat{X} [k - 1 | k - 1]) - I (X [k]; \hat{X} [k | k]) \leq 0,

(47)

which, if the error variance in monotonically nonincreasing, is less than or equal to zero, as shown. Rewriting this last result in terms of increasing mutual information, we have

\frac{1}{2} log \frac{v a r (X [k - 1] | \hat{X} [k - 1 | k - 1])}{v a r (X [k] | \hat{X} [k | k])} = I (X [k]; \hat{X} [k | k]) - I (X [k - 1]; \hat{X} [k - 1 | k - 1]) \geq 0

(48)

Thus, we have related the mean squared error from estimators to the change in the gain or loss of mutual information.

It is important to recognize the power of the expressions in this section. They allow us to obtain the mutual information gain or loss by using the variances of MMSE estimators, the latter of which are easily calculated in comparison to direct calculations of differential entropy or mutual information. There is no need to utilize techniques to approximately compute differential entropies or mutual information, which are fraught with difficulties. See Hudson [10] and Kraskov [11].

8. Fixed Lag Smoothing Example

To provide a concrete example of the preceding results in Section 7, we consider the example of finding the mutual information gain using the results of a fixed lag MMSE smoothing problem for a simple first-order system model with noisy observations. Fixed lag smoothing is a popular approach, since measurements

z (k)

at time instants

k, k + 1, \dots, k + L

are used to estimate the value of

x (k)

; that is, measurement L samples ahead of the present time k are used to estimate

x (k)

[22].

A first-order autoregressive (AR) system model is given by [23]

x (k + 1) = α x (k) + w (k + 1),

(49)

where

α \in [0, 1]

and

w (k + 1)

is a stationary, Gaussian, zero mean, white process with variance q. The observation model is expressed as

z (k + 1) = x (k + 1) + v (k + 1),

(50)

where

v (k + 1)

is the zero mean Gaussian noise with variance r.

For this problem, we compute the steady state errors in fixed-lag smoothing as a function of the smoothing lag. From Chirarattananon and Anderson [22], the fixed-lag smoother error covariance can be expressed in terms of the Kalman filter gain, the filter apriori error covariance, and the filter aposteriori error covariance. Therefore, the steady state expression for the fixed lag smoothing error covariance as a function of the lag L is as follows (details are available in [22,24] and are not included here):

\begin{matrix} P_{L} & = P - \sum_{j = 1}^{L} {[(1 - K) α]}^{2 j} (\bar{P} - P), \end{matrix}

(51)

where the components shown from the Kalman filter are

\begin{matrix} \bar{P} & = α^{2} P + q \end{matrix}

(52)

\begin{matrix} K & = \bar{P} {(\bar{P} + r)}^{- 1} \end{matrix}

(53)

\begin{matrix} P & = (1 - K) \bar{P} . \end{matrix}

(54)

with P is the filtering or aposteriori estimation error variance,

\bar{P}

is the apriori filter error variance, and K is the Kalman filter gain.

Note that

P_{L}

is

v a r (X [k] | \hat{X} [k | k + L))

from Section 7 and P is

v a r (X [k] | \hat{X} [k | k])

, in order to simplify notation for this example.

Given

α

, q, and r, P can be computed in the steady state case as the positive root of the following quadratic equation

α^{2} P^{2} + (q + r - α^{2} r) P - q r = 0 .

(55)

Then

\bar{P}

, K, and

P_{L}

can be evaluated using Equations (51), (52) and (53), respectively.

The asymptotic expression for the smoothing error covariance as L gets large is given by

\begin{matrix} P_{L, \min} & = P - \sum_{j = 1}^{\infty} {[(1 - K) α]}^{2 j} (\bar{P} - P), \\ = P - (\bar{P} - P) \cdot \frac{{(1 - K)}^{2} α^{2}}{1 - {(1 - K)}^{2} α^{2}} \end{matrix}

(56)

This result can be used to determine what value should be selected for the maximum delay to obtain a near-asymptotic performance.

Example: We now consider the specific case of the scalar models in Equations (49) and (50) with

α = 0.98

,

q = 0.118

, and

r = 1.18

. The choice of this ratio of

q / r = 0.1

corresponds to an accurate model for the AR process, but with a very noisy observation signal. Table 1 lists the smoothing error covariance as a function of smoothing lag L using Equation (51) and the following. The result for

L = 15

comes from Equation (56).

We obtained the third column in the table labeled “Incremental MI Gain” from Equation (43), where, for simplicity of notation, we set

P = v a r (X [k] | \hat{X} [k | k])

and let

P_{L} = v a r (X [k] | \hat{X} [k | k + L])

. The fourth column is obtained for a specific L, say,

L_{t o t a l g a i n}

, by adding all values of the incremental MI gain with

L \leq L_{t o t a l g a i n}

. For example, with

L_{t o t a l g a i n} = 5

, we sum up all values of incremental MI gain for

L \leq 5

to get 0.24795.

The asymptotic reduction in MSE due to smoothing as L gets large is thus

([P - P_{L, \min}] / P) \cdot 100 = 39 %

, and the corresponding mutual information gain is 0.2505 bits.

Therefore, we are able to obtain statements concerning the gain or loss of mutual information by calculating the much more directly available quantity, the minimum mean squared smoothing error.

9. Properties and Families (Classes) of Probability Densities

As we have seen, there are many common probability distributions that let us substitute mean squared error for entropy power in the log ratio of entropy power expression. While a general class of distributions that satisfy this property has not been established, many important and ubiquitous “named” continuous distributions do so. In particular, distributions that satisfy the log ratio of the entropy power condition are Gaussian, Laplacian, Cauchy, Gamma, Logistic, exponential, Rayleigh, symmetric triangular, and uniform.

This group of distributions exhibits certain properties and falls into common families or classes of distributions that can prove useful in further studies. The following sections discuss these properties and families.

9.1. Properties

Given a continuous random variable X with a cumulative probability distribution

P_{X} (x)

and a corresponding probability density function

p (x)

, the distribution is said to be unimodal if for some

x = a

such that

P_{X} (x)

is convex for

x < a

and concave for

x > a

[25]. Example distributions that satisfy this condition and thus are unimodal are the Gaussian, Laplacian, Cauchy, Logistic, exponential, Rayleigh, symmetric triangular, and uniform distributions.

Further, a unimodal distribution is called strongly unimodal if

- log p (x)

is convex. Distributions of the form

p (x) = k exp {| x |}^{α}

for

α \geq 1

are strongly unimodal, and by inspection, we see that the Gaussian, Laplacian, and logistic distributions have this property [26].

9.2. Families or Classes

A number of families, or classes, of distributions have been defined to help categorize random variables. Families or classes that help clarify the scope of the log ratio of entropy power results are location-scale families and exponential families.

9.2.1. Location-Scale Family

Given a random variable Y with distribution

P_{Y} (y)

, for a transformation

X = a + b Y

,

b > 0

, a family that satisfies

P (X \leq x) = P_{Y} ((x - a) / b)

(57)

is called a location-scale family [27]. Location-scale families include Gaussian, Laplacian, Cauchy, Logistic, exponential, symmetric triangular, and uniform distributions [27].

9.2.2. Exponential Family

Given a family of probability density functions

p (x, θ)

with

a < θ < b

, a PDF of the form

p (x, θ) = exp [η (θ] K (x) + S (x) + q (θ)], c < x < d,

(58)

is said to be a member of the exponential family of distributions of the continuous type [28]. Additionally, given the set of random variables

X_{1}, X_{2}, . . ., X_{n}

, their joint PDF of the form

exp [η (θ] \sum_{i = 1}^{n} K (x_{i}) + \sum_{i = 1}^{n} S (x_{i}) + n q (θ)],

(59)

c < x_{i} < d

with

a < θ < b

, and zero elsewhere, is in the exponential family. A nice property of exponential distributions is that sufficient statistics exist for this family. Because of this property, the exponential family plays an important role in the Dual Information Bottleneck Theory of learning [8].

Examples of distributions in the exponential family are Gaussian, exponential, Gamma, and Poisson [27].

10. Discussion

Entropy and mutual information have been incorporated into many analyses of agent learning [5,6,7,8]. However, the calculation of the mutual information in these applications is always challenging and adds another variable to the experimental results corresponding to the theory. The mean squared error has mostly been viewed as suspect as a performance indicator in these learning applications and is not utilized at all [5,6,7,8]. It is shown here that the MMSE performance of smoothing, prediction, and filtering algorithms have direct interpretations in terms of the mutual information gained or lost in the estimation process for a fairly large set of probability densities that have differential entropies of the form in Equation (24).

Not only are these results satisfying in terms of a performance indicator, but the expressions in Equations (43), (45) and (47) allow gains or losses in mutual information to be calculated from estimation error variances for systems modeled by Markov chains, as in [5,6,7,8]. This avoids the need for more cumbersome estimates of probability histograms to be used in mutual information expressions or direct approximations of mutual information from data [9,10,11].

These results open the door to explorations of mutual information gain or loss for additional classes of probability densities, perhaps by considering the properties and families briefly discussed in Section 9. In particular, sufficient statistics have played an important part of this prior work in learning [5,6,7], and more recent work considers the case of exponential families which admit sufficient statistics to gain further insights [8]. Future work might also consider extending these results using the Renyi entropy rate [29].

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The author declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

i.i.d.	independent and identically distributed
MMSE	minimum mean squared error
Q	entropy power
MMSPE(M)	minimum mean squared prediction error of order M
MSE	mean squared error
MI Gain	mutual information gain

References

Wiener, N. Extrapolation, Interpolation, and Smoothing of Stationary Time Series: With Engineering Applications; MIT Press: Cambridge, MA, USA, 1949. [Google Scholar]
Ljung, L.; Soderstrom, T. Theory and Practice of Recursive Identification; MIT Press: Cambridge, MA, USA, 1983. [Google Scholar]
Haykin, S. Adaptive Filter Theory; Prentice-Hall: Hoboken, NJ, USA, 2002. [Google Scholar]
Honig, M.L.; Messerschmitt, D.G. Adaptive Filters: Structures, Algorithms, and Applications; Kluwer Academic Publishers: Hingham, MA, USA, 1984. [Google Scholar]
Tishby, N.; Zaslavsky, N. Deep Learning and the Information Bottleneck Principle. In Proceedings of the 2015 IEEE Information Theory Workshop (ITW), Jerusalem, Israel, 26 April–1 May 2015. [Google Scholar]
Crutchfield, J.P.; Feldman, D.P. Synchronizing to the environment: Information-theoretic constraints on agent learning. Adv. Complex Syst. 2001, 4, 251–264. [Google Scholar] [CrossRef]
Crutchfield, J.P.; Feldman, D.P. Regularities unseen, randomness observed: Levels of entropy convergence. Chaos Interdiscip. J. Nonlinear Sci. 2003, 13, 25–54. [Google Scholar] [CrossRef]
Piran, Z.; Shwartz-Ziv, R.; Tishby, N. The Dual Information Bottleneck. arXiv 2020, arXiv:2006.04641. [Google Scholar]
Shwartz Ziv, R.; LeCun, Y. To Compress or Not to Compress–Self-Supervised Learning and Information Theory: A Review. Entropy 2024, 26, 252. [Google Scholar] [CrossRef]
Hudson, J.E. Signal Processing Using Mutual Information. IEEE Signal Process. Mag. 2006, 23, 50–54. [Google Scholar] [CrossRef]
Kraskov, A.; Stogbauer, A.; Grassberger, P. Estimating Mutual Information. Phys. Rev. E 2004, 69, 006138. [Google Scholar] [CrossRef]
Gibson, J.D. Mutual Information Gain and Linear/Nonlinear Redundancy for Agent Learning, Sequence Analysis and Modeling. Entropy 2020, 22, 608. [Google Scholar] [CrossRef] [PubMed]
Cover, T.M.; Thomas, J.A. Elements of Information Theory; Wiley-Interscience: Hoboken, NJ, USA, 2006. [Google Scholar]
Shannon, C.E. A mathematical theory of communication. Bell Sys. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
Contreras-Reyes, J.E. Mutual information matrix based on Renyi entropy and application. Nonlinear Dyn. 2022, 110, 623–633. [Google Scholar] [CrossRef]
Wyner, A.; Ziv, J. Bounds on the rate-distortion function for stationary sources with memory. IEEE Trans. Inf. Theory 1971, 17, 508–513. [Google Scholar] [CrossRef]
Kalata, P.; Priemer, R. Linear prediction, filtering, and smoothing: An information-theoretic approach. Inf. Sci. 1979, 17, 1–14. [Google Scholar] [CrossRef]
Kalata, P.; Priemer, R. On minimal error entropy stochastic approximation. Int. J. Syst. Sci. 1974, 5, 895–906. [Google Scholar] [CrossRef]
Kalata, P.R.; Priemer, R. When should smoothing cease? Proc. IEEE 1974, 62, 1289–1290. [Google Scholar] [CrossRef]
Gibson, J.D. Log Ratio of Entropy Powers. In Proceedings of the 2018 Information Theory and Applications Workshop (ITA), San Diego, CA, USA, 11–16 February 2018. [Google Scholar]
Shynk, J.J. Probability, Random Variables, and Random Processes: Theory and Signal Processing Applications; John Wiley & Sons: Hoboken, NJ, USA, 2012. [Google Scholar]
Chirarattananon, S.; Anderson, B. The Fixed-Lag Smoother as a Stable, Finite-Dimensional Linear Filter. Automatica 1971, 7, 657–669. [Google Scholar] [CrossRef]
Box, G.E.P.; Jenkins, G.M. Time Series Analysis: Forecasting and Control; Holden-Day: San Francisco, CA, USA, 1976. [Google Scholar]
Gibson, J.D.; Bhaskaranand, M. Performance improvement with decoder output smoothing in differential predictive coding. In Proceedings of the 2014 Information Theory and Applications Workshop (ITA), San Diego, CA, USA, 9–14 February 2014. [Google Scholar]
Lukacs, E. Characteristic Functions; Griffin: London, UK, 1970. [Google Scholar]
Lehmann, E.L. Testing Statistical Hypotheses; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 1986. [Google Scholar]
Lehmann, E.L. Theory of Point Estimation; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 1983. [Google Scholar]
Hogg, R.V.; Craig, A.T. Intorduction to Mathematical Statistics; Macmillan: New York, NY, USA, 1970. [Google Scholar]
Golshani, L.; Pasha, E. Renyi Entropy Rate for Gaussian Processes. Inf. Sci. 2010, 180, 1486–1491. [Google Scholar] [CrossRef]

Table 1. Mutual information gain due to smoothing with

α = 0.98, q = 0.118, r = 1.18, P = 0.3045

.

Table 1. Mutual information gain due to smoothing with

α = 0.98, q = 0.118, r = 1.18, P = 0.3045

.

L	$P_{L}$	Incremental MI Gain	Total MI Gain
1	0.2485	0.1145	0.1145
2	0.2189	0.0633	0.1748
3	0.2033	0.0369	0.2117
4	0.1950	0.0209	0.2326
5	0.1906	0.01235	0.24795
15	0.1857 $\approx P_{L_{\min}}$	0.00255	0.2505

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gibson, J. Minimum Mean Squared Error Estimation and Mutual Information Gain. Information 2024, 15, 497. https://doi.org/10.3390/info15080497

AMA Style

Gibson J. Minimum Mean Squared Error Estimation and Mutual Information Gain. Information. 2024; 15(8):497. https://doi.org/10.3390/info15080497

Chicago/Turabian Style

Gibson, Jerry. 2024. "Minimum Mean Squared Error Estimation and Mutual Information Gain" Information 15, no. 8: 497. https://doi.org/10.3390/info15080497

APA Style

Gibson, J. (2024). Minimum Mean Squared Error Estimation and Mutual Information Gain. Information, 15(8), 497. https://doi.org/10.3390/info15080497

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Minimum Mean Squared Error Estimation and Mutual Information Gain

Abstract

1. Introduction

2. Differential Entropy, Mutual Information, and Entropy Rate: Definitions and Notation

3. Agent Learning and Mutual Information Gain

4. Minimum Error Entropy

5. Log Ratio of Entropy Powers

6. Minimum Mean Squared Error (MMSE) Estimation

6.1. MMSE Smoothing

6.2. MMSE Prediction

6.3. MMSE Filtering

7. Entropy Power and MSE

8. Fixed Lag Smoothing Example

9. Properties and Families (Classes) of Probability Densities

9.1. Properties

9.2. Families or Classes

9.2.1. Location-Scale Family

9.2.2. Exponential Family

10. Discussion

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI