1. Introduction
Minimum mean squared error estimation, prediction, and smoothing [
1], whether as point estimation, batch least squares, recursive least squares [
2], Kalman filtering [
3], numerically stable square root filters, recursive least squares lattice structures [
4], or stochastic gradient algorithms [
3], are a staple in signal processing applications. However, even though stochastic gradient algorithms are the workhorses in the inner workings of machine learning, it is felt that mean squared error does not capture the performance of a learning agent [
5]. We begin to address this assertion here and show the close relationship between mean squared estimation error and information theoretic quantities such as differential entropy and mutual information. We consider the problem of estimating a random scalar signal
(the extension to vectors will be obvious to the reader) given the perhaps noisy measurements
, where
is filtering,
is smoothing, and
is prediction, based on a minimum mean squared error cost function.
Information theoretic quantities such as entropy, entropy rate, information gain, and relative entropy are often used to understand the performance of intelligent agents in learning applications [
6,
7]. In such applications, these information theoretic quantities are used to determine what information can be learned from sequences with different properties. Information theory has also been used to examine what is happening within neural networks through the Information Bottleneck Theory of Deep Learning [
5,
8,
9]. One of the challenges in all of these research efforts is the necessity to obtain estimates of mutual information, which often requires great skill and effort [
8,
9,
10,
11].
A relatively newer quantity called mutual information gain or loss has recently been introduced and shown to provide new insights into the process of agent learning [
12]. We build on expressions for mutual information gain that involve ratios of mean squared errors and establish that minimum mean squared error (MMSE) estimation, prediction, and smoothing are directly connected to mutual information gain or loss for sequences modeled by many probability distributions of interest. Thus, mean squared error, which is often relatively easy to calculate, can be employed to obtain changes in mutual information as we progress through a system modeled as a Markov chain. The key quantity in establishing these relationships is the log ratio of entropy powers.
We begin in
Section 2 by establishing the fundamental information quantities of interest and setting the notation. In
Section 3, we review information theoretic quantities that have been defined and used in some agent learning analyses in the literature. Some prior work with similar results, but based on the minimax entropy of the estimation error, is discussed in
Section 4. The following section,
Section 5, introduces the key tool in our development, the log ratio of entropy powers, and derives its expression in terms of mutual information gain. In
Section 6 the log ratio of entropy powers is used to characterize the performance of MMSE smoothing, prediction, and filtering in terms of ratios of entropy powers and mutual information gain. For many probability distributions of interest, we are able to substitute MMSE into the entropy power expressions as shown in
Section 7. A simple fixed lag smoothing example is presented in
Section 8 that illustrates the power of the approach.
Section 9 presents some properties and families of distributions that commonly occur in applications and that have desirable characterizations and implications, such as sufficient statistics. Lists of distributions that satisfy the log ratio of the entropy power property and these properties, and which fall in the classes of interest, are given. Final discussions of the results and future research directions are presented in
Section 10.
2. Differential Entropy, Mutual Information, and Entropy Rate: Definitions and Notation
Given a continuous random variable
X with probability density function
, the differential entropy is defined as
where we assume
X has the variance
=
. The differential entropy of a Gaussian sequence with mean zero and variance
is given by [
13],
An important quantity for investigating structure and randomness is the differential entropy rate [
13]
which is the long-term average differential entropy in bits/symbols for the sequence being studied. The differential entropy rate is a simple indicator of randomness that has been used in agent learning papers [
6,
7].
An alternative definition of differential entropy rate is [
13]
which, for the Gaussian process, yields
where
is the minimum mean squared error of the best estimate given the infinite past, expressible as
with
and
being the variance and differential entropy rate of the original sequence, respectively [
13]. In addition to defining entropy power (6), this equation shows that the entropy power is the minimum variance that can be associated with the not-necessarily-Gaussian differential entropy
.
In his landmark 1948 paper [
14], Shannon defined the entropy power (also called entropy rate power) to be the power in a Gaussian white noise limited to the same band as the original ensemble and having the same entropy. He then used the entropy power in bounding the capacity of certain channels and for specifying a lower bound on the rate distortion function of a source.
Shannon gave the quantity
the notation
Q, which operationally is the power in a Gaussian process with the same differential entropy as the original random variable
X [
14]. Note that the original random variable or process does not need to be Gaussian. Whatever the form of
for the original process, the entropy power can be defined as in Equation (
6). In the following, we use
for both differential entropy and differential entropy rate unless a clear distinction is needed to reduce confusion.
The differential entropy is defined for continuous amplitude random variables and processes, and it is the appropriate quantity to study signals such as speech, audio, and biological signals. However, unlike discrete entropy, differential entropy can be negative or infinite, and is changed by scaling and similar transformations. Note that this is why mutual information is often the better choice for investigating learning applications.
In particular, for continuous random variables
X and
Y with probability density functions
,
and
, respectively, the mutual information between
X and
Y is [
13,
15]
Mutual information is always greater than or equal to zero and is not impacted by scaling or similar transformations. Mutual information is the principal information theoretic indicator employed in this work.
3. Agent Learning and Mutual Information Gain
In agent learning, based on some observations of the environment, we develop an understanding of the structure of the environment, formulate models of this structure, and study any remaining apparent randomness or unpredictability [
6,
7]. Studies of agent learning have made use of the information theoretic ideas in
Section 2, and have created variations on those information theoretic ideas to capture particular characteristics that are distinct to agent learning problems. These expressions and related results are discussed in detail in Gibson [
12]. The developments of using mutual information in the Information Bottleneck Theory of Deep Learning are contained in Tishby [
5] and Shwartz Ziv and LeCun [
16].
The agent learning literature explores the broad ideas of unpredictability and apparent randomness [
6,
7]. Toward this end, it is common to investigate the
total Shannon entropy of length-
N sequences
given by
as a function of
N to characterize learning. The name total Shannon entropy is appropriate, since it is not the usual per component entropy of interest in lossless source coding [
13], for example.
In association with the idea of learning or discerning structure in an environment, the
entropy gain, as defined in the literature, is the difference between the entropies of length
N and length
sequences as [
7]
Equation (
9) was derived and studied much earlier by Shannon [
14], not as an entropy gain, but as a conditional entropy.
In particular, Shannon [
14] defined the conditional entropy of the next symbol when the
preceding symbols are known as
which is exactly Equation (
9); so, the entropy gain from the agent learning literature is simply the conditional entropy expression developed by Shannon in 1948.
A recently introduced quantity,
mutual information gain, allows for a more detailed parsing of what is happening in the learning process than observing changes in entropy [
12]. Even though a relative entropy between two probability densities has been called the information gain in the agent learning literature [
6,
7], it is evident from Equations (
9) and (
10) that it is just a conditional entropy [
12]. Thus, the nomenclature which defined information gain in terms of this conditional entropy is misleading.
In terms of information gain, the quantity of interest is the mutual information between the overall sequence and the growing history of the past given by
where
is defined in Equation (
9). The mutual information in Equation (
11) is a much more direct measure of information gained than entropy gain as a function of
N, and includes the entropy gain from agent learning as a natural component. We can obtain more insight by expanding Equation (
11) using the chain rule for mutual information [
13] as
Since
, we see that
is nondecreasing in
N. However, what do these individual terms in Equation (
12) mean? The sequence
should be considered the input sequence to be analyzed with the block length
N, large but finite. The first term in the sum,
, indicates the mutual information between the predicted value of
, given
, and the input sequence
. The next term,
, is the mutual information between the input sequence
and the predicted value of
, given the prior values
. Therefore, we can characterize the change in mutual information with increasing knowledge of the past history of the sequence as a sum of conditional mutual informations
[
12].
We denote as the total mutual information gain, and as the incremental mutual information gain. We utilize these terms in the following developments.
5. Log Ratio of Entropy Powers
We can use the definition of the entropy power in Equation (
6) to express the logarithm of the ratio of two entropy powers in terms of their respective differential entropies as [
20]
where
represents the entropy power associated with the differential entropy
and similarly for
. The conditional version of Equation (
6) is [
13]
and from which we can express Equation (
17) in terms of the entropy powers at the outputs of successive stages in a signal processing Markov chain
that satisfy the data processing inequality as
It is important to notice that many signal processing systems satisfy the Markov chain property, and thus the data processing inequality, so Equation (
19) is potentially very useful and insightful.
We can expand our insights if we add and subtract
to the right-hand side of Equation (
19), so we then obtain an expression in terms of the difference in mutual information between the two successive stages as
From the entropy power in Equation (
18), we know that both expressions in Equations (
19) and (
20) are greater than or equal to zero. Thus, from this result, we see that we can now associate a change in mutual information as data passes through a Markov chain with the log ratio of entropy powers.
These results are from [
20] and extend the data processing inequality by providing a new characterization of the mutual information gain or loss between stages in terms of the entropy powers of the two stages. Since differential entropies are difficult to calculate, it is useful to have expressions for the entropy power at two stages and then use Equations (
19) and (
20) to find the difference in differential entropy and mutual information between these stages.
To get some idea of how useful Equation (
20) can be, we turn to a few special cases. In many signal processing operations, a Gaussian assumption is accurate and can provide deep insights. Thus, considering two i.i.d. Gaussian distributions with zero mean and variances
and
, we see directly that
and
, so
which satisfies Equation (
17) exactly.
We can also consider the MMSE error variances in a Markov chain when
X and
are Gaussian values with the error variances at successive stages denoted as
and
, then
Perhaps surprisingly, this result holds for two i.i.d. Laplacian distributions with variances
and
[
21], since their corresponding entropy powers are
and
, respectively, so we form
Since
, the Laplacian distribution also satisfies Equation (
17) through Equation (
20) exactly [
20].
Using mean squared errors or variances in Equations (
17) through (
20) is accurate for many other distributions as well. It is straightforward to show that Equation (
17) holds with equality when the differential entropy takes the form
so the entropy powers can be replaced by the mean squared error for the Gaussian, Laplacian, logistic, Cauchy, uniform, symmetric triangular, exponential, and Rayleigh distributions. Equation (
24) is of the same form of the distributions considered in [
17,
18] when considering the minimax error entropy estimate. Note here that we can work directly with MMSE estimates.
Therefore, the satisfaction of Equations (
17) through (
20) with equality when substituting the variance for entropy power occurs for several distributions of significant interest for applications, and it is the log ratio of entropy powers that enables the use of the mean squared error to calculate the loss or gain in mutual information at each stage.
7. Entropy Power and MSE
We know from
Section 2 that the entropy power is the minimum variance that can be associated with a differential entropy
. The key insight into relating mean squared error and mutual information comes from considering the (apparently not so special) cases of random variables whose differential entropy has the form in Equation (
24) and the log ratio of entropy powers. In these cases, we do not have to explicitly calculate the entropy power, since we can use the variance or mean squared error in the log ratio of entropy power expressions to find the mutual information gain or loss for these distributions.
Thus, all of the results in
Section 6, in terms of the log ratio of entropy powers, can be expressed as ratios of variances or mean squared errors for continuous random variables with differential entropies of the form in Equation (
24). In the following, we use the more notationally bulky
for the conditional variances rather than the simpler notation
, since the
symbol could be confused as indicating a Gaussian assumption, which is not needed.
In particular, for the smoothing problem, we can rewrite Equation (
33) as
and the decrease in MSE in terms of the change in mutual information as
both for a smoothing lag of
. Here, we see that the rate of decrease in the MMSE is exponentially related to the mutual information gain due to smoothing.
Rewriting the results for prediction in terms of variances, we see that Equation (
39) becomes
and that the growth in MMSE with increasing lookahead is
Thus, as lookahead in prediction is increased, the conditional error variance grows exponentially.
For the filtering problem, we have the two differential entropies,
and
, in Equation (
42), in addition to the mutual information expressions. However, for wide sense stationary random processes with differential entropies of the form shown in Equation (
24), the two variances are equal, so
, so the difference in the two differential entropies is zero. This simplifies Equation (
42) to
which, if the error variance in monotonically nonincreasing, is less than or equal to zero, as shown. Rewriting this last result in terms of increasing mutual information, we have
Thus, we have related the mean squared error from estimators to the change in the gain or loss of mutual information.
It is important to recognize the power of the expressions in this section. They allow us to obtain the mutual information gain or loss by using the variances of MMSE estimators, the latter of which are easily calculated in comparison to direct calculations of differential entropy or mutual information. There is no need to utilize techniques to approximately compute differential entropies or mutual information, which are fraught with difficulties. See Hudson [
10] and Kraskov [
11].
8. Fixed Lag Smoothing Example
To provide a concrete example of the preceding results in
Section 7, we consider the example of finding the mutual information gain using the results of a fixed lag MMSE smoothing problem for a simple first-order system model with noisy observations. Fixed lag smoothing is a popular approach, since measurements
at time instants
are used to estimate the value of
; that is, measurement
L samples ahead of the present time
k are used to estimate
[
22].
A first-order autoregressive (AR) system model is given by [
23]
where
and
is a stationary, Gaussian, zero mean, white process with variance
q. The observation model is expressed as
where
is the zero mean Gaussian noise with variance
r.
For this problem, we compute the steady state errors in fixed-lag smoothing as a function of the smoothing lag. From Chirarattananon and Anderson [
22], the fixed-lag smoother error covariance can be expressed in terms of the Kalman filter gain, the filter apriori error covariance, and the filter aposteriori error covariance. Therefore, the steady state expression for the fixed lag smoothing error covariance as a function of the lag
L is as follows (details are available in [
22,
24] and are not included here):
where the components shown from the Kalman filter are
with
P is the filtering or aposteriori estimation error variance,
is the apriori filter error variance, and
K is the Kalman filter gain.
Note that
is
from
Section 7 and
P is
, in order to simplify notation for this example.
Given
,
q, and
r,
P can be computed in the steady state case as the positive root of the following quadratic equation
Then
,
K, and
can be evaluated using Equations (
51), (
52) and (
53), respectively.
The asymptotic expression for the smoothing error covariance as
L gets large is given by
This result can be used to determine what value should be selected for the maximum delay to obtain a near-asymptotic performance.
Example: We now consider the specific case of the scalar models in Equations (
49) and (
50) with
,
, and
. The choice of this ratio of
corresponds to an accurate model for the AR process, but with a very noisy observation signal.
Table 1 lists the smoothing error covariance as a function of smoothing lag
L using Equation (
51) and the following. The result for
comes from Equation (
56).
We obtained the third column in the table labeled “Incremental MI Gain” from Equation (
43), where, for simplicity of notation, we set
and let
. The fourth column is obtained for a specific
L, say,
, by adding all values of the incremental MI gain with
. For example, with
, we sum up all values of incremental MI gain for
to get 0.24795.
The asymptotic reduction in MSE due to smoothing as L gets large is thus , and the corresponding mutual information gain is 0.2505 bits.
Therefore, we are able to obtain statements concerning the gain or loss of mutual information by calculating the much more directly available quantity, the minimum mean squared smoothing error.