The Entropy Gain of Linear Systems and Some of Its Implications

We study the increase in per-sample differential entropy rate of random sequences and processes after being passed through a non minimum-phase (NMP) discrete-time, linear time-invariant (LTI) filter G. For LTI discrete-time filters and random processes, it has long been established by Theorem 14 in Shannon’s seminal paper that this entropy gain, G(G), equals the integral of log|G(ejω)|. In this note, we first show that Shannon’s Theorem 14 does not hold in general. Then, we prove that, when comparing the input differential entropy to that of the entire (longer) output of G, the entropy gain equals G(G). We show that the entropy gain between equal-length input and output sequences is upper bounded by G(G) and arises if and only if there exists an output additive disturbance with finite differential entropy (no matter how small) or a random initial state. Unlike what happens with linear maps, the entropy gain in this case depends on the distribution of all the signals involved. We illustrate some of the consequences of these results by presenting their implications in three different problems. Specifically: conditions for equality in an information inequality of importance in networked control problems; extending to a much broader class of sources the existing results on the rate-distortion function for non-stationary Gaussian sources, and an observation on the capacity of auto-regressive Gaussian channels with feedback.


Introduction
We study the difference between the differential entropy rate of a random process u ∞ 1 = {u 1 , u 2 , . . .} entering a discrete-time linear time-invariant (LTI) system G and the differential entropy rate of its (possibly noisy) output y ∞ 1 , as depicted in Figure 1. Figure 1. A causal, stable, linear and time-invariant system G with input and output processes, initial state, and output disturbance.
Recall that the differential entropy rate of a random process x ∞ 1 is given byh(x ∞ 1 ) lim n→∞ n −1 h(x 1 , x 2 , . . . , x n ), provided the limit exists, where h(x 1 , . . . , x n ) = E[− log f (x 1 , . . . , x n )] is the differential entropy of the ensemble x 1 , . . . , x n with probability density function (PDF) f [1]. The system G is supposed to satisfy the following: Assumption 1. The LTI system G in Figure 1 is causal and stable and such that 1.
In this general setup, G may have a random initial state vector x 0 ∈ R p , p ∈ N, and a real-valued random output disturbance z ∞ 1 . Our main purpose is to characterize the limit evaluating the possible effect produced by x 0 and z ∞ 1 . This difference can be interpreted as the entropy gain (entropy amplification or entropy boost) introduced by the filter G and (as apparent from the other variables in the argument of G) the statistics of x 0 , u ∞ 1 , z ∞ 1 . We shall refer to the special case in which x 0 and z ∞ 1 are both zero (or deterministic) as the noise-less case, and write G(G, 0, u ∞ 1 , 0) accordingly. The earliest reference related to this problem corresponds to a noise-less continuoustime counterpart considered by Shannon. In his seminal 1948 paper [2], Shannon gave a formula for the change in differential entropy per degree of freedom that a continuous-time random process u c , band-limited to a frequency range [0, B) (in Hz), experiences after passing through an LTI continuous-time filter G c (without considering a random initial state or an output disturbance). Such entropy per degree of freedom is defined in terms of uniformly taken samples as with T 1/(2B). In this formula, if the LTI filter has frequency response G c (ξ) (with ξ in Hz), then the resulting differential entropy rate of the output process y c is given by the following theorem: Theorem 1 (Reference [2], Theorem 14). If an ensemble having an entropyh(u c ) per degree of freedom in band B is passed through a filter with characteristic G c (ξ) the output ensemble has an entropyh (y c ) =h(u c ) + 2 B B 0 log|G c (ξ)|dξ.
Shannon arrived at (3) by arguing that an LTI filter can be seen as a linear operator that selectively scales its input signal along infinitely many frequencies, each of them representing an orthogonal component of the source. He then obtained the result by writing down the determinant of the Jacobian of this operator as the product of the squared frequency response magnitude of the filter over n frequency bands, applying logarithm, dividing by n, and then taking the limit as n tends to infinity.

Remark 1.
There is a factor of two in excess in the integral on the right-hand side (RHS) of (3). To see this, consider a filter with a constant gain a over [0, B) (i.e., a simple multiplicative factor). In such case, the entropy rate of y c should exceed that of u c by log|a| [1]. However, (3) yields an entropy gain equal to 2 log|a|. This error arises because the determinant of the Jacobian of the transformation is actually the product of |G c | over the n frequency bands considered in Shannon's argument. Such excess factor of two is also present in the entropy losses appearing in Reference [2], Table 1.
Theorem 14 in Reference [2] has found application in works ranging from traditional themes, such as linear prediction [3] and source coding [4], to molecular communication systems [5,6].
The available literature treating the phenomenon itself of the entropy gain (loss, boost, or amplification) induced by LTI systems seems to be rather scarce. This is not surprising given that (3) was published in Reference [2], Theorem 14, the work which gave birth to Information Theory.
The following publication concerned with this problem is Reference [7], following a time-domain analysis for the corresponding discrete-time problem. In this approach, one can obtain y n 1 {y(1), y(2), . . . , y(n)} as a function of u n 1 , for every n ∈ N, and evaluate the difference between the limitsh(y ∞ 1 ) andh(u ∞ 1 ), obtained by letting n → ∞. More precisely, for an LTI discrete-time filter G with impulse response g ∞ 0 = {g 0 , g 1 , . . .}, we can write . . . . . .
where we adopt the notation y 1 n for column vectors to avoid the abuse of notation incurred by treating the sequence y n 1 as a vector, and because, by writing y 1 n , it is easier to remember that its samples are ordered from top to bottom. y 1 n [y(1) y(2) · · · y(n)] T and the random vector u 1 n is defined likewise. From this, it is clear (see, e.g., the corollary after Theorem 8.6.4 in Reference [1]) that where det(G n ) (or simply det G n ) stands for the determinant of G n . This result is utilized in Reference [7] to show that no entropy gain is produced by a stable minimum phase LTI system G if and only if the first sample in its impulse response has unit magnitude.
In Reference [8], p. 568, the entropy gain of a discrete-time LTI system G (the noise-less version of the setup depicted in Figure 1) is found to bē where y ∞ 1 is the filter's discrete-time output process (without the effect of random initial state or an output disturbance) and h(y ∞ 1 ) lim n→∞ 1 n h(y n 1 ).
This result was obtained starting from the fact that, for a Gaussian stationary process u ∞ 1 with power spectral density (PSD) S u (e jω ),h(u ∞ 1 ) = 1 2π π −π S u (e jω )dω. If u ∞ 1 enters a discrete-time LTI system with frequency response G(e jω ), then the PSD of its output y ∞ 1 is S y (e jω ) = S u (e jω ) G(e jω ) 2 ; thus, it is argued that (6) follows for Gaussian stationary inputs. Then, Reference [8] extends the result for non-Gaussian inputs with a proof sketch which uses a time-domain relation, like (4), to point out that the filter is a linear operator and, as such, the differential entropy of its output exceeds that of its input by a quantity that is independent of the input distribution. (It is worth noting that (6) is the discretetime equivalent of (3) (without its wrong factor of 2), which follows directly from the correspondence between sampled band-limited continuous-time systems and discretetime systems.) It is in Reference [9], Section II-C, where, for the first time, it is shown that, for a stationary Gaussian input u ∞ 1 , the full entropy gain predicted by (6) takes place if the system output y ∞ 1 is contaminated by an additive output disturbance of length p and positive definite covariance matrix, where p is the order of G(z).
The integral 1 2π π −π log G(e jω ) dω can be related to the structure of the filter G. It is well known (from Jensen's formula) that if G has a causal and stable rational transfer function G(z) and an impulse response with its first sample g 0 lim z→∞ G(z), then 1 2π π −π log G(e jω ) dω = log |g 0 | + ∑ i:|ρ i |>1 log|ρ i |, (8) where {ρ i } are the zeros of G(z) (see, e.g., Reference [10,11]). This provides a straightforward formula to evaluate 1 2π π −π log G(e jω ) dω of a given LTI filter with rational transfer function G(z). When combined with (6), this equation also reveals that if the entropy gain G(u ∞ 1 , y ∞ 1 ) is negative (i.e., if it corresponds to an entropy loss), then |g 0 | < 1 (with the corresponding change of variables, this is the case in all the examples given by Shannon in Reference [2], Table 1). More importantly, (8) allows us to concentrate, without loss of generality, on LTI systems G(z), whose first impulse-response sample has unit magnitude, as required by Assumption 1. Under the latter condition, (8) shows that the entropy gain is greater than zero if and only if G(z) has zeros outside the unit disk D {ρ ∈ C : |ρ| ≤ 1}. A system with the latter property is said to be non-minimum phase (NMP); conversely, a system with all its zeros inside D is said to be minimum phase (MP) [11].

Main Contributions of this Paper
The main contributions of this paper can be summarized as follows: 1.
Our first main result is showing that (6) and (3) do not hold for a large class of continuous-time filters and inputs. To see this, notice that which, in view of (5), is equivalent to h(y 1 n ) = h(u 1 n ), ∀n ∈ N. In turn, this implies that h(y ∞ 1 ) −h(u ∞ 1 ) = 0, regardless of whether G(z) (i.e., the polynomial g 0 + g 1 z −1 + · · · ) has zeros with magnitude greater than one (choose, for example, g 0 = 1, g 1 = 2, and g k = 0 for k ≥ 2). This reveals that (4) holds if and only G(z) is MP. But (6) and (3) are equivalent (correcting for the in excess factor of 2 discussed in Remark 1); thus, Theorem 14 in Reference [2] also does not hold for a class of continuous-time filters. However, the transfer function G c (s) of a band-limited continuous-time filter G c is defined only for imaginary values of s (because the bilateral Laplace transform of sin(t)/t converges only on the imaginary axis), so one cannot classify such filters as MP or NMP. Instead, we consider a class of continuous-time filters limited to the frequencies in the band [0, B), where B > 0 is in [Hz], defined by having a unit-impulse response of the form for some absolutely summable sequence of real-valued coefficients {g i } η i=0 , η = 1, 2, . . ., where the sinc functions Since every such g satisfies g(k/(2B)) = 0 for k < 0, it makes sense to refer to such filters as "sample-wise causal". For this class of band-limited filters, we show that Theorem 14 holds if and only if the z-transform of {g i } η i=0 is MP: Suppose G c is a low-pass continuous-time filter with unit-impulse response as in (10). Let the continuous-time random input of G c be for some random sequence {u(k)} ∞ k=1 , with φ k as in (11), and denote its output as y c . Then, with equality in (a) if and only if the polynomial g 0 + g 1 z −1 + g 2 z −2 · · · has no roots outside the unit circle.

2.
We show that 1 2π π −π log G(e jω ) dω actually corresponds to the entropy gain introduced by G but considering the new notion of effective differential entropy rate of y ∞ 1 proposed in this paper, defined next. Definition 1 (The Effective Differential Entropy). Let y ∈ R be a random vector. If y can be written as a linear transformation y = Su, for some u ∈ R n (n ≤ ) with bounded differential entropy, S ∈ R ×n , then the effective differential entropy of y is defined as where S = A T TC is an SVD for S, with T ∈ R n×n .
We can now state our second main result, the proof of which is in Appendix A: Theorem 3. Let u ∞ 1 be the input of an LTI system G with transfer function G(z) without zeros on the unit circle and with an absolutely summable unit impulse response {g i } η−1 i=0 , with η = ∞ if G has an infinite impulse response. Denote the output of G as y ∞ 1 . Suppose h(u n 1 ) < ∞ for every finite n. Then, where y n+η 1 (u n 1 ) denotes the entire response of G to the input u n 1 .
Theorem 3 states that, when considering the full-length output of a system, the effective entropy gain is introduced by the system itself. Section 4 provides a geometrical description of the phenomenon behind Definition 1 and Theorem 3.

3.
We show that 1 2π π −π log G(e jω ) dω is a tight upper bound to the entropy gain of G (as defined in (1)), when the output is contaminated by some additional additive signal, such as a random initial state (represented by x 0 in Figure 1) or an output disturbance (such as z ∞ 1 in Figure 1), with sufficiently many degrees of freedom (a condition formally stated in Assumption 2 below). Moreover, we show that an entropy gain equal to the latter upper bound can appear even when these disturbances or random initial state have infinitesimally small variances. To the best of our knowledge, the latter phenomenon has been discussed in the literature first (and only) in Reference [9], Section II-C, for Gaussian stationary inputs and an LTI filter. We go beyond the latter result by explicitly and fully characterizing the entropy gain of LTI systems for a large class of not necessarily Gaussian nor stationary random input. We refer to this class as entropy-balanced processes, formally specified in the following definition: Definition 2. A random process {v(k)} ∞ k=1 is said to be entropy balanced if the following two conditions are satisfied: Its sample variances σ 2 v(n) are finite for finite n and (ii) For every ν ∈ N and for every sequence of matrices {Φ n } ∞ n=ν+1 , Φ n ∈ R (n−ν)×n with orthonormal rows, The second condition guarantees that projecting an entropy-balanced process onto any subspace having finitely fewer dimensions yields a process with the same differential entropy rate. The entropy gain induced by finite-length output disturbances is characterized by our next theorem.
Theorem 4. In the system of Figure 1, let G satisfy Assumption 1 and suppose that u ∞ 1 is entropy balanced. Suppose the random output disturbance z ∞ 1 is such that z(i) = 0, ∀i > κ, and that |h(z κ 1 )| < ∞. Letκ min{κ, m}, where m is the number of NMP zeros of G(z). Then, (18) with equality in (a) if and only if κ ≥ m.
The proof is presented in Section 6.4, and we provide geometrical insight explaining the phenomenon underlying Definition 2 and Theorem 4 in Section 5.1.

4.
We illustrate the relevance of the results summarized above by applying them to three problems in three areas, namely: (a) Networked Control: We show that equality holds in the inequality stated in Reference [12], Lemma 3.2 (a fundamental piece for the performance limitation results further developed in Reference [13]), under very general conditions. In addition, we extend the validity of a related equality for the perfect-feedback case, given by Reference [14], Theorem 14, for Gaussian signals, to the much larger class of entropy-balanced processes.
The rate-distortion function for non-stationary Gaussian sources: This problem has been previously solved in References [15][16][17]. We provide a simpler proof based upon the results described above. This proof extends the result stated in References [16,17] to a broader class of non-stationary sources. (c) Gaussian channel capacity with feedback: We show that capacity results based on using a short random sequence as channel input and relying on a feedback filter which boosts the entropy rate of the end-to-end channel noise (such as the one proposed in Reference [9]), crucially depend upon the complete absence of any additional disturbance anywhere in the system. Specifically, we show that the information rate of such capacity-achieving schemes drops to zero in the presence of any such additional disturbance. As a consequence, the relevance of characterizing the robust (i.e., in the presence of disturbances) feedback capacity of Gaussian channels, which appears to be a fairly unexplored problem, becomes evident.

Paper Outline
The remainder of this paper begins with some necessary definitions and preliminary results in Section 2. It continues with our detailed exposition in Section 3 of why Shannon's reasoning fails to yield the right expression for the entropy gain. We present an intuitive discussion leading to the definition of effective differential entropy in Section 4, which is ended by the proof of Theorem 3. Section 5 gives a geometric interpretation of how an arbitrarily small additive perturbation is able to boost the differential entropy rate of the process coming out of an NMP LTI filter. This exposition helps understanding and justifies the introduction of entropy-balanced random processes, which are also characterized there. Sections 6 and 7 contain our results for the entropy gain produced by an output disturbance and a random initial state, respectively. Our illustrative application results are presented in Section 8, followed by our conclusions in Section 9. Except when presented right after a statement or in its own section, all proofs are given in Appendix B.

Notation
The sets of natural, real and complex numbers are denoted N, R, and C, respectively. For a complex x, {x} is the real part of x. For a set S, the indicator function 1 S (x) equals 1 if x ∈ S and 0 otherwise. For any LTI system G, the transfer function G(z) corresponds to the z-transform of the impulse response g 0 , g 1 , . . ., i.e., G(z) = ∑ ∞ i=0 g i z −i . For a transfer function G(z), we denote by G n ∈ R n×n the lower triangular Toeplitz matrix having [g 0 · · · g n−1 ] T as its first column. We write x n 1 as a shorthand for the sequence {x 1 , . . . , x n }, and, when convenient, we write x n 1 in vector form as x 1 n [x 1 x 2 · · · x n ] T , where () T denotes transposition. Random scalars (vectors) are denoted using non-italic characters, such as x (non-italic and boldface characters, such as x). The notation x ⊥ ⊥ y means x and y are independent. If x and z are conditionally independent given y, we write x ←→ y ←→ z. For matrices, we use upper-case boldface symbols, such as A. We write λ i (A) to denote the i-th eigenvalue of A sorted in increasing magnitude. If A ∈ C m×n , A H is its conjugate transpose, and σ i (A) We define σ min (A) σ 1 (A) and σ max (A) σ min{m,n} (A). The term A i,j denotes the entry in the intersection between the i-th row and the j-th column. If A ∈ C m×n , then A T and A * denote the transpose and conjugate transpose of A, respectively. We write [A] i 1 i 2 , with 1 ≤ i 1 ≤ i 2 ≤ m, to refer to the matrix formed by selecting the rows i 1 to i 2 of A. Likewise, for 1 ≤ j 1 ≤ j 2 ≤ n, j 1 j 2 A is the matrix built with columns j 1 to j 2 of A. The expression m 1 [A] m 2 corresponds to the square sub-matrix along the main diagonal of A, with its top-left and bottom-right corners on A m 1 ,m 1 and A m 2 ,m 2 , respectively. A diagonal matrix whose entries are the elements in a set D (wherein elements may be repeated) is denoted as diag D. If A ∈ R n×m 1 and B ∈ R n×m 2 , we write [A|B] ∈ R n×(m 1 +m 2 ) to denote the augmented matrix built by placing the columns of A followed by those of B.

Mutual Information and Differential Entropy
Let x, y, and z be random variables with joint PDF f x,y,z , and marginal PDFs f x , f y , and f z , respectively. The mutual information between x and y is defined as I(x; y) The conditional mutual information between x and y given z is defined as where f x,y | z is the joint PDF of x and y given z, and f x | z , f y | z are defined likewise. The conditional differential entropy of x given y is defined as h(x | y) − f x,y (x, y) log( f x | y (x|y))dxdy.
From these definitions, it is easy to verify the following properties Reference [ • Non-negativity: with equality if and only if x and y are independent. • Chain Rule: • Relationship with entropy:

System Model and Assumptions
Consider the discrete-time system depicted in Figure 1. In this setup, the block G satisfies Assumption 1.
It is worth noting that there is no loss of generality in considering g 0 = 1, since one can otherwise write G(z) as G (z) = g 0 · (G(z)/g 0 ); thus, the entropy gain introduced by G (z) would be log |g 0 | plus the entropy gain due to G(z)/g 0 (in agreement with (6)), which has an impulse response with its first sample equal to 1.
The following assumption is made about the output disturbance z ∞ 1 : Assumption 2. The disturbance z ∞ 1 is independent of u ∞ 1 and belongs to a κ-dimensional linear subspace, for some finite κ ∈ N. This subspace is spanned by the κ orthonormal columns of a matrix Φ ∈ R |N|×κ (where |N| stands for the countably infinite size of N), such that |h(Φ T z 1 ∞ )| < ∞. Moreover, z 1 ∞ = Φs 1 κ , where the random vector s 1 κ Φ T z 1 ∞ has finite differential entropy, its covariance matrix K s 1 κ satisfies λ max (K s 1 κ ) < ∞, and it is independent of u 1 ∞ .

Revisiting Theorem 14 in Reference Shannon et al.
In this section, after presenting the proof of Theorem 2, we develop Shannon's approach into a more detailed and formal exposition. This allows us to explain why, for part of the continuous-time filters considered in Theorem 2, the approach chosen by Shannon to prove Theorem 14 in Reference [2] is unable to predict the correct value for the entropy gain.

Proof of Theorem 2
To begin with, the Fourier transform of φ k is It is easy to verify that the functions φ k satisfy the following orthogonality property: and Notice that u(k) = u c ( k 2B ), k ∈ N.
The output of G c sampled at time t = /(2B), ∈ N, is with u(k) = 0 for k ≤ 0. This means that the output samples y ∞ 1 are the discrete-time convolution between u ∞ 1 and the filter coefficients {g i } η i=0 . Therefore, the matrix relation (4) holds. We then obtain thath(y c ) =h(u c ) + log |g 0 |.
The frequency response of G c is given by where ξ is in [Hz]. This means that where the last equality holds because G c (ξ) is conjugate symmetric. Thus, the entropy gain introduced by G c is the right-hand side of (13), concluding the proof.

Formalizing Shannon's Argument
In the approach followed by Shannon, it is argued that the entropy gain is the limit as n → ∞ of n −1 ∑ n−1 r=0 log |G c (ξ r )| over uniformly spaced frequencies ξ 0 , . . . , ξ n−1 . Here, we show that this summation corresponds to log | det( G n )|, where G n is an n-by-n Toeplitz circulant matrix. Moreover, the sequences of Hermitian matrices {G n G * n } ∞ n=1 and { G n G * n } ∞ n=1 are asymptotically equivalent (as defined in Reference [18], Section 2.3), which would yield lim n→∞ n −1 log | det(G n )| = lim n→∞ n −1 log | det( G n )| if the eigenvalues of G n G * n were bounded between constants 0 < ζ m < ζ M < ∞ for all n ∈ N. However, if G(z) (the z-transform of {g k } ∞ k=0 ) has NMP zeros, then G n G * n has eigenvalues tending to zero exponentially as n → ∞, which precludes these two limits to coincide.
To prove the above claims, we first apply the change of variable ω πξ/B, with which (30) becomes where G(e jω ) is the frequency response of the discrete-time filter G with unit-impulse and ω is in radians per second. Now, following Shannon's approach, we uniformly sample G(e jω ) at n frequencies ω r r 2π n , r/n ≤ 0.5 r 2π n − 2π , r/n > 0.5 , r = 0, 1, . . . , n − 1, which, from (32), yields the spectral samples We will cast the reason why (3) fails to coincide with the correct expression for the entropy gain provided by (5) as a disagreement between the asymptotic behavior of the logarithm of the determinant of two sequences of asymptotically equivalent matrices. For that purpose, since (34) coincides with Reference [18] where U n ∈ C n×n is the n-point discrete Fourier transform (DFT) matrix, defined as From Reference [18], Lemma 4.5,g n,k ∑ i∈N 0 :k+ni≤η g k+in , corresponding to the (possibly) aliased impulse response g 0 , g 1 , . . . , g η as a result of sampling in frequency.
We can now see that the discrepancy between the entropy gain predicted by (3) and (5) is the disagreement between the following limits: where, due to (8), the expressions on both right-hand sides differ if and only if G(z) has NMP zeros. According to Reference [18], Lemma 4.6, the sequences {G n } ∞ n=1 and { G n } ∞ n=1 are asymptotically equivalent, which is written as G n ∼ G n . Then, from Reference [18], Theorem 2.1, the Hermitian matrices G n G * n ∼ G n G * n , which, from Reference [18], Theorem 2.4, implies that for any function f continuous over a finite interval [ζ m , ζ M ] such that However, when G(z) has m NMP zeros, Lemma 7 (in Section 6.3) establishes that there are exactly m eigenvalues of G n that tend to zero exponentially as n → ∞. Crucially, log(·) is discontinuous at 0, which precludes the limits in (37) from coinciding.

The Effective Differential Entropy
Theorem 3 establishes that the effective differential entropy rate of the entire or complete output of an LTI system exceeds that of the (shorter) input sequence by the RHS of (15). This section provides a geometrical interpretation of this problem and intuition about the effective differential entropy already introduced in Definition 1.
Consider the random vectors u [u 1 u 2 ] T and y [y 1 y 2 y 3 ] T related via Applying the conventional definition of differential entropy of a random sequence, we would have that h(y 1 , y 2 , y 3 ) = h(y 1 , y 2 ) + h(y 3 | y 1 , y 2 ) = −∞ because y 3 is a deterministic function of y 1 and y 2 : In other words, the problem lies in that, although the output is a three-dimensional vector, it only has two degrees of freedom, i.e., it is restricted to a 2-dimensional subspace of R 3 . This is illustrated in Figure 2, where the set [0, 1] × [0, 1] is shown (coinciding with the u-v plane), together with its image throughG 2 (as defined in (40)). As can be seen in this figure, the image of the square [0, 1] 2 throughG 2 is a 2dimensional rhombus over which {y 1 , y 2 , y 3 } distributes uniformly. Since the intuitive notion of differential entropy of an ensemble of random variables relates to the size of the region spanned by the associated random vector (and determines how difficult it is to compress it in a lossy fashion with a given precision), one could argue that the differential entropy of {y 1 , y 2 , y 3 }, far from being −∞, should be somewhat larger than that of {u 1 , u 2 } (since the rhombusG 2 [0, 1] 2 has a larger area than [0, 1] 2 ). So, what does it mean that (and why should) h(y 1 , y 2 , y 3 ) = −∞? Simply put, the differential entropy relates to the volume spanned by the support of the probability density function. For y in our example, the latter (three-dimensional) volume is clearly zero.
From the above discussion, the comparison between the differential entries of y ∈ R 3 and u ∈ R 2 of our previous example should take into account that y actually lives in a two-dimensional subspace of R 3 . Indeed, since the multiplication by a unitary matrix does not alter differential entries, we could consider the differential entropy of whereQ T is the 3 × 2 matrix with orthonormal rows in the SVD ofG 2 andq is a unit-norm vector orthogonal to the rows ofQ (and thus orthogonal to y, as well). We are now able to compute the differential entropy in R 2 forỹ, corresponding to the rotated version of y such that its support is now aligned with R 2 .
The preceding discussion motivates the use of a modified version of the notion of differential entropy for a random vector y ∈ R n which considers the number of dimensions actually spanned by y instead of its length.
It is worth mentioning that Shannon's differential entropy of a vector y ∈ R , whose support's -volume is greater than zero, arises from considering it as the difference between its (absolute) entropy and that of a random variable uniformly distributed over andimensional, unit-volume region of R . More precisely, if in this case the probability density function (PDF) of y = [y 1 y 2 · · · y ] T is Riemann integrable, then [1], Thm. 9.3.1, where y ∆ is the discrete-valued random vector resulting when y is quantized using an -dimensional uniform quantizer with -cubic quantization cells with volume ∆ . However, if we consider a variable y whose support belongs to an n-dimensional subspace of R , n < (i.e., y = Su = A T TCu, as in Definition 1), then the entropy of its quantized version in R , say H (y ∆ ), is distinct from H n ((Ay) ∆ ), the entropy of Ay in R n . Moreover, it turns out that, in general, despite the fact that A has orthonormal rows. Thus, the definition given by (44) does not yield consistent results for the case wherein a random vector has a support's dimension (i.e., its number of degrees of freedom) smaller that its length (The mentioned inconsistency refers to (45).), which reveals that the asymptotic behavior H (y ∆ ) changes if y is rotated.
(If this were not the case, then we could redefine (44) replacing by n, in a spirit similar to the one behind Renyi's d-dimensional entropy [19].) To see this, consider the case in which u ∈ R distributes uniformly over [0, 1] and y = [ Clearly, y distributes uniformly over the unit-length segment connecting the origin with the point (1, 1)/ √ 2 . Then, On the other hand, since, in this case, Ay = u, we have that Thus, the d-dimensional entropy would not generally be equal to the effective differential entropy, that is: The latter example further illustrates why the notion of effective entropy is appropriate in the setup considered in this section, where the effective dimension of the random sequences does not coincide with their length (it is easy to verify that the effective entropy of y does not change if one rotates y in R ).
We finish this section with an example to illustrate the usefulness of the notion of effective differential entropy beyond the context of entropy gain.

Application Example: Shannon Lower Bound
The rate-distortion function (RDF) R(D) is the infimum, among all codes, of the expected number of bits per sample necessary to reconstruct a given random source with distortion not greater than D [1]. Let the source and reconstruction be the vectors x 1 and x 1 + v 1 , respectively, and suppose the distortion is assessed using the mean-squared 1 2 ]. Then, restricting our attention to uniquely-decodable codes Reference [1], p. 105), the Shannon Lower Bound (SLB) [20] establishes that provided h(x 1 ) is bounded. Therefore, if x 1 is the entire forced response of an FIR filter G of order p to an input u n 1 , then = n + p and h(x 1 ) is minus infinity, which precludes one from using (49). We will show next that, in this case, the SLB can still be stated by using the effective differential entropyh(x 1 ) instead of h(x 1 ). Following Definition 1, we can write the source vector as be a unitary matrix, which means that AA T = 0 p× . Then, = I(Hx 1 ; Hx 1 + Hv 1 ), (21) = I(Ax 1 ; Ax 1 + Av 1 ), +I(Ax 1 ; (20) ≥ I(Ax 1 ; where (a) stems from Reference [1], Theorems 5.4.1 and 5.5.1 and Equations (10).58-10.61, (b) holds because conditioning does not increase entropy and (c) is from the definition of effective differential entropy.

Entropy-Balanced Processes: Geometric Interpretation and Properties
In the first part of this section, we provide a geometric interpretation of the effect that a non-minimum phase LTI system has on its input random process. This will give an intuitive meaning to the notion of an entropy-balanced random process (introduced in Definition 2 above) and provide insights into why and how the entropy gain defined in (1) arises as a consequence of an output random disturbance or a random initial state (the themes of Sections 6 and 7, respectively).
The second part of this section identifies several entropy-balanced processes and establishes two properties satisfied by this class of processes.

Geometric Interpretation
We begin our discussion with a simple example. Figure 1 is a finite impulse response (FIR) filter with impulse response g 0 = 1, g 1 = 2, g i = 0, ∀i ≥ 2. Notice that this choice yields G(z) = (z − 2)/z; thus, G(z) has one non-minimum phase zero, at z = 2. The associated matrix G n for n = 3 is

Example 1. Suppose that G in
whose determinant is clearly one (indeed, all its eigenvalues are 1). Hence, as discussed in the introduction, h(G 3 u 1 3 ) = h(u 1 3 ); thus, G 3 (and G n , in general) does not introduce an entropy gain by itself. However, an interesting phenomenon becomes evident by looking at the SVD of G 3 , given by where Q 3 and R 3 are unitary matrices, and D 3 diag{d 1 , d 2 , d 3 }. In this case, D 3 = diag{0.19394, 1.90321, 2.70928}; thus, one of the singular values of G 3 is much smaller than the others (although the product of all singular values yields 1, as expected). As will be shown in Section 6, for a stable G(z), such uneven distribution of singular values arises only when G(z) has non-minimum phase zeros. The effect of this can be visualized by looking at the image of the cube [0, 1] 3 through G 3 , shown in Figure 3. If the input u 1 3 were uniformly distributed over this cube (of unit volume), then G 3 u 1 3 would distribute uniformly over the unit-volume parallelepiped depicted in Figure 3; hence, 1 3 , and with Φ ∈ R 3×1 , the effect would be to "thicken" the support over which the resulting random vector y 1 3 = G 3 u 1 3 + z 1 3 is distributed, along the direction pointed by Φ. If Φ is aligned with the direction along which the support of G 3 u 1 3 is thinnest (given by q 3,1 , the first row of Q 3 ), then the resulting support would have its volume significantly increased, which can be associated with a large increase in the differential entropy of y 1 3 with respect to u 1 3 . Indeed, a relatively small variance of s and an approximately aligned Φ would still produce a significant entropy gain.
The above example suggests that the entropy gain from u 1 n to y 1 n appears as a combination of two factors. The first of these is the uneven way in which the random vector G n u 1 n is distributed over R n . The second factor is the alignment of the disturbance vector z 1 n with respect to the span of the subset {q n,i } i∈Ω n of columns of Q n , associated with the smallest singular values of G n , indexed by the elements in the set Ω n . As we shall discuss in the next section, if G has m non-minimum phase zeros, then, as n increases, there will be m singular values of G n going to zero exponentially. Since the product of the singular values of G n equals 1 for all n, it follows that ∏ i / ∈Ω n d n,i must grow exponentially with n, where d n,i is the i-th diagonal entry of D n . This implies that G n u 1 n expands with n along the span of {q n,i } i / ∈Ω n , compensating its shrinkage along the span of {q n,i } i∈Ω n , thus keeping h(G n u 1 n ) = h(u 1 n ) for all n. Thus, as n grows, any small disturbance distributed over the span of {q n,i } i∈Ω n , added to G n u 1 n , will keep the support of the resulting distribution from shrinking along this subspace. Consequently, the expansion of G n u 1 n with n along the span of {q n,i } i / ∈Ω n is no longer compensated, yielding an entropy increase proportional to log(∏ i / ∈Ω n d n,i ). The above analysis allows one to anticipate a situation in which no entropy gain would take place even when some singular values of G n tend to zero as n → ∞. Since the increase in entropy is made possible by the fact that, as n grows, the support of the distribution of G n u 1 n shrinks along the span of {q n,i } i∈Ω n , no such entropy gain should arise if the support of the distribution of the input u 1 n expands accordingly along the directions pointed by the rows {r n,i } i∈Ω n of R n .
An example of such situation can be easily constructed as follows: Let G(z) in Figure 1 have non-minimum phase zeros and suppose that u ∞ 1 is generated as G −1ũ∞ 1 , whereũ ∞ 1 is an i.i.d. random process with bounded entropy rate. Since the determinant of G −1 n equals 1 for all n, we have that h(u 1 n ) = h(ũ 1 n ), for all n. On the other hand, n h(u 1 n ); thus, no entropy gain appears. The preceding discussion reveals that the entropy gain produced by G in the situation shown in Figure 1 depends on the distribution of the input and on the support and distribution of the disturbance. This stands in stark contrast with the well known fact that the increase in differential entropy produced by an invertible linear operator depends only on its Jacobian, and not on the statistics of the input [2]. We have also seen that the distribution of a random process along the different directions within the Euclidean space which contains it plays a key role, as well. This motivates the need to specify a class of random processes which distribute more or less evenly over all directions. This is precisely the intuitive meaning of an entropy-balanced process.
The following section identifies a large family of processes belonging to this class, as well as two properties which greatly expands this family.

Characterization of Entropy-Balanced Processes
We have defined the notion of an "entropy-balanced" process in Section 1.1. In words, the first condition in this definition allows one to guarantee that the orthogonal projection of an entropy-balanced process onto any ν-dimensional linear subspace has a differential entropy whose magnitude remains bounded or grows at most sub-linearly with n. The second condition states that the projection of an entropy-balanced process v ∞ 1 onto any linear subspaces having ν fewer dimensions has the same differential entropy rate as the original process. This condition is equivalent to requiring that every unitary transformation on v n 1 yields a random sequence y n 1 such that lim n→∞ 1 n h(y n n−ν+1 | y n−ν 1 ) = 0. This property of the resulting random sequence y n 1 means that one cannot predict its last ν samples with arbitrary accuracy by using its previous n − ν samples, even if n goes to infinity.
We now characterize a large family of entropy-balanced random processes and establish some of their properties. Although intuition may suggest that most random processes (such as i.i.d. or stationary processes) should be entropy balanced, that statement seems rather difficult to prove. In the following, we show that the entropy-balanced condi-tion is met by i.i.d. processes with per-sample probability density function (PDF) being uniform, piece-wise constant or Gaussian. It is also shown that adding to an entropybalanced process an independent random processes independent of the former yields another entropy-balanced process, and that filtering an entropy-balanced process by a stable and minimum phase filter yields an entropy-balanced process, as well. The proofs can be found in Appendix B.

Lemma 2.
Let u ∞ 1 be a random process with independent elements satisfying Condition i) in Definition 2, in which each u i is distributed according to a (possibly different) piece-wise constant PDF such that each interval where this PDF is constant has measure less than θ and greater than , for some constants 0 < < θ < ∞. Then, u ∞ 1 is entropy balanced.
is also entropy balanced.
The proof of Lemma 3 is on page 33. The working behind this lemma can be interpreted intuitively by noting that adding to a random process another independent random process can only increase the "spread" of the distribution of the former, which tends to balance the entropy of the resulting process along all dimensions in Euclidean space. In addition, it follows from Lemma 3 that all i.i.d. processes having a per-sample PDF which can be constructed by convolving uniform, piece-wise constant or Gaussian PDFs as many times as required are entropy balanced. It also implies that one can have non-stationary processes which are entropy balanced, since Lemma 3 imposes no requirements for the process v ∞ 1 . The next lemma related to the properties of entropy-balanced processes shows that filtering by a stable and minimum phase LTI filter preserves the entropy balanced condition of its input.

Lemma 4. Let u ∞
1 be an entropy-balanced process and G an LTI stable and minimum-phase filter. Then, the output w ∞ 1 G u ∞ 1 is also an entropy-balanced process.
This result implies that any stable moving-average auto-regressive process constructed from entropy-balanced innovations is also entropy balanced, provided the coefficients of the averaging and regression correspond to a stable MP filter.
The last lemma of this section states a crucial property of entropy-balanced processes (the proof is in Appendix B, page 34).

Lemma 5. Let u ∞
1 be an entropy balanced process. Consider a disturbance z ∞ 1 satisfying Assumption 2 and define y ∞ We finish this section by pointing out two examples of processes which are nonentropy-balanced, namely the output of a NMP-filter to an entropy-balanced input and the output of an unstable filter to an entropy-balanced input. The first of these cases plays a central role in the next section.

Entropy Gain Due to External Disturbances
In this section, we formalize the ideas which were qualitatively outlined in the previous section. Specifically, for the system shown in Figure 1 we will characterize the entropy gain G(G, x 0 , u ∞ 1 , z ∞ 1 ) defined in (1) for the case in which the initial state x 0 is zero (or deterministic) and there exists a random disturbance of (possibly infinite length) z ∞ 1 which satisfies Assumption 2.

Input Disturbances Do Not Produce Entropy Gain
In this section, we show that random disturbances satisfying Assumption 2, when added to the input u ∞ 1 (i.e., before G), do not introduce entropy gain. This result can be obtained from Lemma 6, as stated in the following theorem: Theorem 5 (Input Disturbances do not Introduce Entropy Gain). Let G and z ∞ 1 satisfy Assumptions 1 and 2 , respectively. Suppose that u ∞ 1 is entropy balanced and consider the output Then, Proof. From Lemma 5, the differential entropy rate of u ∞ 1 equals that of u ∞ 1 + z ∞ 1 . The proof is completed by recalling that G yields no entropy gain for its input u ∞ 1 + z ∞ 1 because it corresponds to the noise-less scenario.

The Entropy Gain Introduced by Output Disturbances when G is MP is Zero
The results from the previous section yield the following corollary, which states that an LTI system with transfer function G(z) without zeros outside the unit circle (i.e., an MP transfer function) cannot introduce entropy gain.
Corollary 1 (Minimum Phase Filters do not Introduce Entropy Gain). Consider the system shown in Figure 1 wherein the input u ∞ 1 is an entropy-balanced random process and the output disturbance z ∞ 1 satisfies Assumption 2. Besides Assumption 1, suppose that G(z) is minimum phase. Then, Proof. Since G(z) is minimum phase and stable, the result follows directly from Lemmas 4 and 5.

The Entropy Gain Introduced by Output Disturbances when G(z) is NMP
We show here that the entropy gain of an LTI system with transfer function G(z) and an output disturbance is at most the sum of the logarithm of the magnitude of the zeros of G(z) outside the unit circle.
The following lemma will be instrumental for that purpose.
Lemma 6. Consider the system in Figure 1, and suppose z ∞ 1 satisfies Assumption 2, and that the input process u ∞ 1 is entropy balanced. Let G n = Q T n D n R n be the SVD of G n , where D n = diag{d n,1 , . . . , d n,n } are the singular values of G n , with d n,1 ≤ d n,2 ≤ · · · ≤ d n,n , such that | det G n | = 1 ∀n. Let m be the number of these singular values which tend to zero exponentially as n → ∞. Then, The proof of this lemma can be found on page 34, in Appendix B. Lemma 6 leaves the need to characterize the asymptotic behavior of the singular values of G n . This is accomplished in the following lemma, which relates these singular values to the zeros of G(z). It is a generalization of the unnumbered lemma in the proof of Reference [16], Theorem 1 (restated in Appendix C as Lemma A3), which holds for FIR transfer functions, to the case of infinite-impulse response (IIR) transfer functions (i.e., transfer functions having poles).
where the elements in the sequence {α n,l } are positive and increase or decrease at most polynomially with n.
(The proof of this lemma can be found in Appendix B, page 36). Lemma 6 also precisely formulates the geometric idea outlined in Section 5.1. To see this, notice that no entropy gain is obtained if the output disturbance vector z 1 n becomes orthogonal (with probability 1) to the space spanned by the first m columns of Q n sufficiently fast as n → ∞ . Recalling from Assumption 2 that where the matrix Φ has κ orthonormal columns of infinite length, such orthogonality condition can be formally stated by defining If this were the case, then the disturbance would not be able fill the subspace along which G n u 1 n is shrinking exponentially. Indeed, if κ n = 0 for all n , then h( , and the latter sum cancels out the one on the RHS of (64), while lim n→∞ 1 n h([R n ] 1 m u 1 n ) = 0 since u ∞ 1 is entropy balanced. On the contrary (and loosely speaking), if the projection of the support of z 1 n onto the subspace spanned by the first m rows of Q n is of dimension m (i.e., if κ n = m) for all n , then h([D n ] 1 m R n u 1 n + [Q n ] 1 m z 1 n ) remains bounded for all n, and the entropy limit of the sum lim n→∞ 1 n (− ∑ m i=1 log d n,i ) on the RHS of (64) yields the largest possible entropy gain. Notice that − ∑ m i=1 log d n,i = ∑ n i=m+1 log d n,i (because det(G n ) = 1); thus, this entropy gain stems from the uncompensated expansion of G n u 1 n along the space spanned by the rows of [Q n ] m+1 n . Beyond these extreme cases (i.e., for general values ofκ andκ), the following theorem provides tight bounds on the entropy gain. Theorem 6. In the system of Figure 1, suppose that u ∞ 1 is entropy balanced, and that G(z) and z ∞ 1 satisfy Assumptions 1 and 2, respectively, where the zeros {ρ i } p i=1 of G(z) satisfy |ρ 1 | ≥ · · · ≥ |ρ m | > 1 ≥ |ρ m+1 | ≥ · · · ≥ |ρ p |. For each n ∈ N, let Q T n ∈ R n×n be the unitary matrix holding the left singular vectors of G n ∈ R n×n (as in Lemma 6), where G n is as defined in (4) .

1.
Then, The bounds on both extremes are tight. Moreover, the lower bound is reached ifκ ∞ = 0.
Proof. See Appendix B, page 37.
The next technical result is very useful for finding conditions under which the requirements of point 2 in Theorem 6 are satisfied (the proof is in Appendix B, page 39).

Lemma 8.
Let F be an FIR LTI causal system of order m such that the m zeros of F(z) are NMP, and F n = Q T n D n R n be an SVD for F n , for every n ∈ {m, m + 1, . . .}. For each κ ∈ {1, . . . , n}, define andκ min{m, κ}. Then, and lim n→∞ κ n =κ. Now, we can prove Theorem 4.

Proof of Theorem 4
Factorize G(z) as G(z) = F(z)G(z), whereG(z) is stable and minimum phase and F(z) is a stable FIR transfer function with all the m non-minimum-phase zeros of G(z). Lettingũ 1 n G n u 1 n , we have that h(y 1 n ) = h(F nũ 1 n + z 1 n ), h(ũ 1 n ) = h(u 1 n ), and thatũ ∞ 1 is entropy balanced (from Lemma 4). Thus, This means that the entropy gain of G due to the output disturbance z ∞ 1 corresponds to the entropy gain of F due to the same output disturbance.

Entropy Gain Due to a Random Initial State
Here, we analyze the scenario illustrated by Figure 1 for the case in which there exists a random initial state x 0 independent of the input u ∞ 1 , and zero (or deterministic) output disturbance.
The treatment of an initial state of the LTI system G requires one to first define an internal model for it. For this purpose, in this section, we consider the state-space realization of G in the Kalman canonical form, given by (see, e.g., Reference [21] or Reference [22], Chapter 6) where the column state vectors x co (k), xc o (k), x cō (k), xcō(k) are, respectively, controllable and observable, non-controllable and observable, controllable and non-observable, and non-controllable and non-observable. There is no loss of generality in choosing this state-space representation, because every state-space representation consistent with a rational transfer function G(z) can be written in this form (Reference [22], Theorem 6.7). Since our interest is on the effect of the random initial state of G on its output, we only need to consider the observable subsystem within (76) and without its input, given by whereỹ is the natural response of G to its initial state x o (0) and x co ∈ R p and xc o ∈ R q . We shall decomposeỹ asỹ whereỹc o andỹ co are the natural responses of G to initial states [0 1×p xc o (0) T ] T and [x co (0) T 0 1×q ] T , respectively. The natural response componentỹ co can be generated by the following minimal state-space representation of G(z), without the effect of its input u: Now, we can state and prove the main result of this section: Theorem 7. Suppose G satisfies Assumption 1 and u ∞ 1 is entropy balanced. Assume that x o (0) (the observable part of the initial state of G) is independent of the input u ∞ 1 , |h(x o (0))| < ∞ and that tr{K x o (0) } < ∞. Then, Proof. Both G and u ∞ 1 satisfy the conditions of Theorem 6. Thus, as in its statement, we write G(z) = F(z)G(z), whereG(z) is stable and minimum phase and F(z) is a stable FIR transfer function with only the m non-minimum-phase zeros of G(z).
Defining w 1 n G n u 1 n , we have andỹ 1 n ⊥ ⊥ w 1 n . In addition, the fact that G is stable guarantees that the sample second moment ofỹ ∞ 1 decays exponentially, which means thatỹ ∞ 1 satisfies Assumption 2. Thus, the conditions of Lemma 6 are met considering G n = F n , where now F n = Q T n D n R n is the SVD for F n , and d n,1 ≤ d n,2 ≤ · · · ≤ d n,n . Consequently, the proof would be completed if we can show that lim n→∞ Recalling (78), let us decompose [ỹ co ] 1 n so that whereP n , P n ∈ R n×(p+q) , the sequences F nPn x co (0) and P n x co (0), respectively, are the natural responses ofG and F to the controllable and observable initial state x co , and [ỹc o ] 1 n is the natural response of G to the non-controllable and observable initial state xc o (0). Then, where (a) is from the entropy-power inequality [1] and (b) holds because conditioning does not increase entropy and [ỹc o ] 1 n is a deterministic function of xc o (0). Let the SVD of where S n ∈ R m×m is unitary, T n = diag{t 1 , t 2 , . . . , t m } holds the singular values of [Q n ] 1 m (F nPn + P n ) and H n ∈ R m×p has orthonormal rows. Substituting this SVD into (87) we obtain This last differential entropy is bounded because |h(x o )| < ∞ and tr{K x o } < ∞, which implies (thanks to Proposition A1) that |h(H n x co , xc o )| < ∞, and by the chain rule of entropy, so |h(H n x co (0) | xc o (0))| < ∞ because |h(xc o (0))| < ∞ (again from Proposition A1). Thus, in view of (89) and (84), all that remains to prove is that and To prove (93), recall that the entries in the diagonal matrix 1 [D n ] m decay exponentially with n. On the other hand, the rows of [R n ] 1 m are orthonormal. Finally, the fact thatG is stable implies that the p + q columns ofP n have norms which are bounded for all n. These three observations readily yield that (93) holds.
To prove that (92) holds, write the rational transfer function of G (described by (80)) as wherem p − m. The coefficients in the numerator of G(z) are related to those of F(z) andG(z) by the convolution whereã 0 = f 0 = 1. Denote the natural response of F (up to time n) to its initial state x F (0) (which is a linear function of x co (0)) asÿ 1 n P n x co (0).
Letw 1 n P n x co (0) be the natural response ofG to its initial state x co (0). Following the structure of (80),w(k) can be written as where x co satisfies (79). Considering the following minimal state-space representation of F it can be seen that the natural response of F to its own initial state x F (0) can be written as y(k) =ỹ co (k) −w(k) − "the effect of f 1 , . . . , f k−1 ".
Theorem 7 allows us to formalize the effect that the presence or absence of a random initial state has on the entropy gain using arguments similar to those utilized in Section 6.

Some Implications
The purpose of this section is to illustrate how the results obtained in the previous section can be applied to other problems. To do so, we present next some of the implications of these results on three different problems previously addressed in the literature, namely finding the rate-distortion function for non-stationary processes, an inequality in networked control theory, and the feedback capacity of Gaussian stationary channels. The common feature in these three problems is that, in all of them, non-minimum phase transfer functions play a role (either explicitly or implicitly).

Networked Control
The analysis developed in Reference [13] considers an LTI system P within a noisy feedback loop, as the one depicted in Figure 4. In this scheme, C represents a causal feedback channel which combines the output of P with an exogenous (noise) random process c ∞ 1 to generate its output. The process c ∞ 1 is assumed independent of the initial state of P, represented by the random vector x 0 , which has finite differential entropy. Figure 4. (Left): LTI system P within a noisy feedback loop. (Right): equivalent system when the feedback channel is noiseless and has unit gain.

For this system, it is shown in Reference [13], Theorem 4.2, that
where I(x 0 ; y n 1 ) is the mutual information (see Reference [1], Section 8.5) between x 0 and y n 1 , with equality if w is a deterministic function of v. Furthermore, it is shown in Reference [12], Lemma 3.2, that, if |h(x 0 )| < ∞ and the steady state variance of system P remains asymptotically bounded as k → ∞, then lim n→∞ 1 n I(x 0 ; y n where {p i } are the poles of P. Thus, for the (simplest) case in which w = v, the output y ∞ 1 is the result of filtering u ∞ 1 by a filter G = 1 1−P (as shown in Figure 4 right), and the resulting entropy rate of y ∞ 1 will exceed that of u ∞ 1 only if there is a random initial state with bounded differential entropy (see (114a)). Moreover, if w = v and G(z) is stable, (114) (as well as Reference [13], Lemma 4.3) implies that this entropy gain is lower bounded by the right-hand side (RHS) of (8), which is greater than zero if and only if G is NMP. However, both [12,13] do not provide conditions under which this lower bound is reached.
In Reference [14], Theorem 14, it is shown that, when there is perfect feedback (i.e., when v = w), as in Figure 4 right, with P being the concatenation of a stabilizing LTI controller and an LTI plant, and assuming u ∞ 1 is Gaussian i.i.d. and a Gaussian initial state, thenh Notice that this implies reaching equality in both (114a) and (114b). By using the results obtained in Section 7 we show next that equality holds in (114b) provided the feedback channel satisfies the following assumption: Assumption 3. The feedback channel in Figure 4 can be written as where:

1.
A and B are stable rational transfer functions such that AB is biproper, ABP has the same unstable poles as P, and the feedback AB stabilizes the plant P. 2.
We also extend Reference [14], Theorem 14, to situations including a feedback channel satisfying Assumption 3. For the perfect-feedback case, this extends the validity of (115) to a much larger class of distributions for u ∞ 1 . An illustration of the class of feedback channels satisfying this assumption is depicted on top of Figure 5. Trivial examples of channels satisfying Assumption 5 are a Gaussian additive channel preceded and followed by linear operators [23]. Indeed, when F is an LTI system with a strictly causal transfer function, the feedback channel that satisfies Assumption 3 is widely known as a noise shaper with input pre and post filter, used in, e.g., Reference [24][25][26][27]. Theorem 8. In the networked control system of Figure 4, suppose that the feedback channel satisfies Assumption 3, that the plant P(z) has poles {p i } p i , and that the input u ∞ 1 is entropy balanced. If the random initial states of AB and P, namely s 0 ∈ R q and x 0 ∈ R p , respectively, are independent, have finite variance and |h(x 0 )| < ∞, then

Proof. Let P(z) = N(z)/G(z) and T(z) A(z)B(z) = Γ(z)/Θ(z).
We will first show that the output y 1 n can be written as whereG is the stable LTI system with biproper and MP transfer functioñ with s 0 ∈ R q , x 0 ∈ R p and [x T 0 s T 0 ] T being the random initial states of T, G, andG, respectively, andũ u +Bc (120) (see Figure 5 bottom). The matricesP n ∈ R n×p and P n ∈ R n×(p+q) . From Figure 5, it is clear that the transfer function fromũ to y is G(z) , validating the first term on the RHS of (118). In addition, it is evident that the initial state ofG is a linear combination of x 0 and s 0 , justifying the termP n [x T 0 s T 0 ] T as the natural response ofG. Thus, it is only left to prove that the initial state of G is x 0 . For that purpose, let G(z) = 1 − ∑ p i=1 g i z −i and N(z) = ∑ p i=1 n i z −i . Define the following variables: Then, the recursion corresponding to P(z) is This reveals that the initial state of P(z) corresponds to But, from (121), o is also the output ofG to the inputũ, and which means that the initial state of G is x 0 . Now, using (118), we have that where the first equality is because s 0 ⊥ ⊥ x 0 andū 1 n G nũ 1 n +P n s 0 . The last equality holds since the first sample of the unit-impulse response of G is 1. Since u ∞ 1 is entropy balanced, G(z) is biproper, stable, and MP, and bothc ∞ 1 andP n s 0 have finite variance, it follows from Lemmas 3 and 4 thatū ∞ 1 is entropy balanced, as well. Thus, the proof of the first claim is completed by direct application of Theorem 7.
For the second claim, where (a) holds because the first sample of the unit-impulse response ofG isg 0 = lim z→∞G (z) = 1. Then, where (a) holds becauseGũ is entropy balanced (from Lemma 4), andP n s 0 has finite variance, allowing us to apply Proposition A3. In turn, (b) follows from (128) revealing that the gap in (114a) is exactlyh(c ∞ 1 ). In addition, in the perfect-feedback scenario, Theorem 8 extends the validity of (115) from the Gaussian i.i.d. u and Gaussian x 0 considered in Reference [14], Theorem 14, to an entropy-balanced u and an x 0 with finite variance and finite differential entropy.

Rate Distortion Function for Non-Stationary Processes
In this section, we obtain a simpler proof of a result by Gray, Hashimoto and Arimoto [15][16][17], which compares the rate distortion function (RDF) of a non-stationary auto-regressive Gaussian process x ∞ 1 (of a certain class to be defined shortly) to that of a corresponding stationary version, under MSE distortion. Our proof is based upon the ideas developed in the previous sections, and extends the class of non-stationary sources for which the results in Reference [15][16][17] are valid.
(A block-diagram associated with the construction of x is presented in Figure 6.) x u y w A(z) Figure 6. Block diagram representation of how the non-stationary source x ∞ 1 is built and then reconstructed as y = x + u.
Define the rate-distortion functions for these two sources as where, for each n, the minima are taken over all the conditional probability density functions f u n 1 | x n 1 and fũn 1 |x n 1 yielding E u 1 n 2 /n ≤ D and E ũ 1 n 2 /n ≤ D, respectively. The above rate-distortion functions have been characterized in Reference [15][16][17] for the case in which w ∞ 1 is an i.i.d. Gaussian process. In particular, it is explicitly stated in Reference [16,17] that, for that case, We will next provide an alternative and simpler proof of this result, and extend its validity for general (not-necessarily stationary) Gaussian w ∞ 1 , using the entropy gain properties of non-minimum phase filters established in Section 6. Indeed, the approach in References [15][16][17] is based upon asymptotically-equivalent Toeplitz matrices in terms of the signals' covariance matrices. This restricts w ∞ 1 to be Gaussian and i.i.d. and A(z) to be an all-pole unstable transfer function, and then, the only non-stationarity allowed is that arising from unstable poles. For instance, a cyclo-stationary innovation followed by an unstable filter A(z) would yield a source which cannot be treated using Gray and Hashimoto's approach. By contrast, the reasoning behind our proof lets w ∞ 1 be any entropybalanced Gaussian process with bounded differential entropy rate, and then let the source be A w, with A(z) having unstable poles (and possibly zeros and stable poles, as well).
The statement is as follows: Theorem 9. Let w ∞ 1 be any Gaussian entropy-balanced process with bounded differential entropy rate, and let x ∞ 1 andx ∞ 1 be as defined in (138) and (139), respectively. Then, (142) holds.
Thanks to the ideas developed in the previous sections, it is possible to give an intuitive outline of the proof of this theorem (given in Appendix B, page 40) by using a sequence of block diagrams. More precisely, consider the diagrams shown in Figure 7. In the top diagram in this figure, suppose that y = C x + u realizes the RDF for the non-stationary source x. The sequence u is independent of x, and the linear filter C(z) is such that the error (y − x) ⊥ ⊥ y (a necessary condition for minimum MSE optimality). The filter B(z) is the Blaschke product of A(z) (see (A83) in Appendix B) (a stable, NMP filter with unit frequency response magnitude such thatx = B x).
If one moves the filter B(z) towards the source, then the middle diagram in Figure 7 is obtained. By doing this, the stationary sourcex appears with an additive error signal u that has the same asymptotic variance as u, reconstructed asỹ = Cx +ũ. From the invertibility of B(z), it also follows that the mutual information rate betweenx andỹ equals that between x and y. Thus, the channelỹ = Cx +ũ has the same rate and distortion as the channel y = C x + u.
However, if one now adds a short disturbance d to the error signalũ (as depicted in the bottom diagram of Figure 7), then the resulting additive error termū =ũ + d will be independent ofx and will have the same asymptotic variance asũ. Nonetheless, the differential entropy rate ofū will exceed that ofũ by the RHS of (142). This will make the mutual information rate betweenx andȳ to be less than that betweenx andỹ by the same amount. Hence, Rx(D) is at most R

The Feedback Channel Capacity of (Non-White) Gaussian Channels
Consider a non-white additive Gaussian channel of the form where the input x is subject to the power constraint and z ∞ 1 is a stationary Gaussian process. The feedback information capacity of this channel is realized by a Gaussian input x, and is given by where K x 1 n is the covariance matrix of x 1 n , and, for every k ∈ N, the input x k is allowed to depend upon the channel outputs y k−1 1 (since there exists a causal, noise-less feedback channel with one-step delay).
In Reference [9], it was shown that if z is an auto-regressive moving-average process of M-th order, then C FB can be achieved by the scheme shown in Figure 8. In this system, B is a strictly causal and stable finite-order filter and v ∞ 1 is Gaussian with v k = 0 for all k > M and such that v 1 n is Gaussian with a positive-definite covariance matrix K v 1 M .
B z x v y Figure 8. Block diagram representation a non-white Gaussian channel y = x + z and the coding scheme considered in Reference [9].
Here, we use the ideas developed in Section 6 to show that the information rate achieved by the capacity-achieving scheme proposed in Reference [9] drops to zero if there exists any additive disturbance of length at least M and finite differential entropy affecting the output, no matter how small.
To see this, notice that, in this case, and for all n > M, since det(I n + B n ) = 1. From Theorem 4, this gap between differential entries is precisely the entropy gain introduced by I n + B n to an input z 1 n when the output is affected by the disturbance v 1 M . Thus, from Theorem 4, the capacity of this scheme will correspond to 1 2π are the zeros of 1 + B(z), which is precisely the result stated in Reference [9], Theorem 4.1.
However, if the output is now affected by an additive disturbance d ∞ 1 not passing through B(z) such that d k = 0, ∀k > M and |h(d 1 , then we will have In this case, But lim n→∞ 1 n (h((I n + B n )z 1 n + v 1 n + d 1 n ) − h((I n + B n )z 1 n + d 1 n )) = 0, which follows directly from applying Theorem 4 to each of the differential entries. Notice that this result holds irrespective of how small the power of the disturbance may be.
Thus, the capacity-achieving scheme proposed in Reference [9] (and further studied in Reference [28]), although of groundbreaking theoretical importance, would yield zero rate in any practical situation, since in every physically implemented scheme, signals are unavoidably affected by some amount of noise.

Conclusions
We have provided an intuitive explanation and a rigorous characterization of the entropy gain of a linear time-invariant (LTI) system, defined as the difference between the differential entropy rates of its output and input random signals. The continuous-time version of this problem, considered by Shannon in Theorem 14 of his 1948 landmark paper, involves an LTI system G c band limited to B [Hz]. For this scenario, we restricted our attention to systems such that the samples of its unit-impulse response, taken (2B) −1 seconds apart, correspond to the unit-impulse response g 0 , g 1 , . . . of a causal and stable discrete-time system G. We show that the entropy gain in this case is log |g 0 |, which implies that, for this class of systems, Shannon's Theorem 14 holds if and only if G c has a corresponding discrete-time G that is minimum phase (MP).
For the discrete-time case, we introduced a new notion referred to as effective differential entropy, which quantifies the amount of uncertainty in vector signals that are confined to subspaces of lower dimensionality than that of the signals themselves. (Note that this is not possible by the conventional notion of differential entropy, which simply diverges to minus infinity.) It turns out that the difference in effective differential entropy rate between an n-length input to an LTI discrete-time system with frequency response G(e jω ), and its full length output, as n tends to infinity, equals 1 2π π −π |G(e jω )|dω. When comparing input and output sequences of equal length, our analysis revealed that, in the absence of external random disturbances, the entropy gain of a discrete-time LTI system G with unit-impulse response g 0 , g 1 , . . . is simply log |g 0 |. An entropy gain greater than log |g 0 | can be obtained only if a random signal is added to the output of G and if such output process has statistical properties that make it susceptible to the added random signal. In order to characterize the role of G, its input has been assumed to be entropy balanced (EB), a notion introduced herein. Crucially, the differential entropy rate of an EB process is not susceptible to random signals. EB processes constitute a large family that includes Gaussian processes with bounded, non-vanishing variance. We also show that (i) the sum of an EB process and any bounded variance process is EB, too, and (ii) passing an EB process by a stable MP filter yields an EB process. When the input is EB, we show that if G has NMP zeros ρ 1 , ρ 2 , . . . , ρ m , then the largest possible entropy gain is |g 0 | + ∑ m i=1 log |ρ i |, which equals 1 2π π −π |G(e jω )|dω. This upper bound is achieved by adding a finite-length output disturbance with finite variance and bounded differential entropy if and only if its length is at least m, no matter how tiny its variance may be. The same entropy gain is also obtained if G has a random initial state with bounded differential entropy and finite variance.
We used these fundamental insights about when the entropy gain occurs in order to establish a new and more general proof of the quadratic rate-distortion function for non-stationary Gaussian sources. Moreover, we demonstrated that the information rate of the capacity-achieving scheme proposed in Reference [9] for the auto regressive Gaussian channel with feedback drops to zero in the presence of any additive disturbance in the channel input or output of sufficient (finite) length, no matter how small it may be. This has crucial implications in any physical setup, where noise is unavoidable.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Proof of Theorem 3
The total length of the output , will grow with the length n of the input, if G is FIR, and will be infinite, if G is IIR. Letting η + 1 be the length of the impulse response of G in the FIR case, we define the output-length function (n) length of y when input is It is also convenient to define the sequence of matrices {G n } ∞ n=1 , whereG n ∈ R (n)×n is Toeplitz with G n i,j = 0, ∀i < j, G n i,j = g i−j , ∀i ≥ j. This allows one to write the entire output y 1 of a causal LTI filter G with impulse response {g k } η k=0 to an input u ∞ 1 as Let the SVD ofG n beG n =Q T nD nȒn , whereQ n ∈ R n× (n) has orthonormal rows,D n ∈ R n×n is diagonal with positive elements, andȒ n ∈ R n×n is unitary.
The effective differential entropy of y (n) 1 (u n 1 ) exceeds the differential entropy of u n 1 by The determinant of D n can be related to that ofG T nGn by noticing that SinceȒ n is unitary, it follows that detD 2 n = detG T nGn , which from (A3) means that The product H n G T nGn is a symmetric Toeplitz matrix, with its first column, [h 0 h 1 · · · h n−1 ] T , given by h i = ∑ n k=0 g k g k−i . Thus, the sequence {h i } n−1 i=0 corresponds to the samples 0 to n − 1 of those resulting from the complete convolution g * g − , even when the filter G is IIR, where g − denotes the time-reversed (possibly infinitely long) response g. Consequently, and since G(z) has no zeros on the unit circle, and g is absolutely summable, we can use the Grenander and Szegö's theorem [29], and Reference [18], Theorem 4.2, to obtain that In order to finish the proof, we divide (A5) by n, take the limit as n → ∞, and replace (A6) in the latter.
Proof of Lemma 2. Let {b i, } ∞ =1 be the intervals (bins) in R where the sample u(i) has constant PDF. Define the discrete random process c ∞ 1 , where c(i) = if and only if u(i) ∈ b i, . Let y ν+1 n Φ n u 1 n where Φ n ∈ R (n−ν)×n has orthonormal rows. Then, where the inequality is due to the fact that u 1 n and y ν+1 n are deterministic functions of u 1 n ; hence, c 1 n ←→ u 1 n ←→ y ν+1 n . Subtracting h(u 1 n ) from (A9) we obtain where the last equality follows from Lemma A1 (in Appendix C) whose conditions are met because, given c 1 n , the sequence u 1 n has independent entries each of them distributed uniformly over a possibly different interval with finite and positive measure. The opposite inequality is obtained by following the same steps as in the proof of Lemma A1, from (A124) onwards, which completes the proof.
Proof of Lemma 3. Let y 1 n [Ψ T n |Φ T n ] T w 1 n , where [Ψ T n |Φ T n ] T ∈ R n×n is a unitary matrix and where Ψ n ∈ R ν×n and Φ n ∈ R (n−ν)×n have orthonormal rows. Then, We can lower bound h(y 1 ν |y ν+1 n ) as follows: where (a) holds because conditioning does not increase entropy, (b) is from the fact that u 1 n ⊥ ⊥ v 1 n , and (c) follows from the chain rule of entropy. Substituting this result into (A14), dividing by n, taking the limit as n → ∞, and recalling that u ∞ 1 is entropy balanced, we conclude that lim n→∞ 1 n (h(Φ n w 1 n ) − h(w 1 n )) ≤ 0. The opposite bound over h(y 1 ν |y ν+1 n ) can be obtained from where (w G ) 1 n is a jointly Gaussian sequence with the same second-order moment as w 1 n . Therefore, h(Ψ n (w G ) 1 n ) = 1 2 log((2π e) ν det(Ψ n K w 1 n Ψ T n )) ≤ ν 2 log(2π e λ max (K w 1 n )). But w 1 n satisfies the assumptions of Proposition A2; thus, lim n→∞ n −1 log(λ max (K w 1 n )) = 0. Therefore, lim n→∞ n −1 h(Ψ n (w G ) 1 n ) ≤ 0, which substituted in (A14) yields Hence, w ∞ 1 satisfies Condition ii) of Definition 2. Since w ∞ 1 also satisfies Condition i) of Definition 2, it follows that w ∞ 1 is entropy balanced, completing the proof.
Proof of Lemma 4. Pick any ν ∈ N and let y 1 is a unitary matrix and the matrices Ψ n ∈ R ν×n and Φ n ∈ R (n−ν)×n have orthonormal rows. Since w 1 n = G n u 1 n , we have that Let Φ n G n = A n Σ n B n be the SVD of Φ n G n , where A n ∈ R (n−ν)×(n−ν) is an orthogonal matrix, B n ∈ R (n−ν)×n has orthonormal rows and Σ n ∈ R (n−ν)×(n−ν) is a diagonal matrix with the singular values of Φ n G n . Hence The singular values of Φ n G n are and that Thus, from (A24) and the Cauchy eigenvalue interlacing theorem [30], Hence, Recalling that G is minimum phase (which guarantees that its singular values change at most polynomially with n, due to Lemma 7), we conclude that Substituting back into (A23), we arrive to where(a) holds because u ∞ 1 is entropy balanced. This completes the proof.
Proof of Lemma 5. Let {Ψ n } ∞ n=1 be a sequence of matrices, each Ψ n ∈ R κ×n with orthonormal rows spanning a subspace of R n that contains the span of the columns of [Φ] 1 n . For each n ∈ N, let Ψ n ∈ R (n−κ)×n be such that H n [Ψ T n | Ψ T n ] T is a unitary matrix. Then, where the last equality holds because u ∞ 1 is entropy balanced and y ∞ 1 is entropy balanced (from Lemma 3). This completes the proof.
Proof of Lemma 6. Since Q n is unitary, we have that where Thus, where (a) follows from the chain rule of differential entropy. It only remains to show that the limit of (1/n)h(w n m+1 | w m 1 ) as n → ∞ equals the entropy rate of u ∞ 1 . We will do this by deriving a lower and an upper bounds which converge to the same expression as n → ∞.
A lower bound for h(w n m+1 | w m 1 ) can be obtained by noticing that where (a) follows from the fact that conditioning on more information does not increase differential entropy, (b) is due to the fact that h(x +a) = h(x), for any constant a, (c) holds becausez ∞ 1 ⊥ ⊥ v ∞ 1 , (d) is a direct application of the chain rule of differential entropy, and (e) stems from (A34) and the fact that det(D n R n ) = 1. On the other hand, Then, by inserting (A43) and (A42) in (A37), dividing by n, and taking the limit n → ∞, we obtain where the last equality is a consequence of the fact that u ∞ 1 is entropy balanced (specifically, from Proposition A3).
We now derive an upper bound for h(w n m+1 | w m 1 ). Defining the random vector and since D n is diagonal, we can write where m+1 [D n ] n diag{d n,m+1 , d n,m+2 , . . . , d n,n }. Therefore, where K A n x m+1 n and K A n ( m+1 [D n ] n ) −1zm+1 n are the covariance matrices of A n x m+1 n and A n ( m+1 [D n ] n ) −1zm+1 n , respectively, and where the last inequality follows from [31]. The fact that λ max (K x m+1 n ) and λ max (Kzm+1 ) =h(u ∞ 1 ) (from the assumption that u ∞ 1 is entropy balanced). Therefore, which coincides with the lower bound found in (A45), completing the proof.
Proof of Lemma 7. The transfer function G(z) can be factored as G(z) =G(z)F(z), wherẽ G(z) is stable and minimum phase and F(z) is stable with all the non-minimum phase zeros of G(z), both being biproper rational functions. From Lemma A2 (in Appendix C), in the limit as n → ∞, the eigenvalues ofG T nGn are lower and upper bounded by λ min (G TG ) and and F n = Q T n D n R n be the SVDs ofG n and F n , respectively, withd n,1 ≤d n,2 ≤ · · · ≤d n,n and d n,1 ≤ d n,2 ≤ · · · ≤ d n,n being the diagonal entries of the diagonal matricesD n , D n , respectively. Then, Denoting the i-th row of R n by r T n,i be, we have that, from the Courant-Fischer theorem [30] that Likewise, The result now follows directly from Lemma A3 (in Appendix C).
Proof of Theorem 6 . To begin with, the entropy power inequality [1] gives h(y 1 n ) = h(G n u 1 n + z 1 n ) ≥ h(G n u 1 n ) = h(y 1 n ), proving the lower bound in (70). To obtain the other bounds on the entropy gain of G n , we will use Lemma 6. Recalling the structure of z ∞ 1 specified in Assumption 2, the random vector whose differential entropy appears on the RHS of (64) takes the form Notice that, for every n ≥ κ, the columns of the matrix [Q n ] 1 m [Φ] 1 n ∈ R m×κ span a space of dimension κ n ∈ {0, 1, . . . ,κ}, withκ min{m, κ}. If κ n = 0 (i.e., If that is that case for every n ≥ κ, the lower bound in (70) is reached by inserting the latter expression into (64) and invoking Lemma 7.
where A n ∈ R κ n ×m has orthonormal rows, 1 n , and B n ∈ R κ n ×κ has orthonormal rows. Construct a unitary matrix H n ∈ R m×m such that where A n ∈ R κ n ×m is as before, and A n ∈ R (m−κ n )×m has orthonormal rows, and its row span is the orthogonal complement of that of A n . Thus, From (A63) and (A60), we obtain where the indicator function 1 {m} (κ n ) = 1 if κ n = m and 0 otherwise. The first differential entropy on the RHS of (A66) can be lower bounded as where (a) is from the entropy power inequality [1], (b) holds because s 1 κ ⊥ ⊥ u 1 n and (c) is from Proposition A1. An upper bound can be obtained as where (a) holds because conditioning does not increase entropy, (b) is because a Gaussian distribution maximizes the differential entropy for a given covariance matrix, and (c) is due to Reference [31]. Notice that u ∞ 1 satisfies the requirements of Proposition A2, implying that lim n→∞ n −1 λ max (K u 1 n ) = 0. Thus, since t κ n (n) ≤ 1, it follows from (A67), (A68), and (A66) that For the last differential entropy on the RHS of (A66), notice that being unitary, Σ n ∈ R (m−κ n )×(m−κ n ) being diagonal, and W n ∈ R (m−κ n )×n having orthonormal rows. We can then conclude that Now, the fact that with equality in (a) and (b) if and only if A n = [I m−κ n | 0] and A n = [0 | I m−κ n ], respectively. Substituting this into (A71) and then the latter into (A70), we arrive to Substituting the upper bound from this equation and from (A68) into (A66) and the latter in (64), exploiting the fact that u ∞ 1 is entropy balanced (which ensures that u ∞ 1 satisfies Condition i) in Definition 2) and invoking Lemma 7 yields the upper bound in (70).
Doing the same substitutions but with the lower bounds in (A73) and (A67), and using the assumption that lim n→∞ 1 n log(t 1 (n)) = 0, gives the lower bound of (71). This completes the proof.
Proof of Lemma 8. We will consider first the case κ = m and show that lim n→∞ σ min ( 1 [Q n ] m ) > 0, where now Q T n is the left unitary matrix in the SVD F n = Q T n D n R n . We will prove that this is the case by using a contradiction argument. Thus, suppose the contrary, i.e., that Then, there exists a sequence of unit-norm vectors {v n } ∞ n=1 , with v n ∈ R m for all n, such that For each n ∈ N, define the n-length unit-norm image vectors t T n v T n [Q n ] 1 m . Then, where the last equality follows from the fact that, by construction, t T n is in the span of the first m rows of Q n , together with the fact that Q n is unitary (which implies that [Q n ] m+1 n t n = 0). Since the top m entries in D n decay exponentially as n increases, we have that where ζ n is a finite-order polynomial of n (from Lemma A3, in Appendix C). |F(e jω )| 2 > 0 (A78) (the inequality is strict because all the zeros of F(z) are strictly outside the unit disk). Then, we conclude that Recall that t n = 1; thus, from (A75), lim n→∞ [t n ] 1 m = 0 and lim n→∞ [t n ] m+1 n = 1, which > 0, which contradicts (A77). Therefore, Now, consider an arbitrary κ ≥ 1. Since it follows from (A80) that thus, lim n→∞ κ n =κ. This completes the proof.
Proof of Theorem 9. Denote the Blaschke product [11] of A(z) as which clearly satisfies where b 0 is the first sample in the impulse response of B(z). Notice that (A84) implies that lim n→∞ with uniformly bounded variance. Since B(z) has only stable poles and its zeros coincide exactly with the poles of A(z), it follows that B(z)A(z) is an MP stable transfer function. Thus, the asymptotically stationary processx ∞ 1 defined in (139) can be constructed as where B n is a Toeplitz lower triangular matrix with its main diagonal entries equal to b 0 . Since w ∞ 1 is entropy balanced, so isx ∞ 1 , thanks to Lemma 4.
The fact that B(z) is biproper with b 0 as in (A85) implies that, for any u 1 n with finite differential entropy, which will be utilized next. For any given n ≥ m, suppose that C(z) is chosen and x 1 n and u 1 n are distributed so as to minimize I(x 1 n ; C n x 1 n + u 1 n ) subject to the constraint E[ y 1 n − x 1 n ] 2 ≤ D (i.e., x 1 n , u 1 n is a realization of R x,n (D)), yielding the reconstruction Since we are considering mean-squared error distortion, it follows that, for ratedistortion optimality, u 1 n must be jointly Gaussian with x 1 n . In addition, there is no loss of rate-distortion optimality if u ∞ 1 is entropy balanced (otherwise, it would have a lower entropy rate than its entropy-balanced counterpart, which differs from the former only on a finite number of samples and has the same asymptotic MSE). From these vectors, definẽ where d 1 n is a zero-mean Gaussian vector independent of (ũ 1 n ,x 1 n ) with finite differential entropy and finite variance such that d k = 0, ∀k > m. Then, we have that (the change of variables and the steps in this chain of equations is represented by the block diagrams shown in Figure 7) where (a) follows from B n being invertible, (b) is due to the fact thatỹ 1 n = P nx 1 n +ũ 1 n , (c) holds because u 1 n ⊥ ⊥ x 1 n . The equality (d) stems from h(ũ 1 n ) = h(u 1 n ) − nG (see (A87)). Equality holds in (e) becausex 1 n ⊥ ⊥ (ũ 1 n , d 1 n ) and in ( f ) because of (A91). But from Theorem 4 and since u ∞ 1 is entropy balanced, lim n→∞ 1 n (h(ũ 1 n + d 1 m ) − h(u 1 n )) = 0. From Lemma 3 and because u ∞ 1 is entropy balanced, so isỹ ∞ 1 . This guarantees, from Lemma 5, that lim n→∞ n −1 [h(ỹ 1 n ) − h(ỹ 1 n + d 1 n )] = 0. Thus, R x,n (D) = lim n→∞ 1 n (x 1 n ;ȳ 1 n ) + G ≥ Rx ,n (D) + G.
At the same time, the distortion for the sourcex 1 n when reconstructed asȳ 1 n is where (a) holds because d 1 n = d 1 m is bounded, and (b) is due to the fact that, in the limit, B(z) is a unitary operator. Recalling the definitions of Rx(D) and Rx(D), we conclude that lim n→∞ 1 n (x 1 n ;ȳ 1 n ) ≥ Rx ,n (D); therefore, In order to complete the proof, it suffices to show that R x (D) − Rx(D) ≤ ∑ m i=1 log |p i |. For this purpose, consider now the (asymptotically) stationary sourcex 1 n , and suppose thatŷ 1 n =x 1 n + u 1 n realizes Rx ,n (D). Again,x 1 n and u 1 n will be jointly Gaussian, satisfyinĝ y 1 n ⊥ ⊥ u 1 n (the latter condition is required for minimum MSE optimality). From this, one can propose an alternative realization in which the error sequence isũ B n u 1 n , yielding an outputỹ 1 n =x 1 n +ũ 1 n withỹ 1 n ⊥ ⊥ũ 1 n . Then, nRx ,n (D) = I(x 1 n ;ŷ 1 where (a) follows by recalling thatŷ 1 n =x 1 n + u 1 n and becauseŷ 1 n ⊥ ⊥ u 1 n , (b) stems from (A87), (c) is a consequence ofỹ 1 n ⊥ ⊥ũ 1 n , (d) follows from the fact thatỹ 1 n =x 1 n +ũ 1 n . Finally, (e) holds because B n is invertible for all n. Since, asymptotically as n → ∞, the distortion yielded by y 1 n for the non-stationary source x 1 n is the same which is obtained wheñ x 1 n is reconstructed asŷ 1 n (recall (A84)), we conclude that R x (D) − Rx(D) ≤ ∑ M i=1 log |p i |, completing the proof.

Appendix C. Technical Lemmas and Propositions
Proposition A1. Let the random vector s 1 κ have finite differential entropy, and suppose its covariance matrix K s 1 κ satisfies λ max (K s 1 κ ) < ∞. Then, for any unitary matrix A ∈ R κ×κ and i = 1, 2, . . . , κ Proof. Define r 1 κ As 1 κ . Since A is unitary, it follows that h(r 1 κ ) = h(s 1 κ ) and that K r 1 κ and K s 1 κ have the same eigenvalues. Therefore, where (a) holds because a Gaussian distribution yields the largest differential entropy for a given covariance matrix, (b) is from the fact that det(K s 1 i ) = ∏ i k=1 λ k (K s 1 i ) and (c) is due to the Cauchy interlacing theorem [30]. This proves the upper bound in (A112). For the lower bound, we have where (a) stems from the fact that h(a, b) ≤ h(a) + h(b) and (b) follows from (A113). This completes the proof.

Proposition A3.
Let v ∞ 1 be an entropy-balanced random process. Then, for each ν ∈ N and for every sequence of matrices {Ψ n } ∞ n=ν , Ψ n ∈ R ν×n with orthonormal rows, To see this, notice that, for every Ψ n ∈ R ν×n with orthonormal rows, there exists a matrix Φ n ∈ R (n−ν)×n with orthonormal rows which are also orthogonal to those of Ψ n . This means that the matrix [Ψ T n |Φ T n ] T ∈ R n×n is unitary; thus, where (a) holds due to the chain-rule of differential entropy and (b) follows because conditioning does not increase differential entropy. Therefore, h(Ψ n v 1 n ) ≥ h(v 1 n ) − h(Φ n v 1 n ). Dividing this by n, taking the limit when n → ∞ and recalling that v ∞ 1 satisfies (17) yields (A117).
We will now prove that lim n→∞ with the inequality due to the fact that Ψ n has orthonormal rows. But v ∞ 1 meets the requirements of Proposition A2; thus, lim n→∞ 1 n h(Ψ n v 1 n ) ≤ lim n→∞ ν 2n (2π e λ max (Kṽ1 n )) = 0. The proof is completed by combining this result with (A117).
Lemma A1. Let u ∞ 1 be a random process with independent elements, and where each element u i is uniformly distributed over possible different intervals [− a i 2 , a i 2 ], such that a max > |a i | > a min > 0, ∀i ∈ N, for some positive and bounded a min < a max . Then, u ∞ 1 is entropy balanced.
Proof. Without loss of generality, we can assume that a i ≥ 1, for all i (otherwise, we could scale the input by 1/a min , which would scale the output by the same proportion, increasing the input entropy by n log(1/a min ) and the output entropy by (n − ν) log(1/a min ), without changing the result). The input vector u 1 n is confined to an n-box U n (the support of u n 1 ) of volume V n (U n ) = ∏ n i=1 a i and has entropy log(∏ n i=1 a i ). This support is an n-box which contains ( n k )2 n−k k-boxes of different k-volume. Each of these k-boxes is determined by fixing n − k entries in u 1 n to ±a i /2, and letting the remaining k entries sweep freely over [− a i 2 , a i 2 ]. Thus, the k-volume of each k-box is the product of the k support sizes a i of the associated selected free-sweeping entries. But recalling that a i > 1 for all i, the volume of each k-box can be upper bounded by ∏ n i=1 a i . With this, the added volume of all the k-boxes contained in the original n-box can be upper bounded as We now use this result to upper bound the entropy rate of y ν+1 n . Let y 1 n [Ψ T n |Φ T n ] T u 1 n where [Ψ T n |Φ T n ] T ∈ R n×n is a unitary matrix and where Ψ n ∈ R ν×n and Φ n ∈ R (n−ν)×n have orthonormal rows. From this definition, y ν+1 n will distribute over a finite region Y ν+1 n ⊆ R n−ν , corresponding to the projection onto the (n − ν)-dimensional span of the rows of Φ n . Hence, h(y ν+1 n ) is upper bounded by the entropy of a uniformly distributed vector over the same support, i.e., by log V n−ν (Y ν+1 n ), where V n−ν (Y ν+1 n ) is the (n − ν)-dimensional volume of this support. In turn, V n−ν (Y ν+1 n ) is upper bounded by the sum of the volume of all (ν − k)-dimensional boxes contained in the n-box in which u 1 n is confined, which we already denoted by V n−ν (U n ), and which is upper bounded as in (A120). Therefore, Recalling that h(u 1 n ) = log(∏ n i=1 a i ), dividing by n and taking the limit as n → ∞ yields On the other hand, where (a) follows because [Ψ T n |Φ T n ] T is an orthogonal matrix. Letting (y G ) 1 ν correspond to the jointly Gaussian sequence with the same second-order moments as y 1 ν , and recalling that the Gaussian distribution maximizes differential entropy for a given covariance, we obtain the upper bound h(y 1 ν ) ≤ h((y G ) 1 ν ) (a) = 1 2 log (2π e) ν det(Ψ n diag{σ 2 where (a) follows since the {u i } n i=1 are independent, and (b) stems from the fact that Ψ n ∈ R ν×n has orthonormal rows. Since max{σ 2 u i } n i=1 is bounded for all n, we obtain by substituting (A125) into (A124) that lim n→∞ 1 n (h(y ν+1 n ) − h(u 1 n )) ≥ 0. The combination of this with (A123) yields lim n→∞ 1 n (h(y ν+1 n ) − h(u 1 n )) = 0, satisfying Condition ii) in Definition 2. From this, the proof is completed by noting that u ∞ 1 satisfies Condition i) in Definition 2. This completes the proof.
Lemma A2. Let A(z) be a causal, finite-order, stable and strictly minimum-phase rational transfer function with impulse response a 0 , a 1 , . . . such that a 0 = 1. Then, lim n→∞ λ 1 (A n A T n ) > 0 and lim n→∞ λ n (A n A T n ) < ∞.
Proof of Lemma A2. The fact that lim n→∞ λ n (A n A T n ) is upper bounded follows directly from the fact that A(z) is a stable transfer function. On the other hand, A n A T n is positive definite, with lim n→∞ λ 1 (A n A T n ) ≥ 0. Suppose that lim n→∞ λ 1 (A n A T n ) = 0. If this were true, then it would hold that lim n→∞ λ n (A −1 n A −T n ) = ∞. But A −1 n is the lower triangular Toeplitz matrix associated with A −1 (z), which is stable (since A(z) is minimum phase), implying that lim n→∞ λ n (A −1 n A −T 1 ) < ∞, thus leading to a contradiction. This completes the proof.
We re-state here (for completeness and convenience) the unnumbered lemma in the proof of Reference [16], Theorem 1, as follows: Lemma A3. Let the transfer function G(z) satisfy Assumption 1 and suppose it has no poles. Then, where the elements in the sequence {α n,l } are positive and increase or decrease at most polynomially with n.