Bounds on the Sum-Rate of MIMO Causal Source Coding Systems with Memory under Spatio-Temporal Distortion Constraints

In this paper, we derive lower and upper bounds on the OPTA of a two-user multi-input multi-output (MIMO) causal encoding and causal decoding problem. Each user’s source model is described by a multidimensional Markov source driven by additive i.i.d. noise process subject to three classes of spatio-temporal distortion constraints. To characterize the lower bounds, we use state augmentation techniques and a data processing theorem, which recovers a variant of rate distortion function as an information measure known in the literature as nonanticipatory ϵ-entropy, sequential or nonanticipative RDF. We derive lower bound characterizations for a system driven by an i.i.d. Gaussian noise process, which we solve using the SDP algorithm for all three classes of distortion constraints. We obtain closed form solutions when the system’s noise is possibly non-Gaussian for both users and when only one of the users is described by a source model driven by a Gaussian noise process. To obtain the upper bounds, we use the best linear forward test channel realization that corresponds to the optimal test channel realization when the system is driven by a Gaussian noise process and apply a sequential causal DPCM-based scheme with a feedback loop followed by a scaled ECDQ scheme that leads to upper bounds with certain performance guarantees. Then, we use the linear forward test channel as a benchmark to obtain upper bounds on the OPTA, when the system is driven by an additive i.i.d. non-Gaussian noise process. We support our framework with various simulation studies.


Problem Statement
We consider the two-user causal encoding and causal decoding setup illustrated in Figure 1. In this setup, users 1 and 2 are modeled by the following discrete-time time-invariant multidimensional Markov processes: where x 1 t ∈ R p 1 , x 2 t ∈ R p 2 , with p 1 not necessarily equal to p 2 , (A 1 , A 2 ) are known constant matrices of appropriate dimensions and (w 1 t , w 2 t ) are additive i.i.d. possibly non-Gaussian noise processes with zero mean and covariance matrix Σ w i 0, i = 1, 2, independent of x i 0 , i = 1, 2 and from each other for all t ≥ 0. The initial states x i 0 , i = 1, 2 are given by x i 0 ∼ (0; Σ x i 0 ), i = 1, 2. Finally, we restrict the User 1 Encoder MMSE Decoder t R t y 1 t y 2 t User 2 Figure 1. System model. The encoder receives information from two users that do not interact from a dynamical system perspective, but they are allowed to allocate bits between them and across the dimensions. The compression is done causally whereas the clocks of the encoder and the decoder are assumed to be synchronized.
The goal is to cast the performance of the setup in Figure 1 under various distortion metrics when the encoder compress information causally whereas the lossless compression between the encoder and the decoder is done in one shot assuming their clocks are synchronized.
First, we apply state space augmentation [1] to the state-space models in (1) to transform them into a single augmented state-space model as follows: where x t+1 = [x 1 T t+1 , x 2 T t+1 ] T ∈ R p 1 +p 2 , A is a block diagonal matrix and w t is an additive i.i.d. possibly non-Gaussian noise process such that w t ∼ (0; Σ w ) where (A, Σ w ) are of the form We note that the operation in (3) can be mathematically denoted as A = A 1 ⊕ A 2 and similarly, Σ w = Σ w 1 ⊕ Σ w 2 (see the notation section for "⊕"). System Operation: The encoder at each time instant t observes the augmented state x t and generates the data packet t ∈ {1, . . . , 2 R t } of instantaneous rate R t . At time t, the packet t is sent over a noiseless channel with rate R t . The decoder at each time t, receives t to construct an estimate y t of x t . We assume that the clocks of the encoder and decoder are synchronized. Formally, the encoder (E ) and the decoder (D) are specified by a sequence of measurable functions {( f t , g t ) : t ∈ N 0 } as follows:

Generalizations
It should be noted that the setup in Figure 1 can be generalized to any finite number of users. The only change will appear in the number of state-space equations and the dimension of the vectors and matrices in the augmented state-space representation of (2).
Next, we explain the setup of two users that are correlated (in states). In such scenario, users 1 and 2 are modeled by the following discrete-time time-invariant multidimensional Markov processes: x 1 t+1 = A 11 x 1 t + A 12 x 2 t + w 1 t , t = 0, 1, . . .
where x 1 t ∈ R p 1 , x 2 t ∈ R p 2 , with p 1 not necessarily equal to p 2 , (A 11 , A 12 , A 21 , A 22 ) are known constant matrices of appropriate dimensions whereas all the other assumptions remain similar to the user models described in (1). The single augmented state-space model now is obtained as follows: where A is a block matrix of the form where A 11 , A 22 are square matrices but (A 12 , A 21 ) may be rectangular matrices (if p 1 = p 2 ). We will not consider this case in our paper because it is straightforward by replacing everywhere matrix A with matrix A. Clearly, this case can be generalized to any finite number of users with appropriate modifications on the state-space models.

Distortion Constraints
In this work we consider three types of distortion constraints. These are articulated as follows: (i) a per-dimension (spatial) distortion constraint on the asymptotically averaged total (across the time) MMSE covariance matrix; (ii) an asymptotically averaged total (across the time and across the space) distortion constraint; (iii) a covariance matrix distortion constraint.
Next, we give the definition of each distortion constraint and explain some of their utility in multi-user systems.
A per-dimension (spatial) distortion constraint imposed on the covariance distortion matrix Σ ∆ lim sup n−→∞ 1 n+1 ∑ n t=0 E {(x t − y t )(x t − y t ) T }, where Σ ∆ 0, is defined as follows: where D ii ∈ [0, D max ii ] are given diagonal entries of the positive semidefinite matrix D 0, with trace( D) ≡ D, D ∈ [0, D max ]. Note that under this distortion constraint, it trivially holds that lim sup n−→∞ 1 n+1 ∑ n t=0 E ||x t − y t || 2 2 ≤ D. Utility: The choice of per-dimension distortion constraints is arguably more realistic in various network systems. For instance, one use of such hard constraints can be found in multivariable feedback control systems also called centralized multi-input multi-output (MIMO) systems [2] (see Figure 2). In such networks, it may be the case that one wishes to minimize the temporally total performance criterion or satisfy some total fidelity constraint. However, in addition it is always required that the resource allocation to the different nodes (or variables) to never exceed certain performance thresholds when the demands in data transmission within the communication link allows only limited rate. Nonetheless, the problem is that variables interact. Some variables could be considered more important for certain applications according to the demands of the system or the quality of service, which is why they need hard constraints. Figure 2. Centralized multivariable multi-input multi-output (MIMO) control system. An asymptotically averaged total (across the time and space) distortion constraint is defined as follows: where DT ∈ [0, D max T ], with DT not necessarily equal to D. Utility: The asymptotically averaged total (across the time and space) distortion constraint ensure shared or allocated distortion arbitrarily among the transmit dimensions. The combination of the per-dimension distortion constraint with the averaged total distortion constraints ensure a total allocated distortion budget in the system that depends on the allowable (by design) distortion budget at each dimension (or user).
A covariance matrix distortion constraint is a generalization of the per-dimension distortion constraint defined by where D cov 0. Utility: During the recent years, there has been a shift from conventional MSE distortion constraints (scalar-valued target distortions) to covariance matrix distortions in the areas of multiterminal and distributed source coding [3][4][5][6][7] and signal processing [7][8][9]. Nonetheless, the argument for considering covariance distortion constraints despite its difficulty is its generality and the flexibility in formulating new problems. For instance, one practical example would be wireless adhoc microphones, that transmit to receiver(s) over an MIMO channel. In such setups, perhaps the receiver(s) need to do beam forming or some multi-channel Wiener filtering variants. In both cases, one needs to know the covariance matrix of e.g., the error signal (covariance distortion matrix) to perform the desired signal enhancement. For example, if the quality of one of the signals is too bad, this could harm the overall signal enhancement, and one therefore need to trade-off the bits correctly among the microphones. In an adhoc microphone array, the different signals are naturally correlated, which adds an interesting interplay between them that goes beyond MSE distortion.

Operational Interpretations
In this subsection, we use the three types of distortion constraints introduced in Section 1.2 to define the corresponding operational definitions for which we study lower and upper bounds in this paper. Definition 1 (Causal RDF subject to (8)). The operational causal RDF under per-dimension distortion constraints is defined as follows: where D ii ∈ [0, D ii,max ] and D ∈ [0, D max ].
Definition 2 (Causal RDF subject to (8) and (9)). The operational causal RDF under joint per-dimension and asymptotically averaged total distortion constraints is defined as follows: where D * = min{DT , D}.
Interplay between Definitions 1 and 2. Clearly, Definition 1 is a lower bound to Definition 2 because its constraint set of feasible solutions is larger. Note that, in general, the asymptotically averaged total distortion constraint in (12) is active when DT ≤ D, otherwise, it is a trivial constraint and (12) is equivalent to the optimization problem of (11). This observation will be shown via a simulation study in the sequel of the paper. (10)). The operational causal RDF under covariance matrix distortion constraints is defined as follows:

Definition 3 (Causal RDF subject to
where D cov 0. Literature Review. In information theory, causal coding and causal decoding also termed zero-delay coding (see, e.g., [10][11][12][13][14]) does not rely on the traditional construction based on random codebooks that in turn allows asymptotically large source vector dimensions [15] to establish achievability of a certain (non-causal) rate-distortion performance. Indeed, the optimal rate-distortion performance for causal source coding (with the clocks of the encoder and decoder to be synchronized), is hard to compute and often bounds are derived in the literature. For example, lower and upper bounds on the operational causal RDF subject to solely the distortion constraint in (9) (or the more stringent per instant distortion constraint E{||x t − y t || 2 2 } ≥ D t , ∀t) are already studied extensively for various special cases of the setup of Figure 1, see, e.g., [11,14,16,17] and the references therein. In this work, we study new problems related to the causal RDF for the general multi-user source coding setup of Figure 1 under various classes of distortion constraints that their utility (partly explained in Section 1.2) has not been studied in the literature so far. These bounds are established using tools from information theory, convex optimization and causal MMSE estimation.

Contributions
In this paper we obtain the following results for the setup of Figure 1.

•
Characterization and computation of the true lower bounds on (11)-(13) when the users' augmented source model is driven by a Gaussian noise process (Lemma 3, Theorem 2).
• Analytical lower bounds on (11)-(13) when the users' augmented source model is driven by additive i.i.d. noise process (including both additive Gaussian and non-Gaussian noise) (Theorem 3). As a consequence, we also obtain analytical lower bounds when only one of the users' source model is driven by a Gaussian noise process (Corollary 2).

•
Characterizations and computation of achievable bounds on (11)-(13) when the users' augmented source model is driven by a Gaussian noise process (Theorem 4).
Machinery and tools. The information theoretic rate distortion definitions that are used to obtain the lower bounds in this paper are derived using a data processing theorem (Theorem 1) that reveals the "suitable" information measure to use. The derivation of the steady-state characterization of the lower bounds in Lemma 3 is derived using inequalities from matrix algebra and a convexity argument that allows the use of Jensen's inequality. To derive lower bounds beyond additive i.i.d. Gaussian noise process, we use the fact that the characterizations of the lower bounds for the Gaussian case are in fact the characterizations obtained for the best linear coding policies (Lemma 4) hence these can serve as a benchmark to derive lower bounds beyond Gaussian noise process by leveraging certain trace/determinant inequalities and most importantly Minkowski's determinant inequality ( [18], Exercise 12.13) and EPI [19]. The upper bounds on the OPTA when the system's noise is Gaussian, are derived using a causal sequential DPCM-based scheme with feedback loop which is equivalent to the scheme first derived in [14], followed by an ECDQ scheme that uses vector quantization. The upper bounds on the OPTA, when the system's noise is additive i.i.d. non-Gaussian are obtained using precisely the same trick that is used to obtain the lower bounds, i.e., we use the linear test channel realization that achieves similar upper bounds for the Gaussian case and then, using an SLB type concept (Theorem 5) we obtain the desired results.
The paper is structured as follows. In Section 2 we characterize and compute lower bounds on the OPTA of (11)- (13). In Section 3 we characterize and compute achievable coding scheme on the OPTA of (11)- (13). We draw conclusions and future directions in Section 4.

Lower Bounds
In this section, we first choose a suitable information measure that will be used to derive a lower bound on Definitions 1-3. This information measure is a variant of directed information subject to some conditional independence constraints. Then, we obtain lower bounds on Definitions 1-3 for jointly Gaussian Markov processes and for Markov processes driven by additive i.i.d. possibly non-Gaussian noise process.
First, we write the joint distribution of the communication system of Figure 1, i.e., from the two users described by the augmented state {x t : t ∈ N n 0 } to the augmented output of the MMSE decoder {y t : t ∈ N n 0 }. In particular, the joint distribution induced by the joint process {(x t , t , y t ) : t ∈ N n 0 } admits the following decomposition: which means that the augmented state "source" process x t , and the decoder's output process y t satisfy the following conditional independence constraints: P(dy t |y t−1 , t , x t ) = P(dy t |y t−1 , t ) − a.s.
For (14) we state the following technical remark.
We next prove a data processing theorem.
Theorem 1 (Data processing theorem). Provided the decomposition of the joint distribution in (14) holds, the augmented state-space representation of the system in Figure 1 admits the following data processing inequalities: where I(x n ; n ||y n−1 ) assuming I(x t ; t | t−1 , y t−1 ) < ∞, ∀t, and I(x t ; y t |y t−1 ) < ∞, ∀t.
Proof. We first prove (i).
≡ I(x n ; n ||y n−1 ), where (a) follows because conditioning reduces entropy [19]; (b) follows because of the non-negativity of the discrete entropy [19]; (c) follows by definition. Next, we prove (ii). This can be shown as follows: where (d) follows from an adaptation of ( [20], Lemma 3.3) to processes, i.e., I(x t ; t , y t | t−1 , y t−1 ) = I(x t ; t | t−1 , y t−1 ) + I(x t ; y t | t , y t−1 ) and the second term is zero because of the conditional independence constraint (16); (e) follows by the chain rule of conditional mutual information (again an adaptation of ( [20], Lemma 3.3)) which decomposes the conditional mutual information in two different ways, i.e., ( f ) follows because an adaptation of ( [20], Lemma 3.3) can be applied to I(x t ; t−1 |y t−1 ) as follows This can be shown as follows.
where each h(·) is assumed to be finite for any t, and (20) follows from the conditional independence constraint in (15). From the non-negativity of conditional mutual information [19], the result follows. Finally, from (19) we have where (22) follows by applying the method of differences in (21). The result follows because (22) is by definition non-negative. We note that if I(x 0 ; 0 |y 0 ) also appeared in the cancellations, then, this would have been the telescopic sum of the series. This completes the derivation.
We note that Theorem 1 is different from the data processing theorem derived in ( [21], Lemma 1) in that we assume the conditional independence constraint (15) instead of the conditional independence constraint P(dx t |x t−1 , y t−1 , t−1 ) = P(dx t |x t−1 , y t−1 ) − a.s., i.e., the source process is not allowed to have access via feedback to the previous output symbols y t−1 . This technical difference results into having the mutual information in (18) subject to conditional independence constraints, instead of the well-known directed information [22].
Before we introduce the information theoretic definitions that correspond to lower bounds on (11)-(13), we formally show the construction of I(x n ; y n ). Source. The augmented source process {x t : t ∈ N 0 } induces the sequence of conditional distributions {P(dx t |x t−1 ), t ∈ N n 0 }. Since no initial information is assumed, the distribution at t = 0 is P(dx 0 ). In addition, by Bayes' rule we obtain P(dx n ) ⊗ n t=0 P(dx t |x t−1 ). Reproduction or "test-channel". The reproduction process {y t : t ∈ N n 0 } parameterized by X t induces the sequence of conditional distributions, known as test-channels, by {P(dy t |y t−1 , x t ), t ∈ N n 0 }. At t = 0, no initial state information is assumed, hence P(dy 0 |y −1 , x 0 ) = P(dy 0 |x 0 ). In addition, by Bayes' rule we obtain − → Q (dy n |x n ) ⊗ n t=0 P(dy t |y t−1 , x t ). From ( [23], Remark 1), it can be shown that the sequence of conditional distributions {P(dx t |x t−1 ) : t ∈ N n 0 } and {P(dy t |y t−1 , x t ) : t ∈ N n 0 } uniquely define the family of conditional distributions on X n and Y n parameterized by x n ∈ X n , respectively, given by the joint distribution In addition, from (23), we can uniquely define the Y n −marginal distribution by and the conditional distributions {P(dy t |y t−1 ) : t ∈ N n 0 }.
Given the above construction of distributions, we can formally introduce the information measure using relative entropy as follows: where (a) follows by definition of relative entropy between P(dx n , dy n ) and the product distribution P(dx n ) × P(dy n ); (b) is due to the Radon-Nikodym derivative theorem ( [23], Appendixes A and C); (c) is due to chain rule of relative entropy; (d) follows by definition.
We now state as a definition the lower bounds on (11)- (13).

Definition 4 (Lower bounds).
Using the previous construction of distributions and the information measure of (24), we can define the following lower bounds on (11)- (13).
(1) The sum-rate subject to per-dimension distortion constraint is defined as follows: with (2) The sum-rate subject to joint distortion constraints is defined as follows: where D * = min{D, DT }. (3) The sum-rate subject to covariance matrix distortion constraints is defined as follows: where D cov 0.
Next, we stress some technical remarks related to the new information theoretic measures in Definition 4 that can be obtained using known results in the literature and some known lower bounds that use the same objective function with (26)- (30).

Remark 2 (Comments on Definition 4).
It can be shown that the infimization problems (26), (28) and (30), in contrast to their operational counterparts (11)-(13) are convex with respect to their test channel. This can be shown following, for instance, the techniques of [23]. By the structural properties of the test channel derived in ( [24], Section 4), if the source is first-order Markov, i.e., with distribution P(dx t |x t−1 ), t ∈ N n 0 , the test channel distribution is of the form P(dy t |y t−1 , x t ), t ∈ N n 0 . Finally, combining this structural result, with ([25], Theorem 1.8.6), it can be shown that if x n is Gaussian then a jointly Gaussian process {(x t , y t ) : t ∈ N 0 } achieves a smaller value of the information rates, and if x n is Gaussian and Markov, then the infimum in (26), (28) and (30) can be restricted to test channel distributions which are Gaussian, of the form P(dy t |y t−1 , x t ).
We recall that when the distortion constraint set contains only (9), its finite time horizon counterpart or the per instant distortion constraint E{||x t − y t || 2 2 } ≤ D t ∀t, we end up having the well known nonanticipatoryentropy [26] also found in the literature as sequential or nonanticipative RDF [27,28]. Nonanticipatory-entropy received significant interest during the last twenty years in an anthology of papers (see, e.g., [11,16,24,[29][30][31]) due to its utility in control related and delay-constrained applications. Moreover, the characterizations in (29) and (30) do not appear to be manageable to solve using standard techniques, and no closed-form statements are available for the general RDF in the literature. For this reason, we will seek only for non-tight bounds.
In view of the above, in the sequel we characterize and compute lower bounds on Definitions 1-3 for Gauss-Markov processes and for Markov models driven by additive i.i.d. noise processes.

Characterization and Computation of Jointly Gaussian Processes
In this section, we assume that the augmented joint process {(x t , y t ) : t ∈ N 0 } is jointly Gaussian. We use this assumption to first characterize and then to compute optimally (26), (28) and (30).
We first use the following helpful lemma. We exclude the proof because it is already derived in other papers, see, e.g., [14,24]. The only modification is the augmented joint processes {(x t , y t ) : t ∈ N n 0 }.
Consider the class of conditionally Gaussian test channels {P * (dy t |y t−1 , x t ) : t ∈ N n 0 }. Then, the following statements hold.
(1) Any candidate of {P * (dy t |y t−1 , x t ) : t ∈ N n 0 } can be realized by the recursion Moreover, the innovations process {I t ∈ R p 1 +p 2 : t ∈ N n 0 } of (31) is the orthogonal process defined by 0 } satisfy the following vector-valued equations: where Σ t|t 0 and Σ t|t−1 0.
(3) Using MMSE estimation via the vector-valued KF recursions of (32), the following finite dimensional where We note that one can directly study the finite-dimensional characterizations of Lemma 1, , and try to come up with numerical solutions. However, it is much more insightful to use instead the identification of the design parameters {(H t , Σ v t ) : t ∈ N n 0 } of the test-channel realization in (31). This approach is already done in [14,24] hence we state it without a proof. Note, however, that compared to [14,24] that assume distortion constraints like (9) (or the per time instant counterpart of (9), i.e., E ||x t − y t || 2 2 ≤ D t , ∀t), here we assume augmented state-space models and various spatio-temporal distortion constraints, namely, per-dimension, jointly per-dimension and averaged total distortion constraints, and covariance matrix distortion constraint.
Next, we give some technical remarks related to Lemma 2.

Remark 3 (Special case and technical remarks).
(1) Clearly, if in the forward test-channel realization with additive noise, we assume that the block diagonal matrix A = 0 (null matrix), then, we recover the classical forward test-channel realization for vector memoryless Gaussian source subject to a MSE distortion (see, e.g., ([32], Chapter 4.5), ( [15], Chapter 9.7)) given by and the coefficients in (37) give Moreover, the characterizations in (38)-(40) change in that Λ t = Σ w . Clearly, (42) can be seen as reverse-waterfilling design parameters. (2) Compared to (1), we note that H t in (37) should not be confused with a positive semidefinite matrix defined in the usual quadratic form [33] but instead it can possibly be a non-symmetric matrix which however contains only real non-negative eigenvalues. This observation is important because it means that in general the design variables (∆ t , Λ t ) do not commute like in the classical reverse-waterfilling problems for memoryless multivariate Gaussian random variables or in i.i.d. processes (see, e.g., [19,25]). (36) is the optimal realization among all realizations for this problem because the KF is the optimal causal MMSE estimator. Beyond Gaussian processes, and when the noise is zero-mean, uncorrelated and white (in our setup these properties hold), the optimal realization for Gaussian processes becomes the best linear realization (see, e.g., ([34], §3.2) or ( [35], p. 130)) and similarly the corresponding characterizations in (38)- (40) are the best linear characterizations. By saying "best-linear" realization and characterizations, respectively, we mean that there may be non-linear realizations and hence non-linear-based characterizations that outperform the best linear. (4) The characterization of (39) is different from the characterization obtained in ( [16], Theorem 1, (25e)) that uses weighted distortion constraints. The former optimization problem imposes hard constraints whereas the latter imposes soft constraints via weights. Nonetheless, an interesting open question is whether there exists a set of weights, which will give the same per dimension distortion when imposed as a weighted total distortion constraint. (5) It should also be stressed that the per-dimension constraints on the diagonal entries of ∆ t are not the same as having constraints on the eigenvalues of ∆ t . This further means that even for this class of distortion constraints, it is still possible to have rate-distortion resource allocation (i.e., a type of reverse-waterfilling optimization).

Remark 4 (Convexity)
. The optimization problems in (38) and (39) are convex because the objective function is linear and the constraints are affine and positive semi-definite. Thus, the problem can be solved numerically using convex programming software (see, e.g, [36]) or the more challenging KKT conditions that are first-order necessary conditions for global optimality ( [37], Chapter 5.3). The latter, will give certain non-linear matrix Riccati equations that need to be solved in order to construct a reverse-waterfilling algorithm.
Remark 5 (Existence of Solution). A sufficient condition for existence of a solution with a finite value in (38)- (40) is to consider the strict LMI constraint 0 ≺ ∆ t Λ t that ensures the objective function is bounded.
The strict LMI ensures that ∆ t 0 which further means that D > 0, DT > 0 and D cov 0.

Proof. See Appendix A.
It should be remarked that instead of the derivation based on a convexity argument in Lemma 3, one can assume that the optimal minimizer P(dy t |y t−1 , x t ) that achieves (43)-(45) is time-invariant and the output distribution P(dy t |y t−1 ) is also time-invariant with a unique invariant distribution, see, e.g., ([14], Theorem 3). Moreover, the optimal linear forward test-channel that achieves the lower bounds in (43)-(45) correspond to the time-invariant realization (36), given by whereas the corresponding time-invariant scaling coefficients in (37) are as follows From Lemma 3, the following corollary can be immediately obtained.
Proof. This is immediate from the derivation of Lemma 3.
In what follows, we show that the lower bounds in Lemma 3 are semidefinite representable, thus, they can be readily computed.

Theorem 2 (Computation of the lower bounds in Lemma 3). Consider the variable Q
Then, the following semidefinite programming representations hold.
Next, we stress some comments on the semidefinite representation of the lower bounds in Theorem 2.

Remark 6 (Comments on Theorem 2).
(1) We note that a similar characterization to the characterizations derived in Theorem 2 (subject to the distortion constraint (9) or the per-instant distortion constraint E{||x t − y t || 2 2 } ≤ D t , ∀t, for a special case of the setup in Figure 1) is recently derived in ( [16], Equation (27)). The log-determinant convex optimization problems in Theorem 2 are widely used in systems and control theories because they are able to deal efficiently with LMIs [38].
(2) Recently, the efficiency of SDP algorithm in solving linear and non-linear optimization problems attracted experts from the field of information theory who noticed its usefulness in solving distributed source coding problems (see, e.g., [3,39]). Such log-determinant problems when solved using the semidefinite programming method are known to have polynomial worst-case complexity (see, e.g., [40]). In addition, for an interior point method such as the SDP approach, the most computationally expensive step is the Cholesky factorization involved in the Newton steps. (3) On the other hand, due to its complexity, the SDP approach for high dimensional systems is often time consuming whereas for very large scale systems is occasionally impossible to obtain numerical solutions. Hence, ideally one could preferably consider alternative methods to solve a problem sacrificing for instance the optimality of the SDP algorithm but gaining in scalability and reducing the complexity. The most computationally efficient way to compute such problems and, additionally, to gain some insight from the solution is via the well-known reverse-waterfilling algorithm ( [19], Theorem 10.3.3), which is however very hard to construct and compute because one needs to employ and solve complicated KKT conditions [37]. Such effort was recently made for multivariate Gauss-Markov processes under per instant, averaged total and asymptotically averaged total distortion constraints in [24,41].
Next, we perform some numerical illustrations using the semidefinite representations of Theorem 2. We also compare (48) and (50), to the known expression obtained only for the asymptotically averaged total MSE distortion constraint in ( [16], Equation (27)). We note that the SDP algorithm for (48)-(51) is implemented using the CVX platform [36].
Based on this numerical study, we observe that for distortion levels between (0, D * = D], R LB joint (D * ) ≥ R LB pd (D) whereas for values of DT greater than D * we observe that R LB joint (D * ) = R LB pd (D) because the asymptotically averaged total MSE distortion constraint is inactive. This observation verifies our comment in Section 1.2 regarding the connection of (11) and (12). Clearly, at high rates (or high resolution) we observe that R LB joint (D * ) ≈ R LB (DT ).
Another interesting observation (illustrated in Figure 4) that can be made, is that if in the same example we allocate the total budget of per dimension distortion equally, i.e., D ii = D jj , ∀i = j, we observe that for distortion levels between (0, Example 2 (Covariance matrix distortion constraint). For the system in (1) we assume that user 1 is described by a R 2 -valued time-invariant Markov source driven by i.i.d. Gaussian noise process with parameters (A 1 , Σ w 1 ): whereas, user 2 is described by an R-valued time-invariant Markov source driven by i.i.d. Gaussian noise process with parameters (A 2 , Σ w 2 ): The augmented state space model (2) is generated by A = A 1 ⊕ A 2 and similarly Σ w = Σ 1 w ⊕ Σ 2 w . For this example, we assume a covariance matrix distortion constraint given by: where γ > 0 is the positive correlation coefficient between the distortion matrix components (i.e., diagonal entries) and it is chosen such that D cov 0.
In Figure 5 we demonstrate a comparison between R LB (D cov ) evaluated for several different values of γ. One interesting observation that can be made is that higher distortion correlation in (56) leads to less bits with a γ max ≈ 0.53, beyond which the value of R LB (D cov ) remains unchanged. Another interesting observation is that for negative correlation γ, the approximation via SDP does not give a number. However, this is not the case, in general (see, e.g., ([42], Example 1)). Using the same simulation study, we can arrive to an interesting connection between the approximation in (51) and (48). In particular, if for instance in (56) we restrict the matrix distortion constraint only to the main diagonal elements (i.e., exactly like the the per-dimension constraints) then, we obtain the plot of Figure 6 which clearly demonstrates that R LB (D cov ) = R LB pd (D). In fact, restricting the covariance matrix distortion constraint of (56) to the per dimension distortion constraint, is as if we optimize via a solution space in which γ is allowed to have any value in R. As a result, the feasible set of solutions is larger when the constraint set is subject to per-dimension distortion constraints rather than the covariance matrix distortion constraint.

Analytical Lower Bounds for Markov Sources Driven by Additive i.i.d. Noise Processes
In this subsection, we derive analytical lower bounds on (11)- (13)) when the source model describing the behavior of user 1 or user 2 is driven by possibly i.i.d. non-Gaussian noise process.
We first give a lemma which will facilitate the derivation of our lower bounds. We only consider the case of RDFs subject to per-dimension distortion constraints because the other classes of distortion constraints follow similarly.

Lemma 4 (Rate-distortion bounds).
For the augmented source model describing the behavior of users 1, 2 in (3), the following inequalities hold assuming distortion constraints in the class of (8): where and (a) holds with equality if the augmented state space model described in (2) is jointly Gaussian and the optimal minimizer, i.e., P * (dy t |y t−1 , x t ) of R LB pd (D) is conditionally Gaussian. Equality (a) holds trivially at D max .
Proof. The RHS inequality follows from Theorem 1 and (43) whereas the LHS inequality follows from the fact that the constraint set of R LB pd (D) is larger than the constraint set of R LB,linear pd (D) which is restricted to linear coding policies. Now, under the specific augmented source model in (3), and using Lemma 2, (1), we obtain R LB,linear pd (D) defined as in (58) because these are the best linear coding policies since KF algorithm is the best linear causal MSE estimator beyond additive Gaussian noise processes (see the discussion in Remark 3, (3)). Clearly, if the augmented source in (3) is jointly Gaussian and the optimal realization of R LB pd (D) is conditionally Gaussian, then, the system model is jointly Gaussian and the optimal policies are linear given by the forward linear test channel realization obtained in (36) hence the LHS inequality holds with equality. Lemma 4). We note that Lemma 4 holds if we assume RDFs with distortion constraints in the class of (9) or (10).
where |D cov | ∈   0, The following technical remarks can be made regarding Theorem 3.

Remark 8.
(1) Note that if in Theorem 3 we allow h(w) = −∞, then, the analytical lower bound expressions take a negative finite value or −∞, which cannot be the case (RDF is, by definition, non-negative). A way to include the case where h(w) is allowed to be −∞ in our lower bound expressions, is to set the objective functions in (59)-(61) to be [log(·)] + . This will mean that whenever h(w) = −∞, the analytical lower bound expression will be zero. (2) The analytical lower bounds in (59)-(61) do not correspond to the best linear forward test channel realization of Lemma 3 (see (46)) which is also the optimal policy under the assumption of a MMSE decoder when the system's noise is purely Gaussian (see Remark 3, (3)). Moreover, it is not clear what is the realization that achieves them the same way the bounds in Lemma 3 are achieved for Gaussian processes. (3) If in Theorem 3 we assume that the users 1, 2 have source models described by Markov processes driven by additive Gaussian noise processes then from the EPI (see, e.g., ([43], Equation (7))) N(w) = |Σ w | p 1 +p 2 and (59)-(61) change accordingly.
(4) One can choose to further bound (61) using the inequality |D cov | − 1 p 1 +p 2 ≥ p 1 +p 2 trace(D cov ) obtaining a further lower bound that coincides with the lower bound in (59) (see also the discussion in Example 2). Such lower bound will mean that we extend the set of feasible solutions that correspond to the initial problem statement (13), to be similar to the initial problem statement of (11) which cannot be the case, in general. Our bound in (61) encapsulates the off diagonal elements of the distortion covariance matrix distortion D cov hence it is an appropriate lower bound for the specific problem.
In what follows, we give a numerical simulation where we compare the solution of R LB joint (D * ) (that corresponds to the lower bound achieved by the optimal coding policies when the system is driven by additive i.i.d. Gaussian noise processes) computed via the SDP representation of (50), with the lower bound obtained in (60) when the system's noise is also Gaussian.
Example 3 (Comparison of (50) with (60) for jointly Gaussian processes). We consider the same input data assumed in Example 1 for users 1, 2. Then, we proceed to compute the true lower bound of (50) and the lower bound obtained in (60).
Our simulation study in Figure 7 shows that at high rates the performance of the two bounds is almost identical whereas to moderate and low rates we observe a gap that remains constant when DT ≥ D, i.e., when the asymptotically averaged total distortion constraint is inactive. The same behavior is expected for systems of larger dimension (larger scale optimization problems) with a possibility of an increased gap to moderate and low rates depending on the structure of the block diagonal matrices A and Σ w . Next, we state a corollary of Theorem 3.
Corollary 2 (Analytical bounds when users 1, 2 are not specified by the same additive noise process).
One can deduce the following for Corollary 2.
Remark 9. Corollary 2 will give similar analytical lower bounds (with appropriate modifications) if instead of user 1, we assume that the source model of user 2 is driven by a Gaussian noise process. The additional assumption on the covariance matrix of the noise process in both users is imposed because otherwise we cannot guarantee that the key series of inequalities (65) will be satisfied.

Upper Bounds
In this section we explain the case of encoding the augmented vector-valued Markov source modeled by (3) using a sequential causal DPCM scheme with a feedback loop followed by an ECDQ. The scheme relies on the linear forward test channel realization of the bounds in Lemma 2. The precursor of the DPCM-based scheme with feedback loop is [14] whereas ECDQ is a classical source coding approach with standard performance guarantees in information theory (see, e.g., [44]). The ECDQ scheme is utilized to bound the rate performance of the DPCM scheme. This approach furnish with an achievable (upper) bound the operation causal RDFs in (11)-(13).

DPCM with Feedback Loop
First, we briefly describe the sequential causal DPCM scheme with feedback loop introduced in ([14], Figure 2) (see also [45]). Observe that because the augmented source is modeled as a first-order multidimensional Markov process, the sequential causal coding is precisely equivalent to a predictive coding paradigm (see, e.g., [14,46]).
At each time instant t, the encoder or innovations' encoder performs the linear operation where at t = 0 we assume initial data x 0 = x 0 and also y t−1 E x t−1 | t−1 , i.e., an estimate of x t−1 given the previous quantized symbols t−1 (Note that the process x t has a temporal correlation since it subtracts the error of x t given all previous quantized symbols t−1 and not the infinite past of the source x t −∞ . Hence, x t is only an estimate of the true process and this causes a part of the sub-optimality of this scheme.). Then, by means of a R p 1 +p 2 -valued MMSE quantizer that operates at a rate R t , we generate the quantized reconstruction y t of the residual source x t denoted by y t = y t − Ay t−1 . Then, we send t over the channel (the corresponding data packet to y t ). At the decoder we receive t and recover the quantized symbol y t of x t .
Then, we generate the estimate y t using the linear operation Combining both (66) and (67), we obtain From (68), we can immediately deduce that the error between x t and y t is equal to the quantization error introduced by x t and y t which means that the MSE distortion at each instant of time satisfy In addition, the covariance matrix Σ ∆ yields A pictorial view of the DPCM scheme with feedback loop is given in Figure 8.

Bounding (11)-(13) via A DPCM-based ECDQ for Gaussian Noise Processes
In this subsection, we bound the rate performance of the DPCM scheme described in Section 3.1 in the infinite time horizon, using a scheme that utilizes the steady-state linear forward test-channel realization that achieves the lower bounds of Lemma 3. Essentially, what we do in this scheme is that we replace the quantization noise with an additive Gaussian noise with the same second moments (see e.g., [47] or [44] (Chapter 5) and the references therein).
Recall that the steady-state linear forward test-channel realization of the lower bounds in Lemma 3 is written as follows: whereas the steady-state reverse-waterfilling parameters (H, Σ v ) are given by The forward test-channel realization of (71) is illustrated in Figure 9. Before we proceed, we point out the following important technical remarks on the realization of (71) and the coefficients (72). (71) and (72)). The linear forward test channel realization with additive noise in (71) is equivalent to the steady-state realization in (46) because for both it can be shown that the MSE distortion constraint is achieved (i.e., v t ∼ N (0; Σ v ), Σ v = H∆ = ∆H T 0). Moreover, this realization is equivalent but simpler to build compared to the forward test channel realization introduced in [14] in which non-singular matrices and diagonalization by congruence is assumed (see ( [14], Theorem 4)).

Remark 10 (Observations
In the test channel realization of Figure 9, a reverse-waterfilling in spatial dimension is possible when we assume asymptotically averaged total MSE distortion constraints similar to ([14], Theorem 40). This reverse-waterfilling is dictated by the rank of matrix H. To make this point clear, if H is full rank, then all spatial dimensions in the system are active whereas if H is rank deficient, then, some dimensions are inactive (these dimensions form the null space and the nullity of H) and for these dimensions the rate is zero hence they can be excluded from the realization of Figure 9. In the sequel, we will present simulations where we study the reverse-waterfilling in the spatial domain under a certain distortion constraint studied in this paper.
Pre/Post Filtered ECDQ with multiplicative factors for augmented multivariate Gauss-Markov sources and spatial reverse-waterfilling. First, we consider a rank(H)−dimensional lattice quantizer Q rank(H) (·) [48] such that where z t ∈ R rank (H) is a random dither vector generated both at the encoder and the decoder independent of the input signals x t and the previous realizations of the dither, uniformly distributed over the basic Voronoi cell of the rank (H)−dimensional lattice quantizer Q rank(H) (·) such that v c t ∼ Uni f (0; Σ v c t ). At the encoder the lattice quantizer quantize H x t + z t , that is, Q rank(H) (H x t + z t ), where x t is given by (66). Then, the encoder applies conditional entropy coding to the output of the quantizer and transmits the output of the entropy coder. At the decoder the coded bits are received and the output of the quantizer is reconstructed, i.e., Q rank (H) (H x t + z t ). Then, it generates an estimate by subtracting z t from the quantizer's output and multiplies the result by I rank (H) (I rank (H) denotes the identity matrix with dimensions according to the rank of H. This identity matrix can be excluded but we include it here for completeness.) as follows: Performance. The coding rate at each instant of time of the conditional entropy of the MMSE quantizer is given by where v c t ∈ R rank (H) is the (uniform) coding noise in the ECDQ scheme and v t is the corresponding Gaussian counterpart; (a) follows because the two random vectors v c t , v t have the same second moments hence we can use the identity D(x||x (c) follows because the divergence of the coding noise from Gaussianity is less than or equal to rank (H) 2 log(2πeG rank (H) ) [47] where G rank ≥ I(H x t ; H x t + v t ) where ( * ) follows from the realization of (71), ( * * ) follows from the fact that x t and y t (obtained by (67)) are independent of y t−1 , and ( * * * ) is a consequence of data processing inequality since (H x t + v t ) ↔ x t ↔ H x t . Under the assumption that the clocks of the entropy encoder and entropy decoder in the ECDQ scheme are synchronized, then, the total coding rate is obtained as follows where (e) follows from (74); ( f ) follows from the derivation of Lemma 2.
The previous analysis yields the following theorem.
Theorem 4 (Achievability bound on (11)- (13)). Suppose that ∆ t = ∆, ∀t and assume that the users 1, 2 source models (1) are driven by Gaussian noise processes. Then, the augmented state space source model in (3) ensures the following achievability bounds on (11)- (13), as follows Proof. Under the conditions of the Theorem and the ECDQ scheme that leads to (75), the RHS terms in (76)-(78) are all constants. Then, taking the limit in both sides of (76)-(78) and then the appropriate infimization (minimization) constraint sets, the result follows.
We wish to point out the following for Theorem 4.

Remark 11 (Comments on Theorem 4).
(1) The ECDQ that leads to (75) is not the same as the standard symmetric ECDQ scheme for scalar-valued processes, i.e., when the coefficient H breaks into two pre and post additive noise channel scalings that tune the MSE distortion (see, e.g., [44,47]). In our pre-post scaled ECDQ scheme we take asymmetric coefficients based on the realization of Figure 9. This leads to a coarser lattice than the one used for the unscaled ECDQ (for details see for instance [44]). (2) Since the upper bound essentially relies on the obtained lower bound for all (76)-(78), this means that similar observations can be made. For instance, if D < DT , then, (77) recovers (76), i.e., the asymptotically averaged total distortion constraint is inactive. Moreover, we cannot claim tightness of the achievability bound in (78) because the lower bound is already non-tight.
Next, we give an example where we compare the RL gap for various distortion levels of the achievability bound obtained in Theorem 4 and the lower bound obtained in Theorem 2 for the operational causal RDF with joint distortion constraints.
Example 4 (RL gap of achievability and lower bounds). In this example, we plot lower and upper bounds on the operational causal RDF subject to joint distortion constraints using the bounds derived in (50) and (77), respectively. We first consider the same input data assumed in Example 1 for users 1, 2. Then, we proceed to compute the lower bound via (50) and the achievability bound in (77). After the first numerical study, we consider another one for which we only change the Gaussian noise process for both users 1, 2, as follows Using the data of Example 1, and the same DT and D ii , i = 1, 2, 3, 4, 5, we obtain the plots of Figure 10. For this study, we have used a Schläfli lattice (for details on this lattice see, e.g., [48]). D 5 with a dimensionless normalized second moment of the lattice G 5 ≈ 0.0756 bits. In this example H is always full rank and the RL gap is constant at 0.9218 bits/augmented vector. For the second example, we obtain the plots of Figure 11. For this study, we have used the dimensionless normalized second moment of a Schläfli lattice D 5 for the full rank case and for the rank deficient cases a Schläfli lattice D 4 with a dimensionless normalized second moment of the lattice G 4 ≈ 0.0766 bits. Similar to the first study, when H is always full rank, the RL gap is 0.9218 bits/augmented vector whereas for H rank deficient the RL gap is 0.7754 bits/augmented vector.

Bounding (11)-(13) via a DPCM-based ECDQ for Non-Gaussian Noise Processes
Similar to Lemma 4 where linear policies are the benchmark to derive lower bounds on (11)-(13), in this subsection we derive upper bounds on (11)-(13) using the linear test channel realization in Figure 9 and the DPCM-based ECDQ scheme of Sections 3.1 and 3.2.
Next, we state the following theorem.
Funding: This research was funded by KAW Foundation and the Swedish Foundation for Strategic Research.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Notation
The following notation is used in this manuscript: Symbol Description The set {0, . . . , n} where n ∈ N 0 x ∈ R Random variable X Alphabet for the random variable x x t r The sequence of random variables (x r , x r+1 , . . . , x t ), (r, t) ∈ Z × Z, r ≤ t x t r Sequence of the random variables realizations, where x t r ∈ X t r X t r × t k=r X k with X t = X P(dx) The probability of the RV x on X P(dy|x) The conditional distribution of RV y given Column vector x T ∈ R 1×p Row vector K ∈ R p×p Square real matrix K T ∈ R p×p Transpose of square real matrix K ii Diagonal elements of matrix K |K| Determinant of K rank(K) Rank of K trace(K) Trace of K µ K,i the i th eigenvalue of matrix K Σ x The covariance of a random vector x Σ x 0 Positive definite covariance matrix Σ x Σ x 0 Positive semidefinite covariance matrix Σ Null matrix I p Identity matrix of dimension p H(·) Discrete entropy h(·) Differential entropy D(x||x ) KL Divergence of probability distribution P(x) with respect to probability distribution P(x ) x ∼ N (0; Σ) Gaussian random vector x with zero mean and covariance Σ x ∼ Uni f (0; Σ) Uniformly distributed random vector x with zero mean and covariance Σ h G (·) Gaussian differential entropy R G (·) Gaussian information RDF N(x) The entropy power of random vector x || · || 2 Euclidean norm E{·} Expectation operator [·] + max{0, ·} A ↔ B ↔ C A, B, C form a Markov chain the additive objective; (d) follows because |Λ t+1 ||∆ −1