Source Coding with a Causal Helper

: A multi-terminal network, in which an encoder is assisted by a side-information-aided helper, describes a memoryless identically distributed source to a receiver, is considered. The encoder provides a non-causal one-shot description of the source to both the helper and the receiver. The helper, which has access to causal side-information, describes the source to the receiver sequentially by sending a sequence of causal descriptions depending on the message conveyed by the encoder and the side-information subsequence it has observed so far. The receiver reconstructs the source causally by producing on each time unit an estimate of the current source symbol based on what it has received so far. Given a reconstruction ﬁdelity measure and a maximal allowed distortion, we derive the rates-distortion region for this setting and express it in terms of an auxiliary random variable. When the source and side-information are drawn from an independent identically distributed Gaussian law and the ﬁdelity measure is the squared-error distortion we show that for the evaluation of the rates-distortion region it sufﬁces to choose the auxiliary random variable to be jointly Gaussian with the source and side-information pair.


Introduction
In the classical source coding with decoder side information problem, the source and side information are generated by independent drawings (X k , Y k ) of the pair (X, Y) ∼ P XY . The encoder forms a description of the source sequence X n = (X 1 , . . . , X n ) using a map f (n) : X n → {1, . . . , 2 nR }, while the decoder forms its reconstructionX n depending on both the side-information sequence Y n and the index T ∈ {1, . . . , 2 nR } conveyed by the encoder. In their seminal work [1], Wyner and Ziv derived the rate distortion function for this setting, given a fidelity measure, whenX n can depend on Y n in an arbitrary manner. Yet, Wyner-Ziv source coding with non-causal decoder side-information involves binning the implementation of which is complex.
A successive refinement for the Wyner-Ziv problem with side-information (Y, Z) is a variant of the Wyner-Ziv model, in which the encoder provides a two-layer description of the source sequence X n to a pair of decoders. Decoder 1, which obtains just the course description layer, has available as non-causal side-information the memoryless vector Z n , while Decoder 2, which obtains both description layers, has available as non-causal side-information the memoryless vector Y n . It is further assumed that the reconstruction formed by Decoder 2 should be of smaller distortion compared to that formed by Decoder 1. Such a model has been considered in [2], wherein a complete characterization of the rates-distortion region has been obtained for the case where Z is stochastically degraded to Y with respect to X-i.e., X ↔ Y ↔ Z forms a Markov chain.
The work [3] studies an extension of the model of [2] where a conference link of given capacity allows the unidirectional cooperation between Decoder 1 and Decoder 2-i.e., Decoder 1 functions also as a helper. The results in [3] are partially tight in the sense that the characterization of the encoder rates is conclusive, the remaining gap being in the characterization of the helper's rate.
Thus, with non-causal side-information at both decoders, the successive refinement for the Wyner-Ziv problem with a helper is yet unsolved.
Motivated by practical delay-constrained sequential source coding with decoder side information Weissman and El Gamal considered in [4], a scheme with causal side information at the decoder, where the sequence of reconstructionsX n = (X 1 , . . . ,X n ) is formed sequentially in a causal manner according toX k =X k (T, Y k ), and derived the corresponding rate distortion function. Similar to [1], the rate distortion function in [4] is expressed in terms of an auxiliary random variable, thus leaving the optimal choice of which an open issue depending on the specifics of the model. For the Gaussian setting where (X, Y) are Gaussian, the authors in [4] compute an upper bound on the rate distortion function by choosing the auxiliary random variable to be jointly Gaussian with X, while leaving the question of whether this choice is optimal yet an open problem.
With the vision that modern network design will support the use of cooperation links in favor of reduction of encoding/decoding complexity and network deployment constraints, this work considers an extension of the model [4], involving a causal helper and causal side information at the decoder, which is described as follows. The components of a trivariate independent identically distributed (IID) is observed by the encoder, while the source component {Y k } ∞ k=1 is observed by the helper. Both the encoder and the helper describe the length-n source sequence X n according to a given fidelity to the decoder in two steps. First, the encoder when given X n sends a rate-R description of it to both the helper and the decoder. Then, the helper sends to the decoder, per each source symbol Y k , a causal description depending on the message it had received from the encoder and the source subsequence Y k it had observed so far, with the aggregate rate not exceeding R h . The decoder, which observes the source component {Z k } ∞ k=1 in a causal manner, per channel use k, uses the descriptions it had received so far and the source subsequence Z k to form its reconstructionX k for the source symbol X k . Given a fidelity measure and a maximal allowed distortion, the goal is to determine the set of all rate pairs (R, R h ) that satisfy the distortion constraint.
Causal decoder side information has been considered as well in context with successive refinement. With the aim of reducing encoder/decoder complexity, a two layer description model with successive refinement has been considered in [5], under the setting that the side information is available causally at each of the decoders. A single-letter characterization of the rates-distortion region is obtained in [5], irrespective of the relative ordering of the side information quality at the decoders. Furthermore, the direct part in [5] demonstrates that similarly to [4] with causal side-information at the decoders the optimal code avoids binning, hence its implementation is practically appealing. The extension of the model [5] with a causal helper has recently been studied in [6].

Problem Formulation
Formally, our problem can be stated as follows. A discrete memoryless source (DMS) (X , P X ) is an infinite sequence {X i } ∞ i=1 of independent copies of a random variable X taking values in the finite set X with generic law P X . Similarly, a triple source (X Y Z, P XYZ ) is an infinite sequence of independent copies of the triplet of random variables (X, Y, Z), taking values in the finite sets X , Y and Z, respectively, with generic joint law P XYZ . Since our goal is to encode the source X, letX denote any finite reconstruction alphabet and let d : X ×X → [0, ∞) be a single-letter distortion measure. The vector distortion measure is defined as A system diagram appears in Figure 1.
- k , D code for the source X with causal side-information (SI) (Y, Z) and causal helper, consists of: 1.
An encoder mapping

2.
A unidirectional conference between the helper and the decoder consisting of a sequence of causal descriptions such that The rate tuple (R, R h ) of the code is Given a non-negative distortion D, the tuple (R, R h ) is said to be D-achievable for X with causal SI (Y, Z) if, for any δ > 0, > 0, and sufficiently large n, there exists an (n, exp[n(R + δ)], exp[n(R h + δ)], D + ) code for the source X with causal SI (Y, Z) and causal helper. The collection of all D-achievable rate tuples is the achievable source-coding region and is denoted by R(D).
In this work, we provide a single-letter characterization for R(D). In contrast to [5], a consequence of Definition 1 is that R(D) may depend on the joint law of the triple P XYZ , not only through the marginal laws P XY and P XZ . This is due to the fact that the decoder acquires, in addition to its private side information Z, some additional side information on Y via the conference link. As a result, the expectation in (4), which takes into account the mapping g (n) k (·), is taken over the joint law P XYZ and not just over the marginal law P XZ , as is the case in [5].
Finally, although not of the finite alphabet, of particular interest to us is the Gaussian source. This is a memoryless source, where (X, Y, Z) are centered jointly Gaussians with each pair (X i , Y i ) drawn such that P XY satisfies where ρ > 0 is a fixed constant, X i ∼ N (0, 1) and W i ∼ N (0, 1) are mutually independent. Moreover, Z i is drawn according to where a and b are real numbers and T i ∼ N (0, 1) is independent of (X i , Y i ). Furthermore, in this case, the fidelity measure will be in which case we may restrict the reproduction functions g (n) k to be the MMSE estimates of , and the side-information Z k . That is, In the Gaussian network setting, our focus will be on determining the optimal choice of the auxiliary random variable by means of which the rates-distortion region is defined. Specifically, we will show that choosing it to be jointly Gaussian with X is optimal.

Main Results
Given a maximal allowed distortion D, define R * (D) to be the set of all rate pairs (R, R h ) for which there exist random variables (U, V), taking values in finite alphabets U , V, respectively, such that There exist deterministic mapsf 4. The alphabets U , V satisfy

5.
The rates R and R h satisfy Our first result is a single-letter characterization of the rates-distortion region.

Remark 1.
The converse holds as well for the setting where the causal helper and the reconstructor benefit from causal disclosure-i.e., are cognizant of the past realizations of the source sequence, hence they are allowed to depend also on X k−1 , when forming f (n) 1,k andX k , respectively. That is, 3.1. The Gaussian Setting with Z = ∅.
To simplify the presentation, we consider first the Gaussian setting with Z = ∅. In this case, the region R(D) is defined as the set of rate pairs (R, R h ) for which there exist random variables (U, V), taking real values, such that Conditioned on U, X ↔ Y ↔ V forms a Markov chain.
Our second result characterizes the optimal choice of P XU in the Gaussian setting.

Theorem 2.
For the evaluation of R(D) when (X, Y) are Gaussian, it suffices to assume that (X, U) are jointly Gaussian.
Proof. For the treatment to follow, let us define the Gaussian channel Y = √ X + W where X has an arbitrary law with E[X 2 ] ≤ 1, W ∼ N (0, 1) is independent of X, and > 0 is the channel signal-to-noise ratio. We shall denote by G ( ) Y|X the conditional law of Y given X for this additive Gaussian model. Furthermore, the notation (U n , X n , Y n ) ∼ P U n X n G ( ) Y n |X n would imply that (U n , X n ) ∼ P U n X n and Y n = √ X n + W n with W n ∼ N (0, 1) independent of (U n , X n ). Using this notation, the law P UXYV defining R(D) (see also (45) ahead) may equivalently be expressed as Henceforth, we denote a law which factors as in (16) by X ↔ Y ↔ V| U -i.e., conditioned on U, X ↔ Y ↔ V forms a Markov chain and, furthermore, P Y|X = G ( ) Y|X independently of the rest. Define the region (12a) and (12b) subject to constraint (15) by O K , where the subscript K denotes the covariance constraint (15). The region O K is a closed convex set.
In line with [7,8], we define a λ-parametrized family of functions which are related to the sum rate associated with R(D).
Fix some λ > 1, and consider the minimization of the λ-sum rate defined as Observe that where (a) follows using the lower bounds (12a) and (12b). Since the marginal law of X in our model (6) is Gaussian, the differential entropy h(X) is fixed, thus for the minimization of (18) over a law of the form (16), we define the following functional of P XU (i.e., of the conditional law P X|U ) Thus, the minimum in (17) may be expressed as Henceforth, the set of laws P UXYV defined by (16) which satisfy σ 2 X|UV ≤ D will be denoted by Q and will be attributed as the feasible set.
As shown below, with a proper choice of λ, the functional (19) exhibits a "pair grouping" property with respect to the input X in the sense that the value of s λ does not increase under this operation. Having established that, we follow the same steps as in the proof of ([9] Theorem 9) to establish that the objective (20) is attained when P X|U is Gaussian. More specifically, • Lemma 1 shows that the value of s λ "improves" under the pair grouping operation.

•
With the proper time-sharing of two distributions attaining the infimum in (20) and satisfying the extremal property defined in (23) ahead, Lemma 2 proves that the pair grouping operation exhibits Gaussianity in the sense of Bernstein's characterization.
Assume that the infimum in (20) is attained, and let P denote the subset of laws P XU achieving the minimum. Suppose further that there exists a law P XU ∈ P such that, for (Y, Denote the value of the LHS of (23) by g * ( ).

Lemma 3.
There exists a sequence {X n , U n }, such that for each n ≥ 1, (X n , U n ) is feasible, and there exists a feasible law P X * U * on R × U such that where the tuple (U 1,n , X 1,n , Y 1,n ) and the tuple (U 2,n , X 2,n , Y 2,n ) are two independent copies of (U n , X n , Y n ) ∼ P U n X n G ( ) Y n |X n , we have lim inf n→∞ I(X 1,n + X 2,n ; X 1,n − X 2,n |Y n , U n = u) = 0 for P U * × P U * − a.e u.
Since the marginal of X is Gaussian, with no loss in generality, we may choose (X, U) to be jointly Gaussian, which establishes our claim in Theorem 2.

The Gaussian Setting with Decoder Side-Information Z
In the Gaussian setting where the decoder side-information Z in non-void, using our previous for a pair of real numbers (a, b) where T ∼ N (0, 1) is independent of (X, Y) and U. For arbitrary X with E[X 2 ] ≤ 1, let us define byG YZ|X the conditional law obtained by forming the pair (Y, Z), when given X, via the additive independent Gaussian pair (W, T) as described above.
The rates-distortion region R(D) is defined in Theorem 1 with and it is evaluated over a law of the form Consequently, for the minimization of (20), we consider the functional under a law of the form (31). Next, in Lemma 1, for P XU , a law on X × U and P UXYZ = P XUG ( ,a,b) YZ|X we let (U 1 , X 1 , Y 1 , Z 1 ) and (U 2 , X 2 , Y 2 , Z 2 ) be two independent copies of (U, X, Y, Z).
Upon defining (X − , X + ) and (Y − , Y + ) as before, we also define It can be verified that where T − and T + are independent. Furthermore, the pair (T − , T + ) is independent of (U, X − , X + , Y − , Y + ) and is equal in distribution to the pair (T 1 , T 2 ). Thus, the simultaneous unitary transformation (Y 1 , Y 2 ) → (Y − , Y + ) and (Z 1 , Z 2 ) → (Z − , Z + ) preserves the Gaussian nature of the channel and factors according to Observe that with the choice of U U = (U 1 , U 2 ) and V = (V 1 , V 2 ) where (U 1 , X 1 , Y 1 , Z 1 , V 1 ) and (U 2 , X 2 , Y 2 , Z 2 , V 2 ) are two independent copies of (Ũ, X, Y, Z,Ṽ) such that X ↔ Y ↔Ṽ|Ũ and σ 2 X|ŨṼZ ≤ D -i.e., PŨ XYZṼ is feasible for the minimization of (20), we have Thus, Lemma 1 holds for this setting with decoder side-information Z as well.

Proof of Theorem 1
Converse: Assume that the pair (R, R h ) is D-achievable. For j = 1, . . . , n with the convention The rate R is lower bounded as follows Here, (a) follows since X n is memoryless, and (b) follows since X k ↔ (T, X k−1 ) ↔ (Y k−1 , Z k−1 ) forms a Markov chain.
We may now lower bound R h as follows where (c) follows since V j is a deterministic function of (T, Y j ).
where (d) follows by equality (37) and we may express (38) as follows With regard to the expected distortion, we may write Step (e) is justified as follows: Since V 1 , . . . , V i−1 are deterministic functions of (T, Y i−1 ) is a Markov chain and, given (U k , V k , Z k ), the tuple (T, V 1 , . . . , V k , Z k ) is independent of (V 1 , . . . , V k−1 ). As a consequence of that, Lemma 1 in ( [12] Section II.B) guarantees the existence of a reconstruction This observation interpreted as the "data processing inequality" for estimation has already been made in ([12] Lemma 1). Furthermore, By (1) and the memoryless property of the sequence (X k , Y k , Z k ), k = 1, . . . , n one can verify the Markov relation U k ↔ X k ↔ (Y k , Z k ) which implies the Markov relation U ↔ X ↔ (Y, Z). Similarly, the definitions of U k and V k and the memoryless property of (X k , Y k ), k = 1, . . . , n imply that, conditioned on U k , X k ↔ Y k ↔ V k forms a Markov chain.
Thus, conditioned on U, X ↔ Y ↔ V forms a Markov chain, hence where P Y|X denotes the conditional law induced by the marginal law P XY . The combination of (40)- (42) and (44) together with the latter Markov relations establish the converse.
We shall now obtain an alternative characterization for the lower bound (38). For a law P UXZYV = P U P X|U P ZY|X P V|YU , and its induced conditional law P XZYV|U , let and letQ(P XZYV|U , ·) denote the lower convex envelope of Q(P XZYV|U , ·). Define then by ( [13] Section III.C, Lemma 1) Q s (P UXZYV , ·) is convex. Note that The integrand on the RHS of (46) is the entropy of the scalar quantizer of V k f (n) Consequently, we may lower bound the RHS of (46) as follows where in (a) we used the definition of Q and in (b) its convexity. The lower bound (47) may be interpreted as follows. Fix ρ : U → R + and consider, for each u ∈ U , time shraing of at most two scalar quantizers for the "source" P V|U=u attaining a distortion level ρ(u). The optimal helper time shares side-information-dependent scalar quantizers of V k (at most two per each side-information symbol U k ), while the reconstruction at the decoder is a function of (U, V, Z). Direct: To establish the achievability of R * (D), consider the codebook construction as follows. The codebook A = {u 1 , . . . , u M } , u k ∈ U n is obtained by drawing the n-length sequences u k independently of T δ P U . (For the definition of T δ P U -the set of δ-strongly typical n-sequences corresponding to a marginal law P U and a few properties of these sequences see [14][15][16]).
Given the source sequence x, f (n) (x) is defined as follows.
1. If x ∈ T δ P X the encoder searches for the first sequence P XU , an encoding error is declared.
Given f (n) (x) = k, the helper forms the sequence of descriptions Decoding: Given f (n) (x) = k as well as the sub-sequence V 1 , . . . , V i , the decoder forms the Given that (U, X) are jointly typical since (X, Y, Z) is memoryless, the Markov lemma guarantees that, for large n, with high probability (U, X, Y, Z) are also jointly typical. Since X ↔ (U, Y) ↔ V forms a Markov chain, by the Markov lemma, with high probability, (X, V ) as well as (X, V , Z) are jointly typical. Thus, with high probability, (X,X) are jointly typical, hence the distortion constraint (10) is fulfilled for large n. That the sequence V 1 , . . . , V n can be described at a conditional entropy rate satisfying (12b)-(47) can be established along similar lines as the proof of the direct part of Theorem 2 in ( [13] Section III.C). Finally, standard error probability analysis verifies that, with high probability, (U, X) are jointly typical as long as (12a) holds.

Proof of Lemma 1
where W − and W + are independent and the pair (W − , W + ) is equal in distribution to the pair (W 1 , W 2 ). Thus, the unitary transformation (Y 1 , Y 2 ) → (Y − , Y + ) preserves the Gaussian nature of the channel and factors according to (see (53) ahead) To show (22a), consider the sequence of identities Moreover, to show (22b), consider the sequence of identities Starting with (50), consider the difference under a law of the form i.e., that X − ↔ Y − ↔ V| (U,Y + ) forms a Markov chain (see also ([9] Section VI.A, Lemma 4)).
We distinguish between the two cases: (1) In case that I(X − ; Y + |UV) = 0, the non-negativity of mutual information implies that the expression (52) is non-negative, hence the inequalities (63) ahead hold for any λ > 1. (2) In case that I(X − ; Y + |UV) > 0, we prove first that for the set of laws which are feasible for the optimization problem (20), the expression (52) is strictly positive.
Observe that with the choice of U U = (U 1 , U 2 ) and V = (V 1 , V 2 ) where (U 1 , X 1 , Y 1 , V 1 ) and (U 2 , X 2 , Y 2 , V 2 ) are two independent copies of (Ũ, X, Y,Ṽ) such that X ↔ Y ↔Ṽ|Ũ and σ 2 X|ŨṼ ≤ D i.e., PŨ XYṼ ∈ Q, we have Thus, the unitary transformation (Y 1 , Y 2 ) → (Y − , Y + ) picks a pair of independent copies of a law PŨ XYṼ ∈ Q and "creates" a pair of laws, X − ↔ Y − ↔ V| UY + and X + ↔ Y + ↔ V| UY − , which factor jointly according to (53) with hence are symmetric w.r.t. the inputs X − and X + . We shall define the latter set of laws by P * and, as shown above, P * ⊆ Q.
is a convex set, standard dimensionality reduction argument can be used to establish the existence of a law P XU where P U is supported on a finite set and achieves any point in the image of the map (55) (See ( [9] Section IV, Remark 2)).
By rate-distortion theory, the constraint σ 2 X − |U,V ≤ D with both U and V non-void implies that Now, for any P UX − X + Y − Y + V as per (49), conditioned on (U, Y + ), the random variable V is dependent on Y − i.e. I(Y − ; V|UY + ) > 0. As a consequence of that, since the mutual information I(X − ; Y − ) is strictly positive and, conditioned on (U, Y + ), X − ↔ Y − ↔ V forms a Markov chain Since a law of the form (53) dictates the combination of (56)-(58) yields that Thus, the conditional mutual information I(X − ; V|U) is non-increasing under the conditioning on (Y − , Y + ).
On the other hand, since both mappings (X 1 , Here, (a) follows since, conditioned on U, X 1 and X 2 are independent, and (b) follows since, conditioned on U, Y 1 and Y 2 are independent.
Moreover, an argument similar to that leading to the conclusion that (52) is non-negative establishes that (66) is non-negative.

Proof of Lemma 2
Consider the tuple (X + , X − , Y + , Y − , U) as defined in Lemma 1 that is constructed from independent copies of (U, X, Y) ∼ P XU G ( ) Y|X . With the transformation (21) X + and X − preserve the variance of X i , i = 1, 2 and W + and W − preserve the variance of W i , i = 1, 2. Furthermore, as shown in the proof of Lemma 1, the unitary transformation (Y 1 , Y 2 ) → (Y − , Y + ) picks a pair of independent copies of a law P UXYV ∈ Q and "creates" a pair of laws, X − ↔ Y − ↔ V| UY + and X + ↔ Y + ↔ V| UY − , which factor jointly according to (53) with hence are symmetric w.r.t., the inputs X − and X + and both are in the feasible set. Similarly, the unitary transformation (Y 1 , Y 2 ) → (Y − , Y + ) picks a pair of independent copies of a law P UXYV ∈ Q and "creates" a pair of laws, X − ↔ Y − ↔ V| UX + and X + ↔ Y + ↔ V| UX − , which are both in the feasible set (see the factorization (61)).
The combination of identity (73) with inequality (24b) proves (26). SinceX ∈ R whileŨ ∈ {+, −} × R × U × U a dimensionality reduction argument as in Remark 2 establishes the existence of a law P X U , where U has finite support, such that (25) and (26) are fulfilled, hence (U , X ) are in the feasible set.