Finite-Length Analyses for Source and Channel Coding on Markov Chains

We study finite-length bounds for source coding with side information for Markov sources and channel coding for channels with conditional Markovian additive noise. For this purpose, we propose two criteria for finite-length bounds. One is the asymptotic optimality and the other is the efficient computability of the bound. Then, we derive finite-length upper and lower bounds for coding length in both settings so that their computational complexity is efficient. To discuss the first criterion, we derive the large deviation bounds, the moderate deviation bounds, and second order bounds for these two topics, and show that these finite-length bounds achieves the asymptotic optimality in these senses. For this discussion, we introduce several kinds of information measure for transition matrices.


I. INTRODUCTION
Recently, finite-length analyses for coding problems are attracting a considerable attention [1].This paper focuses on finite-length analyses for the source coding with side-information for Markov sources and the channel coding for channels with conditional Markovian additive noise.Although the main purpose of this paper is finite-length analyses, this paper also develops a unified approach to investigate these topics including the asymptotic analyses.Since this discussion spreads so many subtopics, we explain them separately in the introduction.

A. Two criteria for finite-length bounds
For an explanation of the motivations of this paper, we start with two criteria for finite-length bounds while the problems treated in this paper are not restricted to channel coding.Until now, so many types of finite-length achievability bounds have been proposed.For example, Verdú and Han derived a finite-length bound by using the information spectrum approach in order to derive the general formula [3] (see also [4]), which we call the information-spectrum bound.One of the authors and Nagaoka derived a bound (for the classical-quantum channel) by relating the error probability to the binary hypothesis testing [5, Remark 15] (see also [6]), which we call the hypothesis testing bound.Polyanskiy et.al. derived the RCU (random coding union) bound and the DT (dependence testing) bound [1]  1 .Also, Gallager's bound [7] is known as an efficient bound to derive the exponential decreasing rate.
Here, we focus on two important criteria for finite-length bounds: (C1) Computational complexity for the bound, and (C2) Asymptotic optimality for the bound.First, we consider the first criterion, i.e., the computational complexity for the bound.For the BSC, the computational complexity of the RCU bound is O(n 2 ) and that of the DT bound is O(n) [8].However, the computational complexities of these bounds is much larger for general DMCs or channels with memory.It is known that the Parts of this paper were presented at 51st Allerton conference and 2014 Information Theory and Applications Workshop.The first author is with the Graduate School of Mathematics, Nagoya University, Japan.He is also with the Center for Quantum Technologies, National University of Singapore, Singapore, e-mail:masahito@math.nagoya-u.ac.jp The second author is with the Department of Computer and Information Sciences, Tokyo University of Agriculture and Technology, Japan, e-mail:shunwata@cc.tuat.ac.jp.When the main part of this paper was done, he was with the Department of Information Science and Intelligent Systems, University of Tokushima, Japan.
Manuscript received ; revised 1 A bound slightly looser (coefficients are worse) than the DT bound can be derived from the hypothesis testing bound of [5].
hypothesis testing bound can be described as a linear programming (eg.see [9], [10] 2 ), and can be efficiently computed under certain symmetry.However, the number of variables in the linear programming grows exponentially in the block length, and it is difficult to compute in general.The computation of the information-spectrum bound depends on the evaluation of a tail probability.The information-spectrum bound is less operational than the hypothesis testing bound in the sense of the hierarchy introduced in [9], and the computational complexity of the former is much smaller than that of the latter.However the computation of a tail probability is still not so easy unless the channel is a DMC.For DMCs, computational complexity of Gallager's bound is O(1) since the Gallager function is additive quantity for DMCs.However, this is not the case if there is a memory 3 .Consequently, there is no bound that is efficiently computable for the Markov chain so far.The situation is the same for source coding with side-ifnromation.
Next, let us consider the second criterion, i.e., asymptotic optimality.So far, three kinds of asymptotic regimes have been studied in the information theory [1], [2], [12], [13], [14], [15], [16]: • The large deviation regime in which the error probability ε asymptotically behaves like e −nr for some r > 0, • The moderate deviation regime in which ε asymptotically behaves like e −n 1−2t r for some r > 0 and t ∈ (0, 1/2), and • The second order regime in which ε is a constant.We shall claim that a good finite-length bound should be asymptotically optimal at least one of the above mentioned three regimes.In fact, the information spectrum bound, the hypothesis testing bound, and the DT bound are asymptotically optimal in the moderate deviation regime and the second order regime; the Gallager bound is asymptotically optimal in the large deviation regime; and the RCU bound is asymptotically optimal in all the regimes 4 .Recently, for DMC, Yang-Meng derived efficiently computable bound for low density parity check (LDPC) codes [17], which is asymptotically optimal in the moderate deviation regime and the second order regime.

B. Main Contribution for Finite-Length Analysis
To derive finite-length achievability bounds on the problems, we basically use the exponential type bounds 5 .In source coding with side-information, the exponential type upper bounds on error probability Pe (M n ) for a given message size M n are described by using conditional Rényi entropies as follows (cf.Lemma 13 and Lemma 14): and Here, H ↑ 1+θ (X n |Y n ) is the conditional Rényi entropy introduced by Arimoto [18], which we shall call upper conditional Rényi entropy (cf.(12)).On the other hand, H ↓ 1+θ (X n |Y n ) is the conditional Rényi entropy introduced in [19], which we shall call the lower conditional Rényi entropy.Although there are several other definitions of conditional Rényi entropies, we will only use these two in this paper; see [20], [21] for extensive review on conditional Rényi entropies.
Although the above mentioned conditional Rényi entropies are additive for i.i.d.random variables, they are not additive for Markov chains, which is a difficulty to derive finite-length bounds for Markov chains.In general, it is not easy to evaluate the conditional Rényi entropies for Markov chains.Thus, we consider two assumptions on transition matrices (see Assumption 1 and Assumption 2 of Section II).Without Assumption 1, it should be noted that even the conditional entropy rate is difficult to be evaluated.Under Assumption 1, we introduce the lower conditional Rényi entropy for transition matrices H ↓,W 1+θ (X|Y ) (cf. (47)).Then, we evaluate the lower conditional Rńyi entropy for the Markov chain in terms of its transition matrix counterpart.More specifically, we derive an approximation where an explicit form of O( 1) is also derived.This evaluation gives finite-length bounds under Assumption 1.Under more restrictive assumption, i.e., Assumption 2, we also introduce the upper conditional Rényi entropy for a transition matrix H ↑,W 1+θ (X|Y ) (cf. (55)).Then, we evaluate the upper Rényi entropy for the Markov chain in terms of its transition matrix counterpart.More specifically, we derive an approximation where an explicit form of O( 1) is also derived.This evaluation gives finite-length bounds that are tighter than those obtained under Assumption 1.
We also derive converse bounds by using the change of measure argument for Markov chains developed by the authors in the accompanying paper on information geometry [22], [23].For this purpose, we further introduce twoparameter conditional Rényi entropy and its transition matrix counterpart (cf.(18) and ( 59)).This novel information measure includes the lower conditional Rényi entropy and the upper conditional Rényi entropy as special cases.To clarify the relation among bounds based on these quantities, we numerically calculate the upper and lower bounds for the optimal coding rate in source coding with Markovian source as Figs. 3 and 4. Thanks to the second criterion (C2), this calculation shows that our finite-length bounds are very close to the optimal value.Although this numerical calculation contains the case with the huge size n = 1 × 10 5 , its calculation is not so difficult because their calculation complexity behaves as O (1).That is, this calculation shows the advantage of the first criterion (C1).
Here, we would like to remark on terminologies.There are a few ways to express exponential type bounds.In statistics or the large deviation theory, we usually use the cumulant generating function (CGF) to describe exponents.In information theory, we use the Gallager function or the Rényi entropies.Although these three terminologies are essentially the same and are related by change of variables, the CGF and the Gallager function are convenient for some calculations since they have good properties such as convexity.However, they are merely mathematical functions.On the other hand, the Rényi entropies are information measures including Shannon's information measures as special cases.Thus, the Rényi entropies are intuitively familiar in the field of information theory.The Rényi entropies also have an advantage that two types of bounds (eg.(157) and (166)) can be expressed in a unified manner.For these reasons, we state our main results in terms of the Rényi entropies while we use the CGF and the Gallager function in the proofs.For readers' convenience, the relation between the Rényi entropies and corresponding CGFs are summarized in Appendices A and and B.

C. Main Contribution for Channel Coding
It is known that there is an intimate relationship between channel coding and source coding with side-information (eg.[24], [25], [26]).In particular, for an additive channel, the error probability of channel coding by a linear code can be related to the corresponding source coding problem with side information [24].Chen et.al. also showed that the error probability of source coding with side-information by a linear encoder can be related to the error probability of a dual channel coding problem and vice versa [27] (see also [28]).Since those dual channels can be regarded as additive channels conditioned by state-information, we call those channels conditional additive channels 6 .As a similar symmetric channel, a regular channel [29] is known.
In this paper, we mainly discuss a conditional additive channel, in which, the additive noise is operated subject to a distribution conditioned with an additional output information, and propose a method to convert a regular channel into a conditional additive channel so that our treatment covers regular channels.Additionally, we show that the BPSK-AWGN channel is included in conditional additive channels.Thus, by using aforementioned duality between channel coding and source coding with side-information, we can evaluate the error probability of channel coding for regular channels.
By the same reason as source coding with side-information, we assume two assumptions, Assumption 1 and Assumption 2, on the noise process of a conditional additive channel.It should be noted that the Gilbert-Elliott channel [30], [31] with state-information available at the receiver can be regraded as a conditional additive channel such that the noise process is a Markov chain satisfying both Assumption 1 and Assumption 2 (see Example 6).Thus, we believe that Assumption 1 and Assumption 2 are quite reasonable assumptions.

D. Asymptotic bounds and asymptotic optimality for finite-length bounds
For asymptotic analyses of the large deviation and the moderate deviation regimes, we derive the characterizations7 by using our finite-length achievability and converse bounds, which implies that our finite-length bounds are tight in the large deviation regime and the moderate deviation regime.We also derive the second order rate.Although the second order rate can be derived by application of the central limit theorem to the information spectrum bound, the variance involves the limit with respect to the block length because of memory.In this paper, we derive a single letter form of the variance by using the conditional Rényi entropy for transition matrices 8 .
As we will see in Theorem 11, Theorem 12, Theorem 13, Theorem 14, Theorem 22, Theorem 23, Theorem 24, and Theorem 25, our asymptotic results have the same forms as the counterparts of the i.i.d.case (cf.[7], [1], [2], [12], [13], [14]) when the information measures for distributions in the i.i.d.case are replaced by the information measures for transition matrices introduced in this paper.
To see the asymptotic optimality for finite-length bounds, we summarize the relation between the asymptotic results and the finite-length bounds in Table I.In the table, the computational complexity of the finite-length bounds are also described."Solved * " indicates that those problems are solved up to the critical rates."Ass.1" and "Ass.2" indicate that those problems are solved under Assumption 1 or Assumption 2. "O(1)" indicates that both the achievability part and the converse part of those asymptotic results are derived from our finite-length achievability bounds and converse bounds whose computational complexities are O(1)."Tail" indicates that both the achievability part and the converse part of those asymptotic results are derived from the information-spectrum type achievability bounds and converse bounds whose computational complexities depend on the computational complexities of tail probabilities.
Exact computations of tail probabilities are difficult in general though it may be feasible for a simple case such as an i.i.d.case.One way to approximately compute tail probabilities is to use the Berry-Esséen theorem [34,Theorem 16.5.1]or its variant [35].This direction of research is still continuing [36], [37], and an evaluation of the constant was done in [37] though it is not clear how much tight it is.If we can derive a tight Berry-Esséen type bound for the Markov chain, we can derive a finite-length bound that is asymptotically tight in the second order regime.However, the approximation errors of Berry-Esséen type bounds converge only in the order of 1/ √ n, and cannot be applied when ε is rather small.Even in the cases such that exact computations of tail probabilities are possible, the information-spectrum type bounds are looser than the exponential type bounds when ε is rather small, and we need to use appropriate bounds depending on the size of ε.In fact, this observation was explicitly clarified in [38] for the random number generation with side-information.Consequently, we believe that our exponential type finite-length bounds are very useful.It should be also noted that, for source coding with side-information and channel coding for regular channels, even the first order results have not been revealed as long as the authors know, and they are clarified in this paper 9 .

E. Related Works on Markov chains
Since related works concerning the finite-length analysis has been reviewed in Section I-A, we only review related works concerning the asymptotic analysis here.There are some studies on Markov chains for the large deviation regime [39], [40], [41].The derivation in [39] uses the Markov type method.A drawback of this method is that it involves a term that stems from the number of types, which is not important for the asymptotic analysis but is crucial for the finite-length analysis.Our achievability is derived by a similar approach as in [40], [41], i.e., the Perron-Frobenius theorem, but our derivation separates the single-shot part and the evaluation of the Rényi entropy, and thus is more transparent.Also, the converse part of [40], [41] is based on the Shannon-McMillan-Breiman limiting theorem and does not yield finite-length bounds.
For the second order regime, Polyanskiy et.al. studied the second order rate (dispersion) of the Gilbert-Elliott channel [42].Tomamichel and Tan studied the second order rate of channel coding with state-information such that the state-information may be a general source, and derived a formula for the Markov chain as a special case [32].Kontoyiannis studied the second order variable length source coding for the Markov chain [43].In [44], Kontoyiannis-Verdú derived the second order rate of lossless source coding under overflow probability criterion.
For channel coding of i.i.d.case, Scarlett et.al. derived a saddle-point approximation, which unifies all the three regimes [45], [46].

F. Organization of Paper
In Section II, we introduce information measures and their properties that will be used in Section III and Section IV.Then, source coding with side-information and channel coding will be discussed in Section III and Section IV respectively.As we mentioned above, we state our main result in terms of the Rényi entropies, and we use the CGFs and the Gallager function in the proofs.We explain how to cover the continuous case in Remarks 1 and 6.In Appendices A and B, the relation between the Rényi entropies and corresponding CGFs are summarized.The relation between the Rényi entropies and the Gallager function are explained as necessary.Proofs of some technical results are also shown in the rest of appendices.

G. Notations
For a set X , the set of all distributions on X is denoted by P(X ).The set of all sub-normalized non-negative functions on X is denoted by P(X ).The cumulative distribution function of the standard Gaussian random variable is denoted by Throughout the paper, the base of the logarithm is e.

II. INFORMATION MEASURES
Since this paper discusses the second order optimality, we need to discuss the central limit theorem for the Markovian process.For this purpose, we usually employ advanced mathematical methods from probability theory.For example, the paper [47,Theorem 4] showed the Markov version of the central limit theorem by using a martingale stopping technique.Lalley [48] employed regular perturbation theory of operators on the infinite dimensional space [49, Ch. 7, #1, Ch. 4, #3, and Ch. 3, #5].The papers [50], [51][52, Lemma 1.5 of Chapter 1] employed the spectral measure while it is hard to calculate the spectral measure in general even in the finite state case.Further, the papers [50], [53], [54], [55] showed the central limit theorem by using the asymptotic variance, but they did not give any computable expression of the asymptotic variance without the infinite sum.In summary, to derive the central limit theorem with the variance of computable form, these papers need to use very advanced mathematics beyond calculus and linear algebra.
To overcome this problem, we employ the method used in our recent paper [23].The paper [23] employed the method based on the cumulant generating function for transition matrices, which the Perron eigenvalue of a specific non-negative-entry matrix.Since a Perron eigenvalue can be explained in the framework of linear algebra, the method can be described with elementary mathematics.To employ this method, we need to define the information measure in a way similar to the cumulant generating function for transition matrices.That is, we define the information measures for transition matrices, e.g., the conditional Rényi entropy for transition matrices, etc, by using Perron eigenvalues.
Fortunately, these information measures for transition matrices are very useful even for large deviation type evaluation and finite-length bounds.For example, our recent paper [23] derived finite-length bounds for simple hypothesis testing for Markovian chain by using the cumulant generating function for transition matrices.Therefore, using these information measures for transition matrices, this paper derives finite-length bounds for source coding and channel coding with Markov chains, and discusses their asymptotic bounds with large deviation, moderate deviation, and second order type.
Since they are natural extensions of information measures for single-shot setting, we first review information measures for single-shot setting in Section II-A.Next, we introduce information measures for transition matrices in Section II-B.Then, we show that information measures for Markov chains can be approximated by information measures for transition matrices generating those Markov chains in Section II-C.

A. Information measures for Single-Shot Setting
In this section, we introduce conditional Rényi entropies for the single-shot setting.For more detailed review of conditional Rényi entropies, see [21].For a correlated random variable (X, Y ) on X × Y with probability distribution P XY and a marginal distribution Q Y on Y, we introduce the conditional Rényi entropy of order 1 + θ relative to Q Y as where θ ∈ (−1, 0) ∪ (0, ∞).The conditional Rényi entropy of order 0 relative to Q Y is defined by the limit with respect to θ.When Y is singleton, it is nothing but the ordinary Rényi entropy, and it is denoted by H 1+θ (X) = H 1+θ (P X ) throughout the paper.One of important special cases of H 1+θ (P XY |Q Y ) is the case with Q Y = P Y , where P Y is the marginal of P XY .We shall call this special case the lower conditional Rényi entropy of order 1 + θ and denote10 = − 1 θ log x,y We have the following property, which follows from the correspondence between the conditional Rényi entropy and the cumulant generating function (cf.Appendix B).

Lemma 1
We have and Proof: (9) follows from the relation in (293) and the fact that the first-order derivative of cumulant generating function is the expectation.(11) follows from (293), (9), and (294).
The other important special cases of H 1+θ (P XY |Q Y ) is the measure maximized over Q Y .We shall call this special case the upper conditional Rényi entropy of order 1 + θ and denote11 where For this measure, we also have properties similar to Lemma 1.This lemma will be proved in Appendix C.
We can also derive explicit forms of the conditional Rényi entropies of order 0.

Lemma 4
We have dθ is monotonically decreasing.Thus, we can define the inverse function 13 Since R(a) is a monotonic increasing function of a < a ≤ R(a).Thus, we can define the inverse function a For θH ↓ 1+θ (X|Y ), by the same reason as above, we can define the inverse functions θ(a and 13 Throughout the paper, the notations θ(a) and a(R) are reused for several inverse functions.Although the meanings of those notations are obvious from the context, we occasionally put superscript Q, ↓ or ↑ to emphasize that those inverse functions are induced from corresponding conditional Rényi entropies.This definition is related to Legendre transform of the concave function θ → θH ↓ 1+θ (X|Y ).
Remark 1 Here, we discuss the possibility for extension to the continuous case.Since the entropy on the continuous diverges, we cannot extend the information quantities to the case when X is continuous.However, it is possible to extend these quantities to the case when Y is continuous but X is a discrete finite set.In this case, we prepare a general measure µ (like the Lebesgue measure) on Y and probability density function p Y and q Y such that the distributions P Y and Q Y are given as p Y (y)µ(dy) and q Y (y)µ(dy), respectively.Then, it is sufficient to replace , Q(y), and P XY (x, y) by Y µ(dy), P X|Y (x|y)p Y (y), and q Y (y), respectively.Hence, in the n-independent and identical distributed case, these information measures are given as n times of the original information measures.
One might consider the information quantities for transition matrices given in the next subsection to this continuous case.However, it is not so easy because it needs a continuous extension of the Perron eigenvalue.

B. Information Measures for Transition Matrix
Let {W (x, y|x ′ , y ′ )} ((x,y),(x ′ ,y ′ ))∈(X ×Y) 2 be an ergodic and irrecucible transition matrix.The purpose of this section is to introduce transition matrix counter parts of those measures in Section II-A.For this purpose, we first need to introduce some assumptions on transition matrices: Assumption 1 (Non-Hidden) We say that a transition matrix W is non-hidden (with respect to Y) if14 for every x ′ ∈ X and y, y ′ ∈ Y.This condition is equivalent to the existence of the following decomposition of W (x, y|x ′ , y ′ ); Assumption 2 (Strongly Non-Hidden) We say that a transition matrix W is strongly non-hidden (with respect to Y) if, for every θ ∈ (−1, ∞) and y, y ′ ∈ Y, is well defined, i.e., the right hand side of ( 39) is independent of x ′ .
Assumption 1 requires (39) to hold only for θ = 0, and thus Assumption 2 implies Assumption 1.However, Assumption 2 is strictly stronger condition than Assumption 1.For example, let consider the case such that the transition matrix is a product form, i.e., W (x, y|x ′ , y ′ ) = W (x|x ′ )W (y|y ′ ).In this case, Assumption 1 is obviously satisfied.However, Assumption 2 is not satisfied in general.
Remark 2 Assumption 2 has another expression as follows.Assumption 2 holds if and only if, for every x ′ = x′ , there exists a permeation Now, we fix an element x 0 ∈ X , and transform a sequence of random numbers . That is, essentially, the transition matrix of this case can be written by the transition matrix . So, the transition matrix can be written by using the positive-entry matrix . Since the part "if" is trivial, we show the part "only if" as follow.By noting (38), Assumption 2 can be rephrased as does not depend on x ′ for every θ ∈ (−1, ∞).Furthermore, this condition can be rephrased as follows.For The followings are non-trivial examples satisfying Assumption 1 and Assumption 2.
Example 1 Suppose that X = Y are a module.Let P and Q be transition matrices on X .Then, the transition matrix given by satisfies Assumption 1. Furthermore, if transition matrix P (z|z ′ ) can be written as for permutation π z ′ and a distribution P Z on X , then transition matrix W defined by ( 41) satisfies Assumption 2 as well.
Example 2 Suppose that X is a module, and W is (strongly) non-hidden with respect to Y. Let Q be a transition matrix on Z = X .Then, the transition matrix given by is (strongly) non-hidden with respect to Y × Z.
The following is also an example satisfying Assumption 2, which describes a noise process of an important class of channels with memory (cf.Example 6).
for some 0 < q 0 , q 1 < 1, and let for some 0 < p 0 , p 1 < 1.By choosing π x ′ ;x ′ to be the identity, this transition matrix satisfies the condition given in Remark 2, that is equivalent to Assumption 2.
First, we introduce information measures under Assumption 1.In order to define a transition matrix counterpart of ( 7), let us introduce the following tilted matrix: Here, we should notice that the tilted matrix Wθ is not normalized, i.e., is not a transition matrix.Let λ θ be the Perron-Frobenius eigenvalue of Wθ and Pθ,XY be its normalized eigenvector.Then, we define the lower conditional Rényi entropy for W by where θ ∈ (−1, 0) ∪ (0, ∞).For θ = 0, we define the lower conditional Rényi entropy for W by and we just call it the conditional entropy for W .In fact, the definition of H W (X|Y ) above coincide with where P 0,XY is the stationary distribution of W (cf. [58, Eq. ( 30)]).For θ = −1, H ↓,W 0 (X|Y ) is also defined by taking the limit.When Y is singleton, the Rényi entropy H W 1+θ (X) for W is defined as a special case of As a counterpart of ( 11), we also define15 Remark 3 When transition matrix W satisfies Assumption 2, H ↓,W 1+θ (X|Y ) can be written as where λ ′ θ is the Perron-Frobenius eigenvalue of W θ (y|y ′ )W (y|y ′ ) −θ .In fact, for the left Perro-Frobenius eigenvector Qθ of W θ (y|y ′ )W (y|y ′ ) −θ , we have which implies that λ ′ θ is the Perron-Frobenius eigenvalue of Wθ .Consequently, we can evaluate H ↓,W 1+θ (X|Y ) by calculating the Perron-Frobenius eigenvalue of |Y| × |Y| matrix instead of |X ||Y| × |X ||Y| matrix when W satisfies Assumption 2.
Next, we introduce information measures under Assumption 2. In order to define a transition matrix counterpart of ( 12), let us introduce the following |Y| × |Y| matrix: where W θ is defined by (39).Let κ θ be the Perron-Frobenius eigenvalue of K θ .Then, we define the upper conditional Rényi entropy for W by where θ ∈ (−1, 0) ∪ (0, ∞).For θ = −1 and θ = 0, H ↑,W 1+θ (X|Y ) is defined by taking the limit.We have the following properties, which will be proved in Appendix F.

Lemma 5 We have
and Now, let us introduce a transition matrix counterpart of (18).For this purpose, we introduce the following |Y|×|Y| matrix: Let ν θ,θ ′ be the Perron-Frobenius eigenvalue of N θ,θ ′ .Then, we define the two-parameter conditional Rényi entropy by Remark 4 Although we defined H ↓,W 1+θ (X|Y ) and H ↑,W 1+θ (X|Y ) by ( 47) and ( 55) respectively, we can alternatively define these measures in the same spirit as the single-shot setting by introducing a transition matrix counterpart of H 1+θ (P XY |Q Y ) as follows.For the marginal W (y|y By using this measure, we obviously have Furthermore, under Assumption 2, we can show that holds (see Appendix G for the proof), where the maximum is taken over all transition matrices satisfying Y 2 W ⊂ Y 2 V .Next, we investigate some properties of the information measures introduced in this section.The following lemma is proved in Appendix H.
We have 16 Although we can also define V is not satisfied (see [22] for the detail), for our purpose of defining H ↓,W 1+θ (X|Y ) and H ↑,W 1+θ (X|Y ), other cases are irrelevant.
10) For every θ From Statement 1 of Lemma 6, is monotonically decreasing.Thus, we can define the inverse for a < a ≤ a, where a := lim θ→∞ Since R(a) is a monotonic increasing function of a < a < R(a).Thus, we can define the inverse function 1+θ (X|Y ), by the same reason, we can define the inverse function and the inverse function by for Here, the first equality in (71) follows from (66). Since (X|Y ), we can prove the following.

Lemma 7
The function θ(R) defined in (67) satisfies Furthermore, we can show the following.

Lemma 8
The function θ(a(R)) defined by (70) satisfies for Proof: See Appendix I.
Remark 5 As we can find from ( 49), ( 51), and Lemma 5, both the conditional Rényi entropies expand as around θ = 0. Thus, the difference of these measures significantly appear only when |θ| is rather large.For the transition matrix of Example 3 with q 0 = q 1 = 0.1, p 0 = 0.1, and p 1 = 0.4, we plotted the values of the information measures in Fig. 1.Although the values at θ = −1 coincide in Fig. 1, note that the values at θ = −1 may differ in general.
In Example 1, we have mentioned that transition matrix W in (41) satisfies Assumtption 2 when transition matrix P is given by (42).In this case, we can find that i.e., the two kinds of conditional Rényi entropies coincide.Now, let's consider asymptotic behavior of H ↓,W 1+θ (X|Y ) around θ = 0.When θ(a) is close to 0, we have Taking the derivative, (67) implies that Hence, when R is close to H W (X|Y ), we have i.e., Furthermore, (81) and (82) imply

C. Information Measures for Markov Chain
Let (X, Y) be the Markov chain induced by transition matrix W and some initial distribution P X1Y1 .Now, we show how information measures introduced in Section II-B are related to the conditional Rényi entropy rates.First, we introduce the following lemma, which gives finite upper and lower bounds on the lower conditional Rényi entropy.
Lemma 9 Suppose that transition matrix W satisfies Assumption 1.Let v θ be the eigenvector of W T θ with respect to the Perron-Frobenius eigenvalue λ θ such that min x,y v θ (x, y) = 1.17Let w θ (x, y) := P X1Y1 (x, y) 1+θ P Y1 (y) −θ .Then, for every n ≥ 1, we have where and v θ |w θ is defined as x,y v θ (x, y)w θ (x, y).
Proof: It follows from (297) and Lemma 26.From Lemma 9, we have the following.
Theorem 1 Suppose that transition matrix W satisfies Assumption 1.For any initial distribution, we have We also have the following asymptotic evaluation of the variance, which follows from Lemma 27 in Appendix A.
Theorem 2 Suppose that transition matrix W satisfies Assumption 1.For any initial distribution, we have Theorem 2 is practically important since the limit of the variance can be described by a single letter characterized quantity.A method to calculate V W (X|Y ) can be found in [23].
Next, we show the lemma that gives finite upper and lower bound on the upper conditional Rényi entropy in terms of the upper conditional Rényi entropy for the transition matrix.
Lemma 10 Suppose that transition matrix W satisfies Assumption 2. Let v θ be the eigenvector of K T θ with respect to the Perro-Frobenius eigenvalue κ θ such that min y v θ (y) = 1.Let w θ be the |Y|-dimensional vector defined by . (95) Then, we have where Proof: See Appendix J. From Lemma 10, we have the following.
Theorem 3 Suppose that transition matrix W satisfies Assumption 2. For any initial distribution, we have Finally, we show the lemma that gives finite upper and lower bounds on the two-parameter conditional Rényi entropy in terms of the two-parameter conditional Rényi entropy for the transition matrix.
Lemma 11 Suppose that transition matrix W satisfies Assumption 2. Let v θ,θ ′ be the eigenvector of N T θ,θ ′ with respect to the Perro-Frobenius eigenvalue ν θ,θ ′ such that min y v θ,θ ′ (y) = 1.Let w θ,θ ′ be the |Y|-dimensional vector defined by Then, we have where for θ > 0 and for θ < 0 Proof: We can write The second term is evaluated by Lemma 10.The first term can be evaluated almost the same manner as Lemma 10.
From Lemma 11, we have the following.
Theorem 4 Suppose that transition matrix W satisfies Assumption 2. For any initial distribution, we have

III. SOURCE CODING WITH FULL SIDE-INFORMATION
In this section, we investigate the source coding with side-information.We start this section by showing the problem setting in Section III-A.Then, we review and introduce some single-shot bounds in Section III-B.We derive finite-length bounds for the Markov chain in Section III-C.Then, in Sections III-F and III-E, we show the asymptotic characterization for the large deviation regime and the moderate deviation regime by using those finite-length bounds.We also derive the second order rate in Section III-D.
The results shown in this section are summarized in Table II.The checkmarks indicate that the tight asymptotic bounds (large deviation, moderate deviation, and second order) can be obtained from those bounds.The marks * indicate that the large deviation bound can be derived up to the critical rate.The computational complexity "Tail" indicates that the computational complexities of those bounds depend on the computational complexities of tail probabilites.It should be noted that Theorem 8 is derived from a special case (Q Y = P Y ) of Theorem 5.The asymptotically optimal choice is , which corresponds to Corollary 1.Under Assumption 1, we can derive the bound of the Markov case only for that special choice of Q Y , while under Assumption 2, we can derive the bound of the Markov case for the optimal choice of Q Y .

A. Problem Formulation
A code Ψ = (e, d) consists of one encoder e : X → {1, . . ., M } and one decoder d : {1, . . ., M } × Y → X .The decoding error probability is defined by For notational convenience, we introduce the infimum of error probabilities under the condition that the message size is M : For theoretical simplicity, we focus on a randomized choice of our encoder.For this purpose, we employ a randomized hash function F from X to {1, . . ., M }.A randomized hash function F is called two-universal hash when Pr{F (x) = F (x ′ )} ≤ 1 M for any distinctive x and x ′ [60]; the so-called bin coding [61] is an example of twouniversal hash function.In the following, we denote the set of two-universal hash functions by F. Given an encoder f as a function from X to {1, . . ., M }, we define the decoder d f as the optimal decoder by argmin . Then, we denote the code (f, d f ) by Ψ(f ).Then, we bound the error probability P s [Ψ(F )] averaged over the random function F by only using the property of two-universality.In order to consider the worst case of such schemes, we introduce the following quantity: When we consider n-fold extension, the source code and related quantities are denoted with the superscript (n).For example, the quantities in (112) and (114) are written to be P (n) s (M ) and P(n) s (M ), respectively.Instead of evaluating them, we are often interested in evaluating for given 0 ≤ ε < 1.

B. Single Shot Bounds
In this section, we review existing single shot bounds and also show novel converse bounds.For the information meaures used below, see Section II.
By using the standard argument on information-spectrum approach, we have the following achievability bound.

Lemma 12 (Lemma 7.2.1 of [4])
The following bound holds: Although Lemma 12 is useful for the second-order regime, it is known to be not tight in the large deviation regime.By using the large deviation technique of Gallager, we have the following exponential type achievability bound.

Lemma 13 ([62])
The following bound holds: 18   Ps (M ) ≤ inf Although Lemma 13 is known to be tight in the large deviation regime for i.i.d.sources, H ↑ 1+θ (X|Y ) for Markov chains can only be evaluated under the strongly non-hidden assumption.For this reason, even though the following bound is looser than Lemma 13, it is useful to have another bound in terms of H ↓ 1+θ (X|Y ), which can be evaluated for Markov chains under the non-hidden assumption.

Lemma 14
The following bound holds: Proof: To derive this bound, we change variable in (118) as θ = θ ′ 1−θ ′ .Then, −1 ≤ θ ′ ≤ 0, and we have where we used Lemma 28 in Appendix C. When Y is singleton, we have the following bound, which is tighter than Lemma 13.

Lemma 15 ((2.39) [63])
The following bound holds For converse part, we first have the following bound, which is very close to the operational definition of source coding with side-information.

Lemma 16 ([64])
Let {Ω y } y∈Y be a family of subsets Ω y ⊂ X , and let Ω = ∪ y∈Y Ω y × {y}.Then, for any Q Y ∈ P(Y), the following bound holds: Since Lemma 16 is close to the operational definition, it is not easy to evaluate Lemma 16.Thus, we derive another bound by loosening Lemma 16, which is more tractable for evaluation.Slightly weakening Lemma 16, we have the following.
Theorem 5 For any Q Y ∈ P(Y), we have where R = log M , and θ(a) = θ Q (a) and a(R) = a Q (R) are the inverse functions defined in ( 29) and ( 32) respectively.
Proof: See Appendix K.In particular, by taking Q Y = P (1+θ(a(R))) Y in Theorem 5, we have the following.
Remark 6 Here, it is better to discuss the possibility for extension to the continuous case.As explained in Remark 1, we can define the information quantities to the case when Y is continuous but X is a discrete finite set.The discussions in this subsection still hold even in this continuous case.In particular, in the n-i.i.d.extension case with this continuous setting, Lemma 13 and Corollary 1 hold when the information measures are replaced by n times of the single-shot information measures.

C. Finite-Length Bounds for Markov Source
In this subsection, we derive several finite-length bounds for Markovian source with a computable form.Unfortunately, it is not easy to evaluate how tight those bounds are only with their formula.Their tightness will be discussed by considering the asymptotic limit in the remaining subsections of this section.Since we assume the irreducibility for the transition matrix describing the Markovian chain, the following bound hold with any initial distribution.
To derive a lower bounds on − log Ps (M n ) in terms of the Rényi entropy of transition matrix, we substitute the formula for the Rényi entropy given in Lemma 9 into Lemma 14.Then, we can derive the following achievability bound.
When Y is singleton, from Lemma 15 and a special case of Lemma 9, we have the following achievability bound.
Theorem 7 (Direct, Singleton) Let R := 1 n log M n .Then, for every n ≥ 1, we have To derive an upper bound on − log P s (M n ) in terms of the Rényi entropy of transition matrix, we substitute the formula for the Rényi entropy given in Lemma 9 to Theorem 5.Then, we have the following converse bound.Theorem 8 (Converse, Ass. 1) Suppose that transition matrix W satisfies Assumption 1.Let R := 1 n log M n .For any H W (X|Y ) < R < H ↓,W 0 (X|Y ), we have where θ(a) = θ ↓ (a) and a(R) = a ↓ (R) are the inverse functions defined by ( 67) and (70) respectively, and δ(•) and δ(•) are given by ( 90) and (91), respectively.
Proof: We first use (125) of Theorem 5 for Q Y n = P Y n and Lemma 9.Then, we restrict the range of θ as −1 < θ < θ(a(R)) and set ϑ = θ(a(R))− θ 1+ θ . Then, we have the assertion of the theorem.Next, we derive tighter bounds under Assumption 2. To derive a lower bound on − log Ps (M n ) in terms of the Rényi entropy of transition matrix, we substitute the formula for the Rényi entropy in Lemma 10 to Lemma 13.Then, we have the following achievability bound.Theorem 9 (Direct, Ass. 2) Suppose that transition matrix W satisfies Assumption 2. Let R := 1 n log M n .Then we have where ξ(θ) is given by (98).
Finally, to derive an upper bound on − log P s (M n ) in terms of the Rényi entropy for transition matrix, we substitute the formula for the Rényi entropy in Lemma 11 to Theorem 5 for . Then, we can derive the following converse bound.

D. Second Order
By applying the central limit theorem to Lemma 12 (cf.[65, Theorem 27.4,Example 27.6]) and Lemma 17 for Q Y = P Y , and by using Theorem 2, we have the following.
Theorem 11 Suppose that transition matrix W on X × Y satisfies Assumption 1.For arbitrary ε ∈ (0, 1), we have √ n asymptotically obeys the normal distribution with average 0 and the variance V W (X|Y ), where we use Theorem 2 to show that the limit of the variance is given by in Lemma 12, we have On the other hand, substituting M = e nH W (X|Y )+ √ nR and γ = nH W (X|Y ) + √ nR + n 1 4 in Lemma 17 for Q Y = P Y , we have Combining ( 145) and (146), we have the statement of the theorem.
From the above theorem, the (first-order) compression limit of source coding with side-information for a Markov source under Assumption 1 is given by20 for any ε ∈ (0, 1).In the next subsections, we consider the asymptotic behavior of the error probability when the rate is larger than the compression limit H W (X|Y ) in the moderate deviation regime and the large deviation regime, respectively.

E. Moderate Deviation
From Theorem 6 and Theorem 8, we have the following.
Theorem 12 Suppose that transition matrix W satisfies Assumption 1.For arbitrary t ∈ (0, 1/2) and δ > 0, we have Proof: We apply Theorem 6 and Theorem 8 to the case with For the achievability part, from (88) and Theorem 6, we have To prove the converse part, we fix arbitrary s > 0 and choose θ to be Remark 7 In the literatures [13], [67], the moderate deviation results are stated for ǫ n such that ǫ n → 0 and nǫ 2 n → ∞ instead of n −t for t ∈ (0, 1/2).Although the former is slightly more general than the latter, we employ the latter formulation in Theorem 12 since the order of convergence is clearer.In fact, n −t in Theorem 12 can be replaced by general ǫ n without modifying the argument of the proof.

F. Large Deviation
From Theorem 6 and Theorem 8, we have the following.
Theorem 13 Suppose that transition matrix W satisfies Assumption 1.For H W (X|Y ) < R, we have On the other hand, for H W (X|Y ) < R < H ↓,W 0 (X|Y ), we have Proof: The achievability bound (156) follows from Theorem 6.The converse part (157) is proved from Theorem 8 as follows.We first fix s > 0 and −1 < θ < θ(a(R)).Then, Theorem 8 implies By taking the limit s → 0 and θ → θ(a(R)), we have Thus, (157) is proved.The alternative expression ( 158) is derived via Lemma 8.Under Assumption 2, from Theorem 9 and Theorem 10, we have the following tighter bound.
Theorem 14 Suppose that transition matrix W satisfies Assumption 2. For H W (X|Y ) < R, we have On the other hand, for Proof: The achievability bound (165) follows from Theorem 9.The converse part (166) is proved from Theorem 10 as follows.We first fix s > 0 and −1 < θ < θ(a(R)).Then, Theorem 10 implies By taking the limit s → 0 and θ → θ(a(R)), we have Thus, (166) is proved.The alternative expression (167) is derived via Lemma 8.

Remark 8
For R ≤ R cr , where (cf.(72) for the definition of R(a)) The description of the transition matrix.
is the critical rate, we can rewrite the lower bound in (165) as (cf.Lemma 8) sup Thus, the lower bound and the upper bound coincide up to the critical rate.
Remark 9 When Y is singleton, from Theorem 7 and a special case of (157), we can derive and for H W (X) < R < H W 0 (X).Thus, we can recover the results in [39], [40] by our approach.

G. Numerical Example
In this section, to demonstrate the of our finite-length bound, we numerically evaluate the achievability bound in Theorem 7 and a special case of the converse bound in Theorem 8 for singleton Y. Thanks to the criterion (C2), our numerical calculation shows that our upper finite-length bounds is very close to our lower finite-length bounds when the size n is sufficiently large.Thanks to the criterion (C1), we could calculate both bounds with huge size n = 1 × 10 5 because the calculation complexity behaves as O(1).
We consider a binary transition matrix W given by Fig. 2, i.e., In this case, the stationary distribution is The entropy is where h(•) is the binary entropy function.The tilted transition matrix is The Perron-Frobenius eigenvalue is From these calculations, we can evaluate the bounds in Theorem 7 and Theorem 8.For p = 0.1, q = 0.2, the bounds are plotted in Fig. 3 for fixed error probability ε = 10 −3 .Although there is a gap between the achievability bound and the converse bound for rather small n, the gap is less than approximately 5% of the entropy rate for n larger than 10000.We also plotted the bounds in Fig. 4 for fixed block length n = 10000 and varying ε.The gap between the achievability bound and the converse bound remains approximately 5% of the entropy rate even for ε as small as 10 −10 .

IV. CHANNEL CODING
In this section, we investigate the channel coding with a conditional additive channel.The former part of this section discusses general properties of the channel coding with a conditional additive channel.The latter part of this section discusses properties of the channel coding when the conditional additive noise of the channel is Markovian.We start this section by showing the problem setting in Section IV-A by introducing a conditional additive channel.Section IV-B gives a canonical method to convert a regular channel to a conditional additive channel.Section IV-C gives a method to convert a BPSK-AWGN channel to a conditional additive channel.Then, we show some single-shot achievability bounds in Section IV-D, and single-shot converse bounds in Section IV-E.
As the latter part, we derive finite-length bounds for the Markov noise channel in Section IV-F.Then, in Sections IV-I and IV-H, we show the asymptotic characterization for the large deviation regime and the moderate deviation regime by using those finite-length bounds.We also derive the second order rate in Section IV-G.The results shown in this section for the Markovian conditional additive noise are summarized in Table III.The checkmarks indicate that the tight asymptotic bounds (large deviation, moderate deviation, and second order) can be obtained from those bounds.The marks * indicate that the large deviation bound can be derived up to the critical rate.The computational complexity "Tail" indicates that the computational complexities of those bounds depend on the computational complexities of tail probabilities.It should be noted that Theorem 18 is derived from a special case (Q Y = P Y ) of Theorem 16.The asymptotically optimal choice is Q . Under Assumption 1, we can derive the bound of the Markov case only for that special choice of Q Y , while under Assumption 2, we can derive the bound of the Markov case for the optimal choice of Q Y .Furthermore, Theorem 18 is not asymptotically tight in the large deviation regime in general, but it is tight if Y is singleton, i.e., the channel is additive.It should be also noted that Theorem 20 does not imply Theorem 18 even for the additive channel case since Assumption 2 restricts the structure of transition matrices even when Y is singleton.A. Formulation for conditional additive channel 1) Single-shot case: We first present the problem formulation by the single shot setting.For a channel P B|A (b|a) with input alphabet A and output alphabet B, a channel code Ψ = (e, d) consists of one encoder e : {1, . . ., M } → A and one decoder d : B → {1, . . ., M }.The average decoding error probability is defined by For notational convenience, we introduce the error probability under the condition that the message size is M : Assume that the input alphabet A is the same set as the output alphabet B and they equals an additive group X .When the transition matrix P B|A (b|a) is given as P X (b − a) by using a distribution P X on X , the channel is called additive.
To extend the concept of additive channel, we consider the case when the input alphabet A is an additive group X and the output alphabet B is the product set X × Y.When the transition matrix P B|A (x, y|a) is given as P XY (x − a, y) by using a distribution P XY on X × Y, the channel is called conditional additive.In this paper, we are exclusively interested in the conditional additive channel.As explained in Subsection IV-B, a channel is a conditional additive channel if and only if it is a regular channel in the sense of [29].When we need to explicitly express the underlying distribution of the noise, we denote the average decoding error probability by P c [Ψ|P XY ].
2) n-fold extension: When we consider n-fold extension, the channel code is denoted with subscript n such as Ψ n = (e n , d n ).The error probabilities given in ( 188) and ( 189) are written with the superscript (n) as respectively.Instead of evaluating the error probability P (n) c (M n ) for given M n , we are also interested in evaluating When the channel is given as a conditional distribution, the channel is given by where For the code construction, we investigate the linear code.For an (n, k) linear code C n ⊂ A n , there exists a parity check matrix f n : A n → A n−k such that the kernel of f n is C n .That is, given a parity check matrix f n : A n → A n−k , we define the encoder I Ker(fn) : C n → A n as the imbedding of the kernel Ker(f n ).Then, using the decoder d fn := argmin Here, we employ a randomized choice of a parity check matrix.In particular, instead of a two-universal hash function, we focus on liner two-universal hash functions, because the linearity is required in the above relation with source coding.So, denoting the set of linear two-universal hash functions from A n to A n−k by F l , we introduce the quantity: Taking the infimum over all linear codes associated with F n (cf.( 113)), we obviously have When we consider the error probability for conditionally additive channels, we use notation Pc (n, k|P XY ) so that the underlying distribution of the noise is explicit.We are also interested in characterizing for given 0 ≤ ε ≤ 1.

B. Conversion from regular channel to conditional additive channel
This subsection shows that a channel is a regular channel in the sense of [29] if and only if it is conditional additive.Then, we see that a binary erasure symmetric channel is an example of a regular channel.
We assume that the input alphabet A has an additive group structure.Let P X be a distribution on the output alphabet B. Let π a be a representation of the group A on B, and let G = {π a : a ∈ A}.A regular channel [29] is defined by The set of all orbits constitute a disjoint partition of B. A set of representatives of the orbits is denoted by B, and let ̟ : B → B be the map to the representatives.
Let B = X × Y and P X = P XY for some joint distribution on X × Y. Now, we consider a conditional additive channel, whose transition matrix P B|A (x, y|a) is given as P XY (x − a, y).When the group action is given by π a (x, y) = (x − a, y), the above conditional additive channel given as a regular channel.In this case, there are |Y| orbits and the size of each orbit is |X | respectively.This fact shows that any conditional additive channel is written as a regular channel.
Conversely, we show that any regular channel is written as a conditional additive channel.For this purpose, we convert a regular channel to a conditional additive channel as follows.
We first explain the construction for single shot channel.For random variable X ∼ P X , let Y = B and Y = ̟( X) be the random variable describing the representatives of the orbits.For each orbit Orb(y), fix an element 0 y ∈ Orb(y).Then, let be the stabilizer subgroup of 0 y21 .Let A/Stb(0 y ) be a set of coset representatives of the coset A/Stb(0 y ), and let be the map to the coset representatives.Then, we can define the bijective map Let X = A and P X|Y (•|y) be the distribution on A defined by When the output from the real channel is b, the output from the virtual channel is defined by where A ′ is randomly chosen from Stb(0 ̟(b) ).

Theorem 15
The virtual channel defined by ( 203) is the conditional additive channel such that the output is given by (a + X, Y ) for (X, Y ) ∼ P XY , where P XY is defined from P Y and P X|Y of (202).
Proof: When the input to the real channel is a, note that the output can be written as π −a ( X), where X ∼ P X .By noting that Y = ̟(π −a ( X)) ∼ P Y , the output of the virtual channel is written as where (205) follows from the fact that and we set . Since A ′′ is the uniform random variable on Stb(0 Y ), the joint distribution of (ι Y ( X) + A ′′ , Y ) is P XY .Thus, we have the statement of the theorem.
Similarly, for n-fold extension, we can also construct the virtual conditional additive channel.More precisely, for Xn ∼ P Xn , we set Y n = ̟( Xn ) = (̟( X1 ), . . ., ̟( Xn )) and where Since the conversion to the virtual channel in ( 203) is reversible, we can assume that the channel is a conditional additive from the beginning without loss of generality.

C. Conversion of BPSK-AWGN Channel to Conditional Additive Channel
Although we only considered finite input/output sources and channels throughout the paper, in order to demonstrate the utility of the conditional additive channel framework, let us consider the additive white Gaussian noise (AWGN) channel with binary phase shift keying (BPSK) in this section.Let A = {0, 1} be the input alphabet of the channel, and let B = R be the output alphabet of the channel.For an input a ∈ A and Gaussian noise Z with mean 0 and variance σ 2 , the output of channel is given by B = (−1) a + Z.Then, the conditional probability density function of this channel is given as The relations (216) and (220) show that the AWGN channel with BPSK is given as a conditional additive channel in the above sense.By noting this observation, as explained in Remark 6, the single-shot achievability bounds in Section III-B are also valid for continuous Y , Also, the discussions for the single-shot converse bounds in Subsection IV-E hold even for continuous Y .So, the bounds in Subsections IV-D and IV-E are also applicable to the BPSK-AWGN channel.
In particular, in the n memoryless extension of the BPSK-AWGN channel, the information measures for the noise distribution are given as n times of the single-shot information measures for the noise distribution.Even in this case, the upper and lower bounds in Subsections IV-D and IV-E are also applicable by replacing the information measures by n times of the single-shot information measures.Therefore, we obtain finite-length upper and lower bounds of the optimal coding length for the memoryless BPSK-AWGN channel.Furthermore, even though the additive noise is not Gaussian, when the probability density function p Z of the additive noise Z satisfies the symmetry p Z (z) = p Z (−z), the BPSK channel with the additive noise Z can be converted to a conditional additive channel in the same way.

D. Achievability Bound Derived by Source Coding with Side-Information
In this subsection, we give a code for a conditional additive channel from a code of source coding with sideinformation in a canonical way.In this construction, we see that the decoding error probability of the channel code equals that of the source code.
When the channel is given as the conditional additive channel with conditional additive noise distribution P X n Y n as (191) and X = A is the finite field F q , we can construct a linear channel code from a source coder with full side-information whose encoder and decoder are f n and d n as follows.That is, we assume the linearity for the source encoder f n .Let C n (f n ) be the kernel of the liner encoder f n of the source coder.Suppose that the sender sends a codeword c n ∈ C n (f n ) and (c n + X n , Y n ) is received.Then, the receiver computes the syndrome f n (c n + X n ) = f n (X n ), estimates X n from f n (X n ) and Y n , and subtracts the estimate from c n + X n .That is, we choose the channel decoder dn as dn (x ′n , y n ) := We succeeded in decoding in this channel coding if and only if d n (f n (X n ), Y n ) equals X n .Thus, the error probability of this channel code coincides with that of the source code for the correlated source (X n , Y n ).In summary, we have the following lemma, which was first pointed out in [27].

Lemma 18 ([27, (19)])
Given a linear encoder f n and a decoder d n for source coding with side-information with distribution P X n Y n , let I Ker(fn) and dn be channel encoder and decoder induced from (f n , d n ).Then, the error probability of channel coding for conditionally additive channel with noise distribution P X n Y n satisfies Furthermore, 22 taking the infimum for F n chosen to be a linear two-universal hash function, we also have By using this observation and the results in Section III-B, we can derive the achievability bounds.By using the conversion argument in Section IV-B, we can also construct a channel code for a regular channel from a source code with full side-information.Although the following bounds are just specialization of known bounds for conditional additive channels, we review these bounds here to clarify correspondence between the bounds in source coding with side-information and channel coding.
From Lemma 12 and (223), we have the following.

Lemma 19 ([3])
The following bound holds: From Lemma 13 and (223), we have the following exponential type bound.

Lemma 20 ([7])
The following bound holds: From Lemma 14 and (223), we have the following slightly loose exponential bound.

Lemma 21 ([4], [68])
The following bound holds:23 When Y is singleton, i.e., the virtual channel is additive, we have the following special case of Lemma 20.

E. Converse Bound
In this subsection, we show some converse bounds.The following is the information spectrum type converse shown in [5].

Lemma 23 ([5, Lemma 4])
For any code Ψ n = (e n , d n ) and any output distribution Q B n ∈ P(B n ), we have When a channel is a conditional additive channel, we have By taking the output distribution Q B n as for some Q Y n ∈ P(Y n ), we have the following bound.
Lemma 24 When a channel is a conditional additive channel, for any distribution Q Y n ∈ P(Y n ), we have Proof: By noting (229) and (230), the first term of the right hand side of (228) can be rewritten as which implies the statement of the lemma.By a similar argument as in Theorem 5, we can also derive the following converse bound.

Theorem 16
For any Q Y n ∈ P(Y n ), we have where R = n log |A| − log M n , and θ(a) and a(R) are the inverse functions defined in ( 29) and ( 32) respectively.
Proof: See Appendix L.

F. Finite-Length Bound for Markov Noise Channel
From this section, we address conditional additive channel whose conditional additive noise us subject to Markovian chain.Here, the input alphabet A n equals the additive group X n = F n q and the output alphabet B n is X × Y n .That is, the transition matrix describing the channel is given by using a transition matrix W on X × Y n and an initial distribution Q as As in Section II-B, we consider two assumptions on the transition matrix W of the noise process (X, Y), i.e., Assumption 1 and Assumption 2. We also use the same notations as in Section II-B.
Example 6 (Gilbert-Elliot channel with state-information available at the receiver) The Gilbert-Elliot channel [30], [31] is characterized by a channel state Y n on Y n = {0, 1} n , and an additive noise X n on X n = {0, 1} n .The noise process (X n , Y n ) is a Markov chain induced by the transition matrix W introduced in Example 3.For the channel input a n , the channel output is given by (a n + X n , Y n ) when the state-information is available at the receiver.Thus, this channel can be regarded as a conditional additive channel, and the transition matrix of the noise process satisfies Assumption 2.
Proofs of the following bounds are almost the same as those in Section III-C, and thus omitted.From Lemma 21 and Lemma 9, we can derive the following achievability bound.
Theorem 17 (Direct, Ass. 1) Suppose that the transition matrix W of the conditional additive noise satisfies Assumption 1.Let R := n−k n log |A|.Then we have From Theorem 16 for Q Y n = P Y n and Lemma 9, we have the following converse bound.
Theorem 18 (Converse, Ass. 1) Suppose that transition matrix W of the conditional additive noise satisfies Assumption 1.Let R : (X|Y ), then we have where θ(a) = θ ↓ (a) and a(R) = a ↓ (R) are the inverse functions defined by ( 67) and (70) respectively, and Next, we derive tighter bounds under Assumption 2. From Lemma 20 and Lemma 10, we have the following achievability bound.
By using Theorem 16 for and Lemma 11, we can derive the following converse bound.
Theorem 20 (Converse, Ass. 2) Suppose that the transition matrix W of the conditional additive noise satisfies Assumption 2. Let R : where θ(a) = θ ↑ (a) and a(R) = a ↑ (R) are the inverse functions defined by ( 71) and (73) respectively, and Finally, when Y is singleton, i.e., the channel is additive, we can derive the following achievability bound from Lemma 22.
Theorem 21 (Direct, Singleton) Let R := n−k n log |A|.Then we have Remark 10 Our treatment for Markovian conditional additive channel covers Markovian regular channels because Markovian regular channel can be reduced to Markovian conditional additive channel as follows.Let X = { Xn } ∞ n=1 be a Markov chain on B whose distribution is given by for a transition matrix W and an initial distribution Q.Let (X, Y) = {(X n , Y n )} ∞ n=1 be the noise process of the conditional additive channel derived from the noise process X of the regular channel by the argument of Section IV-B.Since we can write the process (X, Y) is also a Markov chain.Thus, the regular channel given by X is reduced to the conditional additive channel given by (X, Y).

G. Second Order
To discuss the asymptotic performance, we introduce the quantity Theorem 22 Suppose that the transition matrix W of the conditional additive noise satisfies Assumption 1.For arbitrary ε ∈ (0, 1), we have Proof: It can be proved exactly in the same manner as Theorem 11.
From the above theorem, the (first-order) capacity of the conditional additive channel under Assumption 1 is given by for every 0 < ε < 1.In the next subsections, we consider the asymptotic behavior of the error probability when the rate is smaller than the capacity in the moderate deviation regime and the large deviation regime, respectively.

H. Moderate Deviation
From Theorem 17 and Theorem 18, we have the following.
Theorem 23 Suppose that the transition matrix W of the conditional additive noise satisfies Assumption 1.For arbitrary t ∈ (0, 1/2) and δ > 0, we have Proof: It can be proved exactly in the same manner as Theorem 12.

I. Large Deviation
From Theorem 17 and Theorem 18, we have the following.
Theorem 24 Suppose that the transition matrix W of the conditional additive noise satisfies Assumption 1.For H W (X|Y ) < R, we have On the other hand, for H W (X|Y ) < R < H ↓,W 0 (X|Y ), we have Proof: It can be proved exactly in the same manner as Theorem 13.Under Assumption 2, from Theorem 19 and Theorem 20, we have the following tighter bound.
Theorem 25 Suppose that the transition matrix W of the conditional additive noise satisfies Assumption 2. For H W (X|Y ) < R, we have On the other hand, for H W (X|Y ) < R < H ↑,W 0 (X|Y ), we have Proof: It can be proved exactly in the same manner as Theorem 14.When Y is singleton, i.e., the channel is additive, from Theorem 21 and (263), we have the following.2) Transition Matrix: Let {W (z|z ′ )} (z,z ′ )∈Z 2 be an ergodic and irreducible transition matrix, and let P be its stationary distribution.For a function g : Z × Z → R, let E[g] := z,z ′ P (z ′ )W (z|z ′ )g(z, z ′ ). (278) We also introduce the following tilted matrix: W ρ (z|z ′ ) := W (z|z ′ )e ρg(z,z ′ ) . (279) Let λ ρ be the Perron-Frobenius eigenvalue of W ρ .Then, the CGF for W with generator g is defined by Lemma 25 The function φ(ρ) is a convex function of ρ, and it is strict convex iff.φ ′′ (0) > 0.
Lemma 26 Let v ρ be the eigenvector of W T ρ with respect to the Perron-Frobenius eigenvalue λ ρ such that min z v ρ (z) = 1.Let w ρ (z) := P Z1 (z)e ρg (z) .Then, we have (290) From this, we can also find that the relationship between the inverse functions (cf.( 29) and ( 275)): Thus, the inverse function defined in (32) also satisfies ).Then, we have It should be noted that φ(ρ, ρ ′ ) is a CGF for fixed ρ ′ , but φ(ρ, ρ) cannot be treated as a CGF.
Then, the variance defined in (51) satisfies

C. Proof of Lemma 2
We use the following lemma.
Statement 6 can be proved by modifying the proof of Statement 8 of Lemma 3 to a transition matrix in a similar manner as Statement 3 of the present lemma.

46 RFig. 3 .
Fig.3.A comparison of the bounds for p = 0.1, q = 0.2, and ε = 10 −3 .The horizontal axis is the block length n and the vertical axis is the rate R (nats).The red curve is the achievability bound in Theorem 7. The blue curve is the converse bound in Theorem 8.The purple line is the entropy H W (X).

43 RFig. 4 .
Fig.4.A comparison of the bounds for p = 0.1, q = 0.2, and n = 10000.The horizontal axis is − log 10 (ε), and the vertical axis is the rate R (nats).The red curve is the achievability bound in Theorem 7. The blue curve is the converse bound in Theorem 8.The purple line is the entropy H W (X).
) Now, to define a conditional additive channel, we choose Y := R + and define the probability density function p Y on Y with respect to the Lebesgue measure and the conditional distribution P X|Y (x|y) as p Y (y) y ∈ R + .When we define b := (−1) x y ∈ R for x ∈ {0, 1} and y ∈ R + , we have p XY |A (y, x|a) = 1 √ 2πσ e − (y−(−1) a+x ) 2

Theorem 19 ( 2 )
Direct, Ass.Suppose that the transition matrix W of the conditional additive noise satisfies Assumption 2. Let R := n−k n log |A|.Then we have

TABLE I SUMMARY
OF ASYMPTOTIC RESULTS AND FINITE-LENGTH BOUNDS TO DERIVE ASYMPTOTIC RESULTS For fixed Q Y , θH 1+θ (P XY |Q Y ) is a concave function of θ, and it is strict concave iff.Var log QY (Y ) PXY (X,Y ) > 0. 2) For fixed Q Y , H 1+θ (P XY |Q Y ) is a monotonically decreasing 12 function of θ.
(40)y ′ ) 1+θ for sufficiently large θ, which contradict the fact that(40)does not depend on x ′ .Thus, the largest values of {W (x|x ′ , y ′ )} x∈X and {W (x|x ′ , y ′ )} x∈X must coincide.By repeating this argument for the second largest value of {W (x|x ′ , y ′ )} x∈X and {W (x|x ′ , y ′ )} x∈X and so on, we find Assumption 2 implies that for every x ′ = x′ , there exists a permeation π x

TABLE II SUMMARY
OF THE BOUNDS FOR SOURCE CODING WITH FULL SIDE-INFORMATION.

TABLE III SUMMARY
OF THE FINITE-LENGTH BOUNDS FOR CHANNEL CODING.
256) By applying the central limit theorem (cf.[65, Theorem 27.4,Example 27.6]) to Lemma 19 and Lemma 24 for Q Y n = P Y n , and by using Theorem 2, we have the following.