Next Article in Journal
Energy Stability Property of the CPR Method Based on Subcell Second-Order CNNW Limiting in Solving Conservation Laws
Next Article in Special Issue
Amplitude Constrained Vector Gaussian Wiretap Channel: Properties of the Secrecy-Capacity-Achieving Input Distribution
Previous Article in Journal
Deep Classification with Linearity-Enhanced Logits to Softmax Function
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Information Rates for Channels with Fading, Side Information and Adaptive Codewords

School of Computation, Information and Technology, Technical University of Munich (TUM), 80333 Munich, Germany
Entropy 2023, 25(5), 728; https://doi.org/10.3390/e25050728
Submission received: 3 March 2023 / Revised: 26 March 2023 / Accepted: 22 April 2023 / Published: 27 April 2023
(This article belongs to the Special Issue Wireless Networks: Information Theoretic Perspectives III)

Abstract

:
Generalized mutual information (GMI) is used to compute achievable rates for fading channels with various types of channel state information at the transmitter (CSIT) and receiver (CSIR). The GMI is based on variations of auxiliary channel models with additive white Gaussian noise (AWGN) and circularly-symmetric complex Gaussian inputs. One variation uses reverse channel models with minimum mean square error (MMSE) estimates that give the largest rates but are challenging to optimize. A second variation uses forward channel models with linear MMSE estimates that are easier to optimize. Both model classes are applied to channels where the receiver is unaware of the CSIT and for which adaptive codewords achieve capacity. The forward model inputs are chosen as linear functions of the adaptive codeword’s entries to simplify the analysis. For scalar channels, the maximum GMI is then achieved by a conventional codebook, where the amplitude and phase of each channel symbol are modified based on the CSIT. The GMI increases by partitioning the channel output alphabet and using a different auxiliary model for each partition subset. The partitioning also helps to determine the capacity scaling at high and low signal-to-noise ratios. A class of power control policies is described for partial CSIR, including a MMSE policy for full CSIT. Several examples of fading channels with AWGN illustrate the theory, focusing on on-off fading and Rayleigh fading. The capacity results generalize to block fading channels with in-block feedback, including capacity expressions in terms of mutual and directed information.

1. Introduction

The capacity of fading channels is a topic of interest in wireless communications [1,2,3,4]. Fading refers to model variations over time, frequency, and space. A common approach to track fading is to insert pilot symbols into transmit symbol strings, have receivers estimate fading parameters via the pilot symbols, and have the receivers share their estimated channel state information (CSI) with the transmitters. The CSI available at the receiver (CSIR) and transmitter (CSIT) may be different and imperfect.
Information-theoretic studies on fading channels distinguish between average (ergodic) and outage capacity, causal and non-causal CSI, symbol and rate-limited CSI, and different qualities of CSIR and CSIT that are coarsely categorized as no, perfect, or partial. We refer to [5] for a review of the literature up to 2008. We here focus exclusively on average capacity and causal CSIT as introduced in [6]. Codes for such CSIT, or more generally for noisy feedback [7], are based on Shannon strategies, also called codetrees ([8], Chapter 9.4), or adaptive codewords ([9], Section 4.1). (The term “adaptive codeword” was suggested to the author by J. L. Massey.) Adaptive codewords are usually implemented by a conventional codebook and by modifying the codeword symbols as a function of the CSIT. This approach is optimal for some channels [10] and will be our main interest.

1.1. Block Fading

A model that accounts for the different time scales of data transmission (e.g., nanoseconds) and channel variations (e.g., milliseconds) is block fading [11,12]. Such fading has the channel parameters constant within blocks of L symbols and varying across blocks. A basic setup is as follows.
  • The fading is described by a state process S H 1 , S H 2 , independent of the transmitter messages and channel noise. The subscript “H” emphasizes that the states S H i may be hidden from the transceivers.
  • Each receiver sees a state process S R 1 , S R 2 , where S R i is a noisy function of S H i for all i.
  • Each transmitter sees a state process S T 1 , S T 2 , where S T i is a noisy function of S H i for all i.
The state processes may be modeled as memoryless [11,12] or governed by a Markov chain [13,14,15,16,17,18,19,20,21]. The memoryless models are particular cases of Shannon’s model [6]. For scalar channels, S H i is usually a complex number H i . Similarly, for vector or multi-input, multi-output (MIMO) channels with M- and N-dimensional inputs and outputs, respectively, S H i is a N × M matrix H i .
Consider, for example, a point-to-point channel with block-fading and complex-alphabet inputs X i and outputs
Y i = H i X i + Z i
where the index i, i = 1 , , n , enumerates the blocks and the index , = 1 , , L , enumerates the symbols of each block. The additive white Gaussian noise (AWGN) Z 11 , Z 12 , is a sequence of independent and identically distributed (i.i.d.) random variables that have a common circularly-symmetric complex Gaussian (CSCG) distribution.

1.2. CSI and In-Block Feedback

The motivation for modeling CSI as independent of the messages is simplicity. If one uses only pilot symbols to estimate the H i in (1), for example, then the independence is valid, and the capacity analysis may be tractable. However, to improve performance, one can implement data and parameter estimation jointly, and one can actively adjust the transmit symbols X i using past received symbols Y i k , k = 1 , , 1 , if in-block feedback is available. (Across-block feedback does not increase capacity if the state processes are memoryless; see ([22], Remark 16).) An information theory for such feedback was developed in [22], where a challenge is that code design is based on adaptive codewords that are more sophisticated than conventional codewords.
For example, suppose the CSIR is S R i = H i . Then, one might expect that CSCG signaling is optimal, and the capacity is an average of log ( 1 + SNR ) terms, where SNR is a signal-to-noise ratio. However, this simplification is based on constraints, e.g., that the CSIT is a function of the CSIR and that the X i cannot influence the CSIT. The former constraint can be realistic, e.g., if the receiver quantizes a pilot-based estimate of H i and sends the quantization bits to the transmitter via a low-latency and reliable feedback link. On the other hand, the latter constraint is unrealistic in general.

1.3. Auxiliary Models

This paper’s primary motivation is to further develop information theory for adaptive codewords. To gain insight, it is helpful to have achievable rates with log ( 1 + SNR ) terms. A common approach to obtain such expressions is to lower bound the channel mutual information I ( X ; Y ) as follows.
Suppose X is continuous and consider two conditional densities: the density p ( x | y ) and an auxiliary density q ( x | y ) . We will refer to such densities as reverse models; similarly, p ( y | x ) and q ( y | x ) are called forward models. One may write the differential entropy of X given Y as
h ( X | Y ) = E log p ( X | Y ) = E log q ( X | Y ) average cross entropy E log p ( X | Y ) q ( X | Y ) average divergence 0
where the first expectation in (2) is an average cross-entropy, and the second is an average informational divergence, which is non-negative. Several criteria affect the choice of q ( x | y ) : the cross-entropy should be simple enough to admit theoretical or numerical analysis, e.g., by Monte Carlo simulation; the cross-entropy should be close to h ( X | Y ) ; and the cross-entropy should suggest suitable transmitter and receiver structures.
We illustrate how reverse and forward auxiliary models have been applied to bound mutual information. Assume that E X = E Y = 0 for simplicity.
Reverse Model: Consider the reverse density that models X , Y as jointly CSCG:
q ( x | y ) = 1 π σ L 2 exp x x ^ L 2 / σ L 2
where X ^ L = E X Y * / E | Y | 2 Y and
σ L 2 = E X X ^ L 2 = E | X | 2 | E X Y * | 2 E | Y | 2
is the mean square error (MSE) of the estimate X ^ L . In fact, X ^ L is the linear estimate with the minimum MSE (MMSE), and σ L 2 is the linear MMSE (LMMSE) which is independent of Y = y ; see Section 2.5. The bound in (2) gives
h ( X | Y ) log π e σ L 2 .
Thus, if X is CSCG, then we have the desired form
I ( X ; Y ) = h ( X ) h ( X | Y ) log 1 + | h | 2 E | X | 2 σ 2
where the parameters h and σ 2 are
h = E Y X * E | X | 2 , σ 2 = E | Y h X | 2 .
The bound (6) is apparently due to Pinsker [23,24,25] and is widely used in the literature; see e.g., [18,26,27,28,29,30,31,32,33,34,35,36,37,38]. The bound is usually related to channels p ( y | x ) with additive noise but (2)–(6) show that it applies generally. The extension to vector channels is given in Section 2.7 below.
Forward Model: A more flexible approach is to choose the reverse density as
q ( x | y ) = p ( x ) q ( y | x ) s q ( y )
where q ( y | x ) is a forward auxiliary model (not necessarily a density), s 0 is a parameter to be optimized, and
q ( y ) = C p ( x ) q ( y | x ) s d x .
Inserting (8) into (2) we compute
I ( X ; Y ) max s 0 E log q ( Y | X ) s q ( Y ) .
The right-hand side (RHS) of (10) is called a generalized mutual information (GMI) [39,40] and has been applied to problems in information theory [41], wireless communications [42,43,44,45,46,47,48,49,50,51], and fiber-optic communications [52,53,54,55,56,57,58,59,60,61]. For example, the bounds (6) and (10) are the same if s = 1 and
q ( y | x ) = exp | y h x | 2 / σ 2
where h and σ 2 are given by (7). Note that (11) is not a density unless σ 2 = 1 / π but q ( x | y ) is a density. (We require q ( x | y ) to be a density to apply the divergence bound in (2).)
We compare the two approaches. The bound (5) is simple to apply and works well since the choices (7) give the maximal GMI for CSCG X; see Proposition 1 below. However, there are limitations: one must use continuous X, the auxiliary model q ( y | x ) is fixed as (11), and the bound does not show how to design the receiver. Instead, the GMI applies to continuous/discrete/mixed X and has an operational interpretation: the receiver uses q ( y | x ) rather than p ( y | x ) to decode. The framework of such mismatched receivers appeared in ([62], Exercise 5.22); see also [63].

1.4. Refined Auxiliary Models

The two approaches above can be refined in several ways, and we review selected variations in the literature.
Reverse Models: The model q ( x | y ) can be different for each Y = y , e.g., on may choose X as Gaussian with mean E X | Y = y and variance
Var X | Y = y = E | X | 2 | Y = y | E X | Y = y | 2
and where
q ( x | y ) = 1 π Var X | Y = y exp x E X | Y = y 2 Var X | Y = y .
Inserting (13) in (2) we have the bound
h ( X | Y ) E log π e Var X | Y
which improves (5) in general, since Var X | Y = y is the MMSE of X given the event Y = y . In other words, we have Var X | Y = y σ L 2 for all Y = y and the following bound improves (6) for CSCG X:
I ( X ; Y ) E log E | X | 2 Var X | Y .
In fact, the bound (15) was derived in ([50], Section III.B) by optimizing the GMI in (10) over all forward models of the form
q ( y | x ) = exp g ˜ y f ˜ y x 2
where f ˜ y , g ˜ y depend on y; see also [47,48,49]. We provide a simple proof. By inserting (16) into (8) and (9), absorbing the s parameter in f ˜ y and g ˜ y , and completing squares, one can equivalently optimize over all reverse densities of the form
q ( x | y ) = exp g y f y x 2 + h y
where | f y | 2 = π e h y so that q ( x | y ) is a density. We next bound the cross-entropy as
E log q ( X | Y = y ) = E g y / f y X 2 | f y | 2 h y Var X | Y = y π e h y h y
with equality if g y / f y = E X | Y = y ; see Section 2.5. The RHS of (18) is minimized by Var X | Y = y π e h y = 1 , so the best choice for f y , g y , h y gives the bound (14).
Remark 1.
The model (16) uses generalized nearest-neighbor decoding, improving the rules proposed in [42,43,44]. The authors of [50] pointed out that (6) and (15) use the LMMSE and MMSE, respectively; see ([50], Equation (87)).
Remark 2.
A corresponding forward model can be based on (8) and (13), namely
q ( y | x ) s = q ( x | y ) p ( x ) q ( y ) = 1 .
Remark 3.
The RHS of (15) has a more complicated form than the RHS of (6) due to the outer expectation and conditional variance, and this makes optimizing X challenging when there is CSIR and CSIT. Also, if p ( y | x ) is known, then it seems sensible to numerically compute p ( y ) and I ( X ; Y ) directly, e.g., via Monte Carlo or numerical integration.
Remark 4.
Decoding rules for discrete X can be based on decision theory as well as estimation theory; see ([64], Equation (11)).
Forward Models: Refinements of (11) appear in the optical fiber literature where the non-linear Schrödinger equation describes wave propagation [52]. Such channels exhibit complicated interactions of attenuation, dispersion, nonlinearity, and noise, and the channel density is too challenging to compute. One thus resorts to capacity lower bounds based on GMI and Monte Carlo simulation. The simplest models are memoryless, and they work well if chosen carefully. For example, the paper [52] used auxiliary models of the form
q ( y | x ) = exp | y h x | 2 / σ | x | 2
where h accounts for attenuation and self-phase modulation, and where the noise variance σ | x | 2 depends on | x | . Also, X was chosen to have concentric rings rather than a CSCG density. Subsequent papers applied progressively more sophisticated models with memory to better approximate the actual channel; see [53,54,55,56,57,58,59]. However, the rate gains over the model (20) are minor (≈12%) for 1000 km links, and the newer models do not suggest practical receiver structures.
A related application is short-reach fiber-optic systems that use direct detection (DD) receivers [65] with photodiodes. The paper [60] showed that sampling faster than the symbol rate increases the DD capacity. However, spectrally efficient filtering gives the channel a long memory, motivating auxiliary models q ( y | x ) with reduced memory to simplify GMI computations [61,66]. More generally, one may use channel-shortening filters [67,68,69] to increase the GMI.
Remark 5.
The ultimate GMI is I ( X ; Y ) , and one can compute this quantity numerically for the channels considered in this paper. We are motivated to focus on forward auxiliary models q ( y | x ) to understand how to improve information rates for more complex channels. For instance, simple q ( y | x ) let one understand properties of optimal codes, see Lemma 3, and they suggest explicit power control policies, see Theorem 2.
Remark 6.
The paper [37] (see also ([2], Equation (3.3.45)) and ([70], Equation (6))) derives two capacity lower bounds for massive MIMO channels. These bounds are designed for problems where the fading parameters have small variance so that, in effect, σ 2 in (7) is small. We will instead encounter cases where σ 2 grows in proportion to E | X | 2 and the RHS of (6) quickly saturates as E | X | 2 grows; see Remark 20.

1.5. Organization

This paper is organized as follows. Section 2 defines notation and reviews basic results. Section 3 develops two results for the GMI of scalar auxiliary models with AWGN:
  • Proposition 1 in Section 3.1 states a known result, namely that the RHS of (6) is the maximum GMI for the AWGN auxiliary model (11) and a CSCG X.
  • Lemma 1 in Section 3.2 generalizes Proposition 1 by partitioning the channel output alphabet into K subsets, K 1 . We use K = 2 to establish capacity properties at high and low SNR.
Section 4 and Section 5 apply the GMI to channels with CSIT and CSIR.
  • Section 4.3 treats adaptive codewords and develops structural properties of their optimal distribution.
  • Lemma 2 in Section 4.4 generalizes Proposition 1 to MIMO channels and adaptive codewords. The receiver models each transmit symbol as a weighted sum of the entries of the corresponding adaptive symbol.
  • Lemma 3 in Section 4.5 states that the maximum GMI for scalar channels, an AWGN auxiliary model, adaptive codewords with jointly CSCG entries, and K = 1 is achieved by using a conventional codebook where each symbol is modified based on the CSIT.
  • Lemma 4 in Section 4.6 extends Lemma 3 to MIMO channels, including diagonal or parallel channels.
  • Theorem 1 in Section 5.1 generalizes Lemma 3 to include CSIR; we use this result several times in Section 6.
  • Lemma 5 in Section 5.3 generalizes Lemmas 1 and 2 by partitioning the channel output alphabet.
Section 6, Section 7 and Section 8 apply the GMI to fading channels with AWGN and illustrate the theory for on-off and Rayleigh fading.
  • Lemma 6 in Section 6 gives a general capacity upper bound.
  • Section 6.5 introduces a class of power control policies for full CSIT. Theorem 2 develops the optimal policy with an MMSE form.
  • Theorem 3 in Section 6.6 provides a quadratic waterfilling expression for the GMI with partial CSIR.
Section 9 develops theory for block fading channels with in-block feedback (or in-block CSIT) that is a function of the CSIR and past channel inputs and outputs.
  • Theorem 4 in Section 9.2 generalizes Lemma 4 to MIMO block fading channels;
  • Section 9.3 develops capacity expressions in terms of directed information;
  • Section 9.4 specializes the capacity to fading channels with AWGN and delayed CSIR;
  • Proposition 3 generalizes Proposition 2 to channels with special CSIR and CSIT.
Section 10 concludes the paper. Finally, Appendix A, Appendix B, Appendix C, Appendix D, Appendix E, Appendix F and Appendix G provide results on special functions, GMI calculations, and proofs.

2. Preliminaries

2.1. Basic Notation

Let 1 ( · ) be the indicator function that takes on the value 1 if its argument is true and 0 otherwise. Let δ ( . ) be the Dirac generalized function with X δ ( x ) f ( x ) d x = f ( 0 ) · 1 ( 0 X ) . For x R , define ( x ) + = max ( 0 , x ) . The complex-conjugate, absolute value, and phase of x C are written as x * , | x | , and arg ( x ) , respectively. We write j = 1 and ϵ ¯ = 1 ϵ .
Sets are written with calligraphic font, e.g., S = { 1 , , n } and the cardinality of S is | S | . The complement of S in T is S c where T is understood from the context.

2.2. Vectors and Matrices

Column vectors are written as x ̲ = [ x 1 , , x M ] T where M is the dimension, and T denotes transposition. The complex-conjugate transpose (or Hermitian) of x ̲ is written as x ̲ . The Euclidean norm of x ̲ is x ̲ . Matrices are written with bold letters such as A . The letter I denotes the identity matrix. The determinant and trace of a square matrix A are written as det A and tr A , respectively.
A singular value decomposition (SVD) is A = U Σ V where U and V are unitary matrices and Σ is a rectangular diagonal matrix with the singular values of A on the diagonal. The square matrix A is positive semi-definite if x ̲ A x ̲ 0 for all x ̲ . The notation A B means that B A is positive semi-definite. Similarly, A is positive definite if x ̲ A x ̲ > 0 for all x ̲ , and we write A B if B A is positive definite.

2.3. Random Variables

Random variables are written with uppercase letters, such as X, and their realizations with lowercase letters, such as x. We write the distribution of discrete X with alphabet X = { 0 , , n 1 } as P X = [ P X ( 0 ) , , P X ( n 1 ) ] . The density of a real- or complex-valued X is written as p X . Mixed discrete-continuous distributions are written using mixtures of densities and Dirac- δ functions.
Conditional distributions and densities are written as P X | Y and p X | Y , respectively. We usually drop subscripts if the argument is a lowercase version of the random variable, e.g., we write p ( y | x ) for p Y | X ( y | x ) . One exception is that we consistently write the distributions P S R ( . ) and P S T ( . ) of the CSIR and CSIT with the subscript to avoid confusion with power notation.

2.4. Second-Order Statistics

The expectation and variance of the complex-valued random variable X are E X and Var X = E | X E X | 2 , respectively. The correlation coefficient of X 1 and X 2 is ρ = E U 1 U 2 * where
U i = ( X i E X i ) / Var X i
for i = 1 , 2 . We say that X 1 and X 2 are fully correlated if ρ = e j ϕ for some real ϕ . Conditional expectation and variance are written as E X | A = a and
Var X | A = a = E ( X E X ) ( X E X ) * | A = a .
The expressions E X | A , Var X | A are random variables that take on the values E X | A = a , Var X | A = a if A = a .
The expectation and covariance matrix of the random column vector X ̲ = [ X 1 , , X M ] T are E X ̲ and Q X ̲ = E ( X ̲ E X ̲ ) ( X ̲ E X ̲ ) , respectively. We write Q X ̲ , Y ̲ for the covariance matrix of the stacked vector [ X ̲ T Y ̲ T ] T . We write Q X ̲ | Y ̲ = y ̲ for the covariance matrix of X ̲ conditioned on the event Y ̲ = y ̲ . Q X ̲ | Y ̲ is a random matrix that takes on the matrix value Q X ̲ | Y ̲ = y ̲ when Y ̲ = y ̲ .
We often consider CSCG random variables and vectors. A CSCG X ̲ has density
p ( x ̲ ) = exp x ̲ Q X ̲ 1 x ̲ π M det Q X ̲
and we write X ̲ CN ( 0 ̲ , Q X ̲ ) .

2.5. MMSE and LMMSE Estimation

Assume that E X ̲ = E Y ̲ = 0 ̲ . The MMSE estimate of X ̲ given the event Y ̲ = y ̲ is the vector X ̲ ^ ( y ̲ ) that minimizes
E X ̲ X ̲ ^ ( y ̲ ) 2 Y ̲ = y ̲ .
Direct analysis gives ([71], Chapter 4)
X ̲ ^ ( y ̲ ) = E X ̲ | Y ̲ = y ̲
E X ̲ X ̲ ^ 2 = E X ̲ 2 E X ̲ ^ 2
Q X ̲ X ̲ ^ = Q X ̲ Q X ̲ ^
E X ̲ X ̲ ^ Y ̲ = 0
where the last identity is called the orthogonality principle.
The LMMSE estimate of X ̲ given Y ̲ with invertible Q Y ̲ is the vector X ̲ ^ L = C Y ̲ where C is chosen to minimize E X ̲ X ̲ ^ L 2 . We compute
X ̲ ^ L = E X ̲ Y ̲ Q Y ̲ 1 Y ̲
and we also have the properties (22)–(24) with X ̲ ^ replaced by X ̲ ^ L . Moreover, if X ̲ and Y ̲ are jointly CSCG, then the MMSE and LMMSE estimators coincide, and the orthogonality principle (24) implies that the error X ̲ X ̲ ^ is independent of Y ̲ , i.e., we have
E X ̲ X ̲ ^ X ̲ X ̲ ^ Y ̲ = y ̲ = E X ̲ X ̲ Y ̲ = y ̲ E X ̲ Y ̲ Q Y ̲ 1 y ̲ y ̲ Q Y ̲ 1 E X ̲ Y ̲ = Q X ̲ Q X ̲ ^ .

2.6. Entropy, Divergence, and Information

Entropies of random vectors with densities p are written as
h ( X ̲ ) = E log p ( X ̲ ) , h ( X ̲ | Y ̲ ) = E log p ( X ̲ | Y ̲ )
where we use logarithms to the base e for analysis. The informational divergence of the densities p and q is
D p q = E log p ( X ̲ ) q ( X ̲ )
and D ( p q ) 0 with equality if and only if p = q almost everywhere. The mutual information of X ̲ and Y ̲ is
I ( X ̲ ; Y ̲ ) = D p ( X ̲ , Y ̲ ) p ( X ̲ ) p ( Y ̲ ) = E log p ( Y ̲ | X ̲ ) p ( Y ̲ ) .
The average mutual information of X ̲ and Y ̲ conditioned on Z ̲ is I ( X ̲ ; Y ̲ | Z ̲ ) . We write strings as X L = ( X 1 , X 2 , , X L ) and use the directed information notation (see [9,72])
I ( X L Y L | Z ) = = 1 L I ( X ; Y | Y 1 , Z )
I ( X L Y L Z L | W ) = = 1 L I ( X ; Y | Y 1 , Z , W )
where Y 0 = 0 .

2.7. Entropy and Information Bounds

The expression (2) applies to random vectors. Choosing q ( x ̲ | y ̲ ) as the conditional density where the X ̲ , Y ̲ are modeled as jointly CSCG we obtain a generalization of (5):
h ( X ̲ | Y ̲ ) log det π e Q X ̲ , Y ̲ det π e Q Y ̲ = log det π e Q X ̲ E X ̲ Y ̲ Q Y ̲ 1 E Y ̲ X ̲ .
The vector generalization of (6) for CSCG X ̲ is
I ( X ̲ ; Y ̲ ) = h ( X ̲ ) h ( X ̲ | Y ̲ ) log det Q X ̲ E X ̲ Y ̲ Q Y ̲ 1 E Y ̲ X ̲ 1 Q X ̲ = ( a ) log det I + Q Z ̲ 1 H Q X ̲ 1 H
where (cf. (7))
H = E Y ̲ X ̲ Q X ¯ ̲ 1 , Q Z ̲ = Q Y ̲ H Q X ̲ H
and step ( a ) in (30) follows by the Woodbury identity
A + B C D 1 = A 1 A 1 B C 1 + D A 1 B 1 D A 1
and the Sylvester identity
det I + A B = det I + B A .
We also have vector generalizations of (14) and (15):
h ( X ̲ | Y ̲ ) E log det π e Q X ̲ | Y ̲
I ( X ̲ ; Y ̲ ) E log det Q X ̲ det Q X ̲ | Y ̲ for CSC G X ̲ .

2.8. Capacity and Wideband Rates

Consider the complex-alphabet AWGN channel with output Y = X + Z and noise Z CN ( 0 , 1 ) . The capacity with the block power constraint 1 n i = 1 n | X i | 2 P is
C ( P ) = max E | X | 2 P I ( X ; Y ) = log ( 1 + P ) .
The low SNR regime (small P) is known as the wideband regime [73]. For well-behaved channels such as AWGN channels, the minimum E b / N 0 and the slope S of the capacity vs. E b / N 0 in bits/(3 dB) at the minimum E b / N 0 are (see ([73], Equation (35)) and ([73], Theorem 9))
E b N 0 min = log 2 C ( 0 ) , S = 2 [ C ( 0 ) ] 2 C ( 0 )
where C ( P ) and C ( P ) are the first and second derivatives of C ( P ) (measured in nats) with respect to P, respectively. For example, the wideband derivatives for (36) are C ( 0 ) = 1 and C ( 0 ) = 1 so that the wideband values (37) are
E b N 0 min = log 2 , S = 2 .
The minimal E b / N 0 is usually stated in decibels, for example 10 log 10 ( log 2 ) = 1.59 dB. An extension of the theory to general channels is described in ([74], Section III).
Remark 7.
A useful method is flash signaling, where one sends with zero energy most of the time. In particular, we will consider the CSCG flash density
p ( x ) = ( 1 p ) δ ( x ) + p e | x | 2 / ( P / p ) π ( P / p )
where 0 < p 1 so that the average power is E | X | 2 = P . Note that flash signaling is defined in ([73], Definition 2) as a family of distributions satisfying a particular property as P 0 . We use the terminology informally.

2.9. Uniformly-Spaced Quantizer

Consider a uniformly-spaced scalar quantizer q u ( . ) with B bits, domain [ 0 , ) , and reconstruction points
s { Δ / 2 , 3 Δ / 2 , , Δ / 2 + ( 2 B 1 ) Δ }
where Δ > 0 . The quantization intervals are
I ( s ) = s Δ 2 , s + Δ 2 , s s max s Δ 2 , , s = s max
where s max = Δ / 2 + ( 2 B 1 ) Δ . We will consider B = 0 , 1 , . For B = we choose q u ( x ) = x .
Suppose one applies the quantizer to the non-negative random variable G with density p ( g ) to obtain S T = q u ( G ) . Let P S T and P S T | G be the probability mass functions of S T without and with conditioning on G, respectively. We have
P S T | G ( s | g ) = 1 g I ( s ) , P S T ( s ) = g I ( s ) p ( g ) d g
and using Bayes’ rule, we obtain
p ( g | s ) = p ( g ) / P S T ( s ) , g I ( s ) 0 , else .

3. Generalized Mutual Information

We re-derive the GMI in the usual way, where one starts with the forward model q ( y | x ) rather than the reverse density q ( x | y ) in (8). Consider the joint density p ( x , y ) and define q ( y ) as in (9) for s 0 . Note that neither q ( y | x ) nor q ( y ) must be densities. The GMI is defined in [39] to be max s 0 I s ( X ; Y ) where (see the RHS of (10))
I s ( X ; Y ) = E log q ( Y | X ) s q ( Y )
and where the expectation is with respect to p ( x , y ) . The GMI is a lower bound on the mutual information since
I s ( X ; Y ) = I ( X ; Y ) D p X , Y p Y q X | Y .
Moreover, by using Gallager’s derivation of error exponents, but without modifying his “s” variable, the GMI I s ( X ; Y ) is achievable with a mismatched decoder that uses q ( y | x ) for its decoding metric [39].

3.1. AWGN Forward Model with CSCG Inputs

A natural metric is based on the AWGN auxiliary channel Y a = h X + Z where h is a channel parameter and Z CN ( 0 , σ 2 ) is independent of X, i.e., we have the auxiliary model (here a density)
q ( y | x ) = 1 π σ 2 exp | y h x | 2 / σ 2
where h and σ 2 are to be optimized. A natural input is X CN ( 0 , P ) so that (9) is
q ( y ) = π σ 2 / s ( π σ 2 ) s · exp | y | 2 σ 2 / s + | h | 2 P π ( σ 2 / s + | h | 2 P ) .
We have the following result, see [43] that considers channels of the form (1) and ([47], Proposition 1) that considers general p ( y | x ) .
Proposition 1.
The maximum GMI (42) for the channel p ( y | x ) , a CSCG input X with variance P > 0 , and the auxiliary model (44) with σ 2 > 0 is
I 1 ( X ; Y ) = log 1 + | h ˜ | 2 P σ ˜ 2
where s = 1 and (cf. (7))
h ˜ = E Y X * / P
σ ˜ 2 = E | Y h ˜ X | 2 = E | Y | 2 | h ˜ | 2 P .
The expectations are with respect to the actual density p ( x , y ) .
Proof. 
The GMI (42) for the model (44) is
I s ( X ; Y ) = log 1 + | h | 2 P σ 2 / s + E | Y | 2 σ 2 / s + | h | 2 P E | Y h X | 2 σ 2 / s .
Since (49) depends only on the ratio σ 2 / s one may as well set s = 1 . Thus, choosing h = h ˜ and σ 2 = σ ˜ 2 gives (46).
Next, consider Y a = h ˜ X + Z ˜ where Z ˜ CN ( 0 , σ ˜ 2 ) is independent of X. We have
E | Y a | 2 = E | Y | 2
E | Y a h ˜ X | 2 = E | Y h ˜ X | 2 .
In other words, the second-order statistics for the two channels with outputs Y (the actual channel output) and Y a are the same. But the GMI (46) is the mutual information I ( X ; Y a ) . Using (43) and (49), for any s, h and σ 2 we have
I ( X ; Y a ) = log 1 + | h ˜ | 2 P σ ˜ 2 I s ( X ; Y a ) = I s ( X ; Y )
and equality holds if h = h ˜ and σ 2 / s = σ ˜ 2 . □
Remark 8.
The rate (46) is the same as the RHS of (6).
Remark 9.
Proposition 1 generalizes to vector models and adaptive input symbols; see Section 4.4.
Remark 10.
The estimate h ˜ is the MMSE estimate of h:
h ˜ = arg min h E | Y h X | 2
and σ ˜ 2 is the variance of the error. To see this, expand
E | Y h X | 2 = E | ( Y h ˜ X ) + ( h ˜ h ) X | 2 = σ ˜ 2 + | h ˜ h | 2 P
where the final step follows by the definition of h ˜ in (47).
Remark 11.
Suppose that h is an estimate other than (53). Then if E | Y | 2 > E Y h X 2 we may choose
σ 2 / s = | h | 2 P · E Y h X 2 E | Y | 2 E Y h X 2
and the GMI (49) simplifies to
I s ( X ; Y ) = log E | Y | 2 E Y h X 2 .
Remark 12.
The LM rate (for “lower bound to the mismatch capacity”) improves the GMI for some q ( y | x ) [40,75]. The LM rate replaces q ( y | x ) with q ( y | x ) e t ( x ) / s for some function t ( . ) and permits optimizing s and t ( . ) ; see ([41], Section 2.3.2). For example, if p ( y | x ) has the form q ( y | x ) s e t ( x ) then the LM rate can be larger than the GMI; see [76,77].

3.2. CSIR and K-Partitions

We consider two generalizations of Proposition 1. The first is for channels with a state S R known at the receiver but not at the transmitter. The second expands the class of CSCG auxiliary models. The motivation is to obtain more precise models under partial CSIR, especially to better deal with channels at high SNR and with high rates. We here consider discrete S R and later extend to continuous S R .
CSIR: Consider the average GMI
I 1 ( X ; Y | S R ) = s R P S R ( s R ) I 1 ( X ; Y | S R = s R )
where I 1 ( X ; Y | S R = s R ) is the usual GMI where all densities are conditioned on S R = s R . The parameters (47) and (48) for the event S R = s R are now
h ˜ ( s R ) = E Y X * S R = s R E | X | 2 S R = s R
σ ˜ 2 ( s R ) = E | Y h ˜ ( s R ) X | 2 S R = s R .
The GMI (57) is thus
I 1 ( X ; Y | S R ) = s R P S R ( s R ) log 1 + | h ˜ ( s R ) | 2 P σ ˜ ( s R ) 2 .
K-Partitions: Let { Y k : k = 1 , , K } be a K-partition of Y and define the auxiliary model
q ( y | x ) = 1 π σ k 2 e | y h k x | 2 / σ k 2 , y Y k .
Observe that q ( y | x ) is not necessarily a density. We choose X CN ( 0 , P ) so that (9) becomes (cf. (45))
q ( y ) = π σ k 2 / s ( π σ k 2 ) s · exp | y | 2 σ k 2 / s + | h k | 2 P π ( σ k 2 / s + | h k | 2 P ) , y Y k .
Define the events E k = { Y Y k } for k = 1 , , K . We have
I s ( X ; Y ) = k = 1 K Pr E k · E log q ( Y | X ) s q ( Y ) E k
and inserting (61) and (62) we have the following lemma.
Lemma 1.
The GMI (42) for the channel p ( y | x ) , s = 1 , a CSCG input X with variance P, and the auxiliary model (61) is (see (49))
I 1 ( X ; Y ) = k = 1 K Pr E k log 1 + | h k | 2 P σ k 2 + E | Y | 2 | E k σ k 2 + | h k | 2 P E | Y h k X | 2 | E k σ k 2 .
Remark 13.
K-partitioning formally includes (57) as a special case by including S R as part of the receiver’s “overall” channel output Y ˜ = [ Y , S R ] . For example, one can partition Y ˜ as { Y ˜ s R : s R S R } where Y ˜ s R = Y × { s R } .
Remark 14.
The models (16) and (61) suggest building receivers based on adaptive Gaussian statistics. However, we are motivated to introduce (61) to prove capacity scaling results. For this purpose, we will use K = 2 with the partition
E 1 = { | Y | 2 < t R } , E 2 = { | Y | 2 t R }
and h 1 = 0 , σ 1 2 = 1 . The GMI (64) thus has only the k = 2 term and it remains to choose h 2 , σ 2 2 , and t R .
Remark 15.
One can generalize Lemma 1 and partition X × Y rather than Y only. However, the q ( y ) in (62) might not have a CSCG form.
Remark 16.
Define P k = E | X | 2 | E k and choose the LMMSE auxiliary models with
h k = E Y X * E k / P k
σ k 2 = E | Y h k X | 2 E k = E | Y | 2 E k | h k | 2 P k
for k = 1 , , K . The expression (64) is then
I 1 ( X ; Y ) = k = 1 K Pr E k log 1 + | h k | 2 P E | Y | 2 | E k | h k | 2 P k | h k | 2 ( P P k ) E | Y | 2 | E k + | h k | 2 ( P P k ) .
Remark 17.
The LMMSE-based GMI (68) reduces to the GMI of Proposition 1 by choosing the trivial partition with K = 1 and Y 1 = Y . However, the GMI (68) may not be optimal for K 2 . What can be said is that the phase of h k in (64) should be the same as the phase of E Y X * | E k for all k. We thus have K two-dimensional optimization problems, one for each pair ( | h k | , σ k 2 ) , k = 1 , , K .
Remark 18.
Suppose we choose a different auxiliary model for each Y = y , i.e., consider K . The reverse density GMI uses the auxiliary model (19) which gives the RHS of (15):
I 1 ( X ; Y ) = C p ( y ) log P Var X | Y = y d y .
Instead, the suboptimal (68) is the complicated expression
I 1 ( X ; Y ) = C p ( y ) log 1 + | E X | Y = y | 2 ( P / P y ) Var X | Y = y | E X | Y = y | 2 ( P / P y 1 ) Var X | Y = y + | E X | Y = y | 2 ( P / P y ) d y .
where P y = E | X | 2 | Y = y . We show how to compute these GMIs in Appendix C.

3.3. Example: On-Off Fading

Consider the channel Y = H X + Z where H , X , Z are mutually independent, P H ( 0 ) = P H ( 2 ) = 1 / 2 , and Z CN ( 0 , 1 ) . The channel exhibits particularly simple fading, giving basic insight into more realistic fading models. We consider two basic scenarios: full CSIR and no CSIR.
Full CSIR: Suppose S R = H and
q ( y | x , h ) = p ( y | x , h ) = 1 π σ 2 e | y h x | 2 / σ 2
which corresponds to having (58) and (59) as
h ˜ ( 0 ) = 0 , h ˜ 2 = 2 , σ ˜ 2 ( 0 ) = σ 2 2 = 1 .
The GMI (60) with X CN ( 0 , P ) thus gives the capacity
C ( P ) = 1 2 log 1 + 2 P .
The wideband values (37) are
E b N 0 min = log 2 , S = 1 .
Compared with (38), the minimal E b / N 0 is the same as without fading, namely 1.59 dB. However, fading reduces the capacity slope S; see the dashed curve in Figure 1.
No CSIR: Suppose S R = 0 and X CN ( 0 , P ) and consider the densities
p ( y | x ) = e | y | 2 2 π + e | y 2 x | 2 2 π
p ( y ) = e | y | 2 2 π + e | y | 2 / ( 1 + 2 P ) 2 π ( 1 + 2 P ) .
The mutual information can be computed by numerical integration or by Monte Carlo integration:
I ( X ; Y ) 1 N i = 1 N log p Y | X ( y i | x i ) p Y ( y i )
where the RHS of (77) converges to I ( X ; Y ) for long strings x N , y N sampled from p ( x , y ) . The results for X CN ( 0 , P ) are shown in Figure 1 as the curve labeled “ I ( X ; Y ) Gauss”.
Next, Proposition 1 gives h = 1 / 2 , σ 2 = 1 + P / 2 , and
I 1 ( X ; Y ) = log 1 + P 2 + P .
The wideband values (37) are
E b N 0 min = log 4 , S = 2 / 3
so the minimal E b / N 0 is 1.42 dB and the capacity slope S has decreased further. Moreover, the rate saturates at large SNR at 1 bit per channel use.
The “ I ( X ; Y ) Gauss” curve in Figure 1 suggests that the no-CSIR capacity approaches the full-CSIR capacity for large SNR. To prove this, consider the K = 2 partition specified in Remark 14 with h 1 = 0 , h 2 = 2 , and σ 2 2 = 1 . Since we are not using LMMSE auxiliary models, we must compute the GMI using the general expression (64), which is
I 1 ( X ; Y ) = Pr E 2 log ( 1 + 2 P ) + E | Y | 2 | E 2 1 + 2 P E Y 2 X 2 | E 2 .
In Appendix B.1, we show that choosing t R = P λ R + b where 0 < λ R < 1 and b is a real constant makes all terms behave as desired as P increases:
Pr E 2 1 / 2 , E | Y | 2 | E 2 1 + 2 P 1 , E Y 2 X 2 E 2 1 .
The GMI (80) of Lemma 1 thus gives the maximal value (73) for large P:
lim P 1 2 log ( 1 + 2 P ) I 1 ( X ; Y ) = 0 .
Figure 1 shows the behavior of I 1 ( X ; Y ) for K = 2 , λ R = 0.4 , and b = 3 . Effectively, at large SNR, the receiver can estimate H accurately, and one approaches the full-CSIR capacity.
Remark 19.
For on-off fading, one may compute I ( X ; Y ) directly and use the densities (75) and (76) to decode. Nevertheless, the partitioning of Lemma 1 helps prove the capacity scaling (82).
Consider next the reverse density GMI (69) and the forward model GMI (70). Appendix C.1 shows how to compute E X | Y = y , E | X | 2 | Y = y , and Var X | Y = y , and Figure 1 plots the GMIs as the curves labeled “rGMI” and “GMI, K = ”, respectively. The rGMI curve gives the best possible rates for AWGN auxiliary models, as shown in Section 1.4. The results also show that the large-K GMI (70) is worse than the K = 1 GMI at low SNR but better than the K = 2 GMI of Remark 14.
Finally, the curve labeled “ I ( X ; Y ) Gauss” in Figure 1 suggests that the minimal E b / N 0 is 1.42 dB even for the capacity-achieving distribution. However, we know from ([73], Theorem 1) that flash signaling (39) can approach the minimal E b / N 0 of 1.59 dB. For example, the flash rates I ( X ; Y ) with p = 0.05 are plotted in Figure 1. Unfortunately, the wideband slope is S = 0 ([73], Theorem 17), and one requires very large flash powers (very small p) to approach 1.59 dB.
Remark 20.
As stated in Remark 6, the paper [37] (see also [2,70]) derives two capacity lower bounds. These bounds are the same for our problem, and they are derived using the following steps (see ([37], Lemmas 3 and 4)):
I ( X ; Y ) = I ( X , S H ; Y ) I ( S H ; Y | X ) I ( X ; Y | S H ) I ( S H ; Y | X ) .
Now consider Y = H X + Z where H , X , Z are mutually independent, S H = H , Var Z = 1 , and X CN ( 0 , P ) . We have
I ( X ; Y | H ) E log ( 1 + | H | 2 P )
I ( H ; Y | X ) = h ( Y | X ) h ( Z ) log π e ( 1 + Var H P ) h ( Z )
where (84) and (85) follow by (5), in the latter case with the roles of X and Y reversed. The bound (85) works well if Var H is small, as for massive MIMO with “channel hardening”. However, for our on-off fading model, the bound (83) is
I ( X ; Y ) E log 1 + | H | 2 P log ( 1 + Var H P ) = 1 2 log ( 1 + 2 P ) log ( 1 + P / 2 )
which is worse than the K = 1 and K = GMIs and is not shown in Figure 1.

4. Channels with CSIT

This section studies Shannon’s channel with side information, or state, known causally at the transmitter [5,6]. We begin by treating general channels and then focus mainly on complex-alphabet channels. The capacity expression has a random variable A that is either a list (for discrete-alphabet states) or a function (for continuous-alphabet states). We refer to A as an adaptive symbol of an adaptive codeword.

4.1. Model

The problem is specified by the functional dependence graph (FDG) in Figure 2. The model has a message M, a CSIT string S T n , and a noise string Z n . The variables M, S T n , Z n are mutually statistically independent, and S T n and Z n are strings of i.i.d. random variables with the same distributions as S T and Z, respectively. S T n is available causally at the transmitter, i.e., the channel input X i , i = 1 , , n , is a function of M and the sub-string S T i . The receiver sees the channel outputs
Y i = f ( X i , S T i , Z i )
for some function f ( . ) and i = 1 , 2 , , n .
Each A i represents a list of possible choices of X i at time i. More precisely, suppose that S T has alphabet S T = { 0 , 1 , , ν 1 } and define the adaptive symbol
A = X ( 0 ) , , X ( ν 1 )
whose entries have alphabet X . Here S T = s T means that X ( s T ) is transmitted, i.e., we have X = X ( S T ) . If S T has a continuous alphabet, we make A a function rather than a list, and we may again write X = X ( S T ) . Some authors therefore write A as X ( . ) . (Shannon in [6] denoted our A and X as the respective X and x.)
Remark 21.
The conventional choice for A if X = C is
A = P ( 0 ) e j ϕ ( 0 ) , , P ( ν 1 ) e j ϕ ( ν 1 ) · U
where U has E | U | 2 = 1 , P ( s T ) = E | X ( s T ) | 2 , and ϕ ( s T ) is a phase shift. The interpretation is that U represents the symbol of a conventional codebook without CSIT, and these symbols are scaled and rotated. In other words, one separates the message-carrying U from an adaptation due to S T via
X = P ( S T ) e j ϕ ( S T ) U .
Remark 22.
One may define the channel by the functional relation (87), by p ( y | a ) , or by p ( y | x , s T ) ; see Shannon’s emphasis in ([6], Theorem); see ([22], Remark 3). We generally prefer to use p ( y | a ) since we interpret A as a channel input.
Remark 23.
One can add feedback and let X i be a function of ( M , S T i , Y i 1 ) , but feedback does not increase the capacity if the state and noise processes are memoryless ([22], Section V).
Remark 24.
The model (87) permits block fading and MIMO transmission by choosing X i and Y i as vectors [11,78].

4.2. Capacity

The capacity of the model under study is (see [6])
C = max A I ( A ; Y )
where A [ S T , X ] Y forms a Markov chain. One may limit attention to A with cardinality | A | satisfying (see ([22], Equation (56)), [79], ([80], Theorem 1))
| A | min | Y | , 1 + | S T | ( | X | 1 ) .
As usual, for the cost function c ( x , y ) and the average block cost constraint
1 n i = 1 n E c ( X i , Y i ) P
the unconstrained maximization in (90) becomes a constrained maximization over the A for which E c ( X , Y ) P . Also, a simple upper bound on the capacity is
C ( P ) max A : E c ( X , Y ) P I ( A ; Y , S T ) = ( a ) max X ( S T ) : E c ( X ( S T ) , Y ) P I ( X ; Y | S T )
where step ( a ) follows by the independence of A and S T . This bound is tight if the receiver knows S T .
Remark 25.
The chain rule for mutual information gives
I ( A ; Y ) = I X ( 0 ) X ( ν 1 ) ; Y
= s T = 0 ν 1 I X ( s T ) ; Y | X ( 0 ) , , X ( s T 1 ) .
The RHS of (94) suggests treating the channel as a multi-input, single-output (MISO) channel, and the expression (95) suggests using multi-level coding with multi-stage decoding [81]. For example, one may use polar coded modulation [82,83,84] with Honda-Yamamoto shaping [85,86].
Remark 26.
For X = C and the conventional adaptive symbol (88), we compute I ( A ; Y ) = I ( U ; Y ) and
C ( P ) = max P ( S T ) , ϕ ( S T ) : E c ( X ( S T ) , Y ) P I ( U ; Y ) .

4.3. Structure of the Optimal Input Distribution

Let A be the alphabet of A and let X = C , i.e., we have A = C ν for discrete S T . Consider the expansions
p ( y | a ) = s T P S T ( s T ) p ( y | x ( s T ) , s T ) p ( y ) = A p ( a ) p ( y | a ) d a
= s T P S T ( s T ) C p ( x ( s T ) ) p ( y | x ( s T ) , s T ) d x ( s T ) .
Observe that p ( y ) , and hence h ( Y ) , depends only on the marginals p ( x ( s T ) ) of A; see ([80], Section III). So define the set of densities having the same marginals as A:
P ( A ) = p ( a ˜ ) : p ( x ˜ ( s T ) ) = p ( x ( s T ) ) for all s T S T .
This set is convex, since for any p ( 1 ) ( a ) , p ( 2 ) ( a ) P ( A ) and 0 λ 1 we have
λ p ( 1 ) ( a ) + ( 1 λ ) p ( 2 ) ( a ) P ( A ) .
Moreover, for fixed p ( y ) , the expression I ( A ; Y ) is a convex function of p ( a | y ) , and p ( a | y ) = p ( a ) p ( y | a ) / p ( y ) is a linear function of p ( a ) . Maximizing I ( A ; Y ) over P ( A ) is thus the same as minimizing the concave function h ( Y | A ) over the convex set P ( A ) . An optimal p ( a ) is thus an extreme of P ( A ) . Some properties of such extremes are developed in [87,88].
For example, consider | S T | = 2 and X = S T = { 0 , 1 } , for which (91) states that at most | A | = 3 adaptive symbols need have positive probability (and at most | A | = 2 adaptive symbols if | Y | = 2 ). Suppose the marginals have P X ( 0 ) ( 0 ) = 1 / 2 , P X ( 1 ) ( 0 ) = 3 / 4 and consider the matrix notation
P A = P A ( 0 , 0 ) P A ( 0 , 1 ) P A ( 1 , 0 ) P A ( 1 , 1 )
where we write P A ( x 1 , x 2 ) for P A ( [ x 1 , x 2 ] ) . The optimal P A must then be one of the two extremes
P A = 1 / 2 0 1 / 4 1 / 4 , P A = 1 / 4 1 / 4 1 / 2 0 .
For the first P A , the codebook has the property that if X ( 0 ) = 0 then X ( 1 ) = 0 while if X ( 0 ) = 1 then X ( 1 ) is uniformly distributed over X = { 0 , 1 } .
Next, consider | S T | = 2 and marginals P X ( 0 ) , P X ( 1 ) that are uniform over X = { 0 , 1 , , | X | 1 } . This case was treated in detail in ([80], Section VI.A), see also [89], and we provide a different perspective. A classic theorem of Birkhoff [90] ensures that the extremes of P ( A ) are the | X | ! distributions P A for which the | X | × | X | matrix
P A = P A ( 0 , 0 ) P A ( 0 , | X | 1 ) P A ( | X | 1 , 0 ) P A ( | X | 1 , | X | 1 ) .
is a permutation matrix multiplied by 1 / | X | . For example, for | X | = 2 we have the two extremes
P A = 1 2 1 0 0 1 , P A = 1 2 0 1 1 0 .
The permutation property means that X ( s T ) is a function of X ( 0 ) , i.e., the encoding simplifies to a conventional codebook as in Remark 21 with uniformly-distributed U and a permutation π s T ( . ) indexed by s T such that X ( S T ) = π S T ( U ) . For example, for the first P A in (101) we may choose X ( S T ) = U , which is independent of S T . On the other hand, for the second P A in (101) we may choose X ( S T ) = U S T where ⊕ denotes addition modulo-2.
For | S T | > 2 , the geometry of P ( A ) is more complicated; see ([80], Section VI.B). For example, consider X = { 0 , 1 } and suppose the marginals P X ( s T ) , s T S T , are all uniform. Then the extremes include P A related to linear codes and their cosets, e.g., two extremes for | S T | = 3 are related to the repetition code and single parity check code:
P A ( a ) = 1 / 2 , a { [ 0 , 0 , 0 ] , [ 1 , 1 , 1 ] } P A ( a ) = 1 / 4 , a { [ 0 , 0 , 0 ] , [ 0 , 1 , 1 ] , [ 1 , 0 , 1 ] , [ 1 , 1 , 0 ] } .
This observation motivates concatenated coding, where the message is first encoded by an outer encoder followed by an inner code that is the coset of a linear code. The transmitter then sends the entries at position S T of the inner codewords, which are vectors of dimension | S T | . We do not know if there are channels for which such codes are helpful.

4.4. Generalized Mutual Information

Consider the vector channel p ( y ̲ | x ̲ ) with input set X = C M and output set Y = C N . The GMI for adaptive symbols is max s 0 I s ( A ; Y ̲ ) where
I s ( A ; Y ̲ ) = E log q ( Y ̲ | A ) s q ( Y ̲ )
and the expectation is with respect to p ( a , y ̲ ) . Suppose the auxiliary model is q ( y ̲ | a ) and define
q ( y ̲ ) = A p ( a ) q ( y ̲ | a ) s d a .
The GMI again provides a lower bound on the mutual information since (cf. (43))
I s ( A ; Y ̲ ) = I ( A ; Y ̲ ) D p A , Y ̲ p Y ̲ q A | Y ̲
where q ( a | y ̲ ) = p ( a ) q ( y ̲ | a ) s / q ( y ̲ ) is a reverse channel density.
We next study reverse and forward models as in Section 1.3 and Section 1.4. Suppose the entries X ̲ ( s T ) of A are jointly CSCG.
Reverse Model: We write A ̲ when we consider A to be a column vector that stacks the X ̲ ( s T ) . Consider the following reverse density motivated by (13):
q ( a ̲ | y ̲ ) = exp ( a ̲ E A ̲ | Y ̲ = y ̲ ) Q A ̲ | Y ̲ = y ̲ 1 ( a ̲ E A ̲ | Y ̲ = y ̲ ) π ν M det Q A ̲ | Y ̲ = y ̲ .
A corresponding forward model is q y ̲ | a = q a | y ̲ / p ( a ) and the GMI with s = 1 becomes (cf. (35))
I 1 ( A ; Y ̲ ) = E log det Q A ̲ det Q A ̲ | Y ̲ .
To simplify, one may focus on adaptive symbols as in (89):
X ̲ = Q X ̲ ( S T ) 1 / 2 · U ̲
where U ̲ CN ( 0 ̲ , I ) and the Q X ̲ ( s T ) are covariance matrices. We thus have I ( A ; Y ̲ ) = I ( U ̲ ; Y ̲ ) (cf. (96)) and using (105) but with A ̲ replaced with U ̲ we obtain
I 1 ( A ; Y ̲ ) = E log det Q U ̲ | Y ̲ .
Forward Model: Perhaps the simplest forward model is q ( y ̲ | a ) = p ( y ̲ | x ̲ ( s T ) ) for some fixed value s T S T . One may interpret this model as having the receiver assume that S T = s T . A natural generalization of this idea is as follows: define the auxiliary vector
X ¯ ̲ = s T W ( s T ) X ̲ ( s T )
where the W ( s T ) are M × M complex matrices, i.e., X ¯ ̲ is a linear function of the entries of A = [ X ̲ ( s T ) : s T S T ] . For example, the matrices might be chosen based on P S T ( . ) . However, observe that X ¯ ̲ is independent of S T . Now define the auxiliary model
q ( y ̲ | a ) = q ( y ̲ | x ¯ ̲ )
where we abuse notation by using the same q ( . ) . The expression (103) becomes
q ( y ̲ ) = A p ( a ) q ( y ̲ | a ) s d a = C p ( x ¯ ̲ ) q ( y ̲ | x ¯ ̲ ) s d x ¯ ̲ .
Remark 27.
We often consider S T to be a discrete set, but for CSCG channels we also consider S T = C so that the sum over S T in (109) is replaced by an integral over C .
We now specialize further by choosing the auxiliary channel Y ̲ a = H X ¯ ̲ + Z ̲ where H is an N × M complex matrix, Z ̲ is an N-dimensional CSCG vector that is independent of X ¯ ̲ and has invertible covariance matrix Q Z ̲ , and H and Q Z ̲ are to be optimized. Further choose A = [ X ̲ ( s T ) : s T S T ] whose entries are jointly CSCG with correlation matrices
R ( s T 1 , s T 2 ) = E X ̲ ( s T 1 ) X ̲ ( s T 2 ) .
Since X ¯ ̲ in (109) is independent of S T , we have
q ( y ̲ | a ) = exp y ̲ H x ¯ ̲ Q Z ̲ 1 y ̲ H x ¯ ̲ π N det Q Z ̲ .
Moreover, X ¯ ̲ is CSCG so (110) is
q ( y ̲ ) = π N det Q Z ̲ / s π N det Q Z ̲ s · exp y ̲ Q Z ̲ / s + H Q X ¯ ̲ H 1 y ̲ π N det Q Z ̲ / s + H Q X ¯ ̲ H
where
Q X ¯ ̲ = s T 1 , s T 2 W ( s T 1 ) R ( s T 1 , s T 2 ) W ( s T 2 ) .
We have the following generalization of Proposition 1.
Lemma 2.
The maximum GMI (102) for the channel p ( y ̲ | a ) , an adaptive vector A = [ X ̲ ( s T ) : s T S T ] that has jointly CSCG entries, an X ¯ ̲ as in (109) with Q X ¯ ̲ 0 , and the auxiliary model (111) with Q Z ̲ 0 is
I 1 ( A ; Y ̲ ) = log det I + Q Z ˜ ̲ 1 H ˜ Q X ¯ ̲ H ˜
where (cf. (31))
H ˜ = E Y ̲ X ¯ ̲ Q X ¯ ̲ 1
Q Z ˜ ̲ = Q Y ̲ H ˜ Q X ¯ ̲ H ˜ .
The expectation is with respect to the actual channel with joint distribution/density p ( a , y ̲ ) .
Proof. 
See Appendix D. □
Remark 28.
Since X ¯ ̲ is a function of A, the rate (112) can alternatively be derived by using I ( A ; Y ̲ ) I ( X ¯ ̲ ; Y ̲ ) and applying the bound (30) with X ̲ replaced with X ¯ ̲ .
Remark 29.
The estimate H ˜ is the MMSE estimate of H :
H ˜ = arg min H E Y ̲ H X ¯ ̲ 2
and Q Z ̲ ˜ is the resulting covariance matrix of the error. To see this, expand (cf. (54))
E Y ̲ H X ¯ ̲ 2 = E ( Y ̲ H ˜ X ¯ ̲ ) + ( H ˜ H ) X ¯ ̲ 2 = E Y ̲ H ˜ X ¯ ̲ 2 + tr ( H ˜ H ) Q X ¯ ̲ ( H ˜ H )
where the final step follows by the definition of H ˜ in (113).
Remark 30.
Suppose that H is an estimate other than (115). Generalizing (55), if Q Y ̲ Q Z ¯ ̲ we may choose
Q Z ̲ / s = H Q X ¯ ̲ H 1 / 2 Q Y ̲ Q Z ¯ ̲ 1 / 2 Q Z ¯ ̲ Q Y ̲ Q Z ¯ ̲ 1 / 2 H Q X ¯ ̲ H 1 / 2
where
Q Z ¯ ̲ = E Y ̲ H X ¯ ̲ Y ̲ H X ¯ ̲ .
Appendix D shows that (102) then simplifies to (cf. (56))
I s ( A ; Y ̲ ) = log det Q Z ¯ ̲ 1 Q Y ̲ .
Remark 31.
The GMI (112) does not depend on the scaling of X ¯ ̲ since this is absorbed in H ˜ . For example, one can choose the weighting matrices in (109) so that E X ¯ ̲ 2 = P .

4.5. Optimal Codebooks for CSCG Forward Models

The following Lemma maximizes the GMI for scalar channels and A with CSCG entries without requiring A to have the form (89). Nevertheless, this form is optimal, and we refer to ([10], page 2013) and Section 6.4 for similar results. In the following, let U ( s T ) CN ( 0 , 1 ) for all s T .
Lemma 3.
The maximum GMI (102) for the channel p ( y | a ) , an adaptive symbol A with jointly CSCG entries, the forward model (111), and with fixed P ( s T ) = E | X ( s T ) | 2 is
I 1 ( A ; Y ) = log 1 + P ˜ E | Y | 2 P ˜
where, writing X ( s T ) = P ( s T ) U ( s T ) for all s T , we have
P ˜ = E E Y U ( S T ) * S T 2 .
This GMI is achieved by choosing fully-correlated symbols:
X ( s T ) = P ( s T ) e j ϕ ( s T ) U
and X ¯ = c U for some non-zero constant c and a common U CN ( 0 , 1 ) , and where
ϕ ( s T ) = arg E Y U ( s T ) * S T = s T .
Proof. 
See Appendix E. □
Remark 32.
The expression (121) is based on (A58) in Appendix E and can alternatively be written as P ˜ = | h ˜ | 2 P ¯ where
h ˜ = E Y X ¯ * / P ¯ .
Remark 33.
The power levels P ( s T ) may be optimized, usually under a constraint such as E P ( S T ) P .
Remark 34.
By the Cauchy-Schwarz inequality, we have
E E Y U ( S T ) * S T 2 E | Y | 2 .
Furthermore, equality holds if and only if | Y U ( s T ) * | is a constant for each s T , but this case is not interesting.

4.6. Forward Model GMI for MIMO Channels

The following lemma generalizes Lemma 3 to MIMO channels without claiming a closed-form expression for the optimal GMI. Let U ̲ ( s T ) CN ( 0 ̲ , I ) for all s T .
Lemma 4.
A GMI (102) for the channel p ( y ̲ | a ) , an adaptive vector A with jointly CSCG entries, the auxiliary model (111), and with fixed Q X ̲ ( s T ) is given by (112) that we write as
I 1 ( A ; Y ̲ ) = log det Q Y ̲ det Q Y ̲ D ˜ D ˜ .
where for M × M unitary V R ( s T ) we have
D ˜ = E U T ( S T ) Σ ( S T ) V R ( S T )
and U T ( s T ) and Σ ( s T ) are N × N unitary and N × M rectangular diagonal matrices, respectively, of the SVD
E Y ̲ U ̲ ( s T ) S T = s T = U T ( s T ) Σ ( s T ) V T ( s T )
for all s T , and the V T ( s T ) are M × M unitary matrices. The GMI (124) is achieved by choosing the symbols (cf. (122) and (A87) below):
X ̲ ( s T ) = Q X ̲ ( s T ) 1 / 2 V T ( s T ) U ̲
and X ¯ ̲ = C U ̲ for some invertible M × M matrix C and a common M-dimensional vector U ̲ CN ( 0 ̲ , I ) . One may maximize (124) over the unitary V R ( s T ) .
Proof. 
See Appendix G. □
Using Lemma 4, the theory for MISO channels with N = 1 is similar to the scalar case of Lemma 3; see Remark 35 below. However, optimizing the GMI is more difficult for N > 1 because one must optimize over the unitary matrices V R ( s T ) in (125); see Remark 36 below.
Remark 35.
Consider N = 1 in which case one may set U T ( s T ) = 1 and (126) is a 1 × M vector where Σ ( s T ) has as the only non-zero singular value
σ ( s T ) = E Y U ̲ ( s T ) S T = s T = m = 1 M E Y U m ( s T ) * S T = s T 2 1 / 2 .
The absolute value of the scalar (125) is maximized by choosing V R ( s T ) = I for all s T to obtain (cf. (121))
D ˜ D ˜ = E σ ( S T ) 2 .
Remark 36.
Consider M = 1 in which case one may set V T ( s T ) = 1 and (126) is a N × 1 vector where Σ ( s T ) has as the only non-zero singular value
σ ( s T ) = E Y ̲ U ( s T ) S T = s T = n = 1 N E Y n U ( s T ) * S T = s T 2 1 / 2 .
We should now find the V R ( s T ) = e j ϕ R ( s T ) that minimize the determinant in the denominator of (124) where (see (125))
D ˜ = E u ̲ T ( S T ) σ ( S T ) e j ϕ R ( S T )
and where each u ̲ T ( s T ) is one of the columns of the N × N unitary matrix U T ( s T ) .
Remark 37.
Consider M = N and the product channel
p ( y ̲ | a ) = m = 1 M p y m | [ x m ( s T ) : s T S T ]
where x m ( s T ) is the m’th entry of x ̲ ( s T ) . We choose Q X ̲ ( s T ) as diagonal with diagonal entries P m ( s T ) , m = 1 , , M . Also choosing V R ( s T ) = I makes the matrix D ˜ D ˜ diagonal with the diagonal entries (cf. (121) where M = N = 1 )
s T P S T ( s T ) E Y m U m ( s T ) * S T = s T 2
for m = 1 , , M . The GMI (124) is thus (cf. (120))
I 1 ( A ; Y ̲ ) = m = 1 M log E | Y m | 2 E | Y m | 2 E | E Y m U m ( S T ) * S T | 2 .
Remark 38.
For general p ( y ̲ | a ) , one might wish to choose diagonal Q X ̲ ( s T ) and a product model
q ( y ̲ | a ) = m = 1 M q m ( y m | x ¯ m )
where the q m ( . ) are scalar AWGN channels
q m ( y | x ) = 1 π σ m 2 exp | y h m x | 2 / σ m 2
with possibly different h m and σ m 2 for each m. Consider also
X ¯ m = s T w m ( s T ) X m ( s T )
for some complex weights w m ( s T ) , i.e., X ¯ m is a weighted sum of entries from the list [ X m ( s T ) : s T S T ] . The maximum GMI is now the same as (134) but without requiring the actual channel to have the form (132).
Remark 39.
If the actual channel is Y ̲ = H X ̲ + Z ̲ then
E Y ̲ U ̲ ( s T ) | S T = s T = E H X ̲ ( s T ) U ̲ ( s T ) | S T = s T = E H | S T = s T Q X ̲ ( s T ) 1 / 2
where the final step follows because U ̲ ( S T ) S T H forms a Markov chain. The expression (135) is useful because it separates the effects of the channel and the transmitter.
Remark 40.
Combining Remarks 37 and 39, suppose the actual channel is Y ̲ = H X ̲ + Z ̲ with M = N and where H is diagonal with diagonal entries H m , m = 1 , , M . The GMI (124) is then (cf. (134))
I 1 ( A ; Y ̲ ) = m = 1 M log E | Y m | 2 E | Y m | 2 E E H m P m ( S T ) S T 2
where E | Y m | 2 = 1 + E | H m | 2 P m ( S T ) .

5. Channels with CSIR and CSIT

Shannon’s model includes CSIR [11]. The FDG is shown in Figure 3 where there is a hidden state S H , the CSIR S R and CSIT S T are functions of S H , and the receiver sees the channel outputs
[ Y i , S R i ] = [ f ( X i , S H i , Z i ) , S R i ]
for some function f ( . ) and i = 1 , 2 , , n . (By defining S H = [ S H 1 , Z H ] and calling S H 1 the hidden channel state we can include the case where S R and S T are noisy functions of S H 1 .) As before, M, S H n , Z n are mutually statistically independent, and S H n and Z n are i.i.d. strings of random variables with the same distributions as S T and Z, respectively. Observe that we have changed the notation by writing Y for only part of the channel output. The new Y (without the S R ) is usually called the “channel output”.

5.1. Capacity and GMI

We begin with scalar channels for which (90) is
C = max A I ( A ; Y , S R ) = max A I ( A ; Y | S R )
where A and S R are independent.
Reverse Model: The expression (108) with the adaptive symbol (88) is
I 1 ( A ; Y , S R ) = E log Var U | Y , S R .
Forward Model: Consider the expansion
I 1 ( A ; Y | S R ) = S R p ( s R ) I 1 ( A ; Y | S R = s R ) d s R
where I 1 ( A ; Y | S R = s R ) is the GMI (102) with all densities conditioned on S R = s R . We choose the forward model
q ( y | a , s R ) = 1 π σ ( s R ) 2 exp | y h ( s R ) x ¯ ( s R ) | 2 σ ( s R ) 2 .
where similar to (109) we define
X ¯ ( s R ) = s T w ( s T , s R ) X ( s T )
for complex weights w ( s T , s R ) , i.e., X ¯ ( s R ) is a weighted sum of entries from the list A = [ X ( s T ) : s T S T ] . We have the following straightforward generalization of Lemma 3.
Theorem 1.
The maximum GMI (140) for the channel p ( y | a , s R ) , an adaptive symbol A with jointly CSCG entries, the model (141), and with fixed P ( s T ) = E | X ( s T ) | 2 is
I 1 ( A ; Y | S R ) = E log 1 + P ˜ ( S R ) E | Y | 2 | S R P ˜ ( S R )
where for all s R S R we have
P ˜ ( s R ) = E E Y U ( S T ) * S T , S R = s R 2 .
Remark 41.
To establish Theorem 1, the receiver may choose X ¯ = P U to be independent of s R . Alternatively, the receiver may choose X ¯ ( s R ) = E | X | 2 | S R = s R U . Both choices give the same GMI since the expectation in (144) does not depend on the scaling of X ¯ ; see Remark 31.
Remark 42.
The partition idea of Lemmas 1 and 5 carries over to Theorem 1. We may generalize (143) as
I 1 ( A ; Y | S R ) = S R p ( s R ) k = 1 K Pr E k | S R = s R log 1 + | h k ( s R ) | 2 P σ k 2 ( s R ) + E | Y | 2 E k , S R = s R σ k 2 ( s R ) + | h k ( s R ) | 2 P E | Y h k ( s R ) P U | 2 E k , S R = s R σ k 2 ( s R ) d s R
where the X ( s T ) , s T S T , are given by (122) and the h k ( s R ) and σ k 2 ( s R ) , k = 1 , , K , s R S R , can be optimized.
Remark 43.
One is usually interested in the optimal power control policy P ( s T ) under the constraint E P ( S T ) P . Taking the derivative of (143) with respect to P ( s T ) and setting to zero we obtain
E E | Y | 2 | S R P ˜ ( S R ) P ˜ ( S R ) E | Y | 2 | S R E | Y | 2 | S R E | Y | 2 | S R P ˜ ( S R ) = 2 λ P ( s T ) P S T ( s T )
where P ˜ ( S R ) and E | Y | 2 | S R are derivatives with respect to P ( s T ) . We use (146) below to derive power control policies.
Remark 44.
A related model is a compound channel where p ( y | a , s R ) is indexed by the parameter s R ([91], Chapter 4). The problem is to find the maximum worst-case reliable rate if the transmitter does not know s R . Alternatively, the transmitter must send its message to all | S R | receivers indexed by s R S R . A compound channel may thus be interpreted as a broadcast channel with a common message.

5.2. CSIT@ R

An interesting specialization of Shannon’s model is when the receiver knows S T and can determine X ( S T ) . We refer to this scenario as CSIT@R. The model was considered in ([10], Section II) when S T is a function of S R . More generally, suppose S T is a function of [ Y , S R ] . The capacity (138) then simplifies to (see ([10], Proposition 1))
C = ( a ) max A I ( A ; Y , S T | S R ) = ( b ) max A I ( X ; Y | S R , S T ) = ( c ) s T P S T ( s T ) max X ( s T ) I ( X ( s T ) ; Y | S R , S T = s T )
where step ( a ) follows because S T is a function of [ Y , S R ] ; step ( b ) follows because A and ( S R , S T ) are independent, X is a function of [ A , S T ] , and A [ S T , X ] Y forms a Markov chain; and step ( c ) follows because one may optimize X ( s T ) separately for each s T S T .
As discussed in [10], a practical motivation for this model is when the CSIT is based on error-free feedback from the receiver to the transmitter. In this case, where S T is a function of S R , the expression (144) becomes
P ˜ ( s R ) = E Y U ( s T ) * S R = s R 2 .
Remark 45.
The insight that one can replace adaptive symbols A with channel inputs X when X is a function of A and past Y appeared for two-way channels in ([9], Section 4.2.3) and networks in ([22], Section V.A), ([72], Section IV.F).

5.3. MIMO Channels and K-Partitions

We consider generalizations to MIMO channels and K-partitions as in Section 3.2.
MIMO Channels: Consider the average GMI
I 1 ( A ; Y ̲ | S R ) = S R p ( s R ) I 1 ( A ; Y ̲ | S R = s R ) d s R
and choose the parameters (113) and (114) for the event S R = s R . We have
H ˜ ( s R ) = E Y ̲ X ¯ ̲ S R = s R E X ¯ ̲ X ¯ ̲ S R = s R 1
Q Z ˜ ̲ ( s R ) = E Y ̲ Y ̲ S R = s R H ˜ ( s R ) E X ¯ ̲ X ¯ ̲ S R = s R H ˜ ( s R )
and the GMI (149) is (cf. (60) and (112))
I 1 ( A ; Y ̲ | S R ) = E log det I + Q Z ˜ ̲ ( S R ) 1 H ˜ ( S R ) Q X ¯ ̲ H ˜ ( S R ) .
K-Partitions: Let { Y ̲ k : k = 1 , , K } be a K-partition of Y ̲ and define the events E k = { Y ̲ Y ̲ k } for k = 1 , , K . As in Remark 13, K-partitioning formally includes (149) as a special case by including S R as part of the receiver’s “overall” channel output Y ˜ ̲ = [ Y ̲ , S R ] . The following lemma generalizes Lemma 1.
Lemma 5.
A GMI with s = 1 for the channel p ( y ̲ | a ) is
I 1 ( A ; Y ̲ ) = k = 1 K Pr E k log det I + Q Z ̲ k 1 H k Q X ¯ ̲ H k + E Y ̲ Q Z ̲ k + H k Q X ¯ ̲ H k 1 Y ̲ E k E Y ̲ H k X ¯ ̲ Q Z ̲ k 1 Y ̲ H k X ¯ ̲ E k
where the H k and Q Z ̲ k , k = 1 , , K , can be optimized.
Remark 46.
For scalars the GMI (153) is
I 1 ( A ; Y ) = k = 1 K Pr E k log 1 + | h k | 2 P ¯ σ k 2 + E | Y | 2 | E k σ k 2 + | h k | 2 P ¯ E | Y h k X ¯ | 2 | E k σ k 2
which is the same as (64) except that X ¯ , P ¯ replace X , P . If we follow (66) and (67) then (154) becomes (68) but with
h k = E Y X ¯ * E k / P k , P k = E X ¯ 2 E k .
Remark 47.
Consider Remark 14 and choose K = 2 , h 1 = 0 , σ 1 2 = 1 . The GMI (154) then has only the k = 2 term, and it again remains to select h 2 , σ 2 2 , and t R .
Remark 48.
If we define
Q X ¯ ̲ ( k ) = E X ¯ ̲ X ¯ ̲ E k , Q Y ̲ ( k ) = E Y ̲ Y ̲ E k
and choose the LMMSE auxiliary models with
H k = E Y ̲ X ¯ ̲ E k Q X ¯ ̲ ( k ) 1
Q Z ̲ k = Q Y ̲ ( k ) H k Q X ¯ ̲ ( k ) H k
for k = 1 , , K then the expression (153) is (cf. (68))
I 1 ( A ; Y ̲ ) = k = 1 K Pr E k log det I + Q Z ̲ k 1 H k Q X ¯ ̲ H k
tr Q Y ̲ ( k ) + H k D X ¯ ̲ ( k ) H k 1 H k D X ¯ ̲ ( k ) H k
where D X ¯ ̲ ( k ) = Q X ¯ ̲ Q X ¯ ̲ ( k ) .
Remark 49.
We may proceed as in Remark 18 and consider large K. These steps are given in Appendix F.

6. Fading Channels with AWGN

This section treats scalar, complex-alphabet, AWGN channels with CSIR for which the channel output is
[ Y , S R ] = [ H X + Z , S R ]
where H , A , Z are mutually independent, E | H | 2 = 1 , and Z CN ( 0 , 1 ) . The capacity under the power constraint E | X | 2 P is (cf. (138))
C ( P ) = max A : E | X | 2 P I ( A ; Y | S R ) .
However, the optimization in (160) is often intractable, and we desire expressions with log ( 1 + SNR ) terms to gain insight. We develop three such expressions: an upper bound and two lower bounds. It will be convenient to write G = | H | 2 .
Capacity Upper Bound: We state this bound as a lemma since we use it to prove Proposition 2 below.
Lemma 6.
The capacity (160) is upper bounded as
C ( P ) max E log 1 + G P ( S T )
where the maximization is over P ( S T ) with E P ( S T ) = P .
Proof. 
Consider the steps
I ( A ; Y | S R ) I ( A ; Y , S T , H | S R ) = ( a ) I ( A ; Y | S R , S T , H ) = h ( Y | S R , S T , H ) h ( Z ) ( b ) E log Var Y | S R , S T , H
where step ( a ) is because A and [ S R , S T , H ] are independent, and step ( b ) follows by the entropy bound
h ( Y | B = b ) log π e Var Y | B = b
which we applied with B = [ S R , S T , H ] . Finally, we compute Var Y | S R , S T , H = 1 + G P ( S T ) . □
Reverse Model GMI: Consider the adaptive symbol (88) and the GMI (139). We expand the variances in (139) as
Var U | Y = y , S R = s R = E | U | 2 | Y = y , S R = s R | E U | Y = y , S R = s R | 2 .
Appendix C shows that one may write
E U | Y = y , S R = s R = C × S T p ( h , s T | y , s R ) h P ( s T ) e j ϕ ( s T ) y 1 + | h | 2 P ( s T ) d s T d h
and
E | U | 2 | Y = y , S R = s R = C × S T p ( h , s T | y , s R ) 1 1 + | h | 2 P ( s T ) + | h | 2 P ( s T ) | y | 2 1 + | h | 2 P ( s T ) 2 d s T d h .
We use the expressions (164) and (165) to compute achievable rates by numerical integration. For example, suppose that S T = 0 and S R = H , i.e., we have full CSIR and no CSIT. The averaging density is then
p ( h , s T | y , s R ) = δ ( h s R ) δ ( s T )
and the variance simplifies to the capacity-achieving form
Var U | Y = y , S R = h = 1 1 + | h | 2 P .
Forward Model GMI: A forward model GMI is given by Theorem 1 where
P ˜ ( s R ) = E E H P ( S T ) S T , S R = s R 2
E | Y | 2 | S R = s R = 1 + E G P ( S T ) | S R = s R
so that (143) becomes
I 1 ( A ; Y | S R ) = E log 1 + P ˜ ( S R ) 1 + E G P ( S T ) | S R P ˜ ( S R ) .
Remark 50.
Jensen’s inequality implies that the denominator in (168) is greater than or equal to
1 + Var G P ( S T ) S R .
Equality requires that for all S R = s R we have
P ˜ ( s R ) = E G P ( S T ) S R = s R 2
which is valid if H is a function of [ S R , S T ] , for example. However, if there is channel uncertainty after conditioning on [ S R , S T ] then P ˜ ( s R ) is usually smaller than the RHS of (170).
Remark 51.
Consider S R = H or S R = H P ( S T ) . For both cases, H is a function of [ S R , S T ] and the denominator in (168) is the variance (169). In fact, for S R = H P ( S T ) , the expression (169) takes on the minimal value 1. This CSIR is thus the best possible; see Proposition 2.
Remark 52.
For MIMO channels we replace (159) with
[ Y ̲ , S R ] = [ H X ̲ + Z ̲ , S R ]
where H , A , Z ̲ are mutually independent and Z ̲ CN ( 0 ̲ , I ) . One usually considers the constraint E X ̲ 2 P .
Remark 53.
The model (171) includes block fading. For example, choosing M = N and H = H I gives scalar block fading. Moreover, the capacity per symbol without in-block feedback is the same as for the M = N = 1 case except that P is replaced with P / M ; see [11] and Section 9.

6.1. CSIR and CSIT Models

We study two classes of CSIR, as shown in Table 1. The first class has full (or “perfect”) CSIR, by which we mean either S R = H or S R = H P ( S T ) . The motivation for studying the latter case is that it models block fading channels with long blocks where the receiver estimates H P ( S T ) using pilot symbols, and the number of pilot symbols is much smaller than the block length [10]. Moreover, one achieves the upper bound (161), see Proposition 2 below.
We coarsely categorize the CSIT as follows:
  • Full CSIT: S T = H ;
  • CSIT@R: S T = q u ( G ) where q u ( . ) is the quantizer of Section 2.9 with B = 0 , 1 , ;
  • Partial CSIT: S T is not known exactly at the receiver.
The capacity of the CSIT@R models is given by log ( 1 + SNR ) expressions [10,92]; see also [93]. The partial CSIT model is interesting because achieving capacity generally requires adaptive codewords and closed-form capacity expressions are unavailable. The GMI lower bound of Theorem 1 and Remark 42 and the capacity upper bound of Lemma 6 serve as benchmarks.
The partial CSIR models have S R being a lossy function of H. For example, a common model is based on LMMSE channel estimation with
H = ϵ ¯ S R + ϵ Z R
where 0 ϵ 1 and S R , Z R are uncorrelated. The CSIT is categorized as above, except that we consider S T = f T ( S R ) for some function f T ( . ) rather than S T = q u ( G ) .
To illustrate the theory, we study two types of fading: one with discrete H and one with continuous H, namely
  • Section 7: on-off fading with P H ( 0 ) = P H ( 2 ) = 1 / 2 ;
  • Section 8: Rayleigh fading with H CN ( 0 , 1 ) .
For on-off fading we have p ( g ) = 1 2 δ ( g ) + 1 2 δ ( g 2 ) and for Rayleigh fading we have p ( g ) = e g · 1 ( g 0 ) .
Remark 54.
For channels with partial CSIR, we will study the GMI for partitions with K = 1 and K = 2 . The full CSIT model has received relatively little attention in the literature, perhaps because CSIR is usually more accurate than CSIT ([5], Section 4.2.3).

6.2. No CSIR, No CSIT

Without CSIR or CSIT, the channel is a classic memoryless channel [94] for which the capacity (160) becomes the usual expression with S R = 0 and A = X . For CSCG X and U = X / E | X | 2 , the reverse and forward model GMIs (139) and (168) are the respective
I 1 ( X ; Y ) = E log Var U | Y
I 1 ( X ; Y ) = log 1 + P | E H | 2 1 + P Var H .
For example, the forward model GMI is zero if E H = 0 .

6.3. Full CSIR, CSIT@ R

Consider the full CSIR models with S R = H and CSIT@R. The capacity is given by log ( 1 + SNR ) expressions that we review.
First, the capacity with B = 0 (no CSIT) is
C ( P ) = E log 1 + G P = 0 p ( g ) log 1 + g P d g .
The wideband derivatives are (see (37))
C ( 0 ) = E G = 1 , C ( 0 ) = E G 2
so that the wideband values (37) are (see ([73], Theorem 13))
E b N 0 min = log 2 , S = 2 E G 2 .
The minimal E b / N 0 is the same as without fading, namely 1.59 dB. However, Jensen’s inequality gives E G 2 E G 2 = 1 with equality if and only if G = 1 . Thus, fading reduces the capacity slope S.
More generally, the capacity with full CSIR and S T = q u ( G ) is (see [10])
C ( P ) = max P ( S T ) : E P ( S T ) P E log 1 + G P ( S T ) = max P ( S T ) : E P ( S T ) P 0 p ( g , s T ) log 1 + g P ( s T ) d g d s T .
To optimize the power levels P ( s T ) , consider the Lagrangian
E log 1 + G P ( S T ) + λ P E P ( S T )
where λ 0 is a Lagrange multiplier. Taking the derivative with respect to P ( s T ) , we have
λ = E G 1 + G P ( s T ) S T = s T = 0 p ( g | s T ) g 1 + g P ( s T ) d g
as long as P ( s T ) 0 . If this equation cannot be satisfied, choose P ( s T ) = 0 . Finally, set λ so that E P ( S T ) = P .
For example, consider B = and S T = G . We then have p ( g | s T ) = δ ( g s T ) and therefore
P ( g ) = 1 λ 1 g +
where λ is chosen so that E P ( G ) = P . The capacity (178) is then (see ([95], Equation (7)))
C ( P ) = λ p ( g ) log g / λ d g .
Consider now the quantizer q u ( . ) of Section 2.9 with B = 1 . We have two equations for λ , namely
λ = 0 Δ p ( g ) P S T ( Δ / 2 ) · g 1 + g P ( Δ / 2 ) d g
λ = Δ p ( g ) P S T ( 3 Δ / 2 ) · g 1 + g P ( 3 Δ / 2 ) d g .
Observe the following for (183) and (184):
  • both P ( Δ / 2 ) and P ( 3 Δ / 2 ) decrease as λ increases;
  • the maximal λ permitted by (183) is E G | G Δ which is obtained with P ( Δ / 2 ) = 0 ;
  • the maximal λ permitted by (184) is E G | G Δ which is obtained with P ( 3 Δ / 2 ) = 0 .
Thus, if E G | G Δ > E G | G Δ , then at P below some threshold, we have P ( Δ / 2 ) = 0 and P ( 3 Δ / 2 ) = P / P S T ( 3 Δ / 2 ) . The capacity in nats per symbol at low power and for fixed Δ is thus
C ( P ) = Δ p ( g ) log 1 + g P ( 3 Δ / 2 ) d g P E G | G Δ P 2 2 P S T ( 3 Δ / 2 ) E G 2 | G Δ
where we used
log ( 1 + x ) x x 2 2
for small x. The wideband values (37) are
E b N 0 min = log 2 E G | G Δ
S = 2 P S T ( 3 Δ / 2 ) E G | G Δ 2 E G 2 | G Δ .
One can thus make the minimum E b / N 0 approach if one can make E G | G Δ as large as desired by increasing Δ .
Remark 55.
Consider the MIMO model (171) with S R = H . Suppose the CSIT is S T = f T ( S R ) for some function f T ( · ) . The capacity (178) generalizes to
C ( P ) = max X ̲ ( S T ) : E X ̲ ( S T ) 2 P I ( X ̲ ; H X ̲ + Z ̲ | H , S T ) = max Q ( S T ) : E tr Q ( S T ) P E log det I + H Q ( S T ) H .

6.4. Full CSIR, Partial CSIT

Consider first the full CSIR S R = H P ( S T ) and then the less informative S R = H .
S R = H P ( S T ) : We have the following capacity result that implies this CSIR is the best possible since one can achieve the same rate as if the receiver sees both H and S T ; see the first step of (162). We could thus have classified this model as CSIT@R.
Proposition 2
(see ([10], Proposition 3)). The capacity of the channel (159) with S R = H P ( S T ) and general S T is
C ( P ) = max P ( S T ) : E P ( S T ) P C p ( s R ) log 1 + | s R | 2 d s R = max P ( S T ) : E P ( S T ) P E log 1 + G P ( S T ) .
Proof. 
Achievability follows by Theorem 1 with Remark 51. The converse is given by Lemma 6. □
Remark 56.
Proposition 2 gives an upper bound and (thus) a target rate when the receiver has partial CSIR. For example, we will use the K-partition idea of Lemma 1 (see also Remark 46) to approach the upper bound for large SNR.
Remark 57.
Proposition 2 partially generalizes to block-fading channels; see Proposition 3 in Section 9.5.
S R = H : The capacity is (138) with
I ( A ; Y | H ) = E log p ( Y | A , H ) p ( Y | H )
where E | X | 2 P and where
p ( y | a , h ) = C p ( s T | h ) e | y h x ( s T ) | 2 π d s T
and
p ( y | h ) = C p ( s T | h ) A p ( a ) p ( y | a , h , s T ) d a d s T = C p ( s T | h ) C p ( x ( s T ) ) e | y h x ( s T ) | 2 π d x ( s T ) d s T .
For example, if each entry X ( s T ) of A is CSCG with variance P ( s T ) then
p ( y | h ) = C p ( s T | h ) exp | y | 2 1 + g P ( s T ) π ( 1 + g P ( s T ) ) d s T .
In general, one can compute I ( A ; Y | H ) numerically by using (190)–(192), but the calculations are hampered if the integrals in (191) and (192) do not simplify.
For the reverse model GMI (139), the averaging density in (164) and (165) is here
p ( h , s T | y , s R ) = δ ( h s R ) p ( s T | h ) p ( y | h , s T ) p ( y | h ) .
We use numerical integration to compute the GMI.
To obtain more insight, we state the forward model rates of Theorem 1 and Remark 51 as a Corollary.
Corollary 1.
An achievable rate for the fading channels (159) with S R = H and partial CSIT is the forward model GMI
I 1 ( A ; Y | H ) = E log 1 +