Amplitude Constrained MIMO Channels: Properties of Optimal Input Distributions and Bounds on the Capacity †

In this work, the capacity of multiple-input multiple-output channels that are subject to constraints on the support of the input is studied. The paper consists of two parts. The first part focuses on the general structure of capacity-achieving input distributions. Known results are surveyed and several new results are provided. With regard to the latter, it is shown that the support of a capacity-achieving input distribution is a small set in both a topological and a measure theoretical sense. Moreover, explicit conditions on the channel input space and the channel matrix are found such that the support of a capacity-achieving input distribution is concentrated on the boundary of the input space only. The second part of this paper surveys known bounds on the capacity and provides several novel upper and lower bounds for channels with arbitrary constraints on the support of the channel input symbols. As an immediate practical application, the special case of multiple-input multiple-output channels with amplitude constraints is considered. The bounds are shown to be within a constant gap to the capacity if the channel matrix is invertible and are tight in the high amplitude regime for arbitrary channel matrices. Moreover, in the regime of high amplitudes, it is shown that the capacity scales linearly with the minimum between the number of transmit and receive antennas, similar to the case of average power-constrained inputs.


Introduction
While the capacity of a multiple-input multiple-output (MIMO) channel with an average power constraint is well understood [1], there is surprisingly little known about the capacity of the more practically relevant case in which the channel inputs are subject to amplitude constraints. Shannon was the first who considered a channel that is constrained in its amplitude [2]. In that paper, he derived corresponding upper and lower bounds and showed that in the low-amplitude regime, the capacity behaves as that of channel with an average power constraint. The next major contribution to this problem was a seminal paper of Smith [3] published in 1971. Smith showed that, for the single-input single-output (SISO) Gaussian noise channel with an amplitude-constrained input, the capacity-achieving inputs are discrete with finite support. In [4], this result is extended to peak-power-constrained quadrature Gaussian channels. Using the approach of Shamai [4], it is shown in [5] that the input distribution that achieves the capacity of a MIMO channel with an identity channel matrix and a Euclidian norm constraint on the input vector is discrete. Even though the optimal input distribution is known to be discrete, very little is known about the number or the optimal positions of the corresponding constellation points. To the best of our knowledge, the only case for which the input distribution is precisely known is considered in [6], where it is shown for the Gaussian SISO channel with an amplitude constraint that two point masses are optimal if amplitude values are smaller than 1.665 and three for amplitude values of up to 2.786. Finally, it has been shown very recently that the number of mass points in the support of the capacity-achieving input distribution of a SISO channel is of the order O(A 2 ) with A the amplitude constraint.
Based on a dual capacity expression, in [7], McKellips derived an upper bound on the capacity of a SISO channel that is subject to an amplitude constraint. The bound is asymptotically tight; that is, for amplitude values that tend to infinity. By cleverly choosing an auxiliary channel output distribution in the dual capacity expression, the authors of [8] sharpened McKellips' upper bound and extended it to parallel MIMO channels with a Euclidian norm constraint on the input. The SISO version of the upper bound in [8] has been further sharpened in [9] by yet another choice of auxiliary output distribution. In [10], asymptotic lower and upper bounds for a 2 × 2 MIMO channel are presented and the gap between the bounds is specified.
In this work, we make progress on this open problem by deriving several new upper and lower bounds that hold for channels with arbitrary constraints on the support of the channel input distribution and then apply them to the practically relevant special case of MIMO channels that are subject to amplitude-constraints.

Contributions and Paper Organization
The remainder of the paper is organized as follows. The problem is stated in Section 2. In Section 3, we study properties of input distributions that achieve the capacity of input-constrained MIMO channels. The section reviews known results on the structure of optimal input distributions and presents several new results. In particular, Theorem 3 shows that the support of a capacity-achieving input distribution must necessarily be a small set both topologically and measure theoretically. Moreover, Theorem 8 characterizes conditions on the channel input space as well as on the channel matrix such that the support of the optimal input distribution is concentrated on the boundary of the channel input space.
In Section 4, we derive novel upper and lower bounds on the capacity of a MIMO channel that is subject to an arbitrary constraint on the support of the input. In particular, three families of upper bounds are proposed, which are based on: (i) the maximum entropy principle (see the bound in Theorem 9); (ii) the dual capacity characterization (see the bound in Theorem 10); and (iii) a relationship between mutual information and the minimum mean square error that is known as the I-MMSE relationship (see the bound in Theorem 11). On the other hand, Section 4 provides three different lower bounds. The first one is given in Theorem 12 and is based on the entropy power inequality. The second one (see Theorem 13) is based on a generalization of the celebrated Ozarow-Wyner bound [11] to the MIMO case. The third upper bound (see Theorem 14) is based on Jensen's inequality and depends on the characteristic function of the channel input distribution.
In Section 5, we evaluate the performance of our bounds by studying MIMO channels with invertible channel matrices. In particular, Theorem 17 states that our upper and lower bounds are within n log 2 (ρ) bits, where ρ is the packing efficiency and n the number of transmit and receive antennas. For diagonal channel matrices, it is then shown (see Theorem 18) that the Cartesian product of simple pulse-amplitude modulation (PAM) constellations achieves the capacity to within 1.64n bits.
Section 6 is devoted to MIMO channels with arbitrary channel matrices. It is shown that, in the regime of high amplitudes, similar to the case of average power-constrained channel inputs, the capacity scales linearly with the minimum of the number of transmit and receive antennas.
In Section 7, our upper and lower bounds are applied to the SISO case, which are then compared with bounds known from the literature. Finally, Section 8 concludes the paper. Note that parts of the results in this paper were also published in [12].

Notation
Vectors are denoted as bold lowercase letters, random vectors as bold uppercase letters, and matrices as bold uppercase sans serif letters (e.g., x, X, X). For any deterministic vector x ∈ R n , n ∈ N, we denote the Euclidian norm of x by x . For some random X ∈ supp(X) ⊆ R n and any p > 0, we define where supp(X) denotes the support of X. Note that for p ≥ 1, the quantity in Equation (1) whereas Tr(H) is denoting its trace. The n × n identity matrix is represented as I n . Let S be a subset of R n . Then, Vol(S ) := S dx denotes its volume. Moreover, the boundary of S is denoted as ∂S.
We define an n-dimensional ball or radius r ∈ R + centered at x ∈ R n as the set B x (r) := {y ∈ R n : x − y ≤ r}.
Recall that, for any x ∈ R n and r ∈ R + , where Γ(z) denotes the gamma function. For any matrix H ∈ R k×n and some S ⊂ R n , we define HS := {y ∈ R k : y = Hx , x ∈ S}.
Note that for an invertible H ∈ R n×n , we have Vol(HS ) = |det(H)|Vol(S) with det(H) the determinant of H. We define the maximum and minimum radius of a set S ⊂ R n that contains the origin as For a given vector a = (a 1 , . . . , a n ) ∈ R n + , we define The entropy of any discrete random object X is denoted as H(X), whereas h(X) (i.e., the differential entropy) is used whenever X is continuous. The mutual information between two random objects X and Y is denoted as I(X; Y) and N (m, C) denotes the multivariate normal distribution with mean vector m and covariance matrix C. Finally, log + a (x) := max{log a (x), 0}, for any base a > 0, Q(x), x ∈ R, denotes the Q-function, and δ x (y) the Kronecker delta, which is one for x = y and zero otherwise.

Problem Statement
Consider a MIMO system with n t ∈ N transmit and n r ∈ N receive antennas. The corresponding n r -dimensional channel output for a single channel use is of the form for some fixed channel matrix H ∈ R n r ×n t (considering a real-valued channel model is without loss of generality). Here and hereafter, we assume Z ∼ N (0, I n r ) is independent of the channel input X ∈ R n t and H is known to both the transmitter and the receiver. Now, let X ⊂ R n t be a convex and compact channel input space that contains the origin (i.e., the length-n t zero vector) and let F X denote the cumulative distribution function of X. As of the writing of this paper, the capacity of this channel is unknown and we are interested in finding novel lower and upper bounds. Even though most of the results in this paper hold for arbitrary convex and compact X , we are mainly interested in the two important special cases: (i) per-antenna amplitude constraints, i.e., X = Box(a) for some given a = (A 1 , . . . , A n t ) ∈ R n t + ; and (ii) n t -dimensional amplitude constraint, i.e., X = B 0 (A) for some given A ∈ R + . Remark 1. Note that determining the capacity of a MIMO channel with average per-antenna power constraints is also still an open problem and has been solved for some special cases only [13][14][15][16][17].

Properties of an Optimal Input Distribution
Unlike the special cases of real and complex-valued SISO channels (i.e., n t = n r = 1), the structure of the capacity-achieving input distribution, denoted as F X , is in general not known. To motivate why in this paper we are seeking for novel upper and lower bounds on the capacity (Equation (2)) instead of trying to solve the optimization problem directly, in this section we first summarize properties optimal input distributions must posses, which demonstrate how complicated the optimization problem actually is. Note that, whereas an optimal input distribution always exists, it does not necessarily need to be unique.

Necessary and Sufficient Conditions for Optimality
To study properties of an optimal input distribution, we need the notion of a point of increase of probability distribution.

Definition 1. (Points of Increase of a Distribution)
A point x ∈ R n , n ∈ N, is said to be a point of increase of a given probability distribution F X if for any open set A ⊂ R n containing x, F X (A) > 0.
The following result provides necessary and sufficient conditions for the optimality of a channel input distribution. Theorem 1. Let F X be some given channel input distribution and let E (F X ) ⊂ X denote the set of points of increase of F X . Then, the following holds: • F X is capacity-achieving if and only if the Karush-Kuhn-Tucker (KKT) conditions are satisfied [3,18], where with f Y being the probability density of the channel output induced by the channel input X ∼ F X . • F X is unique and symmetric if H is left invertible [18]. • F Y (i.e., the channel output distribution) is unique [18,19].

General Structure of Capacity-Achieving Input Distributions
Theorem 1 can be used to find general properties of the support of a capacity-achieving input distribution, which we will do in this subsection.

Remark 2.
Fully characterizing an input distribution that achieves the capacity of a general MIMO channel with per-antenna or an n t -dimensional amplitude constraint is still an open problem. To the best of our knowledge, the only general available for showing that discrete channel inputs are optimal was developed by Smith in [3] for the amplitude and variance-constrained Gaussian SISO channel. Since then, it has been useful to also characterize the optimal input distribution of several other SISO channels (see, for instance, [4,[20][21][22][23][24]). The method relies on the following series of steps:

1.
Towards a contradiction, it is assumed that the set of points of increase E (F X ) is infinite.

2.
The assumption in Step 1 is then used to establish a certain property of the function h(x; F X ) on the input space X . For example, by showing that h(x; F X ) has an analytic continuation to C. Then, by means of the Identity Theorem of complex analysis and the Bolzano-Weierstrass Theorem [25], Smith was able to show that h(x; F X ) must be constant.

3.
By using either the Fourier or Laplace transform of h(x; F X ) together with the property of h(x; F X ) established in Step 2, a new a property of the channel output distribution F Y is established. For example, Smith was able to show that F Y must be constant.

4.
A conclusion out of Step 3 is used to reach a contradiction. The contradiction implies that E (F X ) must be finite. For example, to reach a contradiction, Smith was using the fact that the channel output distribution F Y results from a convolution with a Gaussian probability density, which cannot be constant.

Remark 3.
Under the restriction that the output space, Y, of a Gaussian SISO channel is finite and the channel input space, X , is subject to an amplitude constraint, Witsenhausen has shown in [26] that the capacity-achieving input distribution is discrete with the number of mass points bounded as |X | ≤ |Y |. The approach of Witsenhausen, however, does not use the variational technique of Smith and relies on arguments from convex analysis instead.
According to Remark 2, assuming in the MIMO case that E (F X ) is of infinite cardinality does not help (or at least it is not clear how this assumption should be used) in showing that the capacity-achieving input distribution is discrete and finite. However, by using the weaker assumption that E (F X ) contains a non-empty open subset in conjunction with the following version of the Identity Theorem, we can show that the support of the optimal input distribution is a small set in a certain topological sense. [27]) For some n ∈ N let U be a subset of R n and f , g : U → R be two real-analytic functions that agree on a set A ⊆ U . Then, f and g agree on R n if one of the following two conditions is satisfied:

Theorem 2. (Identity Theorem for Real-Analytic Functions
Furthermore, for n = 1, it suffices for A to be an arbitrary set with an accumulation point.
We also need the definitions of a dense and a nowhere dense set.

Definition 2.
(Dense and Nowhere Dense Sets) A subset A ⊂ X is said to be dense in the set X if every element x ∈ X either belongs to A or is an accumulation point of A. A subset A ⊂ X is said to be nowhere dense if for every nonempty open subset U ⊂ X , the intersection A ∩ U is not dense in U .
With Theorem 2 at our disposal, we are now able to prove the following result on the structure of the support of the optimal input distribution. Theorem 3. The set of points of increase E (F X ) of an optimal input distribution F X is a nowhere dense subset of X that is of Lebesgue measure zero.
Proof. It is not difficult to show that h(x; F X ) is a real-analytic function on R n t ([18] Proposition 5). Now, in order to prove the result, we follow a series of steps similar to those outlined in Remark 2. Towards a contradiction, assume that the set of points of increase E (F X ) of F X is not a nowhere dense subset of X . Then, according to Definition 2, there exists an open set U ⊂ X such that E (F X ) ∩ U is dense in U .
By using the KKT condition in Equation (3b), we have that h(x; F X ) is constant on the intersection E (F X ) ∩ U . Thus, as E (F X ) ∩ U is dense in U , it follows by the properties of continuous functions (real-analytic functions are continuous) that h(x; F X ) is also constant on U . Moreover, as U is an open set, Theorem 2 implies that h(x; F X ) must also be constant on R n t . This, however, leads to a contradiction as h(x; F X ) cannot be constant on all of R n t , which can be shown by taking the Fourier transform of h(x; F X ) and solving for the probability density f Y (y) of the channel output (the reader is referred to [3] for details). Therefore, we conclude that E (F X ) is a nowhere dense subset of X .
Showing that E (F X ) has Lebesgue measure zero follows along similar lines by assuming that E (F X ) is a set of positive measure. Then, Property (ii) of Theorem 2 can be used to conclude that h(x; F X ) must be zero on all of U . This again leads to a contradiction, which implies that E (F X ) must be of Lebesgue measure zero.

Remark 4.
Note that if X = B 0 (A) for some A ∈ R + and h(x; F X ) is orthogonally equivariant (i.e., it only depends on x ), then E (F X ) can be written as a union of concentric spheres. That is, with C(A j ) := {x ∈ R n t : x = A j } for some A j ∈ R + . To see this, let and observe that if x ∈ E (F X ), then g(x) = 0. Combining this with the symmetry of the function x → g( x ), we have that (We know that it is abuse of notation to use the same letter for the functions x → g(x) and x → g( x ) even if they are different. It is an attempt to say in a compact way that g is orthogonally equivariant.) where I is possibly of infinite cardinality. In fact, I has finite cardinality. To see this, note that, if g(x) is real-analytic, then so is g( x ). However, as x → g( x ) is a non-zero real-analytic function on R, it can have at most finitely many zeros on an interval. As an example consider the special case n r = n t = n with H = I n . Then, the union in Equation (4) implies that the cardinality of E (F X ) is uncountable and that discrete inputs are in general not optimal. Therefore, Theorem 3 can generally not be improved in the sense that for n > 1, statements about the cardinality of E (F X ) cannot be made. Note, however, that the magnitude of X is discrete. An example of the corresponding optimal input distribution for the case of n = 2 is given in Figure 1. Figure 1. An example of a support of an optimal input distribution for the special case n t = n r = n = 2.
Even though Theorem 3 does not allow us to conclude that the optimal input distribution of an arbitrary MIMO channel is discrete and finite, for the special case of a SISO channel we have the following partial result. [3,6,28]) For some fixed h ∈ R and A ∈ R + , consider the SISO channel Y = hX + Z with input space X = [−A, A]. Let F X be an input distribution that achieves the capacity, C(X , h), of that channel. Then, F X satisfies the following properties: Theorem 4 can now be used to also address the special cases of multiple-input single output (MISO) and single-input multiple output (SIMO) channels. Theorem 5. Let Y = h T X + Z be a MISO channel with channel matrix h T ∈ R n t and some optimal input X ∈ X ⊂ R n t . Then, the distribution of h T X * is discrete with finitely many mass points. On the other hand, let Y = hX + Z be a SIMO channel with channel matrix h ∈ R n r . Then, the optimal input X ∈ X ⊂ R has a discrete distribution with finitely many mass points.

Theorem 4. (Optimal Input Distribution of a SISO Channel
Proof. For the MISO case, the capacity can expressed as Using Theorem 5 we have that the maximizing distribution in Equation (5) F S , where S := h T X, is discrete with finitely many mass points.
For the SIMO case, the channel input distribution is discrete as a SIMO channel can be transformed into a SISO channel. Thus, let A ∈ R + be finite. Then, the capacity of the SIMO channel can be expressed as Again, it follows from Theorem 5 that the mutual information in Equation (6) is maximized by a channel input distribution, F X , that is discrete with finitely many mass points. This concludes the proof.

Remark 5.
Note that in the MISO case, we do not claim F X to be discrete with finitely many points. To illustrate the difficulty, let h T = [1, −1] so that As X 1 and X 2 can be arbitrarily correlated, we cannot rule out cases in which X 1 = X − D and X 2 = X − 2D, with D a discrete random variable and X of arbitrary distribution. Clearly the distribution of X is not discrete.
Note that in general it can be shown that the capacity-achieving input distribution is discrete if the optimization problem in Equation (2) can be reformulated as an optimization over one dimensional distributions. This, for example, has been done in [5] for parallel channels with a total amplitude constraint.

Properties of Capacity-Achieving Input Distributions in the Small (But Not Vanishing) Amplitude Regime
In this subsection, we study properties of capacity-achieving input distribution in the regime of small amplitudes. To that end, we will need the notion of a subharmonic function.

Definition 3. (Subharmonic Function) Let f be a real-valued function that is twice continuously differentiable on an open set
We use the following Theorem, which states that a subharmonic function always attains its maximum on the boundary of its domain. [29]) Let G ⊂ R n be a connected open set. If f : G → R is subharmonic and attains a global maximum in the interior of G, then f is constant on G.

Theorem 6. (Maximum Principle of Subharmonic Functions
In addition to Theorem 6, we need the following result that has been proven in [30].

Lemma 1.
Let the likelihood function of the output of a MIMO channel be defined as and let A y denote the Hessian matrix of log e (y) . Then, the Laplacian (or the trace of A y ) is given by Theorem 7. Suppose that r 2 max (HX ) ≤ log 2 (e). Then, x → h(x; F X ) is a subharmonic function for every F X .
Proof. Let F X be arbitrary and observe that With this expression in hand, the Laplacian of h(x; F X ) with respect to x can be bounded from below as follows: where (a) follows from Equation (7) and the chain rule for the Hessian; (b) from using the well-known inequality Tr(CD) 2 ≤ Tr(C) 2 Tr(D) 2 that holds for C and D positive semi-definite; and (c) from using the inequality Thus, according to the assumption that r 2 max (HX ) ≤ log 2 (e), the right-hand side of Equation (8) is nonnegative, which proves the result. Now, knowing that h(x; F X ) is a subharmonic function allows us to characterize the support of an optimal input distribution of a MIMO channel provided that the radius of the channel input space, X , is sufficiently small. Theorem 8. Let F X be a capacity-achieving input distribution and r 2 max (HX ) ≤ log 2 (e). Then, E (F X ) ⊆ ∂X . (3), we know that, if x ∈ E (F X ), then x is a maximizer of h(x; F X ). According to Theorem 7, we also know that h(x; F X ) is subharmonic. Hence, from the Maximum Principle of Subharmonic Functions (i.e., Theorem 6), it follows E (F X ) ⊆ ∂X .

Proof. From the KKT conditions in Equation
Combining Theorem 8 with the observations made in Remark 4 leads to the following corollary.
We conclude this section by noting that for the special case n t = n r = n with H = I n , the exact value of A such that E (F X ) = C(A) has been characterized in terms of an integral equation in [31], which is approximately equal to 1.5 √ n.

Upper and Lower Bounds on the Capacity
The considerations in the previous section have shown that characterizing the structure of an optimal channel-input distribution is a challenging question in itself that we could only partially answer. A full characterization, however, is a necessary prerequisite to narrow down the search space in Equation (2) to one that is tractable. Except for some special cases (i.e., special choices of X ), optimizing over the most general space of input distributions that consists of all continuous n t -dimensional probability distributions F X with X ∈ X , is prohibitive (Note that Dytso et al. [32] summarized methods of how to optimize functionals over the space of probability distributions that are constrained in there support). Thus, up to the writing of this paper, there is little hope in being able to solve the problem in Equation (2) in full generality so that in this section we are proposing novel lower and upper bounds on the capacity C(X , H). Nevertheless, these bounds will allow us to better understand how the capacity of such MIMO channels behaves.
Towards this end, in Section 4.1, we provide four upper bounds. The first is based on an upper bound on the differential entropy of a random vector that is constraint in its pth moment, the second and third bounds are based on duality arguments, and the fourth on the relationship between mutual information and the minimum mean square error (MMSE), I-MMSE relationship for short, known from [33]. The three lower bounds proposed in Section 4.2, on the other hand, are based on the celebrated entropy power inequality, a generalization of the Ozarow-Wyner capacity bound taken from [11], and on Jensen's inequality.

Upper Bounds
To establish our first upper bound on Equation (2), we need the following result ( [11] Th. 1).

Lemma 2. (Maximum Entropy
Under pth Moment Constraint) Let n ∈ N and p ∈ (0, ∞) be arbitrary. Then, for any U ∈ R n such that h(U) < ∞ and U p < ∞, we have Theorem 9. (Moment Upper Bound) For any channel input space X and any fixed channel matrix H, we have wherex ∈ HX is chosen such that x = r max (HX ).
Proof. Expressing Equation (2) in terms of differential entropies results in where (a) follows from Lemma 2 with the fact that h(Z) = n r 2 log 2 (2πe); and (b) from the monotonicity of the logarithm. Now, notice that HX + Z p is linear in F X and therefore it attains its maximum at an extreme point of the set F X := {F X : X ∈ X } (i.e., the set of all cumulative distribution functions of X). As a matter of fact [26], the extreme points of F X are given by the set of degenerate distributions on X ; that is, {F X (y) = δ x (y), y ∈ X } x∈X . This allows us to conclude max F X :X∈X Observe that the Euclidian norm is a convex function, which is therefore maximized at the boundary of the set HX . Combining this with Equation (9) and taking the infimum over p > 0 completes the proof.
The following Theorem provides two alternative upper bounds that are based on duality arguments.
Proof. Using duality bounds, it has been shown in [8] that for any centered n-dimensional ball of radius r ∈ R + max F X :X∈B 0 (r) where c n (r) : where (a) follows from enlarging the optimization domain; and (b) from using the upper bound in Equation (12). This proves Equation (10).
To show the upper bound in Equation (11), we proceed with an alternative upper bound to Equation (13): where the (in)equalities follow from: (a) enlarging the optimization domain; (b) single-letterizing the mutual information; (c) choosing individual amplitude constraints (A 1 , . . . , A n r ) =: a ∈ R n r + such that Box(a) = Box(HX ); and (d) using the upper bound in Equation (12) for n = 1. This concludes the proof.
As mentioned at the beginning of the section, another simple technique for deriving upper bounds on the capacity is to use the I-MMSE relationship [33] I(X; For any γ ≥ 0, the quantity E X − E[X | √ γX + Z] 2 is known as the MMSE of estimating X from the noisy observation √ γX + Z. An important fact that will be useful is that the conditional expected value E[X | √ γX + Z] is the best estimator in the sense that it minimizes the mean square error over all measurable functions f : R n r → R n t ; that is, for any Y ∈ R n r and X ∈ R n t where the (in)equalities follow from: (a) using that I(HX; HX + Z) = I(X; HX + Z) for any fixed H; (b) using the I-MMSE relationship in Equation (14); and (c) using the property that conditional expectation minimizes mean square error (i.e., (15)). Now, notice that max F X :X∈X I(X; HX + Z) ≤ max where (a) follows from max F X :X∈X E HX 2 = maxx ∈X Hx 2 (the same argument was used in the proof of Theorem 9); and (b) from the definition of r 2 max (HX ). Since is arbitrary, we can choose it to minimize the upper bound in Equation (16). Towards this end, we need the following optimization result min which is easy to show. Combining Equation (16) with Equation (17), we obtain the following upper bound on the capacity max F X :X∈X I(X; HX + Z) ≤ n r 2 + n r 2 log This concludes the proof.

Corollary 2.
For any channel input space X and any fixed channel matrix H Proof. The corollary follows by upper bounding Equation (16) using the fact that r 2 max (HX ) ≤ H r 2 max (X ).
with K X the covariance matrix of the channel input. While Equation (18) is a valid upper bound, as of the writing of this paper, it is not clear how to perform an optimization over covariance matrices of random variables with bounded support. One possibility to avoid this is to use the inequality between arithmetic and geometric mean and bound the determinant by the trace: However, combining Equation (19) with Equation (18) is merely a special case of the moment upper bound of Theorem 9 for p = 2. Therefore, the estimators in Theorem 11 are chosen to obtain a non-trivial upper bound avoiding the optimization over covariance matrices.
In Section 5, we present a comparison of the upper bounds of Theorems 9-11 by means of a simple example.

Lower Bounds
A classical approach to bound a mutual information from below is to use the entropy power inequality (EPI). 2πe . (20) Moreover, if n t = n r = n, H ∈ R n×n invertible, and X uniformly distributed over X , then Proof. By means of the EPI  (20).
To show the lower bound in Equation (21), all we need is to recall that h(HX) = h(X) + log 2 |det(H)|, which is maximized for X uniformly distributed over X . However, if X is uniformly drawn from X , we have The considerations in Section 3 suggest that a channel input distribution that maximizes Equation (2) might be discrete. Therefore, there is a need for lower bounds that unlike the bounds in Theorem 12 rely on discrete inputs. Theorem 13. (Ozarow-Wyner Type Lower Bound) Let X D ∈ supp(X D ) ⊂ R n t be a discrete random vector of finite entropy, g : R n r → R n t a measurable function, and p > 0. Furthermore, let K p be a set of continuous random vectors, independent of X D , such that for every U ∈ K p we have h(U) < ∞, U p < ∞, and for all x i , x j ∈ supp(X D ), i = j. Then, and k n t ,p as defined in Lemma 2, respectively.
Proof. The proof is identical to ( [11] Theorem 2). To make the manuscript more self-contained, we repeat it here. Let U and X D be statistically independent. Then, the mutual information I(X D ; Y) can be lower bounded as Here, (a) follows from the data processing inequality as X D + U → X D → Y forms a Markov chain in that order; and (b) from the assumption in Equation (22). By using Lemma 2, we have that the last term in Equation (25) can be bounded from above as h(X D + U|Y) ≤ n t log 2 k n t ,p n

Combining this expression with Equation (25) results in
with G 1,p and G 2,p as defined in Equations (23) and (24), respectively. Maximizing the right-hand side over all U ∈ K p , measurable functions g : R n r → R n t , and p > 0 provides the bound.
Interestingly, the bound of Theorem 13 holds for arbitrary channels and is therefore not restricted to MIMO channels. The interested reader is referred to [11] for details.
We conclude the section by providing a lower bound that is based on Jensen's inequality and holds for arbitrary inputs. with where X is an independent copy of X and φ X denotes the characteristic function of X.
Proof. To show the lower bound, we follow an approach of Dytso et al. [34]. Note that by Jensen's inequality Now, evaluating the integral in Equation (27) results in where (a) follows from the independence of X and X and Tonelli's Theorem ( [35] Chapter 5.9); (b) from completing a square; and (c) from the fact that R nr e − y− H(X−X ) 2 2 dy = R nr e − y 2 dy = π nr 2 . Combining Equation (27) with Equation (28) and subtracting h(Z) = n r 2 log 2 (2πe) completes the proof of the first version of the bound.
To show the second version, observe that ; (e) from using the property that the characteristic function of a sum of random vectors is equal to the product of its characteristic functions; ( f ) from using the fact that a characteristic function is a linear transformation; (g) from using that X and X have the same characteristic function; and (h) from the fact that the characteristic function is Hermitian. This completes the proof. det I n r + HK X H T .
Note that this bound is within min(n r ,n t ) 2 log 2 2 e bits of the capacity of the power-constrained channel.
In Section 3, we discuss that the distributions that maximize mutual information in n t -dimensions are typically singular, which means that they are concentrated on a set of Lebesgue measure zero. Singular distributions generally do not have a probability density, whereas the characteristic function always exists. This is why the version of Jensen's inequality lower bound in Theorem 14 that is based on the characteristic function of the channel input is especially useful for amplitude-constrained MIMO channels.

Invertible Channel Matrices
Consider the symmetric case of n t = n r = n antennas with H ∈ R n×n being invertible. In this section, we evaluate some of the lower and upper bounds proposed in the previous section for the special case of H being also diagonal and then characterize the gap to the capacity for arbitrary invertible channel matrices.

Diagonal Channel Matrices
Suppose the channel inputs are subject to per-antenna or an n-dimensional amplitude constraint. Then, the duality upper boundC Dual,2 (X , H) of Theorem 10 takes on the following form.

Theorem 15. (Upper Bounds)
Let H = diag(h 11 , . . . , h nn ) ∈ R n×n be fixed. If X = Box(a) for some a = (A 1 , . . . , A n ) ∈ R n + , thenC Moreover, if X = B 0 (A) for some A ∈ R + , then Proof. The bound in Equation (29) immediately follows from Theorem 10 by observing that Box(HBox(a)) = Box(Ha). The bound in Equation (30) Moreover, Proof. For some given values B i ∈ R + , i = 1, . . . , n, let the ith component of X = (X 1 , . . . , X n ) be independent and uniformly distributed over the interval [−B i , B i ]. Thus, the expected value appearing in the bound of Theorem 14 can be written as Now, if X is an independent copy of X, it can be shown that the expected value at the right-hand side of Equation (34) is of the explicit form with ϕ as defined in Equation (32). Finally, optimizing over all b = (B 1 , . . . , B n ) ∈ X results in the bound (31). The bound in Equation (33) follows by inserting |det(H)| = |∏ n i=1 h ii | into Equation (21), which concludes the proof.
In Figure 2, the upper bounds of Theorems 9 and 15 and the lower bounds of Theorem 16 are depicted for a diagonal 2 × 2 MIMO channel with per-antenna amplitude constraints. It turns out that the moment upper bound and the EPI lower bound perform well in the small amplitude regime while the duality upper bound and Jensen's inequality lower bound perform well in the high amplitude regime. Interestingly, for this specific example, the duality upper bound and Jensen's lower bound are asymptotically tight.

Gap to the Capacity
Our first result provides and upper bound to the gap between the capacity in Equation (2) and the lower bound in Equation (21).

Theorem 17.
Let H ∈ R n×n be of full rank and ρ(X , H) := Vol B 0 (r max (HX )) Vol(HX ) . Then, Proof. For notational convenience, let the volume of an n-dimensional ball of radius r > 0 be denoted as V n (r) := Vol B 0 (r) = V n (1)r n = π n 2 r n Γ n 2 + 1 . Now, observe that, by choosing p = 2, the upper bound of Theorem 9 can further be upper bounded asC where (a) follows since k n,2 = 2πe n ; and (b) since E[ Z 2 ] = n. Therefore, the gap between Equation (21) and the moment upper bound of Theorem 9 can be upper bounded as follows: where (a) is due to the fact that x is the radius of an n-dimensional ball; (b) follows from the inequality 1+cx 1+x ≤ c for c ≥ 1 and x ∈ R + ; and (c) follows from using Stirling's approximation to The term ρ(X , H) is referred to as the packing efficiency of the set HX . In the following proposition, we present the packing efficiencies for important special cases.

Proposition 1. (Packing Efficiencies)
Let H ∈ R n×n be of full rank, A ∈ R + , and a := (A 1 , . . . , A n ) ∈ R n + . Then, ρ Box(a), I n = π ρ Box(a), H ≤ π Proof. The packing efficiency in Equation (35) follows immediately. Note that Thus, as H is assumed to be invertible, we have Vol(HB 0 (A)) = |det(H)|Vol(B 0 (A)), which results in Equation (36). To show Equation (37), observe that The proof of Equation (37) is concluded by observing that Vol(I n Box(a)) = ∏ n i=1 A i . Finally, observe that Box(a) ⊂ B 0 ( a ) implies r max (HBox(a)) ≤ r max (HB 0 ( a )) so that which is the bound in Equation (38).
We conclude this section by characterizing the gap to the capacity when H is diagonal and the channel input space is the Cartesian product of n PAM constellations. In this context, PAM(N, A) refers to the set of N ∈ N equidistant PAM-constellation points with amplitude constraint A ∈ R + (see Figure 3 for an illustration), whereas X ∼ PAM(N, A) means that X is uniformly distributed over PAM(N, A) [11]. 000 11100 11 00 00 Example of a pulse-amplitude modulation constellation with N = 4 points and amplitude constraint A (i.e., PAM(4, A)), where ∆ := A/(N − 1) denotes half the Euclidean distance between two adjacent constellation points. In the case N is odd, 0 is a constellation point.
Proof. Since the channel matrix is diagonal, letting the channel input X be such that its elements X i , i = 1, . . . , n, are independent, we have that and observe that half the Euclidean distance between any pair of adjacent points in PAM(N i , A i ) is equal to ∆ i := A i /(N i − 1) (see Figure 3), i = 1, . . . , n. To lower bound the mutual information I(X i ; h ii X i + Z i ), we use the bound of Theorem 13 for p = 2 and n t = 1. Thus, for some continuous random variable U that is uniformly distributed over the interval [−∆ i , ∆ i ) and independent of X i , we have that Now, note that the entropy term in Equation (41) can be lower bounded as where we have used that x ≥ x 2 for every x ≥ 1. On the other hand, the last term in Equation (41) can be upper bounded by upper bounding its argument as follows: where (a) follows from using that X i and U are independent and E[ . Combining Equations (41), (42), and (43) results in the gap in (39).
The proof of the capacity gap in Equation (40) follows along similar lines, which concludes the proof.
We are also able to determine the gap to the capacity for a general invertible channel matrix.

Theorem 19.
For any X and any invertible H Proof. Let X be uniformly distributed over a set constructed from an n-dimensional cubic lattice with the number of points equal to N = x + Z n 2 , wherex ∈ HX is chosen such that x = r max (HX ), and scaled such that it is contained in the input space X . Note that the minimum distance between point in X are given by Now, we compute the difference between the moment upper bound of Theorem 9 and the Ozarow-Wyner lower bound of Theorem 13: where (a) follows from Theorem 9 by choosing p = 2; and (b) by using the bound x ≥ x 2 for x > 1. The next step in the proof consists in bounding the gap term, which requires to upper bound the terms in Equations (23) and (24) individually. Towards this end, choose p = 2 and let U be a random vector that is uniformly distributed over a ball of radius d min (X). Thus, for (23) it follows where (a) follows by choosing g(Y) = H −1 Y: (b) by using U 2 2 = r 2 4+2n , where r = d min (supp(X)) 2 is the radius of an n-dimensional ball; (c) from dropping the floor function in the expression for the minimum distance, i.e., (d) follows by expanding x + Z 2 2 using that x = r max (HX ); and (e) from using the bound On the other hand, the term G 2,p (U) can be bounded from above as follows ( [36] Appendix L): Combining these two bounds with the one in (44) provides the result.

Arbitrary Channel Matrices
For an arbitrary MIMO channel with an average power constraint, it is well known that the capacity is achieved by a singular value decomposition (SVD) of the channel matrix (i.e., H = UΛV T ) along with considering the equivalent channel model To provide lower bounds for channels with amplitude constraints and SVD precoding, we need the following lemma.

Lemma 3.
For any given orthogonal matrix V ∈ R n t ×n t and constraint vector a = (A 1 , . . . , A n t ) ∈ R n t + , there exists a distribution F X of X such thatX = V T X is uniformly distributed over Box(a). Moreover, the components X 1 , . . . ,X n t ofX are mutually independent withX i uniformly distributed over [−A i , A i ], i = 1, . . . , n t .
Proof. Suppose thatX is uniformly distributed over Box(a); that is, the density ofX is of the form Since V is orthogonal, we have VX = X and by the change of variable Theorem for x ∈ VBox(a) .
Therefore, such a distribution of X exists.

Theorem 20. (Lower Bounds with SVD Precoding)
Let H ∈ R n r ×n t be fixed, n min := min(n r , n t ), and X = Box(a) for some a = (A 1 , . . . , A n t ) ∈ R n t + . Furthermore, let σ i , i = 1, . . . , n min , be the ith singular value of H.
Proof. Performing the SVD, the expected value in Theorem 14 can be written as By Lemma 3, there exists a distribution F X such that the components ofX are independent and uniformly distributed. Since Λ is a diagonal matrix, we can use Theorem 16 to arrive at Equation (45). Note that by Lemma 3 there exists a distribution on X such thatX is uniform over Box(a) ⊂ R n t and ΛX is uniform over ΛBox(a) ⊂ R n min , respectively. Therefore, by the EPI lower bound given in Equation (20), we obtain which is exactly the expression in Equation (46). This concludes the proof.

Remark 8.
Notice that choosing the optimal b for the lower bound in Equation (45) is an amplitude allocation problem, which is reminiscent of waterfilling in the average power constraint case. It would be interesting to study whether the bound in Equation (45) is connected to what is called mercury waterfilling in [37,38].
In Figure 4, the lower bounds of Theorem 20 are compared to the moment upper bound of Theorem 2 for the special case of a 3 × 1 MIMO channel. Similar to the example presented in Figure 2, the EPI lower bound performs well in the low amplitude regime, while Jensen's inequality lower bound performs well in the high amplitude regime.
We conclude this section by showing that for an arbitrary channel input space X , in the large amplitude regime the capacity pre-log is given by min(n r , n t ).
Theorem 21. Let X be arbitrary and H ∈ R n r ×n t fixed. Then, lim r min (X )→∞ C(X , H) = min(n r , n t ).
Proof. Notice that there always exists a ∈ R n t + and c ∈ R + such that Box(a) ⊆ X ⊂ cBox(a). Thus, without loss of generality, we can consider X = Box(a), a = (A, . . . , A), for sufficiently large A ∈ R + . To prove the result, we therefore start with enlarging the constraint set of the bound in Equation (11): Box HBox(a) ⊆ B 0 r max HBox(a) where r := √ n t A ∑ n min i=1 σ 2 i and a := r √ n min , . . . , r √ n min ∈ R n min + . Therefore, by using the upper bound in Equation (11), it follows that Moreover, lim A→∞ C(Box(a), H) Next, using the EPI lower bound in Equation (46), we have that lim A→∞ C EPI (Box(a), Λ) This concludes the proof.

The SISO Case
In this section, we apply the upper and lower bounds presented in the previous sections to the special case of a SISO channel that is subject to an amplitude constraint (i.e., X = [−A, A] for some A ∈ R + ) and compare them with the state-of-the art. More precisely, we are interested in upper and lower bounds to the capacity Without loss of generality, we assume h = 1 in all that follows.

Upper and Lower Bounds
As a starting point for our comparisons, the following Theorem summarizes bounds on the capacity (47) that are known from the literature. The bounds are all based on the duality approach that we generalize in Section 4 to the MIMO case.
where W(A) := 1 2 log e σ 2 (A) + 1 , Now, we apply the moment upper bound of Theorem 9 to the SISO case.
where the expected value is of the explicit form The proof is concluded by observing that f (a) : is an increasing function in a.
The following theorem establishes the EPI and the Jensen lower bound of Section 4.2 assuming the channel input symbols are uniformly distributed.
Proof. The lower bound in Equation (52) which concludes the proof.
Restricting the channel inputs to be discrete allows for another set of lower bounds on Equation (47).

High and Low Amplitude Asymptotics
In this subsection, we study how the capacity in Equation (47)  This concludes the proof.

Conclusions
In this work, we studied the capacity of MIMO channels with bounded input spaces. Several new properties of input distributions that achieve the capacity of such channels have been provided. In particular, it is shown that the support of a capacity-achieving channel input distribution is a set that is small in a topological and measure theoretical sense. In addition to that, it is shown that, if the radius of the underlying channel input space, X , is small enough, then the support of a corresponding capacity-achieving input distribution must necessarily be a subset of the boundary of X . As the considerations on the input distribution have demonstrated that determining the capacity is a very challenging problem, we proposed several new upper and lower bounds that are shown to be tight in the high amplitude regime. An interesting future direction would be to study generalizations of our techniques to wireless optical MIMO channels [42] and other channels such as the wiretap channel [43].