## 1. Introduction

The Brascamp-Lieb inequality and its reverse [

1] concern the optimality of Gaussian functions in a certain type of integral inequality. (Not to be confused with the “variance Brascamp-Lieb inequality” (cf. [

2,

3,

4]), which generalizes the Poincaré inequality). These inequalities have been generalized in various ways since their discovery, nearly 40 years ago. A modern formulation due to Barthe [

5] may be stated as follows:

**Brascamp-Lieb Inequality and Its Reverse** ([

5] Theorem 1).

Let E, ${E}_{1}$, …, ${E}_{m}$ be Euclidean spaces and ${\mathbf{B}}_{i}:E\to {E}_{i}$ be linear maps. Let ${({c}_{i})}_{i=1}^{m}$ and D be positive real numbers. Then, the Brascamp-Lieb inequality:for all nonnegative measurable functions ${f}_{i}$ on ${E}_{i}$, $i=1,\dots ,m$, holds if and only if it holds whenever ${f}_{i}$, $i=1,\dots ,m$ are centered Gaussian functions (a centered Gaussian function is of the form $\mathbf{x}\mapsto exp(r-{\mathbf{x}}^{\top}\mathbf{A}\mathbf{x})$, where $\mathbf{A}$ is a positive semidefinite matrix and $r\in \mathbb{R}$). Similarly, for F a positive real number, the reverse Brascamp-Lieb inequality, also known as Barthe’s inequality (${\mathbf{B}}_{i}^{\ast}$ denotes the adjoint of ${\mathbf{B}}_{i}$),for all nonnegative measurable functions ${f}_{i}$ on ${E}_{i}$, $i=1,\dots ,m$, holds if and only if it holds for all centered Gaussian functions.For surveys on the history of both the Brascamp-Lieb inequality and Barthe’s inequality and their applications, see, e.g., [

6,

7]. The Brascamp-Lieb inequality can be seen as a generalization of several other inequalities, including Hölder’s inequality, the sharp Young inequality, the Loomis-Whitney inequality, the entropy power inequality (cf. [

6] or the survey paper [

8]), hypercontractivity and the logarithmic Sobolev inequality [

9]. Furthermore, the Prékopa-Leindler inequality can be seen as a special case of Barthe’s inequality. Due in part to their utility in establishing impossibility bounds, these functional inequalities have attracted much attention in information theory [

10,

11,

12,

13,

14,

15,

16,

17], theoretical computer science [

18,

19,

20,

21,

22] and statistics [

23,

24,

25,

26,

27,

28], to name only a small subset of the literature. Over the years, various proofs of these inequalities have been proposed [

1,

29,

30,

31,

32,

33,

34]. Among these, Lieb’s elegant proof [

29], which is very close to one of the techniques that will be used in this paper, employs a doubling trick that capitalizes on the rotational invariance property of the Gaussian function: if

f is a one-dimensional Gaussian function, then:

Since (1) and (2) have the same structure modulo the direction of the inequality, a common viewpoint is to consider (1) and (2) as dual inequalities. This viewpoint successfully captures the geometric aspects of (1) and (2). Indeed, it is known that:

as long as

$D,F<\infty $ [

5]. Moreover, both

D and

F are equal to one under Ball’s geometric condition [

35]:

${E}_{1}$, …,

${E}_{m}$ are dimension one, and:

is the identity matrix. While fruitful, this “dual” viewpoint does not fully explain the asymmetry between the forward and the reverse inequalities: there is a sup in (2), but not in (1).

This paper explores a different viewpoint. In particular, we propose a single inequality that unifies (1) and (2). Accordingly, we should reverse both sides of (2) to make the inequality sign consistent with (1). To be concrete, let us first observe that (1) and (2) can be respectively restated in the following more symmetrical forms (with changes of certain symbols):

For all nonnegative functions

g and

${f}_{1},\dots ,{f}_{m}$ such that:

we have:

For all nonnegative measurable functions

${g}_{1},\dots {g}_{l}$ and

f such that:

we have:

Note that in both cases, the optimal choice of one function (

f or

g) can be explicitly computed from the constraints, hence the conventional formulations in (1) and (2). Generalizing further, we can consider the following problem: Let

$\mathcal{X}$,

${\mathcal{Y}}_{1},\dots ,{\mathcal{Y}}_{m}$,

${\mathcal{Z}}_{1},\dots ,{\mathcal{Z}}_{l}$ be measurable spaces. Consider measurable maps

${\varphi}_{j}:\mathcal{X}\to {\mathcal{Y}}_{j}$,

$j=1,\dots ,m$ and

$\psi :\mathcal{X}\to {\mathcal{Z}}_{i}$,

$i=1,\dots ,l$. Let

${b}_{1},\dots ,{b}_{l}$ and

${c}_{1},\dots ,{c}_{m}$ be nonnegative real numbers. Let

${\nu}_{1},\dots ,{\nu}_{l}$ be measures on

${\mathcal{Z}}_{1},\dots ,{\mathcal{Z}}_{l}$ and

${\mu}_{1},\dots ,{\mu}_{m}$ be measures on

${\mathcal{Y}}_{1},\dots ,{\mathcal{Y}}_{m}$, respectively. What is the smallest

$D>0$ such that for all nonnegative

${f}_{1},\dots ,{f}_{m}$ on

${\mathcal{Y}}_{1},\dots {\mathcal{Y}}_{m}$ and

${g}_{1},\dots ,{g}_{l}$ on

${\mathcal{Z}}_{1},\dots ,{\mathcal{Z}}_{l}$ satisfying:

we have:

Except for special case of $l=1$ (resp. $m=1$), it is generally not possible to deduce a simple expression from (10) for the optimal choice of ${g}_{i}$ (resp. ${f}_{j}$) in terms of the rest of the functions. We will refer to (11) as a forward-reverse Brascamp-Lieb inequality.

One of the motivations for considering multiple functions on both sides of (11) comes from multiuser information theory: independently, but almost simultaneously with the discovery of the Brascamp-Lieb inequality in mathematical physics, in the late 1970s, information theorists including Ahslwede, Gács and Körner [

36,

37] invented the image-size technique for proving strong converses in source and channel networks. An image-size inequality is a characterization of the tradeoff of the measures of certain sets connected by given random transformations (channels); we refer the interested readers to [

37] for expositions on the image-size problem. Although not the way treated in [

36,

37], an image-size inequality can essentially be obtained from a functional inequality similar to (11) by taking the functions to be (roughly speaking) the indicator functions of sets. In the case of (10), the forward channels

${\varphi}_{1},\dots ,{\varphi}_{m}$ and the reverse channels

${\psi}_{1},\dots ,{\psi}_{l}$ degenerate into deterministic functions. In this paper, motivated by information theoretic applications similar to those of the image-size problems, we will consider further generalizations of (11) to the case of random transformations. Since the functional inequality is not restricted to indicator functions, it is strictly stronger than the corresponding image-size inequality. As a side remark, [

38] uses functional inequalities that are variants of (11) together with a reverse hypercontractivity machinery to improve the image-size plus the blowing-up machinery of [

39] and shows that the non-indicator function generalization is crucial for achieving the optimal scaling of the second-order rate expansion.

Of course, to justify the proposal of (11), we must also prove that (11) enjoys certain nice mathematical properties; this is the main goal of the present paper. Specifically, we focus on two aspects of (11): equivalent entropic formulation and Gaussian optimality.

In the mathematical literature, e.g., [

32,

36,

40,

41,

42,

43,

44,

45,

46], it is known that certain integral inequalities are equivalent to inequalities involving relative entropies. In particular, Carlen, Loss and Lieb [

47] and Carlen and Cordero-Erausquin [

32] proved that the Brascamp-Lieb inequality is equivalent to the superadditivity of relative entropy. In this paper, we prove that the forward-reverse Brascamp-Lieb inequality (11) also has an entropic formulation, which turns out to be very close to the rate region of certain multiuser information theory problems (but we will clarify the difference in the text). In fact, Ahlswede, Csiszár and Körner [

37,

39] essentially derived image-size inequalities from similar entropic inequalities. Because of the reverse part, the proof of the equivalence of (11) and corresponding entropic inequality is more involved than the forward case considered in [

32] beyond the case of finite

$\mathcal{X}$,

${\mathcal{Y}}_{j}$,

${\mathcal{Z}}_{i}$, and certain machinery from min-max theory appears necessary. In particular, the proof involves a novel use of the Legendre-Fenchel duality theory. Next, we give a basic version of our main result on the functional-entropic duality (more general versions will be given later). In order to streamline its presentation, all formal definitions of notation are postponed to

Section 2.

**Theorem 1** (Dual formulation of the forward-reverse Brascamp-Lieb inequality).

Assume that:

- (i)
m and l are positive integers; $d\in \mathbb{R}$, $\mathcal{X}$ is a compact metric space;

- (ii)
${b}_{i}\in (0,\infty )$, ${\nu}_{i}$ is a finite Borel measure on a Polish space ${\mathcal{Z}}_{i}$, and ${Q}_{{Z}_{i}|X}$ is a random transformation from $\mathcal{X}$ to ${\mathcal{Z}}_{i}$, for each $i=1,\dots ,l$;

- (iii)
${c}_{j}\in (0,\infty )$, ${\mu}_{j}$ is a finite Borel measure on a Polish space ${\mathcal{Y}}_{j}$, and ${Q}_{{Y}_{j}|X}$ is a random transformation from $\mathcal{X}$ to ${\mathcal{Y}}_{i}$, for each $j=1,\dots ,m$;

- (iv)
For any ${({P}_{{Z}_{i}})}_{i=1}^{l}$ such that ${\sum}_{i=1}^{l}D({P}_{{Z}_{i}}\parallel {\nu}_{i})<\infty $, there exists ${P}_{X}$ such that ${P}_{X}\to {Q}_{{Z}_{i}|X}\to {P}_{{Z}_{i}}$, $i=1,\dots ,l$ and ${\sum}_{j=1}^{m}D({P}_{{Y}_{j}}\parallel {\mu}_{j})<\infty $, where ${P}_{X}\to {Q}_{{Y}_{j}|X}\to {P}_{{Y}_{j}}$, $j=1,\dots ,m$.

Then, the following two statements are equivalent:

- 1.
If the nonnegative continuous functions $({g}_{i})$, $({f}_{j})$ are bounded away from zero and satisfy:then: - 2.
For any $({P}_{{Z}_{i}})$ such that $D({P}_{{Z}_{i}}\parallel {\nu}_{i})<\infty $ (of course, this assumption is not essential (if we adopt the convention that the infimum in (14) is $+\infty $ when it runs over an empty set)), $i=1,\dots ,l$,where ${P}_{X}\to {Q}_{{Y}_{j}|X}\to {P}_{{Y}_{j}}$, $j=1,\dots ,m$, and the infimum is over ${P}_{X}$ such that ${P}_{X}\to {Q}_{{Z}_{i}|X}\to {P}_{{Z}_{i}}$, $i=1,\dots ,l$.

Next, in a similar vein as the proverbial result that “Gaussian functions are optimal” for the forward or the reverse Brascamp-Lieb inequality, we show in this paper that Gaussian functions are also optimal for the forward-reverse Brascamp-Lieb inequality, particularized to the case of Gaussian reference measures and linear maps. The proof scheme is based on rotational invariance (3), which can be traced back in the functional setting to Lieb [

29]. More specifically, we use a variant for the entropic setting introduced by Geng and Nair [

48], thereby taking advantage of the dual formulation of Theorem 1.

**Theorem** **2.** Consider ${b}_{1},\dots ,{b}_{l},{c}_{1},\dots ,{c}_{m},D\in (0,\infty )$. Let ${E}_{1},\dots ,{E}_{l},{E}^{1},\dots ,{E}^{m}$ be Euclidean spaces, and let ${\mathbf{B}}_{ji}:{E}_{i}\to {E}^{j}$ be a linear map for each $i\in \{1,\dots ,l\}$ and $j\in \{1,\dots ,m\}$. Then, for all continuous functions ${f}_{j}:{E}^{j}\to [0,+\infty )$, ${g}_{i}:{E}_{i}\to [0,\infty )$ satisfying:we have:if and only if for all centered Gaussian functions ${f}_{1},\dots ,{f}_{m},{g}_{1},\dots ,{g}_{l}$ satisfying (15),

we have (16).

As mentioned, in the literature on the forward or the reverse Brascamp-Lieb inequalities, it is known that a certain geometric condition (5) ensures that the best constant equals one. Now, for the forward-reverse inequality, there is a simple example where the best constant equals one:

**Example** **1.** Let l be a positive integer, and let $\mathbf{M}:={({m}_{ji})}_{1\le j\le l,1\le i\le l}$ be an orthogonal matrix. For any nonnegative continuous functions ${({f}_{j})}_{j=1}^{l}$${({g}_{i})}_{i=1}^{l}$ on $\mathbb{R}$ such that:we have: The rest of the paper is organized as follows:

Section 2 defines the notation and reviews some basic theory of convex duality.

Section 3 proves Theorem 1 and also presents its extensions to the settings of noncompact spaces or general reverse channels.

Section 4 proves the Gaussian optimality in the entropic formulation, with the caveat that a certain “non-degenerate” assumption is imposed to ensure the existence of extremizers. At the end of

Section 4, we give a proof sketch of Example 1 and also propose a generalization of the example. To completely prove Theorem 2, in

Appendix F, we use a limiting argument to drop the non-degenerate assumption and apply the equivalence between the functional and entropic formulations.

## 2. Review of the Legendre-Fenchel Duality Theory

Our proof of the equivalence of the functional and the entropic inequalities uses the Legendre-Fenchel duality theory, a topic from convex analysis. Before getting into that, a recap of some basics on the duality of topological vector spaces seems appropriate. Unless otherwise indicated, we assume Polish spaces and Borel measures. Recall that metric space. It enjoys several nice properties that we use heavily in this section, including the Prokhorov theorem and the Riesz-Kakutani theorem. Of course, the Polish space assumption covers the cases of Euclidean and discrete spaces (endowed with the Hamming metric, which induces the discrete topology, making every function on the discrete set continuous), among others. Readers interested in discrete spaces only may refer to the (much simpler) argument in [

49] based on the KKT condition.

**Notation** **1.** Let $\mathcal{X}$ be a topological space.

${C}_{c}(\mathcal{X})$ denotes the space of continuous functions on $\mathcal{X}$ with a compact support;

${C}_{0}(\mathcal{X})$ denotes the space of all continuous functions f on $\mathcal{X}$ that vanish at infinity (i.e., for any $\u03f5>0$, there exists a compact set $\mathcal{K}\subseteq \mathcal{X}$ such that $|f(x)|<\u03f5$ for $x\in \mathcal{X}\backslash \mathcal{K}$);

${C}_{b}(\mathcal{X})$ denotes the space of bounded continuous functions on $\mathcal{X}$;

$\mathcal{M}(\mathcal{X})$ denotes the space of finite signed Borel measures on $\mathcal{X}$;

$\mathcal{P}(\mathcal{X})$ denotes the space of probability measures on $\mathcal{X}$.

We consider

${C}_{c}$,

${C}_{0}$ and

${C}_{b}$ as topological vector spaces, with the topology induced from the sup norm. The following theorem, usually attributed to Riesz, Markov and Kakutani, is well known in functional analysis and can be found in, e.g., [

50,

51].

**Theorem 3** (Riesz-Markov-Kakutani).

If $\mathcal{X}$ is a locally compact, σ-compact Polish space, the dual (the dual of a topological vector space consists of all continuous linear functionals on that space, which is naturally also topological vector space (with the weak${}^{\ast}$ topology)) of both ${C}_{c}(\mathcal{X})$ and ${C}_{0}(\mathcal{X})$ is $\mathcal{M}(\mathcal{X})$.

**Remark** **1.** The dual space of ${C}_{b}(\mathcal{X})$ can be strictly larger than $\mathcal{M}(\mathcal{X})$, since it also contains those linear functionals that depend on the “limit at infinity” of a function $f\in {C}_{b}(\mathcal{X})$ (originally defined for those f that do have a limit at infinity and then extended to the whole ${C}_{b}(\mathcal{X})$ by the Hahn-Banach theorem; see, e.g., [50]). Of course, any

$\mu \in \mathcal{M}(\mathcal{X})$ is a continuous linear functional on

${C}_{0}(\mathcal{X})$ or

${C}_{c}(\mathcal{X})$, given by:

where

f is a function in

${C}_{0}(\mathcal{X})$ or

${C}_{c}(\mathcal{X})$. As is well known, Theorem 3 states that the converse is also true under mild regularity assumptions on the space. Thus, we can view measures as continuous linear functionals on a certain function space (in fact, some authors prefer to construct measure theory by defining a measure as a linear functional on a suitable measure space; see Lax [

50] or Bourbaki [

52]); this justifies the shorthand notation:

which we employ in the rest of the paper. This viewpoint is the most natural for our setting since in the proof of the equivalent formulation of the forward-reverse Brascamp-Lieb inequality, we shall use the Hahn-Banach theorem to show the existence of certain linear functionals.

**Definition** **1.** Let $\mathsf{\Lambda}:{C}_{b}(\mathcal{X})\to (-\infty ,+\infty ]$ be a lower semicontinuous, proper convex function. Its Legendre-Fenchel transform ${\mathsf{\Lambda}}^{\ast}:{\mathcal{C}}_{b}{(\mathcal{X})}^{\ast}\to (-\infty ,+\infty ]$ is given by: Let

$\nu $ be a nonnegative finite Borel measure on a Polish space

$\mathcal{X}$, and define the convex functional on

${C}_{b}(\mathcal{X})$:

Then, note that the relative entropy has the following alternative definition: for any

$\mu \in \mathcal{M}(\mathcal{X})$,

which agrees with the more familiar definition

$D(\mu \parallel \nu ):=\mu (log\frac{\mathrm{d}\mu}{\mathrm{d}\nu})$ when

$\nu $ is a probability measure, by the Donsker-Varadhan formula (cf. [

53] Lemma 6.2.13). If

$\mu $ is not a probability measure, then

$D(\mu \parallel \nu )$ as defined in (24) is

$+\infty $.

Given a bounded linear operator

$T:{C}_{b}(\mathcal{Y})\to {C}_{b}(\mathcal{X})$, the dual operator

${T}^{\ast}:{C}_{b}{(\mathcal{X})}^{\ast}\to {C}_{b}{(\mathcal{Y})}^{\ast}$ is defined in terms of:

for any

${\mu}_{X}\in {C}_{b}{(\mathcal{X})}^{\ast}$. Since

$\mathcal{P}(\mathcal{X})\subseteq \mathcal{M}(\mathcal{X})\subseteq {C}_{b}{(\mathcal{X})}^{\ast}$,

T is said to be a conditional expectation operator if

${T}^{\ast}P\in \mathcal{P}(\mathcal{Y})$ for any

$P\in \mathcal{P}(\mathcal{X})$. The operator

${T}^{\ast}$ is defined as the dual of a conditional expectation operator

T and, in a slight abuse of terminology, is said to be a random transformation from

$\mathcal{X}$ to

$\mathcal{Y}$.

For example, in the notation of Theorem 1, if $g\in {C}_{b}(\mathcal{Y})$ and ${Q}_{Y|X}$ is a random transformation from $\mathcal{X}$ to $\mathcal{Y}$, the quantity ${Q}_{Y|X}(g)$ is a function on $\mathcal{X}$, defined by taking the conditional expectation. Furthermore, if ${P}_{X}\in \mathcal{P}(\mathcal{X})$, we write ${P}_{X}\to {Q}_{Y|X}\to {P}_{Y}$ to indicate that ${P}_{Y}\in \mathcal{P}(\mathcal{Y})$ is the measure induced on $\mathcal{Y}$ by applying ${Q}_{Y|X}$ to ${P}_{X}$.

**Remark** **2.** From the viewpoint of category theory (see for example [54,55]), ${C}_{b}$ is a functor from the category of topological spaces to the category of topological vector spaces, which is contra-variant because for any continuous, $\varphi :\mathcal{X}\to \mathcal{Y}$ (morphism between topological spaces), we have ${C}_{b}(\varphi ):{C}_{b}(\mathcal{Y})\to {C}_{b}(\mathcal{X})$, $u\mapsto u\circ f$ where $u\circ \varphi $ denotes the composition of two continuous functions, reversing the arrows in the maps (i.e., the morphisms). On the other hand, $\mathcal{M}$ is a covariant functor and $\mathcal{M}(\varphi ):\mathcal{M}(\mathcal{X})\to \mathcal{M}(\mathcal{Y})$, $\mu \mapsto \mu \circ {\varphi}^{-1}$, where $\mu \circ {\varphi}^{-1}(\mathcal{B}):=\mu ({\varphi}^{-1}(\mathcal{B}))$ for any Borel measurable $\mathcal{B}\subseteq \mathcal{Y}$. “Duality” itself is a contra-variant functor between the category of topological spaces (note the reversal of arrows in Figure 1). Moreover, ${C}_{b}{(\mathcal{X})}^{\ast}=\mathcal{M}(\mathcal{X})$ and ${C}_{b}{(\varphi )}^{\ast}=\mathcal{M}(\varphi )$ if $\mathcal{X}$ and $\mathcal{Y}$ are compact metric spaces and $\varphi :\mathcal{X}\to \mathcal{Y}$ is continuous. Definition 2 can therefore be viewed as the special case where ϕ is the projection map: **Definition** **2.** Suppose $\varphi :{\mathcal{Z}}_{1}\times {\mathcal{Z}}_{2}\to {\mathcal{Z}}_{1},\phantom{\rule{0.166667em}{0ex}}({z}_{1},{z}_{2})\mapsto {z}_{1}$ is the projection to the first coordinate.

${C}_{b}(\varphi ):{C}_{b}({\mathcal{Z}}_{1})\to {C}_{b}({\mathcal{Z}}_{1}\times {\mathcal{Z}}_{2})$ is called a canonical map, whose action is almost trivial: it sends a function of ${z}_{i}$ to itself, but viewed as a function of $({z}_{1},{z}_{2})$.

$\mathcal{M}(\varphi ):\mathcal{M}({\mathcal{Z}}_{1}\times {\mathcal{Z}}_{2})\to \mathcal{M}({\mathcal{Z}}_{1})$ is called marginalization, which simply takes a joint distribution to a marginal distribution.

The Fenchel-Rockafellar duality (see [

40] Theorem 1.9, or [

56] in the case of finite dimensional vector spaces) usually refers to the

$k=1$ special case of the following result.

**Theorem** **4.** Assume that A is a topological vector space whose dual is ${A}^{\ast}$. Let ${\mathsf{\Theta}}_{j}:A\to \mathbb{R}\cup \{+\infty \}$, $j=0,1,\dots ,k$, for some positive integer k. Suppose there exist some ${({u}_{j})}_{j=1}^{k}$ and ${u}_{0}:=-({u}_{1}+\dots +{u}_{k})$ such that:and ${\mathsf{\Theta}}_{0}$ is upper semicontinuous at ${u}_{0}$. Then: For completeness, we provide a proof of this result, which is based on the Hahn-Banach theorem (Theorem 5) and is similar to the proof of [

40] Theorem 1.9.

**Proof.** Let

${m}_{0}$ be the right side of (27). The ≤ part of (27) follows trivially from the (weak) min-max inequality since:

It remains to prove the ≥ part, and it suffices to assume without loss of generality that

${m}_{0}>-\infty $. Note that (26) also implies that

${m}_{0}<+\infty $. Define convex sets:

Observe that these are nonempty sets because of (26). Furthermore,

${C}_{0}$ has a nonempty interior by the assumption that

${\mathsf{\Theta}}_{0}$ is upper semicontinuous at

${u}_{0}$. Thus, the Minkowski sum:

is a convex set with a nonempty interior. Moreover,

$C\cup B=\mathsf{\varnothing}$. By the Hahn-Banach theorem (Theorem 5), there exists

$(\ell ,s)\in {A}^{\ast}\times \mathbb{R}$ such that:

For any

$m\le {m}_{0}$ and

$({u}_{j},{r}_{j})\in {C}_{j}$,

$j=0,\dots ,k$. From (30), we see (32) can only hold when

$s\ge 0$. Moreover, from (26) and the upper semicontinuity of

${\mathsf{\Theta}}_{0}$ at

${u}_{0}$, we see that the

${\sum}_{j=0}^{k}{u}_{j}$ in (32) can take a value in a neighborhood of

$0\in A$; hence,

$s\ne 0$. Thus, by dividing

s on both sides of (32) and setting

$\ell \leftarrow -\ell /s$, we see that:

which establishes ≥ in (27). ☐

**Theorem 5** (Hahn-Banach)

Let C and B be convex, nonempty disjoint subsets of a topological vector space A.

- 1.
If the interior of C is non-empty, then there exists $\ell \in {A}^{\ast}$, $\ell \ne 0$ such that: - 2.
If A is locally convex, B is compact and C is closed, then there exists $\ell \in {A}^{\ast}$ such that:

**Remark** **3.** The assumption in Theorem 5 that C has a nonempty interior is only necessary in the infinite dimensional case. However, even if A in Theorem 4 is finite dimensional, the assumption in Theorem 4 that ${\mathsf{\Theta}}_{0}$ is upper semicontinuous at ${u}_{0}$ is still necessary, because this assumption was not only used in applying Hahn-Banach, but also in concluding that $s\ne 0$ in (32).

## 4. Gaussian Optimality

Recall that the conventional Brascamp-Lieb inequality and its reverse ((1) and (2)) state that centered Gaussian functions exhaust such inequalities, and in particular, verifying those inequalities is reduced to a finite dimensional optimization problem (only the covariance matrices in these Gaussian functions are to be optimized). In this section, we show that similar results hold for the forward-reverse Brascamp-Lieb inequality, as well. Our proof uses the rotational invariance argument mentioned in

Section 1. Since the forward-reverse Brascamp-Lieb inequality has dual representations (Theorem 7), in principle, the rotational invariance argument can be applied either to the functional representation (as in Lieb’s paper [

29]) or the entropic representation (as in Geng-Nair [

48]). Here, we adopt the latter approach. We first consider a certain “non-degenerate” case where the existence of an extremizer is guaranteed. Then, Gaussian optimality in the general case follows by a limiting argument (

Appendix F), establishing Theorem 2.

#### 4.1. Non-Degenerate Forward Channels

This subsection focuses on the following case:

**Assumption** **1.** Fix Lebesgue measures ${({\mu}_{j})}_{j=1}^{m}$ and Gaussian measures ${({\nu}_{i})}_{i=1}^{l}$ on $\mathbb{R}$;

non-degenerate (Definition 3 below) linear Gaussian random transformation ${({P}_{{Y}_{j}|\mathbf{X}})}_{j=1}^{m}$ (where $\mathbf{X}:=({X}_{1},\dots ,{X}_{l})$) associated with conditional expectation operators ${({T}_{j})}_{j=1}^{m}$;

${({S}_{i})}_{i=1}^{l}$ are induced by coordinate projections;

positive $({c}_{j})$ and $({b}_{i})$.

**Definition** **3.** We say $({Q}_{{\mathbf{Y}}_{1}|\mathbf{X}},\dots ,{Q}_{{\mathbf{Y}}_{m}|\mathbf{X}})$ is non-degenerate if each ${Q}_{{\mathbf{Y}}_{j}|\mathbf{X}=\mathbf{0}}$ is an ${n}_{j}$-dimensional Gaussian distribution with an invertible covariance matrix.

Given Borel measures

${P}_{{X}_{i}}$ on

$\mathbb{R}$,

$i=1,\dots ,l$, define:

where the infimum is over Borel measures

${P}_{\mathbf{X}}$ that have

$({P}_{{X}_{i}})$ as marginals. Note that (75) is well defined since the first term cannot be

$+\infty $ under the non-degenerate assumption, and the second term cannot be

$-\infty $. The aim of this subsection is to prove the following:

**Theorem** **8.** ${sup}_{({P}_{{X}_{i}})}{F}_{0}(({P}_{{X}_{i}}))$, where the supremum is over Borel measures ${P}_{{X}_{i}}$ on $\mathbb{R}$, and $i=1,\dots ,l$, is achieved by some Gaussian ${({P}_{{X}_{i}})}_{i=1}^{l}$, in which case the infimum in (75) is achieved by some Gaussian ${P}_{\mathbf{X}}$.

Naturally, one would expect that Gaussian optimality can be established when ${({\mu}_{j})}_{j=1}^{m}$ and ${({\nu}_{i})}_{i=1}^{l}$ are either Gaussian or Lebesgue. We made the assumption that the former is Lebesgue and the latter is Gaussian so that certain technical conditions can be justified more easily. More precisely, the following observation shows that we can regularize the distributions by a second moment constraint for free:

**Proposition** **1.** ${sup}_{({P}_{{X}_{i}})}{F}_{0}(({P}_{{X}_{i}}))$ is finite and there exist ${\sigma}_{i}^{2}\in (0,\infty )$, $i=1,\dots ,l$ such that it equals: **Proof.** When ${\mu}_{j}$ is Lebesgue and ${P}_{{Y}_{j}|\mathbf{X}}$ is non-degenerate, $D({P}_{{Y}_{j}}\parallel {\mu}_{j})=-h({P}_{{Y}_{j}})\le -h({P}_{{Y}_{j}}|\mathbf{X})$ is bounded above (in terms of the variance of the additive noise of ${P}_{{Y}_{j}|\mathbf{X}}$). Moreover, $D({P}_{{X}_{i}}\parallel {\nu}_{i})\ge 0$ when ${\nu}_{i}$ is Gaussian, so ${sup}_{({P}_{{X}_{i}})}{F}_{0}(({P}_{{X}_{i}}))<\infty $. Further, choosing $({P}_{{X}_{i}})=({\nu}_{i})$ and using the covariance matrix to lower bound the first term in (75) show that ${sup}_{({P}_{{X}_{i}})}{F}_{0}(({P}_{{X}_{i}}))>-\infty $.

To see (76), notice that:

where

${\nu}_{i}^{\prime}$ is a Gaussian distribution with the same first and second moments as

${X}_{i}\sim {P}_{{X}_{i}}$. Thus,

$D({P}_{{X}_{i}}\parallel {\nu}_{i})$ is bounded below by some function of the second moment of

${X}_{i}$, which tends to ∞ as the second moment of

${X}_{i}$ tends to ∞. Moreover, as argued in the preceding paragraph, the first term in (75) is bounded above by some constant depending only on

$({P}_{{Y}_{j}|\mathbf{X}})$. Thus, we can choose

${\sigma}_{i}^{2}>0$,

$i=1,\dots ,l$ large enough such that if

$\mathbb{E}\left[{X}_{i}^{2}\right]>{\sigma}_{i}^{2}$ for some of

i, then

${F}_{0}(({P}_{{X}_{i}}))<{sup}_{({P}_{{X}_{i}})}{F}_{0}(({P}_{{X}_{i}}))$, irrespective of the choices of

${P}_{{X}_{1}},\dots ,{P}_{{X}_{i-1}},{P}_{{X}_{i+1}},\dots ,{P}_{{X}_{l}}$. Then, these

${\sigma}_{1},\dots ,{\sigma}_{l}$ are as desired in the proposition. ☐

The non-degenerate assumption ensures that the supremum is achieved:

**Proposition** **2.** Under Assumption 1,

- 1.
For any ${({P}_{{X}_{i}})}_{i=1}^{l}$, the infimum in (75) is attained by some Borel ${P}_{\mathbf{X}}$.

- 2.
If ${({P}_{{Y}_{j}|{X}^{l}})}_{j=1}^{m}$ are non-degenerate (Definition 3), then the supremum in (76) is achieved by some Borel ${({P}_{{X}_{i}})}_{i=1}^{l}$.

The proof of Proposition 2 is given in

Appendix E. After taking care of the existence of the extremizers, we get into the tensorization properties, which are the crux of the proof:

**Lemma** **1.** Fix $({P}_{{X}_{i}^{(1)}})$, $({P}_{{X}_{i}^{(2)}})$, $({\mu}_{j})$, $({T}_{j})$, $({c}_{j})\in {[0,\infty )}^{m}$, and let ${S}_{j}$ be induced by coordinate projections. Then:where for each j,on the left side and:on the right side, $t=1,2$. **Proof.** We only need to prove the nontrivial ≥ part. For any

${P}_{{\mathbf{X}}^{(1,2)}}$ on the left side, choose

${P}_{{\mathbf{X}}^{(t)}}$ on the right side by marginalization. Then:

for each

j. ☐

We are now ready to show the main result of this section.

**Proof of Theorem 8.** Assume that

$({P}_{{X}_{i}^{(1)}})$ and

$({P}_{{X}_{i}^{(2)}})$ are maximizers of

${F}_{0}$ (possibly equal). Let

${P}_{{X}_{i}^{1,2}}:={P}_{{X}_{i}^{(1)}}\times {P}_{{X}_{i}^{(2)}}$. Define:

Define

$({Y}_{j}^{+})$ and

$({Y}_{j}^{-})$ analogously. Then,

${Y}_{j}^{+}|\{{\mathbf{X}}^{+}={\mathbf{x}}^{+},{\mathbf{X}}^{-}={\mathbf{x}}^{-}\}\sim {Q}_{{Y}_{j}|\mathbf{X}={\mathbf{x}}^{+}}$ is independent of

${\mathbf{x}}^{-}$, and

${Y}_{j}^{-}|\{{\mathbf{X}}^{+}={\mathbf{x}}^{+},{\mathbf{X}}^{-}={\mathbf{x}}^{-}\}\sim {Q}_{{Y}_{j}|\mathbf{X}={\mathbf{x}}^{-}}$ is independent of

${\mathbf{x}}^{+}$.

Next, we perform the same algebraic expansion as in the proof of tensorization:

where:

(84) uses Lemma 1.

(86) is because of the Markov chain ${Y}_{j}^{+}-{\mathbf{X}}^{+}-{Y}_{j}^{-}$ (for any coupling).

In (87), we selected a particular instance of coupling

${P}_{{\mathbf{X}}^{+}{\mathbf{X}}^{-}}$, constructed as follows: first, we select an optimal coupling

${P}_{{\mathbf{X}}^{+}}$ for given marginals

$({P}_{{X}_{i}^{+}})$. Then, for any

${\mathbf{x}}^{+}={({x}_{i}^{+})}_{i=1}^{l}$, let

${P}_{{\mathbf{X}}^{-}|{\mathbf{X}}^{+}={x}^{+}}$ be an optimal coupling of

$({P}_{{X}_{i}^{-}|{X}_{i}^{+}={x}_{i}^{+}})$ (for a justification that we can select optimal coupling

${P}_{{\mathbf{X}}^{-}|{\mathbf{X}}^{+}={\mathbf{x}}^{+}}$ in a way that

${P}_{{\mathbf{X}}^{-}|{\mathbf{X}}^{+}}$ is indeed a regular conditional probability distribution, see [

7]). With this construction, it is apparent that

${X}_{i}^{+}-{\mathbf{X}}^{+}-{X}_{i}^{-}$, and hence:

(88) is because in the above, we have constructed the coupling optimally.

(89) is because $({P}_{{X}_{i}}^{(t)})$ maximizes ${F}_{0}$, $t=1,2$.

Thus, in the expansions above, equalities are attained throughout. Using the differentiation technique as in the case of forward inequality, for almost all

$({b}_{i})$,

$({c}_{j})$, we have:

where (92) is because by symmetry, we can perform the algebraic expansions in a different way to show that

$({P}_{{X}_{i}^{-}})$ is also a maximizer of

${F}_{0}$. Then,

$I({X}_{i}^{+};{X}_{i}^{-})=D({P}_{{X}_{i}^{-}|{X}_{i}^{+}}\parallel {\nu}_{i}|{P}_{{X}_{i}^{+}})-D({P}_{{X}_{i}^{-}}\parallel {\nu}_{i})=0$, which, combined with

$I({X}_{i}^{(1)};{X}_{i}^{(2)})$, shows that

${X}_{i}^{(1)}$ and

${X}_{i}^{(2)}$ are Gaussian with the same covariance. Lastly, using Lemma 1 and the doubling trick, one can show that the optimal coupling is also Gaussian.

☐

#### 4.2. Analysis of Example 1 Using Gaussian Optimality

We note that Example 1 is a rather simple setting, where (17) can be proven by integrating the two sides of (18) and applying the change of variables, noting that the absolute value of the Jacobian equals one. Nevertheless, it is illuminating to give an alternative proof using the Gaussian optimality result, as a proof of concept. In this section, we only give a proof sketch where certain “technicalities” are not justified. Details of the justifications are deferred to

Appendix F.

**Proof sketch for the claim in Example 1.** By duality (Theorem 7), it suffices to prove the corresponding entropic inequality. The Gaussian optimality result in Theorem 8 assumed Gaussian reference measures on the output and non-degenerate forward channels in order to simplify the proof of the existence of minimizers; however, supposing that Gaussian optimality extends beyond those technical conditions, we see that it suffices to prove that for any centered Gaussian

$({P}_{{X}_{i}})$,

where the supremum is over Gaussian

${P}_{{X}^{l}}$ with the marginals

${P}_{{X}_{1}},\dots ,{P}_{{X}_{l}}$ and

${Y}_{j}:={\sum}_{i=1}^{l}{m}_{ji}{X}_{i}$. Let

${a}_{i}:=\mathbb{E}\left[{X}_{i}^{2}\right]$, and choose

${P}_{{X}^{l}}={\prod}_{i=1}^{l}{P}_{{X}_{i}}$; we see that (93) holds if:

where

$({a}_{i})$ are the eigenvalues and

${\left({\sum}_{i=1}^{l}{m}_{ji}{a}_{i}\right)}_{i=1}^{l}$ are the diagonal entries of the matrix:

Therefore, (94) holds. ☐

A generalization of Example 1 is as follows.

**Proposition** **3.** For any orthogonal matrix $\mathbf{M}:={({m}_{ji})}_{1\le j\le l,1\le i\le l}$ with nonzero entries, we claim that there exists a neighborhood $\mathcal{U}$ of the uniform probability vector $(\frac{1}{l},\dots ,\frac{1}{l})$, such that for any $({b}_{1},\dots ,{b}_{l})$ and $({c}_{1},\dots ,{c}_{l})$ in $\mathcal{U}$, the best constant D in the FR-BLinequality (16) equals $exp(H({c}^{l})-H({b}^{l}))$ where $H(\xb7)$ is the entropy functional.

The proposition generalizes the claim in Example 1. Indeed, observe that there is no loss of generality in assuming that $({b}_{1},\dots ,{b}_{l})$ and $({c}_{1},\dots ,{c}_{l})$ are probability vectors, since by dimensional analysis, we see that the best constant is infinite unless ${\sum}_{i=1}^{l}{b}_{i}={\sum}_{j=1}^{l}{c}_{j}$; and it is also clear that the best constant is invariant when each ${b}_{i}$ and ${c}_{j}$ is multiplied by the same positive number. Moreover, any orthogonal matrix can be approximated by a sequence of orthogonal $\mathbf{M}$ with nonzero entries, for which the neighborhood $\mathcal{U}$ shrinks, but always contains the uniform probability vector $(\frac{1}{l},\dots ,\frac{1}{l})$.

**Proof sketch for Proposition 3.** Note that along the same lines as (94), the best constant in the FR-BL inequality equals:

where without loss of generality, we assumed

${a}^{l}\in \Delta $ is in the probability simplex. We first observe that if the positive semidefinite constraint

$\mathbf{A}\u2ab0\mathbf{0}$ in (96) were nonexistent, then the sup in the denominator in (96) would equal

${\prod}_{j=1}^{l}{c}_{j}^{{c}_{j}}$, and consequently, (96) would equal

$exp(H({c}^{l})-H({b}^{l}))$, for any

${b}^{l},{c}^{l}\in \Delta $ not necessarily close to the uniform probability vector. Indeed, fixing

${\mathbf{A}}_{ii}={a}_{i},i=1,\dots ,l$, the linear map from the off-diagonal entries to the diagonal entries of

${\mathbf{MAM}}^{\top}$ is onto the space of

l-vectors whose entries sum to one; proof of the surjectivity can be reduced to checking the fact that the only diagonal matrix that commutes with

$\mathbf{M}$ is a multiple of the identity matrix. Then, the sup in the denominator is achieved when

${\left[\mathbf{M}\mathbf{A}{\mathbf{M}}^{\top}\right]}_{jj}={c}_{j},j=1\dots l$, which is independent of

${a}^{l}$.

Next, we argue that the constraint $\mathbf{A}\u2ab0\mathbf{0}$ in (96) is not active when ${b}^{l}$ and ${c}^{l}$ are close to the uniform vector. Denote by $\mathcal{U}(t)$ the set of l-vectors whose distance (say in total variation) to the uniform vector $(\frac{1}{l},\dots ,\frac{1}{l})$ is at most t. Observe that:

There exists

$t>0$ such that for every

${a}^{l}\in \mathcal{U}(t)$,

which follows by continuity and the fact that when

${a}^{l}$ is uniform, the sup (97) is achieved at the strictly positive definite

$\mathbf{A}={l}^{-1}\mathbf{I}$.

When

${b}^{l}={c}^{l}=(\frac{1}{l},\dots ,\frac{1}{l})$ is the uniform probability vector, (96) equals one, which is uniquely achieved by

${a}^{l}=(\frac{1}{l},\dots ,\frac{1}{l})$. To see the uniqueness, take

$\mathbf{A}$ to be diagonal in the denominator and observe that the denominator is strictly bigger than the numerator when the diagonals of

$\mathbf{M}\mathbf{A}{\mathbf{M}}^{\top}$ are not a permutation of

${a}^{l}$. Then, since the extreme value of a continuous functions is achieved on a compact set, we can find

$\u03f5>0$ such that:

for any

${a}^{l}\notin \mathcal{U}(t/2)$.

Finally, by continuity, we can choose

$s\in (0,t/2)$ small enough such that for any

${b}^{l},{c}^{l}\in \mathcal{U}(s)$,

Taking the neighborhood $\mathcal{U}(s)$ proves the claim. ☐