A Forward-Reverse Brascamp-Lieb Inequality: Entropic Duality and Gaussian Optimality

Inspired by the forward and the reverse channels from the image-size characterization problem in network information theory, we introduce a functional inequality that unifies both the Brascamp-Lieb inequality and Barthe’s inequality, which is a reverse form of the Brascamp-Lieb inequality. For Polish spaces, we prove its equivalent entropic formulation using the Legendre-Fenchel duality theory. Capitalizing on the entropic formulation, we elaborate on a “doubling trick” used by Lieb and Geng-Nair to prove the Gaussian optimality in this inequality for the case of Gaussian reference measures.


Introduction
The Brascamp-Lieb inequality and its reverse [1] concern the optimality of Gaussian functions in a certain type of integral inequality. (Not to be confused with the "variance Brascamp-Lieb inequality" (cf. [2][3][4]), which generalizes the Poincaré inequality). These inequalities have been generalized in various ways since their discovery, nearly 40 years ago. A modern formulation due to Barthe [5] may be stated as follows: ([5] Theorem 1). Let E, E 1 , . . . , E m be Euclidean spaces and B i : E → E i be linear maps. Let (c i ) m i=1 and D be positive real numbers. Then, the Brascamp-Lieb inequality:

Brascamp-Lieb Inequality and Its Reverse
for all nonnegative measurable functions f i on E i , i = 1, . . . , m, holds if and only if it holds whenever f i , i = 1, . . . , m are centered Gaussian functions (a centered Gaussian function is of the form x → exp(r − x Ax), where A is a positive semidefinite matrix and r ∈ R). Similarly, for F a positive real number, the reverse Brascamp-Lieb inequality, also known as Barthe's inequality (B * i denotes the adjoint of B i ), for all nonnegative measurable functions f i on E i , i = 1, . . . , m, holds if and only if it holds for all centered Gaussian functions.
For surveys on the history of both the Brascamp-Lieb inequality and Barthe's inequality and their applications, see, e.g., [6,7]. The Brascamp-Lieb inequality can be seen as a generalization of several other inequalities, including Hölder's inequality, the sharp Young inequality, the Loomis-Whitney inequality, the entropy power inequality (cf. [6] or the survey paper [8]), hypercontractivity and the logarithmic Sobolev inequality [9]. Furthermore, the Prékopa-Leindler inequality can be seen as a special case of Barthe's inequality. Due in part to their utility in establishing impossibility bounds, these functional inequalities have attracted much attention in information theory [10][11][12][13][14][15][16][17], theoretical computer science [18][19][20][21][22] and statistics [23][24][25][26][27][28], to name only a small subset of the literature. Over the years, various proofs of these inequalities have been proposed [1,[29][30][31][32][33][34]. Among these, Lieb's elegant proof [29], which is very close to one of the techniques that will be used in this paper, employs a doubling trick that capitalizes on the rotational invariance property of the Gaussian function: if f is a one-dimensional Gaussian function, then: Since (1) and (2) have the same structure modulo the direction of the inequality, a common viewpoint is to consider (1) and (2) as dual inequalities. This viewpoint successfully captures the geometric aspects of (1) and (2). Indeed, it is known that: as long as D, F < ∞ [5]. Moreover, both D and F are equal to one under Ball's geometric condition [35]: E 1 , . . . , E m are dimension one, and: is the identity matrix. While fruitful, this "dual" viewpoint does not fully explain the asymmetry between the forward and the reverse inequalities: there is a sup in (2), but not in (1). This paper explores a different viewpoint. In particular, we propose a single inequality that unifies (1) and (2). Accordingly, we should reverse both sides of (2) to make the inequality sign consistent with (1). To be concrete, let us first observe that (1) and (2) can be respectively restated in the following more symmetrical forms (with changes of certain symbols): • For all nonnegative functions g and f 1 , . . . , f m such that: we have: • For all nonnegative measurable functions g 1 , . . . g l and f such that: we have: Note that in both cases, the optimal choice of one function ( f or g) can be explicitly computed from the constraints, hence the conventional formulations in (1) and (2). Generalizing further, we can consider the following problem: Let X , Y 1 , . . . , Y m , Z 1 , . . . , Z l be measurable spaces. Consider measurable maps φ j : X → Y j , j = 1, . . . , m and ψ : X → Z i , i = 1, . . . , l. Let b 1 , . . . , b l and c 1 , . . . , c m be nonnegative real numbers. Let ν 1 , . . . , ν l be measures on Z 1 , . . . , Z l and µ 1 , . . . , µ m be measures on Y 1 , . . . , Y m , respectively. What is the smallest D > 0 such that for all nonnegative f 1 , . . . , f m on Y 1 , . . . Y m and g 1 , . . . , g l on Z 1 , . . . , Z l satisfying: we have: Except for special case of l = 1 (resp. m = 1), it is generally not possible to deduce a simple expression from (10) for the optimal choice of g i (resp. f j ) in terms of the rest of the functions. We will refer to (11) as a forward-reverse Brascamp-Lieb inequality.
One of the motivations for considering multiple functions on both sides of (11) comes from multiuser information theory: independently, but almost simultaneously with the discovery of the Brascamp-Lieb inequality in mathematical physics, in the late 1970s, information theorists including Ahslwede, Gács and Körner [36,37] invented the image-size technique for proving strong converses in source and channel networks. An image-size inequality is a characterization of the tradeoff of the measures of certain sets connected by given random transformations (channels); we refer the interested readers to [37] for expositions on the image-size problem. Although not the way treated in [36,37], an image-size inequality can essentially be obtained from a functional inequality similar to (11) by taking the functions to be (roughly speaking) the indicator functions of sets. In the case of (10), the forward channels φ 1 , . . . , φ m and the reverse channels ψ 1 , . . . , ψ l degenerate into deterministic functions. In this paper, motivated by information theoretic applications similar to those of the image-size problems, we will consider further generalizations of (11) to the case of random transformations. Since the functional inequality is not restricted to indicator functions, it is strictly stronger than the corresponding image-size inequality. As a side remark, [38] uses functional inequalities that are variants of (11) together with a reverse hypercontractivity machinery to improve the image-size plus the blowing-up machinery of [39] and shows that the non-indicator function generalization is crucial for achieving the optimal scaling of the second-order rate expansion.
Of course, to justify the proposal of (11), we must also prove that (11) enjoys certain nice mathematical properties; this is the main goal of the present paper. Specifically, we focus on two aspects of (11): equivalent entropic formulation and Gaussian optimality.
In the mathematical literature, e.g., [32,36,[40][41][42][43][44][45][46], it is known that certain integral inequalities are equivalent to inequalities involving relative entropies. In particular, Carlen, Loss and Lieb [47] and Carlen and Cordero-Erausquin [32] proved that the Brascamp-Lieb inequality is equivalent to the superadditivity of relative entropy. In this paper, we prove that the forward-reverse Brascamp-Lieb inequality (11) also has an entropic formulation, which turns out to be very close to the rate region of certain multiuser information theory problems (but we will clarify the difference in the text). In fact, Ahlswede, Csiszár and Körner [37,39] essentially derived image-size inequalities from similar entropic inequalities. Because of the reverse part, the proof of the equivalence of (11) and corresponding entropic inequality is more involved than the forward case considered in [32] beyond the case of finite X , Y j , Z i , and certain machinery from min-max theory appears necessary. In particular, the proof involves a novel use of the Legendre-Fenchel duality theory. Next, we give a basic version of our main result on the functional-entropic duality (more general versions will be given later). In order to streamline its presentation, all formal definitions of notation are postponed to Section 2.
Theorem 1 (Dual formulation of the forward-reverse Brascamp-Lieb inequality). Assume that: (i) m and l are positive integers; d ∈ R, X is a compact metric space; (ii) b i ∈ (0, ∞), ν i is a finite Borel measure on a Polish space Z i , and Q Z i |X is a random transformation from X to Z i , for each i = 1, . . . , l; (iii) c j ∈ (0, ∞), µ j is a finite Borel measure on a Polish space Y j , and Q Y j |X is a random transformation from X to Y i , for each j = 1, . . . , m; (iv) For any ( . . , m. Then, the following two statements are equivalent: 1. If the nonnegative continuous functions (g i ), ( f j ) are bounded away from zero and satisfy: then:

2.
For any (P Z i ) such that D(P Z i ν i ) < ∞ (of course, this assumption is not essential (if we adopt the convention that the infimum in (14) is +∞ when it runs over an empty set)), i = 1, . . . , l, where P X → Q Y j |X → P Y j , j = 1, . . . , m, and the infimum is over P X such that P X → Q Z i |X → P Z i , i = 1, . . . , l.
Next, in a similar vein as the proverbial result that "Gaussian functions are optimal" for the forward or the reverse Brascamp-Lieb inequality, we show in this paper that Gaussian functions are also optimal for the forward-reverse Brascamp-Lieb inequality, particularized to the case of Gaussian reference measures and linear maps. The proof scheme is based on rotational invariance (3), which can be traced back in the functional setting to Lieb [29]. More specifically, we use a variant for the entropic setting introduced by Geng and Nair [48], thereby taking advantage of the dual formulation of Theorem 1.
As mentioned, in the literature on the forward or the reverse Brascamp-Lieb inequalities, it is known that a certain geometric condition (5) ensures that the best constant equals one. Now, for the forward-reverse inequality, there is a simple example where the best constant equals one: Example 1. Let l be a positive integer, and let M := (m ji ) 1≤j≤l,1≤i≤l be an orthogonal matrix. For any nonnegative continuous functions ( f j ) l j=1 (g i ) l i=1 on R such that: we have: The rest of the paper is organized as follows: Section 2 defines the notation and reviews some basic theory of convex duality. Section 3 proves Theorem 1 and also presents its extensions to the settings of noncompact spaces or general reverse channels. Section 4 proves the Gaussian optimality in the entropic formulation, with the caveat that a certain "non-degenerate" assumption is imposed to ensure the existence of extremizers. At the end of Section 4, we give a proof sketch of Example 1 and also propose a generalization of the example. To completely prove Theorem 2, in Appendix F, we use a limiting argument to drop the non-degenerate assumption and apply the equivalence between the functional and entropic formulations.

Review of the Legendre-Fenchel Duality Theory
Our proof of the equivalence of the functional and the entropic inequalities uses the Legendre-Fenchel duality theory, a topic from convex analysis. Before getting into that, a recap of some basics on the duality of topological vector spaces seems appropriate. Unless otherwise indicated, we assume Polish spaces and Borel measures. Recall that metric space. It enjoys several nice properties that we use heavily in this section, including the Prokhorov theorem and the Riesz-Kakutani theorem. Of course, the Polish space assumption covers the cases of Euclidean and discrete spaces (endowed with the Hamming metric, which induces the discrete topology, making every function on the discrete set continuous), among others. Readers interested in discrete spaces only may refer to the (much simpler) argument in [49] based on the KKT condition. Notation 1. Let X be a topological space.
• C c (X ) denotes the space of continuous functions on X with a compact support; • C 0 (X ) denotes the space of all continuous functions f on X that vanish at infinity (i.e., for any > 0, there exists a compact set K ⊆ X such that | f (x)| < for x ∈ X \ K); • C b (X ) denotes the space of bounded continuous functions on X ; • M(X ) denotes the space of finite signed Borel measures on X ; • P (X ) denotes the space of probability measures on X .
We consider C c , C 0 and C b as topological vector spaces, with the topology induced from the sup norm. The following theorem, usually attributed to Riesz, Markov and Kakutani, is well known in functional analysis and can be found in, e.g., [50,51].

Theorem 3 (Riesz-Markov-Kakutani)
. If X is a locally compact, σ-compact Polish space, the dual (the dual of a topological vector space consists of all continuous linear functionals on that space, which is naturally also topological vector space (with the weak * topology)) of both C c (X ) and C 0 (X ) is M(X ). Remark 1. The dual space of C b (X ) can be strictly larger than M(X ), since it also contains those linear functionals that depend on the "limit at infinity" of a function f ∈ C b (X ) (originally defined for those f that do have a limit at infinity and then extended to the whole C b (X ) by the Hahn-Banach theorem; see, e.g., [50]).
Of course, any µ ∈ M(X ) is a continuous linear functional on C 0 (X ) or C c (X ), given by: where f is a function in C 0 (X ) or C c (X ). As is well known, Theorem 3 states that the converse is also true under mild regularity assumptions on the space. Thus, we can view measures as continuous linear functionals on a certain function space (in fact, some authors prefer to construct measure theory by defining a measure as a linear functional on a suitable measure space; see Lax [50] or Bourbaki [52]); this justifies the shorthand notation: which we employ in the rest of the paper. This viewpoint is the most natural for our setting since in the proof of the equivalent formulation of the forward-reverse Brascamp-Lieb inequality, we shall use the Hahn-Banach theorem to show the existence of certain linear functionals.
be a lower semicontinuous, proper convex function. Its Legendre-Fenchel transform Λ * : C b (X ) * → (−∞, +∞] is given by: Let ν be a nonnegative finite Borel measure on a Polish space X , and define the convex functional on C b (X ): = log exp( f )dν. (23) Then, note that the relative entropy has the following alternative definition: for any µ ∈ M(X ), which agrees with the more familiar definition D(µ ν) := µ(log dµ dν ) when ν is a probability measure, by the Donsker-Varadhan formula (cf. [53] Lemma 6.2.13). If µ is not a probability measure, then D(µ ν) as defined in (24) is +∞.
Given a bounded linear operator T : for any µ X ∈ C b (X ) * . Since P (X ) ⊆ M(X ) ⊆ C b (X ) * , T is said to be a conditional expectation operator if T * P ∈ P (Y ) for any P ∈ P (X ). The operator T * is defined as the dual of a conditional expectation operator T and, in a slight abuse of terminology, is said to be a random transformation from X to Y. For example, in the notation of Theorem 1, if g ∈ C b (Y ) and Q Y|X is a random transformation from X to Y, the quantity Q Y|X (g) is a function on X , defined by taking the conditional expectation. Furthermore, if P X ∈ P (X ), we write P X → Q Y|X → P Y to indicate that P Y ∈ P (Y ) is the measure induced on Y by applying Q Y|X to P X .

Remark 2.
From the viewpoint of category theory (see for example [54,55]), C b is a functor from the category of topological spaces to the category of topological vector spaces, which is contra-variant because for any continuous, φ : X → Y (morphism between topological spaces), itself is a contra-variant functor between the category of topological spaces (note the reversal of arrows in Figure 1). Moreover, C b (X ) * = M(X ) and C b (φ) * = M(φ) if X and Y are compact metric spaces and φ : X → Y is continuous. Definition 2 can therefore be viewed as the special case where φ is the projection map: Figure 1. Diagrams for Theorem 1.
is called a canonical map, whose action is almost trivial: it sends a function of z i to itself, but viewed as a function of (z 1 , z 2 ).
is called marginalization, which simply takes a joint distribution to a marginal distribution.
The Fenchel-Rockafellar duality (see [40] Theorem 1.9, or [56] in the case of finite dimensional vector spaces) usually refers to the k = 1 special case of the following result.

Theorem 4.
Assume that A is a topological vector space whose dual is A * . Let Θ j : A → R ∪ {+∞}, j = 0, 1, . . . , k, for some positive integer k. Suppose there exist some (u j ) k j=1 and u 0 := −(u 1 + · · · + u k ) such that: and Θ 0 is upper semicontinuous at u 0 . Then: For completeness, we provide a proof of this result, which is based on the Hahn-Banach theorem (Theorem 5) and is similar to the proof of [40] Theorem 1.9.
Proof. Let m 0 be the right side of (27). The ≤ part of (27) follows trivially from the (weak) min-max inequality since: It remains to prove the ≥ part, and it suffices to assume without loss of generality that m 0 > −∞. Note that (26) also implies that m 0 < +∞. Define convex sets: Observe that these are nonempty sets because of (26). Furthermore, C 0 has a nonempty interior by the assumption that Θ 0 is upper semicontinuous at u 0 . Thus, the Minkowski sum: is a convex set with a nonempty interior. Moreover, C ∪ B = ∅. By the Hahn-Banach theorem (Theorem 5), there exists ( , s) ∈ A * × R such that: For any m ≤ m 0 and (u j , r j ) ∈ C j , j = 0, . . . , k. From (30), we see (32) can only hold when s ≥ 0. Moreover, from (26) and the upper semicontinuity of Θ 0 at u 0 , we see that the ∑ k j=0 u j in (32) can take a value in a neighborhood of 0 ∈ A; hence, s = 0. Thus, by dividing s on both sides of (32) and setting ← − /s, we see that: (27).
Theorem 5 (Hahn-Banach). Let C and B be convex, nonempty disjoint subsets of a topological vector space A.

1.
If the interior of C is non-empty, then there exists ∈ A * , = 0 such that: 2.
If A is locally convex, B is compact and C is closed, then there exists ∈ A * such that: Remark 3. The assumption in Theorem 5 that C has a nonempty interior is only necessary in the infinite dimensional case. However, even if A in Theorem 4 is finite dimensional, the assumption in Theorem 4 that Θ 0 is upper semicontinuous at u 0 is still necessary, because this assumption was not only used in applying Hahn-Banach, but also in concluding that s = 0 in (32).

The Entropic-Functional Duality
In this section, we prove Theorem 1 and some of its generalizations.

Compact X
We first state a duality theorem for the case of compact spaces to streamline the proof. Later, we show that the argument can be extended to a particular non-compact case (Theorem 1 is not included in the conference paper [49], but was announced in the conference presentation). Our proof based on the Legendre-Fenchel duality (Theorem 4) was inspired by the proof of the Kantorovich duality in the theory of optimal transportation (see [40] Chapter 1, where the idea was credited to Brenier).
Recall from Section 2 that a random transformation (a mapping between probability measures) is formally the dual of a conditional expectation operator. Suppose P Y j |X = T * j , j = 1, . . . , m and Proof of Theorem 1. We can safely assume d = 0 below without loss of generality (since otherwise, we can always substitute µ 1 ← exp d c 1 µ 1 ).

1)⇒2)
This is the nontrivial direction, which relies on certain (strong) min-max type results. In Theorem 4, put (in (36), u ≤ 0 means that u is pointwise non-positive): Then, For each j = 1, . . . , m, set: as a convention. Observe that: • Θ j is convex: indeed, given arbitrary u 0 and u 1 , suppose that v 0 and v 1 respectively achieve the infimum in (38) for u 0 and u 1 (if the infimum is not achievable, the argument still goes through by the approximation and limit argument). Then, for any α ∈ [0, 1], v α := Thus, the convexity of Θ j follows from the convexity of the functional in (23); Otherwise, for any P X and P Y j := T * j P X , we have: = +∞ (42) which contradicts the assumption that ∑ m j=1 c j D(P Y j µ j ) < ∞ in the theorem; where the definition of D(· µ j ) is extended using the Donsker-Varadhan formula (that is, it is infinite when the argument is not a probability measure).
Finally, for the given (P Z i ) l i=1 , choose: Notice that: where P X is such that S * i P X = P Z i , i = 1, . . . , l, whose existence is guaranteed by the assumption of the theorem. This also shows that Θ m+1 > −∞.
Invoking Theorem 4 (where the u j in Theorem 4 can be chosen as the constant function u j ≡ 1, j = 1, . . . , m + 1): where v m denotes the collection of the functions v 1 , . . . , v m , and similarly for w l . Note that the left side of (46) is exactly the right side of (14). For Now, invoking (13) with f j := exp 1 . . , l, we upper bound the left side of (47) by: where the last step follows by the Donsker-Varadhan formula. Therefore, (14) is established since > 0 is arbitrary.

2)⇒1)
Since ν i is finite and g i is bounded by assumption, we have is trivially true when ν i (g i ) = 0 for some i, so we will assume below that Then, for any > 0, where: • (51) uses the Donsker-Varadhan formula, and we have chosen P X , P Y j := T * j P X , j = 1, . . . , m such that: • (52) also follows from the Donsker-Varadhan formula.
The result follows since > 0 can be arbitrary.

Remark 4.
Condition (iv) in the theorem imposes a rather strong assumption on (S i ): for simplicity, consider the case where |X |, |Z i | < ∞. Then, Condition (iv) assumes that for any (P Z i ), there exists P X such that P Z i = S * i P X . This assumption is certainly satisfied when (S i ) are induced by coordinate projections; the case of l = 1 and P Z|X being a reverse erasure channel gives a simple example where P Z|X is not a deterministic map.
Next, we give a generalization of Theorem 1, which alleviates the restriction on (S i ): Theorem 6. Theorem 1 continues to hold if Condition (iv) therein is weakened to the following: and the conclusion of the theorem will be replaced by the equivalence of the following two statements: 1.
For any nonnegative continuous functions (g i ), ( f j ) bounded away from zero and such that: we have: In Appendix A, we show that Theorem 6 indeed recovers Theorem 1 for the more restricted class of random transformations.
Proof. Here, we mention the parts of the proof that need to be changed: upon specifying ( f j ) and (g i ) right after (47), we select (g i ) such that: Then, in lieu of (59), we upper-bound the left side of (47) by: which establishes the 1)⇒2) part. For the other direction, for each i ∈ {1, 2, . . . , l}, define: Then, following essentially the same proof as that of Θ j in (38), we see that Λ i is proper convex and: Moreover, let: Then, Λ * l+1 (π) = − ∑ b i S * i π(log g i ). Using the Legendre-Fenchel duality, we see that for any > 0, where: • To see (67), we note that the sup in (66) can be restricted to π, which is a probability measure, since otherwise, the relative entropy terms in (66) are +∞ by its definition via the Donsker-Varadhan formula. Then, we select P X such that (67) holds.
• In (68), we have chosenP X such that: and then applied the assumption (54). The result follows since > 0 can be arbitrary. (14) is in fact achievable: for any (P Z i ), there exists a P X that minimizes ∑ m j=1 c j D(P Y j µ j ) subject to the constraints S * i P X = P Z i , i = 1, . . . m, where P Y j := T * j P X , j = 1, . . . , m. Indeed, since the singleton {P Z i } is weak * -closed and S * i is weak * -continuous (Generally, if T : A → B is a continuous map between two topologically vector spaces, then T * : B * → A * is a weak * continuous map between the dual spaces. Indeed, if y n → y is a weak * -convergent subsequence in B * , meaning y n (b) → y(b) for any b ∈ B, then, we must have T * y n (a) = y n (Ta) → y(Ta) = T * y(a) for any a ∈ A, meaning that T * y n converges to T * y in the weak * topology.), the set l i=1 (S * i ) −1 P Z i is weak * -closed in M(X); hence, its intersection with P (X ) is weak * -compact in P (X ), because P (X ) is weak * -compact by (a simple version for the setting of a compact underlying space X of) the Prokhorov theorem [57]. Moreover, by the weak * -lower semicontinuity of D(· µ j ) (easily seen from the variational formula/Donsker-Varadhan formula of the relative entropy, cf. [58]) and the weak * -continuity of T * j , j = 1, . . . , m, we see that ∑ m j=1 c j D(T * j P X µ j ) is weak * -lower semicontinuous in P X , and hence, the existence of a minimizing P X is established.

Remark 6.
Abusing the terminology from min-max theory, Theorem 1 may be interpreted as a "strong duality" result, which establishes the equivalence of two optimization problems. The 1)⇒2) part is the non-trivial direction, which requires regularity on the spaces. In contrast, the 2)⇒1) direction can be thought of as a "weak duality", which establishes only a partial relation, but holds for more general spaces.

Noncompact X
Our proof of 1)⇒2) in Theorem 1 makes use of the Hahn-Banach theorem and hence relies crucially on the fact that the measure space is the dual of the function space. Naively, one might want to extend the the proof to the case of locally compact X by considering C 0 (X ) instead of C b (X ), so that the dual space is still M(X ). However, this would not work: consider the case when X = Z 1 ×, . . . , ×Z l and each S i is the canonical map. Then, Θ m+1 (u) as defined in (43) is +∞ unless u ≡ 0 (because u ∈ C 0 (X ) requires that u vanishes at infinity); thus, Θ * m+1 ≡ 0. Luckily, we can still work with C b (X ); in this case, ∈ C b (X ) * may not be a measure, but we can decompose it into = π + R where π ∈ M(X ) and R is a linear functional "supported at infinity". Below, we use the techniques in [40] (Chapter 1.3) to prove a particular extension of Theorem 1 to a non-compact case.

•
The assumption that X is a compact metric space is relaxed to the assumption that it is a locally compact and σ-compact Polish space; . . , l are canonical maps (see Definition 2).
Proof. The proof of the "weak duality" part 2)⇒1) still works in the noncompact case, so we only need to explain what changes need to be made in the proof of the 1)⇒2) part. Let Θ 0 be defined as before, in (36). Then, for any ∈ C b (X ) * , which is zero if is nonnegative (in the sense that (u) ≥ 0 for every u ≥ 0), and +∞ otherwise. This means that when computing the infimum on the left side of (27), we only need to take into account those nonnegative . Next, let Θ m+1 be also defined as before. Then, directly from the definition, we have: For any ∈ C * b (X ). Generally, the condition in the first line of (73) does not imply that is a measure. However, if is also nonnegative, then using a technical result in [40] Lemma 1.25, we can further simplify: This further shows that when we compute the left side of (27), the infimum can be taken over , which is a coupling of (P Z i ). In particular, if is a probability measure, then Θ * j ( ) = c j D(T * j µ j ) still holds with the Θ j defined in (38), j = 1, . . . , m. Thus, the rest of the proof can proceed as before.

Remark 7.
The second assumption is made in order to achieve (74) in the proof.

Gaussian Optimality
Recall that the conventional Brascamp-Lieb inequality and its reverse ( (1) and (2)) state that centered Gaussian functions exhaust such inequalities, and in particular, verifying those inequalities is reduced to a finite dimensional optimization problem (only the covariance matrices in these Gaussian functions are to be optimized). In this section, we show that similar results hold for the forward-reverse Brascamp-Lieb inequality, as well. Our proof uses the rotational invariance argument mentioned in Section 1. Since the forward-reverse Brascamp-Lieb inequality has dual representations (Theorem 7), in principle, the rotational invariance argument can be applied either to the functional representation (as in Lieb's paper [29]) or the entropic representation (as in Geng-Nair [48]). Here, we adopt the latter approach. We first consider a certain "non-degenerate" case where the existence of an extremizer is guaranteed. Then, Gaussian optimality in the general case follows by a limiting argument (Appendix F), establishing Theorem 2.

Non-Degenerate Forward Channels
This subsection focuses on the following case: Assumption 1.
Given Borel measures P X i on R, i = 1, . . . , l, define: where the infimum is over Borel measures P X that have (P X i ) as marginals. Note that (75) is well defined since the first term cannot be +∞ under the non-degenerate assumption, and the second term cannot be −∞. The aim of this subsection is to prove the following: , where the supremum is over Borel measures P X i on R, and i = 1, . . . , l, is achieved by some Gaussian (P X i ) l i=1 , in which case the infimum in (75) is achieved by some Gaussian P X .
Naturally, one would expect that Gaussian optimality can be established when (µ j ) m j=1 and (ν i ) l i=1 are either Gaussian or Lebesgue. We made the assumption that the former is Lebesgue and the latter is Gaussian so that certain technical conditions can be justified more easily. More precisely, the following observation shows that we can regularize the distributions by a second moment constraint for free: Proposition 1. sup (P X i ) F 0 ((P X i )) is finite and there exist σ 2 i ∈ (0, ∞), i = 1, . . . , l such that it equals: Proof. When µ j is Lebesgue and P Y j |X is non-degenerate, D(P Y j µ j ) = −h(P Y j ) ≤ −h(P Y j |X) is bounded above (in terms of the variance of the additive noise of P Y j |X ). Moreover, D(P X i ν i ) ≥ 0 when ν i is Gaussian, so sup (P X i ) F 0 ((P X i )) < ∞. Further, choosing (P X i ) = (ν i ) and using the covariance matrix to lower bound the first term in (75) show that sup (P X i ) F 0 ((P X i )) > −∞. To see (76), notice that: where ν i is a Gaussian distribution with the same first and second moments as X i ∼ P X i . Thus, D(P X i ν i ) is bounded below by some function of the second moment of X i , which tends to ∞ as the second moment of X i tends to ∞. Moreover, as argued in the preceding paragraph, the first term in (75) is bounded above by some constant depending only on (P Y j |X ). Thus, we can choose σ 2 i > 0, i = 1, . . . , l large enough such that if E[X 2 i ] > σ 2 i for some of i, then F 0 ((P X i )) < sup (P X i ) F 0 ((P X i )), irrespective of the choices of P X 1 , . . . , P X i−1 , P X i+1 , . . . , P X l . Then, these σ 1 , . . . , σ l are as desired in the proposition.
The non-degenerate assumption ensures that the supremum is achieved:

1.
For any (P X i ) l i=1 , the infimum in (75) is attained by some Borel P X .

2.
If (P Y j |X l ) m j=1 are non-degenerate (Definition 3), then the supremum in (76) is achieved by some Borel The proof of Proposition 2 is given in Appendix E. After taking care of the existence of the extremizers, we get into the tensorization properties, which are the crux of the proof: ), (µ j ), (T j ), (c j ) ∈ [0, ∞) m , and let S j be induced by coordinate projections. Then: where for each j, on the left side and: on the right side, t = 1, 2.
Proof. We only need to prove the nontrivial ≥ part. For any P X (1,2) on the left side, choose P X (t) on the right side by marginalization. Then: ≥ 0 for each j.
We are now ready to show the main result of this section.
Proof of Theorem 8.

1.
Assume that (P Define: Next, we perform the same algebraic expansion as in the proof of tensorization: where: • (84) uses Lemma 1.
In (87), we selected a particular instance of coupling P X + X − , constructed as follows: first, we select an optimal coupling P X + for given marginals (P X + i ). Then, for any x + = (x + i ) l i=1 , let P X − |X + =x + be an optimal coupling of (P X − i |X + i =x + i ) (for a justification that we can select optimal coupling P X − |X + =x + in a way that P X − |X + is indeed a regular conditional probability distribution, see [7]). With this construction, it is apparent that X + i − X + − X − i , and hence: • (88) is because in the above, we have constructed the coupling optimally.
• (89) is because (P (t) 3. Thus, in the expansions above, equalities are attained throughout. Using the differentiation technique as in the case of forward inequality, for almost all (b i ), (c j ), we have: where (92) is because by symmetry, we can perform the algebraic expansions in a different way to show that (P X − i ) is also a maximizer of F 0 . Then, are Gaussian with the same covariance. Lastly, using Lemma 1 and the doubling trick, one can show that the optimal coupling is also Gaussian.

Analysis of Example 1 Using Gaussian Optimality
We note that Example 1 is a rather simple setting, where (17) can be proven by integrating the two sides of (18) and applying the change of variables, noting that the absolute value of the Jacobian equals one. Nevertheless, it is illuminating to give an alternative proof using the Gaussian optimality result, as a proof of concept. In this section, we only give a proof sketch where certain "technicalities" are not justified. Details of the justifications are deferred to Appendix F.
Proof sketch for the claim in Example 1. By duality (Theorem 7), it suffices to prove the corresponding entropic inequality. The Gaussian optimality result in Theorem 8 assumed Gaussian reference measures on the output and non-degenerate forward channels in order to simplify the proof of the existence of minimizers; however, supposing that Gaussian optimality extends beyond those technical conditions, we see that it suffices to prove that for any centered Gaussian (P X i ), where the supremum is over Gaussian P X l with the marginals P X 1 , . . . , P X l and Y j := ∑ l i=1 m ji X i . Let a i := E[X 2 i ], and choose P X l = ∏ l i=1 P X i ; we see that (93) holds if: where (a i ) are the eigenvalues and ∑ l i=1 m ji a i l i=1 are the diagonal entries of the matrix: Therefore, (94) holds.
A generalization of Example 1 is as follows. The proposition generalizes the claim in Example 1. Indeed, observe that there is no loss of generality in assuming that (b 1 , . . . , b l ) and (c 1 , . . . , c l ) are probability vectors, since by dimensional analysis, we see that the best constant is infinite unless ∑ l i=1 b i = ∑ l j=1 c j ; and it is also clear that the best constant is invariant when each b i and c j is multiplied by the same positive number. Moreover, any orthogonal matrix can be approximated by a sequence of orthogonal M with nonzero entries, for which the neighborhood U shrinks, but always contains the uniform probability vector ( 1 l , . . . , 1 l ).
Proof sketch for Proposition 3. Note that along the same lines as (94), the best constant in the FR-BL inequality equals: where without loss of generality, we assumed a l ∈ ∆ is in the probability simplex. We first observe that if the positive semidefinite constraint A 0 in (96) were nonexistent, then the sup in the denominator in (96) would equal ∏ l j=1 c c j j , and consequently, (96) would equal exp(H(c l ) − H(b l )), for any b l , c l ∈ ∆ not necessarily close to the uniform probability vector. Indeed, fixing A ii = a i , i = 1, . . . , l, the linear map from the off-diagonal entries to the diagonal entries of MAM is onto the space of l-vectors whose entries sum to one; proof of the surjectivity can be reduced to checking the fact that the only diagonal matrix that commutes with M is a multiple of the identity matrix. Then, the sup in the denominator is achieved when MAM jj = c j , j = 1 . . . l, which is independent of a l .
Next, we argue that the constraint A 0 in (96) is not active when b l and c l are close to the uniform vector. Denote by U (t) the set of l-vectors whose distance (say in total variation) to the uniform vector ( 1 l , . . . , 1 l ) is at most t. Observe that: 1.
There exists t > 0 such that for every a l ∈ U (t), which follows by continuity and the fact that when a l is uniform, the sup (97) is achieved at the strictly positive definite A = l −1 I.

2.
When b l = c l = ( 1 l , . . . , 1 l ) is the uniform probability vector, (96) equals one, which is uniquely achieved by a l = ( 1 l , . . . , 1 l ). To see the uniqueness, take A to be diagonal in the denominator and observe that the denominator is strictly bigger than the numerator when the diagonals of MAM are not a permutation of a l . Then, since the extreme value of a continuous functions is achieved on a compact set, we can find > 0 such that: for any a l / ∈ U (t/2).

3.
Finally, by continuity, we can choose s ∈ (0, t/2) small enough such that for any b l , c l ∈ U (s), Taking the neighborhood U (s) proves the claim.

Relation to Hypercontractivity and Its Reverses
As alluded to before and illustrated by Figure 2, the forward-reverse Brascamp-Lieb inequality generalizes several other inequalities from functional analysis and information theory; a more complete discussion on these relationships can be found in [7]. In this section, we focus on hypercontractivity and show how its three cases all follow from Theorem 1. Among these, the case in Section 5.3 can be regarded as an instance of the forward-reverse inequality that cannot be reduced to either the forward or the reverse inequality alone. It is also interesting to note that, from the viewpoint of the forward-reverse Brascamp-Lieb inequality, in each of the three special cases, there ought to be three functions involved in the functional formulation; however, the optimal choice of one function can be computed from the other two. Therefore, the conventional functional formulations of the three cases of hypercontractivity involve only two functions, making it non-obvious to find a unifying inequality.

Relation to Hypercontractivity and Its Reverses
As alluded before and illustrated by Figure 2, the forward-reverse Brascamp-Lieb inequality generalizes several other inequalities from functional analysis and information theory; A more complete discussion on these relationships can be found in [7]. In this section, we focus on hypercontractivity, and show how its three cases all follow from Theorem 1. Among these, the case in Section 5.3 can be regarded as an instance of the forward-reverse inequality that cannot be reduced to either the forward or the reverse inequality alone. It is also interesting to note that, from the viewpoint of the forward-reverse Brascamp-Lieb inequality, in each of the three special cases there ought to be three functions involved in the functional formulation; but the optimal choice of one function can be computed from the other two. Therefore the conventional functional formulations of the three cases of hypercontractivity involve only two functions, making it non-obvious to find a unifying inequality.

Forward part
Strong data processing inequality [36] Reverse hypercontractivity with one negative parameter (110)

Reverse part
Hypercontractivity (103) Reverse hypercontractivity with positive parameters (106) Figure 2. The forward-reverse Brascamp-Lieb inequality generalizes several other functional inequalities/information theoretic inequalities. For more discussions on these relations see the extended version [7]. Figure 3. Diagram for hypercontractivity.

Hypercontractivity
Fix a joint probability distribution Q Y 1 Y 2 and nonnegative continuous functions F 1 and F 2 on Y 1 and Y 2 , respectively, both bounded away from 0. In Theorem 1, take l ← 1, m ← 2, b 1 ← 1, d ← 0, , and let T 1 and T 2 be the canonical maps (Definition 2). The constraint (12) translates to and the optimal choice of g 1 is when the equality is achieved. We thus obtain the equivalence between (By a standard dense-subspace argument, we see that it is inconsequential that F 1 and F 2 in (103) are

Hypercontractivity
Fix a joint probability distribution Q Y 1 Y 2 and nonnegative continuous functions F 1 and F 2 on Y 1 and Y 2 , respectively, both bounded away from zero. In Theorem 1, take l ← 1, m ← 2, b 1 ← 1, d ← 0, , and let T 1 and T 2 be the canonical maps (Definition 2). The measure spaces and the random transformations are as shown in Figure 3. Figure 3. Diagram for hypercontractivity.
The constraint (12) translates to: and the optimal choice of g 1 is when the equality is achieved. We thus obtain the equivalence between: and: By a standard dense-subspace argument, we see that it is inconsequential that F 1 and F 2 in (103) are not assumed to be continuous, nor bounded away from zero. It is also easy to see that the nonnegativity of F 1 and F 2 is inconsequential for (103).
This equivalence can also be obtained from Theorem 1. By Hölder's inequality, (103) is equivalent to saying that the norm of the linear operator sending , hence the name hypercontractivity. The equivalent formulation of hypercontractivity was shown in [44] using a different proof via the method of types/typicality, which requires that |Y 1 |, |Y 2 | < ∞. In contrast, the proof based on the nonnegativity of relative entropy removes this constraint, allowing one to prove Nelson's Gaussian hypercontractivity from the information-theoretic formulation (see [7]).

Reverse Hypercontractivity (Positive Parameters)
By "positive parameters" we mean the b 1 and b 2 in (107) are positive. Let Q Z 1 Z 2 be a given joint probability distribution, and let G 1 and G 2 be nonnegative functions on Z 1 and Z 2 , respectively, both bounded away from zero. In Theorem 1, take l ← 2, m ← 1, , and let S 1 and S 2 be the canonical maps (Definition 2). The measure spaces and the random transformations are as shown in Figure 4. Note that the constraint (12) translates to: and the equality case yields the optimal choice of f 1 for (13). By Theorem 1, we thus obtain the equivalence between: and: Note that in this setup, if Z 1 and Z 2 are finite, then Condition (iv) in Theorem 1 is equivalent to The equivalent formulations of reverse hypercontractivity were observed in [59], where the proof is based on the method of types.

Reverse Hypercontractivity (One Negative Parameter)
By "one negative parameter" we mean the b 1 is positive and −c 2 is negative in (111). In Theorem 1, take l ← 1, m ← 2, c 1 ← 1, d ← 0. Let Y 1 = X = (Z 1 , Y 2 ), and let S 1 and T 2 be the canonical maps (Definition 2). Suppose that Q Z 1 Y 2 is a given joint probability distribution, and set µ 1 ← Q Z 1 Y 2 , ν 1 ← Q Z 1 , µ 2 ← Q Y 2 in Theorem 1. Suppose that F and G are arbitrary nonnegative continuous functions on Y 2 and Z 1 , respectively, which are bounded away from zero. Take in Theorem 1. The measure spaces and the random transformations are as shown in Figure 5. Figure 5. Diagram for reverse hypercontractivity with one negative parameter.
The constraint (12) translates to: Note that (13) translates to: for all F, G and f 1 satisfying (108). It suffices to verify (109) for the optimal choice f 1 = GF, so (109) is reduced to: By Theorem 1, (110) is equivalent to: Inequality (110) is called reverse hypercontractivity with a negative parameter in [45], where the entropic version (111) is established for |Z 1 |, |Y 2 | < ∞ using the method of types. Multiterminal extensions of (110) and (111) (called the reverse Brascamp-Lieb type inequality with negative parameters in [45]) can also be recovered from Theorem 1 in the same fashion, i.e., we move all negative parameters to the other side of the inequality so that all parameters become positive.
In summary, from the viewpoint of Theorem 1, the results in Sections 5.1-5.3 are degenerate special cases, in the sense that in any of the three cases, the optimal choice of one of the functions in (13) can be explicitly expressed in terms of the other functions; hence, this "hidden function" disappears in (103), (106) or (110). i takes values. Note that since P X i is assumed to be absolutely continuous, the set of "dyadic points" has measure zero: Since P (n) X i → P X i weakly and the assumption in the preceding paragraph precluded any positive mass on the quantization boundaries under P X i , for each k ≥ 1, there exists some n := n k large enough such that: for each i and w ∈ W [k] . Now, define a coupling P (n) , as follows: Observe that (A9) is a well-defined probability measure because of (A8) and indeed has marginals P . Moreover, by the triangle inequality, we have the following bound on the total variation distance: Next, construct P (n) X (we use P| A to denote the restriction of a probability measure P on measurable set A, that is P| A (B) := P(A ∩ B) for any measurable B): Observe that P (n) X defined in (A11) is compatible with the P (n) W [k] defined in (A9) and indeed has marginals (P (n) Since n := n k can be made increasing in k, we have constructed the desired sequence (P (n k ) X ) ∞ k=1 converging weakly to P X . Indeed, for any bounded open dyadic cube (that is, a cube whose corners have coordinates being multiples of 2 −k where k is some integer) A, using (A10) and the assumption (A7), we conclude: Moreover, since bounded open dyadic cubes form a countable basis of the topology in R l , we see that (A12) actually holds for any open set A. By writing A as a countable union of dyadic cubes, using the continuity of measure to pass to a finite disjoint union, and then apply (A12), as desired.

Appendix C. Upper Semicontinuity of the Infimum
Using Lemma A1 in Appendix B, we prove the following result, which will be used in Appendix E.
Corollary A1. Consider non-degenerate (P Y j |X ). For each n ≥ 1, i = 1, . . . , l, P (n) X i is a Borel measure on R, whose second moment is bounded by σ 2 i < ∞. Assume that P (n) X i converges to some absolutely continuous P X i for each i. Then: Proof. By passing to a convergent subsequence, we may assume that the limit on the left side of (A13) exists. For any coupling P X of (P X i ), by invoking Lemma A1 and passing to a subsequence, we find a sequence of couplings P (n) X of (P (n) X i ) that converges weakly to P X . It is known that under a moment constraint, the differential entropy of the output distribution of a non-degenerate Gaussian channel enjoys weak continuity in the input distribution (see, e.g., [48] and (A13) follows since P X was arbitrarily chosen.
Therefore, since the Gaussian distribution maximizes differential entropy given a second moment upper bound, we have: Since lim r→∞ sup n p n (r) = 0 by (A15) and due to Chebyshev's inequality, (A19) implies that: The desired result follows from (A17), (A20) and the fact that: h(X n ) = p n (r)h(X n | X n > r) + (1 − p n (r))h(X n | X n ≤ r) + h(p n (r)).
Appendix E. Proof of Proposition 2

1.
For any > 0, by the continuity of measure, there exists K > 0 such that: By the union bound, wherever P X is a coupling of (P X i ). Now, let P where P Y j := T * j P X , j = 1, . . . , m. The sequence (P (n) X ) is tight by (A23). Thus, invoking the Prokhorov theorem and by passing to a subsequence, we may assume that (P (n) X ) converges weakly to some P X . Therefore, P (n) Y j converges to P Y j weakly, and by the semicontinuity property in Lemma A2, we have: establishing that P X is an infimizer.

2.
Suppose (P (n) , where (σ i ) is as in Proposition 1 and: The regularization on the covariance implies that for each i, (P (n) X i ) n≥1 is a tight sequence. Thus, upon the extraction of subsequences, we may assume that for each i, (P (n) X i ) n≥1 converges to some P X i . We have the moment bound: where X i ∼ P X i and X (n) i ∼ P (n) X i . Then, by Lemma A2, Under the covariance regularization and the nondegenerateness assumption, we showed in Proposition 1 that the value of (76) cannot be +∞ or −∞. This implies that we can assume (by passing to a subsequence) that P (n) X i λ, i = 1, . . . , l, since otherwise F((P X i )) = −∞. Moreover, since is bounded above under the nondegenerateness assumption, the sequence must also be bounded from above, which implies, using (A30), that: In particular, we have P X i λ for each i. Now, Corollary A1 shows that: inf Thus, (A30) and (A32) show that (P X i ) is in fact a maximizer.

Appendix F. Gaussian Optimality in Degenerate Cases: A Limiting Argument
This section proves Theorem 2. We first give a proof for the choice of parameters in Example 1, merely for the sake of notational simplicity, and then discuss how to extend the argument. The proof will be based on Theorem 8, which assumes non-degenerate forward channels and Gaussian measures on the output of the reverse channels. To that end, we will adopt an approximation argument. For each j = 1, . . . , l, define the linear operator T j by: for any measurable function φ on R, where N ∼ N (0, ). Let γ 1 := N (0, −1 ), and note that the density of 2π γ 1 converges pointwise to that of the Lebesgue measure.
Lemma A3. For any > 0, let (T j ) be defined as in (A33). Then, for any Borel P X i λ, i = 1, . . . , l, l ∑ i=1 D(P X i γ 1 ) − l 2 log 2π ≥ inf P X l : S * i P X l =P X i − l ∑ j=1 h(T * j P X l ) . (A34) Proof. By Theorem 8, it suffices to prove (A34) when P X i is Gaussian, and from (A34), it is easy to see that it suffices to prove the case of the centered Gaussian. Let P X i = N (0, a i ), i = 1, . . . , l. We can upper bound the right side of (A34) by taking P X l = P X 1 × P X l instead of the infimum, so it suffices to Put:g 1 := exp(η )g 1 , (A45) g i := g i , i = 1, . . . , l. (A46) Then, (g i ) and ( f j ) satisfy the constraint (A36) for any > 0. By applying the monotone convergence theorem and then Lemma A4, which violates the hypothesis (A38), as desired.

Appendix F.2. Proof of Theorem 2
The limiting argument can be extended to the vector case to prove Theorem 2. Specifically, for each j = 1, . . . , m, define T j the same as (A33) except that N ∼ N (0, I), where I is the identity matrix whose dimension is clear from the context (equal to dim(E j ) here), and let P Y j |X 1 ...X l be the dual operator. For each i = 1, . . . , l, let ν i := 2π 1 2 dim(E i ) · N (0, −1 I), whose density convergences pointwise to that of ν 0 i , defined as the Lebesgue measure on E i . Define: where the supremum is over nonnegative continuous functions f 1 , . . . , f m and g 1 , . . . , g l such that the summands in (A49) are finite and: c j (T j log f j )(x 1 , . . . , x l ), ∀x 1 , . . . , x l .
However, Theorem 8 based on the rotational invariance of the Gaussian measure can be extended to the vector case, so for any > 0, sup P X 1 ,...,P X l F 0 (P X 1 , . . . , P X l ) = sup P X 1 ,...,P X l c.G.
F 0 (P X 1 , . . . , P X l ), where c.G. means that the supremum on the right side is over centered Gaussian measures. The fact that centered distributions exhaust the supremum follows easily from the definition of F 0 . Moreover, from the definitions, it is easy to see that F 0 is monotonically decreasing in , and in particular: sup P X 1 ,...,P X l c.G.