Next Article in Journal
Multivariate Matching Pursuit Decomposition and Normalized Gabor Entropy for Quantification of Preictal Trends in Epilepsy
Next Article in Special Issue
Entropy Inequalities for Lattices
Previous Article in Journal
Modeling the Comovement of Entropy between Financial Markets
Previous Article in Special Issue
On f-Divergences: Integral Representations, Local Behavior, and Inequalities
Open AccessArticle

A Forward-Reverse Brascamp-Lieb Inequality: Entropic Duality and Gaussian Optimality

1
Department of Electrical Engineering, Princeton University, Princeton, NJ 08544, USA
2
Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720-1770, USA
3
Renaissance Technologies, LLC 600 Route 25A East Setauket, New York, NY 11733, USA
*
Author to whom correspondence should be addressed.
Entropy 2018, 20(6), 418; https://doi.org/10.3390/e20060418
Received: 30 March 2018 / Revised: 25 May 2018 / Accepted: 25 May 2018 / Published: 30 May 2018
(This article belongs to the Special Issue Entropy and Information Inequalities)

Abstract

Inspired by the forward and the reverse channels from the image-size characterization problem in network information theory, we introduce a functional inequality that unifies both the Brascamp-Lieb inequality and Barthe’s inequality, which is a reverse form of the Brascamp-Lieb inequality. For Polish spaces, we prove its equivalent entropic formulation using the Legendre-Fenchel duality theory. Capitalizing on the entropic formulation, we elaborate on a “doubling trick” used by Lieb and Geng-Nair to prove the Gaussian optimality in this inequality for the case of Gaussian reference measures.
Keywords: Brascamp-Lieb inequality; hypercontractivity; functional-entropic duality; Gaussian optimality; network information theory; image size characterization Brascamp-Lieb inequality; hypercontractivity; functional-entropic duality; Gaussian optimality; network information theory; image size characterization

1. Introduction

The Brascamp-Lieb inequality and its reverse [1] concern the optimality of Gaussian functions in a certain type of integral inequality. (Not to be confused with the “variance Brascamp-Lieb inequality” (cf. [2,3,4]), which generalizes the Poincaré inequality). These inequalities have been generalized in various ways since their discovery, nearly 40 years ago. A modern formulation due to Barthe [5] may be stated as follows:
Brascamp-Lieb Inequality and Its Reverse 
([5] Theorem 1). Let E, E 1 , …, E m be Euclidean spaces and B i : E E i be linear maps. Let ( c i ) i = 1 m and D be positive real numbers. Then, the Brascamp-Lieb inequality:
i = 1 m f i c i ( B i x ) d x D i = 1 m f i ( x i ) d x i c i ,
for all nonnegative measurable functions f i on E i , i = 1 , , m , holds if and only if it holds whenever f i , i = 1 , , m are centered Gaussian functions (a centered Gaussian function is of the form x exp ( r x A x ) , where A is a positive semidefinite matrix and r R ). Similarly, for F a positive real number, the reverse Brascamp-Lieb inequality, also known as Barthe’s inequality ( B i denotes the adjoint of B i ),
sup ( y i ) : i = 1 m c i B i y i = x i = 1 m f i c i ( y i ) d x F i = 1 m f i ( y i ) d y i c i ,
for all nonnegative measurable functions f i on E i , i = 1 , , m , holds if and only if it holds for all centered Gaussian functions.
For surveys on the history of both the Brascamp-Lieb inequality and Barthe’s inequality and their applications, see, e.g., [6,7]. The Brascamp-Lieb inequality can be seen as a generalization of several other inequalities, including Hölder’s inequality, the sharp Young inequality, the Loomis-Whitney inequality, the entropy power inequality (cf. [6] or the survey paper [8]), hypercontractivity and the logarithmic Sobolev inequality [9]. Furthermore, the Prékopa-Leindler inequality can be seen as a special case of Barthe’s inequality. Due in part to their utility in establishing impossibility bounds, these functional inequalities have attracted much attention in information theory [10,11,12,13,14,15,16,17], theoretical computer science [18,19,20,21,22] and statistics [23,24,25,26,27,28], to name only a small subset of the literature. Over the years, various proofs of these inequalities have been proposed [1,29,30,31,32,33,34]. Among these, Lieb’s elegant proof [29], which is very close to one of the techniques that will be used in this paper, employs a doubling trick that capitalizes on the rotational invariance property of the Gaussian function: if f is a one-dimensional Gaussian function, then:
f ( x ) f ( y ) = f x y 2 f x + y 2 .
Since (1) and (2) have the same structure modulo the direction of the inequality, a common viewpoint is to consider (1) and (2) as dual inequalities. This viewpoint successfully captures the geometric aspects of (1) and (2). Indeed, it is known that:
D · F = 1
as long as D , F < [5]. Moreover, both D and F are equal to one under Ball’s geometric condition [35]: E 1 , …, E m are dimension one, and:
i = 1 m c i B i B i = I
is the identity matrix. While fruitful, this “dual” viewpoint does not fully explain the asymmetry between the forward and the reverse inequalities: there is a sup in (2), but not in (1).
This paper explores a different viewpoint. In particular, we propose a single inequality that unifies (1) and (2). Accordingly, we should reverse both sides of (2) to make the inequality sign consistent with (1). To be concrete, let us first observe that (1) and (2) can be respectively restated in the following more symmetrical forms (with changes of certain symbols):
  • For all nonnegative functions g and f 1 , , f m such that:
    g ( x ) i = 1 m f j c j ( B j x ) , x ,
    we have:
    E g D j = 1 m E j f j c j .
  • For all nonnegative measurable functions g 1 , g l and f such that:
    i = 1 l g i b i ( z i ) f ( i = 1 l b i B i z i ) , z 1 , , z l ,
    we have:
    i = 1 l E i g i b i D E f .
Note that in both cases, the optimal choice of one function (f or g) can be explicitly computed from the constraints, hence the conventional formulations in (1) and (2). Generalizing further, we can consider the following problem: Let X , Y 1 , , Y m , Z 1 , , Z l be measurable spaces. Consider measurable maps ϕ j : X Y j , j = 1 , , m and ψ : X Z i , i = 1 , , l . Let b 1 , , b l and c 1 , , c m be nonnegative real numbers. Let ν 1 , , ν l be measures on Z 1 , , Z l and μ 1 , , μ m be measures on Y 1 , , Y m , respectively. What is the smallest D > 0 such that for all nonnegative f 1 , , f m on Y 1 , Y m and g 1 , , g l on Z 1 , , Z l satisfying:
i = 1 l g i b i ( ψ i ( x ) ) j = 1 m f j c j ( ϕ j ( x ) ) , x ,
we have:
i = 1 l g i d ν i b i D j = 1 m f j d μ j c j ?
Except for special case of l = 1 (resp. m = 1 ), it is generally not possible to deduce a simple expression from (10) for the optimal choice of g i (resp. f j ) in terms of the rest of the functions. We will refer to (11) as a forward-reverse Brascamp-Lieb inequality.
One of the motivations for considering multiple functions on both sides of (11) comes from multiuser information theory: independently, but almost simultaneously with the discovery of the Brascamp-Lieb inequality in mathematical physics, in the late 1970s, information theorists including Ahslwede, Gács and Körner [36,37] invented the image-size technique for proving strong converses in source and channel networks. An image-size inequality is a characterization of the tradeoff of the measures of certain sets connected by given random transformations (channels); we refer the interested readers to [37] for expositions on the image-size problem. Although not the way treated in [36,37], an image-size inequality can essentially be obtained from a functional inequality similar to (11) by taking the functions to be (roughly speaking) the indicator functions of sets. In the case of (10), the forward channels ϕ 1 , , ϕ m and the reverse channels ψ 1 , , ψ l degenerate into deterministic functions. In this paper, motivated by information theoretic applications similar to those of the image-size problems, we will consider further generalizations of (11) to the case of random transformations. Since the functional inequality is not restricted to indicator functions, it is strictly stronger than the corresponding image-size inequality. As a side remark, [38] uses functional inequalities that are variants of (11) together with a reverse hypercontractivity machinery to improve the image-size plus the blowing-up machinery of [39] and shows that the non-indicator function generalization is crucial for achieving the optimal scaling of the second-order rate expansion.
Of course, to justify the proposal of (11), we must also prove that (11) enjoys certain nice mathematical properties; this is the main goal of the present paper. Specifically, we focus on two aspects of (11): equivalent entropic formulation and Gaussian optimality.
In the mathematical literature, e.g., [32,36,40,41,42,43,44,45,46], it is known that certain integral inequalities are equivalent to inequalities involving relative entropies. In particular, Carlen, Loss and Lieb [47] and Carlen and Cordero-Erausquin [32] proved that the Brascamp-Lieb inequality is equivalent to the superadditivity of relative entropy. In this paper, we prove that the forward-reverse Brascamp-Lieb inequality (11) also has an entropic formulation, which turns out to be very close to the rate region of certain multiuser information theory problems (but we will clarify the difference in the text). In fact, Ahlswede, Csiszár and Körner [37,39] essentially derived image-size inequalities from similar entropic inequalities. Because of the reverse part, the proof of the equivalence of (11) and corresponding entropic inequality is more involved than the forward case considered in [32] beyond the case of finite X , Y j , Z i , and certain machinery from min-max theory appears necessary. In particular, the proof involves a novel use of the Legendre-Fenchel duality theory. Next, we give a basic version of our main result on the functional-entropic duality (more general versions will be given later). In order to streamline its presentation, all formal definitions of notation are postponed to Section 2.
Theorem 1 (Dual formulation of the forward-reverse Brascamp-Lieb inequality).
Assume that:
(i) 
m and l are positive integers; d R , X is a compact metric space;
(ii) 
b i ( 0 , ) , ν i is a finite Borel measure on a Polish space Z i , and Q Z i | X is a random transformation from X to Z i , for each i = 1 , , l ;
(iii) 
c j ( 0 , ) , μ j is a finite Borel measure on a Polish space Y j , and Q Y j | X is a random transformation from X to Y i , for each j = 1 , , m ;
(iv) 
For any ( P Z i ) i = 1 l such that i = 1 l D ( P Z i ν i ) < , there exists P X such that P X Q Z i | X P Z i , i = 1 , , l and j = 1 m D ( P Y j μ j ) < , where P X Q Y j | X P Y j , j = 1 , , m .
Then, the following two statements are equivalent:
1.
If the nonnegative continuous functions ( g i ) , ( f j ) are bounded away from zero and satisfy:
i = 1 l b i Q Z i | X ( g i ) j = 1 m c j Q Y j | X ( f j )
then:
i = 1 l g i d ν i b i exp ( d ) j = 1 m f j d μ j c j
2.
For any ( P Z i ) such that D ( P Z i ν i ) < (of course, this assumption is not essential (if we adopt the convention that the infimum in (14) is + when it runs over an empty set)), i = 1 , , l ,
i = 1 l b i D ( P Z i ν i ) + d inf P X j = 1 m c j D ( P Y j μ j )
where P X Q Y j | X P Y j , j = 1 , , m , and the infimum is over P X such that P X Q Z i | X P Z i , i = 1 , , l .
Next, in a similar vein as the proverbial result that “Gaussian functions are optimal” for the forward or the reverse Brascamp-Lieb inequality, we show in this paper that Gaussian functions are also optimal for the forward-reverse Brascamp-Lieb inequality, particularized to the case of Gaussian reference measures and linear maps. The proof scheme is based on rotational invariance (3), which can be traced back in the functional setting to Lieb [29]. More specifically, we use a variant for the entropic setting introduced by Geng and Nair [48], thereby taking advantage of the dual formulation of Theorem 1.
Theorem 2.
Consider b 1 , , b l , c 1 , , c m , D ( 0 , ) . Let E 1 , , E l , E 1 , , E m be Euclidean spaces, and let B j i : E i E j be a linear map for each i { 1 , , l } and j { 1 , , m } . Then, for all continuous functions f j : E j [ 0 , + ) , g i : E i [ 0 , ) satisfying:
i = 1 l g i b i ( x i ) j = 1 m f j c j i = 1 l B j i x i , x 1 , , x l ,
we have:
i = 1 l g i b i D j = 1 m f j c j ,
if and only if for all centered Gaussian functions f 1 , , f m , g 1 , , g l satisfying (15), we have (16).
As mentioned, in the literature on the forward or the reverse Brascamp-Lieb inequalities, it is known that a certain geometric condition (5) ensures that the best constant equals one. Now, for the forward-reverse inequality, there is a simple example where the best constant equals one:
Example 1.
Let l be a positive integer, and let M : = ( m j i ) 1 j l , 1 i l be an orthogonal matrix. For any nonnegative continuous functions ( f j ) j = 1 l ( g i ) i = 1 l on R such that:
i = 1 l g i ( x i ) j = 1 l f j i = 1 l m j i x i , x l R l ,
we have:
i = 1 l g i ( x ) d x i = 1 l f j ( x ) d x .
The rest of the paper is organized as follows: Section 2 defines the notation and reviews some basic theory of convex duality. Section 3 proves Theorem 1 and also presents its extensions to the settings of noncompact spaces or general reverse channels. Section 4 proves the Gaussian optimality in the entropic formulation, with the caveat that a certain “non-degenerate” assumption is imposed to ensure the existence of extremizers. At the end of Section 4, we give a proof sketch of Example 1 and also propose a generalization of the example. To completely prove Theorem 2, in Appendix F, we use a limiting argument to drop the non-degenerate assumption and apply the equivalence between the functional and entropic formulations.

2. Review of the Legendre-Fenchel Duality Theory

Our proof of the equivalence of the functional and the entropic inequalities uses the Legendre-Fenchel duality theory, a topic from convex analysis. Before getting into that, a recap of some basics on the duality of topological vector spaces seems appropriate. Unless otherwise indicated, we assume Polish spaces and Borel measures. Recall that metric space. It enjoys several nice properties that we use heavily in this section, including the Prokhorov theorem and the Riesz-Kakutani theorem. Of course, the Polish space assumption covers the cases of Euclidean and discrete spaces (endowed with the Hamming metric, which induces the discrete topology, making every function on the discrete set continuous), among others. Readers interested in discrete spaces only may refer to the (much simpler) argument in [49] based on the KKT condition.
Notation 1.
Let X be a topological space.
  • C c ( X ) denotes the space of continuous functions on X with a compact support;
  • C 0 ( X ) denotes the space of all continuous functions f on X that vanish at infinity (i.e., for any ϵ > 0 , there exists a compact set K X such that | f ( x ) | < ϵ for x X \ K );
  • C b ( X ) denotes the space of bounded continuous functions on X ;
  • M ( X ) denotes the space of finite signed Borel measures on X ;
  • P ( X ) denotes the space of probability measures on X .
We consider C c , C 0 and C b as topological vector spaces, with the topology induced from the sup norm. The following theorem, usually attributed to Riesz, Markov and Kakutani, is well known in functional analysis and can be found in, e.g., [50,51].
Theorem 3 (Riesz-Markov-Kakutani).
If X is a locally compact, σ-compact Polish space, the dual (the dual of a topological vector space consists of all continuous linear functionals on that space, which is naturally also topological vector space (with the weak topology)) of both C c ( X ) and C 0 ( X ) is M ( X ) .
Remark 1.
The dual space of C b ( X ) can be strictly larger than M ( X ) , since it also contains those linear functionals that depend on the “limit at infinity” of a function f C b ( X ) (originally defined for those f that do have a limit at infinity and then extended to the whole C b ( X ) by the Hahn-Banach theorem; see, e.g., [50]).
Of course, any μ M ( X ) is a continuous linear functional on C 0 ( X ) or C c ( X ) , given by:
f f d μ
where f is a function in C 0 ( X ) or C c ( X ) . As is well known, Theorem 3 states that the converse is also true under mild regularity assumptions on the space. Thus, we can view measures as continuous linear functionals on a certain function space (in fact, some authors prefer to construct measure theory by defining a measure as a linear functional on a suitable measure space; see Lax [50] or Bourbaki [52]); this justifies the shorthand notation:
μ ( f ) : = f d μ
which we employ in the rest of the paper. This viewpoint is the most natural for our setting since in the proof of the equivalent formulation of the forward-reverse Brascamp-Lieb inequality, we shall use the Hahn-Banach theorem to show the existence of certain linear functionals.
Definition 1.
Let Λ : C b ( X ) ( , + ] be a lower semicontinuous, proper convex function. Its Legendre-Fenchel transform Λ : C b ( X ) ( , + ] is given by:
Λ ( ) : = sup u C b ( X ) [ ( u ) Λ ( u ) ] .
Let ν be a nonnegative finite Borel measure on a Polish space X , and define the convex functional on C b ( X ) :
Λ ( f ) : = log ν ( exp ( f ) )
= log exp ( f ) d ν .
Then, note that the relative entropy has the following alternative definition: for any μ M ( X ) ,
D ( μ ν ) : = sup f C b ( X ) [ μ ( f ) Λ ( f ) ]
which agrees with the more familiar definition D ( μ ν ) : = μ ( log d μ d ν ) when ν is a probability measure, by the Donsker-Varadhan formula (cf. [53] Lemma 6.2.13). If μ is not a probability measure, then D ( μ ν ) as defined in (24) is + .
Given a bounded linear operator T : C b ( Y ) C b ( X ) , the dual operator T : C b ( X ) C b ( Y ) is defined in terms of:
T μ X : C b ( Y ) R ; f μ X ( T f ) ,
for any μ X C b ( X ) . Since P ( X ) M ( X ) C b ( X ) , T is said to be a conditional expectation operator if T P P ( Y ) for any P P ( X ) . The operator T is defined as the dual of a conditional expectation operator T and, in a slight abuse of terminology, is said to be a random transformation from X to Y .
For example, in the notation of Theorem 1, if g C b ( Y ) and Q Y | X is a random transformation from X to Y , the quantity Q Y | X ( g ) is a function on X , defined by taking the conditional expectation. Furthermore, if P X P ( X ) , we write P X Q Y | X P Y to indicate that P Y P ( Y ) is the measure induced on Y by applying Q Y | X to P X .
Remark 2.
From the viewpoint of category theory (see for example [54,55]), C b is a functor from the category of topological spaces to the category of topological vector spaces, which is contra-variant because for any continuous, ϕ : X Y (morphism between topological spaces), we have C b ( ϕ ) : C b ( Y ) C b ( X ) , u u f where u ϕ denotes the composition of two continuous functions, reversing the arrows in the maps (i.e., the morphisms). On the other hand, M is a covariant functor and M ( ϕ ) : M ( X ) M ( Y ) , μ μ ϕ 1 , where μ ϕ 1 ( B ) : = μ ( ϕ 1 ( B ) ) for any Borel measurable B Y . “Duality” itself is a contra-variant functor between the category of topological spaces (note the reversal of arrows in Figure 1). Moreover, C b ( X ) = M ( X ) and C b ( ϕ ) = M ( ϕ ) if X and Y are compact metric spaces and ϕ : X Y is continuous. Definition 2 can therefore be viewed as the special case where ϕ is the projection map:
Definition 2.
Suppose ϕ : Z 1 × Z 2 Z 1 , ( z 1 , z 2 ) z 1 is the projection to the first coordinate.
  • C b ( ϕ ) : C b ( Z 1 ) C b ( Z 1 × Z 2 ) is called a canonical map, whose action is almost trivial: it sends a function of z i to itself, but viewed as a function of ( z 1 , z 2 ) .
  • M ( ϕ ) : M ( Z 1 × Z 2 ) M ( Z 1 ) is called marginalization, which simply takes a joint distribution to a marginal distribution.
The Fenchel-Rockafellar duality (see [40] Theorem 1.9, or [56] in the case of finite dimensional vector spaces) usually refers to the k = 1 special case of the following result.
Theorem 4.
Assume that A is a topological vector space whose dual is A . Let Θ j : A R { + } , j = 0 , 1 , , k , for some positive integer k. Suppose there exist some ( u j ) j = 1 k and u 0 : = ( u 1 + + u k ) such that:
Θ j ( u j ) < , j = 0 , , k
and Θ 0 is upper semicontinuous at u 0 . Then:
inf A j = 0 k Θ j ( ) = inf u 1 , , u k A Θ 0 j = 1 k u j + j = 1 k Θ j ( u j ) .
For completeness, we provide a proof of this result, which is based on the Hahn-Banach theorem (Theorem 5) and is similar to the proof of [40] Theorem 1.9.
Proof. 
Let m 0 be the right side of (27). The ≤ part of (27) follows trivially from the (weak) min-max inequality since:
m 0 = inf u 0 , , u k A sup A j = 0 k Θ j ( u j ) ( j = 0 k u j ) sup A inf u 0 , , u k A j = 0 k Θ j ( u j ) ( j = 0 k u j ) = inf A j = 0 k Θ j ( ) .
It remains to prove the ≥ part, and it suffices to assume without loss of generality that m 0 > . Note that (26) also implies that m 0 < + . Define convex sets:
C j : = { ( u , r ) A × R : r > Θ j ( u ) } , j = 0 , , k ;
B : = { ( 0 , m ) A × R : m m 0 } .
Observe that these are nonempty sets because of (26). Furthermore, C 0 has a nonempty interior by the assumption that Θ 0 is upper semicontinuous at u 0 . Thus, the Minkowski sum:
C : = C 0 + + C k
is a convex set with a nonempty interior. Moreover, C B = . By the Hahn-Banach theorem (Theorem 5), there exists ( , s ) A × R such that:
s m j = 0 k u j + s j = 0 k r j .
For any m m 0 and ( u j , r j ) C j , j = 0 , , k . From (30), we see (32) can only hold when s 0 . Moreover, from (26) and the upper semicontinuity of Θ 0 at u 0 , we see that the j = 0 k u j in (32) can take a value in a neighborhood of 0 A ; hence, s 0 . Thus, by dividing s on both sides of (32) and setting / s , we see that:
m 0 inf u 0 , , u k A j = 0 k u j + j = 0 k Θ j ( u j ) = j = 0 k Θ j ( )
which establishes ≥ in (27). ☐
Theorem 5 (Hahn-Banach)
Let C and B be convex, nonempty disjoint subsets of a topological vector space A.
1.
If the interior of C is non-empty, then there exists A , 0 such that:
sup u B ( u ) inf u C ( u ) .
2.
If A is locally convex, B is compact and C is closed, then there exists A such that:
sup u B ( u ) < inf u C ( u ) .
Remark 3.
The assumption in Theorem 5 that C has a nonempty interior is only necessary in the infinite dimensional case. However, even if A in Theorem 4 is finite dimensional, the assumption in Theorem 4 that Θ 0 is upper semicontinuous at u 0 is still necessary, because this assumption was not only used in applying Hahn-Banach, but also in concluding that s 0 in (32).

3. The Entropic-Functional Duality

In this section, we prove Theorem 1 and some of its generalizations.

3.1. Compact X

We first state a duality theorem for the case of compact spaces to streamline the proof. Later, we show that the argument can be extended to a particular non-compact case (Theorem 1 is not included in the conference paper [49], but was announced in the conference presentation). Our proof based on the Legendre-Fenchel duality (Theorem 4) was inspired by the proof of the Kantorovich duality in the theory of optimal transportation (see [40] Chapter 1, where the idea was credited to Brenier).
Recall from Section 2 that a random transformation (a mapping between probability measures) is formally the dual of a conditional expectation operator. Suppose P Y j | X = T j , j = 1 , , m and P Z i | X = S i , i = 1 , , l .
Proof of Theorem 1. 
We can safely assume d = 0 below without loss of generality (since otherwise, we can always substitute μ 1 exp d c 1 μ 1 ).
1)⇒2) 
This is the nontrivial direction, which relies on certain (strong) min-max type results. In Theorem 4, put (in (36), u 0 means that u is pointwise non-positive):
Θ 0 : u C b ( X ) 0 u 0 ; + otherwise .
Then,
Θ 0 : π M ( X ) 0 π 0 ; + otherwise .
For each j = 1 , , m , set:
Θ j ( u ) : = c j inf log μ j exp 1 c j v
where the infimum is over v C b ( Y ) such that u = T j v ; if there is no such v, then Θ j ( u ) : = + as a convention. Observe that:
  • Θ j is convex: indeed, given arbitrary u 0 and u 1 , suppose that v 0 and v 1 respectively achieve the infimum in (38) for u 0 and u 1 (if the infimum is not achievable, the argument still goes through by the approximation and limit argument). Then, for any α [ 0 , 1 ] , v α : = ( 1 α ) v 0 + α v 1 satisfies u α = T j v α where u α : = ( 1 α ) u 0 + α u 1 . Thus, the convexity of Θ j follows from the convexity of the functional in (23);
  • Θ j ( u ) > for any u C b ( X ) . Otherwise, for any P X and P Y j : = T j P X , we have:
    D ( P Y j μ j ) = sup v { P Y j ( v ) log μ j ( exp ( v ) ) }
    = sup v { P X ( T j v ) log μ j ( exp ( v ) ) }
    = sup u C b ( X ) P X ( u ) 1 c j Θ j ( c j u )
    = +
    which contradicts the assumption that j = 1 m c j D ( P Y j μ j ) < in the theorem;
  • From Steps (39)–(41), we see Θ j ( π ) = c j D ( T j π μ j ) for any π M ( X ) , where the definition of D ( · μ j ) is extended using the Donsker-Varadhan formula (that is, it is infinite when the argument is not a probability measure).
Finally, for the given ( P Z i ) i = 1 l , choose:
Θ m + 1 : u C b ( X ) i = 1 l P Z i ( w i ) if u = i = 1 l S i w i   for   some   w i C b ( Z i ) ; + otherwise .
Notice that:
  • Θ m + 1 is convex;
  • Θ m + 1 is well defined (that is, the choice of ( w i ) in (43) is inconsequential). Indeed, if ( w i ) i = 1 l is such that i = 1 l S i w i = 0 , then:
    i = 1 l P Z i ( w i ) = i = 1 l S i P X ( w i ) = i = 1 l P X ( S i w i ) = 0 ,
    where P X is such that S i P X = P Z i , i = 1 , , l , whose existence is guaranteed by the assumption of the theorem. This also shows that Θ m + 1 > .
  • Θ m + 1 ( π ) : = sup u { π ( u ) Θ m + 1 ( u ) } = sup w 1 , , w l π i = 1 l S i w i i = 1 l P Z i ( w i ) = sup w 1 , , w l i = 1 l S i π ( w i ) i = 1 l P Z i ( w i ) = 0 if S i π = P Z i , i = 1 , , l ; + otherwise .
Invoking Theorem 4 (where the u j in Theorem 4 can be chosen as the constant function u j 1 , j = 1 , , m + 1 ):
inf π : π 0 , S i π = P Z i j = 1 m c j D ( T j π μ j ) = inf v m , w l : j = 1 m T j v j + i = 1 l S i w i 0 j = 1 m c j log μ j exp 1 c j v j + i = 1 l P Z i ( w i )
where v m denotes the collection of the functions v 1 , , v m , and similarly for w l . Note that the left side of (46) is exactly the right side of (14). For any ϵ > 0 , choose v j C b ( Y j ) , j = 1 , , m and w i C b ( Z i ) , i = 1 , , l such that j = 1 m T j v j + i = 1 l S i w i 0 and:
ϵ j = 1 m c j log μ j exp 1 c j v j i = 1 l P Z i ( w i ) > inf π : π 0 , S i π = P Z i j = 1 m c j D ( T j π μ j )
Now, invoking (13) with f j : = exp 1 c j v j , j = 1 , , m and g i : = exp 1 b i w i , i = 1 , , l , we upper bound the left side of (47) by:
ϵ i = 1 l b i log ν i ( g i ) + i = 1 l b i P Z i ( log g i ) ϵ + i = 1 l b i D ( P Z i ν i )
where the last step follows by the Donsker-Varadhan formula. Therefore, (14) is established since ϵ > 0 is arbitrary.
2)⇒1) 
Since ν i is finite and g i is bounded by assumption, we have ν i ( g i ) < , i = 1 , , l . Moreover, (13) is trivially true when ν i ( g i ) = 0 for some i, so we will assume below that ν i ( g i ) ( 0 , ) for each i. Define P Z i by:
d P Z i d ν i = g i ν i ( g i ) , i = 1 , , l .
Then, for any ϵ > 0 ,
i = 1 l b i log ν i ( g i ) = i = 1 l b i [ P Z i ( log g i ) D ( P Z i ν i ) ]
< j = 1 m c j P Y j ( log f j ) + ϵ j = 1 m c j D ( P Y j μ j )
ϵ + j = 1 m c j log μ j ( f j )
where:
  • (51) uses the Donsker-Varadhan formula, and we have chosen P X , P Y j : = T j P X , j = 1 , , m such that:
    i = 1 l b i D ( P Z i ν i ) > j = 1 m c j D ( P Y j μ j ) ϵ
  • (52) also follows from the Donsker-Varadhan formula.
The result follows since ϵ > 0 can be arbitrary.
Remark 4.
Condition (iv) in the theorem imposes a rather strong assumption on ( S i ) : for simplicity, consider the case where | X | , | Z i | < . Then, Condition (iv) assumes that for any ( P Z i ) , there exists P X such that P Z i = S i P X . This assumption is certainly satisfied when ( S i ) are induced by coordinate projections; the case of l = 1 and P Z | X being a reverse erasure channel gives a simple example where P Z | X is not a deterministic map.
Next, we give a generalization of Theorem 1, which alleviates the restriction on ( S i ) :
Theorem 6.
Theorem 1 continues to hold if Condition (iv) therein is weakened to the following:
  • For any P X such that D ( S i P X ν i ) < , i = 1 , , l , there exists P ˜ X such that S i P ˜ X = S i P X for each i and j = 1 m c j D ( T j P ˜ X μ j ) < for each j.
and the conclusion of the theorem will be replaced by the equivalence of the following two statements:
1.
For any nonnegative continuous functions ( g i ) , ( f j ) bounded away from zero and such that:
i = 1 l b i S i log g i j = 1 m c j T j log f j
we have:
inf ( g ˜ i ) : i = 1 l b i S i log g ˜ i i = 1 l b i S i log g i i = 1 l ν i b i ( g ˜ i ) exp ( d ) j = 1 m μ j c j ( f j ) .
2.
For any ( P X ) such that D ( S i P X ν i ) < , i = 1 , , l ,
i = 1 l b i D ( S i P X ν i ) + d inf P ˜ X : S i P ˜ X = S i P X j = 1 m c j D ( T j P ˜ X μ j ) .
In Appendix A, we show that Theorem 6 indeed recovers Theorem 1 for the more restricted class of random transformations.
Proof. 
Here, we mention the parts of the proof that need to be changed: upon specifying ( f j ) and ( g i ) right after (47), we select ( g ˜ i ) such that:
i = 1 l b i S i log g ˜ i i = 1 l b i S i log g i
i = 1 l b i log ν i ( g ˜ i ) j = 1 m c j log μ j ( f j ) + ϵ .
Then, in lieu of (59), we upper-bound the left side of (47) by:
2 ϵ i = 1 l b i log ν i ( g ˜ i ) + i = 1 l b i P Z i ( log g ˜ i ) 2 ϵ + i = 1 l b i D ( P Z i ν i )
which establishes the 1)⇒2) part. For the other direction, for each i { 1 , 2 , , l } , define:
Λ i ( u ) : = inf g ˜ i > 0 : b i S i log g ˜ i = u b i log ν i ( g ˜ i ) .
Then, following essentially the same proof as that of Θ j in (38), we see that Λ i is proper convex and:
Λ i ( π ) = b i D ( S i π μ j ) .
Moreover, let:
Λ l + 1 ( u ) : = 0 if u = b i S i log g i ; + otherwise .
Then, Λ l + 1 ( π ) = b i S i π ( log g i ) . Using the Legendre-Fenchel duality, we see that for any ϵ > 0 ,
inf ( g ˜ i ) : i = 1 l b i S i log g ˜ i i = 1 l b i S i log g i i = 1 l b i log ν i ( g ˜ i )
= inf u 1 , , u l + 1 Θ 0 i = 1 l + 1 u i + i = 1 l + 1 Λ i ( u i )
= sup π i = 0 l + 1 Θ i ( π )
= sup π 0 i = 1 l + 1 Θ i ( π )
= sup π 0 i = 1 l b i S i π ( log g i ) i = 1 l b i D ( S i π ν i )
i = 1 l b i S i P X ( log g i ) i = 1 l b i D ( S i P X ν i ) + ϵ
j = 1 m c j T j P ˜ X ( log f j ) j = 1 m c j D ( T j P ˜ X μ j ) + 2 ϵ
2 ϵ + j = 1 m c j log μ j ( f j )
where:
  • To see (67), we note that the sup in (66) can be restricted to π , which is a probability measure, since otherwise, the relative entropy terms in (66) are + by its definition via the Donsker-Varadhan formula. Then, we select P X such that (67) holds.
  • In (68), we have chosen P ˜ X such that:
    S i P ˜ X = S i P X , 1 i l ;
    i = 1 l b i D ( S i P X ) > j = 1 m c j D ( T j P ˜ X μ j ) ϵ ,
    and then applied the assumption (54). The result follows since ϵ > 0 can be arbitrary.
Remark 5.
The infimum in (14) is in fact achievable: for any ( P Z i ) , there exists a P X that minimizes j = 1 m c j D ( P Y j μ j ) subject to the constraints S i P X = P Z i , i = 1 , m , where P Y j : = T j P X , j = 1 , , m . Indeed, since the singleton { P Z i } is weak -closed and S i is weak -continuous (Generally, if T : A B is a continuous map between two topologically vector spaces, then T : B A is a weak continuous map between the dual spaces. Indeed, if y n y is a weak -convergent subsequence in B , meaning y n ( b ) y ( b ) for any b B , then, we must have T y n ( a ) = y n ( T a ) y ( T a ) = T y ( a ) for any a A , meaning that T y n converges to T y in the weak topology.), the set i = 1 l ( S i ) 1 P Z i is weak -closed in M ( X ) ; hence, its intersection with P ( X ) is weak -compact in P ( X ) , because P ( X ) is weak -compact by (a simple version for the setting of a compact underlying space X of) the Prokhorov theorem [57]. Moreover, by the weak -lower semicontinuity of D ( · μ j ) (easily seen from the variational formula/Donsker-Varadhan formula of the relative entropy, cf. [58]) and the weak -continuity of T j , j = 1 , , m , we see that j = 1 m c j D ( T j P X μ j ) is weak -lower semicontinuous in P X , and hence, the existence of a minimizing P X is established.
Remark 6.
Abusing the terminology from min-max theory, Theorem 1 may be interpreted as a “strong duality” result, which establishes the equivalence of two optimization problems. The 1)⇒2) part is the non-trivial direction, which requires regularity on the spaces. In contrast, the 2)⇒1) direction can be thought of as a “weak duality”, which establishes only a partial relation, but holds for more general spaces.

3.2. Noncompact X

Our proof of 1)⇒2) in Theorem 1 makes use of the Hahn-Banach theorem and hence relies crucially on the fact that the measure space is the dual of the function space. Naively, one might want to extend the the proof to the case of locally compact X by considering C 0 ( X ) instead of C b ( X ) , so that the dual space is still M ( X ) . However, this would not work: consider the case when X = Z 1 × , , × Z l and each S i is the canonical map. Then, Θ m + 1 ( u ) as defined in (43) is + unless u 0 (because u C 0 ( X ) requires that u vanishes at infinity); thus, Θ m + 1 0 . Luckily, we can still work with C b ( X ) ; in this case, C b ( X ) may not be a measure, but we can decompose it into = π + R where π M ( X ) and R is a linear functional “supported at infinity”. Below, we use the techniques in [40] (Chapter 1.3) to prove a particular extension of Theorem 1 to a non-compact case.
Theorem 7.
Theorem 1 still holds if
  • The assumption that X is a compact metric space is relaxed to the assumption that it is a locally compact and σ-compact Polish space;
  • X = i = 1 l Z i and S i : C b ( Z i ) C b ( X ) , i = 1 , , l are canonical maps (see Definition 2).
Proof. 
The proof of the “weak duality” part 2)⇒1) still works in the noncompact case, so we only need to explain what changes need to be made in the proof of the 1)⇒2) part. Let Θ 0 be defined as before, in (36). Then, for any C b ( X ) ,
Θ 0 ( ) = sup u 0 ( u )
which is zero if is nonnegative (in the sense that ( u ) 0 for every u 0 ), and + otherwise. This means that when computing the infimum on the left side of (27), we only need to take into account those nonnegative .
Next, let Θ m + 1 be also defined as before. Then, directly from the definition, we have:
Θ m + 1 ( ) = 0 if ( i S i w i ) = i P Z i ( w i ) , w i C b ( Z i ) , i = 1 , l ; + otherwise .
For any C b ( X ) . Generally, the condition in the first line of (73) does not imply that is a measure. However, if is also nonnegative, then using a technical result in [40] Lemma 1.25, we can further simplify:
Θ m + 1 ( ) = 0 if M ( X ) and S i = P Z i , i = 1 , , l ; + otherwise .
This further shows that when we compute the left side of (27), the infimum can be taken over , which is a coupling of ( P Z i ) . In particular, if is a probability measure, then Θ j ( ) = c j D ( T j μ j ) still holds with the Θ j defined in (38), j = 1 , , m . Thus, the rest of the proof can proceed as before. ☐
Remark 7.
The second assumption is made in order to achieve (74) in the proof.

4. Gaussian Optimality

Recall that the conventional Brascamp-Lieb inequality and its reverse ((1) and (2)) state that centered Gaussian functions exhaust such inequalities, and in particular, verifying those inequalities is reduced to a finite dimensional optimization problem (only the covariance matrices in these Gaussian functions are to be optimized). In this section, we show that similar results hold for the forward-reverse Brascamp-Lieb inequality, as well. Our proof uses the rotational invariance argument mentioned in Section 1. Since the forward-reverse Brascamp-Lieb inequality has dual representations (Theorem 7), in principle, the rotational invariance argument can be applied either to the functional representation (as in Lieb’s paper [29]) or the entropic representation (as in Geng-Nair [48]). Here, we adopt the latter approach. We first consider a certain “non-degenerate” case where the existence of an extremizer is guaranteed. Then, Gaussian optimality in the general case follows by a limiting argument (Appendix F), establishing Theorem 2.

4.1. Non-Degenerate Forward Channels

This subsection focuses on the following case:
Assumption 1.
  • Fix Lebesgue measures ( μ j ) j = 1 m and Gaussian measures ( ν i ) i = 1 l on R ;
  • non-degenerate (Definition 3 below) linear Gaussian random transformation ( P Y j | X ) j = 1 m (where X : = ( X 1 , , X l ) ) associated with conditional expectation operators ( T j ) j = 1 m ;
  • ( S i ) i = 1 l are induced by coordinate projections;
  • positive ( c j ) and ( b i ) .
Definition 3.
We say ( Q Y 1 | X , , Q Y m | X ) is non-degenerate if each Q Y j | X = 0 is an n j -dimensional Gaussian distribution with an invertible covariance matrix.
Given Borel measures P X i on R , i = 1 , , l , define:
F 0 ( ( P X i ) ) : = inf P X j = 1 m c j D ( P Y j μ j ) i = 1 l b i D ( P X i ν i )
where the infimum is over Borel measures P X that have ( P X i ) as marginals. Note that (75) is well defined since the first term cannot be + under the non-degenerate assumption, and the second term cannot be . The aim of this subsection is to prove the following:
Theorem 8.
sup ( P X i ) F 0 ( ( P X i ) ) , where the supremum is over Borel measures P X i on R , and i = 1 , , l , is achieved by some Gaussian ( P X i ) i = 1 l , in which case the infimum in (75) is achieved by some Gaussian P X .
Naturally, one would expect that Gaussian optimality can be established when ( μ j ) j = 1 m and ( ν i ) i = 1 l are either Gaussian or Lebesgue. We made the assumption that the former is Lebesgue and the latter is Gaussian so that certain technical conditions can be justified more easily. More precisely, the following observation shows that we can regularize the distributions by a second moment constraint for free:
Proposition 1.
sup ( P X i ) F 0 ( ( P X i ) ) is finite and there exist σ i 2 ( 0 , ) , i = 1 , , l such that it equals:
sup ( P X i ) : E [ X i 2 ] σ i 2 F 0 ( ( P X i ) ) .
Proof. 
When μ j is Lebesgue and P Y j | X is non-degenerate, D ( P Y j μ j ) = h ( P Y j ) h ( P Y j | X ) is bounded above (in terms of the variance of the additive noise of P Y j | X ). Moreover, D ( P X i ν i ) 0 when ν i is Gaussian, so sup ( P X i ) F 0 ( ( P X i ) ) < . Further, choosing ( P X i ) = ( ν i ) and using the covariance matrix to lower bound the first term in (75) show that sup ( P X i ) F 0 ( ( P X i ) ) > .
To see (76), notice that:
D ( P X i ν i ) = D ( P X i ν i ) + E [ ı ν i ν i ( X ) ] = D ( P X i ν i ) + D ( ν i ν i ) D ( ν i ν i )
where ν i is a Gaussian distribution with the same first and second moments as X i P X i . Thus, D ( P X i ν i ) is bounded below by some function of the second moment of X i , which tends to ∞ as the second moment of X i tends to ∞. Moreover, as argued in the preceding paragraph, the first term in (75) is bounded above by some constant depending only on ( P Y j | X ) . Thus, we can choose σ i 2 > 0 , i = 1 , , l large enough such that if E [ X i 2 ] > σ i 2 for some of i, then F 0 ( ( P X i ) ) < sup ( P X i ) F 0 ( ( P X i ) ) , irrespective of the choices of P X 1 , , P X i 1 , P X i + 1 , , P X l . Then, these σ 1 , , σ l are as desired in the proposition. ☐
The non-degenerate assumption ensures that the supremum is achieved:
Proposition 2.
Under Assumption 1,
1.
For any ( P X i ) i = 1 l , the infimum in (75) is attained by some Borel P X .
2.
If ( P Y j | X l ) j = 1 m are non-degenerate (Definition 3), then the supremum in (76) is achieved by some Borel ( P X i ) i = 1 l .
The proof of Proposition 2 is given in Appendix E. After taking care of the existence of the extremizers, we get into the tensorization properties, which are the crux of the proof:
Lemma 1.
Fix ( P X i ( 1 ) ) , ( P X i ( 2 ) ) , ( μ j ) , ( T j ) , ( c j ) [ 0 , ) m , and let S j be induced by coordinate projections. Then:
inf P X ( 1 , 2 ) : S i 2 P X ( 1 , 2 ) = P X i ( 1 ) × P X i ( 2 ) j = 1 m c j D ( P Y j ( 1 , 2 ) μ j 2 ) = t = 1 , 2 j = 1 m c j inf P X ( t ) : S i P X ( t ) = P X i ( t ) D ( P Y j ( t ) μ j )
where for each j,
P Y j ( 1 , 2 ) : = T j 2 P X ( 1 , 2 )
on the left side and:
P Y j ( t ) : = T j 2 P X ( t )
on the right side, t = 1 , 2 .
Proof. 
We only need to prove the nontrivial ≥ part. For any P X ( 1 , 2 ) on the left side, choose P X ( t ) on the right side by marginalization. Then:
D ( P Y j ( 1 , 2 ) μ j 2 ) t D ( P Y j ( t ) μ j ) = I ( Y j ( 1 ) ; Y j ( 2 ) ) 0
for each j. ☐
We are now ready to show the main result of this section.
Proof of Theorem 8. 
  • Assume that ( P X i ( 1 ) ) and ( P X i ( 2 ) ) are maximizers of F 0 (possibly equal). Let P X i 1 , 2 : = P X i ( 1 ) × P X i ( 2 ) . Define:
    X + : = 1 2 X ( 1 ) + X ( 2 ) ;
    X : = 1 2 X ( 1 ) X ( 2 ) .
    Define ( Y j + ) and ( Y j ) analogously. Then, Y j + | { X + = x + , X = x } Q Y j | X = x + is independent of x , and Y j | { X + = x + , X = x } Q Y j | X = x is independent of x + .
  • Next, we perform the same algebraic expansion as in the proof of tensorization:
    t = 1 2 F 0 P X i ( t ) i = 1 l = inf P X ( 1 , 2 ) : S j 2 P X ( 1 , 2 ) = P X j ( 1 , 2 ) j c j D ( P Y j ( 1 , 2 ) μ j 2 ) i b i D ( P X i ( 1 , 2 ) ν i 2 )
    = inf P X + X : S j 2 P X + X = P X j + X j j c j D ( P Y j + Y j μ j 2 ) i b i D ( P X i + X i ν i 2 ) inf P X + X : S j 2 P X + X = P X j + X j j c j D ( P Y j + μ j ) + D ( P Y j | X + μ j | P X + )
    i b i D ( P X i + ν i ) + D ( P X i | X i + ν i | P X i + )
    j c j D ( P Y j + μ j ) + D ( P Y j | X + μ j | P X + ) i b i D ( P X i + ν i ) + D ( P X i | X + ν i | P X + )
    = F 0 P X i + i = 1 l + F 0 P X i | X + i = 1 l d P X +
    t = 1 2 F 0 P X i ( t ) i = 1 l
    where:
    • (84) uses Lemma 1.
    • (86) is because of the Markov chain Y j + X + Y j (for any coupling).
    • In (87), we selected a particular instance of coupling P X + X , constructed as follows: first, we select an optimal coupling P X + for given marginals ( P X i + ) . Then, for any x + = ( x i + ) i = 1 l , let P X | X + = x + be an optimal coupling of ( P X i | X i + = x i + ) (for a justification that we can select optimal coupling P X | X + = x + in a way that P X | X + is indeed a regular conditional probability distribution, see [7]). With this construction, it is apparent that X i + X + X i , and hence:
      D ( P X i | X i + ν i | P X i + ) = D ( P X i | X + ν i | P X + ) .
    • (88) is because in the above, we have constructed the coupling optimally.
    • (89) is because ( P X i ( t ) ) maximizes F 0 , t = 1 , 2 .
  • Thus, in the expansions above, equalities are attained throughout. Using the differentiation technique as in the case of forward inequality, for almost all ( b i ) , ( c j ) , we have:
    D ( P X i | X i + ν i | P X i + ) = D ( P X i + ν i )
    = D ( P X i ν i ) , i
    where (92) is because by symmetry, we can perform the algebraic expansions in a different way to show that ( P X i ) is also a maximizer of F 0 . Then, I ( X i + ; X i ) = D ( P X i | X i + ν i | P X i + ) D ( P X i ν i ) = 0 , which, combined with I ( X i ( 1 ) ; X i ( 2 ) ) , shows that X i ( 1 ) and X i ( 2 ) are Gaussian with the same covariance. Lastly, using Lemma 1 and the doubling trick, one can show that the optimal coupling is also Gaussian.

4.2. Analysis of Example 1 Using Gaussian Optimality

We note that Example 1 is a rather simple setting, where (17) can be proven by integrating the two sides of (18) and applying the change of variables, noting that the absolute value of the Jacobian equals one. Nevertheless, it is illuminating to give an alternative proof using the Gaussian optimality result, as a proof of concept. In this section, we only give a proof sketch where certain “technicalities” are not justified. Details of the justifications are deferred to Appendix F.
Proof sketch for the claim in Example 1. 
By duality (Theorem 7), it suffices to prove the corresponding entropic inequality. The Gaussian optimality result in Theorem 8 assumed Gaussian reference measures on the output and non-degenerate forward channels in order to simplify the proof of the existence of minimizers; however, supposing that Gaussian optimality extends beyond those technical conditions, we see that it suffices to prove that for any centered Gaussian ( P X i ) ,
i = 1 l h ( P X i ) sup P X l j = 1 l h ( P Y j )
where the supremum is over Gaussian P X l with the marginals P X 1 , , P X l and Y j : = i = 1 l m j i X i . Let a i : = E [ X i 2 ] , and choose P X l = i = 1 l P X i ; we see that (93) holds if:
i = 1 l log a i j = 1 l log i = 1 l m j i 2 a i , a i > 0 , i = 1 , , l ,
where ( a i ) are the eigenvalues and i = 1 l m j i a i i = 1 l are the diagonal entries of the matrix:
M diag ( a i ) 1 i l M .
Therefore, (94) holds. ☐
A generalization of Example 1 is as follows.
Proposition 3.
For any orthogonal matrix M : = ( m j i ) 1 j l , 1 i l with nonzero entries, we claim that there exists a neighborhood U of the uniform probability vector ( 1 l , , 1 l ) , such that for any ( b 1 , , b l ) and ( c 1 , , c l ) in U , the best constant D in the FR-BLinequality (16) equals exp ( H ( c l ) H ( b l ) ) where H ( · ) is the entropy functional.
The proposition generalizes the claim in Example 1. Indeed, observe that there is no loss of generality in assuming that ( b 1 , , b l ) and ( c 1 , , c l ) are probability vectors, since by dimensional analysis, we see that the best constant is infinite unless i = 1 l b i = j = 1 l c j ; and it is also clear that the best constant is invariant when each b i and c j is multiplied by the same positive number. Moreover, any orthogonal matrix can be approximated by a sequence of orthogonal M with nonzero entries, for which the neighborhood U shrinks, but always contains the uniform probability vector ( 1 l , , 1 l ) .
Proof sketch for Proposition 3. 
Note that along the same lines as (94), the best constant in the FR-BL inequality equals:
D = sup a l Δ i = 1 l a i b i sup A 0 : A i i = a i j = 1 l M A M j j c j
where without loss of generality, we assumed a l Δ is in the probability simplex. We first observe that if the positive semidefinite constraint A 0 in (96) were nonexistent, then the sup in the denominator in (96) would equal j = 1 l c j c j , and consequently, (96) would equal exp ( H ( c l ) H ( b l ) ) , for any b l , c l Δ not necessarily close to the uniform probability vector. Indeed, fixing A i i = a i , i = 1 , , l , the linear map from the off-diagonal entries to the diagonal entries of MAM is onto the space of l-vectors whose entries sum to one; proof of the surjectivity can be reduced to checking the fact that the only diagonal matrix that commutes with M is a multiple of the identity matrix. Then, the sup in the denominator is achieved when M A M j j = c j , j = 1 l , which is independent of a l .
Next, we argue that the constraint A 0 in (96) is not active when b l and c l are close to the uniform vector. Denote by U ( t ) the set of l-vectors whose distance (say in total variation) to the uniform vector ( 1 l , , 1 l ) is at most t. Observe that:
  • There exists t > 0 such that for every a l U ( t ) ,
    sup A 0 : A i i = a i j = 1 l M A M j j = 1 / l l
    which follows by continuity and the fact that when a l is uniform, the sup (97) is achieved at the strictly positive definite A = l 1 I .
  • When b l = c l = ( 1 l , , 1 l ) is the uniform probability vector, (96) equals one, which is uniquely achieved by a l = ( 1 l , , 1 l ) . To see the uniqueness, take A to be diagonal in the denominator and observe that the denominator is strictly bigger than the numerator when the diagonals of M A M are not a permutation of a l . Then, since the extreme value of a continuous functions is achieved on a compact set, we can find ϵ > 0 such that:
    i = 1 l a i 1 / l sup A 0 : A i i = a i j = 1 l M A M j j 1 / l < 1 ϵ
    for any a l U ( t / 2 ) .
  • Finally, by continuity, we can choose s ( 0 , t / 2 ) small enough such that for any b l , c l U ( s ) ,
    i = 1 l a i b i sup A 0 : A i i = a i j = 1 l M A M j j c j < 1 ϵ / 2 , a l U ( t / 2 ) ;
    sup A 0 : A i i = a i j = 1 l M A M j j c j = sup A : A i i = a i j = 1 l M A M j j c j , a l U ( t / 2 ) ;
    exp ( H ( c l ) H ( b l ) ) > 1 ϵ / 2 .
Taking the neighborhood U ( s ) proves the claim. ☐

5. Relation to Hypercontractivity and Its Reverses

As alluded to before and illustrated by Figure 2, the forward-reverse Brascamp-Lieb inequality generalizes several other inequalities from functional analysis and information theory; a more complete discussion on these relationships can be found in [7]. In this section, we focus on hypercontractivity and show how its three cases all follow from Theorem 1. Among these, the case in Section 5.3 can be regarded as an instance of the forward-reverse inequality that cannot be reduced to either the forward or the reverse inequality alone. It is also interesting to note that, from the viewpoint of the forward-reverse Brascamp-Lieb inequality, in each of the three special cases, there ought to be three functions involved in the functional formulation; however, the optimal choice of one function can be computed from the other two. Therefore, the conventional functional formulations of the three cases of hypercontractivity involve only two functions, making it non-obvious to find a unifying inequality.

5.1. Hypercontractivity

Fix a joint probability distribution Q Y 1 Y 2 and nonnegative continuous functions F 1 and F 2 on Y 1 and Y 2 , respectively, both bounded away from zero. In Theorem 1, take l 1 , m 2 , b 1 1 , d 0 , f 1 F 1 1 c 1 , f 2 F 2 1 c 2 , ν 1 Q Y 1 Y 2 , μ 1 Q Y 1 , μ 2 Q Y 2 . Furthermore, put Z 1 = X = ( Y 1 , Y 2 ) , and let T 1 and T 2 be the canonical maps (Definition 2). The measure spaces and the random transformations are as shown in Figure 3.
The constraint (12) translates to:
g 1 ( y 1 , y 2 ) F 1 ( y 1 ) F 2 ( y 2 ) , y 1 , y 2
and the optimal choice of g 1 is when the equality is achieved. We thus obtain the equivalence between:
F 1 1 c 1 F 2 1 c 2 E [ F 1 ( Y 1 ) F 2 ( Y 2 ) ] , F 1 L 1 c 1 ( Q Y 1 ) , F 2 L 1 c 2 ( Q Y 2 )
and:
P Y 1 Y 2 , D ( P Y 1 Y 2 Q Y 1 Y 2 ) c 1 D ( P Y 1 Q Y 1 ) + c 2 D ( P Y 2 Q Y 2 ) .
By a standard dense-subspace argument, we see that it is inconsequential that F 1 and F 2 in (103) are not assumed to be continuous, nor bounded away from zero. It is also easy to see that the nonnegativity of F 1 and F 2 is inconsequential for (103).
This equivalence can also be obtained from Theorem 1. By Hölder’s inequality, (103) is equivalent to saying that the norm of the linear operator sending F 1 L 1 c 1 ( Q Y 1 ) to E [ F 1 ( Y 1 ) | Y 2 = · ] L 1 1 c 2 ( Q Y 2 ) does not exceed one. The interesting case is 1 1 c 2 > 1 c 1 , hence the name hypercontractivity. The equivalent formulation of hypercontractivity was shown in [44] using a different proof via the method of types/typicality, which requires that | Y 1 | , | Y 2 | < . In contrast, the proof based on the nonnegativity of relative entropy removes this constraint, allowing one to prove Nelson’s Gaussian hypercontractivity from the information-theoretic formulation (see [7]).

5.2. Reverse Hypercontractivity (Positive Parameters)

By “positive parameters” we mean the b 1 and b 2 in (107) are positive.
Let Q Z 1 Z 2 be a given joint probability distribution, and let G 1 and G 2 be nonnegative functions on Z 1 and Z 2 , respectively, both bounded away from zero. In Theorem 1, take l 2 , m 1 , c 1 1 , d 0 , g 1 G 1 1 b 1 , g 2 G 2 1 b 2 , μ 1 Q Z 1 Z 2 , ν 1 Q Z 1 , ν 2 Q Z 2 . Furthermore, put Y 1 = X = ( Z 1 , Z 2 ) , and let S 1 and S 2 be the canonical maps (Definition 2). The measure spaces and the random transformations are as shown in Figure 4.
Note that the constraint (12) translates to:
f 1 ( z 1 , z 2 ) G 1 ( z 1 ) G 2 ( z 2 ) , z 1 , z 2 .
and the equality case yields the optimal choice of f 1 for (13). By Theorem 1, we thus obtain the equivalence between:
G 1 1 b 1 G 2 1 b 2 E [ G 1 ( Z 1 ) G 2 ( Z 2 ) ] , G 1 , G 2
and:
P Z 1 , P Z 2 , P Z 1 Z 2 , D ( P Z 1 Z 2 Q Z 1 Z 2 ) b 1 D ( P Z 1 Q Z 1 ) + b 2 D ( P Z 2 Q Z 2 ) .
Note that in this setup, if Z 1 and Z 2 are finite, then Condition (iv) in Theorem 1 is equivalent to Q Z 1 Z 2 Q Z 1 × Q Z 2 . The equivalent formulations of reverse hypercontractivity were observed in [59], where the proof is based on the method of types.

5.3. Reverse Hypercontractivity (One Negative Parameter)

By “one negative parameter” we mean the b 1 is positive and c 2 is negative in (111).
In Theorem 1, take l 1 , m 2 , c 1 1 , d 0 . Let Y 1 = X = ( Z 1 , Y 2 ) , and let S 1 and T 2 be the canonical maps (Definition 2). Suppose that Q Z 1 Y 2 is a given joint probability distribution, and set μ 1 Q Z 1 Y 2 , ν 1 Q Z 1 , μ 2 Q Y 2 in Theorem 1. Suppose that F and G are arbitrary nonnegative continuous functions on Y 2 and Z 1 , respectively, which are bounded away from zero. Take g 1 G 1 b 1 , f 2 F 1 c 2 . in Theorem 1. The measure spaces and the random transformations are as shown in Figure 5.
The constraint (12) translates to:
f 1 ( z 1 , y 2 ) G ( z 1 ) F ( y 2 ) , z 1 , y 2 .
Note that (13) translates to:
G 1 b 1 Q Y 2 Z 1 ( f 1 ) Q Y 2 c 2 ( F 1 c 2 )
for all F, G and f 1 satisfying (108). It suffices to verify (109) for the optimal choice f 1 = G F , so (109) is reduced to:
F 1 c 2 G 1 b 1 E [ F ( Y 2 ) G ( Z 1 ) ] , F , G .
By Theorem 1, (110) is equivalent to:
P Z 1 , P Z 1 Y 2 , D ( P Z 1 Y 2 Q Z 1 Y 2 ) b 1 D ( P Z 1 Q Z 1 ) + ( c 2 ) D ( P Y 2 Q Y 2 ) .
Inequality (110) is called reverse hypercontractivity with a negative parameter in [45], where the entropic version (111) is established for | Z 1 | , | Y 2 | < using the method of types. Multiterminal extensions of (110) and (111) (called the reverse Brascamp-Lieb type inequality with negative parameters in [45]) can also be recovered from Theorem 1 in the same fashion, i.e., we move all negative parameters to the other side of the inequality so that all parameters become positive.
In summary, from the viewpoint of Theorem 1, the results in Section 5.1, Section 5.2 and Section 5.3 are degenerate special cases, in the sense that in any of the three cases, the optimal choice of one of the functions in (13) can be explicitly expressed in terms of the other functions; hence, this “hidden function” disappears in (103), (106) or (110).

Author Contributions

All the authors have contributed to the problem formulation, refinement, structuring or editing of the paper. Most of the sections were written by J.L. Parts of the sections on the existence of the minimizer and the Gaussian optimality were written by T.A.C.

Acknowledgments

This work was supported in part by NSF Grants CCF-1528132, CCF-0939370 (Center for Science of Information), CCF-1319299, CCF-1319304, CCF-1350595 and AFOSR FA9550-15-1-0180. Jingbo Liu would like to thank Elliott H. Lieb for teaching the Brascamp-Lieb inequality, as well as some techniques used in this paper in his graduate class.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Recovering Theorem 1 from Theorem 6 as a Special Case

Assume that P X ( P Z i ) is surjective. Let 1 Z i denote the constant one function on Z i . Define:
C : = ( w i ) : w i C b ( Z i ) , i = 1 l inf z i w i ( z i ) 0 ,
which is a closed convex cone in C b ( Z 1 ) × × C b ( Z l ) . Given ( g i ) , we show that i = 1 l b i S i log g ˜ i i = 1 l b i S i log g i implies:
( b i log g ˜ i b i log g i ) i = 1 l C .
Indeed, we can verify that the dual cone:
C : = ( π i ) : i = 1 l π i ( w i ) 0 , ( w i ) C = λ ( P Z 1 , , P Z l ) : λ 0 .
Under the surjectivity assumption, we see:
i = 1 l π i ( b i log g ˜ i b i log g i ) 0 , ( π i ) C .
Now, if (A2) is not true, by the Hahn-Banach theorem (Theorem 5), we find π i M ( Z i ) , i = 1 , , l such that:
i = 1 l π i ( b i log g ˜ i b i log g i ) < inf ( w i ) C i = 1 l π i ( w i )
so the right side of (A5) is not . Since C is a cone containing the origin, the right side of (A5) hence must be nonnegative, and we conclude that ( π i ) C . However, then, (A5) contradicts (A4).

Appendix B. Existence of Weakly-Convergent Couplings

This section proves an auxiliary result which will be used in Appendix C.
Lemma A1.
Suppose that for each i = 1 , , l , P X i is a Borel measure on R and P X i ( n ) converges weakly to some absolutely continuous (with respect to the Lebesgue measure) P X i as n . If P X is a coupling of ( P X i ) 1 i l , then, upon extraction of a subsequence, there exist couplings P X ( n ) for ( P X i ( n ) ) 1 i l that converge weakly to P X as n .
Proof. 
For each integer k 1 , define the random variable W i [ k ] : = ϕ k ( X i ) where ϕ k : R R { e } is the following “dyadic quantization function”:
ϕ k : x 2 k x | x | k , x 2 k Z ; e otherwise ,
and let W [ k ] : = ( W i [ k ] ) i = 1 l . Denote by W [ k ] : = { k 2 k , , k 2 k 1 , e } the set from which W i [ k ] takes values. Note that since P X i is assumed to be absolutely continuous, the set of “dyadic points” has measure zero:
P X i k = 1 2 k Z = 0 , i = 1 , , l .
Since P X i ( n ) P X i weakly and the assumption in the preceding paragraph precluded any positive mass on the quantization boundaries under P X i , for each k 1 , there exists some n : = n k large enough such that:
P W i [ k ] ( n ) ( w ) 1 1 k P W i [ k ] ( w ) ,
for each i and w W [ k ] . Now, define a coupling P W [ k ] ( n ) compatible with the P W i [ k ] ( n ) i = 1 l induced by P X i ( n ) i = 1 l , as follows:
P W [ k ] ( n ) : = 1 1 k P W [ k ] + k l 1 i = 1 l P W i [ k ] ( n ) 1 1 k P W i [ k ] .
Observe that (A9) is a well-defined probability measure because of (A8) and indeed has marginals P W i [ k ] ( n ) i = 1 l . Moreover, by the triangle inequality, we have the following bound on the total variation distance:
P W [ k ] ( n ) P W [ k ] 2 k .
Next, construct P X ( n ) (we use P | A to denote the restriction of a probability measure P on measurable set A , that is P | A ( B ) : = P ( A B ) for any measurable B ):
P X ( n ) : = w l W [ k ] × × W [ k ] P W [ k ] ( n ) w l i = 1 l P W i [ k ] ( n ) ( w i ) i = 1 l P X i ( n ) | ϕ k 1 ( w i ) .
Observe that P X ( n ) defined in (A11) is compatible with the P W [ k ] ( n ) defined in (A9) and indeed has marginals ( P X i ( n ) ) i = 1 l . Since n : = n k can be made increasing in k, we have constructed the desired sequence ( P X ( n k ) ) k = 1 converging weakly to P X . Indeed, for any bounded open dyadic cube (that is, a cube whose corners have coordinates being multiples of 2 k where k is some integer) A , using (A10) and the assumption (A7), we conclude:
lim inf k P X ( n k ) ( A ) P X ( A ) .
Moreover, since bounded open dyadic cubes form a countable basis of the topology in R l , we see that (A12) actually holds for any open set A . By writing A as a countable union of dyadic cubes, using the continuity of measure to pass to a finite disjoint union, and then apply (A12), as desired. ☐

Appendix C. Upper Semicontinuity of the Infimum

Using Lemma A1 in Appendix B, we prove the following result, which will be used in Appendix E.
Corollary A1.
Consider non-degenerate ( P Y j | X ) . For each n 1 , i = 1 , , l , P X i ( n ) is a Borel measure on R , whose second moment is bounded by σ i 2 < . Assume that P X i ( n ) converges to some absolutely continuous P X i for each i. Then:
lim sup n inf P X : S i P X = P X i ( n ) j = 1 m c j D ( T j P X μ j ) inf P X : S i P X = P X i j = 1 m c j D ( T j P X μ j ) .
Proof. 
By passing to a convergent subsequence, we may assume that the limit on the left side of (A13) exists. For any coupling P X of ( P X i ) , by invoking Lemma A1 and passing to a subsequence, we find a sequence of couplings P X ( n ) of ( P X i ( n ) ) that converges weakly to P X . It is known that under a moment constraint, the differential entropy of the output distribution of a non-degenerate Gaussian channel enjoys weak continuity in the input distribution (see, e.g., [48] Proposition 18, [60] Theorem 7, or [61] Theorem 1 and Theorem 2). Thus:
lim n j = 1 m c j D ( T j P X ( n ) μ j ) = j = 1 m c j D ( T j P X μ j )
and (A13) follows since P X was arbitrarily chosen. ☐

Appendix D. Weak Semicontinuity of Differential Entropy under a Moment Constraint

This section proves the following result, which will be used in Appendix E.
Lemma A2.
Suppose ( P X n ) is a sequence of distributions on R d converging weakly to P X , and:
E [ X n X n ] Σ
for all n. Then
lim sup n h ( X n ) h ( X ) .
Remark A1.
The result fails without the condition (A15). Furthermore, related results when the weak convergence is replaced with pointwise convergence of density functions and certain additional constraints were shown in [61] (Theorem 1 and Theorem 2) (see also the proof of [48] (Theorem 5)). Those results are not applicable here since the density functions of X n do not converge pointwise. They are applicable for the problems discussed in [48] because the density functions of the output of the Gaussian random transformation enjoy many nice properties due to the smoothing effect of the “good kernel”.
Proof. 
It is well known that in metric spaces and for probability measures, the relative entropy is weakly lower semicontinuous (cf. [58]). This fact and a scaling argument immediately show that, for any r > 0 ,
lim sup n h ( X n | X n r ) h ( X | X r ) .
Let p n ( r ) : = P [ X n > r ] , then (A15) implies:
E [ X X | X n > r ] 1 p n ( r ) Σ .
Therefore, since the Gaussian distribution maximizes differential entropy given a second moment upper bound, we have:
h ( X n | X n > r ) 1 2 log ( 2 π ) d e | Σ | p n ( r ) .
Since lim r sup n p n ( r ) = 0 by (A15) and due to Chebyshev’s inequality, (A19) implies that:
lim r sup n p n ( r ) h ( X n | X n > r ) = 0 .
The desired result follows from (A17), (A20) and the fact that:
h ( X n ) = p n ( r ) h ( X n | X n > r ) + ( 1 p n ( r ) ) h ( X n | X n r ) + h ( p n ( r ) ) .

Appendix E. Proof of Proposition 2

  • For any ϵ > 0 , by the continuity of measure, there exists K > 0 such that:
    P X i ( [ K , K ] ) 1 ϵ l , i = 1 , , l .
    By the union bound,
    P X ( [ K , K ] l ) 1 ϵ
    wherever P X is a coupling of ( P X i ) . Now, let P X ( n ) , n = 1 , 2 , be such that:
    lim n j = 1 m c j D ( P Y j ( n ) μ j ) = inf P X j = 1 m c j D ( P Y j μ j )
    where P Y j : = T j P X , j = 1 , , m . The sequence ( P X ( n ) ) is tight by (A23). Thus, invoking the Prokhorov theorem and by passing to a subsequence, we may assume that ( P X ( n ) ) converges weakly to some P X . Therefore, P Y j ( n ) converges to P Y j weakly, and by the semicontinuity property in Lemma A2, we have:
    j = 1 m c j D ( P Y j μ j ) lim n j = 1 m c j D ( P Y j ( n ) μ j )
    establishing that P X is an infimizer.
  • Suppose ( P X i ( n ) ) 1 i l , n 1 is such that E [ X i 2 ] σ i 2 , X i P X i ( n ) , where ( σ i ) is as in Proposition 1 and:
    lim n F 0 ( P X i ( n ) ) i = 1 l = sup ( P X i ) : Σ X i σ i 2 F 0 ( ( P X i ) i = 1 l ) .
    The regularization on the covariance implies that for each i, ( P X i ( n ) ) n 1 is a tight sequence. Thus, upon the extraction of subsequences, we may assume that for each i, ( P X i ( n ) ) n 1 converges to some P X i . We have the moment bound:
    E [ X i 2 ] = lim K E [ min { X i 2 , K } ]
    = lim K E [ min { ( X i ( n ) ) 2 , K } ]
    σ i 2
    where X i P X i and X i ( n ) P X i ( n ) . Then, by Lemma A2,
    i b i D ( P X i ν i ) lim n i b i D ( P X i ( n ) ν i )
    Under the covariance regularization and the nondegenerateness assumption, we showed in Proposition 1 that the value of (76) cannot be + or . This implies that we can assume (by passing to a subsequence) that P X i ( n ) λ , i = 1 , , l , since otherwise F ( ( P X i ) ) = . Moreover, since j c j D ( P Y j ( n ) μ j ) n 1 is bounded above under the nondegenerateness assumption, the sequence i b i D ( P X i ( n ) ν i ) n 1 must also be bounded from above, which implies, using (A30), that:
    i b i D ( P X i ν i ) < .
    In particular, we have P X i λ for each i. Now, Corollary A1 shows that:
    inf P X : S i P X = P X i j c j D ( T j P X μ j ) lim n inf P X : S i P X = P X i ( n ) j c j D ( T j P X μ j )
    Thus, (A30) and (A32) show that ( P X i ) is in fact a maximizer.

Appendix F. Gaussian Optimality in Degenerate Cases: A Limiting Argument

This section proves Theorem 2. We first give a proof for the choice of parameters in Example 1, merely for the sake of notational simplicity, and then discuss how to extend the argument.

Appendix F.1. Proof of the Claim in Example 1

The proof will be based on Theorem 8, which assumes non-degenerate forward channels and Gaussian measures on the output of the reverse channels. To that end, we will adopt an approximation argument. For each j = 1 , , l , define the linear operator T j ϵ by:
( T j ϵ ϕ ) ( x 1 , , x l ) : = E ϕ i = 1 l m j i x i + N ϵ
for any measurable function ϕ on R , where N ϵ N ( 0 , ϵ ) . Let γ 1 ϵ : = N ( 0 , ϵ 1 ) , and note that the density of 2 π ϵ γ 1 ϵ converges pointwise to that of the Lebesgue measure.
Lemma A3.
For any ϵ > 0 , let ( T j ϵ ) be defined as in (A33). Then, for any Borel P X i λ , i = 1 , , l ,
i = 1 l D ( P X i γ 1 ϵ ) l 2 log 2 π ϵ inf P X l : S i P X l = P X i j = 1 l h ( T j ϵ P X l ) .
Proof. 
By Theorem 8, it suffices to prove (A34) when P X i is Gaussian, and from (A34), it is easy to see that it suffices to prove the case of the centered Gaussian. Let P X i = N ( 0 , a i ) , i = 1 , , l . We can upper bound the right side of (A34) by taking P X l = P X 1 × P X l instead of the infimum, so it suffices to prove that:
ϵ 2 i = 1 l a i 1 2 i = 1 l log a i 1 2 j = 1 l log i = 1 l m j i 2 a i + ϵ
for any ϵ , a 1 , , a l ( 0 , ) . This is implied by the ϵ = 0 case, which we proved in (94). ☐
By the duality of the forward-reverse Brascamp-Lieb inequality (Theorem 7) , we conclude from Lemma A3 that:
Lemma A4.
For any ϵ > 0 and nonnegative continuous ( f j ) , ( g i ) satisfying:
i = 1 l log g i ( x i ) j = 1 l ( T j ϵ log f j ) x l , x l R l ,
we have:
2 π ϵ l 2 i = 1 l g i d γ 1 ϵ i = 1 l f j ( x ) d x .
Now, suppose that the claim in Example 1 is not true; then there are nonnegative continuous ( f j ) and ( g i ) satisfying (17) while:
i = 1 l g i ( x ) d x > i = 1 l f j ( x ) d x ,
By the standard approximation argument, we can assume, without loss of generality, that:
g i ( x ) = 0 , x : | x | R , 1 i l ;
f j ( x ) δ e x 2 , 1 j l ,
for some R sufficiently large and δ > 0 sufficiently small. Note that for any x l [ R , R ] l ,
i = 1 l m j i x i [ l R , l R ] .
Since log f j is uniformly continuous on [ 2 l R , 2 l R ] for each j and since we assumed (A40), we have:
lim ϵ 0 inf x l [ R , R ] l j = 1 l ( T j ϵ log f j ) x l j = 1 l ( T j 0 log f j ) x l 0 .
However, since we assumed (17) and (A39), we must also have:
lim ϵ 0 η ϵ 0
where:
η ϵ : = inf x l R l j = 1 l ( T j ϵ log f j ) x l i = 1 l log g i ( x i ) .
Put:
g ˜ 1 ϵ : = exp ( η ϵ ) g 1 ,
g ˜ i ϵ : = g i , i = 1 , , l .
Then, ( g ˜ i ϵ ) and ( f j ) satisfy the constraint (A36) for any ϵ > 0 . By applying the monotone convergence theorem and then Lemma A4,
i = 1 l g i ( x i ) d x i lim ϵ 0 2 π ϵ l 2 i = 1 l g ˜ i ϵ d γ 1 ϵ
i = 1 l f j ( x ) d x
which violates the hypothesis (A38), as desired.

Appendix F.2. Proof of Theorem 2

The limiting argument can be extended to the vector case to prove Theorem 2. Specifically, for each j = 1 , , m , define T j ϵ the same as (A33) except that N ϵ N ( 0 , ϵ I ) , where I is the identity matrix whose dimension is clear from the context (equal to dim ( E j ) here), and let P Y j | X 1 X l ϵ be the dual operator. For each i = 1 , , l , let ν i ϵ : = 2 π ϵ 1 2 dim ( E i ) · N ( 0 , ϵ 1 I ) , whose density convergences pointwise to that of ν i 0 , defined as the Lebesgue measure on E i . Define:
d ϵ : = sup i = 1 l b i log ν i ϵ ( g i ) j = 1 m c j log f j
where the supremum is over nonnegative continuous functions f 1 , , f m and g 1 , , g l such that the summands in (A49) are finite and:
i = 1 l b i log g i ( x i ) j = 1 m c j ( T j ϵ log f j ) ( x 1 , , x l ) , x 1 , , x l .
The same limiting argument (A38)–(A48) extended to the vector case shows that:
d 0 lim ϵ 0 d ϵ .
Next, define F 0 ϵ ( · ) for ( μ j ) , ( ν i ϵ ) and P Y j | X 1 X l ϵ , similarly to (75). The entropic⇒functional argument shows that:
d ϵ sup P X 1 , , P X l F 0 ϵ ( P X 1 , , P X l ) .
However, Theorem 8 based on the rotational invariance of the Gaussian measure can be extended to the vector case, so for any ϵ > 0 ,
sup P X 1 , , P X l F 0 ϵ ( P X 1 , , P X l ) = sup P X 1 , , P X l c . G . F 0 ϵ ( P X 1 , , P X l ) ,
where c.G. means that the supremum on the right side is over centered Gaussian measures. The fact that centered distributions exhaust the supremum follows easily from the definition of F 0 . Moreover, from the definitions, it is easy to see that F 0 ϵ is monotonically decreasing in ϵ , and in particular:
sup P X 1 , , P X l c . G . F 0 ϵ ( P X 1 , , P X l ) sup P X 1 , , P X l c . G . F 0 0 ( P X 1 , , P X l ) .
To finish the proof with the above chain of inequalities, it only remains to show that the right side of (A54) equals the supremum in (A49) with ( f j ) ( g j ) taken over center Gaussian functions. This follows by similar steps as the proof of the functional⇒entropic part of Theorem 1. We briefly mention how the idea works: suppose A is the linear space defined as the Cartesian product of R and the set of n × n symmetric matrices. Let Λ ( · ) be the convex functional on A defined by:
Λ ( r , M ) : = ln exp e r + x M x d x
= r + n 2 ln π 1 2 ln | M | M 0 , + otherwise .
The dual space of A is itself, and Λ is given by:
Λ ( s , H ) = sup r , M 0 { s r + Tr ( H M ) Λ ( r , M ) } .
Then, Λ ( s , H ) = + if s 1 , and:
Λ ( 1 , H ) = sup M 0 Tr ( H M ) n 2 ln π + 1 2 ln | M | .
The supremum in (A58) equals + if H is not positive-semidefinite. However, if H is positive-semidefinite, the supremum equals 1 2 ln 2 π e | H | , which is equal to the relative entropy between N ( 0 , H ) and the Lebesgue measure (supremum achieved when M = ( 2 H ) 1 ). Since the proof of Theorem 1, in essence, only uses the duality between convex functionals, the same algebraic steps therein also establish the desired matrix optimization identity.

References

  1. Brascamp, H.J.; Lieb, E.H. Best constants in Young’s inequality, its converse, and its generalization to more than three functions. Adv. Math. 1976, 20, 151–173. [Google Scholar] [CrossRef]
  2. Brascamp, H.J.; Lieb, E.H. On extensions of the Brunn-Minkowski and Prékopa-Leindler theorems, including inequalities for log concave functions, and with an application to the diffusion equation. J. Funct. Anal. 1976, 22, 366–389. [Google Scholar] [CrossRef]
  3. Bobkov, S.G.; Ledoux, M. From Brunn-Minkowski to Brascamp-Lieb and to logarithmic Sobolev inequalities. Geom. Funct. Anal. 2000, 10, 1028–1052. [Google Scholar] [CrossRef]
  4. Cordero-Erausquin, D. Transport inequalities for log-concave measures, quantitative forms and applications. arXiv, 2015; arXiv:1504.06147. [Google Scholar]
  5. Barthe, F. On a reverse form of the Brascamp-Lieb inequality. Invent. Math. 1998, 134, 335–361. [Google Scholar] [CrossRef]
  6. Bennett, J.; Carbery, A.; Christ, M.; Tao, T. The Brascamp-Lieb inequalities: finiteness, structure and extremals. Geom. Funct. Anal. 2008, 17, 1343–1415. [Google Scholar] [CrossRef]
  7. Liu, J.; Courtade, T.A.; Cuff, P.; Verdú, S. Information theoretic perspectives on Brascamp-Lieb inequality and its reverse. arXiv, 2017; arXiv:1702.06260. [Google Scholar]
  8. Gardner, R. The Brunn-Minkowski inequality. Bull. Am. Math. Soc. 2002, 39, 355–405. [Google Scholar] [CrossRef]
  9. Gross, L. Logarithmic Sobolev inequalities. Am. J. Math. 1975, 97, 1061–1083. [Google Scholar] [CrossRef]
  10. Erkip, E.; Cover, T.M. The efficiency of investment information. IEEE Trans. Inf. Theory Mar. 1998, 44, 1026–1040. [Google Scholar] [CrossRef]
  11. Courtade, T. Outer bounds for multiterminal source coding via a strong data processing inequality. In Proceedings of the IEEE International Symposium on Information Theory, Istanbul, Turkey, 7–12 July 2013; pp. 559–563. [Google Scholar]
  12. Polyanskiy, Y.; Wu, Y. Dissipation of information in channels with input constraints. IEEE Trans. Inf. Theory 2016, 62, 35–55. [Google Scholar] [CrossRef]
  13. Polyanskiy, Y.; Wu, Y. A Note on the Strong Data-Processing Inequalities in Bayesian Networks. Available online: http://arxiv.org/pdf/1508.06025v1.pdf (accessed on 25 August 2015).
  14. Liu, J.; Cuff, P.; Verdú, S. Key capacity for product sources with application to stationary Gaussian processes. IEEE Trans. Inf. Theory 2016, 62, 984–1005. [Google Scholar]
  15. Liu, J.; Cuff, P.; Verdú, S. Secret key generation with one communicator and a one-shot converse via hypercontractivity. In Proceedings of the IEEE International Symposium on Information Theory, Hong Kong, China, 14–19 June 2015; pp. 710–714. [Google Scholar]
  16. Xu, A.; Raginsky, M. Converses for distributed estimation via strong data processing inequalities. In Proceedings of the IEEE International Symposium on Information Theory, Hong Kong, China, 14–19 June 2015; pp. 2376–2380. [Google Scholar]
  17. Kamath, S.; Anantharam, V. On non-interactive simulation of joint distributions. arXiv, 2015; arXiv:1505.00769. [Google Scholar]
  18. Kahn, J.; Kalai, G.; Linial, N. The influence of variables on Boolean functions. In Proceedings of the 29th Annual Symposium on Foundations of Computer Science, White Plains, NY, USA, 24–26 October 1988; pp. 68–80. [Google Scholar]
  19. Ganor, A.; Kol, G.; Raz, R. Exponential separation of information and communication. In Proceedings of the 2014 IEEE 55th Annual Symposium on Foundations of Computer Science (FOCS), Philadelphia, PA, USA, 18–21 Otctober 2014; pp. 176–185. [Google Scholar]
  20. Dvir, Z.; Hu, G. Sylvester-Gallai for arrangements of subspaces. arXiv, 2014; arXiv:1412.0795. [Google Scholar]
  21. Braverman, M.; Garg, A.; Ma, T.; Nguyen, H.L.; Woodruff, D.P. Communication lower bounds for statistical estimation problems via a distributed data processing inequality. arXiv, 2015; arXiv:1506.07216. [Google Scholar]
  22. Garg, A.; Gurvits, L.; Oliveira, R.; Wigderson, A. Algorithmic aspects of Brascamp-Lieb inequalities. arXiv, 2016; arXiv:1607.06711. [Google Scholar]
  23. Talagrand, M. On Russo’s approximate zero-one law. Ann. Probab. 1994, 22, 1576–1587. [Google Scholar] [CrossRef]
  24. Friedgut, E.; Kalai, G.; Naor, A. Boolean functions whose Fourier transform is concentrated on the first two levels. Adv. Appl. Math. 2002, 29, 427–437. [Google Scholar] [CrossRef]
  25. Bourgain, J. On the distribution of the Fourier spectrum of Boolean functions. Isr. J. Math. 2002, 131, 269–276. [Google Scholar] [CrossRef]
  26. Mossel, E.; O’Donnell, R.; Oleszkiewicz, K. Noise stability of functions with low influences: Invariance and optimality. Ann. Math. 2010, 171, 295–341. [Google Scholar] [CrossRef]
  27. Garban, C.; Pete, G.; Schramm, O. The Fourier spectrum of critical percolation. Acta Math. 2010, 205, 19–104. [Google Scholar] [CrossRef]
  28. Duchi, J.C.; Jordan, M.; Wainwright, M.J. Local privacy and statistical minimax rates. In Proceedings of the IEEE 54th Annual Symposium on Foundations of Computer Science (FOCS), Berkeley, CA, USA, 26–29 October 2013; pp. 429–438. [Google Scholar]
  29. Lieb, E.H. Gaussian kernels have only Gaussian maximizers. Invent. Math. 1990, 102, 179–208. [Google Scholar] [CrossRef]
  30. Barthe, F. Optimal Young’s inequality and its converse: A simple proof. Geom. Funct. Anal. 1998, 8, 234–242. [Google Scholar] [CrossRef]
  31. Barthe, F.; Cordero-Erausquin, D. Inverse Brascamp-Lieb inequalities along the Heat equation. In Geometric Aspects of Functional Analysis; Lecture Notes in Mathematics; Springer: Berlin/Heidelberg, Germany, 2004; Volume 1850, pp. 65–71. [Google Scholar]
  32. Carlen, E.A.; Cordero-Erausquin, D. Subadditivity of the entropy and its relation to Brascamp-Lieb type inequalities. Geom. Funct. Anal. 2009, 19, 373–405. [Google Scholar] [CrossRef]
  33. Barthe, F.; Cordero-Erausquin, D.; Ledoux, M.; Maurey, B. Correlation and Brascamp-Lieb inequalities for Markov semigroups. Int. Math. Res. Notices 2011, 2011, 2177–2216. [Google Scholar] [CrossRef]
  34. Lehec, J. Short probabilistic proof of the Brascamp-Lieb and Barthe theorems. Can. Math. Bull. 2014, 57, 585–587. [Google Scholar] [CrossRef]
  35. Ball, K. Volumes of sections of cubes and related problems. In Geometric Aspects of Functional Analysis; Springer: Berlin/Heidelberg, Germany, 1989; pp. 251–260. [Google Scholar]
  36. Ahlswede, R.; Gács, P. Spreading of sets in product spaces and hypercontraction of the Markov operator. Ann. Probab. 1976, 4, 925–939. [Google Scholar] [CrossRef]
  37. Csiszár, I.; Körner, J. Information Theory: Coding Theorems for Discrete Memoryless Systems, 2nd ed.; Cambridge University Press: Cambridge, UK, 2011. [Google Scholar]
  38. Liu, J.; van Handel, R.; Verdú, S. Beyond the Blowing-Up Lemma: Sharp Converses via Reverse Hypercontractivity. In Proceedings of the IEEE International Symposium on Information Theory, Aachen, Germany, 25–30 June 2017; pp. 943–947. [Google Scholar]
  39. Ahlswede, R.; Gács, P.; Körner, J. Bounds on conditional probabilities with applications in multi-user communication. Probab. Theory Relat. Fields 1976, 34, 157–177. [Google Scholar] [CrossRef]
  40. Villani, C. Topics in Optimal Transportation; American Mathematical Soc.: Providence, RI, USA, 2003; Volume 58. [Google Scholar]
  41. Atar, R.; Merhav, N. Information-theoretic applications of the logarithmic probability comparison bound. IEEE Trans. Inf. Theory 2015, 61, 5366–5386. [Google Scholar] [CrossRef]
  42. Radhakrishnan, J. Entropy and counting. In Kharagpur Golden Jubilee Volume; Narosa: New Delhi, India, 2001. [Google Scholar]
  43. Madiman, M.M.; Tetali, P. Information inequalities for joint distributions, with interpretations and applications. IEEE Trans. Inf. Theory 2010, 56, 2699–2713. [Google Scholar] [CrossRef]
  44. Nair, C. Equivalent Formulations of Hypercontractivity Using Information Measures; International Zurich Seminar: Zurich, Switzerland, 2014. [Google Scholar]
  45. Beigi, S.; Nair, C. Equivalent characterization of reverse Brascamp-Lieb type inequalities using information measures. In Proceedings of the IEEE International Symposium on Information Theory, Barcelona, Spain, 10–15 July 2016. [Google Scholar]
  46. Bobkov, S.G.; Götze, F. Exponential integrability and transportation cost related to Logarithmic Sobolev inequalities. J. Funct. Anal. 1999, 163, 1–28. [Google Scholar] [CrossRef]
  47. Carlen, E.A.; Lieb, E.H.; Loss, M. A sharp analog of Young’s inequality on SN and related entropy inequalities. J. Geom. Anal. 2004, 14, 487–520. [Google Scholar] [CrossRef]
  48. Geng, Y.; Nair, C. The capacity region of the two-receiver Gaussian vector broadcast channel with private and common messages. IEEE Trans. Inf. Theory 2014, 60, 2087–2104. [Google Scholar] [CrossRef]
  49. Liu, J.; Courtade, T.A.; Cuff, P.; Verdú, S. Brascamp-Lieb inequality and its reverse: An information theoretic view. In Proceedings of the IEEE International Symposium on Information Theory, Barcelona, Spain, 10–15 July 2016; pp. 1048–1052. [Google Scholar]
  50. Lax, P.D. Functional Analysis; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2002. [Google Scholar]
  51. Tao, T. 245B, Notes 12: Continuous Functions on Locally Compact Hausdorff Spaces. Available online: https://terrytao.wordpress.com/2009/03/02/245b-notes-12-continuous-functions-on-locally-compact-hausdorff-spaces/ (accessed on 2 March 2009).
  52. Bourbaki, N. Intégration; (Chaps. I-IV, Actualités Scientifiques et Industrielles, no. 1175); Hermann: Paris, France, 1952. [Google Scholar]
  53. Dembo, A.; Zeitouni, O. Large Deviations Techniques and Applications; Springer: Berlin, Germany, 2009; Volume 38. [Google Scholar]
  54. Lane, S.M. Categories for the Working Mathematician; Springer: New York, NY, USA, 1978. [Google Scholar]
  55. Hatcher, A. Algebraic Topology; Tsinghua University Press: Beijing, China, 2002. [Google Scholar]
  56. Rockafellar, R.T. Convex Analysis; Princeton University Press: Princeton, NJ, USA, 2015. [Google Scholar]
  57. Prokhorov, Y.V. Convergence of random processes and limit theorems in probability theory. Theory Probab. Its Appl. 1956, 1, 157–214. [Google Scholar] [CrossRef]
  58. Verdú, S. Information Theory; In preparation; 2018. [Google Scholar]
  59. Kamath, S. Reverse hypercontractivity using information measures. In Proceedings of the 53rd Annual Allerton Conference on Communications, Control and Computing, Champaign, IL, USA, 30 September–2 October 2015; pp. 627–633. [Google Scholar]
  60. Wu, Y.; Verdú, S. Functional properties of minimum mean-square error and mutual information. IEEE Trans. Inf. Theory 2012, 58, 1289–1301. [Google Scholar] [CrossRef]
  61. Godavarti, M.; Hero, A. Convergence of differential entropies. IEEE Trans. Inf. Theory 2004, 50, 171–176. [Google Scholar] [CrossRef]
Figure 1. Diagrams for Theorem 1.
Figure 1. Diagrams for Theorem 1.
Entropy 20 00418 g001
Figure 2. The forward-reverse Brascamp-Lieb inequality generalizes several other functional inequalities/information theoretic inequalities. For more discussions on these relations, see the extended version [7].
Figure 2. The forward-reverse Brascamp-Lieb inequality generalizes several other functional inequalities/information theoretic inequalities. For more discussions on these relations, see the extended version [7].
Entropy 20 00418 g002
Figure 3. Diagram for hypercontractivity.
Figure 3. Diagram for hypercontractivity.
Entropy 20 00418 g003
Figure 4. Diagram for reverse hypercontractivity.
Figure 4. Diagram for reverse hypercontractivity.
Entropy 20 00418 g004
Figure 5. Diagram for reverse hypercontractivity with one negative parameter.
Figure 5. Diagram for reverse hypercontractivity with one negative parameter.
Entropy 20 00418 g005
Back to TopTop