Gaussian Optimality for Derivatives of Differential Entropy Using Linear Matrix Inequalities †

Let Z be a standard Gaussian random variable, X be independent of Z, and t be a strictly positive scalar. For the derivatives in t of the differential entropy of X+tZ, McKean noticed that Gaussian X achieves the extreme for the first and second derivatives, among distributions with a fixed variance, and he conjectured that this holds for general orders of derivatives. This conjecture implies that the signs of the derivatives alternate. Recently, Cheng and Geng proved that this alternation holds for the first four orders. In this work, we employ the technique of linear matrix inequalities to show that: firstly, Cheng and Geng’s method may not generalize to higher orders; secondly, when the probability density function of X+tZ is log-concave, McKean’s conjecture holds for orders up to at least five. As a corollary, we also recover Toscani’s result on the sign of the third derivative of the entropy power of X+tZ, using a much simpler argument.

There have been numerous generalizations of the EPI. In [4], Costa considered the case where X is perturbed by an independent standard Gaussian Z, and showed that N(X + √ tZ) is concave in t for t > 0: Toscani [5] further showed that d 3 dt 3 N(X + √ tZ) is the Fisher information J(X + √ tZ). The above conjecture is equivalent to hypothesizing that the Fisher information of X + √ tZ is completely monotone, thus admitting a very simple characterization using the Laplace Transform [10]: there exists a finite Borel measure µ(·) such that Back in 1966, McKean [7] also studied the derivatives in t of h(X + √ tZ), and noticed that Gaussian X achieves the minimum of d dt h(X + √ tZ) and − d 2 dt 2 h(X + √ tZ), subject to Var(X) = σ 2 . Then, McKean implicitly made the following conjecture that Gaussian optimality holds generally: Conjecture 2 ([7]). Subject to Var(X) = σ 2 , Gaussian X with variance σ 2 achieves the minimum of (−1) n−1 × d n dt n h(X + √ tZ) for t > 0 and n ≥ 1.
Hence, McKean's conjecture implies the one by Cheng and Geng. Compared with the progress made by Cheng and Geng [9] on Conjecture 1, there has been little progress on Conjecture 2. Most of the existing results are on the second derivative of the differential entropy (or the mutual information), and on generalizing the EPI to other settings. For example: Guo et al. [11] represents the derivatives in the signal-to-noise ratio of the mutual information in terms of the minimum mean-square estimation error, building on de Bruijn's identity [2]; Wibisono and Jog [12] study the mutual information along the density flow defined by the heat equation and show that it is a convex function of time if the initial distribution is log-concave; Wang and Madiman [13] recover the proof of the EPI via rearrangements; Courtade [14] generalizes Costa's EPI to non-Gaussian additive perturbations; and König and Smith [15] propose a quantum version of the EPI.
In this paper, we work on Conjecture 2. The main results are to show that Conjecture 2 holds for higher orders up to at least five under the log-concavity condition, and the introduction of the technique of linear matrix inequalities.
The paper is organized as follows: in Section 2, we obtain the formulae for the derivatives of the differential entropy h(X + √ tZ) (Theorem 1) and show that McKean's conjecture holds for higher orders up to at least five under the log-concavity condition (Corollary 1). As a corollary, we recover Toscani's result [5] on the third derivative of the entropy power, using the Cauchy-Schwartz inequality, which is much simpler. In Section 3, we introduce the linear matrix inequality approach, and transform the above two conjectures to the feasibility check of semidefinite programming problems.
With this approach, we can easily obtain the coefficients in Theorem 1. Then, we show that the direct generalization of the method by Cheng and Geng might not work for orders higher than four for proving Conjecture 1. In Section 4, we prove the main theorem of Section 2.

Main Results
We first introduce the notation that is used throughout this paper. When the functions are single-variate, we use d · d · for its derivative. For the multi-variate cases, we use ∂ · ∂ · for the partial derivative. To simplify the notation, for the derivatives of a general single-variate function g(y), we also use g (y), g (y) and g (y) to represent the first, second and third derivatives, respectively; and g (n) (y) denotes the n-th derivative for n ≥ 1.
In the rest of the paper, let Z be a standard Gaussian random variable, and X be independent of Z. Denote According to [4,16], Y t has nice properties: The probability density function f (y, t) of Y t exists, is strictly positive and infinitely differentiable; The differential entropy h (Y t ) exists. Denote f n := ∂ n ∂y n f (y, t), where it is understood that f n and T n are functions of (y, t). We also present some properties of f (y, t) in the following lemma. The proof can be found in, say, [2,16] and Propositions 1 and 2 in [9].

Lemma 1.
For t > 0, the probability density function f (y, t) satisfies the following properties: (1) The heat equation holds: The expectation of the product of the T i , E[∏ i T i ] exists, and lim |y|→∞ f ∏ i T i = 0, ∀t > 0.
In Lemma 1,part (3), in writing E[∏ i T i ], we think of each T i as a function of (Y t , t). Notice that, given X and Z, the differential entropy h(X + √ tZ) is a function of t. The formulae for the first and second derivatives of h(X + √ tZ) are presented in the following lemma. According to Stam [2], the first equality is due to de Bruijn, and the right-hand side is actually the Fisher information (page 671 of [17]); the second one is due to McKean [7], Toscani [8] and Villani [6]; the Gaussian optimality is due to McKean [7].

Lemma 2.
For the first and second derivatives of the differential entropy h(X + √ tZ), the following expressions hold for t > 0: Subject to VarX = σ 2 , Gaussian X with variance σ 2 minimizes h (X + √ tZ) and −h (X + √ tZ).
By standard manipulations, one has Thus, it is straightforward to rewrite the derivatives as For the third and fourth derivatives, one can refer to Theorems 1 and 2 in [9], where they were represented by the f i . Notice that these representations are not unique, and the ones in [9] are sufficient for identifying the signs. Instead, in Theorem 1, we use the T i , and this will facilitate our proof of the Gaussian optimality in Corollary 1.

Theorem 1.
For t > 0, the derivatives of the differential entropy h(X + √ tZ) can be expressed as: −2h (4) 2h (5) The proof to this theorem is left to Section 4. The existence of such expressions and how to obtain the coefficients are left to Section 3, where the method of linear matrix inequalities is introduced.

Log-Concave Case
Lemma 2 already ensures the optimality of Gaussians, subject to Var(X) = σ 2 , for the first and second derivatives. For higher ones, we do not know if we can show the optimality based on the expressions in Theorem 1. Here, we impose the constraint of log-concavity on f (y, t) and summarize the results in Corollaries 1-3.
A nonnegative function f (·) is logarithmically concave (or log-concave for short) if its domain is convex and it satisfies the inequality for all x, y in the domain and 0 < θ < 1. If f is strictly positive, this is equivalent to saying that the logarithm of the function is concave (Section 2.5 of [18]). In our case, assuming that f (y, t) is log-concave in y is equivalent to T 2 ≤ 0.
Examples of log-concave distributions include the Gaussian, exponential, Laplace, and the Gamma with parameter larger than one. Notice that, if the probability density function of X is log-concave, then so is that of X + √ tZ (Section 3.5.2 of [18]).
Proof. Let X G be Gaussian with mean µ and variance σ 2 . The probability density function of The key observation is that the second derivative of the logarithm in the Gaussian case is Hence, from Equation (2), the derivatives of the differential entropy in the Gaussian case are Now, if one can show the following chain of inequalities: then one is done. For inequality (b), the log-concavity condition, namely T 2 ≤ 0, suffices.
This can be proved using Lemma 2: Notice that where the last equality is due to Lemma 1. Now, from Equation (5), Combining this with Lemma 2, one has This part is finished by noticing that E [−T 2,G ] > 0 from Equation (2). For inequality (a), we show each case of n using Theorem 1 and the condition T 2 ≤ 0. For n = 3, where the inequality is due to Now, the proof is finished.
The following corollary deals with the fifth-order case in [9], under the log-concavity assumption. The proof follows directly from Corollary 1 and Equation (2).

Corollary 2.
If the probability density function of X + √ tZ is log-concave, then the fifth derivative of the differential entropy is strictly positive: h (5) Regarding the entropy power, it is already known that N (X + √ tZ) ≥ 0 from the connection with Fisher information, and N (X + √ tZ) ≤ 0 according to [4]. For the third derivative, Toscani showed that N (3) (X + √ tZ) ≥ 0, under the log-concavity assumption. Here, we simplify Toscani's proof, using a Cauchy-Schwartz argument.

Corollary 3.
If the probability density function of X + √ tZ is log-concave, then the third derivative of the entropy power is nonnegative: Proof. For brevity, let h := h (X + √ tZ), and, similarly, we omit the arguments for higher orders. Routine manipulations yield that Thus, it suffices to show 2h in the form of the T i : according to Lemma 2 and Equation (12), 2h = E[−T 2 ]; from Equation (7), . Now, under the log-concavity condition, namely T 2 ≤ 0, from the Cauchy-Schwartz inequality for random variables, we have: Thus, we have The proof is finished by noticing that E[T 2 2 ] ≥ E[−T 2 ] 2 ≥ 0, which implies that the right-hand side is nonnegative.

Linear Matrix Inequalities
In this section, we introduce the method of linear matrix inequalities (LMI), and transform the proof of Conjectures 1 and 2 to the feasibility problem of LMI. This transformation also enables us to find the right coefficients in Theorem 1.
Recall that, in [9], the authors first obtained the fourth derivative as the following (Equation (27) in [9]) Then, with some equalities (from integration by parts), they showed this derivative can be expressed as the negative of a sum of squares (Theorem 2 in [9]): 70, 000 Hence, the fourth derivative is nonpositive. The sum of squares has a natural connection with positive semidefinite matrices. The right-hand side of Equation (14) can be written as −E[u T Fu], where u is the column vector with coordinates and F is a positive semidefinite matrix. Thus, the method in [9] is actually to verify the existence of a suitable positive semidefinite matrix F. This can be cast as the feasibility of a linear matrix inequality.
A linear matrix inequality (Chapter 2 of [18]) has the form where the m × m symmetric matrices F 0 , F i , G j , i = 1, . . . , I, j = 1, . . . , J are given, variables x i are real and y j ' are nonnegative, and the notation F(x, y) 0 means F(x, y) is positive semidefinite. The feasibility problem refers to identifying if there exists a set of x i and y j such that F(x, y) is positive semidefinite.
To reformulate the method used by Cheng and Geng [9] as an LMI feasibility problem, using the fourth derivative as an illustrative example, the main idea is: first, transform the original expression of the derivative to the form −2h (4) Then, transform the equalities resulting from integration by parts to the form Finally, try to find a set of variables One can notice that there is no matrix G j in the above statement. This is mainly because only equalities were available in [9]. When one imposes inequality constraints, for example T 2 ≤ 0, as in this paper, then one will be able to construct matrices G j .
Before we proceed to introduce the details on constructing those matrices, the following observations are clear regarding (13)): (a) the sum-order of derivatives for each entry of u is four, for example, the sum-order of f 2 1 f 2 / f 3 is 1 × 2 + 2 = 4; (b) the highest order of a single term in the entries of u is four, namely f 4 / f ; (c) the sum-order of each entry in the fourth derivative is eight, which is twice that of u.
In the following, we take the fourth derivative as an example, and show how to construct these matrices F 0 (Section 3.3), F i (Sections 3.1 and 3.2), and G j (Section 3.4). We decide to use the T k as the entries of u, instead of the f k , the motivation for which is clear from the proof of Corollary 1 and the desire to exploit the assumption T 2 ≤ 0. Based on the above observation and the expressions in Equation (5), our vector u is Thus, F 0 , F i , G j are 5 × 5 symmetric matrices. Here, we mention that the expressions appearing as coordinates in u correspond to the integer partitions of four.
The organization of this section is as follows: Sections 3.1-3.3 deal with the sign of the fourth derivative with only equality constraints (see Conjecture 1); Section 3.4 further incorporates the inequality constraints, namely T 2 ≤ 0; Section 3.5 shows the manipulation for the optimality of Gaussian inputs (see Conjecture 2). In Section 3.6, we consider the sign and the Gaussian optimality for the fifth derivative.

Matrices F i from Multiple Representations
The matrices F i are such that E[u T F i u] = 0. A trivial case is to notice that different products of the form u(i)u(j) may map to the same term, for example That is, T 2 2 T 4 1 admits multiple representations as u(i)u(j). It is easy to construct the corresponding matrix F 1 such that u T F 1 u = 0: For the fourth derivative, only one term has multiple representations. There is none for the third derivative, and three for the fifth (F 1 , F 2 and F 3 in Section 3.6).

Matrices F i from Integration by Parts
The equalities of the type E[u T F i u] = 0 used in [9] are from integration by parts. Here, we list them one by one.
Notice that all the possible terms with sum-order eight and highest-order four are the following (the numbers in the left column are indices): Denote this vector as w.
These terms are arranged in the order such that the first (fourteen) terms can be expressed as u(i)u(j) for some i and j, while the last term(s) cannot be. We call the first terms the quadratic part w qua , and the last term(s) the non-quadratic part w non . Thus, w = (w qua , w non ).
It is not difficult to conclude that, for non-repetition, one only needs to perform integration by parts on the entries whose highest-order term is of power one. All of these entries are (eight in total): Taking T 4 T 3 T 1 as an example, one can show that (Equation (18), see the end of this subsection) In addition, this can be written as E[c T 1 w] = 0, where There are eight equalities in total and hence there are vectors c 1 , . . . , c 8 . We put each c i as the i-th row of C ∈ R 8×15 , and write those equalities as The entries can be found in Equations (18)-(25). We need to extract matrices F from these eight equalities E[Cw] = 0, such that E[u T Fu] ≡ 0. The main problem is that c T k w may contain entries that are not expressible as u(i)u(j). In particular, for the fourth derivative, this happens when c k (15) = 0. One needs to do some work to cancel these entries. The general method, which can also be used in higher-order cases, is stated below: 1. Firstly, since w = (w qua , w non ), we separate the blocks of C accordingly, w qua w non ] = 0.
In particular, for the first row of C 21 , the matrix is Notice a scaling of a factor of two is added here just for conciseness, and this does not affect the feasibility of (15). Similarly, the other five matrices, corresponding to the remaining rows of C 21 , are  3. Thirdly, for C 11 and C 12 , the equalities are E[C 11 w qua + C 12 w non ] = 0. Notice w non cannot be expressed in a quadratic form. Supposing that we can find a column vector z such that z T C 12 = 0, then E[z T C 11 w qua ] = E[z T (C 11 w qua + C 12 w non )] = 0. The vector z actually lies in the null space of C T 12 , and it suffices to find the basis. One way is to do the QR decomposition: where U is upper-triangular. The null-space of C T 12 has the same dimensions as the number of rows of 0 above, and a basis as the last several columns of Q-in particular, for the fourth derivative Hence, one takes z as the second column of Q, which is (after scaling for conciseness) z T = −2, 1 . Then, one calculates z T C 11 w qua = −4T 4 T 3 T 1 + T 4 T 2 2 − 2T 2 3 T 2 1 + T 3 T 2 2 T 1 , and the corresponding matrix F 8 (scaled by a factor of two) is The rest of this subsection is devoted to calculating the equalities obtained from integration by parts. This is similar to that in [9], except in the form of the T i . To begin, we need the following lemma. Lemma 3. Let A be a linear combination of terms of products of the T i , then, for n ≥ 2, Proof. From calculus, where (a) is due to Lemma 1, and (b) is due to Equation (5). Now, using Lemma 3, one obtains the following equalities: With these equalities, matrix (17) can be constructed.

Matrix F 0 from the Derivative
Suppose we have already obtained the fourth derivative in the form (see Equation (30) later) −2h (4) where d 1 ∈ R 14 , d 2 ∈ R 1 . Then, similar to F 8 , we can find the matrix F 0 such that −2h (4) To cancel the non-quadratic term d T 2 w non , we solve for z T 2 C 12 = d T 2 (the solution z 2 should exist, otherwise it is not possible to find a quadratic form and the LMI approach fails). Then, since E[C 11 w qua + C 12 w non ] = 0, we have −2h (4) Now, F 0 can be constructed from d T 1 − z T C 11 . The details are as follows. First, we need to express the derivative using the entries of w. This can be done recursively using the following lemma.

Lemma 4. Let A be a linear combination of terms of products of the T i . The following equalities hold:
The proof is left to Appendix A. Now, with Equation (7): and Equation (28), one can easily obtain that For the fourth derivative, One solves for z 2 such that z T 2 C 12 = d T 2 and obtains has nonzero entries at locations [1,3,7,10,11], with values [1, 6, 9, 7, 1], respectively. Furthermore, F 0 (scaled by a factor of two) is found as By the end of this subsection, it is easy to see that Cheng and Geng's method can be reformulated as identifying if there exist x 1 , . . . , x 8 ∈ R such that We use the convex optimization package [19] to identify the feasibility of the above LMI problem, and it turns out to be feasible as it should be according to Equation (14).

Matrices G j from Log-Concavity
Recall that, in [9], there is no matrix G j , since there is no inequality constraint. In this paper, we consider the log-concave case T 2 ≤ 0, thus introducing inequality constraints.
For the fourth order, T 2 ≤ 0 actually implies that the following entries in w are nonpositive: It is clear that the powers of T 2 are odd, and the others are even.
To transform these nonpositive terms into matrices G j , the first two terms, T 3 2 T 2 1 and T 2 T 6 1 are trivial, since they can be expressed by u(i)u(j) directly: For the term T 2 T 2 3 , the idea is similar to the third part in Section 3.2. One first finds z 3 ∈ R 2 such that z T 3 C 12 w non = T 2 T 2 3 , namely z T 3 C 12 = 1. The solution is z T 3 = 0, 1/2 . Then, At this point, we are done with the procedure for calculating all these matrices F 0 , the F i and the G j . To show the negativity of the fourth derivative, it suffices to find a set of variables x i ∈ R and y j ≥ 0 such that

Remark 2.
The matrix G 2 is actually redundant, since we know that E[T 2 T 6 1 ] ≡ − 1 7 E[T 8 1 ] ≤ 0, which is already included in the matrices F i (in particular, matrix F 7 in Section 3.2). Including G 2 will not affect the feasibility check.

MatrixF 0 for Gaussian Optimality
However, to show the optimality of the Gaussian, the above formulation is not enough. According to inequality (a) in Equation (11), it would suffice to show that Thus, one needs to calculate the matrix F 0 such that The procedure is the same as that in Section 3.3.
In particular, for the fourth derivative, since n = 4 is even, we directly have the quadratic form E[(−T 2 ) n ] = u(3)u(3). It is straightforward to construct the matrixF 0 (scaled by a factor of two) herẽ Again, we use the convex optimization package [19] to check the feasibility. It turns out to be feasible and the solution helps us to identify the coefficients in Equation (9).

Fifth Derivative
For the fifth derivative, we omit the details of the manipulations since they are routine, and just provide the matrices here. For brevity, we only list out the nonzero entries of the upper-triangular part of a symmetric matrix. These matrices (with scaling) are For the sign of the fifth derivative, we used the convex optimization package [19] to solve the following LMI problem, but could not find a feasible solution x 1 , . . . , x 16 ∈ R. This suggests to us that a direct generalization of Cheng and Geng's method may not work for the fifth derivative.
Instead, if we consider the log-concavity constraint T 2 ≤ 0 and check the optimality of Gaussian inputs, then we have a new matrixF 0 (similar to Section 3.5) and several matrices G j as the following: Now, one would like to find x 1 , . . . , x 16 ∈ R and y 1 , . . . , y 5 ∈ R + such that This can be solved by the convex optimization package [19]. Again, the solution helps us to arrive at Equation (10).

Proof of Theorem 1
Proof. For the third derivative, according to Equation (29), we have For the fourth derivative, according to Equation (30): Adding multiples of the left-hand sides of the equations: where (a) is due to Equation (19), and (b) is due to Equation (22).
For the fifth derivative, For each term above on the right-hand side: According to Equation (28), For the second term, Then, adding multiples of the left-hand sides of Equations (35)-(37), we have 2h (5)

On the Derivatives
We are not able to say anything conclusive about the sign of the fifth derivative of the differential entropy h(X + √ tZ). If we impose the log-concavity condition, namely T 2 ≤ 0, then the fifth derivative is at least 4! × E[(−T 2 ) 5 ]. This motivates us to consider the following problem: Without additional constraints, what are the values c 5 > 0 such that If one finds such a value c 5 , then so long as E[(−T 2 ) 5 ] ≥ 0, the sign of the fifth derivative is determined. This condition is much weaker than T 2 ≤ 0.
For the computational part, one only needs to construct the matrixF 0 such that 2h (5) 0 u], and then solve the problem (see Section 3.6 for the matrices F i ) It turns out that c 5 = 0.13 works, while c 5 = 0.125 fails. The authors guess that c 5 ∈ [0.13, 24] works, but, at the moment, can just partly confirm this with limited simulation.
Notice that the third derivative of the entropy power N(X + √ tZ) was shown to be nonnegative under the log-concavity condition [5], and we recover this in Corollary 3. We also considered the fourth derivative, but failed to obtain the sign because we were unable to apply the Cauchy-Schwartz inequality as we did for the third derivative.

Possible Proofs
To prove Conjecture 1, besides the method proposed in [9], we are also considering the following ways: the first one is constructive and inspired by Equation (1). Given a random variable X, if we can construct a proper measure µ(·) such that Equation (1) holds, then one proves Conjecture 1. However, this is difficult even when X is binary symmetric, which is a very simple random variable.
The second one is recursive. Suppose one can find a formula for the n-th derivative such that then it is clear that However, this fails for n = 2 (see Equation (7) and Theorem 1). Instead, one may expect that and then If further one can show that E[−B 2 1 + B 2 k n +1 ] = E[−C 2 k n +1 ] for some C k n +1 , then one finishes the proof. Notice here that a clever observation is needed for this way to work.

Applications
The topic of Gaussian optimality has wide applications, for example in [20,21]. In this work, besides the Gaussian optimality, we also have some new observations. In [11], the derivatives in the signal-noise ratio (snr) of I(X; √ snrX + Z) are studied. In particular, the first four derivatives are obtained in the language of the minimum mean-square error (Equations (69)-(72) in Corollary 1 of [11]). However, it is not clear whether some of these derivatives are signed or not.
With some standard manipulations, it is not difficult to show that By letting t = 1/ √ snr, one can easily connect the minimum mean-square error formulae in [11] with the signs of the derivatives of h(X + √ tZ) in t. The verification of Conjectures 1 and 2 would imply the bounding and extremal properties of Equations (69)-(72) in [11], and thus deepen our understanding of the minimum mean-square error estimation under the additive-Gaussian setting.
In addition, notice that the probability density function f (y, t) of Y = X + √ tZ is the solution of the heat equation ∂ ∂t f (y, t) = 1 2 ∂ 2 ∂y 2 f (y, t) with the initial condition that f (y, 0) = f X (y). Hence, Conjectures 1 and 2, if true, reveal the properties of the differential entropy of functions that satisfy the heat equation. For more results related to diffusion equations, one may refer to [22].

Conclusions
In this paper, we studied two conjectures on the derivatives of the differential entropy of a general random variable with added Gaussian noise. Regarding the conjecture on the signs of the derivatives made by Cheng and Geng, we introduced the linear matrix inequality approach to provide evidence that their original method might not generalize to orders higher than four. Instead, we considered imposing an additional constraint, namely the log-concavity assumption, and showed the optimality of Gaussian random variables for orders three, four and five. Thus, we made progress on McKean's conjecture, under a mild condition.