A Log-Det Heuristics for Covariance Matrix Estimation: The Analytic Setup

: This paper studies a new nonconvex optimization problem aimed at recovering high-dimensional covariance matrices with a low rank plus sparse structure. The objective is composed of a smooth nonconvex loss and a nonsmooth composite penalty. A number of structural analytic properties of the new heuristics are presented and proven, thus providing the necessary framework for further investigating the statistical applications. In particular, the ﬁrst and the second derivative of the smooth loss are obtained, its local convexity range is derived, and the Lipschitzianity of its gradient is shown. This opens the path to solve the described problem via a proximal gradient algorithm.


Introduction
The estimation of large covariance or precision matrices is a relevant challenge nowadays, due to the increasing availability of datasets composed of a large number of variables p compared to the sample size n in many fields. The urgency of this topic is testified by several recent books [1][2][3], and comprehensive reviews [4][5][6]. In this paper, we assume for the p × p covariance matrix Σ * a low rank plus sparse decomposition, that is where L * = BB = U L Λ L U L , U L is a p × r matrix such that U L U L = I r , Λ L is a r × r diagonal matrix, and S * is element-wise sparse, i.e. it contains only s p(p−1) 2 off-diagonal non-zero elements. Since [7] proposed their approximate factor model, structure (1) has become the reference model for many high-dimensional covariance matrix estimators, like POET [8].
The recovery of structure (1) is a statistical problem of primary relevance. Ref. [7] proposed to consistently estimate L * (as p → ∞) by means of principal component analysis (PCA, see [9]), assuming that the eigenvalues of L * diverge with the dimension p while the eigenvalues of S * remain bounded. [8] proposes to estimate L * by the top r principal components of the sample covariance matrix Σ n (as p → ∞) and to estimate S * by thresholding their orthogonal complement. In [10], L * and S * are recovered by nuclear norm plus l 1 penalization, that is by computing L, S = arg min L 0,S 0 L(L, S) + P(L, S), (2) where L(L, S) is a smooth loss function, P(L, S) is a nonsmooth penalty function, L 0 denotes positive semidefiniteness for L and S 0 denotes positive definiteness for S. In particular, denoting by λ i (M), i = 1, . . . , p, the eigenvalues of a p × p matrix M sorted in descending order, L(L, S) = 1 2 Σ n − (L + S) 2 F , P(L, S) = ψ L * + ρ S 1 , where j=i |S ij | (the l 1 norm of S), and ψ and ρ are non-negative threshold parameters.
The nuclear norm was first proposed in [11] as an alternative to PCA. Ref. [12] furnishes a proof that ψ L * + ρ S 1 is the tightest convex relaxation of the original nonconvex penalty ψrk(L) + ρ S 0 . Ref. [13] proves that the l 1 norm minimization provides the sparsest solution to most large underdetermined linear systems, while [14] proves that the nuclear norm minimization provides guaranteed rank minimization under a set of linear equality constraints. Ref. [15] shows that l 1 norm minimization selects the best linear model in a wide range of situations. The nuclear norm has instead been used to solve large matrix completion problems, like in [16][17][18], and [19]. Nuclear norm plus l 1 norm minimization was first exploited in [20] to provide a robust version of PCA under grossly corrupted or missing data.
The pair of estimators (2) derived in [10] is named ALCE (ALgebraic Covariance Estimator). Although ALCE has many desirable statistical properties, there is room to further improve it by replacing 1 2 Σ n − (L + S) 2 F by a different loss. The Frobenius loss optimizes in fact the entry by entry performance of Σ, while a loss able to explicitly control the spectrum estimation quality may be desirable. In this paper, we consider the loss where ∆ n = Σ − Σ n , and Σ = L + S. Heuristics (3) is controlled by the individual singular values of ∆ n , because and, therefore, it is better suited for the estimation of the underlying spectrum.
To the best of our knowledge, the mathematical properties of (3) have not been extensively studied. Analogously to the univariate context (p = 1), (3) is not a convex function. According to ongoing works like [21], nonconvex problems may be approached either by searching for approximate solutions instead of global solutions, or by exploiting the geometric structure of the objective function. Furthermore, in this case, the idea of restricting the analysis to the convexity region of the objective, a region that may be indefinitely extended (see the concept of Extendable Local Strong Convexity in [22]), is the key to apply, for instance, existing proximal gradient algorithms for convex functions (see [23]). For this reason, in this paper we calculate the first and second derivatives of (3), we derive its range of local convexity, and the Lipschitzianity of (3) and of its gradient. This opens the path to using the usual proximal gradient algorithms (see [23]) to solve problem (2) with L(L, S) as in (3).

Analytic Setup
We consider the objective function where L(L, S) = 1 2 log det(I p + ∆ n ∆ n ) is the smooth part of φ(L, S) and P(L, S) = ψ L * + ρ S 1 is the non-smooth (but convex) part of φ(L, S). First, we calculate the derivative of the smooth component L(L, S) wrt L and S, which is Proof. Let us consider two generic p × p matrices L and S, their sum Σ = L + S, and the matrix ∆ n = Σ − Σ n . Let us define the matrix function ϕ(Σ) = I p + ∆ n ∆ n and the function φ(Σ) = log det ϕ(Σ). We denote by e i the i-th canonical basis vector, by e l i = δ il its l-th element, and by σ ij the ij entry of Σ. Then, following [24], for each i, j = 1, . . . , p, we can write Therefore, Since for A, B, C conformable matrices Finally, considering that To sum up, In the following, we explicit the second derivative of L(L, S) = 1 2 log det ϕ(Σ), with ϕ(Σ) = (I p + ∆ n ∆ n ) and Σ = L + S: More, if Σ = Σ n , we get that is, Proof. From [24], we write and we recall that Then, we can calculate The second summand can be derived from (12) as Equation (9) is consequently proved.

Local Convexity
The aim of this section is to determine the range of convexity for L(L, S) = 1 2 log det(I p +∆ n ∆ n ), ∆ n = Σ − Σ n , wrt to the semidefinite positive matrix ∆ n ∆ n . In the univariate context, the function 1 In the multivariate context, it is therefore reasonable to suppose that a similar condition on ∆ n ∆ n ensures local convexity. A proof can be given by showing the positive definiteness of the Hessian of L(L, S) for some range of ∆ n ∆ n . In other words, we need to show that there exists a positive δ such that, whenever ∆ n ∆ n < δ, the function 1 2 log det(I p +∆ n ∆ n ) is convex.

Lemma 1.
Given 0 < µ ≤ 1 3p , we have that the function is convex on the set C µ = {A|A is a real p × p matrix, A 2 ≤ µ} where A 2 denotes the spectral norm of A.
Proof. We proceed using the criterion of convexity estimating the second derivative with respect to t of Let us recall that where G(t) is a differentiable square matrix-valued function and (15) holds for those values of t for which G(t) is invertible. Furthermore, we have as well for any differentiable square matrix-valued function A(t). (See [25] or [26] e.g., how to prove these identities). Calling and G(t) = 1 3pµ I p + t 2 ΛΛ * + tR + BB * . Thus, applying (15) and (16) we get and Convexity will follow once we have proven that (17) is non-negative for every t ∈ [0, 1] and every A and B in C µ .
Due to the circularity of the trace function we also have that is This can be written as We recall that G(t) is self-adjoint so that denoting by H the matrix G −1/2 Λ and K the matrix G −1/2 G G −1/2 we get that (19) can be written as We also recall that Tr(AB * ) induces a scalar product to which the trace norm is attached: where σ i (A) are the singular values of A. In particular we have A 2 ≤ ||A|| tr ≤ p A 2 for every A. Now from (21) convexity can be checked as Let us consider We have Notice that the spectral norm • 2 is self-adjoint, that is M * 2 = M 2 for every M (see e.g., [27]). Then Thus, Assume now that A, B ∈ C µ : A 2 ≤ µ, B 2 ≤ µ. We deduce that Λ 2 ≤ 2µ and due to the structure of G(t) = I p + Q(t)Q(t) * we also have Finally, we have Going back to (22) we have since 0 < µ ≤ 1 3p .
By means of a simple change of variable, the following result can be proven.

Lemma 2.
For any δ > 0 the function is convex on the closed ball C δ = {A|A is a real p × p matrix , A 2 ≤ 1 3δp }.
In conclusion, even though the function log det(I p + A) is always concave, Lemma 2 shows that the function log det δ −2 I p + AA * can be made locally convex into any ball centered in 0, just choosing a suitable δ.

Lemma 3.
The function L(L, S) = 1 2 ln det(I p + ∆ n ∆ n ) is Lipschitz continuous in Euclidean norm with Lipschitz constant equal to 1: Proof. Let us recall that L and S are two generic p × p matrices, Σ = L + S is their sum, and ∆ n = Σ − Σ n . We reconsider the matrix function ϕ(Σ) = I p + ∆ n ∆ n and the function φ(Σ) = log det ϕ(Σ). We recall from (6) that ∂ ∂Σ Given two vectors u, v ∈ R p , let us define the Euclidean inner product u, v = u v. We consider where |v| is the Euclidean norm of v ∈ R p . Then we have via Cauchy-Schwarz, for any δ ∈ R. Now choose δ = √ 2, then we have 4|∆ n v| 2 ≤ |ϕ(Σ)v| 2 for every v ∈ R p . Noticing that ϕ(Σ) is invertible and plugging in the previous inequality Now recall that (see [25] p. 312) that the spectral norm of a matrix A, A 2 , can be computed also via the equality and that the spectral norm is self-adjoint (again see [25] p. 309), that is A 2 = A 2 . Summing up, we have proved that This means that the gradient of log det ϕ(Σ) is uniformly bounded and since Σ → log det ϕ(Σ) is a smooth function we have that the Lipschitz condition is satisfied with Lipschitz constant equal to 1: We have proven that the function is Lipschitz continuous. Now, we prove that the function  = (I p + ∆ n ∆ n ) −1 ∆ n is Lipschitz continuous with Lipschitz constant equal to 5 4 : with F(∆ n ) = ϕ −1 (Σ)∆ n = (I p + ∆ n ∆ n ) −1 ∆ n and fix > 0, for any > 0.
Calling Ψ = I p + ∆ n ∆ n we have so that we have Recalling that We develop in the powers of : A tedious but simple computation yields to The previous computations for the Lipschizianity of log det(I p + ∆ n ∆ n ) gave us (see (28)) that It is also easy to check that Ψ −1 2 ≤ 1, and that Putting all together we get such that we have proven that the directional derivative of the gradient is bounded in every direction by 5/4, i.e., the gradient is Lipschitz as a function from M(p) to M(p), the vector space of p × p real matrices.

Discussion
In this paper, we have proved that the loss 1 2 log det(I p + ∆ n ∆ n ) has good analytic properties for the purpose of optimization, provided that the matrix ∆ n fulfills certain conditions. As a consequence, by [23,28] and the supplement of [10], it follows that our analytic setup can provide a numerical solution to the problem min L 0,S 0 by using proximal gradient algorithms. The local convexity of log det(I p + ∆ n ∆ n ) is the key to apply first-order methods to solve (33). Following [23,28] and the supplement of [10], we derive the following solution Algorithm 1. Such algorithm may be applied in many fields, like economics, finance, biology, genetics, health, climatology, social science, among others. In future research, we plan to properly develop the selection of threshold parameters, to study how local convexity may cope with the random nature of the sample error matrix ∆ n , and to establish the consistency of the solution pair of (33).
Algorithm 1 Pseudocode to solve problem (33) given any input covariance matrix Σ n .

4.
Set L ALCE = Y t and S ALCE = Z t .