A Log-Det Heuristics for Covariance Matrix Estimation: The Analytic Setup

Bernardi, Enrico; Farnè, Matteo

doi:10.3390/stats5030037

Open AccessFeature PaperArticle

A Log-Det Heuristics for Covariance Matrix Estimation: The Analytic Setup

by

Enrico Bernardi

^†

and

Matteo Farnè

^*,†

Dipartimento di Scienze Statistiche, Università di Bologna, 40126 Bologna, Italy

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Stats 2022, 5(3), 606-616; https://doi.org/10.3390/stats5030037

Submission received: 4 June 2022 / Revised: 1 July 2022 / Accepted: 2 July 2022 / Published: 5 July 2022

(This article belongs to the Special Issue Multivariate Statistics and Applications)

Download Versions Notes

Abstract

:

This paper studies a new nonconvex optimization problem aimed at recovering high-dimensional covariance matrices with a low rank plus sparse structure. The objective is composed of a smooth nonconvex loss and a nonsmooth composite penalty. A number of structural analytic properties of the new heuristics are presented and proven, thus providing the necessary framework for further investigating the statistical applications. In particular, the first and the second derivative of the smooth loss are obtained, its local convexity range is derived, and the Lipschitzianity of its gradient is shown. This opens the path to solve the described problem via a proximal gradient algorithm.

Keywords:

nonconvex optimization; local convexity; nuclear norm; covariance matrix; high dimension

1. Introduction

The estimation of large covariance or precision matrices is a relevant challenge nowadays, due to the increasing availability of datasets composed of a large number of variables p compared to the sample size n in many fields. The urgency of this topic is testified by several recent books [1,2,3], and comprehensive reviews [4,5,6]. In this paper, we assume for the

p \times p

covariance matrix

Σ^{*}

a low rank plus sparse decomposition, that is

Σ^{*} = L^{*} + S^{*} = B B^{'} + S^{*},

(1)

where

L^{*} = B B^{'} = U_{L} Λ_{L} U_{L}^{'}

,

U_{L}

is a

p \times r

matrix such that

U_{L}^{'} U_{L} = I_{r}

,

Λ_{L}

is a

r \times r

diagonal matrix, and

S^{*}

is element-wise sparse, i.e. it contains only

s ≪ \frac{p (p - 1)}{2}

off-diagonal non-zero elements. Since [7] proposed their approximate factor model, structure (1) has become the reference model for many high-dimensional covariance matrix estimators, like POET [8].

The recovery of structure (1) is a statistical problem of primary relevance. Ref. [7] proposed to consistently estimate

L^{*}

(as

p \to \infty

) by means of principal component analysis (PCA, see [9]), assuming that the eigenvalues of

L^{*}

diverge with the dimension p while the eigenvalues of

S^{*}

remain bounded. [8] proposes to estimate

L^{*}

by the top r principal components of the sample covariance matrix

Σ_{n}

(as

p \to \infty

) and to estimate

S^{*}

by thresholding their orthogonal complement. In [10],

L^{*}

and

S^{*}

are recovered by nuclear norm plus

l_{1}

penalization, that is by computing

(\hat{L}, \hat{S}) = arg min_{L ⪰ 0, S ≻ 0} L (L, S) + P (L, S),

(2)

where

L (L, S)

is a smooth loss function,

P (L, S)

is a nonsmooth penalty function,

L ⪰ 0

denotes positive semidefiniteness for

L

and

S ≻ 0

denotes positive definiteness for

S

. In particular, denoting by

λ_{i} (M)

,

i = 1, \dots, p

, the eigenvalues of a

p \times p

matrix

M

sorted in descending order,

L (L, S) = \frac{1}{2} {∥ Σ_{n} - (L + S) ∥}_{F}^{2}

,

P (L, S) = {ψ ∥ L ∥}_{*} + ρ {∥ S ∥}_{1}

, where

{∥ L ∥}_{*} = \sum_{i = 1}^{p} λ_{i} (L)

(the nuclear norm of

L

),

{∥ S ∥}_{1} = \sum_{i = 1}^{p} \sum_{j = i}^{p} | S_{i j} |

(the

l_{1}

norm of

S

), and

ψ

and

ρ

are non-negative threshold parameters.

The nuclear norm was first proposed in [11] as an alternative to PCA. Ref. [12] furnishes a proof that

{ψ ∥ L ∥}_{*} + ρ {∥ S ∥}_{1}

is the tightest convex relaxation of the original non-convex penalty

ψ rk (L) + {ρ ∥ S ∥}_{0}

. Ref. [13] proves that the

l_{1}

norm minimization provides the sparsest solution to most large underdetermined linear systems, while [14] proves that the nuclear norm minimization provides guaranteed rank minimization under a set of linear equality constraints. Ref. [15] shows that

l_{1}

norm minimization selects the best linear model in a wide range of situations. The nuclear norm has instead been used to solve large matrix completion problems, like in [16,17,18], and [19]. Nuclear norm plus

l_{1}

norm minimization was first exploited in [20] to provide a robust version of PCA under grossly corrupted or missing data.

The pair of estimators (2) derived in [10] is named ALCE (ALgebraic Covariance Estimator). Although ALCE has many desirable statistical properties, there is room to further improve it by replacing

\frac{1}{2} {∥ Σ_{n} - (L + S) ∥}_{F}^{2}

by a different loss. The Frobenius loss optimizes in fact the entry by entry performance of

\hat{Σ}

, while a loss able to explicitly control the spectrum estimation quality may be desirable. In this paper, we consider the loss

L (L, S) = \frac{1}{2} log det (I_{p} + Δ_{n} Δ_{n}^{'}),

(3)

where

Δ_{n} = Σ - Σ_{n}

, and

Σ = L + S

. Heuristics (3) is controlled by the individual singular values of

Δ_{n}

, because

log det (I_{p} + Δ_{n} Δ_{n}^{'}) = log \prod_{i = 1}^{p} (λ_{i} (I_{p} + Δ_{n} Δ_{n}^{'})) \leq \sum_{i = 1}^{p} (1 + λ_{i} {(Δ_{n})}^{2}) = p + \sum_{i = 1}^{p} λ_{i} {(Δ_{n})}^{2},

(4)

and, therefore, it is better suited for the estimation of the underlying spectrum.

To the best of our knowledge, the mathematical properties of (3) have not been extensively studied. Analogously to the univariate context (

p = 1

), (3) is not a convex function. According to ongoing works like [21], nonconvex problems may be approached either by searching for approximate solutions instead of global solutions, or by exploiting the geometric structure of the objective function. Furthermore, in this case, the idea of restricting the analysis to the convexity region of the objective, a region that may be indefinitely extended (see the concept of Extendable Local Strong Convexity in [22]), is the key to apply, for instance, existing proximal gradient algorithms for convex functions (see [23]). For this reason, in this paper we calculate the first and second derivatives of (3), we derive its range of local convexity, and the Lipschitzianity of (3) and of its gradient. This opens the path to using the usual proximal gradient algorithms (see [23]) to solve problem (2) with

L (L, S)

as in (3).

2. Analytic Setup

We consider the objective function

ϕ (L, S) = L (L, S) + P (L, S),

(5)

where

L (L, S) = \frac{1}{2} log det (I_{p} + Δ_{n} Δ_{n}^{'})

is the smooth part of

ϕ (L, S)

and

P (L, S) = {ψ ∥ L ∥}_{*} + ρ {∥ S ∥}_{1}

is the non-smooth (but convex) part of

ϕ (L, S)

. First, we calculate the derivative of the smooth component

L (L, S)

wrt

L

and

S

, which is

\frac{δ \frac{1}{2} log det (I_{p} + Δ_{n} Δ_{n}^{'})}{δ L} = \frac{δ \frac{1}{2} log det (I_{p} + Δ_{n} Δ_{n}^{'})}{δ S} = {(I_{p} + Δ_{n} Δ_{n}^{'})}^{- 1} Δ_{n} .

(6)

Proof.

Let us consider two generic

p \times p

matrices

L

and

S

, their sum

Σ = L + S

, and the matrix

Δ_{n} = Σ - Σ_{n}

. Let us define the matrix function

φ (Σ) = I_{p} + Δ_{n} Δ_{n}^{'}

and the function

ϕ (Σ) = log det φ (Σ)

. We denote by

e_{i}

the i-th canonical basis vector, by

e_{i}^{l} = δ_{i l}

its l-th element, and by

σ_{i j}

the

i j

entry of

Σ

. Then, following [24], for each

i, j = 1, \dots, p

, we can write

\begin{matrix} \frac{\partial Σ}{\partial σ_{i j}} = m_{i j} = e_{i}^{'} e_{j} \\ \frac{\partial}{\partial σ_{i j}} log det φ (Σ) = Tr (φ^{- 1} (Σ) \frac{\partial φ}{\partial σ_{i j}}) \\ \frac{\partial φ}{\partial σ_{i j}} = Δ_{n} \frac{\partial Δ_{n}^{'}}{\partial σ_{i j}} + \frac{\partial Δ_{n}}{\partial σ_{i j}} Δ_{n}^{'} \\ \frac{\partial Δ_{n}}{\partial σ_{i j}} = m_{i j} . \end{matrix}

Therefore,

\begin{matrix} \frac{\partial}{\partial σ_{i j}} log det φ (Σ) = & Tr (φ^{- 1} (Σ) [Δ_{n} m_{j i} + m_{i j} Δ_{n}^{'}]) \\ = & Tr (φ^{- 1} (Σ) Δ_{n} m_{j i}) + Tr (φ^{- 1} (Σ) m_{i j} Δ_{n}^{'}) . \end{matrix}

Since for

A

,

B

,

C

conformable matrices

Tr (A B C) = Tr ({(A B C)}^{'}) = Tr (C^{'} B^{'} A^{'}) = Tr (A^{'} B^{'} C^{'}),

we get

\begin{matrix} \frac{\partial}{\partial σ_{i j}} log det φ (Σ) & = 2 Tr (φ^{- 1} (Σ) Δ_{n} m_{j i}) \\ = 2 \sum_{ν} {(φ^{- 1} (Σ) Δ_{n} m_{j i})}_{ν ν} \\ = 2 \sum_{ν, ρ} {(φ^{- 1} (Σ))}_{ν ρ} {(Δ_{n} m_{j i})}_{ρ ν} \\ = 2 \sum_{ν, ρ, σ} {(φ^{- 1} (Σ))}_{ν ρ} Δ_{n, ρ σ} {(m_{j i})}_{σ ν} . \end{matrix}

(7)

Finally, considering that

{(m_{j i})}_{σ ν} = {(e_{j} \otimes e_{i})}_{σ ν} = δ_{j σ} δ_{i ν},

we get

\begin{matrix} \frac{\partial}{\partial σ_{i j}} log det φ (Σ) & = & 2 \sum_{ν, ρ, σ} {(φ^{- 1} (Σ))}_{ν ρ} Δ_{n, ρ σ} δ_{j σ} δ_{i ν} \\ = & 2 \sum_{ρ} {(φ^{- 1} (Σ))}_{i ρ} Δ_{n, ρ j} \\ = & 2 {(φ^{- 1} (Σ) Δ_{n})}_{i j} . \end{matrix}

To sum up,

\frac{\partial}{\partial Σ} [\frac{1}{2} log det φ (Σ)] = \frac{\partial}{\partial L} [\frac{1}{2} log det φ (Σ)] = \frac{\partial}{\partial S} [\frac{1}{2} log det φ (Σ)] = φ^{- 1} (Σ) Δ_{n} .

(8)

□

In the following, we explicit the second derivative of

L (L, S) = \frac{1}{2} log det φ (Σ)

, with

φ (Σ) = (I_{p} + Δ_{n} Δ_{n}^{'})

and

Σ = L + S

:

\begin{matrix} \frac{\partial^{2}}{\partial σ_{i j} \partial σ_{h k}} \frac{1}{2} log det φ (Σ) & = {(\frac{1}{2} Hess log det φ (Σ))}_{i j h k} \\ = δ_{j k} {(φ^{- 1} (Σ))}_{i h} - \sum_{μ, σ} {(φ^{- 1} (Σ))}_{h μ} Δ_{μ j} {(φ^{- 1} (Σ))}_{i σ} Δ_{σ k} \\ - {(φ^{- 1} (Σ))}_{i h} \sum_{μ, λ} {(φ^{- 1} (Σ))}_{λ μ} Δ_{μ j} Δ_{λ k} . \end{matrix}

(9)

More, if

Σ = Σ_{n}

, we get

{(\frac{1}{2} Hess log det φ (Σ))}_{i j h k} = δ_{j k} \otimes δ_{i h} = {(I_{p} \otimes I_{p})}_{i j h k},

(10)

that is,

\frac{1}{2} Hess log det φ (Σ) = I_{p} \otimes I_{p} .

(11)

Proof.

From [24], we write

\frac{\partial^{2}}{\partial σ_{i j} \partial σ_{h k}} \frac{1}{2} log det φ (Σ) = \frac{1}{2} (Tr (φ^{- 1} (Σ) \frac{\partial^{2} φ (Σ)}{\partial σ_{i j} \partial σ_{h k}}) - Tr (φ^{- 1} (Σ) \frac{\partial φ (Σ)}{\partial σ_{i j}} φ^{- 1} (Σ) \frac{\partial φ (Σ)}{\partial σ_{h k}})),

and we recall that

\frac{\partial^{2} φ (Σ)}{\partial σ_{i j} \partial σ_{h k}} = σ_{h k} σ_{j i} + σ_{i j} σ_{k h} .

Then, we can calculate

\begin{matrix} Tr (φ^{- 1} (Σ) \frac{\partial^{2} φ (Σ)}{\partial σ_{i j} \partial σ_{h k}}) & = Tr (φ^{- 1} (Σ) σ_{h k} σ_{j i} + φ^{- 1} (Σ) σ_{i j} σ_{k h}) \\ = Tr (φ^{- 1} (Σ) σ_{h k} σ_{j i}) + Tr (φ^{- 1} (Σ) σ_{i j} σ_{h k}) \\ = 2 Tr (φ^{- 1} (Σ) σ_{h k} σ_{j i}) \\ = 2 \sum_{ν} (φ^{- 1} (Σ) σ_{h k} σ_{j i}) \\ = 2 δ_{j k} {(φ^{- 1} (Σ))}_{i h} . \end{matrix}

(12)

The second summand

Tr (φ^{- 1} (Σ) \frac{\partial φ (Σ)}{\partial σ_{i j}} φ^{- 1} (Σ) \frac{\partial φ (Σ)}{\partial σ_{h k}})

can be derived from (12) as

2 Tr (φ^{- 1} (Σ) Δ σ_{j i} φ^{- 1} (Σ) Δ σ_{k h}) + 2 Tr (φ^{- 1} (Σ) Δ σ_{j i} φ^{- 1} (Σ) σ_{h k} Δ) .

Equation (9) is consequently proved. □

3. Local Convexity

The aim of this section is to determine the range of convexity for

L (L, S) = \frac{1}{2} log det (I_{p} + Δ_{n} Δ_{n}^{'})

,

Δ_{n} = Σ - Σ_{n}

, wrt to the semidefinite positive matrix

Δ_{n} Δ_{n}^{'}

. In the univariate context, the function

\frac{1}{2} ln det (1 + x^{2})

is convex if and only if

| x | < \frac{1}{\sqrt{2}}

. In the multivariate context, it is therefore reasonable to suppose that a similar condition on

Δ_{n} Δ_{n}^{'}

ensures local convexity. A proof can be given by showing the positive definiteness of the Hessian of

L (L, S)

for some range of

∥ Δ_{n} Δ_{n}^{'} ∥

. In other words, we need to show that there exists a positive

δ

such that, whenever

∥ Δ_{n} Δ_{n}^{'} ∥ < δ

, the function

\frac{1}{2} log det (I_{p} + Δ_{n} Δ_{n}^{'})

is convex.

Lemma 1.

Given

0 < μ \leq \frac{1}{3 p}

, we have that the function

log det (I_{p} + A A^{*})

(13)

is convex on the set

C_{μ} = {A | A is a real p \times {p matrix, ∥ A ∥}_{2} \leq μ}

where

{∥ A ∥}_{2}

denotes the spectral norm of

A

.

Proof.

We proceed using the criterion of convexity estimating the second derivative with respect to t of

ϕ (t) = log det (I_{p} + (t A + (1 - t) B) {(t A + (1 - t) B)}^{*}) .

(14)

Let us recall that

\frac{d}{d t} log det G (t) = Tr (G {(t)}^{- 1} G^{'} (t)),

(15)

where

G (t)

is a differentiable square matrix-valued function and (15) holds for those values of t for which

G (t)

is invertible.

Furthermore, we have as well

\frac{d}{d t} Tr (A (t)) = Tr (\frac{d}{d t} A (t)), \frac{d}{d t} A^{- 1} = - A^{- 1} (\frac{d}{d t} A) A^{- 1},

(16)

for any differentiable square matrix-valued function

A (t)

. (See [25] or [26] e.g., how to prove these identities).

Calling

G (t) = I_{p} + (t A + (1 - t) B) {(t A + (1 - t) B)}^{*}

, we see that

G^{'} (t) = 2 t Λ Λ^{*} + R

and

G^{″} (t) = 2 Λ Λ^{*}

where

Λ = A - B, R = B Λ^{*} + Λ B^{*}

and

G (t) = \frac{1}{3 p μ} I_{p} + t^{2} Λ Λ^{*} + t R + B B^{*}

.

Thus, applying (15) and (16) we get

ϕ^{'} (t) = Tr (G {(t)}^{- 1} G^{'} (t))

and

ϕ^{″} (t) = Tr ({(G {(t)}^{- 1} G^{'} (t))}^{'}) = Tr (G^{- 1} G^{″} - G^{- 1} G^{'} G^{- 1} G^{'})

(17)

Convexity will follow once we have proven that (17) is non-negative for every

t \in [0, 1]

and every

A

and

B

in

C_{μ}

.

Due to the circularity of the trace function we also have

ϕ^{″} (t) = Tr (G^{- 1 / 2} G^{″} G^{- 1 / 2} - G^{- 1 / 2} G^{'} G^{- 1} G^{'} G^{- 1 / 2}),

(18)

that is

ϕ^{″} (t) = Tr (G^{- 1 / 2} G^{″} G^{- 1 / 2}) - Tr (G^{- 1 / 2} G^{'} G^{- 1 / 2} G^{- 1 / 2} G^{'} G^{- 1 / 2}) .

(19)

This can be written as

ϕ^{″} (t) = 2 Tr (G^{- 1 / 2} Λ Λ^{*} G^{- 1 / 2}) - Tr (G^{- 1 / 2} (2 t Λ Λ^{*} + R) G^{- 1 / 2} G^{- 1 / 2} (2 t Λ Λ^{*} + R) G^{- 1 / 2}) .

(20)

We recall that

G (t)

is self-adjoint so that denoting by

H

the matrix

G^{- 1 / 2} Λ

and

K

the matrix

G^{- 1 / 2} G^{'} G^{- 1 / 2}

we get that (19) can be written as

ϕ^{″} (t) = 2 Tr (H H^{*}) - Tr (K K^{*}) .

(21)

We also recall that

Tr (A B^{*})

induces a scalar product to which the trace norm is attached:

{| | A | |}_{tr} = \sqrt{Tr (A A^{*})} = \sum_{i} σ_{i} (A),

where

σ_{i} (A)

are the singular values of

A

. In particular we have

{∥ A ∥}_{2} \leq {| | A | |}_{tr} \leq p {∥ A ∥}_{2}

for every

A

. Now from (21) convexity can be checked as

ϕ^{″} (t) = {2 | | H | |}_{tr} - {| | K | |}_{tr} \geq 0 .

(22)

Let us consider

{∥ K ∥}_{2} = ∥ G^{- 1 / 2} G^{'} G^{- 1 / 2} ∥_{2} = {∥ G^{- 1 / 2} (2 t Λ Λ^{*} + R) G^{- 1 / 2} ∥}_{2} .

We have

{∥ K ∥}_{2} \leq 2 t ∥ G^{- 1 / 2} Λ Λ^{*} G^{- 1 / 2} ∥_{2} + {∥ G^{- 1 / 2} R G^{- 1 / 2} ∥}_{2} .

Notice that the spectral norm

{∥ • ∥}_{2}

is self-adjoint, that is

∥ M^{*} ∥_{2} = {∥ M ∥}_{2}

for every

M

(see e.g., [27]). Then

{∥ K ∥}_{2} \leq 2 t ∥ H H^{*} ∥_{2} + {∥ G^{- 1 / 2} (B Λ^{*} + Λ B^{*}) G^{- 1 / 2} ∥}_{2},

that is

{∥ K ∥}_{2} \leq {2 ∥ H ∥}_{2} ∥ G^{- 1 / 2} {Λ ∥}_{2} + 2 ∥ G^{- 1 / 2} {Λ ∥}_{2} {∥ G^{- 1 / 2} B ∥}_{2} .

Thus,

{∥ K ∥}_{2} \leq 2 {∥ H ∥}_{2} (∥ G^{- 1 / 2} {Λ ∥}_{2} + {∥ G^{- 1 / 2} B ∥}_{2}) .

(23)

Assume now that

A, B \in C_{μ}

:

{∥ A ∥}_{2} \leq μ

,

{∥ B ∥}_{2} \leq μ

. We deduce that

{∥ Λ ∥}_{2} \leq 2 μ

and due to the structure of

G (t) = I_{p} + Q (t) Q {(t)}^{*}

we also have

∥ G^{- 1 / 2} {Λ ∥}_{2} \leq 2 μ, {∥ G^{- 1 / 2} B ∥}_{2} \leq μ .

(24)

Finally, we have

{∥ K ∥}_{2} \leq 6 μ {∥ H ∥}_{2} .

Going back to (22) we have

ϕ^{″} (t) = {2 | | H | |}_{tr} - {| | K | |}_{tr} \geq {2 ∥ H ∥}_{2} - {p ∥ K ∥}_{2} \geq 2 (1 - 3 p μ) {∥ H ∥}_{2} \geq 0,

(25)

since

0 < μ \leq \frac{1}{3 p}

. □

By means of a simple change of variable, the following result can be proven.

Lemma 2.

For any

δ > 0

the function

log det (δ^{- 2} I_{p} + A A^{*})

(26)

is convex on the closed ball

C_{δ} = {A | A is a real p \times {p matrix, ∥ A ∥}_{2} \leq \frac{1}{3 δ p}}

.

In conclusion, even though the function

log det (I_{p} + A)

is always concave, Lemma 2 shows that the function

log det (δ^{- 2} I_{p} + A A^{*})

can be made locally convex into any ball centered in 0, just choosing a suitable

δ

.

4. Lipschitz-Continuity

In this section, we prove the Lipschitzianity of the smooth function

L (L, S) = \frac{1}{2} ln det (I_{p} + Δ_{n} Δ_{n}^{'})

, and of its gradient function,

\frac{δ L (L, S)}{δ L} = \frac{δ L (L, S)}{δ S} = {(I_{p} + Δ_{n} Δ_{n}^{'})}^{- 1} Δ_{n}

(see (6)).

Lemma 3.

The function

L (L, S) = \frac{1}{2} ln det (I_{p} + Δ_{n} Δ_{n}^{'})

is Lipschitz continuous in Euclidean norm with Lipschitz constant equal to 1:

| log det φ (Σ_{1}) - log det φ (Σ_{2}) | \leq ∥ Σ_{1} - Σ_{2} ∥_{2} .

(27)

Proof.

Let us recall that

L

and

S

are two generic

p \times p

matrices,

Σ = L + S

is their sum, and

Δ_{n} = Σ - Σ_{n}

. We reconsider the matrix function

φ (Σ) = I_{p} + Δ_{n} Δ_{n}^{'}

and the function

ϕ (Σ) = log det φ (Σ)

. We recall from (6) that

\frac{\partial}{\partial Σ} [\frac{1}{2} log det φ (Σ)] = φ^{- 1} (Σ) Δ_{n} .

Given two vectors

u, v \in R^{p}

, let us define the Euclidean inner product

〈 u, v 〉 = u^{'} v

. We consider

〈 φ (Σ) v, v 〉 = 〈 (I_{p} + Δ_{n} Δ_{n}^{'}) v, v 〉 = {| v |}^{2} + {| Δ_{n}^{'} v |}^{2},

where

| v |

is the Euclidean norm of

v \in R^{p}

. Then we have

{| v |}^{2} + {| Δ_{n}^{'} v |}^{2} \leq \frac{1}{2} (\frac{1}{δ^{2}} {| φ (Σ) v |}^{2} + δ^{2} {| v |}^{2}),

via Cauchy-Schwarz, for any

δ \in R

. Now choose

δ = \sqrt{2}

, then we have

4 | Δ_{n}^{'} {v |}^{2} \leq {| φ (Σ) v |}^{2}

for every

v \in R^{p}

. Noticing that

φ (Σ)

is invertible and plugging in the previous inequality

v = φ {(Σ)}^{- 1} w

,

w \in R^{p}

, we obtain

max_{w \neq 0} \frac{| Δ_{n}^{'} φ {(Σ)}^{- 1} w |}{{| w |}^{2}} \leq \frac{1}{2} .

Now recall that (see [25] p. 312) that the spectral norm of a matrix

A

,

{∥ A ∥}_{2}

, can be computed also via the equality

{∥ A ∥}_{2} = max_{x \neq 0} \frac{| A x |}{| x |},

and that the spectral norm is self-adjoint (again see [25] p. 309), that is

∥ A^{'} ∥_{2} = {∥ A ∥}_{2}

. Summing up, we have proved that

{∥ φ (Σ)}^{- 1} Δ_{n} ∥_{2} \leq \frac{1}{2} .

(28)

This means that the gradient of

log det φ (Σ)

is uniformly bounded and since

Σ \mapsto log det φ (Σ)

is a smooth function we have that the Lipschitz condition is satisfied with Lipschitz constant equal to 1:

| log det φ (Σ_{1}) - log det φ (Σ_{2}) | \leq ∥ Σ_{1} - Σ_{2} ∥_{2} .

(29)

□

We have proven that the function

\begin{matrix} \frac{\partial^{2}}{\partial σ_{i j} \partial σ_{h k}} \frac{1}{2} log det φ (Σ) & = δ_{j k} {(φ^{- 1} (Σ))}_{i h} - \sum_{μ, σ} {(φ^{- 1} (Σ))}_{h μ} Δ_{n, μ j} {(φ^{- 1} (Σ))}_{i σ} Δ_{n, σ k} \\ - {(φ^{- 1} (Σ))}_{i h} \sum_{μ, λ} {(φ^{- 1} (Σ))}_{λ μ} Δ_{n, μ j} Δ_{n, λ k} \\ = δ_{j k} {(φ^{- 1} (Σ))}_{i h} - {(φ^{- 1} (Σ) Δ_{n})}_{h j} {(φ^{- 1} (Σ) Δ_{n})}_{i k} \\ - {(φ^{- 1} (Σ))}_{i h} 〈 φ^{- 1} (Σ) Δ_{n, j}, Δ_{n, k} 〉 \end{matrix}

(30)

is Lipschitz continuous.

Now, we prove that the function

\frac{δ L (L, S)}{δ L} = \frac{δ L (L, S)}{δ S} = {(I_{p} + Δ_{n} Δ_{n}^{'})}^{- 1} Δ_{n}

is Lipschitz continuous.

Lemma 4.

The function

\frac{δ L (L, S)}{δ L} = \frac{δ L (L, S)}{δ S} = {(I_{p} + Δ_{n} Δ_{n}^{'})}^{- 1} Δ_{n}

is Lipschitz continuous with Lipschitz constant equal to

\frac{5}{4}

:

∥ F (Δ_{n} + ϵ H) - F (Δ_{n}) ∥_{2} \leq \frac{5}{4} ϵ {∥ H ∥}_{2} + O (ϵ^{2}),

(31)

with

F (Δ_{n}) = φ^{- 1} (Σ) Δ_{n} = {(I_{p} + Δ_{n} Δ_{n}^{'})}^{- 1} Δ_{n}

and fix

ϵ > 0

, for any

ϵ > 0

.

Proof.

Let us call

F (Δ_{n}) = φ^{- 1} (Σ) Δ_{n} = {(I_{p} + Δ_{n} Δ_{n}^{'})}^{- 1} Δ_{n}

and fix

ϵ > 0

.

Let us compute

F (Δ_{n} + ϵ H) - F (Δ_{n}) = {(I_{p} + (Δ_{n} + ϵ H) {(Δ_{n} + ϵ H)}^{'})}^{- 1} (Δ_{n} + ϵ H) - {(I_{p} + Δ_{n} Δ_{n}^{'})}^{- 1} Δ_{n} .

We have

{(I_{p} + (Δ_{n} + ϵ H) {(Δ_{n} + ϵ H)}^{'})}^{- 1} (Δ_{n} + ϵ H) = {(I_{p} + Δ_{n} Δ_{n}^{'} + Λ)}^{- 1} (Δ_{n} + ϵ H),

with

Λ = ϵ Λ_{0} + ϵ^{2} Λ_{1}

and

Λ_{0} = H Δ_{n}^{'} + Δ_{n} H^{'}, Λ_{1} = H H^{'} .

Calling

Ψ = I_{p} + Δ_{n} Δ_{n}^{'}

we have

{(Ψ + Λ)}^{- 1} = {(Ψ (I_{p} + Ψ^{- 1} Λ))}^{- 1} = {(I_{p} + Ψ^{- 1} Λ)}^{- 1} Ψ^{- 1},

so that we have

F (Δ_{n} + ϵ H) - F (Δ_{n}) = {(I_{p} + Ψ^{- 1} Λ)}^{- 1} Ψ^{- 1} (Δ_{n} + ϵ H) - Ψ^{- 1} Δ_{n} .

Recalling that

{(I_{p} + G)}^{- 1} = I_{p} - G + {(I_{p} + G)}^{- 1} G^{2},

whenever

I_{p} + G

is invertible, we have

F (Δ_{n} + ϵ H) - F (Δ_{n}) = (I_{p} - Ψ^{- 1} Λ + {(I_{p} + Ψ^{- 1} Λ)}^{- 1} {(Ψ^{- 1} Λ)}^{2}) Ψ^{- 1} (Δ_{n} + ϵ H) - Ψ^{- 1} Δ_{n} .

We develop in the powers of

ϵ

:

\begin{matrix} F (Δ_{n} + ϵ H) - F (Δ_{n}) & = & (I_{p} - Ψ^{- 1} (ϵ Λ_{0} + ϵ^{2} Λ_{1}) + (I_{p} + Ψ^{- 1} {(ϵ Λ_{0} + ϵ^{2} Λ_{1})}^{- 1}) \\ {(Ψ^{- 1} (ϵ Λ_{0} + ϵ^{2} Λ_{1}))}^{2}) Ψ^{- 1} (Δ_{n} + ϵ H) - Ψ^{- 1} Δ_{n} \end{matrix}

A tedious but simple computation yields to

F (Δ_{n} + ϵ H) - F (Δ_{n}) = ϵ T_{1} + O (ϵ^{2}),

with

T_{1} = Ψ^{- 1} H - Ψ^{- 1} Λ_{0} Ψ^{- 1} Δ_{n},

that is

T_{1} = Ψ^{- 1} H (I_{p} - Δ_{n}^{'} Ψ^{- 1} Δ_{n}) - Ψ^{- 1} Δ_{n} H^{'} Ψ^{- 1} Δ_{n} .

The previous computations for the Lipschizianity of

log det (I_{p} + Δ_{n} Δ_{n}^{'})

gave us (see (28)) that

∥ Ψ^{- 1} Δ_{n} ∥_{2} \leq \frac{1}{2} .

It is also easy to check that

∥ Ψ^{- 1} ∥_{2} \leq 1,

and that

∥ I_{p} - Δ_{n}^{'} Ψ^{- 1} Δ_{n} ∥_{2} \leq 1 .

Putting all together we get

∥ F (Δ_{n} + ϵ H) - F (Δ_{n}) ∥_{2} \leq \frac{5}{4} ϵ {∥ H ∥}_{2} + O (ϵ^{2}),

(32)

such that we have proven that the directional derivative of the gradient is bounded in every direction by

5 / 4

, i.e., the gradient is Lipschitz as a function from

M (p)

to

M (p)

, the vector space of

p \times p

real matrices. □

5. Discussion

In this paper, we have proved that the loss

\frac{1}{2} log det (I_{p} + Δ_{n} Δ_{n}^{'})

has good analytic properties for the purpose of optimization, provided that the matrix

Δ_{n}

fulfills certain conditions. As a consequence, by [23,28] and the supplement of [10], it follows that our analytic setup can provide a numerical solution to the problem

min_{L ⪰ 0, S ≻ 0} \frac{1}{2} log det (I_{p} + Δ_{n} Δ_{n}^{'}) + {ψ ∥ L ∥}_{*} + ρ {∥ S ∥}_{1},

(33)

by using proximal gradient algorithms. The local convexity of

log det (I_{p} + Δ_{n} Δ_{n}^{'})

is the key to apply first-order methods to solve (33). Following [23,28] and the supplement of [10], we derive the following solution Algorithm 1.

Such algorithm may be applied in many fields, like economics, finance, biology, genetics, health, climatology, social science, among others. In future research, we plan to properly develop the selection of threshold parameters, to study how local convexity may cope with the random nature of the sample error matrix

Δ_{n}

, and to establish the consistency of the solution pair of (33).

Algorithm 1 Pseudocode to solve problem (33) given any input covariance matrix

Σ_{n}

.

(1)

Set

(L_{0}, S_{0}) = \frac{1}{2} (diag (Σ_{n}), diag (Σ_{n}))

,

η_{0} = 1

.

(2)

Initialize

Y_{0} = L_{0}

and

Z_{0} = S_{0}

. Set

t = 1

.

(3)

For

t \geq 1

, repeat:

(i): calculate calculate $Δ_{t, n} = Y_{t - 1} + Z_{t - 1} - Σ_{n}$ ;
(ii): compute $\frac{\partial \frac{1}{2} log det (I_{p} + Δ_{t, n} Δ_{t, n}^{'})}{\partial Y_{t - 1}} = \frac{\partial \frac{1}{2} log det (I_{p} + Δ_{t, n} Δ_{t, n}^{'})}{\partial Z_{t - 1}} = {(I_{p} + Δ_{t, n} Δ_{t, n}^{'})}^{- 1} Δ_{t, n}$ .
(iii): apply the singular value thresholding (SVT, [29]) operator $T_{ψ}$ to
$E_{Y, t} = Y_{t - 1} - \frac{1}{ℓ} {(I_{p} + Δ_{t, n} Δ_{t, n}^{'})}^{- 1} Δ_{t, n}$ , with $ℓ = \frac{10}{4}$ ,
and set $L_{t} = T_{ψ} (E_{Y, t}) = \hat{U} {\hat{D}}_{ψ} {\hat{U}}^{⊤}$ .
(v): apply the soft-thresholding operator [30] $T_{ρ}$ to
$E_{Z, t} = Z_{t - 1} - \frac{1}{ℓ} {(I_{p} + Δ_{t, n} Δ_{t, n}^{'})}^{- 1} Δ_{t, n}$ , with $ℓ = \frac{10}{4}$ , and set $S_{t} = T_{ρ} (E_{Z, t})$ .
(vi): set $(Y_{t}, Z_{t}) = (L_{t}, S_{t}) + \{\frac{η_{t - 1} - 1}{η_{t}}\} {(L_{t}, S_{t}) - (L_{t - 1}, S_{t - 1})}$ where
$η_{t} = \frac{1}{2} + \frac{1}{2} \sqrt{1 + 4 η_{t - 1}^{2}}$ .
(vii): stop if the convergence criterion $\frac{∥ L_{t} - L_{t - 1} ∥_{F}}{1 + ∥ L_{t - 1} ∥_{F}} + \frac{∥ S_{t} - S_{t - 1} ∥_{F}}{1 + ∥ S_{t - 1} ∥_{F}} \leq ε$ .

(4)

Set

{\hat{L}}_{ALCE} = Y_{t}

and

{\hat{S}}_{ALCE} = Z_{t}

.

Author Contributions

Conceptualization, M.F.; Investigation, E.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Pourahmadi, M. High-Dimensional Covariance Estimation: With High-Dimensional Data; John Wiley & Sons: Hoboken, NJ, USA, 2013; Volume 882. [Google Scholar]
Wainwright, M.J. High-Dimensional Statistics: A Non-Asymptotic Viewpoint; Cambridge University Press: Cambridge, UK, 2019; Volume 48. [Google Scholar]
Zagidullina, A. High-Dimensional Covariance Matrix Estimation: An Introduction to Random Matrix Theory; Springer Nature: Berlin/Heidelberg, Germany, 2021. [Google Scholar]
Fan, J.; Liao, Y.; Liu, H. An overview of the estimation of large covariance and precision matrices. Econom. J. 2016, 19, C1–C32. [Google Scholar] [CrossRef]
Lam, C. High-dimensional covariance matrix estimation. Wiley Interdiscip. Rev. Comput. Stat. 2020, 12, e1485. [Google Scholar] [CrossRef]
Ledoit, O.; Wolf, M. Shrinkage estimation of large covariance matrices: Keep it simple, statistician? J. Multivar. Anal. 2021, 186, 104796. [Google Scholar] [CrossRef]
Chamberlain, G.; Rothschild, M. Arbitrage, factor structure, and mean-variance analysis on large asset markets. Econometrica 1983, 51, 1281. [Google Scholar] [CrossRef] [Green Version]
Fan, J.; Liao, Y.; Mincheva, M. Large covariance estimation by thresholding principal orthogonal complements. J. R. Stat. Soc. Ser. (Stat. Methodol.) 2013, 75, 603–680. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Abdi, H.; Williams, L.J. Principal component analysis. Wiley Interdiscip. Rev. Comput. Stat. 2010, 2, 433–459. [Google Scholar] [CrossRef]
Farnè, M.; Montanari, A. A large covariance matrix estimator under intermediate spikiness regimes. J. Multivar. Anal. 2020, 176, 104577. [Google Scholar] [CrossRef] [Green Version]
Fazel, M.; Hindi, H.; Boyd, S.P. A rank minimization heuristic with application to minimum order system approximation. In Proceedings of the American Control Conference, Arlington, VA, USA, 25–27 June 2001; Volume 6, pp. 4734–4739. [Google Scholar]
Fazel, M. Matrix Rank Minimization with Applications. Ph.D. Thesis, Stanford University, Stanford, CA, USA, 2002. [Google Scholar]
Donoho, D.L. For most large underdetermined systems of linear equations the minimal l₁-norm solution is also the sparsest solution. Commun. Pure Appl. Math. J. Issued Courant Inst. Math. Sci. 2006, 59, 797–829. [Google Scholar] [CrossRef]
Recht, B.; Fazel, M.; Parrilo, P.A. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev. 2010, 52, 471–501. [Google Scholar] [CrossRef] [Green Version]
Candès, J.E.; Plan, Y. Near-ideal model selection by l₁ minimization. Ann. Stat. 2009, 37, 2145–2177. [Google Scholar] [CrossRef]
Candès, J.E.; Tao, T. The power of convex relaxation: Near-optimal matrix completion. IEEE Trans. Inf. Theory 2010, 56, 2053–2080. [Google Scholar] [CrossRef] [Green Version]
Mazumder, R.; Hastie, T.; Tibshirani, R. Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res. 2010, 11, 2287–2322. [Google Scholar] [PubMed]
Srebro, N.; Rennie, J.; Jaakkola, T.S. Maximum-margin matrix factorization. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2005; pp. 1329–1336. [Google Scholar]
Hastie, T.; Mazumder, R.; Lee, J.D.; Zadeh, R. Matrix completion and low-rank svd via fast alternating least squares. J. Mach. Learn. Res. 2015, 16, 3367–3402. [Google Scholar] [PubMed]
Candès, E.J.; Li, X.; Ma, Y.; Wright, J. Robust principal component analysis? J. ACM (JACM) 2011, 58, 11. [Google Scholar] [CrossRef]
Danilova, M.; Dvurechensky, P.; Gasnikov, A.; Gorbunov, E.; Guminov, S.; Kamzolov, D.; Shibaev, I. Recent theoretical advances in non-convex optimization. arXiv 2020, arXiv:2012.06188. [Google Scholar]
Dey, D.; Mukhoty, B.; Kar, P. Agglio: Global optimization for locally convex functions. In Proceedings of the 5th Joint International Conference on Data Science & Management of Data (9th ACM IKDD CODS and 27th COMAD), Bangalore, India, 8–10 January 2022; pp. 37–45. [Google Scholar]
Nesterov, Y. Gradient methods for minimizing composite functions. Math. Program. 2013, 140, 125–161. [Google Scholar] [CrossRef]
Harville, D.A. Matrix Algebra from A Statistician’s Perspective; Springer: New York, NY, USA, 1997. [Google Scholar]
Horn, A.R.; Johnson, C.R. Matrix Analysis; Cambridge University Press: Cambridge, UK, 2012. [Google Scholar]
Graham, A. Kronecker Products and Matrix Calculus: With Applications; Ellis Horwood Limited: London, UK, 1981. [Google Scholar]
Lax, P.D. Linear Algebra and Its Applications, 2nd ed.; Pure and Applied Mathematics (Hoboken); Wiley-Interscience (John Wiley & Sons): Hoboken, NJ, USA, 2007. [Google Scholar]
Luo, X. High dimensional low rank and sparse covariance matrix estimation via convex minimization. arXiv 2011, arXiv:1111.1133. [Google Scholar]
Cai, J.-F.; Candès, E.J.; Shen, Z. A singular value thresholding algorithm for matrix completion. SIAM J. Optim. 2010, 20, 1956–1982. [Google Scholar] [CrossRef]
Daubechies, I.; Defrise, I.M.; Mol, C.D. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Commun. Pure Appl. Math. 2004, 57, 1413–1457. [Google Scholar] [CrossRef] [Green Version]

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bernardi, E.; Farnè, M. A Log-Det Heuristics for Covariance Matrix Estimation: The Analytic Setup. Stats 2022, 5, 606-616. https://doi.org/10.3390/stats5030037

AMA Style

Bernardi E, Farnè M. A Log-Det Heuristics for Covariance Matrix Estimation: The Analytic Setup. Stats. 2022; 5(3):606-616. https://doi.org/10.3390/stats5030037

Chicago/Turabian Style

Bernardi, Enrico, and Matteo Farnè. 2022. "A Log-Det Heuristics for Covariance Matrix Estimation: The Analytic Setup" Stats 5, no. 3: 606-616. https://doi.org/10.3390/stats5030037

APA Style

Bernardi, E., & Farnè, M. (2022). A Log-Det Heuristics for Covariance Matrix Estimation: The Analytic Setup. Stats, 5(3), 606-616. https://doi.org/10.3390/stats5030037

Article Menu

A Log-Det Heuristics for Covariance Matrix Estimation: The Analytic Setup

Abstract

1. Introduction

2. Analytic Setup

3. Local Convexity

4. Lipschitz-Continuity

5. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI