Abstract
This paper studies a new nonconvex optimization problem aimed at recovering high-dimensional covariance matrices with a low rank plus sparse structure. The objective is composed of a smooth nonconvex loss and a nonsmooth composite penalty. A number of structural analytic properties of the new heuristics are presented and proven, thus providing the necessary framework for further investigating the statistical applications. In particular, the first and the second derivative of the smooth loss are obtained, its local convexity range is derived, and the Lipschitzianity of its gradient is shown. This opens the path to solve the described problem via a proximal gradient algorithm.
1. Introduction
The estimation of large covariance or precision matrices is a relevant challenge nowadays, due to the increasing availability of datasets composed of a large number of variables p compared to the sample size n in many fields. The urgency of this topic is testified by several recent books [1,2,3], and comprehensive reviews [4,5,6]. In this paper, we assume for the covariance matrix a low rank plus sparse decomposition, that is
where , is a matrix such that , is a diagonal matrix, and is element-wise sparse, i.e. it contains only off-diagonal non-zero elements. Since [7] proposed their approximate factor model, structure (1) has become the reference model for many high-dimensional covariance matrix estimators, like POET [8].
The recovery of structure (1) is a statistical problem of primary relevance. Ref. [7] proposed to consistently estimate (as ) by means of principal component analysis (PCA, see [9]), assuming that the eigenvalues of diverge with the dimension p while the eigenvalues of remain bounded. [8] proposes to estimate by the top r principal components of the sample covariance matrix (as ) and to estimate by thresholding their orthogonal complement. In [10], and are recovered by nuclear norm plus penalization, that is by computing
where is a smooth loss function, is a nonsmooth penalty function, denotes positive semidefiniteness for and denotes positive definiteness for . In particular, denoting by , , the eigenvalues of a matrix sorted in descending order, , , where (the nuclear norm of ), (the norm of ), and and are non-negative threshold parameters.
The nuclear norm was first proposed in [11] as an alternative to PCA. Ref. [12] furnishes a proof that is the tightest convex relaxation of the original non-convex penalty . Ref. [13] proves that the norm minimization provides the sparsest solution to most large underdetermined linear systems, while [14] proves that the nuclear norm minimization provides guaranteed rank minimization under a set of linear equality constraints. Ref. [15] shows that norm minimization selects the best linear model in a wide range of situations. The nuclear norm has instead been used to solve large matrix completion problems, like in [16,17,18], and [19]. Nuclear norm plus norm minimization was first exploited in [20] to provide a robust version of PCA under grossly corrupted or missing data.
The pair of estimators (2) derived in [10] is named ALCE (ALgebraic Covariance Estimator). Although ALCE has many desirable statistical properties, there is room to further improve it by replacing by a different loss. The Frobenius loss optimizes in fact the entry by entry performance of , while a loss able to explicitly control the spectrum estimation quality may be desirable. In this paper, we consider the loss
where , and . Heuristics (3) is controlled by the individual singular values of , because
and, therefore, it is better suited for the estimation of the underlying spectrum.
To the best of our knowledge, the mathematical properties of (3) have not been extensively studied. Analogously to the univariate context (), (3) is not a convex function. According to ongoing works like [21], nonconvex problems may be approached either by searching for approximate solutions instead of global solutions, or by exploiting the geometric structure of the objective function. Furthermore, in this case, the idea of restricting the analysis to the convexity region of the objective, a region that may be indefinitely extended (see the concept of Extendable Local Strong Convexity in [22]), is the key to apply, for instance, existing proximal gradient algorithms for convex functions (see [23]). For this reason, in this paper we calculate the first and second derivatives of (3), we derive its range of local convexity, and the Lipschitzianity of (3) and of its gradient. This opens the path to using the usual proximal gradient algorithms (see [23]) to solve problem (2) with as in (3).
2. Analytic Setup
We consider the objective function
where is the smooth part of and is the non-smooth (but convex) part of . First, we calculate the derivative of the smooth component wrt and , which is
Proof.
Let us consider two generic matrices and , their sum , and the matrix . Let us define the matrix function and the function . We denote by the i-th canonical basis vector, by its l-th element, and by the entry of . Then, following [24], for each , we can write
Therefore,
Since for , , conformable matrices
we get
Finally, considering that
we get
To sum up,
□
In the following, we explicit the second derivative of , with and :
More, if , we get
that is,
3. Local Convexity
The aim of this section is to determine the range of convexity for , , wrt to the semidefinite positive matrix . In the univariate context, the function is convex if and only if . In the multivariate context, it is therefore reasonable to suppose that a similar condition on ensures local convexity. A proof can be given by showing the positive definiteness of the Hessian of for some range of . In other words, we need to show that there exists a positive such that, whenever , the function is convex.
Lemma 1.
Given , we have that the function
is convex on the set where denotes the spectral norm of .
Proof.
We proceed using the criterion of convexity estimating the second derivative with respect to t of
Let us recall that
where is a differentiable square matrix-valued function and (15) holds for those values of t for which is invertible.
Furthermore, we have as well
for any differentiable square matrix-valued function . (See [25] or [26] e.g., how to prove these identities).
Calling , we see that and where
and .
Convexity will follow once we have proven that (17) is non-negative for every and every and in .
Due to the circularity of the trace function we also have
that is
This can be written as
We recall that is self-adjoint so that denoting by the matrix and the matrix we get that (19) can be written as
We also recall that induces a scalar product to which the trace norm is attached:
where are the singular values of . In particular we have for every . Now from (21) convexity can be checked as
Let us consider
We have
Notice that the spectral norm is self-adjoint, that is for every (see e.g., [27]). Then
that is
Thus,
Assume now that : , . We deduce that and due to the structure of we also have
Finally, we have
Going back to (22) we have
since . □
By means of a simple change of variable, the following result can be proven.
Lemma 2.
For any the function
is convex on the closed ball .
In conclusion, even though the function is always concave, Lemma 2 shows that the function can be made locally convex into any ball centered in 0, just choosing a suitable .
4. Lipschitz-Continuity
In this section, we prove the Lipschitzianity of the smooth function , and of its gradient function, (see (6)).
Lemma 3.
The function is Lipschitz continuous in Euclidean norm with Lipschitz constant equal to 1:
Proof.
Let us recall that and are two generic matrices, is their sum, and . We reconsider the matrix function and the function . We recall from (6) that
Given two vectors , let us define the Euclidean inner product . We consider
where is the Euclidean norm of . Then we have
via Cauchy-Schwarz, for any . Now choose , then we have
for every . Noticing that is invertible and plugging in the previous inequality , , we obtain
Now recall that (see [25] p. 312) that the spectral norm of a matrix , , can be computed also via the equality
and that the spectral norm is self-adjoint (again see [25] p. 309), that is . Summing up, we have proved that
This means that the gradient of is uniformly bounded and since is a smooth function we have that the Lipschitz condition is satisfied with Lipschitz constant equal to 1:
□
We have proven that the function
is Lipschitz continuous.
Now, we prove that the function is Lipschitz continuous.
Lemma 4.
The function is Lipschitz continuous with Lipschitz constant equal to :
with and fix , for any .
Proof.
Let us call and fix .
Let us compute
We have
with and
Calling we have
so that we have
Recalling that
whenever is invertible, we have
We develop in the powers of :
A tedious but simple computation yields to
with
that is
The previous computations for the Lipschizianity of gave us (see (28)) that
It is also easy to check that
and that
Putting all together we get
such that we have proven that the directional derivative of the gradient is bounded in every direction by , i.e., the gradient is Lipschitz as a function from to , the vector space of real matrices. □
5. Discussion
In this paper, we have proved that the loss has good analytic properties for the purpose of optimization, provided that the matrix fulfills certain conditions. As a consequence, by [23,28] and the supplement of [10], it follows that our analytic setup can provide a numerical solution to the problem
by using proximal gradient algorithms. The local convexity of is the key to apply first-order methods to solve (33). Following [23,28] and the supplement of [10], we derive the following solution Algorithm 1.
Such algorithm may be applied in many fields, like economics, finance, biology, genetics, health, climatology, social science, among others. In future research, we plan to properly develop the selection of threshold parameters, to study how local convexity may cope with the random nature of the sample error matrix , and to establish the consistency of the solution pair of (33).
| Algorithm 1 Pseudocode to solve problem (33) given any input covariance matrix . |
|
Author Contributions
Conceptualization, M.F.; Investigation, E.B. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Not applicable.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Pourahmadi, M. High-Dimensional Covariance Estimation: With High-Dimensional Data; John Wiley & Sons: Hoboken, NJ, USA, 2013; Volume 882. [Google Scholar]
- Wainwright, M.J. High-Dimensional Statistics: A Non-Asymptotic Viewpoint; Cambridge University Press: Cambridge, UK, 2019; Volume 48. [Google Scholar]
- Zagidullina, A. High-Dimensional Covariance Matrix Estimation: An Introduction to Random Matrix Theory; Springer Nature: Berlin/Heidelberg, Germany, 2021. [Google Scholar]
- Fan, J.; Liao, Y.; Liu, H. An overview of the estimation of large covariance and precision matrices. Econom. J. 2016, 19, C1–C32. [Google Scholar] [CrossRef]
- Lam, C. High-dimensional covariance matrix estimation. Wiley Interdiscip. Rev. Comput. Stat. 2020, 12, e1485. [Google Scholar] [CrossRef]
- Ledoit, O.; Wolf, M. Shrinkage estimation of large covariance matrices: Keep it simple, statistician? J. Multivar. Anal. 2021, 186, 104796. [Google Scholar] [CrossRef]
- Chamberlain, G.; Rothschild, M. Arbitrage, factor structure, and mean-variance analysis on large asset markets. Econometrica 1983, 51, 1281. [Google Scholar] [CrossRef] [Green Version]
- Fan, J.; Liao, Y.; Mincheva, M. Large covariance estimation by thresholding principal orthogonal complements. J. R. Stat. Soc. Ser. (Stat. Methodol.) 2013, 75, 603–680. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Abdi, H.; Williams, L.J. Principal component analysis. Wiley Interdiscip. Rev. Comput. Stat. 2010, 2, 433–459. [Google Scholar] [CrossRef]
- Farnè, M.; Montanari, A. A large covariance matrix estimator under intermediate spikiness regimes. J. Multivar. Anal. 2020, 176, 104577. [Google Scholar] [CrossRef] [Green Version]
- Fazel, M.; Hindi, H.; Boyd, S.P. A rank minimization heuristic with application to minimum order system approximation. In Proceedings of the American Control Conference, Arlington, VA, USA, 25–27 June 2001; Volume 6, pp. 4734–4739. [Google Scholar]
- Fazel, M. Matrix Rank Minimization with Applications. Ph.D. Thesis, Stanford University, Stanford, CA, USA, 2002. [Google Scholar]
- Donoho, D.L. For most large underdetermined systems of linear equations the minimal l1-norm solution is also the sparsest solution. Commun. Pure Appl. Math. J. Issued Courant Inst. Math. Sci. 2006, 59, 797–829. [Google Scholar] [CrossRef]
- Recht, B.; Fazel, M.; Parrilo, P.A. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev. 2010, 52, 471–501. [Google Scholar] [CrossRef] [Green Version]
- Candès, J.E.; Plan, Y. Near-ideal model selection by l1 minimization. Ann. Stat. 2009, 37, 2145–2177. [Google Scholar] [CrossRef]
- Candès, J.E.; Tao, T. The power of convex relaxation: Near-optimal matrix completion. IEEE Trans. Inf. Theory 2010, 56, 2053–2080. [Google Scholar] [CrossRef] [Green Version]
- Mazumder, R.; Hastie, T.; Tibshirani, R. Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res. 2010, 11, 2287–2322. [Google Scholar] [PubMed]
- Srebro, N.; Rennie, J.; Jaakkola, T.S. Maximum-margin matrix factorization. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2005; pp. 1329–1336. [Google Scholar]
- Hastie, T.; Mazumder, R.; Lee, J.D.; Zadeh, R. Matrix completion and low-rank svd via fast alternating least squares. J. Mach. Learn. Res. 2015, 16, 3367–3402. [Google Scholar] [PubMed]
- Candès, E.J.; Li, X.; Ma, Y.; Wright, J. Robust principal component analysis? J. ACM (JACM) 2011, 58, 11. [Google Scholar] [CrossRef]
- Danilova, M.; Dvurechensky, P.; Gasnikov, A.; Gorbunov, E.; Guminov, S.; Kamzolov, D.; Shibaev, I. Recent theoretical advances in non-convex optimization. arXiv 2020, arXiv:2012.06188. [Google Scholar]
- Dey, D.; Mukhoty, B.; Kar, P. Agglio: Global optimization for locally convex functions. In Proceedings of the 5th Joint International Conference on Data Science & Management of Data (9th ACM IKDD CODS and 27th COMAD), Bangalore, India, 8–10 January 2022; pp. 37–45. [Google Scholar]
- Nesterov, Y. Gradient methods for minimizing composite functions. Math. Program. 2013, 140, 125–161. [Google Scholar] [CrossRef]
- Harville, D.A. Matrix Algebra from A Statistician’s Perspective; Springer: New York, NY, USA, 1997. [Google Scholar]
- Horn, A.R.; Johnson, C.R. Matrix Analysis; Cambridge University Press: Cambridge, UK, 2012. [Google Scholar]
- Graham, A. Kronecker Products and Matrix Calculus: With Applications; Ellis Horwood Limited: London, UK, 1981. [Google Scholar]
- Lax, P.D. Linear Algebra and Its Applications, 2nd ed.; Pure and Applied Mathematics (Hoboken); Wiley-Interscience (John Wiley & Sons): Hoboken, NJ, USA, 2007. [Google Scholar]
- Luo, X. High dimensional low rank and sparse covariance matrix estimation via convex minimization. arXiv 2011, arXiv:1111.1133. [Google Scholar]
- Cai, J.-F.; Candès, E.J.; Shen, Z. A singular value thresholding algorithm for matrix completion. SIAM J. Optim. 2010, 20, 1956–1982. [Google Scholar] [CrossRef]
- Daubechies, I.; Defrise, I.M.; Mol, C.D. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Commun. Pure Appl. Math. 2004, 57, 1413–1457. [Google Scholar] [CrossRef] [Green Version]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).