Local Laws for Sparse Sample Covariance Matrices

: We proved the local Marchenko–Pastur law for sparse sample covariance matrices that corresponded to rectangular observation matrices of order n × m with n / m → y (where y > 0) and sparse probability np n > log β n (where β > 0). The bounds of the distance between the empirical spectral distribution function of the sparse sample covariance matrices and the Marchenko–Pastur law distribution function that was obtained in the complex domain z ∈ D with Im z > v 0 > 0 (where v 0 ) were of order log 4 n / n and the domain bounds did not depend on p n while np n > log β n .


Introduction
The random matrix theory (RMT) dates back to the work of Wishart in multivariate statistics [1], which was devoted to the joint distribution of the entries of sample covariance matrices. The next RMT milestone was the work of Wigner [2] in the middle of the last century, in which the modelling of the Hamiltonian of excited heavy nuclei using a large dimensional random matrix was proposed, thereby replacing the study of the energy levels of nuclei with the study of the distribution of the eigenvalues of a random matrix. Wigner studied the eigenvalues of random Hermitian matrices with centred, independent and identically distributed elements (such matrices were later named Wigner matrices) and proved that the density of the empirical spectral distribution function of the eigenvalues of such matrices converges to the semicircle law as the matrix dimensions increase. Later, this convergence was named Wigner's semicircle law and Wigner's results were generalised in various aspects.
The breakthrough work of Marchenko and Pastur [3] gave impetus to new progress in the study of sample covariance matrices. Under quite general conditions, they found an explicit form of the limiting density of the expected empirical spectral distribution function of sample covariance matrices. Later, this convergence was named the Marchenko-Pastur law.
Sample covariance matrices are of great practical importance for the problems of multivariate statistical analysis, particularly for the method of principal component analysis (PCA). In recent years, many studies have appeared that have connected RMT with other rapidly developing areas, such as the theory of wireless communication and deep learning. For example, the spectral density of sample covariance matrices is used in calculations that relate to multiple input multiple output (MIMO) channel capacity [4]. An important object of study for neural networks is the loss surface. The geometry and critical points of this surface can be predicted using the Hessian of the loss function. A number of works that have been devoted to deep networks have suggested the application of various RMT models for Hessian approximation, thereby allowing the use of RMT results to reach specific conclusions about the nature of the critical points of the surface.
Another area of application for sample covariance matrices is graph theory. The adjacency matrix of an undirected graph is asymmetric, so the study of its singular values leads to sample covariance matrices. An example of these graphs is the bipartite random graph, the vertices of which can be divided into two groups in which the vertices are not connected to each other.
If we assume that the probability p n of having graph edges tends to zero as the number of vertices n increases to infinity, we arrive at the concept of sparse random matrices. The behaviour of the eigenvalues and eigenvectors of a sparse random matrix significantly depends on its sparsity and results that are obtained for non-sparse matrices cannot be applied. Sparse sample covariance matrices have applications in random graph models [5] and deep learning problems [6] as well. Sparse Wigner matrices have been considered in a number of papers (see [7][8][9][10]), in which many results have been obtained. With the symmetrisation of sample covariance matrices, it is possible to apply these results when observation matrices are square. However, when the sample size is greater than the observation dimensions, the spectral limit distribution has a singularity at zero, which requires a different approach. The spectral limit distribution of sparse sample covariance matrices with a sparsity of np n ∼ n (where > 0 was arbitrary small) was studied in [11,12]. In particular, a local law was proven under the assumption that the matrix elements satisfied the moment conditions E |X jk | q ≤ (Cq) cq . In this paper, we considered a case with a sparsity of np n ∼ log α n for α > 1 and assumed that the matrix element moments satisfied the conditions E |X jk | 4+δ ≤ C < ∞ and |X jk | ≤ c 1 (np n ) 1 2 −κ for κ > 0.

Main Results
We let m = m(n), where m ≥ n. We considered the independent and identically distributed zero mean random variables X jk , 1 ≤ j ≤ n and 1 ≤ k ≤ m with E X jk = 0 and E X 2 jk = 1 and an independent set of the independent Bernoulli random variables ξ jk , 1 ≤ j ≤ n and 1 ≤ k ≤ m with E ξ jk = p n . In addition, we supposed that np n → ∞ as n → ∞.
In what follows, we omitted the index n from p n when this would not cause confusion.
We considered a sequence of random matrices: Denoted by s 1 ≥ · · · ≥ s n , the singular values of X and the symmetrised empirical spectral distribution function (ESD) of the sample covariance matrix W = XX * were defined as: where I{A} stands for the event A indicator. We let y := y(n, m) = n m and G y (x) be the symmetrised Marchenko-Pastur distribution function with the density: where a = 1 − √ y and b = 1 + √ y. We assumed that y ≤ y 0 < 1 for n, m ≥ 1.
When the Stieltjes transformation of the distribution function G y (x) was denoted by S y (z) and the Stieltjes transformation of the distribution function F n (x) was denoted by s n (z), we obtained: We also put: In this paper, we proved the so called local Marchenko-Pastur law for sparse covariance matrices. We let: For a constant δ > 0, we defined the value κ = κ(δ) := δ 2(4+δ) . We assumed that a sparse probability of p n and that the moments of the matrix elements X ij satisfied the following conditions: • Condition (C0): for c 0 > 0 and n ≥ 1, we have np n ≥ c 0 log 2 κ n; • Condition (C1): for δ > 0, we have µ 4+δ := E |X 11 | 4+δ < ∞; • Condition (C2): a constant c 1 > 0 exists, such that for all 1 ≤ j ≤ n and 1 ≤ k ≤ m, We introduced the quantity v 0 = v 0 (a 0 ) := a 0 n −1 log 4 n with a positive constant a 0 . We then introduced the region: For constants u 0 > 0 and V, we defined the region: Next, we introduced some notations. We let: We introduced the quantity: and put: We stated the improved bounds for Λ n (z) and put: Theorem 1. Assuming that the conditions (C0)-(C2) are satisfied. Then, for any Q ≥ 1 the positive constants C = C(Q, δ, µ 4+δ , c 0 , c 1 ), K = K(Q, δ, µ 4+δ , c 0 , c 1 ) and a 0 = a 0 (Q, δ, µ 4+δ , c 0 , c 1 ) exist, such that for z ∈ D(a 0 ): We also proved the following result.

Organisation
The paper is organised as follows. In Section 3, we state Theorems 3-5 and several corollaries. In Section 4, the delocalisation is considered. In Section 4, we prove the corollaries that were stated in Section 3. Section 6 is devoted to the proof of Theorems 3-5. In Section 7, we state and prove some auxiliary results.

Notation
We use C for large universal constants, which may be different from line to line. S y (z) and s n (z) denote the Stieltjes transformations of the symmetrised Marchenko-Pastur distribution and the spectral distribution function, respectively. R(z) denotes the resolvent matrix. We let T = {1, . . . , n}, J ⊂ T, T (1) = {1, . . . , m} and K ⊂ T (1) . We consider the σ-algebras M (J,K) , which were generated by the elements of X (with the exception of the rows from J and the columns from K). We write M instead of M (J,K∪{l}) for brevity. The symbol X (J,K) denotes the matrix X, from which the rows with numbers in J and columns with numbers in K were deleted. In a similar way, we denote all objects in terms of X (J,K) , such that the resolvent matrix is R (J,K) , the ESD Stieltjes transformation is s The symbol E j denotes the conditional expectation with respect to the σ-algebra M j and E l+n denotes the conditional expectation with respect to σ-algebra M l+n . We let J c = T \ J and K c = T (1) \ K.

Main Equation and Its Error Term Estimation
Note that F n (x) is the ESD of the block matrix: where O k is a k × k matrix with zero elements. We let R = R(z) be the resolvent matrix of V: By applying the Schur complement, we obtained: This implied: For the diagonal elements of R, we could write: for j ∈ J c and: for l ∈ K c . The correction terms ε (J,K) j for j ∈ J c and ε (J,K) l+n for l ∈ K c were defined as: l+n,k+n ; and ε (J,K) By summing Equation (4) (J = ∅ and K = ∅), we obtained the self-consistent equation: s n (z) = S y (z)(1 + T n − yΛ n s n (z)), with the error term: We let s 0 > 1 be positive constant V, depending on δ. The exact values of these constants were defined as below. For 0 < v ≤ V, we defined k v as: Remembering that: Λ n = Λ n (z) := s n (z) − S y (z), and: We defined: The function b(z) was defined in (2). For a given γ > 0, we considered the event: , for all u and the event: For any γ value, the constant V = V(γ) existed, such that: It could be V = 2/γ, for example. In what follows, we assumed that γ and V were chosen so that (6) was satisfied and we wrote: We defined: In this section, we demonstrate the following results.

Remark 1.
Theorem 3 was auxiliary. T n was the perturbation of the main equation in the Stieltjes transformation of the limit distribution. The size of T n was responsible for the stability of the solution of the perturbed equation. We were interested in the estimates of T n that were uniform in the domain D and had an order of log n/(nv) (such estimates were needed for the proof of the delocalisation of Theorem 6). It was important to know to what extent the estimates depended on both np n and nv. The estimates behaved differently on the beam and at the ends of the support of the limit distribution (the introduced functions a n (z) and b(z) were responsible for the behaviour of the estimates, depending on the real part of the argument: on the beam or at the ends of the support of the limit distribution). For Λ n estimation, there were two regimes: for |b(z)| ≥ Γ n , we used the inequality (10) and for |b(z)| ≤ Γ n , we used the inequality (18).

Corollary 1.
Under conditions of Theorem 3, the following inequalities hold: and for any Q > 1, a constant C exists that depends on Q, such that: Moreover, for z = u + iv to satisfy v ≥ v 0 and |z| ≥ C max{ √ log n √ np , log 4 n (np) 2κ } and for Q > 1, a constant C exists that depends on Q, such that:

Corollary 3.
Under the conditions of Theorem 3, for Q ≥ 1, a constant C that depends on Q exists, such that: Theorem 4. Under the conditions of Theorem 1, for Q ≥ 1, the positive constants C = C(Q, δ, µ 4+δ , c 0 , c 1 ) and a 0 = a 0 (Q, δ, µ 4+δ , c 0 , c 1 ) exists, such that for z = u + iv ∈ D(a 0 ): Moreover, for Q ≥ 1, the positive constants where To prove the main result, we needed to estimate the entries of the resolvent matrix.

Corollary 4.
Under the conditions of Theorem 5, for v ≥ v 0 and q ≤ c log n, a constant H exists, such that for j, k ∈ T ∪ (T (1) + n):

Delocalisation
In this section, we demonstrate some applications of the main result. We let L = (L jk ) n j,k=1 and K = (K jk ) m j,k=1 be orthogonal matrices from the SVD of matrix X s.t.: where D = D n O n,m and D = diag{s 1 , . . . , s n }.
We proved the following result.

Theorem 6.
Under the conditions (C0)-(C2), for Q ≥ 1, the positive constants Moreover, for j = 1, . . . n, we have: Proof. First, we noted that according to [13] based on [14] and Theorem 1, c 1 , c 2 , C > 0 exists, such that: Furthermore, by Lemma 11, we obtained: We noted that: These implied that: We chose λ ∼ n −1 log 4 n. Then, by Corollary 4, we obtained: Pr sup We obtained the bounds for K jk in a similar way. Thus, the theorem was proven.

The Proof of Corollary 4
Proof. We could write: Combining this inequality with |R jk | ≤ v −1 , we found that: By applying Theorem 5, we obtained what was required. Thus, the corollary was proven.

The Proof of Corollary 2
Proof. We considered the domain D. We noted that for z ∈ D, we obtained: First, we considered the case |b(z)| ≥ Γ n . This inequality implied that: From there, it followed that: Furthermore, for the case |b(z)| ≥ Γ n , we obtained |b n (z)|I{Q} ≥ (1 − γ)|b(z)|I{Q}. We used the inequality: By Chebyshev's inequality, we obtained: By applying Corollary 1, we obtained: First, we noted that for q = K log n: Moreover, for q = C log n: From there, it followed that: Furthermore: Using these estimations, we could show that: By choosing q = K log n and K > C(Q), we obtained: Then, we considered the case |b(z)| ≤ Γ n . In this case: By applying the inequality |Λ n (z)| ≤ C |T n | and Corollary 1, we obtained: It was then simple to show that: Thus, the first inequality was proven. The proof of the second inequality was similar to the proof of the first. We had to use the inequality: which was valid on the real line, instead of |Λ n | ≤ C |T n |, which held in the domain D.
Moreover, we noted that for any z value, we obtained: Thus, the corollary was proven.

Proof of Corollary 3
Proof. According to Theorem 4: We noted that for v = V: We split the interval [v 0 , V] into subintervals by v 0 < v 1 < · · · < v M = V, such that for k = 1, . . . , M: 6. Proof of the Theorems 6.1. Proof of Theorem 1 Proof. We obtained: The second term in the RHS of the last inequality was bounded by Corollary 3. For z (such that |b(z)| ≥ CΓ n (z)), we used the inequality: and the Markov inequality. We could write: We recalled that in the case |b(z)| ≥ Γ n : In the case |b(z)| ≥ Γ n and using Corollary 1, we obtained: First, we considered the case |b(z)| ≥ Γ n . By our definition of r n (z), we obtained: This inequality completed the proof for |b(z)| ≥ Γ n . We then considered |b(z)| ≤ Γ n . We used inequality |Λ n (z)| ≤ |T n | and Corollary 1 to obtain: By choosing a sufficiently large K value, we obtained the proof. Thus, the theorem was proven.

Proof of Theorem 2
Proof. The proof of Theorem 2 was similar to the proof of Theorem 1. We only noted that inequality:

The Proof of Theorem 5
Proof. Using the definition of the Stieltjes transformation, we obtained: It is also well known that for z = u + iv: We considered the following event for 1 ≤ j ≤ n, 1 ≤ k ≤ m and C > 0: We set: For j ∈ J c , k ∈ K c and u, we obtained: We recalled: Then: We introduced the events: It was easy to see that: In what follows, we used Q := Q γ (v).
We considered the off-diagonal elements of the resolvent matrix. It could be shown that for j = k ∈ J c : for l = k ∈ K c : and where Inequalities (21) and (22) implied that: for 1 ≤ j ≤ n and C > 4 √ y and that: Pr{|R l+n,l+n |I{Q} > C|A 0 (z)|} ≤ Pr |ε l+n |I{Q} > 1 4|A 0 (z)| for 1 ≤ l ≤ m and C > 2. Equations (23)-(25) produced: Pr{|R jk |I{Q} > C|S y (z)|} ≤ Pr{|R jj |I{Q} > C|S y (z)|} + Pr{|ζ jk |I{Q} > 1} for 1 ≤ j = k ≤ n and: Pr{|R l+n,k+n |I{Q} > C|A 0 (z)|} ≤ Pr{|R l+n,l+n |I{Q} > C|A 0 (z)|} + Pr{|ζ l+n,k+n |I{Q} > 1} for 1 ≤ l = k ≤ m. Similarly, we obtained: We noted that for |z| ≤ B, we obtained: Using Rosenthal's inequality, we found that: k,l+n | q for 1 ≤ j = k ≤ n and that: k+n,r+n | q for 1 ≤ j = k ≤ m. We noted that: Using Chebyshev's inequality, we obtained: By applying the triangle inequality to the results of Lemmas (1)-(3) (which were the property of the multiplicative gradient descent of the resolvent matrix), we arrived at the inequality: q + q 2 s(a n (z) + |A 0 (z)|) nv q 2 When we set q ∼ log 2 n, nv > C log 4 n and np > C(log n) 2 κ and took into account that κ < 1/2 and |A 0 (z)| ≤ C/|z|, then we obtained: Moreover, the constant c could be made arbitrarily large. We could obtain similar estimates for the quantities of ε l+n , ζ jk , ζ j+nk , ζ jk+n , ζ j+n,k+n . Inequalities (27) and (28) implied: The last inequalities produced: We noted that k v ≤ C log n for v ≥ v 0 = n −1 log 4 n. So, by choosing c large enough, we obtained: This completed the proof of the theorem.
We noted that: where, as before: We estimated the value: It was easy to see that: To estimate D n , we used the approach developed in [15], which refers back to Stein's method. We let: ϕ(z) := z|z| q−2 .
We set: Then, we could write: D n := E T n ϕ( T n ).
The equality: implied that a constant C exists that depends on γ in the definition of Q, such that: We considered: Then: D n ≤ E |T n | q I{Q} I{B} + Cn −c log n .
By the definition of T n , we could rewrite the last inequality as: We set: where D (1) We obtained: 1 n n ∑ j=1 ε j1 R jj = 1 2n s n (z) + s n (z) 2nz and this yielded: Then, we used: Inequality (30) implied that for z ∈ D: where J 1 = C a n (z) nv .

Estimation of D (24) n
Using Taylor's formula, we obtained: where τ is uniformly distributed across the interval [0, 1] and the random variables are independent from each other. Since I{B} = 1 yields |R jj | ≤ C|S y (z)|, we found that: Taking into account the inequality: we obtained: By applying Hölder's inequality, we obtained: Jensen's inequality produced: To estimate D (24) n , we had to obtain the bounds for: Using Cauchy's inequality, we obtained:

Estimation of V
(1) j Lemma 2 produced: and, in turn, Lemma 3 produced: By summing the obtained estimates, we arrived at the following inequality: We considered T n − T (j) n . Since T n = T n h γ (|Λ n |, v) and T (j) n = E j T n , we obtained: Further, we noted that: We obtained: Then, we returned to the estimation of V (2) j . Equality (41) implied: We could rewrite this as: First, we found that: and We noted that: It was straightforward to see that: This bound implied that:
By combining the estimates that were obtained for A 1 , . . . , A 4 , we concluded that: n |} ≤ Ca n (z)}I{B}.
By applying: we obtained: The last inequality produced: We put: n | 2q I{B}I{Q}.

The Proof of Theorem 4
Proof. We considered the case z ∈ D, where For z, we obtained: This implied that the constant C 1 exists, depending on V, y, such that: First, we considered the case |b(z)| ≥ Γ n . Without a loss of generality, we assumed that C 0 ≥ C 1 , where C 0 is the constant in the definition of a n (z). This meant that a n (z) = Im b(z) + C 0 Γ n . Furthermore: Using Theorem 3, we obtained: We let: The analysis of F i /|b(z)| q for i = 1, . . . , 6.
• The bound of F 1 /|b(z)| q . By the definition of a n (z) and F 1 , we obtained: The bound of F 2 /|b(z)| q . By the definition of F 2 , we obtained: For this, we used |S y (z)||A 0 (z)| = |1 + zS y (z)| ≤ C.
From there and from the definition of F 5 , it followed that: • The bound of F 6 /|b(z)| q . Simple calculations showed that: We defined: By combining all of these estimations and using: we obtained: For z ∈ D (such that Γ n ≤ |b(z)|), we could write: Then, we considered |b(z)| ≤ Γ n . In this case, we used the inequality: In what follows, we assumed that q ∼ log n. The bound of E |T n | q for |b(z)| ≤ Γ n .
• By the definition of a n (z), we obtained: a n (z) nv = Γ n nv .
We could obtain from this that, for sufficiently small δ > 0 values: n .
• We noted that Γ n ≥ Im b(z) ≥ Im A 0 (z). This immediately implied that: n .
• We noted that for Im b(z) ≤ |b(z)| ≤ Γ n , we obtained: From there, it followed that: n .
• Simple calculations showed that: n .
• Simple calculation showed that: n .
• It was straightforward to check that: By applying the Markov inequality for Γ n ≤ Im b(z) ≤ C, we obtained: On the other hand, when Im b(z) ≤ Γ n , we used the inequality: By applying the Markov inequality, we obtained: This implied that: We noted that Q = Q(v) for V ≥ v ≥ v 0 and that for V ≥ v ≥ v 0 : a n (z) ≥ C log 2 n n .
On the other hand: We chose ∆v, such that: It was enough to put ∆v := n −4 . We let K := V−v 0 ∆v . For ν = 0, . . . , K − 1, we defined: and v K = V. We noted that v 0 < v 1 < · · · > v K = V and that: We started with v K = V. We noted that: This implied that: From there, it followed that: By repeating this procedure and using the union bound, we obtained the proof. Thus, Theorem 4 was proven.

Auxiliary Lemmas
Lemma 1. Under the conditions of Theorem, for j ∈ J c and l ∈ K c , we have: Proof. For simplicity, we only considered the case J = ∅ and K = ∅. We noted that: .
By applying Schur's formula, we obtained: The second inequality was proven in a similar way.

Lemma 2.
Under the conditions of Theorem 5, for all j ∈ J c , the following inequalities are valid: In addition, for q > 2, we have: Proof. For simplicity, we only considered the case J = ∅ and K = ∅. The first two inequalities were obvious. We only considered q > 2. By applying Rosenthal's inequality, for q > 2, we obtained: We recalled that: µ r = E |X jk ξ jk | r and under the conditions of the theorem: By substituting the last inequality into Inequality (44), we obtained: The second inequality could be proven similarly.

Lemma 3.
Under the conditions of the theorem, for all j ∈ T J , the following inequalities are valid: l+n,k+n (z)| 2 n 2 and E l+n |ε In addition, for q > 2, we have: l+n,k+n | q and for l ∈ T 1 K , we have: Proof. It sufficed to apply the inequality from Corollary 1 of [16].
We recalled the notation:

Lemma 4.
Under the conditions of the theorem, the following bounds are valid: and Proof. We considered the equality: It implied that: Further, we noted that for a sufficiently small γ value, a constant H existed, such that: I{Q} ≤ H|S y (z)| I{Q}. Hence: It was easy to see that: We introduced the events: It was obvious that: I{Q} ≤ I{Q} I{Q (j) }. Consequently: Further, we considered Q = {|Λ n | ≤ 2γa n (z)}. We obtained: Then, it followed that: Next, the following inequality held: Under the condition C 0 and the inequality |R jj | ≤ v −1 0 , we obtained the bounds: By applying Lemmas 2 and 3, for the first term on the right side of (48), we obtained: This completed the proof of Inequality (45). Furthermore, by using representation (47), we obtained: By applying Lemmas 2 and 3, we obtained: By applying Young's inequality, we obtained the required proof. Thus, the lemma was proven.

Lemma 5.
Under the conditions of the theorem, we have: Proof. We set Λ (j) n (z) − S y (z). Using Schur's complement formula: X jl X jk ξ jl ξ jk [R (j) ] 2 k+n,l+n )R jj .
Since Λ (j) n was measurable with respect to M (j) , we could write: n }.
We introduced the notation: X jl X jk ξ jl ξ jk [R (j) 2 ] k+n,l+n .
From the above estimates and Lemma 4, we concluded that: |S y (z)| 2 np + E j |R jj − E j R jj | 2 I{Q} I{B} + C|S y (z)| 2 (nv) 2 a n (z) nv .
Thus, the lemma was proven.
Proof. We used the representation: [R (j) 2 ] l+n,l+n (R jj − E j R jj ) We noted that by using Rosenthal's inequality: Similarly, for the second moment of η j2 , we obtained the following estimate: E j |η j2 | q I{Q} I{B} ≤ C q q q n q 2 v n (z) + C q q 2q |A 0 (z)| q (np) 2qκ+2 v q .
From the estimates above and Lemma 4, we concluded that: n | q I{Q} I{B} ≤ C q a q n (z) (nv) q E j |R jj − E j R jj | q I{Q} I{B} + C q q q 2 a q 2 n (z)|A 0 (z)| q 2 |S y (z)| q (nv) q (np) q 2 + C q q q |S y (z)| q A 0 (z)| q (nv) q (np) 2qκ+1 + C q q q a q 2 n (z)|S y (z)| q (nv) 3q 2 To finish the proof, we applied Lemma (45) and Inequality (46). Thus, the lemma was proven.
Thus, the lemma was proven.
Thus, the lemma was proven.
We let X be a rectangular n × m matrix with m ≥ n. We let s 1 ≥ · · · ≥ s n be the singular values of matrix X. The diagonal matrix with d jj = s j was denoted by D n = (d jk ) n × n. We let O n,k be an n × k matrix with zero entries. We put O n = O n,n and D n = D n O n,m−n . We let L and K be orthogonal (Hermitian) matrices, such that the singular value decomposition held: Furthermore, we let I n be the identity of an n × n matrix and E n = I n O n,m−n . We introduced the matrices L n = LE n and K n = KE T n . We noted that L * n = E T n L * and K * n = E n K * . We introduced the matrix V = O X X * O . We considered the matrix . We then obtained the following: