No Statistical-Computational Gap in Spiked Matrix Models with Generative Network Priors

We provide a non-asymptotic analysis of the spiked Wishart and Wigner matrix models with a generative neural network prior. Spiked random matrices have the form of a rank-one signal plus noise and have been used as models for high dimensional Principal Component Analysis (PCA), community detection and synchronization over groups. Depending on the prior imposed on the spike, these models can display a statistical-computational gap between the information theoretically optimal reconstruction error that can be achieved with unbounded computational resources and the sub-optimal performances of currently known polynomial time algorithms. These gaps are believed to be fundamental, as in the emblematic case of Sparse PCA. In stark contrast to such cases, we show that there is no statistical-computational gap under a generative network prior, in which the spike lies on the range of a generative neural network. Specifically, we analyze a gradient descent method for minimizing a nonlinear least squares objective over the range of an expansive-Gaussian neural network and show that it can recover in polynomial time an estimate of the underlying spike with a rate-optimal sample complexity and dependence on the noise level.


Introduction
One of the fundamental problems in statistical inference and signal processing is the estimation of a signal given noisy high dimensional data. A prototypical example is provided by spiked matrix models where a signal y ∈ R n is to be estimated from a matrix Y taking one of the following forms: • Spiked Wishart Model in which Y ∈ R N×n is given by: where σ > 0, u ∼ N (0, I N ), Z ij are i.i.d. from N (0, 1), and u and Z are independent; • Spiked Wigner Model in which Y ∈ R n×n is given by: where ν > 0, H ∈ R n×n is drawn from a Gaussian Orthogonal Ensemble GOE(n), that is, H ii ∼ N (0, 2/n) for all 1 ≤ i ≤ n and H ij = H ji ∼ N (0, 1/n) for 1 ≤ j < i ≤ n.
In the last 20 years, spiked random matrices have been extensively studied, as they serve as a mathematical model for many signal recovery problems such as PCA [1][2][3][4], synchronization over graphs [5][6][7] and community detection [8][9][10]. Furthermore, these models are archetypal examples of the trade-off between statistical accuracy and computational efficiency. From a statistical perspective, the objective is to understand how the

Our Contribution
In this paper we analyze the spiked matrix models (1) and (2) under a generative network prior in the nonasymptotic, finite data regime. We consider a d-layer feedforward generative network G : R k → R n with architecture (3). We furthermore assume that the planted spike y ∈ R n lies on the range of G, that is, there exists a latent vector x ∈ R k such that y = G(x ).
To estimate y , we first find an estimatex of the latent vector x and then use G(x) to estimate y . We thus consider the following minimization problem (under the conditions on the generative network specified below, it was shown in [15] that G is invertible and there exists a unique x that satisfies y = G(x)): where: • for the Wishart model (1) we take M = Σ N − σ 2 I n with Σ N = Y T Y/N • for the Wigner model (2) we take M = Y. Despite the non-convexity and non-smoothness of the problem, our preliminary work in [31] shows that when the generative network G is expansive and has Gaussian weights, (4) enjoys a favorable optimization geometry. Specifically, every nonzero point outside two small neighborhoods around x and a negative multiple of it, has a descent direction which is given a.e. by the gradient of f . Furthermore, in [31] it is shown that the the global minimum of f lies in the neighborhoods around x and has optimal reconstruction error. This result suggests that a first order optimization algorithm can succeed in efficiently solving (4), and no statistical-computational gap is present for the spiked matrix models with a (random) generative network prior in the finite data regime. In the current paper, we prove this conjecture by providing a polynomial-time subgradient method that minimizes the non-convex problem (4) and obtains information-theoretically optimal error rates.
Our main contribution can be summarized as follows. We analyze a subgradient method (Algorithm 1) for the minimization of (4) and show that after a polynomial number of stepsT and up to polynomials factors in the depth d of the network, the iterate xT satisfies the following reconstruction errors: • in the Spiked Wishart Model : in the regime N k log(n); • in the Spiked Wigner Model: We notice that these bounds are information-theoretically optimal up to the log factors in n, and correspond to the best achievable in the case of a k-dimensional subspace prior. In particular, they imply that efficient recovery in the Wishart model is possible with a number of samples N proportional to the intrinsic dimension of the signal y . Similarly, the bound in the Spiked Wigner Model implies that imposing a generative network prior leads to a reduction of the noise by a factor of k/n. Algorithm 1: Subgradient method for the minizimization problem (4) Input: Weights W i , observation matrix M, step size µ > 0, initial point x 0 ∈ R k \{0} ; 1 for i = 0, 1, . . . , do

Sparse PCA and Other Computational-Statistical Gaps
A canonical problem in Statistics is finding the directions that explain most of the variance in a given cloud of data, and it is classically solved by Principal Component Analysis. Spiked covariance models were introduced in [1] to study the statistical performance of this algorithm in the high dimensional regime. Under a spiked covariance model it is assumed that the data are of the form: where σ > 0, u i ∼ N (0, 1) and z i ∼ N (0, I n ) are independent and identically distributed, and y is the unit-norm planted spike. Each y i is an i.i.d. sample from a centered Gaussian N (0, Σ) with spiked covariance matrix given by Σ = y y T + σ 2 I n , with y being the direction that explains most of the variance. The estimate of y provided by PCA is then given by the leading eigenvectorŷ of the empirical covariance matrix Σ N = 1 N ∑ N i=1 y i y T i , and standard techniques from high dimensional probability can be used to show that (we write f (n) g(n) if f (n) ≥ Cn for some constant C > 0 that might depend σ and y 2 . Similarly for f (n) g(n) as long as N n, with overwhelming probability. Note incidentally that the data matrix Y ∈ R N×n with rows {y T i } i can be written as (1). Bounds of the form (8), however, become uninformative in modern high dimensional regimes where the ambient dimension of the data n is much larger than, or on the order of, the number of samples N. Even worse, in the asymptotic regime n/N → c > 0 and for σ 2 large enough, the spike y and the estimateŷ become orthogonal [32], and minimax techniques show that no other estimators based solely on the data (7), can achieve better overlap with y [33].
In order to obtain consistent estimates and lower the sample complexity of the problem, therefore, additional prior information on the spike y has to be enforced. For this reason, in recent years various priors have been analyzed such as positivity [34], cone constraints [35] and sparsity [32,36]. In the latter case y is assumed to be s-sparse, and it can be shown (e.g., [33]) that for N s log n and n s, the s-sparse largest eigenvectorŷ s of Σ N y s = argmax y∈S n−1 2 , y 0 ≤s y T Σ N y satisfies with high probability the condition: This implies, in particular, that the signal y can be estimated with a number of samples that scales linearly with its intrinsic dimension s. These rates are also minimax optimal; see for example [4] for the mean squared error and [2] for the support recovery. Despite these encouraging results, no currently known polynomial time algorithm achieves such optimal error rates and, for example, the covariance thresholding algorithm of [37] requires N s 2 samples in order to obtain exact support recovery or estimation rate min =± ŷ s − y 2 s 2 log n N , as shown in [3]. In summary, only computationally intractable algorithms are known to reach the statistical limit N = Ω(s) for Sparse PCA, while polynomial time methods are only sub-optimal, requiring N = Ω(s 2 ). Notably, [38] provided a reduction of Sparse PCA to the planted clique problem which is conjectured to be computationally hard. Further strong evidence for the hardness of sparse PCA have been given in a series of recent works [39][40][41][42][43]. Other computational-statistical gaps have also been found and studied in a variety of other contexts such as sparse Gaussian mixture models [44], tensor principal component analysis [45], community detection [46] and synchronization over groups [47]. These works fit in the growing and important body of literature aiming at understanding the trade-offs between statistical accuracy and computational efficiency in statistical inverse problems.
We finally note that many of the above mentioned problems can be phrased as recovery of a spike vector from a spiked random matrix. The difficulty can be viewed as arising from simultaneously imposing low-rankness and additional prior information on the signal (sparsity in case of Sparse PCA). This difficulty can be found in sparse phase retrieval as well. For example, [25] has shown that m = O(s log n) number of quadratic measurements are sufficient to ensure well-posedness of the estimation of an s-sparse signal of dimension n lifted to a rank-one matrix, while m ≥ O(s 2 / log 2 n) measurements are necessary for the success of natural convex relaxations of the problem. Similarly, [48] studied the recovery of simultaneously low-rank and sparse matrices, showing the existence of a gap between what can be achieved with convex and tractable relaxations and nonconvex and intractable methods.

Inverse Problems with Generative Network Priors
Recently, in the wake of successes of deep learning, generative networks have gained popularity as a novel approach for encoding and enforcing priors in signal recovery problems. In one deep-learning-based approach, a dataset of "natural signals" is used to train a generative network in an unsupervised manner. The range of this network defines a low-dimensional set which, if successfully trained, contains or approximately contains, target signals of interest [19,21]. Non-convex optimization methods are then used for recovery by optimizing over the range of the network. We notice that allowing the algorithms the complete knowledge of the generative network architecture and of the learned weights is roughtly analogous to allowing sparsity-based algorithms the knowledge of the basis or frame in which the signal is modeled as sparse.
The use of generative network for signal recovery has been successfully demonstrated in a variety of settings such as compressed sensing [21,49,50], denoising [16,51], blind deconvolution [22], inpainting [52] and many more [53][54][55][56]. In these papers, generative networks significantly outperform sparsity based priors at signal reconstruction in the low-measurement regime. This fundamentally leverages the fact that a natural signal can be represented more concisely by a generative network than by a sparsity prior under an appropriate basis. This characteristic has been observed even in untrained generative networks where the prior information is encoded only in the network architecture and has been used to devise state-of-the-art signal recovery methods [57][58][59].
Parallel to these empirical successes, a recent line of works have investigated theoretical guarantees for various statistical estimation tasks with generative network priors. Following the work of [15,21] gave global guarantees for compressed sensing, followed then by many others for various inverse problems [19,20,50,51,55]. In particular, in [17] the authors have shown that m = Ω(k log n) number of measurements are sufficient to recover a signal from random phaseless observations, assuming that the signal lies on the range of a generative network with latent dimension k. The same authors have then provided in [23] a polynomial time algorithm for recovery under the previous settings. Note that, contrary to the sparse phase retrieval problem, generative priors for phase retrieval allow for efficient algorithms with optimal sample complexity, up to logarithmic factors, with respect to the intrinsic dimension of the signal.
Further theoretical advances in signal recovery with generative network priors have been spurred by using techniques from statistical physics. Recently, [30] analyzed the spiked matrix models (1) and (2) with y in the range of a generative network with random weights, in the asymptotic limit k, n, N → ∞ with n/k = O(1) and N/n = O(1). The analysis is carried out mainly for networks with sign or linear activation functions in the Bayesian setting where the latent vector is drawn from a separable distribution. The authors of [30] provide an Approximate Message Passing and a spectral algorithm, and they numerically observe no statistical-computational gap as these polynomial time methods are able to asymptotically match the information-theoretic optimum. In this asymptotic regime, [60] further provided precise statistical and algorithmic thresholds for compressed sensing and phase retrieval.

Algorithm and Main Result
In this section we present an efficient and statistically-optimal algorithm for the estimation of the signal y given a spiked matrix Y of the form (1) or (2). The recovery method is detailed in Algorithm 1, and it is based on the direct optimization of the nonlinear least squares problem (4).
Applied in [16] for denoising and compressed sensing under generative network priors, and later used in [23] for phase retrieval, the first order optimization method described in Algorithm 1 leverages the theory of Clarke subdifferentials (the reader is referred to [61] for more details). As the objective function f is continuous and piecewise smooth, at every point x ∈ R k it has a Clarke subdifferential given by where conv denotes the convex hull of the vectors v 1 , . . . , v T , which are respectively the gradient of the T smooth functions adjoint at x. The vectors v x ∈ ∂ f (x) are the subgradients of f at x, and at a point The reconstruction method presented in Algorithm 1 is motivated by the landscape analysis of the minimization problem (4) for a network G with sufficiently-expansive Gaussian weights matrices. Under this assumption, we showed in [31] that (4) has a benign optimization geometry and in particular that for any nonzero point outside a neighborhood of x and a negative multiple of it, any subgradient of f is a direction of strict descent. Furthermore we showed that the points in the vicinity of the spurious negative multiple of x have function values strictly larger than those close to x . Figure 1 shows the expected value of f in the noiseless case, ν = 0 and N → ∞, for a generative network with latent dimension k = 2. This plot highlights the global minimimum at x = [1,1], and the flat region in near a negative multiple of x . At each step, the subgradient method in Algorithm 1 checks if the current iterate x i has a larger loss value than its negative multiple, and if so negates x i . As we show in the proof of our main result, this step will ensure that the algorithm will avoid the neighborhood around the spurious negative multiple of x and will converge to the neighborhood around x in a polynomial number of steps.
Below we make the following assumptions on the weight matrices of G.

Assumption 1.
The generative network G defined in (3), has weights W i ∈ R n i ×n i−1 with i.i.d. entries from N (0, 1/n i ) and satisfying the expansivity condition with constant > 0: for all i and a universal constant c > 0.
We note that in [31] the expansivity condition was more stringent, requiring an additional log factor. Since the publication of our paper, [62] has shown that the more relaxed assumption (10) suffices for ensuring a benign optimization geometry. Under Assumption 1, our main theorem below shows that the subgradient method in Algorithm 1 can estimate the spike y with optimal sample complexity and in a polynomial number of steps. Theorem 1. Let x ∈ R k nonzero and y = G(x ) where G is a generative network satisfying Assumption 1 with ≤ K 1 /d 96 . Consider the minimization problem (4) and assume that the noise and • for the Spiked Wigner Model (2) take M = Y, and Consider Algorithm 1 with nonzero x 0 and x 0 2 < R where R ≥ 5 x 2 /(2 √ 2), and stepsize µ = 2 2d K 3 /(8d 4 R 2 ). Then with probability at least where C > 0, K 1 , . . . , K 4 > 0, ρ 1 ∈ (0, 1) and ρ 2 > 0 are universal constants.
Note that the quantity 2 2d in the hypotheses and conclusions of the theorem is an artifact of the scaling of the network and it should not be taken as requiring exponentially small noise or number of steps. Indeed under Assumption 1, the ReLU activation zeros out roughly half of the entries of its argument leading to an "effective" operator norm of approximately 1/2. We furthermore notice that the dependence of the depth d is likely quite conservative and it was not optimized in the proof as the main objective was to obtain tight dependence on the intrinsic dimension of the signal k. As shown in the numerical experiments, the actual dependence on the depth is much better in practice. Finally, observe that despite the nonconvex nature of the objective function in (4) we obtain a rate of convergence which is not directly dependent on the dimension of the signal, reminiscent of what happens in the convex case.
The quantity ω in Theorem 1 can be interpreted as the intrinsic noise level of the problem (inverse SNR). The theorem guarantees that in a polynomial number of steps the iterates of the subgradient method will converge to x up to ω . ForT large enough G(xT) will satisfy the rate-optimal error bounds (5) and (6).

Numerical Experiments
We illustrate the predictions of our theory by providing results of Algorithm 1 on a set of synthetic experiments. We consider 2-layer generative networks with ReLU activation functions, hidden layer of dimension n 1 = 500, output dimension n 2 = n = 1500 and varying number of latent dimension k ∈ [40, 60, 100]. We randomly sample the weights of the matrix independently from N (0, 2/n i ) (this scaling removes that 2 d dependence in Theorem1). We then consider data Y according the spiked models (1) and (2), where x ∈ R k is chosen so that y = G(x ) has unit norm. For the Wishart model, we vary the number of samples N; and for the Wigner model, we vary the noise level ν so that the following quantities remain constant for the different networks with latent dimension k: In Figure 2 we plot the reconstruction error given by G(x) − y 2 against θ WS and θ WG . As predicted by Theorem 1, the errors scale linearly with respect to these control parameters, and moreover the overlap of these plots confirms that these rates are tight with respect to the order of k.

Recovery Under Deterministic Conditions
We will derive Theorem 1 from Theorem 3, below, which is based on a set of deterministic conditions on the weights of the matrix and the noise. Specifically, we consider the minimization problem (4) with for an unknown symmetric matrix H ∈ R n×n , nonzero x ∈ R k , and a given d-layer feed forward generative network G as in (3).
In order to describe the main deterministic conditions on the generative network G, we begin by introducing some notation. For W ∈ R n×k and x ∈ R k , we define the operator Finally we let Λ x = Π 1 j=d W j,+,x and note that G(x) = Λ x x. With this notation we next recall the following deterministic condition on the layers of the generative network.
Definition 2 (Weight Distribution Condition [15]). We say that W ∈ R n×k satisfies the Weight Distribution Condition (WDC) with constant > 0 if for all nonzero x 1 , x 2 ∈ R k : Note that Q x 1 ,x 2 is the expected value of W T +,x 1 W +,x 2 when W has rows w i ∼ N (0, I k /n), and if x 1 = x 2 then Q x 1 ,y 2 is an isometry up to the scaling factor 1/2. Below we will say that a d-layer generative network G of the form (3), satisfies the WDC with constant > 0 if every weight matrix W i has the WDC with constant for all i = 1, . . . d.
The WDC was originally introduced in [15], and ensures that the angle between two vectors in the latent space is approximately preserved at the output layer and, in turn, it guarantees the invertibility of the network. Assumption 1 will guarantees that the generative network G satisfies the WDC with high probability.
We are now able to state our recovery guarantees for a spike y under deterministic conditions on the network G and noise H. Theorem 3. Let d ≥ 2 and assume the generative network (3) has weights W i ∈ R n i ×n i−1 satisfying the WDC with constant 0 < ≤ K 1 /d 96 . Consider Algorithm 1 with M = G(x )G(x ) T + H, x ∈ R k \{0} and H a symmetric matrix satisfying: . Then the iterates {x i } i≥0 generated by the Algorithm 1 satisfy 0 < x i < R and obey to the following: such that for any i ≥ T: where K 1 , . . . , K 6 > 0, ρ 1 ∈ (0, 1) and ρ 2 > 0 are universal constants.
Theorem 1 follows then from Theorem 3 after proving that with high probability the spectral norm of Λ T x HΛ x , where H = M − y y T , can be upper bounded by ω/2 d , and the weights of the network G satisfy with high probability the WDC.
In the rest of this section section we will describe the main steps and tools needed to prove Theorem 3.

Technical Tools and Outline of the Proofs
Our proof strategy for Theorem 3 can be summarized as follows: 1.
In Proposition A1 (Appendix A.3) we show that the iterates {x i } i=1 of the Algorithm 1 stay inside the Euclidean ball of radius R and remain nonzero for all i ≥ 1.

2.
We then identify two small Euclidean balls B + and B − around respectively x and −ρ d x , where ρ d ∈ (0, 1) only depends on the depth of the network. In Proposition A2 we show that after a polynomial number of steps, the iterates {x i } of the Algorithm 1 enter the region B + ∪ B − (Appendix A.4).

3.
We show, in Proposition A3, that the negation step causes the iterates of the algorithm to avoid the spurious point −ρ d x and actually enter B + within a polynomial number of steps (Appendix A.5).

4.
We finally show in Proposition A4, that in B + the loss function f enjoys a favorable convexity-like property, which implies that the iterates {x i } will remain in B + and eventually converge to x up to the noise level (Appendix A.6).
One of the main difficulties in the analysis of a subgradient method in Algorithm 1 is the lack of smoothness of the loss function f . We show that the WDC allows us to overcome this issue by showing that the subgradients of f are uniformly close, up to the noise level, to the vector field h x ∈ R k : whereh x is continuous for nonzero x (see Appendix A.2). We show furthermore that h x is locally Lipschitz, which allows us to conclude that the gradient method decreases the value of the loss function until eventually reaching B + ∪ B − (Appendix A.4).
Using the WDC, we show that the loss function f is uniformly close to A direct analysis of f E reveals that its values inside B − are strictly larger then those inside B + . This property extends to f as well, and guarantees that the gradient method will not converge to the spurious point −ρ d x (Appendix A.5).

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Supporting Lemmas and Proof of Theorem 3
The proof of Theorem 3 is provided in Appendix A.7. We begin this section with a set of preliminary results and supporting lemmas.
Appendix A.1. Notation We collect the notation that is used throughout the paper. For any real number a, let relu(a) = max(a, 0) and for any vector v ∈ R n , denote the entrywise application of relu as relu(v) . Let diag(Wx > 0) be the diagonal matrix with i-th diagonal element equal to 1 if (Wx) i > 0 and 0 otherwise. For any vector x we denote with x its Euclidean norm and for any matrix A we denote with A its spectral norm and with A F its Frobenius norm. The euclidean inner product between two vectors a and b is a, b , while for two matrices A and B their Frobenius inner product will be denoted by A, B F . For any nonzero vector x ∈ R n , letx = x/ x . For a set S we will write |S| for its cardinality and S c for its complement. Let B(x, r) be the Euclidean ball of radius r centered at x, and S k−1 be the unit sphere in R k . Let θ 0 = ∠(x, x ) and for i ≥ 0 let θ i+1 = g(θ i ) where g is defined in (A1). We will write γ = Ω(δ) to mean that there exists a positive constant C such that γ ≥ Cδ and similarly γ = O(δ) if γ ≤ Cδ. Additionally we will use a = b + O 1 (δ) when a − b ≤ δ, where the norm is understood to be the absolute value for scalars, the Euclidean norm for vectors and the spectral norm for matrices.

Appendix A.2. Preliminaries
For later convenience we will define the following vectors: x HΛ x x. Note then that when f is differentiable at x, thenṽ x := ∇ f (x) =v x − η x and in particular when H = 0 then we haveṽ x =v x .
The following function controls how the angles are contracted by a ReLU layer: As we mentioned in Section 4.1, our analysis is based on showing that the subgradients of f are uniformly close to the vector field h x given by and θ i := g(θ i−1 ) for g given by (A1) and θ 0 = ∠(x, y).
The next lemma shows that the noiseless gradientv x concentrates around h x .
Lemma A2. Suppose d ≥ 2 and the WDC holds with < 1/(16πd 2 ) 2 , then for all nonzero We now use the characterization of the Clarke subdifferential given in (9) to derive a bound on the concentration of v x ∈ ∂ f (x) around h x up to the noise level.
Lemma A3. Under the assumptions of Lemma A2, and assuming Λ T From the above and the bound on the noise level ω we can bound the norm of the step v x in the Algorithm 1.

Lemma A4.
Under the assumptions of Lemma A3, and assuming that ω satisfies (13), for any v

Appendix A.3. Iterates Stay Bounded
In this section we prove that all the iterates {x i } generated by Algorithm 1 remain inside the Euclidean ball B(0, R ) where R ≥ √ 2C x and C = 5/4.
Then for any x ∈ R k with C x < x ≤ R and any λ ∈ [0, 1], it holds that x − λ µ v x ≤ x .
From the previous lemma we can now derive the boundedness of the iterates of the Algorithm 1.

Proposition A1. Under the assumptions of Theorem
Proof. Assume C x < x ≤ R , then the conclusions follows from Lemma A5. Assuming instead that x ≤ C x , note that 2 ) x ≤ R using Lemma A4, d ≥ 2 and the assumptions on µ and R . Finally observe that if x i ∈ B(0, R ) then the same holds forx i .
Finally if for some i ≥ 0 it was the case that 0 =x i − λµvx i , then this would imply that x i = λµ vx i which cannot happen because by Lemma A4 and the choice of the step size it holds that µ vx i ≤ x i /8.

Appendix A.4. Convergence to S β
We define the set S β outside which we can lower bound the norm of h x as where we take Outside the set S β the sub-gradients of f are bounded below and the landscape has favorable optimization geometry.
Based on the previous lemma we can prove the main result of this section.

Proposition A2. Under the assumptions of Theorem
By the mean value theorem for Clarke subdifferentials [61] (Theorem 8.13), there exists λ ∈ (0, 1) such that for where the first inequality follows from the triangle inequality and the second from Equation (A9). Next observe that by (A8) which together with the definition of µ and (A11) gives (A10).
Next take x i / ∈ S β and assume f ( we obtain then (A10) proceeding as before.
Finally the claim on the maximum number of iterations directly follows by a telescopic sum on (A10) and f (·) ≥ 0.

Appendix A.5. Convergence to a Neighborhood Around x
In the previous section we have shown that after a finite number of steps the iterates {x i } of the Algorithm 1 will enter in the region S β . In this section we show that, thanks to the negation step in the descent algorithm, they will eventually be confined in a neighborhood of x .
The following lemma shows that S β is contained inside two balls around x and −x . where R 1 , R 2 are numerical constants and 0 < ρ d < 1 such that ρ d → 1 as d → ∞.
We furthermore observe that the ball around −x has values strictly higher that the one around x .
The main result of this section is about the convergence to a neighborhood of x of the iterates {x i }.
such that x i+T ∈ B(x , R 1 βd 10 x ). In particular it holds that Proof. Either x i ∈ S β or by Proposition A2 there exist T such that x i+T ∈ S β . By the choice of > 0, the definition of β in (A7), and the assumption on the noise level (13), it follows that the hypotheses of Lemma A7 are satisfied. We therefore define the two neighborhoods S + β := S β ∩ B + and S − β := S β ∩ B − and conclude that that either x i+T ∈ S + β or x i+T ∈ S − β .
We next notice that S + β ⊆ B(x , x d −12 ) and S − β ⊆ B(−ρ d x , x d −12 ). We can then use Lemma A8 to conclude that by the negation step, if x i+T ∈ S + β thenx i+T ∈ S + β , otherwise we will havex i+T ∈ −S − β . We now analyze the case x i+T ∈ −S − β ∩ S c β . Applying again Proposition A2, we have that there exists and integer such that x i+T +T ∈ S β . Furthermore Proposition A2 implies that f (x i+T +T ) < f (x i+T ), while from Lemma A8 we know that f (x i+T ) < f (y) for all y ∈ S − β . We conclude therefore that x i+T +T must be in S + β .
In summary we obtained that there exists an integer such that x i+T ∈ S + β ⊂ B + . Finally Equation (A13) follows from the definition of β in (A7) and B + .
Based on the previous condition on the direction of the sub-gradients we can then prove that the iterates of the algorithm converge to x up to noise. (3), if x T ∈ B + then for any i ≥ T it holds that x i ∈ B + and furthermore
Proof. If i = T, by Proposition A2, it follows that x i =x i ∈ B + . Furthermore, the assumptions of Lemma A9 are satisfied and we can write: Next recall that ω satisfies (13), µ = K 3 2 2d /(8d 4 R 2 ) and R 2 ≥ 25 x 2 /8, then Therefore for K 2 and K 3 small enough we obtain that x i+1 ∈ B(x , R 1 β d 10 x ) and by induction this holds for all i ≥ T. Finally we obtain (A14) by letting Appendix A.7. Proof of Theorem 3 We begin recalling the following fact on the local Lipschitz property of the generative network G under the WDC.
Lemma A10 (Lemma 21 in [16]). Suppose x ∈ B(x , d √ x ), and the WDC holds with < 1/(200 4 d 6 ). Then it holds that: We then conclude the proof of Theorem 3 by using the above lemma and the results in the previous sections.

(I)
By assumption x 0 ∈ B(0, R )\{0} so that according to Proposition A1 for any By Proposition A3, there exists an integer T such that x T ∈ B + and therefore it satisfies the conclusions of Theorem 3.A (III) Once in B + the iterates of Algorithm 1 converge to x up to the noise level, as shown by Proposition A4 and Equation (A14) which corresponds to (11) in Theorem 3.B . (IV) The reconstruction error (12) in Theorem 3.B, follows then from (11) by applying Lemma A10 and the lower bound (A4).

Appendix B.1. Supplementary Proofs for Appendix A.2
Below we prove Lemma A2 on the concentration of the gradient of f at a differentiable point.
Proof of Lemma A2. We begin by noticing that: . Below we show that: and from which the thesis follows. Regarding Equation (A16) observe that: where in the first inequality we used p x , x = Λ x x 2 and in the second we used Equations (A5) and (A6) of Lemma A1. Next note that: where in the second inequality we have the bound (A6) and the definition ofh x . Equation (A17) is then found by appealing to Equation (A5) in Lemma A1.
The previous lemma is now used to control the concentration of the subgradients v x of f around h x .
Proof of Lemma A3. When f is differentiable at x, ∇ f (x) =ṽ x =v x + η x , so that by Lemma A2 and the assumption on the noise: Observe, now, that by (9), for any x ∈ R k , v x ∈ ∂ f (x) =conv(v 1 , . . . , v t ), and therefore v x = a 1 v 1 + · · · + a T v T for some a 1 , . . . , a T ≥ 0, ∑ i a i = 1. Moreover for each v i there exist a w i such that v i = lim δ→0 +ṽ x+δw i . Therefore using Equation (A18), the continuity of h x with respect to nonzero x and ∑ i a i = 1: The above results are now used to bound the norm of v x ∈ ∂ f (x).
Proof of Lemma A4. Since < 1/(16πd 2 ) 2 observe that 86d 4 √ ≤ 2d 2 , therefore by the assumption on the noise level and Lemma A3 it follows that for any v x ∈ ∂ f (x) and Next observe that that since h x ≤ d x , we have:

Appendix B.2. Supplementary Proofs for Appendix A.3
In this section we prove Lemma A5 which implies that the norm of the iterates x i does not increase in the region B(0, R )\B(0, C x ).
Proof of Lemma A5. Note that the thesis is equivalent to 2 x, v x ≥ λµ v x 2 . Next recall that by the WDC for any x ∈ R k and 2d < 1: (A19) Next, by Lemma A4, the definition of the step length µ and max( x 2 , x 2 ) ≤ R 2 , we have: which, using x > C x , gives We can then conclude by observing that by the assumptions and for small enough constants At a non-differentiable point x, by the characterization of the Clarke subdifferential, which implies that the lower bound (A19) also holds for x, v x . Similarly Lemma A4 leads to the upper bound (A20) also for v x , which then leads to the thesis 2 x, v x ≥ λµ v x 2 .

Appendix B.3. Supplementary Proofs for Appendix A.4
In the next lemmas we show that h x is locally Lipschitz.
Lemma A11. For all x, y = 0 Note thath x = ξ x + ζ x x where ξ and ζ are defined in (A23). Then observe that by (A28) and (A29) it follows that h x ≤ d x . Furthermore by [16] (Lemma 18) for any nonzero x, y ∈ R k we have Next notice that where by triangle inequality the first term on the left hand side can be bounded as Finally note that by from the bound (A21) we obtain: where we used Based on the previous lemma we can now prove that h x is locally Lipschitz..
Based on the previous result we can now prove Lemma A6.
where we used the definition β in Equation (A7) and Lemma A3. Next take λ ∈ [0, 1] and where in the first inequality we used Lemma A3 and Lemma A12, in the second inequality (A22) and in the last one (A8).

Appendix B.4. Supplementary Proofs for Appendix A.5
Below we prove that the region of R k where we cannot control the norm of the vector field h x is contained in two balls around x and −ρ d x .
We prove Lemma A7 by showing the following.
Proof of Lemma A13. Without loss of generality, let x = e 1 and x = r cosθ 0 · e 1 + r sinθ 0 · e 2 , for someθ 0 ∈ [0, π], and r ≥ 0. Recall that we callx = x/ x andx = x / x . We then introduce the following notation: where θ i = g(θ i−1 ) with g as in (A1), and observe thath x = (ξx + ζx). Let α := h x ,x , then we can write: . Using the definition ofx andx we obtain: and conclude that since x ∈ S β , then: We now list some bounds that will be useful in the subsequent analysis. We have: The identities (A26) through (A34) can be found in Lemma 16 of [16], while the identity (A35) follows by noticing that α = ξ cosθ 0 + ζ = cos θ d and using (A27) together with d ≥ 2.
Controlling the distance. We have shown that it is eitherθ 0 ≈ 0 and x 2 ≈ x 2 or θ 0 ≈ π and x 2 ≈ ρ 2 d x 2 . We can therefore conclude that it must be either x ≈ x or Observe that if a two dimensional point is known to have magnitude within ∆r of some r and is known to be within an angle ∆θ from 0, then its Euclidean distance to the point of coordinates (r, 0) is no more that ∆r + (r + ∆r)∆θ. Similarly we can write: In the small angle case, by (A36), (A38), and x | x − x | ≤ | x 2 − x 2 |, we have: x − x ≤ 258πd 6 β + (1 + 258πd 6 β) 32d 4 πβ ≤ 550 πd 10 β.
Next we notice that ρ 2 = 1/π and ρ d ≥ ρ d−1 as follows from the definition and (A31), (A32). Then considering the large angle case and using (A37) we have: The latter, together with (A38), yields: where in the second inequality we have used ρ d ≤ d and in the third 8 βπd 6 ≤ 1.
We next will show that the values of the loss function in a neighborhood of x are strictly smaller that those in a neighborhood of −ρ d x .
Recall that f ( , we next define the following loss functions: In particular notice that f (x) = f H (x) + 1/4 H 2 F . Below we show that assuming the WDC is satisfied, f 0 (x) concentrates around f E (x). Lemma A14. Suppose that d ≥ 2 and the WDC holds with < 1/(16πd 2 ) 2 , then for all nonzero Observe that: We analyze each term separately. The first term can be bounded as: We next take ϕ = and x ∈ B(φ d x , ϕ x ), so that by Lemma A15 and the assumption 2 d d 44 w ≤ K 2 x 2 , we have: Similarly if y ∈ B(−φ d x , ϕ x ), and ϕ = we obtain: In order to guarantee that f (y) > f (x), it suffices to have: with C d := (544d 4 + 10πd 3 π + 3K 2 d −44 + πd/2 + 1/100), that is to require: Finally notice that by Lemma 17 in [16] it holds that 1 − ρ d ≥ (K(d + 2)) −2 for some numerical constant K, we therefore choose φ = /d 12 for some > 0 small enough.

Appendix B.5. Supplementary Proofs for Appendix A.6
In this section we use strong convexity and smoothness to prove convergence to x up to the noise variance ω. The idea is to show that every vector in the subgradient points in the direction (x − x ). Recall that the gradient in the noiseless case was: We show that by continuity of Λ x , when x is close to x , thenv x it is close to: We begin by recalling the following result which can be found in the proof of Lemma 22 of [16]. Lemma A16. Suppose x ∈ B(x , d √ x ) and the WDC holds with < 1/(200 4 d 6 ). Then it holds that: We now prove thatv x ≈ v x for x close to x .
Lemma A17. Suppose x ∈ B(x , d √ x ) and the WDC holds with < 1/(200 4 d 6 ). Then it holds that: where the second inequality follows from Lemma A1, and the third from Lemma A16.
We next prove that by the WDC v x ≈v x .
Lemma A18. Suppose the WDC holds with < 1/(16πd 2 ) 2 . Then for all nonzero x, x : Proof. For notational convenience we define E d := I k /2 d the scaled identity in R d , where the second inequality follows from Lemma A1 and the third from (17) in [15]. We conclude by noticing that M x, ≤ x − x x + x .
Next consider the quartic functionf (x) := 1/2 2d+2 xx T − x x T 2 F and observe that: . Following [63] we show thatv x is β-smooth and strongly convex, in turn deriving the result in Lemma A19.

Appendix C. Proof of Theorem 1
By Theorem 3 it suffices to show that with high probability the weight matrices {W i } d i=1 satisfy the WDC, and that ω/2 d upper bounds the spectral norm of the noise term Λ T x HΛ x where • in the Spiked Wishart Model H = Σ N − Σ where Σ N = Y T Y/N and the Σ = y y T + σ 2 I n ; • in the Spiked Wigner Model H = νH where H ∼ GOE(n).
Regarding the Weight Distribution Condition (WDC), we observe that it was initially proposed in [15], where it was shown to hold with high probability for networks with random Gaussian weights under an expansivity condition on their dimensions. It was later shown in [62] that a less restrictive expansivity rate is sufficient.
By a union bound over all layers, using the above result we can conclude that the WDC holds simultaneously for all layers of the network with probability at least 1 − ∑ d i=1 e −Cn i−1 . Note in particular that this argument does not requires the independence of the weight matrices {W i } d i=1 . By Lemma A20, with high probability the random generative network G satisfies the WDC. Therefore if we can guarantee that the assumptions on the noise term H are satisfied, then the proof of the main Theorem 1 follows from the deterministic Theorem 3 and Lemma A20.
Before turning to the bounds of the noise terms in the spiked models, we recall the following lemma which bounds the number of possible Λ x for x = 0. Note that this is related to the number of possible regions defined by a deep ReLU network.
In the next section we use this lemma to control the noise term Λ T x HΛ x on the event where the WDC holds.