No Statistical-Computational Gap in Spiked Matrix Models with Generative Network Priors

Jorio Cocola; Paul Hand; Vladislav Voroninski

doi:10.3390/e23010115

,

and

¹

Department of Mathematics, Northeastern University, Boston, MA 02115, USA

²

Khoury College of Computer Sciences, Northeastern University, Boston, MA 02115, USA

³

Helm.ai, Menlo Park, CA 94025, USA

^*

Author to whom correspondence should be addressed.

Entropy2021, 23(1), 115;https://doi.org/10.3390/e23010115

This article belongs to the Special Issue The Role of Signal Processing and Information Theory in Modern Machine Learning

Version Notes

Order Reprints

Abstract

We provide a non-asymptotic analysis of the spiked Wishart and Wigner matrix models with a generative neural network prior. Spiked random matrices have the form of a rank-one signal plus noise and have been used as models for high dimensional Principal Component Analysis (PCA), community detection and synchronization over groups. Depending on the prior imposed on the spike, these models can display a statistical-computational gap between the information theoretically optimal reconstruction error that can be achieved with unbounded computational resources and the sub-optimal performances of currently known polynomial time algorithms. These gaps are believed to be fundamental, as in the emblematic case of Sparse PCA. In stark contrast to such cases, we show that there is no statistical-computational gap under a generative network prior, in which the spike lies on the range of a generative neural network. Specifically, we analyze a gradient descent method for minimizing a nonlinear least squares objective over the range of an expansive-Gaussian neural network and show that it can recover in polynomial time an estimate of the underlying spike with a rate-optimal sample complexity and dependence on the noise level.

Keywords:

spiked matrix models; generative networks; rank-one matrix recovery; statistical-computational gap

1. Introduction

One of the fundamental problems in statistical inference and signal processing is the estimation of a signal given noisy high dimensional data. A prototypical example is provided by spiked matrix models where a signal

y_{🟉} \in R^{n}

is to be estimated from a matrix Y taking one of the following forms:

Spiked Wishart Model in which $Y \in R^{N \times n}$ is given by:

$Y = u {y_{🟉}}^{T} + σ Z,$

(1)

where $σ > 0$ , $u \sim N (0, I_{N})$ , $Z_{i j}$ are i.i.d. from $N (0, 1)$ , and u and Z are independent;
Spiked Wigner Model in which $Y \in R^{n \times n}$ is given by:

$Y = y_{🟉} {y_{🟉}}^{T} + ν H$

(2)

where $ν > 0$ , $H \in R^{n \times n}$ is drawn from a Gaussian Orthogonal Ensemble GOE $(n)$ , that is, $H_{i i} \sim N (0, 2 / n)$ for all $1 \leq i \leq n$ and $H_{i j} = H_{j i} \sim N (0, 1 / n)$ for $1 \leq j < i \leq n$ .

In the last 20 years, spiked random matrices have been extensively studied, as they serve as a mathematical model for many signal recovery problems such as PCA [1,2,3,4], synchronization over graphs [5,6,7] and community detection [8,9,10]. Furthermore, these models are archetypal examples of the trade-off between statistical accuracy and computational efficiency. From a statistical perspective, the objective is to understand how the choice of the prior on

y_{🟉}

determines the critical signal-to-noise ratio (SNR) and number of measurements above which it becomes information-theoretically possible to estimate the signal. From a computational perspective, the objective is to design efficient algorithms that leverage such prior information. A recent and vast body of literature has shown that depending on the chosen prior, gaps can arise between the minimum SNR required to solve the problem and the one above which known polynomial-time algorithms succeed. An emblematic example is provided the Sparse PCA problem where the signal

y_{🟉}

in (1) is taken to be s-sparse. In this case

N = O (s \log n)

number of samples are sufficient for estimating

y_{🟉}

[2,4], while the best known efficient algorithms require

N = O (s^{2})

[3,11,12]. This gap is believed to be fundamental. This “statistical-computational gap” has been observed also for Spiked Wigner models (2) and, in general, for other structured signal recovery problems where the prior imposed has a combinatorial flavor (see the next section and [13,14] for surveys).

Motivated by the recent advances of deep generative networks in learning complex data structures, in this paper we study the spiked random matrix models (1) and (2), where the planted signal

y_{🟉}

has a generative network prior. We assume that a generative neural network

G : R^{k} \to R^{n}

with

k < n

, has been trained on a data set of spikes, and the unknown spike

y_{🟉} \in R^{n}

lies on the range of G, that is, we can write

y_{🟉} = G (x_{🟉})

for some

x_{🟉} \in R^{k}

. As a mathematical model for the trained G, we consider a network of the form:

G (x) = relu (W_{d} \dots relu (W_{2} relu (W_{1} x)) \dots),

(3)

with weight matrices

W_{i} \in R^{n_{i} \times n_{i - 1}}

and

relu (x) = \max (x, 0)

is applied entrywise. We furthermore assume that the network is expansive, that is,

n = n_{d} > n_{d - 1} > \dots > n_{0} = k

, and the weights have Gaussian entries. These modeling assumptions and their variants were used in [15,16,17,18,19,20].

Enforcing generative network priors has led to substantially fewer measurements needed for signal recovery than with traditional sparsity priors for a variety of signal recovery problems [17,21,22]. In the case of phase retrieval, [17,23] have shown that under the generative prior (3), efficient compressive phase retrieval is possible with sample complexity proportional (up to log factors) to the underlying signal dimensionality k. In contrast, for a sparsity-based prior, the best known polynomial time algorithms (convex methods [24,25,26], iterative thresholding [27,28,29], etc.) require a sample complexity proportional to the square of the sparsity level for stable recovery. Given that generative priors lead to no computational-statistical gap with compressive phase retrieval, one might anticipate that they will close other computational-statistical gaps as well. Indeed, [30] analyzed the spiked models (1) and (2) under a generative network prior similar to (3) and observed no computational-statistical gap in the asymptotic limit

k, n, N \to \infty

with

n / k = O (1)

and

N / n = O (1)

. For more details on this work and on the comparison of sparsity and generative priors, see Section 2.2.

Our Contribution

In this paper we analyze the spiked matrix models (1) and (2) under a generative network prior in the nonasymptotic, finite data regime. We consider a d-layer feedforward generative network

G : R^{k} \to R^{n}

with architecture (3). We furthermore assume that the planted spike

y_{🟉} \in R^{n}

lies on the range of G, that is, there exists a latent vector

x_{🟉} \in R^{k}

such that

y_{🟉} = G (x_{🟉})

.

To estimate

y_{🟉}

, we first find an estimate

\hat{x}

of the latent vector

x_{🟉}

and then use

G (\hat{x})

to estimate

y_{🟉}

. We thus consider the following minimization problem (under the conditions on the generative network specified below, it was shown in [15] that G is invertible and there exists a unique

x_{🟉}

that satisfies

y_{🟉} = G (x)

:

min_{x \in R^{k}} f (x) : = \frac{1}{4} {‖ G (x) G {(x)}^{T} - M ‖}_{F}^{2},

(4)

where:

for the Wishart model (1) we take $M = Σ_{N} - σ^{2} I_{n}$ with $Σ_{N} = Y^{T} Y / N$
for the Wigner model (2) we take $M = Y$ .

Despite the non-convexity and non-smoothness of the problem, our preliminary work in [31] shows that when the generative network G is expansive and has Gaussian weights, (4) enjoys a favorable optimization geometry. Specifically, every nonzero point outside two small neighborhoods around

x_{🟉}

and a negative multiple of it, has a descent direction which is given a.e. by the gradient of f. Furthermore, in [31] it is shown that the the global minimum of f lies in the neighborhoods around

x_{🟉}

and has optimal reconstruction error. This result suggests that a first order optimization algorithm can succeed in efficiently solving (4), and no statistical-computational gap is present for the spiked matrix models with a (random) generative network prior in the finite data regime. In the current paper, we prove this conjecture by providing a polynomial-time subgradient method that minimizes the non-convex problem (4) and obtains information-theoretically optimal error rates.

Our main contribution can be summarized as follows. We analyze a subgradient method (Algorithm 1) for the minimization of (4) and show that after a polynomial number of steps

\tilde{T}

and up to polynomials factors in the depth d of the network, the iterate

x_{\tilde{T}}

satisfies the following reconstruction errors:

in the Spiked Wishart Model:

$‖ G (x_{\tilde{T}}) - y_{🟉} ‖_{2} ≲ (1 + \frac{σ^{2}}{‖ y_{🟉} ‖_{2}^{2}}) \sqrt{\frac{k \log (n)}{N}} {‖ y_{🟉} ‖}_{2}$

(5)

in the regime $N ≳ k \log (n)$ ;
in the Spiked Wigner Model:

$‖ G (x_{\tilde{T}}) - y_{🟉} ‖_{2} ≲ \frac{ν}{‖ y_{🟉} ‖_{2}^{2}} \sqrt{\frac{k \log (n)}{n}} {‖ y_{🟉} ‖}_{2} .$

(6)

We notice that these bounds are information-theoretically optimal up to the log factors in n, and correspond to the best achievable in the case of a k-dimensional subspace prior. In particular, they imply that efficient recovery in the Wishart model is possible with a number of samples N proportional to the intrinsic dimension of the signal

y_{🟉}

. Similarly, the bound in the Spiked Wigner Model implies that imposing a generative network prior leads to a reduction of the noise by a factor of

k / n

.

Algorithm 1: Subgradient method for the minizimization problem (4)

2. Related Work

2.1. Sparse PCA and Other Computational-Statistical Gaps

A canonical problem in Statistics is finding the directions that explain most of the variance in a given cloud of data, and it is classically solved by Principal Component Analysis. Spiked covariance models were introduced in [1] to study the statistical performance of this algorithm in the high dimensional regime. Under a spiked covariance model it is assumed that the data are of the form:

y_{i} = u_{i} y_{🟉} + σ z_{i},

(7)

where

σ > 0

,

u_{i} \sim N (0, 1)

and

z_{i} \sim N (0, I_{n})

are independent and identically distributed, and

y_{🟉}

is the unit-norm planted spike. Each

y_{i}

is an i.i.d. sample from a centered Gaussian

N (0, Σ)

with spiked covariance matrix given by

Σ = y_{🟉} {y_{🟉}}^{T} + σ^{2} I_{n}

, with

y_{🟉}

being the direction that explains most of the variance. The estimate of

y_{🟉}

provided by PCA is then given by the leading eigenvector

\hat{y}

of the empirical covariance matrix

Σ_{N} = \frac{1}{N} \sum_{i = 1}^{N} y_{i} y_{i}^{T}

, and standard techniques from high dimensional probability can be used to show that (we write

f (n) ≳ g (n)

if

f (n) \geq C n

for some constant

C > 0

that might depend

σ

and

‖ y_{🟉} ‖^{2}

. Similarly for

f (n) ≲ g (n)

as long as

N ≳ n

,

min_{ϵ = \pm} {‖ ϵ \hat{y} - y_{🟉} ‖}_{2} ≲ \sqrt{\frac{n}{N}},

(8)

with overwhelming probability. Note incidentally that the data matrix

Y \in R^{N \times n}

with rows

{y_{i}^{T}}_{i}

can be written as (1).

Bounds of the form (8), however, become uninformative in modern high dimensional regimes where the ambient dimension of the data n is much larger than, or on the order of, the number of samples N. Even worse, in the asymptotic regime

n / N \to c > 0

and for

σ^{2}

large enough, the spike

y_{🟉}

and the estimate

\hat{y}

become orthogonal [32], and minimax techniques show that no other estimators based solely on the data (7), can achieve better overlap with

y_{🟉}

[33].

In order to obtain consistent estimates and lower the sample complexity of the problem, therefore, additional prior information on the spike

y_{🟉}

has to be enforced. For this reason, in recent years various priors have been analyzed such as positivity [34], cone constraints [35] and sparsity [32,36]. In the latter case

y_{🟉}

is assumed to be s-sparse, and it can be shown (e.g., [33]) that for

N ≳ s \log n

and

n ≳ s

, the s-sparse largest eigenvector

{\hat{y}}_{s}

of

Σ_{N}

{\hat{y}}_{s} = \underset{y \in S_{2}^{n - 1}, {‖ y ‖}_{0} \leq s}{argmax} y^{T} Σ_{N} y

Satisfies with high probability the condition:

min_{ϵ = \pm} {‖ ϵ {\hat{y}}_{s} - y_{🟉} ‖}_{2} ≲ \sqrt{\frac{s \log n}{N}} .

This implies, in particular, that the signal

y_{🟉}

can be estimated with a number of samples that scales linearly with its intrinsic dimension s. These rates are also minimax optimal; see for example [4] for the mean squared error and [2] for the support recovery. Despite these encouraging results, no currently known polynomial time algorithm achieves such optimal error rates and, for example, the covariance thresholding algorithm of [37] requires

N ≳ s^{2}

samples in order to obtain exact support recovery or estimation rate

min_{ϵ = \pm} {‖ ϵ {\hat{y}}_{s} - y_{🟉} ‖}_{2} ≲ \sqrt{\frac{s^{2} \log n}{N}},

as shown in [3]. In summary, only computationally intractable algorithms are known to reach the statistical limit

N = Ω (s)

for Sparse PCA, while polynomial time methods are only sub-optimal, requiring

N = Ω (s^{2})

. Notably, [38] provided a reduction of Sparse PCA to the planted clique problem which is conjectured to be computationally hard.

Further strong evidence for the hardness of sparse PCA have been given in a series of recent works [39,40,41,42,43]. Other computational-statistical gaps have also been found and studied in a variety of other contexts such as sparse Gaussian mixture models [44], tensor principal component analysis [45], community detection [46] and synchronization over groups [47]. These works fit in the growing and important body of literature aiming at understanding the trade-offs between statistical accuracy and computational efficiency in statistical inverse problems.

We finally note that many of the above mentioned problems can be phrased as recovery of a spike vector from a spiked random matrix. The difficulty can be viewed as arising from simultaneously imposing low-rankness and additional prior information on the signal (sparsity in case of Sparse PCA). This difficulty can be found in sparse phase retrieval as well. For example, [25] has shown that

m = O (s \log n)

number of quadratic measurements are sufficient to ensure well-posedness of the estimation of an s-sparse signal of dimension n lifted to a rank-one matrix, while

m \geq O (s^{2} / \log^{2} n)

measurements are necessary for the success of natural convex relaxations of the problem. Similarly, [48] studied the recovery of simultaneously low-rank and sparse matrices, showing the existence of a gap between what can be achieved with convex and tractable relaxations and nonconvex and intractable methods.

2.2. Inverse Problems with Generative Network Priors

Recently, in the wake of successes of deep learning, generative networks have gained popularity as a novel approach for encoding and enforcing priors in signal recovery problems. In one deep-learning-based approach, a dataset of “natural signals” is used to train a generative network in an unsupervised manner. The range of this network defines a low-dimensional set which, if successfully trained, contains or approximately contains, target signals of interest [19,21]. Non-convex optimization methods are then used for recovery by optimizing over the range of the network. We notice that allowing the algorithms the complete knowledge of the generative network architecture and of the learned weights is roughtly analogous to allowing sparsity-based algorithms the knowledge of the basis or frame in which the signal is modeled as sparse.

The use of generative network for signal recovery has been successfully demonstrated in a variety of settings such as compressed sensing [21,49,50], denoising [16,51], blind deconvolution [22], inpainting [52] and many more [53,54,55,56]. In these papers, generative networks significantly outperform sparsity based priors at signal reconstruction in the low-measurement regime. This fundamentally leverages the fact that a natural signal can be represented more concisely by a generative network than by a sparsity prior under an appropriate basis. This characteristic has been observed even in untrained generative networks where the prior information is encoded only in the network architecture and has been used to devise state-of-the-art signal recovery methods [57,58,59].

Parallel to these empirical successes, a recent line of works have investigated theoretical guarantees for various statistical estimation tasks with generative network priors. Following the work of [15,21] gave global guarantees for compressed sensing, followed then by many others for various inverse problems [19,20,50,51,55]. In particular, in [17] the authors have shown that

m = Ω (k \log n)

number of measurements are sufficient to recover a signal from random phaseless observations, assuming that the signal lies on the range of a generative network with latent dimension k. The same authors have then provided in [23] a polynomial time algorithm for recovery under the previous settings. Note that, contrary to the sparse phase retrieval problem, generative priors for phase retrieval allow for efficient algorithms with optimal sample complexity, up to logarithmic factors, with respect to the intrinsic dimension of the signal.

Further theoretical advances in signal recovery with generative network priors have been spurred by using techniques from statistical physics. Recently, [30] analyzed the spiked matrix models (1) and (2) with

y_{🟉}

in the range of a generative network with random weights, in the asymptotic limit

k, n, N \to \infty

with

n / k = O (1)

and

N / n = O (1)

. The analysis is carried out mainly for networks with sign or linear activation functions in the Bayesian setting where the latent vector is drawn from a separable distribution. The authors of [30] provide an Approximate Message Passing and a spectral algorithm, and they numerically observe no statistical-computational gap as these polynomial time methods are able to asymptotically match the information-theoretic optimum. In this asymptotic regime, [60] further provided precise statistical and algorithmic thresholds for compressed sensing and phase retrieval.

3. Algorithm and Main Result

In this section we present an efficient and statistically-optimal algorithm for the estimation of the signal

y_{🟉}

given a spiked matrix Y of the form (1) or (2). The recovery method is detailed in Algorithm 1, and it is based on the direct optimization of the nonlinear least squares problem (4).

Applied in [16] for denoising and compressed sensing under generative network priors, and later used in [23] for phase retrieval, the first order optimization method described in Algorithm 1 leverages the theory of Clarke subdifferentials (the reader is referred to [61] for more details). As the objective function f is continuous and piecewise smooth, at every point

x \in R^{k}

it has a Clarke subdifferential given by

\partial f (x) = conv {v_{1}, v_{2}, \dots, v_{T}},

(9)

where conv denotes the convex hull of the vectors

v_{1}, \dots, v_{T}

, which are respectively the gradient of the T smooth functions adjoint at x. The vectors

v_{x} \in \partial f (x)

are the subgradients of f at x, and at a point x where f is differentiable it holds that

\partial f (x) = {\nabla f (x)}

.

The reconstruction method presented in Algorithm 1 is motivated by the landscape analysis of the minimization problem (4) for a network G with sufficiently-expansive Gaussian weights matrices. Under this assumption, we showed in [31] that (4) has a benign optimization geometry and in particular that for any nonzero point outside a neighborhood of

x_{🟉}

and a negative multiple of it, any subgradient of f is a direction of strict descent. Furthermore we showed that the points in the vicinity of the spurious negative multiple of

x_{🟉}

have function values strictly larger than those close to

x_{🟉}

. Figure 1 shows the expected value of f in the noiseless case,

ν = 0

and

N \to \infty

, for a generative network with latent dimension

k = 2

. This plot highlights the global minimimum at

x_{🟉} = [1, 1]

, and the flat region in near a negative multiple of

x_{🟉}

.

Figure 1. Expected value, with respect to the weights, of the objective function f in (4) in the noiseless case (see (16) for explicit formula), for a network with latent dimension

k = 2

and

x_{🟉} = [1, 1]

.

At each step, the subgradient method in Algorithm 1 checks if the current iterate

x_{i}

has a larger loss value than its negative multiple, and if so negates

x_{i}

. As we show in the proof of our main result, this step will ensure that the algorithm will avoid the neighborhood around the spurious negative multiple of

x_{🟉}

and will converge to the neighborhood around

x_{🟉}

in a polynomial number of steps.

Below we make the following assumptions on the weight matrices of G.

Assumption 1.

The generative network G defined in (3), has weights

W_{i} \in R^{n_{i} \times n_{i - 1}}

with i.i.d. entries from

N (0, 1 / n_{i})

and satisfying the expansivity condition with constant

ϵ > 0

:

n_{i + 1} \geq c n_{i} ϵ^{- 2} \log (1 / ϵ)

(10)

for all i and a universal constant

c > 0

.

We note that in [31] the expansivity condition was more stringent, requiring an additional log factor. Since the publication of our paper, [62] has shown that the more relaxed assumption (10) suffices for ensuring a benign optimization geometry. Under Assumption 1, our main theorem below shows that the subgradient method in Algorithm 1 can estimate the spike

y_{🟉}

with optimal sample complexity and in a polynomial number of steps.

Theorem 1.

Let

x_{🟉} \in R^{k}

nonzero and

y_{🟉} = G (x_{🟉})

where G is a generative network satisfying Assumption 1 with

ϵ \leq K_{1} / d^{96}

. Consider the minimization problem (4) and assume that the noise level ω satisfies

ω \leq K_{2} {‖ x_{🟉} ‖}_{2}^{2} 2^{- d} / d^{44}

where:

for theSpiked Wishart Model(1) take $M = Σ_{N} - σ^{2} I_{n}$ , and

$ω : = (‖ y_{🟉} ‖_{2}^{2} + σ^{2}) \max \{\sqrt{\frac{338 k \log (3 n_{1}^{d} n_{2}^{d - 1} \dots n_{d - 1}^{2} n)}{N}}, \frac{156 k \log (3 n_{1}^{d} n_{2}^{d - 1} \dots n_{d - 1}^{2} n)}{N}\};$
for theSpiked Wigner Model(2) take $M = Y$ , and

$ω : = ν \sqrt{\frac{169 k \log (3 n_{1}^{d} n_{2}^{d - 1} \dots n_{d - 1}^{2} n)}{n}} .$

Consider Algorithm 1 with nonzero

x_{0}

and

‖ x_{0} ‖_{2} < R_{🟉}

where

R_{🟉} \geq 5 {‖ x_{🟉} ‖}_{2} / (2 \sqrt{2})

, and stepsize

μ = 2^{2 d} K_{3} / (8 d^{4} R_{🟉}^{2})

. Then with probability at least

1 - 2 e^{- k \log n} - \sum_{i = 1}^{d} e^{- C n_{i - 1}}

,

0 < ‖ x_{i} ‖_{2} < R_{🟉}

for any

i \geq 1

, there exists an integer

T \leq K_{4} f (x_{0}) 2^{2 d} / (R_{🟉}^{4} d^{4} ϵ)

such that for any

i \geq T

:

\begin{matrix} ‖ x_{i + 1} - x_{🟉} ‖_{2} & \leq ρ_{1}^{i + 1 - T} {‖ x_{T} - x_{🟉} ‖}_{2} + ρ_{2} \frac{2^{d}}{‖ x_{🟉} ‖_{2}} ω \end{matrix}

(11)

\begin{matrix} ‖ G (x_{i + 1}) - y_{🟉} ‖_{2} & \leq \frac{1.2}{2^{d / 2}} ρ_{1}^{i + 1 - T} {‖ x_{T} - x_{🟉} ‖}_{2} + 1.3 ρ_{2} \frac{ω}{‖ y_{🟉} ‖_{2}} \end{matrix}

(12)

where

C > 0

,

K_{1}, \dots, K_{4} > 0

,

ρ_{1} \in (0, 1)

and

ρ_{2} > 0

are universal constants.

Note that the quantity

2^{2 d}

in the hypotheses and conclusions of the theorem is an artifact of the scaling of the network and it should not be taken as requiring exponentially small noise or number of steps. Indeed under Assumption 1, the ReLU activation zeros out roughly half of the entries of its argument leading to an “effective” operator norm of approximately

1 / 2

. We furthermore notice that the dependence of the depth d is likely quite conservative and it was not optimized in the proof as the main objective was to obtain tight dependence on the intrinsic dimension of the signal k. As shown in the numerical experiments, the actual dependence on the depth is much better in practice. Finally, observe that despite the nonconvex nature of the objective function in (4) we obtain a rate of convergence which is not directly dependent on the dimension of the signal, reminiscent of what happens in the convex case.

The quantity

ω

in Theorem 1 can be interpreted as the intrinsic noise level of the problem (inverse SNR). The theorem guarantees that in a polynomial number of steps the iterates of the subgradient method will converge to

x_{🟉}

up to

ω

. For

\tilde{T}

large enough

G (x_{\tilde{T}})

will satisfy the rate-optimal error bounds (5) and (6).

Numerical Experiments

We illustrate the predictions of our theory by providing results of Algorithm 1 on a set of synthetic experiments. We consider 2-layer generative networks with ReLU activation functions, hidden layer of dimension

n_{1} = 500

, output dimension

n_{2} = n = 1500

and varying number of latent dimension

k \in [40, 60, 100]

. We randomly sample the weights of the matrix independently from

N (0, 2 / n_{i})

(this scaling removes that

2^{d}

dependence in Theorem 1). We then consider data Y according the spiked models (1) and (2), where

x_{🟉} \in R^{k}

is chosen so that

y_{🟉} = G (x_{🟉})

has unit norm. For the Wishart model, we vary the number of samples N; and for the Wigner model, we vary the noise level

ν

so that the following quantities remain constant for the different networks with latent dimension k:

θ_{WS} : = \sqrt{k \log (n_{1}^{2} n) / N}, θ_{WG} : = ν \sqrt{k \log (n_{1}^{2} n) / n} .

In Figure 2 we plot the reconstruction error given by

‖ G (x) - y_{🟉} ‖_{2}

against

θ_{WS}

and

θ_{WG}

. As predicted by Theorem 1, the errors scale linearly with respect to these control parameters, and moreover the overlap of these plots confirms that these rates are tight with respect to the order of k.

Figure 2. Reconstruction error for the recovery of a spike

y_{🟉} = G (x_{🟉})

in the Wishart and Wigner models with random generative network priors. Each point corresponds to the average over 50 random drawing of the network weights and samples. These plots demonstrate that the reconstruction errors follow the scalings established by Theorem 1.

4. Recovery Under Deterministic Conditions

We will derive Theorem 1 from Theorem 3, below, which is based on a set of deterministic conditions on the weights of the matrix and the noise. Specifically, we consider the minimization problem (4) with

M = G (x_{🟉}) G {(x_{🟉})}^{T} + H

for an unknown symmetric matrix

H \in R^{n \times n}

, nonzero

x_{🟉} \in R^{k}

, and a given d-layer feed forward generative network G as in (3).

In order to describe the main deterministic conditions on the generative network G, we begin by introducing some notation. For

W \in R^{n \times k}

and

x \in R^{k}

, we define the operator

W_{+, x} : = diag (W x > 0) W

such that

relu (W x) = W_{+, x} x

. Moreover, we let

W_{1, +, x} = {(W_{1})}_{+, x} = diag (W_{1} x > 0) W_{1}

, and for

2 \leq i \leq d

we define recursively

W_{i, +, x} = diag (W_{i}, Π_{j = i - 1}^{1} W_{j, +, x} x > 0) W_{i},

where

Π_{i = d}^{1} W_{i} = W_{d} W_{d - 1} \dots W_{1}

. Finally we let

Λ_{x} = Π_{j = d}^{1} W_{j, +, x}

and note that

G (x) = Λ_{x} x

. With this notation we next recall the following deterministic condition on the layers of the generative network.

Definition 2

(Weight Distribution Condition [15]). We say that

W \in R^{n \times k}

satisfies the Weight Distribution Condition (WDC) with constant

ϵ > 0

if for all nonzero

x_{1}, x_{2} \in R^{k}

:

‖ W_{+, x_{1}}^{T} W_{+, x_{2}} - Q_{x_{1}, x_{2}} ‖_{2} \leq ϵ,

where

Q_{x_{1}, x_{2}} = \frac{π - θ_{x_{1}, x_{2}}}{2 π} I_{k} + \frac{\sin θ_{x_{1}, x_{2}}}{2 π} M_{{\hat{x}}_{2} \leftrightarrow {\hat{x}}_{2}}

and

θ_{x_{1}, x_{2}} = ∠ (x_{1}, x_{2})

,

{\hat{x}}_{1} = x_{1} / {‖ x_{1} ‖}_{2}

,

{\hat{x}}_{2} = x_{2} / {‖ x_{2} ‖}_{2}

,

I_{k}

is the

k \times k

identity matrix and

M_{{\hat{x}}_{1} \leftrightarrow {\hat{x}}_{2}}

is the matrix that sends

{\hat{x}}_{1} \mapsto {\hat{x}}_{2}

,

{\hat{x}}_{2} \mapsto {\hat{x}}_{1}

, and with kernel span

{({x_{1}, x_{2}})}^{⊥}

.

Note that

Q_{x_{1}, x_{2}}

is the expected value of

W_{+, x_{1}}^{T} W_{+, x_{2}}

when W has rows

w_{i} \sim N (0, I_{k} / n)

, and if

x_{1} = x_{2}

then

Q_{x_{1}, y_{2}}

is an isometry up to the scaling factor

1 / 2

. Below we will say that a d-layer generative network G of the form (3), satisfies the WDC with constant

ϵ > 0

if every weight matrix

W_{i}

has the WDC with constant

ϵ

for all

i = 1, \dots d

.

The WDC was originally introduced in [15], and ensures that the angle between two vectors in the latent space is approximately preserved at the output layer and, in turn, it guarantees the invertibility of the network. Assumption 1 will guarantees that the generative network G satisfies the WDC with high probability.

We are now able to state our recovery guarantees for a spike

y_{🟉}

under deterministic conditions on the network G and noise H.

Theorem 3.

Let

d \geq 2

and assume the generative network (3) has weights

W_{i} \in R^{n_{i} \times n_{i - 1}}

satisfying the WDC with constant

0 < ϵ \leq K_{1} / d^{96}

. Consider Algorithm 1 with

M = G (x_{🟉}) G {(x_{🟉})}^{T} + H

,

x_{🟉} \in R^{k} \ {0}

and H a symmetric matrix satisfying:

∥ Λ_{x}^{T} H Λ_{x} ∥_{2} \leq \frac{ω}{2^{d}}, and ω \leq K_{2} \frac{∥ x_{🟉} ∥_{2}^{2} 2^{- d}}{d^{44}} .

(13)

Take

x_{0}

nonzero and with

‖ x_{0} ‖_{2} < R_{🟉}

where

R_{🟉} \geq 5 {‖ x_{🟉} ‖}_{2} / (2 \sqrt{2})

,

μ = 2^{2 d} K_{3} / (8 d^{4} R_{🟉}^{2})

. Then the iterates

{x_{i}}_{i \geq 0}

generated by the Algorithm 1 satisfy

0 < ‖ x_{i} ‖ < R_{🟉}

and obey to the following:

(A): there exists an integer $T \leq K_{4} \frac{f (x_{0}) 2^{2 d}}{R_{🟉}^{4} d^{4} ϵ}$ such that

$∥ x_{T} - x_{🟉} ∥_{2} \leq K_{5} d^{14} \sqrt{ϵ} ∥ x_{🟉} ∥_{2} + K_{6} 2^{d} d^{10} ω {∥ x_{🟉} ∥}_{2}^{- 1}$
(B): for any $i \geq T$ :

$\begin{matrix} ∥ x_{i + 1} - x_{🟉} ∥_{2} & \leq ρ_{1}^{i + 1 - T} {∥ x_{T} - x_{🟉} ∥}_{2} + ρ_{2} \frac{2^{d}}{∥ x_{🟉} ∥_{2}} ω \end{matrix}$

(14)

$\begin{matrix} ‖ G (x_{i + 1}) - G (x_{🟉}) ‖_{2} & \leq \frac{1.2}{2^{d / 2}} ρ_{1}^{i + 1 - T} {‖ x_{T} - x_{🟉} ‖}_{2} + 1.3 ρ_{2} \frac{ω}{‖ y_{🟉} ‖_{2}} \end{matrix}$

(15)

where $K_{1}, \dots, K_{6} > 0$ , $ρ_{1} \in (0, 1)$ and $ρ_{2} > 0$ are universal constants.

Theorem 1 follows then from Theorem 3 after proving that with high probability the spectral norm of

Λ_{x}^{T} H Λ_{x}

, where

H = M - y_{🟉} {y_{🟉}}^{T}

, can be upper bounded by

ω / 2^{d}

, and the weights of the network G satisfy with high probability the WDC.

In the rest of this section section we will describe the main steps and tools needed to prove Theorem 3.

4.1. Technical Tools and Outline of the Proofs

Our proof strategy for Theorem 3 can be summarized as follows:

In Proposition A1 (Appendix A.3) we show that the iterates ${x_{i}}_{i = 1}$ of the Algorithm 1 stay inside the Euclidean ball of radius $R_{🟉}$ and remain nonzero for all $i \geq 1$ .
We then identify two small Euclidean balls $B_{+}$ and $B_{-}$ around respectively $x_{🟉}$ and $- ρ_{d} x_{🟉}$ , where $ρ_{d} \in (0, 1)$ only depends on the depth of the network. In Proposition A2 we show that after a polynomial number of steps, the iterates ${x_{i}}$ of the Algorithm 1 enter the region $B_{+} \cup B_{-}$ (Appendix A.4).
We show, in Proposition A3, that the negation step causes the iterates of the algorithm to avoid the spurious point $- ρ_{d} x_{🟉}$ and actually enter $B_{+}$ within a polynomial number of steps (Appendix A.5).
We finally show in Proposition A4, that in $B_{+}$ the loss function f enjoys a favorable convexity-like property, which implies that the iterates ${x_{i}}$ will remain in $B_{+}$ and eventually converge to $x_{🟉}$ up to the noise level (Appendix A.6).

One of the main difficulties in the analysis of a subgradient method in Algorithm 1 is the lack of smoothness of the loss function f. We show that the WDC allows us to overcome this issue by showing that the subgradients of f are uniformly close, up to the noise level, to the vector field

h_{x} \in R^{k}

:

h_{x} : = \frac{1}{2^{2 d}} (x x^{T} - {\tilde{h}}_{x} {\tilde{h}}_{x}^{T}) x,

where

{\tilde{h}}_{x}

is continuous for nonzero x (see Appendix A.2). We show furthermore that

h_{x}

is locally Lipschitz, which allows us to conclude that the gradient method decreases the value of the loss function until eventually reaching

B_{+} \cup B_{-}

(Appendix A.4).

Using the WDC, we show that the loss function f is uniformly close to

f_{E} (x) = \frac{1}{2^{2 d + 2}} ({‖ x ‖}_{2}^{4} + {‖ x_{🟉} ‖}_{2}^{4} - 2 {⟨ x, {\tilde{h}}_{x} ⟩}^{2}) .

(16)

A direct analysis of

f_{E}

reveals that its values inside

B_{-}

are strictly larger then those inside

B_{+}

. This property extends to f as well, and guarantees that the gradient method will not converge to the spurious point

- ρ_{d} x_{🟉}

(Appendix A.5).

Author Contributions

Conceptualization, P.H. and V.V.; Formal analysis, J.C., Writing—original draft, J.C.; Writing - review & editing J.C. and P.H.; Supervision P.H. and V.V. All authors have read and agreed to the published version of the manuscript.

Funding

PH was partially supported by NSF CAREER Grant DMS-1848087 and NSF Grant DMS-2022205.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Supporting Lemmas and Proof of Theorem 3

The proof of Theorem 3 is provided in Appendix A.7. We begin this section with a set of preliminary results and supporting lemmas.

Appendix A.1. Notation

We collect the notation that is used throughout the paper. For any real number a, let

relu (a) = \max (a, 0)

and for any vector

v \in R^{n}

, denote the entrywise application of relu as

relu (v)

. Let

diag (W x > 0)

be the diagonal matrix with i-th diagonal element equal to 1 if

{(W x)}_{i} > 0

and 0 otherwise. For any vector x we denote with

‖ x ‖

its Euclidean norm and for any matrix A we denote with

‖ A ‖

its spectral norm and with

{‖ A ‖}_{F}

its Frobenius norm. The euclidean inner product between two vectors a and b is

⟨ a, b ⟩

, while for two matrices A and B their Frobenius inner product will be denoted by

{⟨ A, B ⟩}_{F}

. For any nonzero vector

x \in R^{n}

, let

\hat{x} = x / ‖ x ‖

. For a set S we will write

| S |

for its cardinality and

S^{c}

for its complement. Let

B (x, r)

be the Euclidean ball of radius r centered at x, and

S^{k - 1}

be the unit sphere in

R^{k}

. Let

θ_{0} = ∠ (x, x_{🟉})

and for

i \geq 0

let

θ_{i + 1} = g (θ_{i})

where g is defined in (A1). We will write

γ = Ω (δ)

to mean that there exists a positive constant C such that

γ \geq C δ

and similarly

γ = O (δ)

if

γ \leq C δ

. Additionally we will use

a = b + O_{1} (δ)

when

‖ a - b ‖ \leq δ

, where the norm is understood to be the absolute value for scalars, the Euclidean norm for vectors and the spectral norm for matrices.

Appendix A.2. Preliminaries

For later convenience we will define the following vectors:

\begin{matrix} p_{x} & : = Λ_{x}^{T} Λ_{x} x, \\ q_{x} & : = Λ_{x}^{T} Λ_{x_{🟉}} x_{🟉}, \\ {\bar{v}}_{x} & : = [p_{x} p_{x}^{T} - q_{x} q_{x}^{T}] x, \\ η_{x} & : = Λ_{x}^{T} H Λ_{x} x . \end{matrix}

Note then that when f is differentiable at x, then

{\tilde{v}}_{x} : = \nabla f (x) = {\bar{v}}_{x} - η_{x}

and in particular when

H = 0

then we have

{\tilde{v}}_{x} = {\bar{v}}_{x}

.

The following function controls how the angles are contracted by a ReLU layer:

g (θ) : = \cos^{- 1} (\frac{(π - θ) \cos θ + \sin θ}{π}) .

(A1)

As we mentioned in Section 4.1, our analysis is based on showing that the subgradients of f are uniformly close to the vector field

h_{x}

given by

h_{x} : = \frac{1}{2^{2 d}} (x x^{T} - {\tilde{h}}_{x} {\tilde{h}}_{x}^{T}) x,

(A2)

where

{\tilde{h}}_{x} : = (\prod_{i = 0}^{d - 1} \frac{π - {\bar{θ}}_{i}}{π}) x_{🟉} + \sum_{i = 1}^{d - 1} \frac{\sin θ_{i}}{π} (\prod_{j = i + 1}^{d - 1} \frac{π - {\bar{θ}}_{j}}{π}) \frac{‖ x_{🟉} ‖}{‖ x ‖} x,

(A3)

and

θ_{i} : = g ({\bar{θ}}_{i - 1})

for g given by (A1) and

θ_{0} = ∠ (x, y)

.

Lemma A1

(Lemma 8 in [15]). Suppose that

d \geq 2

and the WDC holds with

ϵ < 1 / {(16 π d^{2})}^{2}

, then for all nonzero

x, x_{🟉} \in R^{k}

,

\begin{matrix} ⟨ Λ_{x} x, Λ_{x_{🟉}} x_{🟉} ⟩ & \geq \frac{1}{4 π} \frac{1}{2^{d}} {‖ x ‖}_{2} ‖ x_{🟉} ‖, \end{matrix}

(A4)

\begin{matrix} ∥ Λ_{x}^{T} Λ_{x_{🟉}} x_{🟉} - \frac{{\tilde{h}}_{x}}{2^{d}} ∥ & \leq 24 \frac{d^{3} \sqrt{ϵ}}{2^{d}} ∥ x_{🟉} ∥, and \end{matrix}

(A5)

\begin{matrix} ∥ Λ_{x} ∥^{2} \leq \frac{1}{2^{d}} {(1 + 2 ϵ)}^{d} \leq \frac{1 + 4 ϵ d}{2^{d}} & \leq \frac{13}{12} \frac{1}{2^{d}} . \end{matrix}

(A6)

Proof.

The first two bounds can be found in [15] (Lemma 8). The third bound follows by noticing that the WDC implies:

∥ Λ_{x} ∥^{2} \leq Π_{i = d}^{1} {‖ W_{i, +, x} ‖}^{2} \leq \frac{1}{2^{d}} {(1 + 2 ϵ)}^{d} \leq \frac{1 + 4 ϵ d}{2^{d}} \leq \frac{13}{12} \frac{1}{2^{d}},

where we used

\log (1 + z) \leq z

and

e^{z} \leq (1 + 2 z)

for all

0 \leq z \leq 1

. □

The next lemma shows that the noiseless gradient

{\bar{v}}_{x}

concentrates around

h_{x}

.

Lemma A2.

Suppose

d \geq 2

and the WDC holds with

ϵ < 1 / {(16 π d^{2})}^{2}

, then for all nonzero

x, x_{🟉} \in R^{k}

:

∥ {\bar{v}}_{x} - h_{x} ∥ \leq 86 \frac{d^{4} \sqrt{ϵ}}{2^{2 d}} \max (∥ x_{🟉} ∥^{2} {, ∥ x ∥}^{2}) ∥ x ∥ .

We now use the characterization of the Clarke subdifferential given in (9) to derive a bound on the concentration of

v_{x} \in \partial f (x)

around

h_{x}

up to the noise level.

Lemma A3.

Under the assumptions of Lemma A2, and assuming

∥ Λ_{x}^{T} H Λ_{x} ‖ \leq ω / 2^{d}

, for any

v_{x} \in \partial f (x)

:

∥ v_{x} - h_{x} ‖ \leq 86 \frac{d^{4} \sqrt{ϵ}}{2^{2 d}} \max (‖ x_{🟉} ‖^{2} {, ‖ x ‖}^{2}) ‖ x ‖ + \frac{ω}{2^{d}} ‖ x ‖

From the above and the bound on the noise level

ω

we can bound the norm of the step

v_{x}

in the Algorithm 1.

Lemma A4.

Under the assumptions of Lemma A3, and assuming that ω satisfies (13), for any

v_{x} \in \partial f (x)

:

∥ v_{x} ‖ \leq \frac{4}{2^{2 d}} d^{2} {\max (‖ x ‖}^{2}, ‖ x_{🟉} ‖^{2}) ‖ x ‖ .

Appendix A.3. Iterates Stay Bounded

In this section we prove that all the iterates

{x_{i}}

generated by Algorithm 1 remain inside the Euclidean ball

B (0, R_{🟉})

where

R_{🟉} \geq \sqrt{2} C_{🟉} ‖ x_{🟉} ‖

and

C_{🟉} = 5 / 4

.

Lemma A5.

Let the assumptions of Lemma A4 be satisfied,

μ = K_{3} 2^{2 d} / (8 d^{4} R_{🟉}^{2})

with

512 π^{2} K_{3} < 1

. Then for any

x \in R^{k}

with

C_{🟉} ‖ x_{🟉} ‖ < ‖ x ‖ \leq R_{🟉}

and any

λ \in [0, 1]

, it holds that

‖ x - λ μ v_{x} ‖ \leq ‖ x ‖

.

From the previous lemma we can now derive the boundedness of the iterates of the Algorithm 1.

Proposition A1.

Under the assumptions of Theorem 3, if

x \in B (0, R_{🟉}) \ {0}

it follows that

x - λ μ v_{x} \in B (0, R_{🟉}) \ {0}

. Furthermore if

x_{0} \in B (0, R_{🟉}) \ {0}

, the iterates

{x_{i}}_{i = 1}

of the Algorithm 1 satisfy

x_{i} - λ μ v_{x_{i}} \in B (0, R_{🟉}) \ {0}

for all

i \geq 1

and

λ \in [0, 1]

.

Proof.

Assume

C_{🟉} ‖ x_{🟉} ‖ < ‖ x ‖ \leq R_{🟉}

, then the conclusions follows from Lemma A5. Assuming instead that

‖ x ‖ \leq C_{🟉} ‖ x_{🟉} ‖

, note that

‖ x - λ μ v_{x} ‖ \leq ‖ x ‖ + μ ‖ v_{x} ‖ \leq (1 + \frac{1}{2 d^{2}}) ‖ x ‖ \leq R_{🟉}

using Lemma A4,

d \geq 2

and the assumptions on

μ

and

R_{🟉}

. Finally observe that if

x_{i} \in B (0, R_{🟉})

then the same holds for

{\tilde{x}}_{i}

.

Finally if for some

i \geq 0

it was the case that

0 = {\tilde{x}}_{i} - λ μ v_{{\tilde{x}}_{i}}

, then this would imply that

‖ {\tilde{x}}_{i} ‖ = λ μ ‖ v_{{\tilde{x}}_{i}} ‖

which cannot happen because by Lemma A4 and the choice of the step size it holds that

μ ‖ v_{{\tilde{x}}_{i}} ‖ \leq ‖ {\tilde{x}}_{i} ‖ / 8

. □

Appendix A.4. Convergence to $S_{β}$

We define the set

S_{β}

outside which we can lower bound the norm of

‖ h_{x} ‖

as

S_{β} : = {x \in R^{k} | ‖ h_{x} ‖ \leq \frac{β}{2^{2 d}} {\max (‖ x ‖}^{2}, ‖ {x_{🟉} ‖}^{2}) ‖ x ‖},

where we take

β = 7 \cdot (86 d^{4} \sqrt{ϵ} + 2^{d} ω {‖ x_{🟉} ‖}^{- 2}) .

(A7)

Outside the set

S_{β}

the sub-gradients of f are bounded below and the landscape has favorable optimization geometry.

Lemma A6.

Let

x \in B (0, R_{🟉}) \ {0} \cap S_{β}^{c}

, then for all

v_{x} \in \partial f (x)

∥ v_{x} ‖ \geq 6 (\frac{\max (‖ x ‖^{2}, ‖ x_{🟉} ‖^{2}) ‖ x ‖}{2^{2 d}} 86 d^{4} \sqrt{ϵ} + \frac{ω}{2^{d}} ‖ x ‖) .

(A8)

Moreover let

λ \in [0, 1]

and

x_{λ} = x - λ μ v_{x_{λ}}

then

∥ v_{x} - v_{x_{λ}} ‖ \leq \frac{15}{16} ‖ v_{x} ‖,

(A9)

for all

v_{x} \in \partial f (x)

and

v_{x_{λ}} \in \partial f (x_{λ})

.

Based on the previous lemma we can prove the main result of this section.

Proposition A2.

Under the assumptions of Theorem 3, if

x_{i} \in B (0, R_{🟉}) \ {0} \cap S_{β}^{c}

then

f (x_{i + 1}) - f (x_{i}) \leq - \frac{K R_{🟉}^{4} d^{4} ϵ}{2^{2 d}}

(A10)

for some numerical constant

K > 0

. Moreover there exists an integer

T \leq \frac{f (x_{0}) 2^{2 d}}{K R_{🟉}^{4} d^{4} ϵ}

such

x_{i + T} \in S_{β}

.

Proof.

Let

x_{i} \notin S_{β}

and assume that

f (x_{i}) \leq f (- x_{i})

, then

{\tilde{x}}_{i} = x_{i}

. By the mean value theorem for Clarke subdifferentials [61] (Theorem 8.13), there exists

λ \in (0, 1)

such that for

x_{i, λ} = x_{i} - μ λ v_{x_{i}}

and a

v_{x_{i, λ}} \in \partial f (x_{i, λ})

it holds that

\begin{matrix} f (x_{i + 1}) - f ({\tilde{x}}_{i}) & = ⟨ v_{x_{i, λ}}, - μ v_{x_{i}} ⟩ \\ = ⟨ v_{x_{i}}, - μ v_{x_{i}} ⟩ + ⟨ v_{x_{i, λ}} - v_{x_{i}}, - μ v_{x_{i}} ⟩ \\ \leq - μ ‖ v_{x_{i}} ‖ (‖ v_{x_{i}} ‖ - ‖ v_{x_{i, λ}} - v_{x_{i}} ‖) \\ \leq - \frac{μ}{16} {‖ v_{x_{i}} ‖}^{2} \end{matrix}

(A11)

where the first inequality follows from the triangle inequality and the second from Equation (A9). Next observe that by (A8)

∥ v_{x} ∥^{2} \geq 36 \frac{{\max (‖ x ‖}^{2}, ‖ x_{🟉} ‖^{2})^{2} {‖ x ‖}^{2}}{2^{4 d}} 86^{2} d^{8} ϵ

which together with the definition of

μ

and (A11) gives (A10).

Next take

x_{i} \notin S_{β}

and assume

f (x_{i}) > f (- x_{i})

, so that

{\tilde{x}}_{i} = - x_{i}

. Observe that

f (x_{i + 1}) - f (x_{i}) < f (x_{i + 1}) - f (- x_{i}) = f (x_{i + 1}) - f ({\tilde{x}}_{i}),

we obtain then (A10) proceeding as before.

Finally the claim on the maximum number of iterations directly follows by a telescopic sum on (A10) and

f (\cdot) \geq 0

. □

Appendix A.5. Convergence to a Neighborhood Around $x_{🟉}$

In the previous section we have shown that after a finite number of steps the iterates

{x_{i}}

of the Algorithm 1 will enter in the region

S_{β}

. In this section we show that, thanks to the negation step in the descent algorithm, they will eventually be confined in a neighborhood of

x_{🟉}

.

The following lemma shows that

S_{β}

is contained inside two balls around

x_{🟉}

and

- x_{🟉}

.

Lemma A7.

Suppose

8 π d^{6} \sqrt{β} \leq 1

, then we have

S_{β} \subset B^{+} \cup B^{-}

where

B^{+} : = B (x_{🟉}, R_{1} β d^{10} ‖ x_{🟉} ‖) and B^{-} : = B (- ρ_{d} x_{🟉}, R_{2} \sqrt{β} d^{10} ‖ x_{🟉} ‖),

where

R_{1}, R_{2}

are numerical constants and

0 < ρ_{d} < 1

such that

ρ_{d} \to 1

as

d \to \infty

.

We furthermore observe that the ball around

- x_{🟉}

has values strictly higher that the one around

x_{🟉}

.

Lemma A8.

Suppose that

d \geq 2

, the WDC holds with

ϵ < 1 / {(16 π d^{2})}^{2}

and H satisfies (13). Then for any

ϕ_{d} \in [ρ_{d}, 1]

, it holds that

f (x) < f (y),

(A12)

for all

x \in B (ϕ_{d} x_{🟉}, ϱ ‖ x_{🟉} ‖ d^{- 12})

and

y \in B (- ϕ_{d} x_{🟉}, ϱ ‖ x_{🟉} ‖ d^{- 12})

where

ϱ < 1

is a universal constant.

The main result of this section is about the convergence to a neighborhood of

x_{🟉}

of the iterates

{x_{i}}

.

Proposition A3.

Under the assumptions of Theorem 3, if

x_{i} \in B (0, R_{🟉}) \ {0}

, then there exists a finite number of steps

T \leq K \frac{f (x_{i}) 2^{2 d}}{R_{🟉}^{4} d^{4} ϵ}

such that

x_{i + T} \in B (x_{🟉}, R_{1} β d^{10} ‖ x_{🟉} ‖)

. In particular it holds that

∥ x_{i + T} - x_{🟉} ‖ \leq C_{1} d^{14} \sqrt{ϵ} ‖ x_{🟉} ‖ + C_{2} 2^{d} d^{10} ω {‖ x_{🟉} ‖}^{- 1} .

(A13)

Proof.

Either

x_{i} \in S_{β}

or by Proposition A2 there exist

T^{'}

such that

x_{i + T^{'}} \in S_{β}

. By the choice of

ϵ > 0

, the definition of

β

in (A7), and the assumption on the noise level (13), it follows that the hypotheses of Lemma A7 are satisfied. We therefore define the two neighborhoods

S_{β}^{+} : = S_{β} \cap B^{+}

and

S_{β}^{-} : = S_{β} \cap B^{-}

and conclude that that either

x_{i + T^{'}} \in S_{β}^{+}

or

x_{i + T^{'}} \in S_{β}^{-}

.

We next notice that

S_{β}^{+} \subseteq B (x_{🟉}, ϱ ‖ x_{🟉} ‖ d^{- 12})

and

S_{β}^{-} \subseteq B (- ρ_{d} x_{🟉}, ϱ ‖ x_{🟉} ‖ d^{- 12})

. We can then use Lemma A8 to conclude that by the negation step, if

x_{i + T^{'}} \in S_{β}^{+}

then

{\tilde{x}}_{i + T^{'}} \in S_{β}^{+}

, otherwise we will have

{\tilde{x}}_{i + T^{'}} \in - S_{β}^{-}

.

We now analyze the case

x_{i + T^{'}} \in - S_{β}^{-} \cap S_{β}^{c}

. Applying again Proposition A2, we have that there exists and integer

T^{″} \leq K \frac{f (x_{i + N}) 2^{2 d}}{R_{🟉}^{4} d^{4} ϵ}

such that

x_{i + T^{'} + T^{″}} \in S_{β}

. Furthermore Proposition A2 implies that

f (x_{i + T^{'} + T^{″}}) < f (x_{i + T^{'}})

, while from Lemma A8 we know that

f (x_{i + T^{'}}) < f (y)

for all

y \in S_{β}^{-}

. We conclude therefore that

x_{i + T^{'} + T^{″}}

must be in

S_{β}^{+}

.

In summary we obtained that there exists an integer

T \leq K \frac{f (x_{0}) 2^{2 d}}{R_{🟉}^{4} d^{4} ϵ}

such that

x_{i + T} \in S_{β}^{+} \subset B^{+}

. Finally Equation (A13) follows from the definition of

β

in (A7) and

B^{+}

. □

Appendix A.6. Convergence to $x_{🟉}$ up to Noise

Lemma A9.

Suppose the WDC holds with

ϵ < 1 / (200^{4} d^{6})

, then for all

x \in B (x_{🟉}, d \sqrt{ϵ} ‖ x_{🟉} ‖)

and

v_{x} \in \partial_{x} f (x)

, it holds that

∥ v_{x} - \frac{τ_{1}}{2^{2 d}} {‖ x_{🟉} ‖}^{2} (x - x_{🟉}) ‖ \leq τ_{2} \frac{‖ x_{🟉} ‖^{2}}{2^{2 d}} ‖ x - x_{🟉} ‖ + τ_{3} \frac{ω}{2^{d}} ‖ x_{🟉} ‖

where

τ_{1} = 21 / 5

,

τ_{2} = 17 / 5

and

τ_{3} = (1 + 1 / {(400)}^{2})

.

Based on the previous condition on the direction of the sub-gradients we can then prove that the iterates of the algorithm converge to

x_{🟉}

up to noise.

Proposition A4.

Under the assumptions of Theorem (3), if

x_{T} \in B^{+}

then for any

i \geq T

it holds that

x_{i} \in B^{+}

and furthermore

\begin{matrix} ‖ x_{i + 1} - x_{🟉} ‖ & \leq ρ_{1}^{i + 1 - T} ‖ x_{T} - x_{🟉} ‖ + ρ_{2} \frac{2^{d}}{‖ x_{🟉} ‖} ω \end{matrix}

(A14)

where

ρ_{1} \in (0, 1)

and

ρ_{2} > 0

are numerical constants.

Proof.

If

i = T

, by Proposition A2, it follows that

x_{i} = {\tilde{x}}_{i} \in B^{+}

. Furthermore, the assumptions of Lemma A9 are satisfied and we can write:

\begin{matrix} ‖ x_{i + 1} - x_{🟉} ‖ & = ‖ {\tilde{x}}_{i} - μ v_{{\tilde{x}}_{i}} - x_{🟉} ‖ \\ \leq (1 - \frac{μ τ_{1} {‖ x_{🟉} ‖}^{2}}{2^{2 d}}) ‖ {\tilde{x}}_{i} - x_{🟉} ‖ + μ ‖ v_{{\tilde{x}}_{i}} - \frac{τ_{1}}{2^{2 d}} ‖ x_{🟉} ‖^{2} ({\tilde{x}}_{i} - x_{🟉}) ‖ \\ \leq (1 - \frac{μ (τ_{1} - τ_{2}) {‖ x_{🟉} ‖}^{2}}{2^{2 d}}) ‖ {\tilde{x}}_{i} - x_{🟉} ‖ + μ τ_{3} \frac{ω}{2^{d}} ‖ x_{🟉} ‖ . \end{matrix}

(A15)

Next recall that

ω

satisfies (13),

μ = K_{3} 2^{2 d} / (8 d^{4} R_{🟉}^{2})

and

R_{🟉}^{2} \geq 25 {‖ x_{🟉} ‖}^{2} / 8

, then

\begin{matrix} ‖ x_{i + 1} - x_{🟉} ‖ & \leq (1 - \frac{μ (τ_{1} - τ_{2}) {‖ x_{🟉} ‖}^{2}}{2^{2 d}}) ‖ {\tilde{x}}_{i} - x_{🟉} ‖ + μ \frac{K_{2}}{2^{2 d}} \frac{τ_{3}}{d^{44}} {‖ x_{🟉} ‖}^{3} \\ \leq [1 - (τ_{1} - τ_{2}) \frac{K_{3}}{25 d^{4}} + K_{2} \frac{K_{3}}{25 d^{48}}] ‖ x_{🟉} ‖ \end{matrix}

Therefore for

K_{2}

and

K_{3}

small enough we obtain that

x_{i + 1} \in B (x_{🟉}, R_{1} β d^{10} ‖ x_{🟉} ‖)

and by induction this holds for all

i \geq T

. Finally we obtain (A14) by letting

ρ_{1} = (1 - μ (τ_{1} - τ_{2}) {‖ x_{🟉} ‖}^{2} / 2^{2 d})

and

ρ_{2} = K_{3} τ_{3} / (25 d^{4})

in (A15). □

Appendix A.7. Proof of Theorem 3

We begin recalling the following fact on the local Lipschitz property of the generative network G under the WDC.

Lemma A10

(Lemma 21 in [16]). Suppose

x \in B (x_{🟉}, d \sqrt{ϵ} ‖ x_{🟉} ‖)

, and the WDC holds with

ϵ < 1 / (200^{4} d^{6})

. Then it holds that:

‖ G (x) - G (x_{🟉}) ‖ \leq \frac{1.2}{2^{d / 2}} ‖ x - x_{🟉} ‖ .

We then conclude the proof of Theorem 3 by using the above lemma and the results in the previous sections.

(I): By assumption $x_{0} \in B (0, R_{🟉}) \ {0}$ so that according to Proposition A1 for any $i \geq 1$ it holds that $x_{i} \in B (0, R_{🟉}) \ {0}$ .
(II): By Proposition A3, there exists an integer T such that $x_{T} \in B^{+}$ and therefore it satisfies the conclusions of Theorem 3.A

$‖ x_{T} - x_{🟉} ‖ \leq C_{1} d^{14} \sqrt{ϵ} ‖ x_{🟉} ‖ + C_{2} 2^{d} d^{10} ω {‖ x_{🟉} ‖}^{- 1} .$
(III): Once in $B^{+}$ the iterates of Algorithm 1 converge to $x_{🟉}$ up to the noise level, as shown by Proposition A4 and Equation (A14)

$∥ x_{i + 1} - x_{🟉} ‖ \leq ρ_{1}^{i + 1 - T} ‖ x_{T} - x_{🟉} ‖ + ρ_{2} \frac{2^{d}}{‖ x_{🟉} ‖} ω$

which corresponds to (11) in Theorem 3.B.
(IV): The reconstruction error (12) in Theorem 3.B, follows then from (11) by applying Lemma A10 and the lower bound (A4).

Appendix B. Supplementary Proofs

Appendix B.1. Supplementary Proofs for Appendix A.2

Below we prove Lemma A2 on the concentration of the gradient of f at a differentiable point.

Proof of Lemma A2.

We begin by noticing that:

{\bar{v}}_{x} - h_{x} = [⟨ p_{x}, x ⟩ p_{x} - ⟨ x, x ⟩ \frac{x}{2^{2 d}}] + [⟨ {\tilde{h}}_{x}, x ⟩ \frac{{\tilde{h}}_{x}}{2^{2 d}} - ⟨ q_{x}, x ⟩ x] .

Below we show that:

∥ ⟨ p_{x}, x ⟩ p_{x} - ⟨ x, x ⟩ \frac{x}{2^{2 d}} ∥ \leq \frac{50}{2^{2 d}} d^{3} \sqrt{ϵ} {\max {‖ x ‖}^{2}, ‖ {x_{🟉} ‖}^{2}} ‖ x ‖ .

(A16)

and

∥ ⟨ q_{x}, x ⟩ p_{x} - ⟨ {\tilde{h}}_{x}, x ⟩ \frac{{\tilde{h}}_{x}}{2^{2 d}} | \leq \frac{36}{2^{2 d}} d^{4} \sqrt{ϵ} {\max {‖ x ‖}^{2}, ‖ {x_{🟉} ‖}^{2}} ‖ x ‖ .

(A17)

from which the thesis follows.

Regarding Equation (A16) observe that:

\begin{matrix} ∥ ⟨ p_{x}, x ⟩ p_{x} - ⟨ x, x ⟩ \frac{x}{2^{2 d}} ∥ & = ‖ ⟨ p_{x}, x ⟩ [p_{x} - \frac{x}{2^{d}}] + ⟨ p_{x} - \frac{x}{2^{d}}, \frac{x}{2^{d}} ⟩ x ∥ \\ \leq (∥ Λ_{x} {x ‖}^{2} + \frac{{‖ x ‖}^{2}}{2^{d}}) ∥ p_{x} - \frac{x}{2^{d}} ∥ \\ \leq \frac{50}{2^{2 d}} d^{3} \sqrt{ϵ} {‖ x ‖}^{3} \end{matrix}

where in the first inequality we used

⟨ p_{x}, x ⟩ = {‖ Λ_{x} x ‖}^{2}

and in the second we used Equations (A5) and (A6) of Lemma A1.

Next note that:

\begin{matrix} ∥ ⟨ q_{x}, x ⟩ q_{x} - ⟨ {\tilde{h}}_{x}, x ⟩ \frac{{\tilde{h}}_{x}}{2^{2 d}} ∥ & = ‖ ⟨ q_{x}, x ⟩ (q_{x} - \frac{{\tilde{h}}_{x}}{2^{d}}) + ⟨ q_{x} - \frac{{\tilde{h}}_{x}}{2^{d}}, x ⟩ \frac{{\tilde{h}}_{x}}{2^{d}} ∥ \\ \leq (‖ q_{x} ‖ + \frac{‖ {\tilde{h}}_{x} ‖}{2^{d}}) ‖ x ‖ ‖ q_{x} - \frac{{\tilde{h}}_{x}}{2^{d}} ∥ \\ \leq (\frac{13}{12} + 1 + \frac{d}{π}) \frac{‖ x ‖ ‖ x_{🟉} ‖}{2^{d}} ‖ q_{x} - {\tilde{h}}_{x} ‖ \\ \leq \frac{3}{2} d \frac{‖ x ‖ ‖ x_{🟉} ‖}{2^{d}} ‖ q_{x} - {\tilde{h}}_{x} ‖ \end{matrix}

where in the second inequality we have the bound (A6) and the definition of

{\tilde{h}}_{x}

. Equation (A17) is then found by appealing to Equation (A5) in Lemma A1. □

The previous lemma is now used to control the concentration of the subgradients

v_{x}

of f around

h_{x}

.

Proof of Lemma A3.

When f is differentiable at x,

\nabla f (x) = {\tilde{v}}_{x} = {\bar{v}}_{x} + η_{x}

, so that by Lemma A2 and the assumption on the noise:

\begin{matrix} ‖ {\tilde{v}}_{x} - h_{x} ‖ & \leq ‖ {\bar{v}}_{x} - h_{x} ‖ + ‖ η_{x} ‖ \\ \leq 86 \frac{d^{4} \sqrt{ϵ}}{2^{2 d}} \max (‖ {x_{🟉} ‖}^{2}, {‖ x ‖}^{2}) ‖ x ‖ + \frac{ω}{2^{d}} ‖ x ‖ . \end{matrix}

(A18)

Observe, now, that by (9), for any

x \in R^{k}

,

v_{x} \in \partial f (x) =

conv

(v_{1}, \dots, v_{t})

, and therefore

v_{x} = a_{1} v_{1} + \dots + a_{T} v_{T}

for some

a_{1}, \dots, a_{T} \geq 0

,

\sum_{i} a_{i} = 1

. Moreover for each

v_{i}

there exist a

w_{i}

such that

v_{i} = {lim}_{δ \to 0^{+}} {\tilde{v}}_{x + δ w_{i}}

. Therefore using Equation (A18), the continuity of

h_{x}

with respect to nonzero x and

\sum_{i} a_{i} = 1

:

\begin{matrix} ‖ v_{x} - h_{x} ‖ & \leq \sum_{i = 1}^{T} a_{i} ‖ v_{i} - h_{x} ‖ \\ \leq \sum_{i = 1}^{T} a_{i} lim_{δ \to 0} ‖ {\tilde{v}}_{x + δ w_{i}} - h_{x + δ w_{i}} ‖ \\ \leq 86 \frac{d^{4} \sqrt{ϵ}}{2^{2 d}} \max (‖ {x_{🟉} ‖}^{2}, {‖ x ‖}^{2}) ‖ x ‖ + \frac{ω}{2^{d}} ‖ x ‖ . \end{matrix}

□

The above results are now used to bound the norm of

v_{x} \in \partial f (x)

.

Proof of Lemma A4.

Since

ϵ < 1 / {(16 π d^{2})}^{2}

observe that

86 d^{4} \sqrt{ϵ} \leq 2 d^{2}

, therefore by the assumption on the noise level and Lemma A3 it follows that for any

v_{x} \in \partial f (x)

and

K_{2} \leq 1

∥ v_{x} - h_{x} ‖ \leq \frac{1}{2^{2 d}} \frac{5}{2} d^{2} {\max (‖ x ‖}^{2}, ‖ {x_{🟉} ‖}^{2}) ‖ x ‖ .

Next observe that that since

‖ {\tilde{h}}_{x} ‖ \leq d ‖ x_{🟉} ‖

, we have:

∥ h_{x} ‖ \leq \frac{1}{2^{2 d}} \frac{5}{4} d^{2} {\max (‖ x ‖}^{2}, ‖ x_{🟉} ‖^{2}) ‖ x ‖

and from

‖ v_{x} ‖ \leq ‖ v_{x} - h_{x} ‖ + ‖ h_{x} ‖

we obtain the thesis. □

Appendix B.2. Supplementary Proofs for Appendix A.3

In this section we prove Lemma A5 which implies that the norm of the iterates

x_{i}

does not increase in the region

B (0, R_{🟉}) \ B (0, C_{🟉} ‖ x_{🟉} ‖)

.

Proof of Lemma A5.

Note that the thesis is equivalent to

2 ⟨ x, v_{x} ⟩ \geq λ μ {‖ v_{x} ‖}^{2}

. Next recall that by the WDC for any

x \in R^{k}

and

2 d ϵ < 1

:

\frac{(1 - 2 ϵ d)}{2^{d}} {‖ x ‖}^{2} \leq \frac{{(1 - 2 ϵ)}^{d}}{2^{d}} {‖ x ‖}^{2} \leq {‖ G (x) ‖}^{2} \leq \frac{{(1 + 2 ϵ)}^{d}}{2^{d}} {‖ x ‖}^{2} \leq \frac{(1 + 4 ϵ d)}{2^{d}} {‖ x ‖}^{2} .

At a nonzero differentiable point

x \in R^{k} \ {0}

with

C_{🟉} ‖ x_{🟉} ‖ < ‖ x ‖ < R_{🟉}

, then

\begin{matrix} ⟨ {\tilde{v}}_{x}, x ⟩ & \geq {‖ G (x) ‖}^{2} {(‖ G (x) ‖}^{2} - ‖ G (x_{🟉}) ‖^{2}) - ‖ Λ_{x}^{T} H Λ_{x} {‖ ‖ x ‖}^{2} \\ \geq \frac{{‖ x ‖}^{2}}{2^{2 d}} [{(1 - 2 ϵ d)}^{2} {‖ x ‖}^{2} - [(1 - 2 ϵ d) (1 + 4 ϵ d) + K_{2} / d^{44}] {‖ x_{🟉} ‖}^{2}] \\ \geq \frac{{‖ x ‖}^{2}}{2^{2 d}} [{(1 - 4 d ϵ) ‖ x ‖}^{2} - [(1 + 2 d ϵ) + K_{2} / d^{44}] {‖ x_{🟉} ‖}^{2}] . \end{matrix}

(A19)

Next, by Lemma A4, the definition of the step length

μ

and

{\max (‖ x ‖}^{2}, ‖ x_{🟉} ‖^{2}) \leq R_{🟉}^{2}

, we have:

\frac{λ μ}{2} ‖ {\tilde{v}}_{x} ‖^{2} \leq \frac{K_{3}}{2^{2 d}} {‖ x ‖}^{4},

(A20)

which, using

‖ x ‖ > C_{🟉} ‖ x_{🟉} ‖

, gives

\begin{matrix} ⟨ {\tilde{v}}_{x}, x ⟩ - \frac{λ μ}{2} {‖ {\tilde{v}}_{x} ‖}^{2} & \geq \frac{{‖ x ‖}^{2}}{2^{2 d}} [(1 - 4 d ϵ - K_{3}) {‖ x ‖}^{2} - (1 + 2 d ϵ + K_{2}) {‖ x_{🟉} ‖}^{2}] \\ \geq \frac{{‖ x ‖}^{2} {‖ x_{🟉} ‖}^{2}}{2^{2 d}} [(1 - 4 d ϵ - K_{3}) C_{🟉} - (1 + 2 d ϵ + K_{2})] . \end{matrix}

We can then conclude by observing that by the assumptions and for small enough constants

(1 - 4 d ϵ - K_{3}) C_{🟉} - (1 + 2 d ϵ + K_{2} / d^{44}) > 0

.

At a non-differentiable point x, by the characterization of the Clarke subdifferential, we can write

v_{x} = \sum_{ℓ = 1}^{m} c_{ℓ} v_{ℓ}

where

v_{ℓ} = {lim}_{δ_{ℓ} \nabla f (x + δ_{ℓ} w_{ℓ})}

then

⟨ x, v_{x} ⟩ = ⟨ x, \sum_{ℓ = 1}^{m} c_{ℓ} v_{ℓ} ⟩ = \sum_{ℓ = 1}^{m} c_{ℓ} lim_{δ_{ℓ} \to 0^{+}} ⟨ x + δ_{ℓ} w_{ℓ}, {\tilde{v}}_{x + δ_{ℓ} w_{ℓ}} ⟩

which implies that the lower bound (A19) also holds for

⟨ x, v_{x} ⟩

. Similarly Lemma A4 leads to the upper bound (A20) also for

v_{x}

, which then leads to the thesis

2 ⟨ x, v_{x} ⟩ \geq λ μ {‖ v_{x} ‖}^{2}

. □

Appendix B.3. Supplementary Proofs for Appendix A.4

In the next lemmas we show that

h_{x}

is locally Lipschitz.

Lemma A11.

For all

x, y \neq 0

∥ h_{x} - h_{y} ‖ \leq \frac{1}{2^{2 d}} [{2 (‖ x ‖}^{2} + {‖ y ‖}^{2}) + (d^{2} + 5 d^{3} \max (\frac{‖ x ‖}{‖ y ‖}, \frac{‖ y ‖}{‖ x ‖})) {‖ x_{🟉} ‖}^{2}] ‖ x - y ‖

Proof.

Note that

{\tilde{h}}_{x} = ξ ‖ x_{🟉} ‖ + ζ ‖ x_{🟉} ‖ \hat{x}

where

ξ

and

ζ

are defined in (A23). Then observe that by (A28) and (A29) it follows that

∥ {\tilde{h}}_{x} ‖ \leq d ‖ x_{🟉} ∥

. Furthermore by [16] (Lemma 18) for any nonzero

x, y \in R^{k}

we have

∥ {\tilde{h}}_{x} - {\tilde{h}}_{y} ‖ \leq \frac{9}{4} d^{2} \max (\frac{1}{‖ x ‖}, \frac{1}{‖ y ‖}) ‖ x_{🟉} ‖ ‖ x - y ‖

(A21)

Next notice that

∥ h_{x} - h_{y} ‖ \leq \frac{1}{2^{2 d}} ‖ ⟨ x, x ⟩ x - ⟨ y, y ⟩ y ‖ + \frac{1}{2^{2 d}} ∥ ⟨ {\tilde{h}}_{y}, y ⟩ {\tilde{h}}_{y} - ⟨ {\tilde{h}}_{x}, x ⟩ {\tilde{h}}_{x} ∥,

where by triangle inequality the first term on the left hand side can be bounded as

\begin{matrix} ‖ ⟨ x, x ⟩ x - ⟨ y, y ⟩ y ‖ & \leq ({‖ x ‖}^{2} ‖ x - y ‖ + |{‖ x ‖}^{2} - {‖ y ‖}^{2}| ‖ y ‖) \\ \leq ({‖ x ‖}^{2} + ‖ x ‖ ‖ y ‖ + {‖ y ‖}^{2}) ‖ x - y ‖ \\ \leq 2 ({‖ x ‖}^{2} + {‖ y ‖}^{2}) ‖ x - y ‖ \end{matrix}

Finally note that by from the bound (A21) we obtain:

\begin{matrix} ∥ ⟨ {\tilde{h}}_{y}, y ⟩ {\tilde{h}}_{y} - ⟨ {\tilde{h}}_{x}, x ⟩ {\tilde{h}}_{x} ∥ & \leq ‖ {\tilde{h}}_{x} ‖ ‖ x ‖ ‖ {\tilde{h}}_{x} - {\tilde{h}}_{y} ‖ + ‖ {\tilde{h}}_{y} ‖ |⟨ {\tilde{h}}_{x}, x ⟩ - ⟨ {\tilde{h}}_{y}, y ⟩| \\ \leq ‖ {\tilde{h}}_{x} ‖ ‖ x ‖ ‖ {\tilde{h}}_{x} - {\tilde{h}}_{y} ‖ + ‖ {\tilde{h}}_{y} ‖ ‖ {\tilde{h}}_{x} ‖ ‖ x - y ‖ + ‖ {\tilde{h}}_{y} ‖ ‖ y ‖ ‖ {\tilde{h}}_{x} - {\tilde{h}}_{y} ‖ \\ \leq d^{2} ‖ x_{🟉} ‖^{2} ‖ x - y ‖ + d (‖ x ‖ + ‖ y ‖) ‖ x_{🟉} ‖ ‖ {\tilde{h}}_{x} - {\tilde{h}}_{y} ‖ \\ \leq (2 d^{2} + 5 d^{3} \max (\frac{‖ x ‖}{‖ y ‖}, \frac{‖ y ‖}{‖ x ‖})) ‖ x_{🟉} ‖^{2} ‖ x - y ‖ . \end{matrix}

where we used

(‖ x ‖ + ‖ y ‖) \max (\frac{1}{‖ x ‖, ‖ y ‖}) \leq 2 \frac{\max (‖ x ‖, ‖ y ‖)}{min (‖ x ‖, ‖ y ‖)} = 2 \max (\frac{‖ x ‖}{‖ y ‖}, \frac{‖ y ‖}{‖ x ‖}) .

□

Based on the previous lemma we can now prove that

h_{x}

is locally Lipschitz.

Lemma A12.

Let

x \in B (0, R_{🟉}) \ {0}

,

λ \in (0, 1)

,

μ = K_{3} 2^{2 d} / (8 d^{4} R_{🟉}^{2})

with

512 π^{2} K_{3} < 1

and

v_{x} \in \partial f (x)

, then for

x_{λ} = x - λ μ v_{x}

it holds that:

∥ h_{x} - h_{x_{λ}} ‖ \leq \frac{7}{16} ‖ v_{x} ‖

Proof.

Consider

x \in B (0, R_{🟉})

and observe that by Lemma A4,

d \geq 2

and the choice of

μ

, for any

v_{x} \in \partial f (x)

and any

λ \in (0, 1)

we have

μ ‖ v_{x} ‖ \leq ‖ x ‖ / 8

. It follows that

\frac{7}{8} ‖ x ‖ \leq ‖ x_{λ} ‖ \leq \frac{9}{8} ‖ x ‖,

(A22)

and in particular

\max (\frac{‖ x ‖}{‖ x_{λ} ‖}, \frac{‖ x_{λ} ‖}{‖ x ‖}) \leq \frac{8}{7} .

Therefore by Lemma A11 we deduce that

\begin{matrix} ‖ h_{x} - h_{x_{λ}} ‖ & \leq \frac{λ μ}{2^{2 d}} [{2 (‖ x ‖}^{2} + ‖ x_{λ} ‖^{2}) + (d^{2} + 6 d^{3}) {‖ x_{🟉} ‖}^{2}] ‖ v_{x} ‖ \\ \leq \frac{μ}{2^{2 d}} (4 R_{🟉}^{2} + (d^{2} + 6 d^{3}) {‖ x_{🟉} ‖}^{2}) ‖ v_{x} ‖ \end{matrix}

where in the second inequality we used

{\max (‖ x ‖}^{2}, ‖ x_{λ} ‖^{2}) \leq R_{🟉}^{2}

by Proposition A1. The thesis is obtained by substituting the definition of

μ

and using

K_{3} \leq 1

. □

Based on the previous result we can now prove Lemma A6.

Proof of Lemma A6.

Let

x \in B (0, R_{🟉}) \cap S_{β}^{c}

,

\begin{matrix} ‖ v_{x} ‖ & \geq ‖ h_{x} ‖ - ‖ v_{x} - h_{x} ‖ \\ \geq \frac{β}{2^{d}} \max (‖ x_{🟉} ‖^{2} {, ‖ x ‖}^{2}) ‖ x ‖ - 86 \frac{d^{4} \sqrt{ϵ}}{2^{2 d}} \max (‖ x_{🟉} ‖^{2} {, ‖ x ‖}^{2}) ‖ x ‖ - \frac{ω}{2^{d}} ‖ x ‖ \\ \geq 6 \frac{\max (‖ x ‖^{2}, ‖ x_{🟉} ‖^{2}) ‖ x ‖}{2^{2 d}} (86 d^{4} \sqrt{ϵ} + 2^{d} ω ‖ x_{🟉} ‖^{- 2}) \\ \geq 6 (\frac{\max (‖ x ‖^{2}, ‖ x_{🟉} ‖^{2}) ‖ x ‖}{2^{2 d}} 86 d^{4} \sqrt{ϵ} + \frac{ω}{2^{d}} ‖ x ‖) \end{matrix}

where we used the definition

β

in Equation (A7) and Lemma A3.

Next take

λ \in [0, 1]

and

x_{λ} = x - λ μ v_{x}

with

v_{x} \in \partial f (x)

. Then for any

v_{x_{λ}} \in \partial f (x_{λ})

\begin{matrix} ‖ v_{x} - v_{x_{λ}} ‖ & \leq ‖ v_{x} - h_{x} ‖ + ‖ h_{x} - h_{x_{λ}} ‖ + ‖ h_{x_{λ}} - v_{x_{λ}} ‖ \\ \leq 86 \frac{d^{4} \sqrt{ϵ}}{2^{2 d}} (\max (‖ x_{🟉} ‖^{2} {, ‖ x ‖}^{2}) ‖ x ‖ + \max (‖ x_{🟉} ‖^{2}, ‖ x_{λ} ‖^{2}) ‖ x_{λ} ‖) \\ + \frac{ω}{2^{d}} (‖ x ‖ + ‖ x_{λ} ‖) + \frac{7}{16} ‖ v_{x} ‖ \\ \leq 86 \frac{d^{4} \sqrt{ϵ}}{2^{2 d}} (1 + {(\frac{9}{8})}^{3}) \max (‖ x_{🟉} ‖^{2} {, ‖ x ‖}^{2}) ‖ x ‖ + (1 + \frac{9}{8}) \frac{ω}{2^{d}} ‖ x ‖ + \frac{7}{16} ‖ v_{x} ‖ | \\ \leq 3 (86 \frac{d^{4} \sqrt{ϵ}}{2^{2 d}} \max (‖ x_{🟉} ‖^{2} {, ‖ x ‖}^{2}) ‖ x ‖ + \frac{ω}{2^{d}} ‖ x ‖) + \frac{7}{16} ‖ v_{x} ‖ \\ \leq (\frac{1}{2} + \frac{7}{16}) ‖ v_{x} ‖ \end{matrix}

where in the first inequality we used Lemma A3 and Lemma A12, in the second inequality (A22) and in the last one (A8). □

Appendix B.4. Supplementary Proofs for Appendix A.5

Below we prove that the region of

R^{k}

where we cannot control the norm of the vector field

h_{x}

is contained in two balls around

x_{🟉}

and

- ρ_{d} x_{🟉}

.

We prove Lemma A7 by showing the following.

Lemma A13.

Suppose

8 π d^{6} \sqrt{β} \leq 1

. Define:

ρ_{d} : = \sum_{i = 0}^{d - 1} \frac{\sin {\overset{ˇ}{θ}}_{i}}{π} (\prod_{j = i + 1}^{d - 1} \frac{π - {\overset{ˇ}{θ}}_{j}}{π})

where

{\overset{ˇ}{θ}}_{0} = π

and

{\overset{ˇ}{θ}}_{i} = g ({\overset{ˇ}{θ}}_{i - 1})

. If

x \in S_{β}

, then we have either:

| {\bar{θ}}_{0} | \leq 32 d^{4} {π β and | ‖ x ‖}^{2} - ‖ x_{🟉} ‖^{2} | \leq 258 π β d^{6} ‖ x_{🟉} ‖

or

| {\bar{θ}}_{0} - π | \leq 8 π d^{4} \sqrt{β} {and | ‖ x ‖}^{2} - ρ_{d}^{2} ‖ x_{🟉} ‖^{2} | \leq 281 π^{2} \sqrt{β} d^{10} ‖ x_{🟉} ‖ .

In particular, we have:

S_{β} \subset B (x_{🟉}, R_{1} β d^{10} ‖ x_{🟉} ‖) \cup B (- ρ_{d} x_{🟉}, R_{2} \sqrt{β} d^{10} ‖ x_{🟉} ‖)

where

R_{1}, R_{2}

are numerical constants and

ρ_{d} \to 1

as

d \to \infty

.

Proof of Lemma A13.

Without loss of generality, let

x_{🟉} = e_{1}

and

x = r \cos {\bar{θ}}_{0} \cdot e_{1} + r \sin {\bar{θ}}_{0} \cdot e_{2}

, for some

{\bar{θ}}_{0} \in [0, π]

, and

r \geq 0

. Recall that we call

\hat{x} = x / ‖ x ‖

and

{\hat{x}}_{🟉} = x_{🟉} / ‖ x_{🟉} ‖

. We then introduce the following notation:

ξ = \prod_{i = 0}^{d - 1} \frac{π - {\bar{θ}}_{i}}{π}, ζ = \sum_{i = 0}^{d - 1} \frac{\sin {\bar{θ}}_{i}}{π} \prod_{j = i + 1}^{d - 1} \frac{π - {\bar{θ}}_{j}}{π}, r = ‖ x ‖, R = \max (r^{2}, 1),

(A23)

where

θ_{i} = g ({\bar{θ}}_{i - 1})

with g as in (A1), and observe that

{\tilde{h}}_{x} = (ξ {\hat{x}}_{🟉} + ζ \hat{x})

. Let

α : = ⟨ {\tilde{h}}_{x}, \hat{x} ⟩

, then we can write:

\begin{matrix} h_{x} = \frac{1}{2^{2 d}} [⟨ x, x ⟩ x - ⟨ {\tilde{h}}_{x}, x ⟩ {\tilde{h}}_{x}] = \frac{r}{2^{2 d}} [r^{2} \hat{x} - α (ξ {\hat{x}}_{🟉} + ζ \hat{x})] . \end{matrix}

Using the definition of

\hat{x}

and

{\hat{x}}_{🟉}

we obtain:

\frac{2^{2 d} h_{x}}{r} = [(r^{2} - α ζ) \cos {\bar{θ}}_{0} - α ξ] \cdot e_{1} + [r^{2} - α ζ] \sin {\bar{θ}}_{0} \cdot e_{2},

and conclude that since

x \in S_{β}

, then:

\begin{matrix} | (r^{2} - α ζ) \cos {\bar{θ}}_{0} - α ξ | & \leq β R \end{matrix}

(A24)

\begin{matrix} | [r^{2} - α ζ] \sin {\bar{θ}}_{0} | \leq β R . \end{matrix}

(A25)

We now list some bounds that will be useful in the subsequent analysis. We have:

\begin{matrix} {\bar{θ}}_{i} & \leq {\bar{θ}}_{i - 1} for i \geq 1 \end{matrix}

(A26)

\begin{matrix} {\bar{θ}}_{i} & \leq \cos^{- 1} (1 / π) for i \geq 2 \end{matrix}

(A27)

\begin{matrix} | ξ | & \leq 1 \end{matrix}

(A28)

\begin{matrix} | ζ | & \leq \frac{d}{π} \sin {\bar{θ}}_{0} \end{matrix}

(A29)

\begin{matrix} ξ & \geq \frac{π - {\bar{θ}}_{0}}{π} d^{- 3} \end{matrix}

(A30)

\begin{matrix} {\overset{ˇ}{θ}}_{i} & \leq \frac{3 π}{i + 3} for i \geq 0 \end{matrix}

(A31)

\begin{matrix} {\overset{ˇ}{θ}}_{i} & \geq \frac{π}{i + 1} for i \geq 0 \end{matrix}

(A32)

\begin{matrix} {\bar{θ}}_{0} & = π + O_{1} (δ) \Rightarrow | ξ | \leq \frac{δ}{π} \end{matrix}

(A33)

\begin{matrix} {\bar{θ}}_{0} & = π + O_{1} (δ) \Rightarrow ζ = ρ_{d} + O_{1} (3 d^{3} δ) if \frac{d^{2} δ}{π} \leq 1 \end{matrix}

(A34)

\begin{matrix} 1 / π & \leq α \leq 1 . \end{matrix}

(A35)

The identities (A26) through (A34) can be found in Lemma 16 of [16], while the identity (A35) follows by noticing that

α = ξ \cos {\bar{θ}}_{0} + ζ = \cos θ_{d}

and using (A27) together with

d \geq 2

.

Bound on R.

We now show that if

x \in S_{β}

, then

r^{2} \leq 4 d

and therefore

R \leq 4 d

.

If

r^{2} \leq 1

, then the claim is trivial. Take

r^{2} > 1

, then note that either

| \sin {\bar{θ}}_{0} | \geq 1 / \sqrt{2}

or

| \cos {\bar{θ}}_{0} | \geq 1 / \sqrt{2}

must hold. If

| \sin {\bar{θ}}_{0} | \geq 1 / \sqrt{2}

then from (A25) it follows that

r^{2} - α ζ \leq \sqrt{2} β R = \sqrt{2} β r^{2}

which implies:

r^{2} \leq \frac{α ζ}{1 - \sqrt{2} β} \leq \frac{1}{(1 - \sqrt{2} β)} \frac{d}{π} \leq \frac{d}{2}

using (A29) and (A35) in the second inequality and

β < 1 / 4

in the third. Next take

| \cos {\bar{θ}}_{0} | \geq 1 / \sqrt{2}

, then (A24) implies

| r^{2} - α ζ | \leq \sqrt{2} (β r^{2} + α ξ)

which in turn results in:

r^{2} \leq \frac{α (ζ + \sqrt{2} ξ)}{1 - \sqrt{2} β} \leq 4 d

using (A28), (A29), (A35) and

β < 1 / 4

. In conclusion if

x \in S_{β}

then

r^{2} \leq 4 d \Rightarrow R \leq 4 d

.

Bounds on ${\bar{θ}}_{0}$ .

We proceed by showing that we only have to analyze the small angle case

{\bar{θ}}_{0} \approx 0

and the large angle case

{\bar{θ}}_{0} \approx π

. At least one of the following three cases must hold:

(1): $\sin {\bar{θ}}_{0} \leq 16 β π d^{4}$ : Then we have $\bar{θ} = O_{1} (32 π β π d^{4})$ or $\bar{θ} = π + O_{1} (32 π β π d^{4})$ as $32 π β π d^{4} < 1$ .
(2): $| r^{2} - α ζ | < \sqrt{β} R$ : Then (A24), (A35) and $β < 1$ yield $| ξ | \leq 2 \sqrt{β} π R$ . Using (A30), we then get $\bar{θ} = π + O_{1} (2 \sqrt{β} π^{2} d^{3} R)$ .
(3): $\sin {\bar{θ}}_{0} > 16 β π d^{4}$ and $| r^{2} - α ζ | \geq \sqrt{β} R$ : Then (A27) gives $| r^{2} - α ζ | \leq β M / \sin {\bar{θ}}_{0}$ which used with (A24) leads to:

$| α ξ | \leq β R + | r^{2} - α ζ | \leq β R + \frac{β R}{\sin {\bar{θ}}_{0}} \leq 2 \frac{β R}{\sin {\bar{θ}}_{0}} .$

Then using (A35), the assumption on $\sin {\bar{θ}}_{0}$ and $R \leq 4 d$ we obtain $ξ \leq d^{- 3} / 2$ . The latter together with (A30) leads to ${\bar{θ}}_{0} \geq π / 2$ . Finally as $| r^{2} - α ζ | \geq \sqrt{β} R$ then (A25) leads to $| \sin {\bar{θ}}_{0} | \leq \sqrt{β}$ . Therefore as ${\bar{θ}}_{0} \geq π / 2$ and $β < 1$ , we can conclude that ${\bar{θ}}_{0} = π + O_{1} (2 \sqrt{β})$ .

Inspecting the three cases, and recalling that

R \leq 4 d

, we can see that it suffices to analyze the small angle case

{\bar{θ}}_{0} = O_{1} (32 d^{4} π β)

and the large angle case

\bar{θ} = π + O_{1} (8 \sqrt{β} π^{2} d^{4})

.

Small angle case.

We assume

{\bar{θ}}_{0} = O_{1} (δ)

with

δ = 32 d^{4} π β

and show that

{‖ x ‖}^{2} \approx {‖ x_{🟉} ‖}^{2}

. We begin collecting some bounds. Since

{\bar{θ}}_{i} \leq {\bar{θ}}_{0} \leq δ

, then

1 \geq ξ \geq {(1 - δ / π)}^{d} \geq 1 + O_{1} (2 d δ / π)

assuming

δ d / π \leq 1 / 2

, which holds true since

64 d^{5} β < 1

. Moreover from (A29) we have

ζ = O_{1} (d δ / π)

. Finally observe that

\cos {\bar{θ}}_{0} = 1 + O_{1} ({\bar{θ}}_{0}^{2} / 2) = 1 + O_{1} (δ / 2)

for

δ < 1

. We then have

α = 1 + O_{1} (2 d δ)

so that

α ζ = O_{1} (d^{2} δ)

and

α ξ = 1 + O_{1} (4 d^{2} δ)

. We can therefore rewrite (A24) as:

(r^{2} + O_{1} (d^{2} δ)) (1 + O_{1} (δ / 2)) - (1 + O_{1} (4 d^{2} δ)) = O_{1} (β R) .

Using the bound

r^{2} \leq R \leq 4 d

and the definition of

δ

, we obtain:

\begin{matrix} r^{2} - 1 & = O_{1} (\frac{δ r^{2}}{2} + d^{2} δ + \frac{d^{2} δ^{2}}{2} + 4 d^{2} δ + 4 d β) \\ = O_{1} (8 d^{2} δ + 4 d β) \\ = O_{1} (258 π d^{6} β) \end{matrix}

(A36)

Large angle case.

Here we assume

\bar{θ} = π + O_{1} (δ)

with

δ = 8 \sqrt{β} π^{2} d^{4}

and show that it must be

{‖ x ‖}^{2} \approx ρ_{d}^{2} {‖ x_{🟉} ‖}^{2}

.

From (A33) we know that

ξ = O_{1} (δ / π)

, while from (A34) we know that

ζ = ρ_{d} + O_{1} (3 d^{3} δ)

as long as

8 \sqrt{β} π d^{6} \leq 1

. Moreover for large angles and

δ < 1

, it holds

\cos {\bar{θ}}_{0} = - 1 + O_{1} ({({\bar{θ}}_{0} - π)}^{2} / 2) = - 1 + O_{1} (δ^{2} / 2)

. These bounds lead to:

\begin{matrix} α & = ξ \cos {\bar{θ}}_{0} + ζ \\ = ρ_{d} + O_{1} (\frac{δ}{π} + \frac{δ^{3}}{2 π} + 3 d^{3} δ) \\ = ρ_{d} + O_{1} (4 d^{3} δ), \end{matrix}

and using

ρ_{d} \leq d

:

\begin{matrix} α ζ & = ρ_{d}^{2} + O_{1} (4 d^{3} δ ρ_{d} + 3 d^{3} δ ρ_{d} + 12 d^{6} δ) = ρ_{d}^{2} + O_{1} (20 d^{6} δ), \\ α ξ & = O_{1} (\frac{δ}{π} ρ_{d} + 4 \frac{d^{3} δ^{2}}{π}) = O_{1} (2 d^{3} δ) . \end{matrix}

Then recall that (A24) is equivalent to

(r^{2} - α ζ) \cos {\bar{θ}}_{0} - α ξ = O_{1} (4 β d)

, that is:

(r^{2} - ρ_{d}^{2} + O_{1} (20 d^{6} δ)) (1 + O_{1} (δ^{2} / 2)) + O_{1} (2 d^{3} δ) = O_{1} (4 β d)

and in particular:

\begin{matrix} r^{2} - ρ_{d}^{2} & = O_{1} (20 d^{6} δ + 10 d^{6} δ^{3} + \frac{ρ_{d} δ^{2}}{2} + \frac{r^{2} δ^{2}}{2} + 2 d^{3} δ + 4 β d) \\ = O_{1} (35 d^{6} δ + 4 β d) \\ = O_{1} (281 π^{2} \sqrt{β} d^{10}) \end{matrix}

(A37)

where we used

ρ_{d} \leq d

, the definition of

δ

and

δ < 1

.

Controlling the distance.

We have shown that it is either

{\bar{θ}}_{0} \approx 0

and

{‖ x ‖}^{2} \approx {‖ x_{🟉} ‖}^{2}

or

{\bar{θ}}_{0} \approx π

and

{‖ x ‖}^{2} \approx ρ_{d}^{2} {‖ x_{🟉} ‖}^{2}

. We can therefore conclude that it must be either

x \approx x_{🟉}

or

x \approx - ρ_{d} x_{🟉}

.

Observe that if a two dimensional point is known to have magnitude within

Δ r

of some r and is known to be within an angle

Δ θ

from 0, then its Euclidean distance to the point of coordinates

(r, 0)

is no more that

Δ r + (r + Δ r) Δ θ

. Similarly we can write:

‖ x - x_{🟉} ‖ \leq | ‖ x ‖ - ‖ x_{🟉} ‖ | + (‖ x_{🟉} ‖ + | ‖ x ‖ - ‖ x_{🟉} ‖ |) {\bar{θ}}_{0} .

(A38)

In the small angle case, by (A36), (A38), and

‖ x_{🟉} ‖ | ‖ x ‖ - ‖ x_{🟉} ‖ | \leq {| ‖ x ‖}^{2} - ‖ x_{🟉} ‖^{2} |

, we have:

‖ x - x_{🟉} ‖ \leq 258 π d^{6} β + (1 + 258 π d^{6} β) 32 d^{4} π β \leq 550 π d^{10} β .

Next we notice that

ρ_{2} = 1 / π

and

ρ_{d} \geq ρ_{d - 1}

as follows from the definition and (A31), (A32). Then considering the large angle case and using (A37) we have:

| ‖ x ‖ - ρ_{d} | \leq \frac{281 π^{2} \sqrt{β} d^{10}}{‖ x ‖ + ρ_{d}} \leq 281 π^{3} \sqrt{β} d^{10} .

The latter, together with (A38), yields:

\begin{matrix} ‖ x + ρ_{d} x_{🟉} ‖ & \leq | ‖ x ‖ - ρ_{d} | + (ρ_{d} + | ‖ x ‖ - ρ_{d} |) (π - {\bar{θ}}_{0}) \\ \leq 281 π^{3} \sqrt{β} d^{10} + (d + 281 π^{3} \sqrt{β} d^{10}) 8 \sqrt{β} π^{2} d^{4} \\ \leq 284 π^{3} \sqrt{β} d^{10} \end{matrix}

where in the second inequality we have used

ρ_{d} \leq d

and in the third

8 \sqrt{β} π d^{6} \leq 1

.

We conclude by noticing that

ρ_{d} \to 1

as

d \to 1

as shown in [16] (Lemma 16). □

We next will show that the values of the loss function in a neighborhood of

x_{🟉}

are strictly smaller that those in a neighborhood of

- ρ_{d} x_{🟉}

.

Recall that

{f (x) : = 1 / 4 ‖ G (x) G (x)}^{T} - G (x_{🟉}) G {(x_{🟉})}^{T} {- H ‖}_{F}^{2}

, we next define the following loss functions:

\begin{matrix} f_{0} (x) & : = \frac{1}{4} {‖ G (x) G {(x)}^{T} - G (x_{🟉}) G {(x_{🟉})}^{T} ‖}_{F}^{2}, \\ f_{H} (x) & : = f_{0} (x) - \frac{1}{2} {⟨ G (x) G {(x)}^{T} - G (x_{🟉}) G {(x_{🟉})}^{T}, H ⟩}_{F}, \\ f_{E} (x) & : = \frac{1}{2^{2 d + 2}} ({‖ x ‖}^{4} + {‖ x_{🟉} ‖}^{4} - 2 {⟨ x, {\tilde{h}}_{x} ⟩}^{2}) . \end{matrix}

In particular notice that

f (x) = f_{H} (x) + 1 / 4 {‖ H ‖}_{F}^{2}

. Below we show that assuming the WDC is satisfied,

f_{0} (x)

concentrates around

f_{E} (x)

.

Lemma A14.

Suppose that

d \geq 2

and the WDC holds with

ϵ < 1 / {(16 π d^{2})}^{2}

, then for all nonzero

x, x_{🟉} \in R^{k}

| f_{0} (x) - f_{E} (x) | \leq \frac{16}{2^{2 d}} {(‖ x ‖}^{4} + ‖ x_{🟉} ‖^{4}) d^{4} \sqrt{ϵ}

Proof.

Observe that:

\begin{matrix} | f_{0} (x) - f_{E} (x) | & \leq \frac{1}{4} {| ‖ G (x) ‖}^{4} - \frac{1}{2^{2 d}} {‖ x ‖}^{4} | \\ + \frac{1}{4} | ‖ G (x_{🟉}) ‖^{4} - \frac{1}{2^{2 d}} ‖ x_{🟉} ‖^{4} | \\ + \frac{1}{2} | {⟨ G (x), G (x_{🟉}) ⟩}^{2} - \frac{1}{2^{2 d}} {⟨ x, {\tilde{h}}_{x} ⟩}^{2} | . \end{matrix}

We analyze each term separately. The first term can be bounded as:

\begin{matrix} \frac{1}{4} {| ‖ G (x) ‖}^{4} - \frac{1}{2^{2 d}} {‖ x ‖}^{4} | & = \frac{1}{4} {| ‖ G (x) ‖}^{2} + \frac{1}{2^{d}} {‖ x ‖}^{2} {| | ‖ G (x) ‖}^{2} - \frac{1}{2^{d}} {‖ x ‖}^{2} | \\ \leq \frac{1}{4} \frac{1}{2^{d}} (\frac{13}{12} + 1) {‖ x ‖}^{2} {| ‖ G (x) ‖}^{2} - \frac{1}{2^{d}} {‖ x ‖}^{2} | \\ \leq \frac{1}{4} \frac{1}{2^{d}} (\frac{13}{12} + 1) {‖ x ‖}^{2} 24 \frac{d^{3} \sqrt{ϵ}}{2^{d}} {‖ x ‖}^{2} \\ \leq \frac{1}{2^{2 d}} 13 d^{3} \sqrt{ϵ} {‖ x ‖}^{4} \end{matrix}

where in the first inequality we used (A6) and in the second inequality (A5). Similarly we can bound the second term:

\frac{1}{4} | ‖ G (x_{🟉}) ‖^{4} - \frac{1}{2^{2 d}} ‖ x_{🟉} ‖^{4} | \leq \frac{1}{2^{2 d}} 13 d^{3} \sqrt{ϵ} {‖ x_{🟉} ‖}^{4} .

We next note that

‖ {\tilde{h}}_{x} ‖ \leq (1 + d / π) ‖ x_{🟉} ‖

and therefore from (A6) and

d \geq 2

we obtain:

| ‖ G (x) ‖ ‖ G (x_{🟉}) ‖ + ‖ x ‖ \frac{‖ {\tilde{h}}_{x} ‖}{2^{d}} | \leq \frac{1}{2^{d}} (\frac{13}{12} + 1 + \frac{d}{π}) ‖ x ‖ ‖ x_{🟉} ‖ \leq \frac{1}{2^{d}} \frac{3}{2} d ‖ x ‖ ‖ x_{🟉} ∥

We can then conclude that:

\begin{matrix} \frac{1}{2} | {⟨ G (x), G (x_{🟉}) ⟩}^{2} - {⟨ x, \frac{{\tilde{h}}_{x}}{2^{d}} ⟩}^{2} | & \leq \frac{1}{2} | ⟨ x, Λ_{x}^{T} Λ_{x_{🟉}} x_{🟉} - {\tilde{h}}_{x} ⟩ | | ‖ G (x) ‖ ‖ G (x_{🟉}) ‖ + ‖ x ‖ ‖ {\tilde{h}}_{x} ‖ | \\ \leq \frac{1}{2} ‖ x ‖ 24 \frac{d^{3} \sqrt{ϵ}}{2^{d}} ‖ x_{🟉} ‖ \frac{1}{2^{d}} \frac{3}{2} d ‖ x ‖ ‖ x_{🟉} ∥ \\ \leq \frac{9}{2^{2 d}} d^{4} \sqrt{ϵ} (‖ x_{🟉} ‖^{4} + {‖ x ‖}^{4}) \end{matrix} .

□

We next consider the loss

f_{E}

and show that in a neighborhood

- ρ_{d} x_{🟉}

, this loss function has larger values than in a neighborhood of

x_{🟉}

.

Lemma A15.

Fix

0 < a \leq 1 / (2 π^{3} d^{3})

and

ϕ_{d} \in [ρ_{d}, 1]

then:

\begin{matrix} f_{E} (x) & \leq \frac{1}{2^{2 d + 2}} ‖ x_{🟉} ‖^{4} + \frac{1}{2^{2 d + 2}} [{(a + ϕ_{d})}^{4} - 2 ϕ_{d}^{2} + 2 π d a] ‖ x_{🟉} ‖^{4} \forall x \in B (ϕ_{d} x_{🟉}, a ‖ x_{🟉} ‖) and \\ f_{E} (x) & \geq \frac{1}{2^{2 d + 2}} ‖ x_{🟉} ‖^{4} + \frac{1}{2^{2 d + 2}} [{(a - ϕ_{d})}^{4} - 2 ρ_{d}^{2} ϕ_{d}^{2} - 40 π d^{3} a] ‖ x_{🟉} ‖^{4} \forall x \in B (- ϕ_{d} x_{🟉}, a ‖ x_{🟉} ‖) . \end{matrix}

Proof.

Let

x \in B (ϕ_{d} x_{🟉}, a ‖ x_{🟉} ‖)

then observe that

0 \leq {\bar{θ}}_{i} \leq {\bar{θ}}_{0} \leq π a / 2 ϕ_{d}

and

(ϕ_{d} - a) ‖ x_{🟉} ‖ \leq ‖ x ‖ \leq (a + ϕ_{d}) ‖ x_{🟉} ‖

. Then observe that:

\begin{matrix} ⟨ x, \frac{{\tilde{h}}_{d}}{2^{d}} ⟩ & = \frac{1}{2^{d}} (\prod_{i = 0}^{d - 1} \frac{π - {\bar{θ}}_{i}}{π}) ‖ x_{🟉} ‖ ‖ x ‖ \cos {\bar{θ}}_{0} + \frac{1}{2^{d}} \sum_{i = 0}^{d - 1} \frac{\sin {\bar{θ}}_{i}}{π} \prod_{j = i + 1}^{d - 1} \frac{π - {\bar{θ}}_{j}}{π} ‖ x_{🟉} ‖ ‖ x ‖ \\ \geq \frac{1}{2^{d}} (\prod_{i = 0}^{d - 1} \frac{π - \frac{π a}{2 ϕ_{d}}}{π}) (ϕ_{d} - a) {‖ x_{🟉} ‖}^{2} (1 - \frac{π^{2} a^{2}}{8 ϕ_{d}^{2}}) \\ \geq \frac{1}{2^{d}} (1 - \frac{d a}{ϕ_{d}}) (ϕ_{d} - a) (1 - \frac{π^{2} a^{2}}{8 ϕ_{d}^{2}}) {‖ x_{🟉} ‖}^{2} . \end{matrix}

using

\cos θ \geq 1 - θ^{2} / 2

and

{(1 - x)}^{d} \geq (1 - 2 d x)

as long as

0 \leq x \leq 1

. We can therefore write:

\begin{matrix} f_{E} (x) - \frac{‖ x_{🟉} ‖^{4}}{2^{2 d + 2}} & \leq \frac{1}{2^{2 d + 2}} {‖ x ‖}^{4} - \frac{1}{2^{2 d + 1}} {(1 - \frac{d a}{ϕ_{d}})}^{2} {(ϕ_{d} - a)}^{2} {(1 - \frac{π^{2} a^{2}}{8 ϕ_{d}^{2}})}^{2} {‖ x_{🟉} ‖}^{4} \\ \leq \frac{1}{2^{2 d + 2}} [{(ϕ_{d} + a)}^{4} - 2 (1 - 2 \frac{d a}{ϕ_{d}}) {(ϕ_{d} - a)}^{2} (1 - \frac{π^{2} a^{2}}{4 ϕ_{d}^{2}})] {‖ x_{🟉} ‖}^{4} \end{matrix}

where in the second inequality we used

{(1 - x)}^{2} \geq 1 - 2 x

for all

x \in R

. We then observe that:

\begin{matrix} (1 - 2 \frac{d a}{ϕ_{d}}) {(ϕ_{d} - a)}^{2} (1 - \frac{π^{2} a^{2}}{4 ϕ_{d}^{2}}) & \geq (1 - \frac{π^{2} a^{2}}{4 ϕ_{d}^{2}} - \frac{2 a d}{ϕ_{d}}) ϕ_{d}^{2} + a (a - 2 ϕ_{d}) (1 - 2 \frac{d a}{ϕ_{d}}) (1 - \frac{π^{2} a^{2}}{4 ϕ_{d}^{2}}) \\ \geq ϕ_{d}^{2} - a (\frac{1}{2 π d^{3}} + 2 d ϕ_{d}) + a (a - 2 ϕ_{d}) (1 - 2 \frac{d a}{ϕ_{d}}) (1 - \frac{π^{2} a^{2}}{4 ϕ_{d}^{2}}) \\ \geq ϕ_{d}^{2} - a (\frac{1}{2 π d^{3}} + 2 d ϕ_{d} + 2 ϕ_{d}) \\ \geq ϕ_{d}^{2} - π d a, \end{matrix}

where in the second inequality we have used

π^{3} d^{3} a \leq 2

and in the last one

d \geq 2

and

ϕ_{d} \leq 1

. We can then conclude that:

f_{E} (x) - \frac{‖ x_{🟉} ‖^{4}}{2^{2 d + 2}} \leq \frac{1}{2^{2 d + 2}} [{(ϕ_{d} + a)}^{4} - 2 (ϕ_{d}^{2} - π d a)] {‖ x_{🟉} ‖}^{4}

We next take

x \in B (- ϕ_{d} x_{🟉}, a ‖ x_{🟉} ‖)

which implies

0 \leq π - {\bar{θ}}_{0} \leq π^{2} a / 2 = : δ

and

‖ x ‖ \leq (a + ϕ_{d}) ‖ x_{🟉} ‖

. We then note that for

ξ

and

ζ

as defined in (A23) we have:

\begin{matrix} | x^{T} {\tilde{h}}_{x} |^{2} & \leq (| ξ | + {| ζ |)}^{2} {(a + ϕ_{d})}^{2} {‖ x_{🟉} ‖}^{4} \\ \leq {(\frac{δ}{π} + 3 d^{3} δ + ρ_{d})}^{2} {(a + ϕ_{d})}^{2} {‖ x_{🟉} ‖}^{4} \\ \leq {(\frac{π^{3} d^{3}}{2} a + ρ_{d})}^{2} {(a + ϕ)}^{2} {‖ x_{🟉} ‖}^{4} \\ \leq (2 π^{3} d^{3} a + ρ_{d}^{2}) {(a + ϕ_{d})}^{2} {‖ x_{🟉} ‖}^{4} \\ \leq 20 π d^{3} a + ρ_{d}^{2} ϕ_{d}^{2} \end{matrix}

where the second inequality is due to (A33) and (A34), the rest from

d \geq 2

,

ρ_{d} \leq ϕ_{d} \leq 1

and

2 π^{3} d^{3} a \leq 1

. Finally using

(ϕ_{d} - a) ‖ x_{🟉} ‖ \leq ‖ x ‖

, we can then conclude that:

f_{E} (x) - \frac{‖ x_{🟉} ‖^{4}}{2^{2 d + 2}} \geq \frac{1}{2^{2 d + 2}} [{(ϕ_{d} - a)}^{4} - 2 (20 π d^{3} a + ρ_{d}^{2} ϕ_{d}^{2})] {‖ x_{🟉} ‖}^{4} .

□

The above two lemmas are now used to prove Lemma A8.

Proof of Proposition A8.

Let

x \in B (\pm ϕ_{d} x_{🟉}, φ ‖ x_{🟉} ‖)

for a

0 < φ < 1

that will be specified below, and observe that by the assumptions on the noise:

\begin{matrix} | {⟨ G (x) G {(x)}^{T} - G (x_{🟉}) G {(x_{🟉})}^{T}, H ⟩}_{F} | & {\leq | G (x)}^{T} H G (x) | + | G {(x_{🟉})}^{T} H G (x_{🟉}) | \\ \leq \frac{ω}{2^{d}} {(‖ x ‖}^{2} + ‖ x_{🟉} ‖^{2}) \\ \leq \frac{ω}{2^{d}} ({(ϕ_{d} + φ)}^{2} + 1) {‖ x_{🟉} ‖}^{2}, \end{matrix}

and therefore by Lemma A14:

\begin{matrix} | f_{0} (x) - f_{E} (x) | & + \frac{1}{2} | {⟨ G (x) G {(x)}^{T} - G (x_{🟉}) G {(x_{🟉})}^{T}, H ⟩}_{F} | \leq \\ \leq \frac{16}{2^{2 d}} ({(ϕ_{d} + φ)}^{4} + 1) ‖ x_{🟉} ‖^{4} d^{4} \sqrt{ϵ} + \frac{ω}{2^{d}} ({(ϕ_{d} + φ)}^{2} + 1) {‖ x_{🟉} ‖}^{2} \\ \leq \frac{272}{2^{2 d}} ‖ x_{🟉} ‖^{4} d^{4} \sqrt{ϵ} + \frac{ω}{2^{d}} ({(ϕ_{d} + φ)}^{2} + 1) {‖ x_{🟉} ‖}^{2} \end{matrix}

We next take

φ = ϵ

and

x \in B (ϕ_{d} x_{🟉}, φ ‖ x_{🟉} ‖)

, so that by Lemma A15 and the assumption

2^{d} d^{44} w \leq K_{2} {‖ x_{🟉} ‖}^{2}

, we have:

\begin{matrix} f_{H} (x) & \leq f_{E} (x) + | f_{0} (x) - f_{E} (x) | + \frac{1}{2} | {⟨ G (x) G {(x)}^{T} - G (x_{🟉}) G {(x_{🟉})}^{T}, H ⟩}_{F} | \\ \leq \frac{1}{2^{2 d + 2}} [1 + {(ϵ + ϕ_{d})}^{4} - 2 ϕ_{d}^{2} + 2 π d ϵ] ‖ x_{🟉} ‖^{4} + 272 d^{4} \sqrt{ϵ} ‖ x_{🟉} ‖^{4} + \frac{ω}{2^{d + 1}} (2 + 2 ϵ + ϵ^{2}) {‖ x_{🟉} ‖}^{2} \\ \leq \frac{1}{2^{2 d + 2}} [1 - 2 ϕ_{d}^{2} + {(ϵ + ϕ_{d})}^{4}] ‖ x_{🟉} ‖^{4} + \frac{1}{2^{2 d}} (\frac{3}{2} 2^{d} {‖ x_{🟉} ‖}^{- 2} ω + \frac{π d}{2} + 272 d^{4}) \sqrt{ϵ} ‖ x_{🟉} ‖^{4} + \frac{ω}{2^{d}} {‖ x_{🟉} ‖}^{2} \\ \leq \frac{1}{2^{2 d + 2}} [1 - 2 ϕ_{d}^{2} + {(ϵ + ϕ_{d})}^{4}] ‖ x_{🟉} ‖^{4} + \frac{1}{2^{2 d}} (\frac{3}{2} K_{2} d^{- 44} + \frac{π d}{2} + 272 d^{4}) \sqrt{ϵ} {‖ x_{🟉} ‖}^{4} + K_{2} \frac{‖ x_{🟉} ‖^{4}}{2^{2 d}} d^{- 44} . \end{matrix}

Similarly if

y \in B (- ϕ_{d} x_{🟉}, φ ‖ x_{🟉} ‖)

, and

φ = ϵ

we obtain:

\begin{matrix} f_{H} (y) & \geq f_{E} (y) - | f_{0} (y) - f_{E} (y) | - \frac{1}{2} | ⟨ G (y) G {(y)}^{T} - G (x_{🟉}) G {(x_{🟉})}^{T}, H ⟩ | \\ \geq \frac{1}{2^{2 d + 2}} [1 - 2 ϕ_{d}^{2} ρ_{d}^{2} + {(ϵ - ϕ_{d})}^{4}] ‖ x_{🟉} ‖^{4} - \frac{1}{2^{2 d}} (\frac{3}{2} 2^{d} {‖ x_{🟉} ‖}^{- 2} ω + 10 π d^{3} + 272 d^{4}) \sqrt{ϵ} ‖ x_{🟉} ‖^{4} - \frac{ω}{2^{d}} {‖ x_{🟉} ‖}^{2} \\ \geq \frac{1}{2^{2 d + 2}} [1 - 2 ϕ_{d}^{2} ρ_{d}^{2} + {(ϵ - ϕ_{d})}^{4}] ‖ x_{🟉} ‖^{4} - \frac{1}{2^{2 d}} (\frac{3}{2} K_{2} d^{- 44} + 10 π d^{3} + 272 d^{4}) \sqrt{ϵ} {‖ x_{🟉} ‖}^{4} - K_{2} \frac{‖ x_{🟉} ‖^{4}}{2^{2 d}} d^{- 44} . \end{matrix}

In order to guarantee that

f (y) > f (x)

, it suffices to have:

2 (1 - ρ_{d}^{2}) ϕ_{d}^{2} - 8 K_{2} d^{- 44} > 4 C_{d} \sqrt{ϵ}

with

C_{d} : = (544 d^{4} + 10 π d^{3} π + 3 K_{2} d^{- 44} + π d / 2 + 1 / 100)

, that is to require:

φ = ϵ < {(\frac{(1 - ρ_{d}^{2}) ϕ_{d}^{2} - 4 K_{2} d^{- 44}}{2 C_{d}})}^{2} .

Finally notice that by Lemma 17 in [16] it holds that

1 - ρ_{d} \geq {(K (d + 2))}^{- 2}

for some numerical constant K, we therefore choose

ϕ = ϱ / d^{12}

for some

ϱ > 0

small enough. □

Appendix B.5. Supplementary Proofs for Appendix A.6

In this section we use strong convexity and smoothness to prove convergence to

x_{🟉}

up to the noise variance

ω

. The idea is to show that every vector in the subgradient points in the direction

(x - x_{🟉})

. Recall that the gradient in the noiseless case was:

{\bar{v}}_{x} = Λ_{x}^{T} [Λ_{x} x x^{T} Λ_{x}^{T} - Λ_{x_{🟉}} x_{🟉} {x_{🟉}}^{T} Λ_{x_{🟉}}^{T}] Λ_{x} x .

We show that by continuity of

Λ_{x}

, when x is close to

x_{🟉}

, then

{\bar{v}}_{x}

it is close to:

{\bar{\bar{v}}}_{x} : = Λ_{x}^{T} Λ_{x} [x x^{T} - x_{🟉} {x_{🟉}}^{T}] Λ_{x}^{T} Λ_{x} x .

which in turn concentrates around:

{\overset{ˇ}{v}}_{x} : = \frac{1}{2^{2 d}} [x x^{T} - x_{🟉} {x_{🟉}}^{T}] x

by the WDC.

We begin by recalling the following result which can be found in the proof of Lemma 22 of [16].

Lemma A16.

Suppose

x \in B (x_{🟉}, d \sqrt{ϵ} ‖ x_{🟉} ‖)

and the WDC holds with

ϵ < 1 / (200^{4} d^{6})

. Then it holds that:

∥ Λ_{x}^{T} Λ_{x} x_{🟉} - Λ_{x}^{T} Λ_{x_{🟉}} x_{🟉} ‖ \leq \frac{1}{16} \frac{1}{2^{d}} ‖ x - x_{🟉} ‖ .

We now prove that

{\bar{v}}_{x} \approx {\bar{\bar{v}}}_{x}

for x close to

x_{🟉}

.

Lemma A17.

Suppose

x \in B (x_{🟉}, d \sqrt{ϵ} ‖ x_{🟉} ‖)

and the WDC holds with

ϵ < 1 / (200^{4} d^{6})

. Then it holds that:

∥ {\bar{v}}_{x} - {\bar{\bar{v}}}_{x} ‖ \leq \frac{13}{96} \frac{1}{2^{2 d}} ‖ x ‖ ‖ x_{🟉} ‖ ‖ x - x_{🟉} ∥

Proof.

Let

g_{x, 🟉} : = Λ_{x}^{T} Λ_{x} x_{🟉}

and

q_{x, 🟉} : = Λ_{x}^{T} Λ_{x_{🟉}} x_{🟉}

. Then observe that:

\begin{matrix} ‖ {\bar{v}}_{x} - {\bar{\bar{v}}}_{x} ‖ & = ‖ ⟨ g_{x, 🟉}, x ⟩ g_{x, 🟉} - ⟨ q_{x, 🟉}, x ⟩ q_{x, 🟉} ‖ \\ \leq (‖ x ‖ ‖ g_{x, 🟉} ‖ + ‖ Λ_{x} x ‖ ‖ Λ_{x_{🟉}} x_{🟉} ‖) ‖ g_{x, 🟉} - q_{x, 🟉} ∥ \\ \leq \frac{13}{6} \frac{1}{2^{d}} ‖ x ‖ ‖ x_{🟉} ‖ ‖ g_{x, 🟉} - q_{x, 🟉} ∥ \\ \leq \frac{13}{96} \frac{1}{2^{2 d}} ‖ x ‖ ‖ x_{🟉} ‖ ‖ x - x_{🟉} ∥ \end{matrix}

where the second inequality follows from Lemma A1, and the third from Lemma A16. □

We next prove that by the WDC

{\bar{\bar{v}}}_{x} \approx {\overset{ˇ}{v}}_{x}

.

Lemma A18.

Suppose the WDC holds with

ϵ < 1 / {(16 π d^{2})}^{2}

. Then for all nonzero

x, x_{🟉}

:

∥ {\bar{\bar{v}}}_{x} - {\overset{ˇ}{v}}_{x} ‖ \leq \frac{25}{12} \frac{4 ϵ d}{2^{2 d}} ‖ x ‖ ‖ x - x_{🟉} ‖ ‖ x + x_{🟉} ∥

Proof.

For notational convenience we define

E_{d} : = I_{k} / 2^{d}

the scaled identity in

R^{d}

,

Q_{x} : = Λ_{x}^{T} Λ_{x}

and

M_{x, 🟉} = x x^{T} - x_{🟉} {x_{🟉}}^{T}

. Next observe that:

\begin{matrix} ‖ {\bar{\bar{v}}}_{x} - {\overset{ˇ}{v}}_{x} ‖ & \leq ‖ (Q_{x} - E_{d}) M_{x, 🟉} Q_{x} x ‖ + ‖ E_{d} M_{x, 🟉} (Q_{x} - E_{d}) x ‖, \\ \leq \frac{25}{12} \frac{1}{2^{d}} ‖ Q_{x} - E_{d} ‖ ‖ M_{x, 🟉} ‖ x ‖ | \\ \leq \frac{25}{12} \frac{4 ϵ d}{2^{2 d}} ‖ M_{x, 🟉} ‖ ‖ x ‖ \end{matrix}

where the second inequality follows from Lemma A1 and the third from (17) in [15]. We conclude by noticing that

‖ M_{x, 🟉} ‖ \leq ‖ x - x_{🟉} ‖ ‖ x + x_{🟉} ‖

. □

Next consider the quartic function

\overset{ˇ}{f} (x) : = 1 / 2^{2 d + 2} {‖ x x^{T} - x_{🟉} {x_{🟉}}^{T} ‖}_{F}^{2}

and observe that:

\begin{matrix} \nabla \overset{ˇ}{f} (x) & = \frac{1}{2^{2 d}} (x x^{T} - x_{🟉} {x_{🟉}}^{T}) x = {\overset{ˇ}{v}}_{x}, \\ \nabla^{2} \overset{ˇ}{f} (x) & = \frac{1}{2^{2 d}} {(‖ x ‖}^{2} I_{n} + 2 x x^{T} - x_{🟉} {x_{🟉}}^{T}) . \end{matrix}

Following [63] we show that

{\overset{ˇ}{v}}_{x}

is

β

-smooth and strongly convex, in turn deriving the result in Lemma A19.

Lemma A19.

Assume

‖ x - x_{🟉} ‖ \leq γ ‖ x_{🟉} ‖

with

γ < 1 / 5

. Take

τ \geq 3 {(1 + γ)}^{2} + 1

, then:

∥ {\overset{ˇ}{v}}_{x} - \frac{τ}{2^{2 d}} ‖ x_{🟉} ‖^{2} (x - x_{🟉}) ‖ \leq \sqrt{τ (τ - α)} ‖ x - x_{🟉} ‖ \frac{‖ x_{🟉} ‖^{2}}{2^{2 d}}

where

α = 2 - 9 γ

.

Proof.

Let

‖ x - x_{🟉} ‖ \leq γ ‖ x_{🟉} ‖

with

γ < 1 / 5

, then:

\begin{matrix} (1 - γ) ‖ x_{🟉} ‖ \leq ‖ x ‖ \leq (1 + γ) ‖ x_{🟉} ‖ \end{matrix}

(A39)

\begin{matrix} ‖ Δ ‖ ‖ x ‖ \leq γ (1 + γ) ‖ x_{🟉} ‖^{2} \end{matrix}

(A40)

since

Δ = x - x_{🟉}

satisfies

‖ Δ ‖ \leq ‖ x ‖

. Using (A40) we then obtain

\begin{matrix} x x^{T} & = x_{🟉} {x_{🟉}}^{T} + Δ x^{T} + x Δ^{T} - Δ Δ^{T}, \\ ⪰ x_{🟉} {x_{🟉}}^{T} - 3 ‖ Δ ‖ ‖ x ‖ I_{k}, \\ ⪰ x_{🟉} {x_{🟉}}^{T} - 3 γ (1 + γ) {‖ x_{🟉} ‖}^{2} I_{k} . \end{matrix}

We therefore have:

\begin{matrix} 2^{2 d} \nabla^{2} \overset{ˇ}{f} (x) & = {‖ x ‖}^{2} I_{k} + 2 x x^{T} - x_{🟉} {x_{🟉}}^{T} \\ ⪰ {‖ x ‖}^{2} I_{k} + x_{🟉} {x_{🟉}}^{T} - 6 γ (1 + γ) {‖ x_{🟉} ‖}^{2} I_{k}, \\ ⪰ ({(1 - γ)}^{2} + 1 - 6 γ (1 + γ)) {‖ x_{🟉} ‖}^{2} I_{k}, \\ ⪰ (2 - 9 γ) ‖ x_{🟉} ‖^{2} I_{k}, \end{matrix}

where in the second line we have used (A39) and in the third

0 < γ < 1 / 5

.

Next notice that using (A39) we have

2^{2 d} ‖ \nabla^{2} \overset{ˇ}{f} {(x) ‖ \leq 3 ‖ x ‖}^{2} + ‖ x_{🟉} ‖^{2} \leq (3 {(1 + γ)}^{2} + 1) {‖ x_{🟉} ‖}^{2}

and therefore, for all

x \in B (x_{🟉}, γ ‖ x_{🟉} ‖)

(2 - 9 γ) ‖ x_{🟉} ‖^{2} I_{k} ⪯ 2^{2 d} \nabla^{2} \overset{ˇ}{f} (x) ⪯ (3 {(1 + γ)}^{2} + 1) {‖ x_{🟉} ‖}^{2} .

The above bounds imply in particular that for all

x \in B (x_{🟉}, γ ‖ x_{🟉} ‖^{2})

the gradient of

\overset{ˇ}{f}

satisfies the regularity condition (see [63]):

2 ⟨ \nabla \overset{ˇ}{f} (x), x - x_{🟉} ⟩ \geq μ ‖ \nabla \overset{ˇ}{f} {(x) ‖}^{2} + λ {‖ x - x_{🟉} ‖}^{2},

(A41)

where

λ = (2 - 9 γ) ‖ x_{🟉} ‖^{2} / 2^{2 d}

and

1 / μ = (3 {(1 + γ)}^{2} + 1) {‖ x_{🟉} ‖}^{2} / 2^{2 d}

. By (A41) we can then conclude that for

σ \geq 1 / μ

:

\begin{matrix} ‖ \nabla \overset{ˇ}{f} (x) - σ (x - x_{🟉}) ‖^{2} & \leq ‖ \nabla \overset{ˇ}{f} {(x) ‖}^{2} + σ^{2} {‖ x - x_{🟉} ‖}^{2} - 2 σ ⟨ \nabla \overset{ˇ}{f} (x), x - x_{🟉} ⟩ \\ \leq (1 - σ μ) ‖ \nabla \overset{ˇ}{f} {(x) ‖}^{2} + σ (σ - λ) {‖ x - x_{🟉} ‖}^{2} \\ \leq σ (σ - λ) ‖ x - x_{🟉} ‖^{2} \end{matrix}

Finally letting

σ = τ ‖ x_{🟉} ‖^{2} / 2^{d}

,

α = (2 - 9 γ)

, and by the assumptions on

γ

and

τ

the thesis follows. □

We finally can prove Lemma A9.

Proof of Lemma A9.

Let

x \in B (x_{🟉}, d \sqrt{ϵ} ‖ x_{🟉} ‖)

, then

‖ x + x_{🟉} ‖ \leq (2 + d \sqrt{ϵ}) ‖ x_{🟉} ‖

and since

ϵ < 1 / (200^{4} d^{6})

we have from Lemma A17 and Lemma A18

∥ {\bar{v}}_{x} - {\overset{ˇ}{v}}_{x} ‖ \leq ‖ {\bar{v}}_{x} - {\bar{\bar{v}}}_{x} ‖ + ‖ {\bar{\bar{v}}}_{x} - {\overset{ˇ}{v}}_{x} ‖ \leq \frac{1}{2^{2 d}} \frac{13}{48} ‖ x ‖ ‖ x_{🟉} ‖ ‖ x - x_{🟉} ‖ .

If

x \in B (x_{🟉}, d \sqrt{ϵ} ‖ x_{🟉} ‖)

is a differentiable point of f, then by Lemma A19 and the assumption (13) on the noise H, it holds that:

∥ {\tilde{v}}_{x} - \frac{τ}{2^{2 d}} ‖ x_{🟉} ‖^{2} (x - x_{🟉}) ‖ \leq [\frac{13}{48} (1 + γ) + \sqrt{τ (τ - α)}] \frac{‖ x_{🟉} ‖^{2}}{2^{2 d}} ‖ x - x_{🟉} ‖ + \frac{ω}{2^{d}} (1 + γ) ‖ x_{🟉} ‖ .

where

γ = d \sqrt{ϵ}

,

α = 2 - 9 γ

and

τ \geq 3 {(1 + γ)}^{2} + 1

.

In general, if

x \in B (x_{🟉}, γ ‖ x_{🟉} ‖)

and

v_{x} \in \partial f (x)

, by the characterization of Clarke subdifferential (9) and the previous results:

\begin{matrix} ∥ v_{x} - \frac{τ}{2^{2 d}} ‖ x_{🟉} ‖^{2} (x - x_{🟉}) ‖ & \leq \sum_{t = 1}^{T} c_{t} ‖ v_{t} - \frac{τ}{2^{2 d}} ‖ x_{🟉} ‖^{2} (x - x_{🟉}) ‖ \\ \leq [\frac{13}{48} (1 + γ) + \sqrt{τ (τ - α)}] \frac{‖ x_{🟉} ‖^{2}}{2^{2 d}} ‖ x - x_{🟉} ‖ + \frac{ω}{2^{d}} (1 + γ) ‖ x_{🟉} ‖ . \end{matrix}

We finally obtain the thesis by noting that by the assumptions on

ϵ

it holds that

γ = d \sqrt{ϵ} < 1 / {(400)}^{2}

and taking

τ = τ_{1} = 21 / 5

,

τ_{2} = 17 / 5

and

τ_{3} = (1 + 1 / {(400)}^{2})

. □

Appendix C. Proof of Theorem 1

By Theorem 3 it suffices to show that with high probability the weight matrices

{W_{i}}_{i = 1}^{d}

satisfy the WDC, and that

ω / 2^{d}

upper bounds the spectral norm of the noise term

Λ_{x}^{T} H Λ_{x}

where

in the Spiked Wishart Model $H = Σ_{N} - Σ$ where $Σ_{N} = Y^{T} Y / N$ and the $Σ = y_{🟉} {y_{🟉}}^{T} + σ^{2} I_{n}$ ;
in the Spiked Wigner Model $H = ν H$ where $H \sim G O E (n)$ .

Regarding the Weight Distribution Condition (WDC), we observe that it was initially proposed in [15], where it was shown to hold with high probability for networks with random Gaussian weights under an expansivity condition on their dimensions. It was later shown in [62] that a less restrictive expansivity rate is sufficient.

Lemma A20

(Theorem 3.2 in [62]). There are constants

C, c > 0

with the following property. Let

0 < ϵ < 1

and suppose

W \in R^{n \times k}

has

i . i . d . N (0, 1 / n)

entries. Suppose that

n \geq c k ϵ^{- 2} \log (1 / ϵ)

. Then with probability at least

1 - exp (- C k)

, W satisfies the WDC with constant ϵ.

By a union bound over all layers, using the above result we can conclude that the WDC holds simultaneously for all layers of the network with probability at least

1 - \sum_{i = 1}^{d} e^{- C n_{i - 1}} .

Note in particular that this argument does not requires the independence of the weight matrices

{W_{i}}_{i = 1}^{d}

.

By Lemma A20, with high probability the random generative network G satisfies the WDC. Therefore if we can guarantee that the assumptions on the noise term H are satisfied, then the proof of the main Theorem 1 follows from the deterministic Theorem 3 and Lemma A20.

Before turning to the bounds of the noise terms in the spiked models, we recall the following lemma which bounds the number of possible

Λ_{x}

for

x \neq 0

. Note that this is related to the number of possible regions defined by a deep ReLU network.

Lemma A21.

Consider a network G as defined in (3) with

d \geq 2

, weight matrices

W_{i} \in R^{n_{i} \times n_{i - 1}}

with i.i.d. entries

N (0, 1 / n_{i})

. Then, with probability one, for any

x \neq 0

the number of different matrices

Λ_{x}

is

| {Λ_{x} | x \neq 0} | \leq 10^{d^{2}} {(n_{1}^{d} n_{2}^{d - 1} \dots n_{d})}^{k} \leq {(n_{1}^{d} n_{2}^{d - 1} \dots n_{d})}^{8 k}

Proof.

The first inequality follows from Lemma 16 and the proof of Lemma 17 in [15]. For the second inequality notice that as

k \geq 1

,

n_{1} > k

and

d \geq 2

it follows that

7 k \log (n_{1}^{d} n_{2}^{d - 1} \dots n_{d}) \geq 7 k d (d + 1) \log (n_{1}) / 2 \geq 3.5 d (d + 1) \log (2) \geq d^{2} \log (10) .

□

In the next section we use this lemma to control the noise term

Λ_{x}^{T} H Λ_{x}

on the event where the WDC holds.

Appendix C.1. Spiked Wigner Model

Recall that in the Wigner model

Y = G (x_{🟉}) G {(x_{🟉})}^{T} + ν H

and the symmetric noise matrix

H

follows a Gaussian Orthogonal Ensemble GOE

(n)

, that is

H_{i i} \sim N (0, 2 / n)

for all

1 \leq i \leq n

and

H_{j i} = H_{i j} \sim N (0, 1 / n)

for

1 \leq j < i \leq n

. Our goal is to bound

‖ Λ_{x}^{T} H Λ_{x} ‖

uniformly over x with high probability.

Fix

x \in R^{k}

, and let

N_{1 / 4}

be a

1 / 4

-net on the sphere

S^{k - 1}

such that (see for example [64])

| N_{1 / 4} | \leq 9^{k}

and

∥ Λ_{x}^{T} H Λ_{x} ‖ \leq 2 \max_{z \in N_{1 / 4}} | ⟨ Λ_{x}^{T} H Λ_{x} z, z ⟩ | .

Observe next that for any

v \in R^{n}

by the definition of GOE(n) it holds that

v^{T} H v = \sum_{i}^{n} H_{i i} v_{i}^{2} + 2 \sum_{i < j}^{n} H_{i j} v_{i} v_{j} \sim N (0, 2 (\sum_{i} v_{i}^{4} + 2 \sum_{i < j} v_{i}^{2} v_{j}^{2}) / n) = N ({0, 2 ‖ v ‖}^{4} / n) .

Therefore for any

z \in N_{1 / 4}

let

ℓ_{x, z} : = Λ_{x} z \in R^{n}

, then

ℓ_{x, z}^{T} H ℓ_{x, z} \sim N (0, 2 ‖ ℓ_{x, z} ‖^{4} / n)

. In particular by (A6), the quadratic form

ℓ_{x, z}^{T} H ℓ_{x, z}

is sub-Gaussian with parameter

γ^{2}

given by

γ^{2} : = \frac{2}{n} {(\frac{13}{12})}^{2} \frac{1}{2^{2 d}} .

Then for fixed

x \in R^{k}

, standard sub-Gaussian tail bounds (e.g., [64]) and a union bound over

N_{1 / 4}

give for any

u \geq 0

\begin{matrix} P [‖ Λ_{x}^{T} H Λ_{x} ‖ \geq 2 u] & \leq P [\max_{z \in N_{1 / 4}} ‖ ℓ_{x, z}^{T} H ℓ_{x, z} ‖ \geq u] \\ \leq \sum_{z \in N_{1 / 4}} P [‖ ℓ_{x, z}^{T} H ℓ_{x, z} ‖ \geq u] \leq 2 \cdot 9^{k} e^{- \frac{u^{2}}{2 γ^{2}}} . \end{matrix}

Lemma A21, then ensures that the number of possible

Λ_{x}

is at most

{(n_{1}^{d} n_{2}^{d - 1} \dots n_{d})}^{8 k}

, so a union bound over this set allows us to conclude that

\begin{matrix} P [‖ Λ_{x}^{T} H Λ_{x} ‖ \leq 2 u, for all x] & \geq 1 - {(n_{1}^{d} n_{2}^{d - 1} \dots n_{d})}^{8 k} P [‖ Λ_{x}^{T} H Λ_{x} ‖ \geq 2 u] \\ \geq 1 - 2 exp (8 k \log (2 n_{1}^{d} n_{2}^{d - 1} \dots n_{d}) - u^{2} / (2 γ^{2})) \end{matrix}

Choosing then

u = \sqrt{2 γ^{2} \cdot 9 k \log (2 n_{1}^{d} n_{2}^{d - 1} \dots n_{d})}

and substituting the definition of

γ^{2}

, we obtain

P [‖ Λ_{x}^{T} H Λ_{x} ‖ \leq \frac{1}{2^{d}} \sqrt{\frac{169 k \log (2 n_{1}^{d} n_{2}^{d - 1} \dots n_{d})}{n}}, for all x] \geq 1 - 2 e^{- k \log (2 n_{1}^{d} n_{2}^{d - 1} \dots n_{d})}

which implies the thesis as

n = n_{d}

and

\log (n) \leq \log (2 n_{1}^{d} n_{2}^{d - 1} \dots n_{d})

.

Appendix C.2. Spiked Wishart Model

Each row

{y_{i}}_{i = 1}^{N}

of the matrix Y in (1) can be seen as i.i.d. samples from

N (0, Σ)

where

Σ = y_{🟉} {y_{🟉}}^{T} + σ^{2} I_{n}

. In the minimization problem (4) we take

M = Σ_{N} - σ^{2} I_{n}

where

Σ_{N}

is the empirical covariance matrix

Y^{T} Y / N

. The symmetric noise matrix H is then given by

H = Σ_{N} - Σ

and by the Law of Large Numbers

H \to 0

as

N \to \infty

. We bound

‖ Λ_{x}^{T} H Λ_{x} ‖

with high probability uniformly over

x \in R^{k}

.

Fix

x \in R^{k}

, let

N_{1 / 4}

be a

1 / 4

-net on the sphere

S^{k - 1}

such that

| N_{1 / 4} | \leq 9^{k}

, and notice that:

‖ Λ_{x}^{T} H Λ_{x} ‖ \leq 2 \max_{z \in N_{1 / 4}} | z^{T} Λ_{x}^{T} H Λ_{x} z | .

By a union bound on

N_{1 / 4}

we obtain for any fixed

z \in N_{1 / 4}

:

P [‖ Λ_{x}^{T} H Λ_{x} ‖ \geq 2 u] \leq 9^{k} P [| z^{T} Λ_{x}^{T} H Λ_{x} z | \geq u] .

Let

ℓ_{x, z} : = Λ_{x} z

and

s_{i} : = ℓ_{x, z}^{T} y_{i}

, so that we can write

\begin{matrix} z^{T} Λ_{x}^{T} H Λ_{x} z & = \frac{1}{N} \sum_{i = 1}^{N} ({(ℓ_{x}^{T} y_{i})}^{2} - E [{(ℓ_{x}^{T} y_{i})}^{2}]) = \frac{1}{N} \sum_{i = 1}^{N} (s_{i}^{2} - E [s_{i}^{2}]) \end{matrix}

and in particular

P [‖ Λ_{x}^{T} H Λ_{x} ‖ \geq 2 u] \leq 9^{k} P [| z^{T} Λ_{x}^{T} H Λ_{x} z | \geq u] = 9^{k} P [\frac{1}{N} | \sum_{i = 1}^{N} s_{i}^{2} - E [s_{i}^{2}] | \geq u]

Observe then that

s_{i} \sim N (0, γ^{2})

with

γ^{2} = ℓ_{x, z}^{T} Σ ℓ_{x, z}

. It follows for

u \in [0, γ^{2}]

by the small deviation bound for

χ^{2}

random variables (e.g., [33] (Example 2.11))

P [‖ Λ_{x}^{T} H Λ_{x} ‖ \geq 2 u] \leq 2 exp (2 k \log 3 - \frac{N u^{2}}{8 γ^{4}})

Recall now that

| {Λ_{x} | x \neq 0} | \leq {(n_{1}^{d} n_{2}^{d} \dots n_{d})}^{8 k}

, then proceeding as for the Wigner case by a union bound over all possible

Λ_{x}

:

\begin{matrix} P [‖ Λ_{x}^{T} H Λ_{x} ‖ \leq 2 u, for all x] & \geq 1 - {(n_{1}^{d} n_{2}^{d - 1} \dots n_{d})}^{8 k} P [‖ Λ_{x}^{T} H Λ_{x} ‖ \geq 2 u] \\ \geq 1 - 2 exp (8 k \log (2 n_{1}^{d} n_{2}^{d - 1} \dots n_{d}) - \frac{N u^{2}}{8 γ^{4}}) . \end{matrix}

Substituting

u = \sqrt{8 γ^{4} \cdot 9 k \log (2 n_{1}^{d} n_{2}^{d - 1} \dots n_{d}) / N}

we find that:

P [‖ Λ_{x}^{T} H Λ_{x} ‖ \leq 2 \sqrt{\frac{72 k \log (2 n_{1}^{d} n_{2}^{d - 1} \dots n_{d})}{N}} γ^{2}, for all x] \geq 1 - 2 e^{- k \log (n)}

(A42)

since

\log (n) \leq \log (2 n_{1}^{d} n_{2}^{d - 1} \dots n_{d})

.

Similarly if

u > γ^{2}

by large deviation bounds for sub-exponential variables

P [‖ Λ_{x}^{T} H Λ_{x} ‖ \leq 2 u, for all x] \geq 1 - 2 exp (8 k \log (2 n_{1}^{d} n_{2}^{d - 1} \dots n_{d}) - \frac{N u}{8 γ^{2}}) .

Substituting

u = 8 γ^{2} \cdot 9 k \log (2 n_{1}^{d} n_{2}^{d - 1} \dots n_{d}) / N

we find that:

P [‖ Λ_{x}^{T} H Λ_{x} ‖ \leq 2 \frac{72 k \log (3 n_{1}^{d} n_{2}^{d - 1} \dots n_{d})}{N} γ^{2}, for all x] \geq 1 - 2 e^{- k \log (n)}

(A43)

Finally observe that using (A6) for bounding

‖ Λ_{x} ‖^{2}

(by the WDC) and

σ^{2} + {‖ y_{🟉} ‖}^{2}

for bounding

‖ Σ ‖

, we have

γ^{2} \leq \frac{13}{12} \frac{1}{2^{d}} (‖ y_{🟉} ‖^{2} + σ^{2}),

which combined with (A42) and (A43) implies the thesis.

References

Johnstone, I.M. On the distribution of the largest eigenvalue in principal components analysis. Ann. Stat. 2001, 29, 295–327. [Google Scholar] [CrossRef]
Amini, A.A.; Wainwright, M.J. High-dimensional analysis of semidefinite relaxations for sparse principal components. In Proceedings of the 2008 IEEE International Symposium on Information Theory, Toronto, ON, Canada, 6–11 July 2008; pp. 2454–2458. [Google Scholar]
Deshpande, Y.; Montanari, A. Sparse PCA via covariance thresholding. In Proceedings of the Advances in Neural Information Processing Systems, Montréal, QC, Canada, 8–13 December 2014; pp. 334–342. [Google Scholar]
Vu, V.; Lei, J. Minimax rates of estimation for sparse PCA in high dimensions. In Proceedings of the 15th International Conference on Artificial Intelligence and Statistics, La Palma, Canary Islands, Spain, 21–23 April 2012; pp. 1278–1286. [Google Scholar]
Abbe, E.; Bandeira, A.S.; Bracher, A.; Singer, A. Decoding binary node labels from censored edge measurements: Phase transition and efficient recovery. IEEE Trans. Netw. Sci. Eng. 2014, 1, 10–22. [Google Scholar] [CrossRef]
Bandeira, A.S.; Chen, Y.; Lederman, R.R.; Singer, A. Non-unique games over compact groups and orientation estimation in cryo-EM. Inverse Probl. 2020, 36, 064002. [Google Scholar] [CrossRef]
Javanmard, A.; Montanari, A.; Ricci-Tersenghi, F. Phase transitions in semidefinite relaxations. Proc. Natl. Acad. Sci. USA 2016, 113, E2218–E2223. [Google Scholar] [CrossRef] [PubMed]
McSherry, F. Spectral partitioning of random graphs. In Proceedings of the 42nd IEEE Symposium on Foundations of Computer Science, Newport Beach, CA, USA, 8–11 October 2001; pp. 529–537. [Google Scholar]
Deshpande, Y.; Abbe, E.; Montanari, A. Asymptotic mutual information for the binary stochastic block model. In Proceedings of the 2016 IEEE International Symposium on Information Theory (ISIT), Barcelona, Spain, 10–15 July 2016; pp. 185–189. [Google Scholar]
Moore, C. The computer science and physics of community detection: Landscapes, phase transitions, and hardness. arXiv 2017, arXiv:1702.00467. [Google Scholar]
D’Aspremont, A.; Ghaoui, L.; Jordan, M.; Lanckriet, G. A direct formulation for sparse PCA using semidefinite programming. Adv. Neural Inf. Process. Syst. 2004, 17, 41–48. [Google Scholar] [CrossRef]
Berthet, Q.; Rigollet, P. Optimal detection of sparse principal components in high dimension. Ann. Stat. 2013, 41, 1780–1815. [Google Scholar] [CrossRef]
Bandeira, A.S.; Perry, A.; Wein, A.S. Notes on computational-to-statistical gaps: Predictions using statistical physics. arXiv 2018, arXiv:1803.11132. [Google Scholar] [CrossRef]
Kunisky, D.; Wein, A.S.; Bandeira, A.S. Notes on computational hardness of hypothesis testing: Predictions using the low-degree likelihood ratio. arXiv 2019, arXiv:1907.11636. [Google Scholar]
Hand, P.; Voroninski, V. Global guarantees for enforcing deep generative priors by empirical risk. IEEE Trans. Inf. Theory 2019, 66, 401–418. [Google Scholar] [CrossRef]
Heckel, R.; Huang, W.; Hand, P.; Voroninski, V. Rate-optimal denoising with deep neural networks. arXiv 2018, arXiv:1805.08855. [Google Scholar]
Hand, P.; Leong, O.; Voroninski, V. Phase retrieval under a generative prior. In Proceedings of the Advances in Neural Information Processing Systems, Montréal, QC, Canada, 3–8 December 2018; pp. 9136–9146. [Google Scholar]
Ma, F.; Ayaz, U.; Karaman, S. Invertibility of convolutional generative networks from partial measurements. Adv. Neural Inf. Process. Syst. 2018, 31, 9628–9637. [Google Scholar]
Hand, P.; Joshi, B. Global Guarantees for Blind Demodulation with Generative Priors. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 11535–11543. [Google Scholar]
Song, G.; Fan, Z.; Lafferty, J. Surfing: Iterative optimization over incrementally trained deep networks. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 15034–15043. [Google Scholar]
Bora, A.; Jalal, A.; Price, E.; Dimakis, A.G. Compressed sensing using generative models. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Volume 70, pp. 537–546. [Google Scholar]
Asim, M.; Shamshad, F.; Ahmed, A. Blind Image Deconvolution using Deep Generative Priors. arXiv 2019, arXiv:cs.CV/1802.04073. [Google Scholar] [CrossRef]
Hand, P.; Leong, O.; Voroninski, V. Compressive Phase Retrieval: Optimal Sample Complexity with Deep Generative Priors. arXiv 2020, arXiv:2008.10579. [Google Scholar]
Hand, P.; Voroninski, V. Compressed sensing from phaseless gaussian measurements via linear programming in the natural parameter space. arXiv 2016, arXiv:1611.05985. [Google Scholar]
Li, X.; Voroninski, V. Sparse signal recovery from quadratic measurements via convex programming. SIAM J. Math. Anal. 2013, 45, 3019–3033. [Google Scholar] [CrossRef]
Ohlsson, H.; Yang, A.Y.; Dong, R.; Sastry, S.S. Compressive phase retrieval from squared output measurements via semidefinite programming. arXiv 2011, arXiv:1111.6323. [Google Scholar] [CrossRef]
Cai, T.; Li, X.; Ma, Z. Optimal rates of convergence for noisy sparse phase retrieval via thresholded Wirtinger flow. Ann. Stat. 2016, 44, 2221–2251. [Google Scholar] [CrossRef]
Wang, G.; Zhang, L.; Giannakis, G.B.; Akçakaya, M.; Chen, J. Sparse phase retrieval via truncated amplitude flow. IEEE Trans. Signal Process. 2017, 66, 479–491. [Google Scholar] [CrossRef]
Yuan, Z.; Wang, H.; Wang, Q. Phase retrieval via sparse wirtinger flow. J. Comput. Appl. Math. 2019, 355, 162–173. [Google Scholar] [CrossRef]
Aubin, B.; Loureiro, B.; Maillard, A.; Krzakala, F.; Zdeborová, L. The spiked matrix model with generative priors. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 8366–8377. [Google Scholar]
Cocola, J.; Hand, P.; Voroninski, V. Nonasymptotic Guarantees for Spiked Matrix Recovery with Generative Priors. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 11 December 2020; Volume 33. [Google Scholar]
Johnstone, I.M.; Lu, A.Y. On consistency and sparsity for principal components analysis in high dimensions. J. Am. Stat. Assoc. 2009, 104, 682–693. [Google Scholar] [CrossRef] [PubMed]
Wainwright, M.J. High-Dimensional Statistics: A Non-Asymptotic Viewpoint; Cambridge University Press: Cambridge, UK, 2019; Volume 48. [Google Scholar]
Montanari, A.; Richard, E. Non-negative principal component analysis: Message passing algorithms and sharp asymptotics. IEEE Trans. Inf. Theory 2015, 62, 1458–1484. [Google Scholar] [CrossRef]
Deshpande, Y.; Montanari, A.; Richard, E. Cone-constrained principal component analysis. Adv. Neural Inf. Process. Syst. 2014, 27, 2717–2725. [Google Scholar]
Zou, H.; Hastie, T.; Tibshirani, R. Sparse principal component analysis. J. Comput. Graph. Stat. 2006, 15, 265–286. [Google Scholar] [CrossRef]
Krauthgamer, R.; Nadler, B.; Vilenchik, D.; others. Do semidefinite relaxations solve sparse PCA up to the information limit? Ann. Stat. 2015, 43, 1300–1322. [Google Scholar] [CrossRef]
Berthet, Q.; Rigollet, P. Computational lower bounds for Sparse PCA. arXiv 2013, arXiv:1304.0828. [Google Scholar]
Cai, T.; Ma, Z.; Wu, Y. Sparse PCA: Optimal rates and adaptive estimation. Ann. Stat. 2013, 41, 3074–3110. [Google Scholar] [CrossRef]
Ma, T.; Wigderson, A. Sum-of-squares lower bounds for sparse PCA. Adv. Neural Inf. Process. Syst. 2015, 28, 1612–1620. [Google Scholar]
Lesieur, T.; Krzakala, F.; Zdeborová, L. Phase transitions in sparse PCA. In Proceedings of the 2015 IEEE International Symposium on Information Theory (ISIT), Hong Kong, China, 14–19 June 2015; pp. 1635–1639. [Google Scholar]
Brennan, M.; Bresler, G. Optimal average-case reductions to sparse pca: From weak assumptions to strong hardness. arXiv 2019, arXiv:1902.07380. [Google Scholar]
Arous, G.B.; Wein, A.S.; Zadik, I. Free energy wells and overlap gap property in sparse PCA. In Proceedings of the Conference on Learning Theory, PMLR, Graz, Austria, 9–12 July 2020; pp. 479–482. [Google Scholar]
Fan, J.; Liu, H.; Wang, Z.; Yang, Z. Curse of heterogeneity: Computational barriers in sparse mixture models and phase retrieval. arXiv 2018, arXiv:1808.06996. [Google Scholar]
Richard, E.; Montanari, A. A statistical model for tensor PCA. Adv. Neural Inf. Process. Syst. 2014, 27, 2897–2905. [Google Scholar]
Decelle, A.; Krzakala, F.; Moore, C.; Zdeborová, L. Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications. Phys. Rev. E 2011, 84, 066106. [Google Scholar] [CrossRef] [PubMed]
Perry, A.; Wein, A.S.; Bandeira, A.S.; Moitra, A. Message-Passing Algorithms for Synchronization Problems over Compact Groups. Commun. Pure Appl. Math. 2018, 71, 2275–2322. [Google Scholar] [CrossRef]
Oymak, S.; Jalali, A.; Fazel, M.; Eldar, Y.C.; Hassibi, B. Simultaneously structured models with application to sparse and low-rank matrices. IEEE Trans. Inf. Theory 2015, 61, 2886–2908. [Google Scholar] [CrossRef]
Dhar, M.; Grover, A.; Ermon, S. Modeling sparse deviations for compressed sensing using generative models. arXiv 2018, arXiv:1807.01442. [Google Scholar]
Shah, V.; Hegde, C. Solving linear inverse problems using gan priors: An algorithm with provable guarantees. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 4609–4613. [Google Scholar]
Mixon, D.G.; Villar, S. Sunlayer: Stable denoising with generative networks. arXiv 2018, arXiv:1803.09319. [Google Scholar]
Yeh, R.A.; Chen, C.; Lim, T.Y.; Schwing, A.G.; Hasegawa-Johnson, M.; Do, M.N. Semantic image inpainting with deep generative models. arXiv 2016, arXiv:1607.07539. [Google Scholar]
Sønderby, C.K.; Caballero, J.; Theis, L.; Shi, W.; Huszár, F. Amortised map inference for image super-resolution. arXiv 2016, arXiv:1610.04490. [Google Scholar]
Yang, G.; Yu, S.; Dong, H.; Slabaugh, G.; Dragotti, P.L.; Ye, X.; Liu, F.; Arridge, S.; Keegan, J.; Guo, Y.; et al. DAGAN: Deep de-aliasing generative adversarial networks for fast compressed sensing MRI reconstruction. IEEE Trans. Med. Imaging 2017, 37, 1310–1321. [Google Scholar] [CrossRef]
Qiu, S.; Wei, X.; Yang, Z. Robust One-Bit Recovery via ReLU Generative Networks: Improved Statistical Rates and Global Landscape Analysis. arXiv 2019, arXiv:1908.05368. [Google Scholar]
Xue, Y.; Xu, T.; Zhang, H.; Long, L.R.; Huang, X. Segan: Adversarial network with multi-scale l 1 loss for medical image segmentation. Neuroinformatics 2018, 16, 383–392. [Google Scholar] [CrossRef] [PubMed]
Heckel, R.; Hand, P. Deep Decoder: Concise Image Representations from Untrained Non-convolutional Networks. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Heckel, R.; Soltanolkotabi, M. Denoising and regularization via exploiting the structural bias of convolutional generators. arXiv 2019, arXiv:1910.14634. [Google Scholar]
Heckel, R.; Soltanolkotabi, M. Compressive sensing with un-trained neural networks: Gradient descent finds the smoothest approximation. arXiv 2020, arXiv:2005.03991. [Google Scholar]
Aubin, B.; Loureiro, B.; Baker, A.; Krzakala, F.; Zdeborová, L. Exact asymptotics for phase retrieval and compressed sensing with random generative priors. In Proceedings of the First Mathematical and Scientific Machine Learning Conference, PMLR, Princeton, NJ, USA, 20–24 July 2020; pp. 55–73. [Google Scholar]
Clason, C. Nonsmooth Analysis and Optimization. arXiv 2017, arXiv:1708.04180. [Google Scholar]
Daskalakis, C.; Rohatgi, D.; Zampetakis, M. Constant-Expansion Suffices for Compressed Sensing with Generative Priors. arXiv 2020, arXiv:2006.04237. [Google Scholar]
Chi, Y.; Lu, Y.M.; Chen, Y. Nonconvex optimization meets low-rank matrix factorization: An overview. IEEE Trans. Signal Process. 2019, 67, 5239–5269. [Google Scholar] [CrossRef]
Vershynin, R. High-Dimensional Probability: An Introduction with Applications in Data Science; Cambridge University Press: Cambridge, UK, 2018; Volume 47. [Google Scholar]

Figure 1. Expected value, with respect to the weights, of the objective function f in (4) in the noiseless case (see (16) for explicit formula), for a network with latent dimension

k = 2

and

x_{🟉} = [1, 1]

.

Figure 2. Reconstruction error for the recovery of a spike

y_{🟉} = G (x_{🟉})

in the Wishart and Wigner models with random generative network priors. Each point corresponds to the average over 50 random drawing of the network weights and samples. These plots demonstrate that the reconstruction errors follow the scalings established by Theorem 1.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

No Statistical-Computational Gap in Spiked Matrix Models with Generative Network Priors^†

Abstract

1. Introduction

Our Contribution

2. Related Work

2.1. Sparse PCA and Other Computational-Statistical Gaps

2.2. Inverse Problems with Generative Network Priors

3. Algorithm and Main Result

Numerical Experiments

4. Recovery Under Deterministic Conditions

4.1. Technical Tools and Outline of the Proofs

Author Contributions

Funding

Conflicts of Interest

Appendix A. Supporting Lemmas and Proof of Theorem 3

Appendix A.1. Notation

Appendix A.2. Preliminaries

Appendix A.3. Iterates Stay Bounded

Appendix A.4. Convergence to $S_{β}$

Appendix A.5. Convergence to a Neighborhood Around $x_{🟉}$

Appendix A.6. Convergence to $x_{🟉}$ up to Noise

Appendix A.7. Proof of Theorem 3

Appendix B. Supplementary Proofs

Appendix B.1. Supplementary Proofs for Appendix A.2

Appendix B.2. Supplementary Proofs for Appendix A.3

Appendix B.3. Supplementary Proofs for Appendix A.4

Appendix B.4. Supplementary Proofs for Appendix A.5

Appendix B.5. Supplementary Proofs for Appendix A.6

Appendix C. Proof of Theorem 1

Appendix C.1. Spiked Wigner Model

Appendix C.2. Spiked Wishart Model

References

Article Metrics

Citations

Article Access Statistics

No Statistical-Computational Gap in Spiked Matrix Models with Generative Network Priors †

Abstract

1. Introduction

Our Contribution

2. Related Work

2.1. Sparse PCA and Other Computational-Statistical Gaps

2.2. Inverse Problems with Generative Network Priors

3. Algorithm and Main Result

Numerical Experiments

4. Recovery Under Deterministic Conditions

4.1. Technical Tools and Outline of the Proofs

Author Contributions

Funding

Conflicts of Interest

Appendix A. Supporting Lemmas and Proof of Theorem 3

Appendix A.1. Notation

Appendix A.2. Preliminaries

Appendix A.3. Iterates Stay Bounded

Appendix A.4. Convergence to S β

Appendix A.5. Convergence to a Neighborhood Around x 🟉

Appendix A.6. Convergence to x 🟉 up to Noise

Appendix A.7. Proof of Theorem 3

Appendix B. Supplementary Proofs

Appendix B.1. Supplementary Proofs for Appendix A.2

Appendix B.2. Supplementary Proofs for Appendix A.3

Appendix B.3. Supplementary Proofs for Appendix A.4

Appendix B.4. Supplementary Proofs for Appendix A.5

Appendix B.5. Supplementary Proofs for Appendix A.6

Appendix C. Proof of Theorem 1

Appendix C.1. Spiked Wigner Model

Appendix C.2. Spiked Wishart Model

References

Article Metrics

Citations

Article Access Statistics

No Statistical-Computational Gap in Spiked Matrix Models with Generative Network Priors^†

Appendix A.4. Convergence to $S_{β}$

Appendix A.5. Convergence to a Neighborhood Around $x_{🟉}$

Appendix A.6. Convergence to $x_{🟉}$ up to Noise