A Composite Initialization Method for Phase Retrieval

: Phase retrieval is a classical inverse problem with respect to recovering a signal from a system of phaseless constraints. Many recently proposed methods for phase retrieval such as PhaseMax and gradient-descent algorithms enjoy benign theoretical guarantees on the condition that an elaborate estimate of true solution is provided. Current initialization methods do not perform well when number of measurements are low, which deteriorates the success rate of current phase retrieval methods. We propose a new initialization method that can obtain an estimate of the original signal with uniformly higher accuracy which combines the advantages of the null vector method and maximal correlation method. The constructed spectral matrix for the proposed initialization method has a simple and symmetrical form. A lower error bound is proved theoretically as well as veriﬁed numerically.


Introduction
Phase retrieval (PR) has many applications in science and engineering, which include X-ray crystallography [1], molecular imaging [2], biological imaging [3], and astronomy [4]. Mathematically, phase retrieval is a problem about finding a signal x ∈ R n or C n satisfying the following phaseless constraints: where b i ∈ R is the observed measurement, a i ∈ R n or C n is the measuring vector, and i denotes noise. In optics and quantum physics, a i is related to the Fourier transform or the Fractional Fourier transform [5]. To remove the theoretical barrier, recent works also focus on the generic measurement vectors, namely a i ∈ R independently sampled from N (0, I n×n ).
Current methods for solving phase retrieval can be generally divided into two groups: convex and non-convex approaches. Convex methods, e.g., PhaseLift [6] and PhaseCut [7], utilize a semidefinite relaxation to transfer the original problem into a convex one. Nevertheless, the heavy computational cost of these methods hinders its application to practical scenarios. On the other hand, non-convex methods including the Gerchberg-Saxton (GS) [8,9], the (truncated) Wirtinger flow (WF/TWF) [10,11], and the (truncated) amplitude flow (AF/TAF) have a significant advantage in lower computational cost and sampling complexity (i.e., the minimal required m/n to achieve successful recovery) [12]. More recently, two variants of AF using the reweighting [13] and the smoothing technique [14,15] have been proposed to further lower sampling complexity. However, for the mentioned nonconvex methods, the rate of successful reconstruction relies upon elaborate initialization. Moreover, an initialization point nearby the true solution is also a necessity for convergency property in theoretical analysis.

Prior Art
Current initialization methods for PR problem include spectral method [16], the orthogonality promoting method [12], null vector method [17], and the (weighted) maximal correlation method [13]. These methods first construct a spectral matrix in the form of where the weight w i is a positive real number. The estimate is then given as the (scaled) eigenvector corresponding to the largest or smallest eigenvalue of M. The spectral method and the (weighted) maximal correlation method fall into one category as they seek the leading vector of M as the estimate. We categorize these methods as correlation-focused methods, since their spectral matrices endow more weight to the measuring vectors more correlated to the original signal. Conversely, the null vector and orthogonality promoting method can be regarded as orthogonality-focused methods. For the vector a i with less correlation to the original signal, the corresponding weight w i is larger in the spectral matrix. These two types of initialization methods each have their own merits and demerits. The correlation-focused methods perform better when the oversalting rate m/n is low. While the situation is contrary as m/n is large enough, the orthogonality-focused method provides a more accurate estimate.

This Work
In this paper, we propose a method combining the advantages of the above two types of initialization methods. Our intuition is to construct a composite spectral matrix, which is a weighted sum of spectral matrices of correlation-focused methods and orthogonalityfocused methods. Hence, the new method is termed as the composite initialization method.

Article Organization and Notations
This paper adopts the following notations. The bold-font lower-case letter denotes a column vector, e.g., b, z. Bold capital letters such as A denote matrices. (·) T and (·) H stand for the transpose and the conjugate transpose, respectively. Calligraphic letters such as I denote a index set, and |I| denotes the number of elements of I. {a i } m i=1 is a simple notation of a set of elements in the form of {a 1 , . . . , a m }. x 2 denotes the 2 -norm of x.
x means x 2 for the sake of simplicity, if not specially specified. The reminder of this article is organized as follows. The algorithm is given in Section 2. Section 3 provides the theoretical error analysis about the proposed initialization method with some relevant technique error bounds being placed in the Appendix. Section 4 illustrates the numerical performance and makes a comparison of the proposed method with other initialization methods.

The Formulation of the Composite Initialization Method
This section provides the formulation of the proposed method. Without loss of generality, we assume that b 1 ≤ b 2 ≤ · · · ≤ b m . Measuring vectors {a i } m i=1 are assumed to be independently sampled centered Gaussian variables with identity covariance for the convenience of theoretical analysis. Let x 0 be the true solution of (1). For concreteness, we focus on the real-valued Gaussian model. We further assume that x 0 = 1 since this investigation focuses on the estimation of the original signal. In practice, x 0 can be estimated by To begin with, we describe the null vector and the maximal correlation method which form the basis of our algorithm.
The null vector method first picks out a subset of vectors {a i } i∈I 1 , where I 1 ⊂ {1, 2, · · · , m}. Then, the null vector method approximates the original signal x 0 by a vector that is most orthogonal to this subset. Since {b i } m i=1 is in the ascending order, I 1 can be directly written as {1, 2, · · · , T 1 }, where T 1 is the cardinality of I 1 . The intuition of the null vector method is as simple as follows. Since ∑ i∈I 1 x H a i a H i x takes a very small value ∑ i∈I 1 b 2 i when x = x 0 , one can construct the following minimizing problem to estimate x 0 to coincide this property.x Solving (2) is equivalent to finding the smallest eigenvalue and the corresponding eigenvector of M 1 : The so-called orthogonality promoting method proposed by Wang is an approximately equivalent method that transforms the minimization problem (2) into a maximization problem.
The maximal correlation method is based on the opposite intuition, which picks out a subset of vectors {a i } i∈I 2 corresponding to the largest is in the ascending order, I 2 can also be written directly as {m − T 2 + 1, · · · , m}. The core idea of maximal correlation method is searching for the vector that is most correlated with {a i } i∈I 2 . This is achieved by solving the following maximizing problem.
From (2) and (3), one can observe that these two methods only utilize a subset of vectors that are either most correlated or most orthogonal to the original signal. To make use of as much as possible information, we try to solve a composite problem combining (2) and (3) together.
We can observe that (4) has a symmetrical structure and can be interpreted as a modified version of (2), which adds the objection of (3) as a penalization term. The solution of (4) will be a more accurate estimate than that of (3) since it utilize the information from (2). Let We set the parameter as α = H 1 /H 2 in (4). The true signal x is roughly the solution of (2) and (3); thus, it is also the approximated solution of (4). In next section, we will analyze the error of our proposed method.
The algorithm is presented in Algorithm 1. 2αI is added to ensure the positivity of M. The x k+1 in Step 4 is obtained by solving the following system of linear equations: which can be solved by the conjugate descent method since M is a positive and Hermitian by using x k−1 as initializer. Since x 0 and e −iφ x 0 are indistinguishable for the phase retrieval, we evaluate the estimate using the following metric.
The other initialization methods, e.g., the spectral method, the truncated spectral method, and the (weighted) maximal correlation method, are all based on the power iteration. The spectral method is for finding the leading eigenvector of matrix ∑ m i=1 b 2 i a i a T i , which is also realized by the power approach. Chen proposed the truncated spectral method to improve the performance of the spectral method, which constructed another matrix via the threshold parameter τ [17]. The corresponding matrix for iteration is The numerical performance of these initialization methods will be compared later.

Algorithm 1:
The composite initialization method.

Theoretical Analysis
In this section, we will present the error estimate of the proposed method under Gaussian assumptions in the real case.
Since the measuring vectors are assumed to be i.i.d. Gaussians, we can assume that because PA is composed by Gaussian vectors due to the invariance of Gaussian under the orthogonal transform.
The spectral matrix for orthogonality promoting method is the following.
By denoting d i = (a i2 , a i3 , · · · , a in ) T , then M can be written as follows: where the following is the case.
The matrix for maximal correlation method is the following: where the following is the case.
The constructed matrix for estimation is the following.
To proceed, we notice the following basic facts: 1.
The orthogonality promoting method is for finding the eigenvector of minimum eigenvalue of M. 2.
In the ideal case that E is a zero matrix, M degenerates to the following.
If we can also ensure that G = G 1 − αG 2 is positively definite, the eigenvector corresponding to the smallest eigenvalue is exactly the true signal.
These two facts inspire us to estimate the error by computing the eigenvector of matrix M after adding perturbation E = E 1 − αE 2 . Thus, the outline of theoretical analysis is concluded as follows: 1.
With ∆λ, we can then compute the perturbation of corresponding eigenvector, v −ṽ , which is the exact error of our algorithm.
Specifically, we have the following roadmap of theoretical analysis. Section 3.1 presents bound results for each component of the spectral matrix M. Using the results in Section 3.1, the bounds of ∆λ can then be easily obtained in Section 3.2. The relationship between perturbation of eigenvalues and eigenvectors is presented in Section 3.3, which finally induces the error estimation of our algorithm formally in Section 3.4.

Analysis of Each Component of the Spectral Matrix
In this part, we will provide the bounds for the variables involved of matrix M, which are basic ingredients for estimating perturbation of eigenvalue and eigenvector, namely ∆λ and v −ṽ . In particular, the upper bound for H 1 , lower bound for H 2 , and the norm of E will be analysed.
The proof of Lemma 1 is placed in the Appendixes A and B. As for the lower bound of H 2 , we borrow a result from Wang [13], Lemma 3.
Then, we obtained the required upper bound of H 1 /H 2 .

Lemma 3.
Under the set-up of Lemmas 1 and 2, we have the following: with probability of at least where ρ 0 = 1/ √ 2π, p 1 , and p 2 denote T 1 /m and T 2 /m, respectively.

Lower Bound of the Smallest Eigenvalue of G
Since G is a linear combination of G 1 and G 2 , the lower bound of the smallest eigenvalue of G can be estimated from the bounds of eigenvalues of G 1 and G 2 .
According to (10) and (12), we rewrite G 1 and G 2 as the following: where D 1 = [d 1 , . . . , d T 1 ] T and D 2 = [d m−T 2 +1 , . . . , d m ] T . D 1 and D 2 are termed Gaussian matrices here since their entries are all sampled from the Gaussian distribution. We notice that D T 1 D 1 and D T 2 D 2 are Wishart matrices, which can be written as the product of a Gaussian matrix and its adjoint [18]. Moreover, the eigenvalue perturbation of Wishart matrix obeys the following classical result. Theorem 1 ([19], Corollary 5.35). Let A be an N 1 × N 2 matrix for which its entries are independent standard normal random variables. Then, for every t ≥ 0, with a probability of at least 1 − 2 exp −t 2 /2 , the following events hold simultaneously: where s min (A) and s max (A) stand for the smallest and the largest singular value of A, respectively.
one can see that the following is the case.
In theoretical analysis, we assume that T 1 is large enough such that √ T 1 − 1.1 √ n − 1 ≥ 0. However, it should be observed that setting T 1 ≥ n suffices to output reasonable estimates as shown in the numerical test. Based on this assumption, one then has the following.
A necessary condition for the validness of our method is s min (G) ≥ 0, which can be satisfied by choosing proper s 1 and s 2 .

The Upper Bound of E 2
Now, let us turn to the estimation of the norm of E. We denote each item of E as ξ k .
Since a ik obeys normal distribution, , we can derive the following.
Let ψ k = ξ k H 1 T 1 + α 2 H 2 T 2 , then ψ 2 k obeys chi-squared distribution, which is a subexponential distribution. With ψ k , we can reform E 2 2 as the following.
The expectation of E 2 2 is obviously H 1 (s 1 + s 2 α). The variance of E 2 2 is also needed in order obtain the upper bound of E 2 2 . To this end, we need to recall the Bernstein-type inequality for sub-exponential random variable. Theorem 2. Let X 1 , · · · , X N be i.i.d. centered sub-exponential random variables with subexponential norm denoted as K. Then, for every t ≥ 0, we have the following: where c > 0 is an absolute constant. K is the so-called sub-exponential norm defined as the following.
Define a centered sub-exponential random variable Z i = ξ i − 1. The sub-exponential norm of Z i is computed in Appendix C. Appendix C tells that the sub-exponential norm K = 1 in our case. By replacing n − 1 → N, 0.1 → t and K → 1 in Theorem 2, one can then conclude the following result. Theorem 3. Since K ≤ 2, let t = 0.1, then we have the following: where c > 0 is an absolute constant.
Gathering these results, we have the following theorem.
The above holds with the probability stated below: with some absolute constant c > 0.

Estimate of ∆λ
To make an estimate for ∆λ, we first recall a classical result from matrix perturbation theory in [20]. Theorem 5. Let the following be the case: Let the above be Hermitian and set η = min|µ − ν| over all µ ∈ σ(H) and ν ∈ σ(G), where σ(H) stands for the set of eigenvalue of H. Then, the following is the case: where λ ↑ j andλ ↑ j are the j-th smallest one among the eigenvalues of M andM, respectively.
To ensure the effectiveness of our algorithm, H 1 should still be the minimum eigenvalue after adding perturbation in our case. Then, η is simply s min (G) that is bounded by (25). Hence, ∆λ can be estimated by the following.

Computing v −ṽ
With the upper bound of ∆λ, we can now estimate the perturbation of the eigenvector. The minimum eigenvector of M is computed by solving (M − (λ + ∆λ)I n )v = 0: where I n denotes the n × n identity matrix here. Without loss of generality, the solution of (36) can be written in the form of v = (1, δv) T . Then, we have the following.
Therefore, v −ṽ can be bounded by the following: where (38) is derived based on (37). The form of (39) would be a bit complicated, as stated as follows.

Main Result
Notice that v −ṽ is exactly x 0 −x . Then, it is straightforward to induce our main theorem by combining the above results.
Theorem 6 (Main result). Consider the problem of estimating arbitrary x ∈ R n from m phaseless measurements (1) under the Gaussian assumption. Suppose the output of Algorithm 1 isx 0 . If m ≥ c 1 T 1 , m ≥ c 2 T 2 , T 1 > n, and m ≥ T 1 + T 2 , then we have the following error estimate for the composite initialization method: with a probability of at least the following: where the following is the case, p i = T i /m, s i = (n − 1)/T i for i = 1, 2, and some absolute constants c 0 , c 1 , c 2 , ρ 0 ≥ 0.
Probability P in (42) can be negative when m and n are too small, which shows the limitation of our analysis since m and n need to be large enough so as to render the probability reasonable. In the extreme case that p 1 , p 2 , s 1 and s 2 are close to 0, our error estimation expression (41) can be approximated by a simpler form: which is verified by numerical experiments later.

Numerical Results
In this section, we test the accuracy of the proposed method and compare it with other methods, including the spectral method, the truncated spectral method, the reweighted maximal correlation method, the null vector method, and the orthogonality method. The sampling vectors a i and the original signal x 0 ∈ R 100 are independently randomly generated. To eliminate the influence of error brought by estimation of x 0 , the original signal x 0 is normalized. We chose T 1 = n, T 2 = 0.3n for numerical experiments for the proposed method. All the following simulation results are averaged over 80 independent Monte Carlo realizations. Figure 1 provides the RMSE calculated by (6) versus the oversampling rate m/n of the mentioned initialization methods. Obviously, all methods exhibit better performance as m/n increases. In particular, the proposed initialization method outweighs the other methods. When m/n ≥ 10 roughly, the composite initialization method performs closely as the null vector does, and the convergency behavior coincides with (45). When m/n ≤ 10, the proposed method does not lose its accuracy as dramatically as the null vector method does. The proposed algorithm provides the most accurate estimate when oversampling rate is below the information limit, i.e., m/n ≈ 2.
Oversampling rate RMSE null vector method orthogonality promoting method spectral method t-spectral method reweighted max correlation method composite method Figure 1. RMSE vs. oversampling rate of several initialization method. n = 100, T 1 = n, T 2 = 0.3n for the composite initialization method, while for the other algorithms, the involved parameters are selected according to related articles. Figure 2 illustrates the importance of initialization method on refining the success rate of TAF algorithm [12]. In each simulation, each initialization generates an estimate as the initializer for the TAF algorithm. A trial is considered to be successful if TAF algorithm can return a result with RMSE less than 10 −5 . When m/n is large, all the presented initialization methods ensure almost 100% success rate. However, when m/n approaches 0, the differences in success rates of various methods appear gradually. Therefore, the composite initialization method can help TAF to achieve better success rate compared with the other two methods. Table 1 illustrates the CPU time needed for the TAF using our method as an initializer compared with other two typical initialization methods. To distinguish the performance more clearly, the length of signal is set as a large number n = 1000. The proposed method and null vector method need to solve a linear system of equations at each inverse power computation, which makes it more time-consuming than the power computation of the maximal correlation method. However, the proposed method provides a more accurate initializer, which can help TAF converges faster. Hence, the overall efficiency of our algorithm is not far behind power-type methods as shown in Table 1.

Conclusions
This paper proposes a new initialization method that combines the advantages of two methods constructed from two totally contrary intuitions. Both theoretical analysis and numerical experiments indicate the validness of our method and higher accuracy compared with other methods. Future work will focus on extending our initialization algorithms to the more generalized problem of PR, e.g., the quadratic sensing and the matrix reconstruction problems [21].  i has been sorted in ascending order, we have the following.
By the Gaussian assumption, a T i x 0 2 = a 2 i1 , obeys the following chi-squared distribution with probability function: is the normalization constant. The cumulative distribution function is the following.
Let p = T 1 m and τ * = F −1 (p). The estimation of H 1 hinges on the value of τ * . However, ρ(z) is not explicitly integrable; hence we cannot derive an explicit expression of τ * about p. To this end, we first calculate following bounds of τ * using some basic inequalities, as the following two lemmas shows. The detailed proof is placed in Appendix B.
Then, we turn back to the estimation of H 1 . Notice that b 2 can be approximately regarded as random variables sampled from a bounded chi-squared distribution. Therefore, we first provide an upper bound of the largest element of b 2 i T 1 i=1 , namely, b 2 T 1 . We prove the following result.
where τ i i.i.d. obeys the chi-squared distribution with the probability function (A1) and the characteristic function of the following.
The event b 2 T 1 ≥ 9τ * /8 means that at least m − T 1 + 1 measurements are larger than 9τ * /8. Therefore, we have the following.
The expectation of r i is then r i is the bounded distribution and the upper bound of the sum of r i can be provided by the well-known Hoeffding's inequality.
Lemma A3 (Hoeffding's inequality). Let X 1 , · · · , X n be i.i.d. random variables bounded by the interval [a, b]. Then, the following is the case.
Then, it can be proved that the following is the case.
DefineF(τ) = τ 0ρ (x)dx. Then, it is obvious that the following is the case.
Therefore, we have the following.