Next Article in Journal
Study on Supermode Control of External Cavity VCSEL Array with Parallel-Coupled Model
Next Article in Special Issue
Quantum State Tomography in Nonequilibrium Environments
Previous Article in Journal
Highly Dispersive Optical Solitons in Absence of Self-Phase Modulation by Laplace-Adomian Decomposition
Previous Article in Special Issue
Quantum Speed Limit for a Moving Qubit inside a Leaky Cavity
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Fast Quantum State Reconstruction via Accelerated Non-Convex Programming

1
Computer Science Department, Rice University, Houston, TX 77005, USA
2
IBM Research, IBM T.J. Watson Research Center, Yorktown Heights, NY 10598, USA
3
Information Sciences Institute, University of Southern California, Marina del Rey, CA 90292, USA
4
IBM Quantum, IBM T.J. Watson Research Center, Yorktown Heights, NY 10598, USA
*
Author to whom correspondence should be addressed.
Photonics 2023, 10(2), 116; https://doi.org/10.3390/photonics10020116
Submission received: 30 December 2022 / Revised: 17 January 2023 / Accepted: 19 January 2023 / Published: 22 January 2023
(This article belongs to the Special Issue Photonic State Tomography: Methods and Applications)

Abstract

:
We propose a new quantum state reconstruction method that combines ideas from compressed sensing, non-convex optimization, and acceleration methods. The algorithm, called Momentum-Inspired Factored Gradient Descent (MiFGD), extends the applicability of quantum tomography for larger systems. Despite being a non-convex method, MiFGD converges provably close to the true density matrix at an accelerated linear rate asymptotically in the absence of experimental and statistical noise, under common assumptions. With this manuscript, we present the method, prove its convergence property and provide the Frobenius norm bound guarantees with respect to the true density matrix. From a practical point of view, we benchmark the algorithm performance with respect to other existing methods, in both synthetic and real (noisy) experiments, performed on the IBM’s quantum processing unit. We find that the proposed algorithm performs orders of magnitude faster than the state-of-the-art approaches, with similar or better accuracy. In both synthetic and real experiments, we observed accurate and robust reconstruction, despite the presence of experimental and statistical noise in the tomographic data. Finally, we provide a ready-to-use code for state tomography of multi-qubit systems.

1. Introduction

Quantum tomography is one of the main procedures to identify the nature of imperfections and deviations in quantum processing unit (QPU) implementations [1,2]. Generally, quantum tomography is composed of two main parts: (i) measuring the quantum system, and (ii) analyzing the measurement data to obtain an estimate of the density matrix (in the case of state tomography [1]), or of the quantum process (in the case of process tomography [3]). In this manuscript, we focus on the state tomography.
Quantum tomography is generally considered as a non-scalable protocol [4], as the number of free parameters that define quantum states and processes scale exponentially with the number of subsystems. In particular, quantum state tomography (QST) suffers from two bottlenecks. The first concerns about the large amount of data one needs to collect to perform tomography; the second concerns about numerically searching in an exponentially increasing space for a density matrix that is consistent with the data.
There have been various approaches to improve the scalability of QST, in terms of the amount of data required [5,6,7]. To address the data collection bottleneck, prior information about the unknown quantum state is often assumed. For example, in compressed sensing QST [4,8], the density matrix of the quantum system is assumed to be low-rank. In neural network QST [9,10,11], the wave-functions are often assumed to be real and positive, confining the landscape of quantum states (To handle more complex wave-functions, neural network approaches require a proper re-parameterization of the Restricted Boltzmann machines [9]). The prior information considered in these cases is that they are characterized by structured quantum states [9] (Such assumptions are often the reason behind accurate solutions of neural network QST in high-dimensional spaces.) Ref. [9] considers also the case of a completely unstructured case and test the limitation of this technique, which does not perform as expected, due to lack of any exploitable structure). Similarly, in matrix-product-state tomography [12,13], one assumes that the quantum state can be represented with low bond-dimension matrix-product state.
To address the computational bottleneck, several works introduce sophisticated numerical methods to improve the efficiency of QST. In particular, variants of gradient-based convex solvers—e.g., [14,15,16,17]—have been tested on synthetic scenarios [17]. The problem is that, achieving these results often requires utilizing special-purpose hardwares, such as Graphics Processing Units (GPUs), on top of carefully designing a proper distributed system [18]. Thus, going beyond current capabilities requires novel methods that can efficiently search in the space of density matrices under more realistic scenarios. Importantly, such numerical methods should come with rigorous guarantees on their convergence and performance.
The setup we consider here is the estimation of an n-qubit state, under the prior assumption that the state is close to a pure state, and thus its density matrix is of low-rank. This assumption is justified by the state-of-the-art experiments, where the aim is to manipulate the pure states with unitary maps. From a theoretical perspective, the low-rank assumption implies that we can use compressed sensing techniques [19], which allow the recovery of the density matrix from relatively fewer measurement data [20,21].
Indeed, compressed sensing QST is widely used for estimating highly-pure quantum states; e.g., [4,22,23,24]. However, compressed sensing QST usually relies on convex optimization for the estimation [8], which limits the applicability to relatively small system sizes [4] (In particular, convex solvers over low-rank structures utilize the nuclear norm over 2 n × 2 n matrices. This assumes calculating all 2 n eigenvalues of such matrices per iteration, which has cubic complexity O ( ( 2 n ) 3 ) ). On the other hand, non-convex optimization approaches could preform much faster than their convex counterparts [25]. Although non-convex optimization typically lacks convergence guarantees, it was recently shown that one can formulate the compressed sensing QST as a non-convex problem, and solve it with rigorous convergence guarantees (under certain but generic conditions), allowing the state estimation of larger system sizes [26].
Following the non-convex path, we introduce a new algorithm to the toolbox of QST—the Momentum-Inspired Factored Gradient Descent (MiFGD). Our approach combines the ideas from compressed sensing, non-convex optimization, and acceleration/momentum techniques to scale QST beyond the current capabilities. MiFGD includes acceleration motions per iteration, meaning that it uses two previous iterates to update the next estimate; see Section 2 for details. The intuition is that if the k-th and ( k 1 ) -th estimates were pointing to the correct direction, then both information should be useful to determine the ( k + 1 ) -th estimate. Of course such approach requires an additional estimate to be stored—yet, we show both theoretically and experimentally that momentum results in faster estimation. We emphasize that the analysis becomes non-trivially challenging due to the inclusion of two previous iterates.
The contributions of the paper are summarized as follows:
(i)
We prove that the non-convex MiFGD algorithm asymptotically enjoys an accelerated linear convergence rate in terms of the iterate distance, in the noiseless measurement data case and under common assumptions.
(ii)
We provide QST results using the real measurement data from IBM’s quantum computers up to 8-qubits, contributing to recent efforts on testing QST algorithms in real quantum data [22]. Our synthetic examples scale up to 12-qubits effortlessly, leaving the space for an efficient and hardware-aware implementation open for future work.
(iii)
We show through extensive empirical evaluations that MiFGD allows faster estimation of quantum states compared to the state-of-the-art convex and non-convex algorithms, including recent deep learning approaches [9,10,11,27], even in the presence of statistical noise in the measurement data.
(iv)
We further increase the efficiency of MiFGD by extending its implementation to utilize parallel execution over the shared and distributed memory systems. We experimentally showcase the scalability of our approach, which is particularly critical for the estimation of larger quantum system.
(v)
We provide the implementation of our approach at https://github.com/gidiko/MiFGD (accessed on 18 January 2023), which is compatible with the open-source software Qiskit [28].
The rest of this manuscript is organized as follows. In Section 2, we set up the problem in detail, and present our proposed method: MiFGD. Then, we detail the experimental set up in Section 3, followed by the results in Section 4. Finally, we discuss related and future works with concluding remarks in Section 5.

2. Methods

2.1. Problem Setup

We consider the estimation of a low-rank density matrix ρ C d × d on an n-qubit Hilbert space with dimension d = 2 n , through the following 2 -norm reconstruction objective:
min ρ C d × d f ( ρ ) : = 1 2 A ( ρ ) y 2 2 subject to ρ 0 , rank ( ρ ) r .
Here, y R m is the measurement data (observations) (Specific description on how y is generated and what it represents will follow), and A ( · ) : C d × d R m is the linear sensing map, where m d 2 . The sensing map relates the density matrix ρ to (expected, noiseless) observations through the Born rule: A ( ρ ) i = Tr ( A i ρ ) , where { A i } i = 1 m C d × d are matrices closely related to the measured observable or the POVM elements of appropriate dimensions.
The objective function in Equation (1) has two constraints: the positive semi-definite constraint: ρ 0 , and the low-rank constraint: rank ( ρ ) r . The former is a convex constraint, whereas the latter is a non-convex one, rendering Equation (1) to be a non-convex optimization problem (Convex optimization problem requires both the objective function as well as the constraints to be convex). Following compressed sensing QST results [8], the unit trace constraint Tr ( ρ ) = 1 (which should be satisfied by any density matrix by definition) can be disregarded, without affecting the precision of the final estimate.
A pivotal assumption to apply compressed sensing results is that the linear sensing map A should satisfy the restricted isometry property, which we recall below.
Definition 1 
(Restricted Isometry Property (RIP) [29]). A linear operator A : C d × d R m satisfies the RIP on rank-r matrices with the RIP constant δ r ( 0 , 1 ) , if the following holds with high probability for any rank-r matrix X C d × d :
( 1 δ r ) · X F 2 A ( X ) 2 2 ( 1 + δ r ) · X F 2 .
Such maps (almost) preserve the Frobenius norm of low-rank matrices, and, as an extension, of low-rank Hermitian matrices. The intuition behind RIP is that the operator A ( · ) behaves almost as a bijection between the subspaces C d × d and R m for low-rank matrices.
Following recent works [26], instead of solving Equation (1), we propose to solve a factorized version of it:
min U C d × r 1 2 A ( U U ) y 2 2 ,
where U C r × d denotes the adjoint of U, and ρ is re-parametrized with ρ = U U . The motivation for this reformulation is two-folds. First, by representing the d × d dimensional density matrix ρ with (the outer product of) its d × r dimensional low-rank factors U, the search space for the density matrix (that is consistent with the measurement data) significantly reduces, given that r d . Second, via the reformulation ρ = U U , both the PSD constraint and the low-rank constraint are automatically satisfied, transforming the constrained optimization problem in Equation (1) to the unconstrained optimization problem in Equation (3). An important implication is that, to solve Equation (3), one can bypass the projection step onto the PSD and low-rank subspace, which requires a sigular value decomposition (SVD) of the estimate of the density matrix ρ on every iteration. This is prohibitively expensive when the dimension d = 2 n is large, which is the case for even moderate number of qubits n. As such, working in the factored space was shown to improve time and space complexities [26,30,31,32,33,34].
A common approach to solve a factored objective as in Equation (3) is to use gradient descent [35] on the parameter U, with iterates as follows (We assume the case where f ( · ) = f ( · ) . If this does not hold, the theory still holds by carrying around f ( · ) + f ( · ) instead of just f ( · ) , after proper scaling):
U k + 1 = U k η f ( U k U k ) · U k
= U k η A A ( U k U k ) y · U k .
Here, U k C d × r is the k-th iterate, and the operator A : R m C d × d is the adjoint of A , defined as A ( x ) = i = 1 m x i A i , for x R m . The hyperparameter η > 0 is the step size. This algorithm has been studied in [25,32,34,36,37,38]. We will refer to the above iteration as the factored gradient descent (FGD) algorithm, as in [30]. In what follows, we will introduce our proposed method, the MiFGD algorithm: momentum-inspired factored gradient descent.

2.2. The MiFGD Algorithm

The MiFGD algorithm is a two-step variant of FGD, which iterates as follows:
U k + 1 = Z k η A A ( Z k Z k ) y · Z k ,
Z k + 1 = U k + 1 + μ U k + 1 U k .
Here, Z k C d × r is a rectangular matrix (with the same dimension as U k ) that accumulates the “momentum” of the iterates U k [39]. μ is the momentum parameter that weighs the amount of mixture of the previous estimate U k and the current U k + 1 to generate Z k + 1 . The above iteration is an adaptation of Nesterov’s accelerated first-order method for convex problems [40]. We borrow this momentum formulation, and study how the choice of the momentum parameter μ affects the overall performance in non-convex problem formulations, such as Equation (3). We note that the theory and algorithmic configurations in [40] do not generalize to non-convex problems, which is one of the contributions of this work. Albeit being a non-convex problem, we show that MiFGD asymptotically converges at an accelerated linear rate around a neighborhood of the optimal value, akin to convex optimization results [40].
An important observation is that the factorization ρ = U U is not unique. For instance, suppose that U is an optimal solution for Equation (3); then, for any rotation matrix R C r × r satisfying R R = I , the matrix U ^ = U R is also optimal for Equation (3) (To see this, observe that ρ = U U = U I U = U R R U = U ^ U ^ ). To resolve this ambiguity, we use the distance between a pair of matrices as the minimum distance min R O U U R F up to rotations, where O = { R C r × r | R R = I } . In words, we want to track how close an estimate U is to U , up to the minimizing rotation matrix.
Algorithm 1 contains the details of the MiFGD. As Problem (3) is non-convex, the initialization plays an important role in achieving global convergence. The initial point U 0 is either randomly initialized [36,41,42], or set according to Lemma 4 in [26]:
ρ 0 = U 0 U 0 = Π C 1 1 + δ 2 r · f ( 0 ) = 1 1 + δ 2 r Π C ( i = 1 m y i A i ) ,
where Π C ( · ) is the projection onto the set of PSD matrices, δ 2 r ( 0 , 1 ) is the RIP constant from Definition 1, and f ( 0 ) is the gradient of f ( · ) evaluated at all-zero matrix. Since computing the RIP constant is NP-hard, in practice we compute U 0 through ρ 0 = 1 L ^ Π C i = 1 m y i A i , where L ^ ( 1 , 11 10 ] ; see Theorem 1 below for details.
Algorithm 1 Momentum-Inspired Factored Gradient Descent (MiFGD).
Input: A (sensing map), y (measurement data), r (rank), and μ (momentum parameter).
Set U 0 randomly or as in Equation (8).
Set Z 0 = U 0 .
Set η as in Equation (9).
for  k = 0 , 1 , 2 , do
    U k + 1 = Z k η A A ( Z k Z k ) y · Z k
    Z k + 1 = U k + 1 + μ U k + 1 U k
end for 
Output: ρ = U k + 1 U k + 1
Compared to randomly selecting U 0 , the initialization scheme in Equation (8) involves a gradient and a top-r eigenvalue computations. Yet, Equation (8) provides a more informed initial point, as it is based on the data { y i , A i } i = 1 m , which could lead to convergence in fewer iterations in practice, and satisfies the initialization condition of Theorem 1 for small enough κ (Based on our experiments, in practice, both initializations are applicable and useful).
For the step size η in Algorithm 1, it is set to the following based on our theoretical analysis (c.f., Lemma A6):
η = 1 4 ( 1 + δ 2 r ) Z 0 Z 0 2 + A A ( Z 0 Z 0 ) y 2 ,
where Z 0 = U 0 . Similarly to the above, in practice we replace the RIP constant δ 2 r with L ^ . The step size η remains constant at every iteration, and requires only two top-eigenvalue computations to obtain the spectral norms Z 0 Z 0 2 and A A ( Z 0 Z 0 y 2 . These computations can be efficiently implemented by any off-the-shelf eigenvalue solver, such as the Power Method or the Lanczos method [43].

2.3. Theoretical Guarantees of the MiFGD Algorithm

We now present the formal convergence theorem, where under certain conditions, MiFGD asymptotically achieves an accelerated linear rate.
Theorem 1 
(Accelerated asymptotic convergence rate). Assume that A ( · ) satisfies the RIP in Definition 1 with the constant δ 2 r 1 10 . Initialize U 0 = U 1 such that
min R O U 0 U R F = min R O U 1 U R F σ r ( ρ ) 10 3 κ τ ( ρ ) ,
where κ : = 1 + δ 2 r 1 δ 2 r is the (inverse) condition number of A ( · ) , τ ( ρ ) : = σ 1 ( ρ ) σ r ( ρ ) is the condition number of ρ with rank ( ρ ) = r , and σ i ( ρ ) is the i-th singular value of ρ. Set the step size η such that
1 1 + δ 2 r 1 δ 2 r ( 2 + 1 ) 1 + δ 2 r 4 · 10 4 σ r ( ρ ) ( 1 δ 2 r ) η 10 4 σ r ( ρ ) ( 1 δ 2 r ) ,
and the momentum parameter μ = ε μ 2 · 10 3 r τ ( ρ ) κ , for user-defined ε μ ( 0 , 1 ] . Then, for the (noiseless) measurement data y = A ( ρ ) with rank ( ρ ) = r , the output of the MiFGD in Algorithm 1 satisfies the following: for any ϵ > 0 , there exist constants C ϵ and C ˜ ϵ such that, for all k,
min R O U k + 1 U R F 2 + min R O U k U R F 2 1 / 2 C ϵ 1 1 δ 2 r 1 + δ 2 r + ϵ k + 1 min R O U 0 U R F + ξ · μ · σ 1 ( ρ ) 1 / 2 · r · C ˜ ϵ C ϵ 1 1 δ 2 r 1 + δ 2 r + ϵ k + 1 min R O U 0 U R F + O ( μ ) .
where ξ = 1 4 η σ r ( ρ ) ( 1 δ 2 r ) 10 . That is, the MiFGD asymptotically enjoys an accelerated linear convergence rate in iterate distances up to a constant proportional to the momentum parameter μ.
Theorem 1 can be interpreted as follows. The right hand side of Equation (10) depends on the initial distance min R O U 0 U R F akin to convex optimization results, where asymptotically O 1 1 δ 2 r 1 + δ 2 r appear as the contraction factor. In contrast, the contraction factor of vanilla FGD [26] is of the order O 1 1 δ 2 r 1 + δ 2 r .
The main assumption is that the sensing map A ( · ) satisfies RIP. This assumption implies that the condition number of f depends on the RIP constants δ 2 r such that L μ 1 + δ 2 r 1 δ 2 r , since the eigenvalues of the Hessian of f, i.e., A A ( · ) , lie between 1 δ 2 r and 1 + δ 2 r (when restricted to low-rank matrices) (In this sense, the RIP assumption plays the similar role to assuming f is μ -strongly convex and L-smooth (when restricted to low-rank matrices)). Such assumption has become standard in the optimization and the signal processing community [19,25,29]. Hence, MiFGD has better dependency on the (inverse) condition number of f compared to FGD. Such improvement of the dependency on the condition number is referred to as “acceleration” in the convex optimization literature [44,45]. Thus, assuming that the initial points U 0 and U 1 are close enough to the optimum as stated in the theorem, MiFGD decreases its distance to U at an accelerated linear rate, up to an “error” level that depends on the momentum parameter μ , which is bounded by 1 2 · 10 3 r τ ( ρ ) κ .
Theorem 1 requires a strong assumption on the momentum parameter μ , which depends on quantities that might not be known a priori for general problems. However, we note that for the special case of QST, we know these quantities exactly: r is the rank of density matrix—thus, for pure states r = 1 ; τ ( ρ ) is the (rank-restricted) condition number of the density matrix ρ —for pure states, τ ( ρ ) = σ 1 ( ρ ) σ r ( ρ ) = σ 1 ( ρ ) σ 1 ( ρ ) = 1 ; and finally, κ is the condition number of the sensing map, and satisfies: κ 11 9 given the constraint δ 2 r 1 10 . This analysis leads to a momentum value μ ϵ μ / 2211 (This is the numerical value of μ we use in experiments in Section 4). However, as we show in both real and synthetic experiments in Section 4 (and further in Appendix A), the theory is conservative; much larger values of μ lead to fast, stable, and improved performance. Finally, the bound on the condition number in Theorem 1 is not strict, and comes out of the analysis we follow; we point the reader to similar assumptions made where τ ( ρ ) is assumed to be constant: O ( 1 )  [46].
The detailed proof of Theorem 1 is provided in Appendix B. The proof differs from state of the art proofs for non-accelerated factored gradient descent: due to the inclusion of the memory term, three different terms— U k + 1 , U k , U k 1 —need to be handled simultaneously. Further, the proof differs from other recent proofs on non-convex, but non-factored, gradient descent methods, as in [47]: the distance metric over rotations min R O Z k U R F , where Z k includes estimates from two steps in history, is not amenable to simple triangle inequality bounds. As a result, a careful analysis is required, including the design of two-dimensional dynamical systems, where we characterize and bound the eigenvalues of a 2 × 2 contraction matrix.

3. Experimental Setup

3.1. ρ Density Matrices and Quantum Circuits

In our numerical and real experiments, we have considered (different subsets of) the following n-qubit pure quantum states (The content in this subsection is implemented in the states.py component of our complementary software package: https://github.com/gidiko/MiFGD (accessed on 18 January 2023)):
  • The (generalized) GHZ state:
    | GHZ ( n ) = | 0 n + | 1 n 2 , n > 2 .
  • The (generalized) GHZ-minus state:
    | GHZ ( n ) = | 0 n | 1 n 2 , n > 2 .
  • The Hadamard state:
    | Hadamard ( n ) = | 0 + | 1 2 n .
  • A random state | Random ( n ) .
We have implemented these states (on the IBM quantum simulator and/or the IBM’s QPU) using the following circuits. The GHZ state | GHZ ( n ) is generated by applying the Hadamard gate to one of the qubits, and then applying n 1 CNOT gates between this qubit (as a control) and the remaining n 1 qubits (as targets). The GHZ-minus state | GHZ ( n ) is generated by applying the X gate to one of the qubits (e.g., the first qubit) and the Hadamard gate to the remaining n 1 qubits, followed by applying n 1 CNOT gates between the first qubit (as a target) and the other n 1 qubits (as controls). Finally, we apply the Hadamard gate to all of the qubits. The Hadamard state | Hadamard ( n ) is a separable state, and is generated by applying the Hadamard gate to all of the qubits. The random state | Random ( n ) is generated by a random quantum gate selection: In particular, for a given circuit depth, we uniformly select among generic single-qubit rotation gates with 3 Euler angles, and controlled-X gates, for every step in the circuit sequence. For the rotation gates, the qubits involved are selected uniformly at random, as well as the angles from the range [ 0 , 1 ] . For the controlled-X gates, the source and target qubits are also selected uniformly at random.
We generically denote the density matrix that correspond to pure state ψ as ρ = ψ ψ . For clarity, we will drop the bra-ket notation when we refer to | GHZ ( n ) , | GHZ ( n ) , | Hadamard ( n ) and | Random ( n ) . While the density matrices of the GHZ ( n ) and GHZ ( n ) are sparse in the { 0 , 1 } n basis, the density matrix of Hadamard ( n ) state is fully-dense in this basis, and the sparsity of the density matrix that of Random ( n ) may be different form one state to another.

3.2. Measuring Quantum States

The quantum measurement model. In our experiments (both synthetic and real), we measure the qubits in the Pauli basis (This is the non-commutative analogue of the Fourier basis, for the case of sparse vectors [19,48]). A Pauli basis measurement on an n-qubit system has d = 2 n possible outcomes. The Pauli basis measurement is uniquely defined by the measurement setting. A Pauli measurement is a string of n letters α : = ( α 1 , α 2 , , α n ) such that α k { x , y , z } for all k [ n ] . Note that there are at most 3 n distinct Pauli strings. To define the Pauli basis measurement associated with a given measurement string α , we first define the the following three bases on C 2 × 2 :
B x = x , 0 : = 1 2 ( 0 + 1 ) , x , 1 : = 1 2 ( 0 1 ) , B y = y , 0 : = 1 2 ( 0 + i 1 ) , y , 1 : = 1 2 ( 0 i 1 ) , B z = z , 0 : = 0 , z , 1 : = 1 .
These are the eigenbases of the single-qubit Pauli operators, σ x , σ y , and σ z , whose 2 × 2 matrix representations are given by:
σ x = 0 1 1 0 , σ y = 0 i i 0 , and σ z = 1 0 0 1 .
Given a Pauli setting α , the Pauli basis measurement Π α is defined by the 2 n projectors:
Π α = v ( α ) v ( α ) = k = 1 n α k , k α k , k : k { 0 , 1 } k [ 1 , n ] ,
where denotes the bit string ( k 1 , k 2 , , k n ) . Since there are 3 n distinct Pauli measurement settings, there are the same number of possible Pauli basis measurements.
Technically, this set forms a positive operator-valued measure (POVM). The projectors that form Π α are the measurement outcomes (or POVM elements) and the probability to obtain an outcome | v ( α ) v ( α ) | –when the state of the system is ρ – is given by the Born rule: v ( α ) | ρ | v ( α ) = Tr | v ( α ) v ( α ) | · ρ .
The RIP and expectation values of Pauli observables. Starting with the requirements of our algorithm, the sensing map A : C d × d R m we consider is comprised of a collection of matrices { A i C d × d } i = 1 m , such that y i = Tr ( A i ρ ) . We denote the vector ( y 1 , , y m ) by y.
When no prior information about the quantum state is assumed, to ensure its (robust) recovery, one must choose a set m sensing matrices A i , so that d 2 of them are linearly independent. One example of such choice is the POVM elements of the 3 n Pauli basis measurements.
Yet, when it is known that the state-to-be-reconstructed is of low-rank, theory on low-rank recovery problems suggests that A i could just be “incoherent” enough with respect to ρ [20], so that recovery is possible from a limited set of measurements, i.e., with m d 2 . In particular, it is known [4,20,21] that if the sensing matrices correspond to random Pauli monomials, then m = O r · d · poly log d A i ’s are sufficient for a successful recovery of ρ , using convex solvers for Equation (1) (The main difference between [4,20] and [21] is that the former guarantees recovery for almost all choices of m = O r · d · poly log d random Pauli monomials, while the latter proves that there exists a universal set of m = O r · d · poly log d Pauli monomials A i that guarantees successful recovery).
A Pauli monomial P i is an operator in the set P i { 1 , σ x , σ y , σ z } n , that is, an n-fold tensor product of single-qubit Pauli operators (including the identity operator). For convenience we re-label the single-qubit Pauli operators as σ 0 : = 1 , σ 1 : = σ x , σ 2 : = σ y , and σ 3 : = σ z , so that we can also write P i = k = 1 n σ i k with i k { 0 , , 3 } for all k [ n ] . These results [4,20,21] are feasible since the Pauli-monomial-based sensing map A ( · ) obeys the RIP property, as in Definition 1 (In particular, the RIP is satisfied for the sensing mechanisms that obeys A ( ρ ) i = d m Tr ( A i * ρ ) , i = 1 , , m . Further, the case considered in [21] holds for a slightly larger set than the set of rank-r density matrices: for all ρ C d × d such that ρ * r ρ F ). For the remainder of this manuscript, we will use the term “Pauli expectation value” to denote Tr ( A i ρ ) = Tr ( P i ρ ) .
From Pauli basis measurements to Pauli expectation values. While the theory for compressed sensing was proven for Pauli expectation values, in real QPUs, experimental data is obtained from Pauli basis measurements. Therefore, to make sure we are respecting the compressed sensing requirements on the sensing map, we follow this protocol:
(i)
We sample m = O r · d · poly log d or m = measpc · d 2 Pauli monomials uniformly over { σ i } n with i { 0 , , 3 } , where measpc [ 0 , 1 ] represents the percentage of measurements out of full tomography.
(ii)
For every monomial, P i , in the generated set, we identify an experimental setting α ( i ) that corresponds to the monomial. There, qubits, for which their Pauli operator in P i is the identity operator, are measured, without loss of generality, in the σ 3 basis. For example, for n = 3 and P i = σ 0 σ 1 σ 1 , we identify the measurement setting α ( i ) = ( z , x , x ) .
(iii)
We measure the quantum state in the Pauli basis that corresponds to α ( i ) , and record the outcomes.
To connect the measurement outcomes to the expectation value of the Pauli monomial, we use the relation:
Tr ( P i ρ ) = { 0 , 1 } n ( 1 ) χ f ( ) · Tr | v ( α ( i ) ) v ( α ( i ) ) | · ρ ,
where f ( ) : { 0 , 1 } n { 0 , 1 } n is a mapping that takes a bit string and returns a new bit string ˜ (of the same size) such that ˜ k = 0 for all k’s for which i k = 0 (that is, the locations of the identity operators in P i ), and χ ˜ is the parity of the bit string ˜ .

3.3. Algorithmic Setup

In our implementation, we explore a number of control parameters, including the maximum number of iterations maxiters, the step size η , the relative error from successive state iterates reltol, the momentum parameter μ , the percentage of the complete set of measurements (i.e., over all possible Pauli monomials) measpc, and the seed. In the sequel experiments we set maxiters = 1000 , η = 10 3 , and reltol = 5 × 10 4 , unless stated differently. Regarding acceleration, μ = 0 when acceleration is muted; we experiment over the range of values μ { 1 8 , 1 4 , 1 3 , 3 4 } when investigating the acceleration effect, beyond the theoretically suggested μ . In order to explore the dependence of our approach on the number of measurements available, measpc varies over the set of { 5 % , 10 % , 15 % , 20 % , 40 % , 60 % } ; seed is used for differentiating repeating runs with all other parameters kept fixed (maxiters is num_iterations in the code; also reltol is relative_error_tolerance, measpc is complete_measurements_percentage).
Denoting ρ ^ the estimate of ρ by MiFGD, we report on outputs including:
  • The evolution with respect to the distance between ρ ^ and ρ : ρ ^ ρ F , for various μ ’s.
  • The number of iterations to reach reltol to ρ for various μ ’s.
  • The fidelity of ρ ^ , defined as Tr ρ ρ ^ (for rank-1 ρ ), as a function of the acceleration parameter μ in the default set.
In our plots, we sweep over our default set of measpc values, repeat 5 times for each individual setup, varying supplied seed, and depict their 25-, 50- and 75-percentiles.

3.4. Experimental Setup on Quantum Processing Unit (QPU)

We show empirical results on 6- and 8-qubit real data, obtained on the 20-qubit IBM QPU ibmq_boeblingen. The layout/connectivity of the device is shown in Figure 1. The 6-qubit data was from qubits [ 0 , 1 , 2 , 3 , 8 , 9 ] , and the 8-qubit data was from [ 0 , 1 , 2 , 3 , 8 , 9 , 6 , 4 ] . The T 1 coherence times are [ 39.1 , 75.7 , 66.7 , 100.0 , 120.3 , 39.2 , 70.7 , 132.3 ] μ s , and T 2 coherence times are [ 86.8 , 94.8 , 106.8 , 63.6 , 156.5 , 66.7 , 104.5 , 134.8 ] μ s . The circuit for generating 6-qubit and 8-qubit GHZ states are shown in Figure 1. The typical two qubit gate errors measured from randomized benchmarking (RB) for relevant qubits are summarized in Table 1.
The QST circuits were generated using the tomography module in qiskit-ignis (https://github.com/Qiskit/qiskit-ignis (accessed on 18 January 2023)). For complete QST of a n-qubits state 3 n circuits are needed. The result of each circuit is averaged over 8192, 4096 or 2048, for different n-qubit scenarios. To mitigate for readout errors, we prepare and measure all of the 2 n computational basis states in the computation basis to construct a calibration matrix C. C has dimension 2 n by 2 n , where each column vector corresponds to the measured outcome of a prepared basis state. In the ideal case of no readout error, C is an identity matrix. We use C to correct for the measured outcomes of the experiment by minimizing the function:
min v cal R d C v cal v meas 2 subject to i v i cal = 1 , v i cal 0 , i = 1 , , d
Here v meas and v cal are the measured and calibrated outputs, respectively. The minimization problem is then formulated as a convex optimization problem and solved by quadratic programming using the package cvxopt [49].

4. Results

4.1. MiFGD on 6- and 8-Qubit Real Quantum Data

We realize two types of quantum states on IBM QPUs, parameterized by the number of qubits n for each case: the GHZ ( n ) and the Hadamard ( n ) circuits. We collected measurements over all possible Pauli settings by repeating the experiment for each setting a number of times: these are the number of shots for each setting. The (circuit, number of shots) measurement configurations from IBM Quantum devices are summarized in Table 2.
In Appendix A, we provide target error list plots for the evolution of ρ ^ ρ F 2 for reconstructing all the settings in Table 2, both for real data and for simulated scenarios. Further, we provide plots that relate the effect of momentum acceleration on the final fidelity observed for these cases. For clarity, in Figure 2, we summarize the efficiency of momentum acceleration, by showing the reconstruction error only for the following settings: maxiters = 1000 , η = 10 3 , reltol = 5 × 10 4 , and measpc = 20 % . In the plots, μ = 0 corresponds to the FGD algorithm in [30], μ corresponds to the value obtained through our theory, while we use μ 1 8 , 1 4 , 1 3 , 3 4 to study the acceleration effect. For μ , per our theory, we follow the rule μ ε μ / 2211 for ε μ ( 0 , 1 ] ; see also Section 2 for details (For this application, σ r ( ρ ) = 1 , τ ( ρ ) = 1 , and r = 1 by construction; we also approximated κ = 1.223 , which, for user-defined ε μ = 1 , results in μ = 4.5 × 10 4 . Note that smaller ε μ values result into a smaller radius of the convergence region; however, more pessimistic ε μ values result into small μ , with no practical effect in accelerating the algorithm). Note that, in most of the cases, the curve corresponding to μ = 0 is hidden behind the curve corresponding to μ μ . We run each QST experiment for 5 times for random initializations. We record the evolution of the ρ ^ ρ F 2 error at each step, and stop when the relative error of successive iterates gets smaller than reltol or the number of iterations exceeds maxiters (whichever happens first). To implement measpc = 20 % , we follow the description given in Equation (11) with m = measpc · d 2 .
To highlight the level of noise existing in real quantum data, in Figure 3, we repeat the same setting using the QASM simulator in qiskit-aer. This is a parallel, high performance quantum circuit simulator written in C++ that can support a variety of realistic circuit level noise models.
Figure 2 summarizes the performance of our proposal on different ρ , and for different μ values on real IBM QPU data. All plots show the evolution of ρ ^ ρ F across iterations, featuring a steep dive to convergence for the largest value of μ we tested: we report that we also tested μ = 0 , which shows only slight worse performances than μ . Figure 2 highlights the universality of our approach: its performance is oblivious to the quantum state reconstructed, as long as it satisfies purity or it is close to a pure state. Our method does not require any additional structure assumptions in the quantum state.
To highlight the effect of real noise on the performance of MiFGD, we further plot its performance on the same settings but using measurements coming from an idealized quantum simulator. Figure 3 considers the exact same settings as in Figure 2. It is obvious that MiFGD can achieve better reconstruction performance when data are less erroneous. This also highlights that, in real noisy scenarios, the radius of the convergence region of MiFGD around ρ is controlled mostly by the the noise level, rather than by the inclusion of momentum acceleration.
Finally, in Figure 4, we depict the fidelity of ρ ^ achieved using MiFGD, defined as Tr ρ ρ ^ , versus various μ values and for different circuits ( ρ ) . Shaded area denotes standard deviation around the mean over repeated runs in all cases. The plots show the significant gap in performance when using real quantum data versus using synthetic simulated data within a controlled environment.

4.2. Performance Comparison with Full Tomography Methods in Qiskit

We compare the MiFGD with publicly available implementations for QST reconstruction. Two common techniques for QST, included in the qiskit-ignis distribution [28], are: (i) the CVXPY fitter method, that uses the CVXPY convex optimization package [50,51]; and (ii) the lstsq method, that uses least-squares fitting [52]. Both methods solve the full tomography problem (In [8], it was sown that the minimization program (13) yields a robust estimation of low-rank states in the compressed sensing. Thus, one can use CVXPY fitter method to solve Equation (13) with m d 2 Pauli expectation value to obtain a robust reconstruction of ρ ) according to the following expression:
min ρ C d × d f ( ρ ) : = 1 2 A ( ρ ) y 2 2 subject to ρ 0 , Tr ( ρ ) = 1 .
We note that MiFGD is not restricted to “tall” U scenarios to encode PSD and rank constraints: even without rank constraints, one could still exploit the matrix decomposition ρ = U U to avoid the PSD projection, ρ 0 , where U C d × d . For the lstsq fitter method, the putative estimate ρ ^ is rescaled using the method proposed in [52]. For CVXPY, the convex constraint makes the optimization problem a semidefinite programming (SDP) instance. By default, CVXPY calls the SCS solver that can handle all problems (including SDPs) [53,54]. Further comparison results with matrix factorization techniques from the machine learning community is provided in the Appendix for n = 12 .
The settings we consider for full tomography are the following: GHZ ( n ) , Hadamard ( n ) and Random ( n ) quantum states (for n = 3 , , 8 ). We focus on fidelity of reconstruction and computation timings performance between CVXPY , lstsq and MiFGD . We use 100% of the measurements. We experimented with states simulated in QASM and measured taking 2048 shots. For MiFGD, we set η = 0.001 , μ = 3 4 , and stopping criterion/tolerance reltol = 10 5 . All experiments are run on a Macbook Pro with 2.3 GHz Quad-Core Intel Core i7CPU and 32GB RAM.
The results are shown in Figure 5; higher-dimensional cases are provided in Table 3. Some notable remarks: (i) For small-scale scenarios ( n = 3 , 4 ), CVXPY and lstsq attain almost perfect fidelity, while being comparable or faster than MiFGD. (ii) The difference in performance becomes apparent from n = 6 and on: while MiFGD attains 98% fidelity in <5 s, CVXPY and lstsq require up to hundreds of seconds to find a good solution. (iii) Finally, while MiFGD gets to high-fidelity solutions in seconds for n = 7 , 8 , CVXPY and lstsq methods could not finish tomography as their memory usage exceeded the system’s available memory.
It is noteworthy that the reported fidelities for MiFGD are the fidelities at the last iteration, before the stopping criterion is activated, or the maximum number of iterations is exceeded. However, the reported fidelity is not necessarily the best one during the whole execution: for all cases, we observe that MiFGD finds intermediate solutions with fidelity >99%. Though, it is not realistic to assume that the iteration with the best fidelity is known a priori, and this is the reason we report only the final iteration fidelity.

4.3. Performance Comparison of MiFGD with Neural-Network Quantum State Tomography

We compare the performance of MiFGD with neural network approaches. Per [9,10,11,27], we model a quantum state with a two-layer Restricted Boltzmann Machine (RBM). RBMs are stochastic neural networks, where each layer contains a number of binary stochastic variables: the size of the visible layer corresponds to the number of input qubits, while the size of the hidden layer is a hyperparameter controlling the representation error. We experiment with three types of RBMs for reconstructing either the positive-real wave function, the complex wave function, or the density matrix of the quantum state. In the first two cases the state is assumed pure while in the last, general mixed quantum states can be represented. We leverage the implementation in QuCumber [10], PositiveRealWaveFunction (PRWF), ComplexWaveFunction (CWF), and DensityMatrix (DM), respectively.
We reconstruct GHZ ( n ) , Hadamard ( n ) and Random ( n ) quantum states (for n = 3 , , 8 ), by training PRWF, CWF, and DM neural networks (We utilize GPU (NVidia GeForce GTX 1080 TI,11GB RAM) for faster training of the neural networks) with measurements collected by the QASM Simulator.
For our setting, we consider measpc = 50% and shots = 2048. The set of measurements is presented to the RBM implementation, along with the target positive-real wave function (for PRWF), complex wavefunction (for CWF) or the target density matrix (for DM) in a suitable format for training. We train Hadamard and Random states with 20 epochs, and GHZ state with 100 epochs (We experimented higher number of epochs (up to 500) for all cases, but after the reported number of epochs, Qucumber methods did not improve, if not worsened). We set the number of hidden variables (and also of additional auxilliary variables for DM) to be equal to the number of input variables n and we use 100 data points for both the positive and the negative phase of the gradient (as per the recommendation for the defaults). We choose k = 10 contrastive divergence steps and fixed the learning rate to 10 (per hyperparameter tuning). Lastly, we limit the fitting time of Qucumber methods (excluding data preparation time) to be three hours. To compare to the RBM results, we run MiFGD with η = 0.001 , μ = 3 4 , reltol = 10 5 and using measpc = 50%, keeping previously chosen values for all other hyperparameters.
We report the fidelity of the reconstruction as a function of elapsed training time for n = 3 , 4 in Figure 6 for PRWF, CWF, and DM. We observe that for all cases, Qucumber methods are orders of magnitude slower than MiFGD. E.g., for n = 8 , for all three states, CWF and DM did not finish a single epoch in 3 h, while MiFG achieves high fidelity in less than 30 s. For the Hadamard ( n ) and Random ( n ) , reaching reasonable fidelities is significantly slower for both CWF and DM, while PRWF hardly improves its performance throughout the training. For the GHZ case, CWF and DM also shows non-monotonic behaviors: even after a few thousands of seconds, fidelities have not “stabilized”, while PRWF stabilizes in very low fidelities. In comparison MiFGD is several orders of magnitude faster than both CWF and DM and fidelity smoothly increases to comparable or higher values. Further, in Table 4, we report final fidelities (within the 3 h time window), and reported times.

4.4. The Effect of Parallelization

We study the effect of parallelization in running MiFGD. We parallelize the iteration step across a number of processes, that can be either distributed and network connected, or sharing memory in a multicore environment. Our approach is based on Message Passing Interface (MPI) specification [55], which is the lingua franca for interprocess communication in high performance parallel and supercomputing applications. A MPI implementation provides facilities for launching processes organized in a virtual topology and highly tuned primitives for point-to-point and collective communication between them.
We assign to each process a subset of the measurement labels consumed by the parallel computation. At each step, a process first computes the local gradient-based corrections due only to its assigned Pauli monomials and corresponding measurements. These local gradient-based corrections will then (i) need to be communicated, so that they can be added, and (ii) finally, their sum will be shared across all processes to produce a global update for U for next step. We accomplish this structure in MPI using MPI_Allreduce collective communication primitive with MPI_SUM as its reduction operator: the underlying implementation will ensure minimum communication complexity for the operation (e.g., log p steps for p processes organized in a communication ring) and thus maximum performance (This communication pattern can alternatively be realized in two stages, as naturally suggested in its structure: (i) first invoke MPI’s MPI_Reduce primitive, with MPI_SUM as its reduction operator, which results in the element-wise accumulation of local corrections (vector sum) at a single, designated root process, and (ii) finally, send a “copy” of this sum from root process to each process participating in the parallel computation (broadcasting); MPI_Bcast primitive can be utilized for this latter stage. However, MPI_Allreduce is typically faster, since its actual implementation is not constrained by the requirement to have the sum available at a specific, root process, at an intermediate time point - as the two-stage approach implies). We leverage mpi4py [56] bindings to issue MPI calls in our parallel Python code.
We conducted our parallel experiments on a server equipped with 4 × E7-4850 v2 CPUs @ 2.30GHz (48/96 physical/virtual cores), 256 GB RAM, using shared memory multiprocessing over multiple cores. We experimented with states simulated in QASM and measured taking 8192 shots; parallel MiFGD runs with default parameters and using all measurements (measpc = 100%). Reported times are wall-clock computation time. These exclude initialization time for all processes to load Pauli monomials and measurements: we here target parallelizing computation proper in MiFGD.
In our first round of experiments, we investigate the scalability of our approach. We vary the number p of parallel processes ( p = 1 , 2 , 4 , 8 , 16 , 32 , 48 , 64 , 80 , 96 ), collect timings for reconstructing GHZ ( 4 ) , Random ( 6 ) and GHZ ( 8 ) states and report speedups T p / T 1 we gain from MiFGD in Figure 7 Left. We observe that the benefits of parallelization are pronounced for bigger problems (here: n = 8 qubits) and maximum scalability results when we use all physical cores (48 in our platform).
Further, we move to larger problems ( n = 10 qubits: reporting on reconstructing Hadamard ( 10 ) state) and focus on the effect parallelization to achieving a given level of fidelity in reconstruction. In Figure 7 Middle, we illustrate the fidelity as a function of the time spent in the iteration loop of MiFGD for ( p = 8 , 16 , 32 , 48 , 64 ): we observe the smooth path to convergence in all p counts which again minimizes compute time for p = 48 . Note that in this case we use measpc = 10% and μ = 1 4 .
Finally, in Figure 7 Right, we fix the number of processes to p = 48 , in order to minimize compute time and increase the percentage of used measurements to 20 % of the total available for Hadamard ( 10 ) . We vary the acceleration parameter, μ = 0 (no acceleration) to μ = 1 4 and confirm that we indeed get faster convergence times in the latter case while the fidelity value remains the same (i.e., coinciding upper plateau value in the plots). We can also compare with the previous fidelity versus time plot, where the same μ but half the measurements are consumed: more measurements translate to faster convergence times (plateau is reached roughly 25 % faster; compare the green line with the yellow line in the previous plot).

5. Conclusions and Discussions

We have introduced the MiFGD algorithm for the factorized form of the low-rank QST problems. We proved that, under certain assumptions on the problem parameters, MiFGD converges linearly to a neighborhood of the optimal solution, whose size depends on the momentum parameter μ , while using acceleration motions in a non-convex setting. We demonstrate empirically, using both simulated and real data, that MiFGD outperforms non-accelerated methods on both the original problem domain and the factorized space, contributing to recent efforts on testing QST algorithms in real quantum data [22]. These results expand on existing work in the literature illustrating the promise of factorized methods for certain low-rank matrix problems. Finally, we provide a publicly available implementation of our approach, compatible to the open-source software Qiskit [28], where we further exploit parallel computations in MiFGD by extending its implementation to enable efficient, parallel execution over shared and distributed memory systems.
Despite our theory does not apply to the Pauli basis measurement directly (i.e., using randomly selected Pauli bases Π α , does not lead to the 2 -norm RIP), using the data from random Pauli basis measurements directly could provide excellent tomographic reconstruction with MiFGD. Preliminary results suggest that only O ( r · log d ) random Pauli bases should be taken for a reconstruction, with the same level of accuracy as with O ( r · d · log d ) expectation values of random Pauli matrices. We leave the analysis of our algorithm in this case for future work, along with detailed experiments.

Related Work

Matrix sensing. The problem of low-rank matrix reconstruction from few samples was first studied within the paradigm of convex optimization, using the nuclear norm minimization [29,57,58]. The use of non-convex approaches for low-rank matrix recovery—by imposing rank-constraints—has been proposed in [59,60,61]. In all these works, the convex and non-convex algorithms involve a full, or at least a truncated, singular value decomposition (SVD) per algorithm iteration. Since SVD can be prohibitive, these methods are limited to relatively small system sizes.
Momentum acceleration methods are used regularly in the convex setting, as well as in machine learning practical scenarios [62,63,64,65,66,67]. While momentum acceleration was previously studied in non-convex programming setups, it mostly involve non-convex constraints with a convex objective function [47,61,68,69]; and generic non-convex settings but only considering with the question of whether momentum acceleration leads to fast convergence to a saddle point or to a local minimum, rather than to a global optimum [45,70,71,72].
The factorized version for semi-definite programming was popularized in [73]. Effectively the factorization of a the set of PSD matrices to a product of rectangular matrices results in a non-convex setting. This approach have been heavily studied recently, due to computational and space complexity advantages [25,26,30,31,32,33,34,36,37,38,41,74,75,76]. None of the works above consider the inclusion and analysis of momentum. Moreover, the Procrustes Flow approach [32,34] uses certain initializations techniques, and thus relies on multiple SVD computations. Our approach on the other hand uses a single, unique, top-r SVD computation. Comparison results beyond QST are provided in the appendix.
Compressed sensing QST using non-convex optimization. There are only few works that study non-convex optimization in the context of compressed sensing QST. The authors of [16] propose a hybrid algorithm that (i) starts with a conjugate-gradient (CG) algorithm in the factored space, in order to get initial rapid descent, and (ii) switch over to accelerated first-order methods in the original ρ space, provided one can determine the switch-over point cheaply. Using the multinomial maximum likelihood objective, in the initial CG phase, the Hessian of the objective is computed per iteration (i.e., a 4 n × 4 n matrix), along with its eigenvalue decomposition. Such an operation is costly, even for moderate values of qubit number n, and heuristics are proposed for its completion. From a theoretical perspective, [16] provide no convergence or convergence rate guarantees.
From a different perspective, [77] relies on spectrum estimation techniques [78,79] and the Empirical Young Diagram algorithm [80,81] to prove that O ( r d / ε ) copies suffice to obtain an estimate ρ ^ that satisfies ρ ^ ρ F 2 ε ; however, to the best of our knowledge, there is no concrete implementation of this technique to compare with respect to scalability.
Ref. [82] proposes an efficient quantum tomography protocol by determining the permutationally invariant part of the quantum state. The authors determine the minimal number of local measurement settings, which scales quadratically with the number of qubits. The paper determines which quantities have to be measured in order to get the smallest uncertainty possible. See [83] for a more recent work on permutationally invariant tomography. The method has been tested in a six-qubit experiment in [84].
Ref. [22] presented an experimental implementation of compressed sensing QST of a n = 7 qubit system, where only 127 Pauli basis measurements are available. To achieve recovery in practice, the authors proposed a computationally efficient estimator, based on gradient descent method in the factorized space. The authors of [22] focus on the experimental efficiency of the method, and provide no specific results on the optimization efficiency, neither convergence guarantees of the algorithm. Further, there is no available implementation publicly available.
Similar to [22], Ref. [26] also proposes a non-convex projected gradient decent algorithm that works on the factorized space in the QST setting. The authors prove a rigorous convergence analysis and show that, under proper initialization and step-size, the algorithm is guaranteed to converge to the global minimum of the problem, thus ensuring a provable tomography procedure. Our results extend these results by including acceleration techniques in the factorized space. The key contribution of our work is proving convergence of the proposed algorithm in a linear rate to the global minimum of the problem, under common assumptions. Proving our results required developing a whole set of new techniques, which are not based on a mere extension of existing results.
Compressed sensing QST using convex optimization. The original formulation of compressed sensing QST [4] is based on convex optimization methods, solving the trace-norm minimization, to obtain an estimation of the low-rank state. It was later shown [8] that essentially any convex optimization program can be used to robustly estimate the state. In general, there are two drawbacks in using convex optimization optimization in QST. Firstly, as the dimension of density matrices grow exponentially in the number of qubits, the search space in convex optimization grows exponentially in the number of qubits. Secondly, the optimization requires projection onto the PSD cone at every iteration, which becomes exponentially hard in the number of qubits. We avoid these two drawbacks by working in the factorized space. Using this factorization results in a search space that is substantially smaller than its convex counterpart, and moreover, in a single use of top-r SVD during the entire execution algorithm. Bypassing these drawbacks, together with accelerating motions, allows us to estimate quantum states of larger qubit systems than state-of-the-art algorithms.
Full QST using non-convex optimization. The use of non-convex algorithms in QST was studied in the context of full tomography as well. By “full tomography” we refer to the situation where an informationally complete measurement is performed, so that the input data to the algorithm is of size 4 n . The exponential scaling of the data size restrict the applicability of full tomography to relatively small system sizes. In this setting non-convex algorithms which work in the factored space were studied [85,86,87,88,89]. Except of the work [88], we are not aware of theoretical results on the convergence of the proposed algorithm due to the presence of spurious local minima. The authors of [88] characterize the local vs. the global behavior of the objective function under the factorization ρ = U U and discuss how existing methods fail due to improper stopping criteria or due to the lack of algorithmic convergence results. Their work highlights the lack of rigorous convergence results of non-convex algorithms used in full quantum state tomography. There is no available implementation publicly available for these methods as well.
Full QST using convex optimization. Despite the non-scalability of full QST, and the limitation of convex optimization, a lot of research was devoted to this topic. Here, we mention only a few notable results that extend the applicability of full QST using specific techniques in convex optimization. Ref [52] shows that for given measurement schemes the solution for the maximum likelihood is given by a linear inversion scheme, followed by a projection onto the set of density matrices. More recently, the authors of [18] used a combination of the techniques of [52] with the sparsity of the Pauli matrices and the use of GPUs to perform a full QST of 14 qubits. While pushing the limit of full QST using convex optimization, obtaining full tomographic experimental data for more than a dozen qubits is significantly time-intensive. Moreover, this approach is highly centralized, in comparison to our approach that can be distributed. Using the sparsity pattern property of the Pauli matrices and GPUs is an excellent candidate approach to further enhance the performance of non-convex compressed sensing QST.
QST using neural networks. Deep neural networks are ubiquitous, with many applications to science and industry. Recently, [9,10,11,27] show how machine learning and neural networks can be used to perform QST, driven by experimental data. The neural network architecture used is based on restricted Boltzmann machines (RBMs) [90], which feature a visible and a hidden layer of stochastic binary neurons, fully connected with weighted edges. Test cases considered include reconstruction of W state, magnetic observables of local Hamiltonians, the unitary dynamics induced by Hamiltonian evolution. Comparison results are provided in the Main Results section. Alternative approaches include conditional generative adversarial networks (CGANs) [91,92]: in this case, two dueling neural networks, a generator and a discriminator, learn to generate and identify multi-modal models from data.
QST for Matrix Product States (MPS). This is the case of highly structured quantum states where the state is well-approximated by a MPS of low bond dimension [12,13]. The idea behind this approach is, in order to overcome exponential bottlenecks in the general QST case, we require highly structured subsets of states, similar to the assumptions made in compressed sensing QST. MPS QST is considered an alternative approach to reduce the computational and storage complexity of QST.
Direct fidelity estimation. Rather than focusing on entrywise estimation of density matrices, the direct fidelity estimation procedure focuses on checking how close is the state of the system to a target state, where closeness is quantified by the fidelity metric. Classic techniques require up to 2 n / ϵ 4 number of samples, where ϵ denotes the accuracy of the fidelity term, when considering a general quantum state [93,94], but can be reduced to almost dimensionality-free 1 / ϵ 2 number of samples for specific cases, such as stabilizer states [95,96,97]. Shadow tomography is considered as an alternative and generalization of this technique [98,99]; however, as noted in [94], the procedure in [98,99] requires exponentially long quantum circuits that act collectively on all the copies of the unknown state stored in a quantum memory, and thus has not been implemented fully on real quantum machines. A recent neural network-based implementation of such indirect QST learning methods is provided here [100].
The work in [93,94], goes beyond simple fidelity estimation, and utilizes random single qubit rotations to learn a minimal sketch of the unknown quantum state by which one that can predict arbitrary linear function of the state. Such methods constitute a favorable alternative to QST as they do not require number of samples that scale polynomially with the dimension; however, this, in turn, implies that these methods cannot be used in general to estimate the density matrix itself.

Author Contributions

Conceptualization, A.K. (Amir Kalev) and A.K. (Anastasios Kyrillidis); methodology, J.L.K. and A.K. (Anastasios Kyrillidis); software, J.L.K. and G.K.; formal analysis, J.L.K. and A.K. (Anastasios Kyrillidis); investigation, J.L.K., G.K., A.K. (Amir Kalev) and A.K. (Anastasios Kyrillidis); data curation, K.X.W. and G.K.; writing—original draft preparation, J.L.K. and A.K. (Anastasios Kyrillidis); writing—review and editing, J.L.K., G.K., A.K. (Amir Kalev) and A.K. (Anastasios Kyrillidis); visualization, J.L.K. and G.K.; supervision, A.K. (Amir Kalev) and A.K. (Anastasios Kyrillidis); project administration, A.K. (Anastasios Kyrillidis); funding acquisition, A.K. (Amir Kalev) and A.K. (Anastasios Kyrillidis). All authors have read and agreed to the published version of the manuscript.

Funding

Anastasios Kyrillidis and Amir Kalev acknowledge funding by the NSF (CCF-1907936).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The empirical results were obtained via synthetic and real experiments; the algorithm’s implementation is available at https://github.com/gidiko/MiFGD (accessed on 18 January 2023).

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Additional Experiments

Appendix A.1. IBM Quantum System Experiments: GHZ—(6) Circuit, 2048 Shots

Figure A1. Target error list plots for reconstructing GHZ ( 6 ) circuit using real measurements from IBM Quantum system experiments.
Figure A1. Target error list plots for reconstructing GHZ ( 6 ) circuit using real measurements from IBM Quantum system experiments.
Photonics 10 00116 g0a1
Figure A2. Target error list plots for reconstructing GHZ ( 6 ) circuit using synthetic measurements from IBM’s quantum simulator.
Figure A2. Target error list plots for reconstructing GHZ ( 6 ) circuit using synthetic measurements from IBM’s quantum simulator.
Photonics 10 00116 g0a2
Figure A3. Convergence iteration plots for reconstructing GHZ ( 6 ) circuit using using real measurements from IBM Quantum system experiments and synthetic measurements from Qiskit simulation experiments.
Figure A3. Convergence iteration plots for reconstructing GHZ ( 6 ) circuit using using real measurements from IBM Quantum system experiments and synthetic measurements from Qiskit simulation experiments.
Photonics 10 00116 g0a3
Figure A4. Fidelity list plots for reconstructing GHZ ( 6 ) circuit using using real measurements from IBM Quantum system experiments and synthetic measurements from Qiskit simulation experiments.
Figure A4. Fidelity list plots for reconstructing GHZ ( 6 ) circuit using using real measurements from IBM Quantum system experiments and synthetic measurements from Qiskit simulation experiments.
Photonics 10 00116 g0a4

Appendix A.2. IBM Quantum System Experiments: GHZ—(8) Circuit, 2048 Shots

Figure A5. Target error list plots for reconstructing GHZ ( 8 ) circuit using real measurements from IBM Quantum system experiments.
Figure A5. Target error list plots for reconstructing GHZ ( 8 ) circuit using real measurements from IBM Quantum system experiments.
Photonics 10 00116 g0a5
Figure A6. Target error list plots for reconstructing GHZ ( 8 ) circuit using synthetic measurements from IBM’s quantum simulator.
Figure A6. Target error list plots for reconstructing GHZ ( 8 ) circuit using synthetic measurements from IBM’s quantum simulator.
Photonics 10 00116 g0a6
Figure A7. Convergence iteration plots for reconstructing GHZ ( 8 ) circuit using using real measurements from IBM Quantum system experiments and synthetic measurements from Qiskit simulation experiments.
Figure A7. Convergence iteration plots for reconstructing GHZ ( 8 ) circuit using using real measurements from IBM Quantum system experiments and synthetic measurements from Qiskit simulation experiments.
Photonics 10 00116 g0a7
Figure A8. Fidelity list plots for reconstructing GHZ ( 8 ) circuit using using real measurements from IBM Quantum system experiments and synthetic measurements from Qiskit simulation experiments.
Figure A8. Fidelity list plots for reconstructing GHZ ( 8 ) circuit using using real measurements from IBM Quantum system experiments and synthetic measurements from Qiskit simulation experiments.
Photonics 10 00116 g0a8

Appendix A.3. IBM Quantum System Experiments: GHZ—(8) Circuit, 4096 Shots

Figure A9. Target error list plots for reconstructing GHZ ( 8 ) circuit using real measurements from IBM Quantum system experiments.
Figure A9. Target error list plots for reconstructing GHZ ( 8 ) circuit using real measurements from IBM Quantum system experiments.
Photonics 10 00116 g0a9
Figure A10. Target error list plots for reconstructing GHZ ( 8 ) circuit using synthetic measurements from IBM’s quantum simulator.
Figure A10. Target error list plots for reconstructing GHZ ( 8 ) circuit using synthetic measurements from IBM’s quantum simulator.
Photonics 10 00116 g0a10
Figure A11. Convergence iteration plots for reconstructing GHZ ( 8 ) circuit using using real measurements from IBM Quantum system experiments and synthetic measurements from Qiskit simulation experiments.
Figure A11. Convergence iteration plots for reconstructing GHZ ( 8 ) circuit using using real measurements from IBM Quantum system experiments and synthetic measurements from Qiskit simulation experiments.
Photonics 10 00116 g0a11
Figure A12. Fidelity list plots for reconstructing GHZ ( 8 ) circuit using using real measurements from IBM Quantum system experiments and synthetic measurements from Qiskit simulation experiments.
Figure A12. Fidelity list plots for reconstructing GHZ ( 8 ) circuit using using real measurements from IBM Quantum system experiments and synthetic measurements from Qiskit simulation experiments.
Photonics 10 00116 g0a12

Appendix A.4. IBM Quantum System Experiments: Hadamard(6) Circuit, 8192 Shots

Figure A13. Target error list plots for reconstructing Hadamard ( 6 ) circuit using real measurements from IBM Quantum system experiments.
Figure A13. Target error list plots for reconstructing Hadamard ( 6 ) circuit using real measurements from IBM Quantum system experiments.
Photonics 10 00116 g0a13
Figure A14. Target error list plots for reconstructing Hadamard ( 6 ) circuit using synthetic measurements from IBM’s quantum simulator.
Figure A14. Target error list plots for reconstructing Hadamard ( 6 ) circuit using synthetic measurements from IBM’s quantum simulator.
Photonics 10 00116 g0a14
Figure A15. Convergence iteration plots for reconstructing Hadamard ( 6 ) circuit using using real measurements from IBM Quantum system experiments and synthetic measurements from Qiskit simulation.
Figure A15. Convergence iteration plots for reconstructing Hadamard ( 6 ) circuit using using real measurements from IBM Quantum system experiments and synthetic measurements from Qiskit simulation.
Photonics 10 00116 g0a15
Figure A16. Fidelity list plots for reconstructing Hadamard ( 6 ) circuit using using real measurements from IBM Quantum system experiments and synthetic measurements from Qiskit simulation experiments.
Figure A16. Fidelity list plots for reconstructing Hadamard ( 6 ) circuit using using real measurements from IBM Quantum system experiments and synthetic measurements from Qiskit simulation experiments.
Photonics 10 00116 g0a16

Appendix A.5. IBM Quantum System Experiments: Hadamard(8) Circuit, 4096 Shots

Figure A17. Target error list plots for reconstructing Hadamard ( 8 ) circuit using real measurements from IBM Quantum system experiments.
Figure A17. Target error list plots for reconstructing Hadamard ( 8 ) circuit using real measurements from IBM Quantum system experiments.
Photonics 10 00116 g0a17
Figure A18. Target error list plots for reconstructing Hadamard ( 8 ) circuit using synthetic measurements from IBM’s quantum simulator.
Figure A18. Target error list plots for reconstructing Hadamard ( 8 ) circuit using synthetic measurements from IBM’s quantum simulator.
Photonics 10 00116 g0a18
Figure A19. Convergence iteration plots for reconstructing Hadamard ( 8 ) circuit using using real measurements from IBM Quantum system experiments and synthetic measurements from Qiskit simulation.
Figure A19. Convergence iteration plots for reconstructing Hadamard ( 8 ) circuit using using real measurements from IBM Quantum system experiments and synthetic measurements from Qiskit simulation.
Photonics 10 00116 g0a19
Figure A20. Fidelity list plots for reconstructing Hadamard ( 8 ) circuit using using real measurements from IBM Quantum system experiments and synthetic measurements from Qiskit simulation experiments.
Figure A20. Fidelity list plots for reconstructing Hadamard ( 8 ) circuit using using real measurements from IBM Quantum system experiments and synthetic measurements from Qiskit simulation experiments.
Photonics 10 00116 g0a20

Appendix A.6. Synthetic Experiments for n = 12

We compare MiFGD with (i) the Matrix ALPS framework [61], a state of the art projected gradient descent algorithm, and an optimized version of matrix iterative hard thresholding, operating on the full matrix variable ρ , with adaptive step size η (we note that this algorithm has outperformed most of the schemes that work on the original space ρ ; see [61]); (ii) the plain Procrustes Flow/FGD algorithm [25,26,32], where we use the step size as reported in [25], since the later has reported better performance than vanilla Procrustes Flow. We note that the Procrustes Flow/FGD algorithm is similar to our algorithm without acceleration. Further, the original Procrustes Flow/FGD algorithm relies on performing many iterations in the original space ρ as an initialization scheme, which is often prohibitive as the problem dimensions grow. Both for our algorithm and the plain Procrustes Flow/FGD scheme, we use random initialization.
To properly compare the algorithms in the above list, we pre-select a common set of problem parameters. We fix the dimension d = 4096 (equivalent to n = 12 qubits), and the rank of the optimal matrix ρ R d × d to be r = 10 (equivalent to a mixed quantum state reconstruction). Similar behavior has been observed for other values of r, and are omitted. We fix the number of observables m to be m = c · d · r , where c { 3 , 5 } . In all algorithms, we fix the maximum number of iterations to 4000, and we use the same stopping criterion: ρ i + 1 ρ i F / ρ i F tol = 10 3 . For the implementation of MiFGD , we have used the momentum parameter μ = 2 3 , as well as the theoretical μ value.
The procedure to generate synthetically the data is as follows: The observations y are set to y = A ρ + w for some noise vector w; while the theory holds for the noiseless case, we show empirically that noisy cases are robustly handled by the same algorithm. We use permuted and subsampled noiselets for the linear operator A [101]. The optimal matrix ρ is generated as the multiplication of a tall matrix U R d × r such that ρ = U U , and ρ F = 1 , without loss of generality. The entries of U are drawn i.i.d. from a Gaussian distribution with zero mean and unit variance. In the noisy case, w has the same dimensions with y, its entries are drawn from a zero mean Gaussian distribution with norm w 2 = 0.01 . The random initialization is defined as U 0 drawn i.i.d. from a Gaussian distribution with zero mean and unit variance.
The results are shown in Figure A21. Some notable remarks: (i) While factorization techniques might take more iterations to converge compared to non-factorized algorithms, the per iteration time complexity is much less, such that overall, factorized gradient descent converges more quickly in terms of total execution time. (ii) Our proposed algorithm, even under the restrictive assumptions on acceleration parameter μ, performs better than the non-accelerated factored gradient descent algorithms, such as Procrustes Flow. (iii) Our theory is conservative: using a much larger μ we obtain a faster convergence; the proof for less strict assumptions for μ is an interesting future direction. In all cases, our findings illustrate the effectiveness of the proposed schemes on different problem configurations.
Figure A21. Synthetic example results on low-rank matrix sensing in higher dimensions (equivalent to n = 12 qubits). Top row: Convergence behavior vs. time elapsed. Bottom row: Convergence behavior vs. number of iterations. Left panel: c = 5 , noiseless case; Center panel: c = 3 , noiseless case; Right panel: c = 5 , noisy case, w 2 = 0.01 .
Figure A21. Synthetic example results on low-rank matrix sensing in higher dimensions (equivalent to n = 12 qubits). Top row: Convergence behavior vs. time elapsed. Bottom row: Convergence behavior vs. number of iterations. Left panel: c = 5 , noiseless case; Center panel: c = 3 , noiseless case; Right panel: c = 5 , noisy case, w 2 = 0.01 .
Photonics 10 00116 g0a21

Appendix A.7. Asymptotic Complexity Comparison of lstsq, CVXPY, and MiFGD

We first note that lstsq can be only applied to the case we have a full tomographic set of measurements; this makes lstsq algorithm inapplicable in the compressed sensing scenario, where the number of measurements can be significantly reduced. Yet, we make the comparison by providing information-theoretically complete set of measurements to lstsq and CVXPY, as well as to MiFGD, to highlight the efficiency of our proposed method, even in the scenario that is not exactly intended in our work. Given this, we compare in detail the asymptotic scailing of MiFGD with lstsq and CVXPY below:
  • lstsq is based on the computation of eigenvalues/eigenvector pairs (among other steps) of a matrix of size equal to the density matrix we want to reconstruct. Based on our notation, the density matrices are denoted as ρ with dimensions 2 n × 2 n . Here, n is the number of qubits in the quantum system. Standard libraries for eigenvalue/eigenvector calculations, like LAPACK, reduce a Hermitian matrix to tridiagonal form using the Householder method, which takes overall a O ( 2 n ) 3 computational complexity. The other steps in the lstsq procedure either take constant time, or O ( 2 n ) complexity. Thus, the actual run-time of an implementation depends on the eigensystem solver that is being used.
  • CVXPY is distributed with the open source solvers; for the case of SDP instances, CVXPY utilizes the Splitting Conic Solver (SCS) (https://github.com/cvxgrp/scs (accessed on 18 January 2023)), a general numerical optimization package for solving large-scale convex cone problems. SCS applies Douglas-Rachford splitting to a homogeneous embedding of the quadratic cone program. Based on the PSD constraint, this again involves the computation of eigenvalues/eigenvector pairs (among other steps) of a matrix of size equal to the density matrix we want to reconstruct. This takes overall a O ( 2 n ) 3 computational complexity, not including the other steps performed within the SCS solver. This is an iterative algorithm that requires such complexity per iteration. Douglas-Rachford splitting methods enjoy O ( 1 ε ) convergence rate in general [53,102,103]. This leads to a rough O ( ( 2 n ) 3 · 1 ε ) overall iteration complexity (This is an optimistic complexity bound since we have skipped several details within the Douglas-Rachford implementation of CVXPY).
  • For MiFGD, and for sufficiently small momentum value, we require O ( κ · log ( 1 ε ) ) iterations to get close to the optimal value. Per iteration, MiFGD does not involve any expensive eigensystem solvers, but relies only on matrix-matrix and matrix-vector multiplications. In particular, the main computational complexity per iteration origins from the iteration:
    U k + 1 = Z k η A A ( Z k Z k ) y · Z k , Z k + 1 = U k + 1 + μ U k + 1 U k .
    Here, U k , Z k R 2 n × r for all k. Observe that A ( Z k Z k ) R m where each element is computed independently. For an index j [ m ] , ( A ( Z k Z k ) ) j = Tr ( A j Z k Z k ) requires O ( ( 2 n ) 2 · r ) complexity, and thus computing A ( Z k Z k ) y requires O ( ( 2 n ) 2 · r ) complexity, overall. By definition the adjoing operation A : R m C 2 n × 2 n satisfies: A ( x ) = i = 1 m x i A i ; thus, the operation A A ( Z k Z k ) y is still dominated by O ( ( 2 n ) 2 · r ) complexity. Finally, we perform one more matrix-matrix multiplication with Z i , which results into an additional O ( ( 2 n ) 2 · r ) complexity. The rest of the operations involve adding 2 n × r matrices, which does not dominate the overall complexity. Combining the iteration complexity with the per-iteration computational complexity, MiFGD has a O ( ( 2 n ) 2 · r · κ · log ( 1 ε ) ) complexity.
Combining the above, we summarize the following complexities:
O ( ( 2 n ) 3 ) lstsq vs O ( ( 2 n ) 3 · 1 ε ) CVXPY vs O ( ( 2 n ) 2 · r · κ · log ( 1 ε ) ) MiFGD
Observe that (i) MiFGD has the best dependence on the number of qubits and the ambient dimension of the problem, 2 n ; (ii) MiFGD applies to cases that lstsq is inapplicable; (iii) MiFGD has a better iteration complexity than other iterative algorithms, while has a better polynomial dependency on 2 n .

Appendix B. Detailed Proof of Theorem 1

For notational brevity, we first denote U + U k + 1 , U U k , U U k 1 and Z Z k . Let us start with the following equality. For R Z O as the minimizer of min R O Z U R F , we have:
U + U R Z F 2 = U + Z + Z U R Z F 2
= U + Z F 2 + Z U R Z F 2 2 U + Z , U R Z Z .
The proof focuses on how to bound the last part on the right-hand side. By the definition of U + , we get:
U + Z , U R Z Z = Z η A A ( Z Z ) y Z Z , U R Z Z = η A A ( Z Z ) y Z , Z U R Z
Observe the following:
A A ( Z Z ) y Z , Z U R Z = A A ( Z Z ) y , Z Z U R Z Z = A A ( Z Z ) y , Z Z 1 2 U U + 1 2 U U U R Z Z = 1 2 A A ( Z Z ) y , Z Z U U + A A ( Z Z ) y , 1 2 ( Z Z + U U ) U R Z Z = 1 2 A A ( Z Z ) y , Z Z U U + 1 2 A A ( Z Z ) y , ( Z U R Z ) ( Z U R Z ) .
By Lemmata A7 and A8, we have:
U + U R Z F 2 = U + Z F 2 + Z U R Z F 2 2 U + Z , U R Z Z = η 2 A A ( Z Z ) y Z F 2 + Z U R Z F 2 η A A ( Z Z ) y , Z Z U U η A A ( Z Z ) y , ( Z U R Z ) ( Z U R Z ) η 2 A A ( Z Z ) y Z F 2 + Z U R Z F 2 1.0656 η 2 A ( A ( Z Z ) y ) Z F 2 η 1 δ 2 r 2 U U Z Z F 2 + η ( θ σ r ( ρ ) · Z U R Z F 2 + 1 200 β 2 · η ^ · 3 2 + 2 | μ | 2 1 3 2 + 2 | μ | 1 10 3 2 · A ( A ( Z Z ) y ) · Z F 2 )
Next, we use the following lemma:
Lemma A1 
([32] (Lemma 5.4)). For any W , V C d × r , the following holds:
W W V V F 2 2 ( 2 1 ) · σ r ( V V ) · min R O W V R F 2 .
From Lemma A1, the quantity U U Z Z F 2 satisfies:
U U Z Z F 2 2 ( 2 1 ) · σ r ( ρ ) · min R O Z U R F 2 = 2 ( 2 1 ) · σ r ( ρ ) · Z U R Z F 2 ,
which, in our main recursion, results in:
U + U R Z F 2 η 2 A A ( Z Z ) y Z F 2 + Z U R Z F 2 1.0656 η 2 A ( A ( Z Z ) y ) Z F 2 η ( 2 1 ) ( 1 δ 2 r ) σ r ( ρ ) Z U R Z F 2 + η ( θ σ r ( ρ ) · Z U R Z F 2 + 1 200 β 2 · η ^ · 3 2 + 2 | μ | 2 1 3 2 + 2 | μ | 1 10 3 2 · A ( A ( Z Z ) y ) · Z F 2 ) ( i ) η 2 A A ( Z Z ) y Z F 2 + Z U R Z F 2 1.0656 η 2 A ( A ( Z Z ) y ) Z F 2 η ( 2 1 ) ( 1 δ 2 r ) σ r ( ρ ) Z U R Z F 2 + η ( θ σ r ( ρ ) · Z U R Z F 2 + 1 200 β 2 · 10 9 η · 3 2 + 2 | μ | 2 1 3 2 + 2 | μ | 1 10 3 2 · A ( A ( Z Z ) y ) · Z F 2 ) = ( i i ) 1 + 1 200 β 2 · 10 9 · 3 2 + 2 | μ | 2 1 3 2 + 2 | μ | 1 10 3 2 1.0656 η 2 A ( A ( Z Z ) y ) · Z F 2 + 1 + η θ σ r ( ρ ) η ( 2 1 ) ( 1 δ 2 r ) σ r ( ρ ) Z U R Z F 2
where ( i ) is due to Lemma A6, and ( i i ) is due to the definition of U + .
Under the assumptions that μ = σ r ( ρ ) 1 / 2 10 3 κ τ ( ρ ) · ε 4 · σ 1 ( ρ ) 1 / 2 · r , for ε ( 0 , 1 ) user-defined, and δ 2 r 1 10 , the main constants in our proof so far simplify to:
β = 1 + 3 2 + 2 | μ | · 1 10 3 1 3 2 + 2 | μ | · 1 10 3 = 1.003 , and β 2 = 1.006 ,
by Corollary A3. Thus:
1 + 1 200 β 2 · 10 9 · 3 2 + 2 | μ | 2 1 3 2 + 2 | μ | 1 10 3 2 1.0656 0.0516 ,
and our recursion becomes:
U + U R Z F 2 0.0516 · η 2 · A ( A ( Z Z ) y ) · Z F 2 + 1 + η θ σ r ( ρ ) η ( 2 1 ) ( 1 δ 2 r ) σ r ( ρ ) Z U R Z F 2 .
Finally, we have
θ = ( 1 δ 2 r ) 1 + 3 2 + 2 | μ | 1 10 3 2 10 3 + ( 1 + δ 2 r ) 2 + 3 2 + 2 | μ | · 1 10 3 3 2 + 2 | μ | · 1 10 3 = ( i ) ( 1 δ 2 r ) · 1 + ( 3 2 + 2 | μ | ) 1 10 3 2 10 3 + κ 2 + 3 2 + 2 | μ | · 1 10 3 3 2 + 2 | μ | · 1 10 3 0.0047 · ( 1 δ 2 r ) .
where ( i ) is by the definition of κ : = 1 + δ 2 r 1 δ 2 r 1.223 for δ 2 r 1 10 , which is one of our assumptions as explained above. Combining the above in our main inequality, we obtain:
U + U R Z F 2 0.0516 · η 2 · A ( A ( Z Z ) y ) · Z F 2 + 1 + η σ r ( ρ ) ( 1 δ 2 r ) · ( 0.0047 2 + 1 ) Z U R Z F 2 1 4 η σ r ( ρ ) ( 1 δ 2 r ) 10 Z U R Z F 2 .
Taking square root on both sides, we obtain:
U + U R Z F 1 4 η σ r ( ρ ) ( 1 δ 2 r ) 10 · Z U R Z F
Let us define ξ = 1 4 η σ r ( ρ ) ( 1 δ 2 r ) 10 . Using the definitions Z = U + μ ( U U ) and R Z arg min R O Z U R F , we get
U + U R Z F ξ · min R O Z U R F = ξ · min R O U + μ ( U U ) U R F = ξ · min R O U + μ U U ( 1 μ + μ ) U R F ( i ) ξ · | 1 + μ | · min R O U U R F + ξ · | μ | · min R O U U R F + ξ · | μ | · r σ 1 ( ρ ) 1 / 2
where ( i ) follows from steps similar to those in Lemma A5. Further observe that
min R O U + U R F U + U R Z F ,
which leads to:
min R O U + U R F ξ · | 1 + μ | · min R O U U R F + ξ · | μ | · min R O U U R F + ξ · | μ | · r σ 1 ( ρ ) 1 / 2 .
Including two subsequent iterations in a single two-dimensional first-order system, we get the following characterization:
min R O U k + 1 U R F min R O U k U R F ξ · | 1 + μ | ξ · | μ | 1 0 · min R O U k U R F min R O U k 1 U R F + 1 0 · ξ · | μ | · σ 1 ( ρ ) 1 / 2 · r .
Now, let x j : = min R O U j U R F . Then, we can write the above relation as
x k + 1 x k ξ · | 1 + μ | ξ · | μ | 1 0 : = A · x k x k 1 + 1 0 · ξ · | μ | · σ 1 ( ρ ) 1 / 2 · r ,
where we denote the “contraction matrix” by A. Observe that A has non-negative values. Unfolding the above recursion for k iterations, we obtain:
x k + 1 x k A k + 1 · x 0 x 1 + t = 0 k A t · 1 0 · ξ · | μ | · σ 1 ( ρ ) 1 / 2 · r
Taking norms on both sides, we get
x k + 1 x k A k + 1 · x 0 x 1 + t = 0 k A t · 1 0 · ξ · | μ | · σ 1 ( ρ ) 1 / 2 · r ( i ) A k + 1 · x 0 x 1 + t = 0 k A t · 1 0 · ξ · | μ | · σ 1 ( ρ ) 1 / 2 · r ( i i ) A k + 1 · x 0 x 1 + t = 0 k A t · 1 0 · ξ · | μ | · σ 1 ( ρ ) 1 / 2 · r ,
where ( i ) is by triangle inequality, and ( i i ) is by submultiplicativity of matrix norms.
To bound Equation (A6), we will use spectral analysis. We first recall the definition of the spectral radius of a matrix M:
ρ ( M ) : = max { | λ | : λ Sp ( M ) } ,
where Sp ( M ) is the set of all eigenvalues of M. We then use the following lemma:
Lemma A2 
(Lemma 11 in [104]). Given a matrix M and ϵ > 0 , there exists a matrix norm · such that
M ρ ( M ) + ϵ .
We further use the Gelfand’s formula:
Lemma A3 
(Theorem 12 in [104]). Given any matrix norm · , the following holds:
ρ ( M ) = min k M k 1 / k
The proofs for the above lemmata can be found in [104]. Using Lemmas A2 and A3, we have that for any ϵ > 0 , there exists K ϵ N such that
A k 1 / k ( ρ ( A ) + ϵ ) for all k .
Further, let C ϵ : = max k < K ϵ max 1 , A k ( ρ ( A ) + ϵ ) k . Then, we have
A k C ϵ ( ρ ( A ) + ϵ ) k for all k K ϵ .
Hence, using Equation (A7) in Equation (A6), we have
x k + 1 x k A k + 1 · x 0 x 1 + t = 0 k A t · 1 0 · ξ · | μ | · σ 1 ( ρ ) 1 / 2 · r C ϵ ( ρ ( A ) + ϵ ) k + 1 x 0 x 1 + 1 0 · ξ · | μ | · σ 1 ( ρ ) 1 / 2 · r · C ϵ · t = 0 k ( ρ ( A ) + ϵ ) t .
Therefore, asymptotically, the convergence rate is O ( ρ ( A ) k + 1 ) , where ρ ( A ) is the spectral radius of the contraction matrix A. We thus compute ρ ( A ) below. Since A is a 2 × 2 matrix, its eigenvalues are:
λ 1 , 2 = ξ · | 1 + μ | 2 ± ξ 2 ( 1 + μ ) 2 4 + ξ · | μ | ( i ) ρ ( A ) : = max { λ 1 , λ 2 } = λ 1 = ξ · | 1 + μ | 2 + ξ 2 ( 1 + μ ) 2 4 + ξ · | μ | ,
where ( i ) follows since every term in λ 1 , 2 is positive.
To show an accelerated convergence rate, we want the above eigenvalue (which determines the convergence rate) to be bounded by 1 1 δ 2 r 1 + δ 2 r (This is akin to the notion of acceleration and optimal method (for certain function classes) from convex optimization literature, where the condition number appears inside the · term. For details, see [39,44]). To show this, first note that this term is bounded above as follows:
λ 1 = ξ · | 1 + μ | 2 + ξ 2 ( 1 + μ ) 2 4 + ξ · | μ | ( i ) ξ + ξ 2 + ξ ( i i ) ξ + 2 ξ ( i i ) ( 2 + 1 ) ξ ,
where (i) is by the conventional bound on momentum: 0 < μ < 1 , and (ii) is by the relation ξ 2 ξ ξ for 0 ξ 1 . Therefore, to show an accelerated rate of convergence, we want the following relation to hold:
( 2 + 1 ) ξ 1 1 δ 2 r 1 + δ 2 r ξ 1 + δ 2 r 1 δ 2 r ( 2 + 1 ) 1 + δ 2 r .
Recalling our definition of ξ = 1 4 η σ r ( ρ ) ( 1 δ 2 r ) 10 , the problem boils down to choosing the right step size η such that the above inequality on ξ in Equation (A9) is satisfied. With simple algebra, we can show the following lower bound on η :
1 1 + δ 2 r 1 δ 2 r ( 2 + 1 ) 1 + δ 2 r 4 · 10 4 σ r ( ρ ) ( 1 δ 2 r ) η
Finally, the argument inside the · term of ξ = 1 4 η σ r ( ρ ) ( 1 δ 2 r ) 10 > 0 has to be non-negative, yielding the following upper bound on η :
η 10 4 σ r ( ρ ) ( 1 δ 2 r ) .
Combining two inequalities, and noting that the term 1 1 + δ 2 r 1 δ 2 r ( 2 + 1 ) 1 + δ 2 r 4 is bounded above by 1, we arrive at the following bound on η :
1 1 + δ 2 r 1 δ 2 r ( 2 + 1 ) 1 + δ 2 r 4 · 10 4 σ r ( ρ ) ( 1 δ 2 r ) η 10 4 σ r ( ρ ) ( 1 δ 2 r ) .
In sum, for the specific η satisfying Equation (A10), we have shown that
ρ ( A ) = λ 1 = ξ · | 1 + μ | 2 + ξ 2 ( 1 + μ ) 2 4 + ξ · | μ | 1 1 δ 2 r 1 + δ 2 r
Above bound translates Equation (A8) into:
x k + 1 x k C ϵ ( ρ ( A ) + ϵ ) k + 1 x 0 x 1 + 1 0 · ξ · | μ | · σ 1 ( ρ ) 1 / 2 · r · C ϵ · t = 0 k ( ρ ( A ) + ϵ ) t C ϵ 1 1 δ 2 r 1 + δ 2 r + ϵ k + 1 x 0 x 1 + 1 0 · ξ · | μ | · σ 1 ( ρ ) 1 / 2 · r · C ϵ · t = 0 k ( ρ ( A ) + ϵ ) t
Without loss of generality, we can take the Euclidean norm from the above (By the equivalence of norms [105], C ϵ can absorb additional constants), which yields:
x k + 1 x k 2 C ϵ 1 1 δ 2 r 1 + δ 2 r + ϵ k + 1 x 0 x 1 2 + 1 0 · ξ · | μ | · σ 1 ( ρ ) 1 / 2 · r · C ϵ · t = 0 k ( ρ ( A ) + ϵ ) t 2 = C ϵ 1 1 δ 2 r 1 + δ 2 r + ϵ k + 1 x 0 2 + x 1 2 + ξ · | μ | · σ 1 ( ρ ) 1 / 2 · r · C ϵ · t = 0 k ( ρ ( A ) + ϵ ) t = C ϵ 1 1 δ 2 r 1 + δ 2 r + ϵ k + 1 x 0 2 + x 1 2 + ξ · | μ | · σ 1 ( ρ ) 1 / 2 · r · C ϵ · 1 ( ρ ( A ) + ϵ ) k + 1 1 ( ρ ( A ) + ϵ ) .
Re-substituting x j = min R O U j U R F , and using the same initialization for U 0 and U 1 , we get:
min R O U k + 1 U R F 2 + min R O U k U R F 2 1 / 2 C ϵ 1 1 δ 2 r 1 + δ 2 r + ϵ k + 1 2 min R O U 0 U R F 2 1 / 2 + ξ · | μ | · σ 1 ( ρ ) 1 / 2 · r · C ϵ · 1 ( ρ ( A ) + ϵ ) k + 1 1 ( ρ ( A ) + ϵ ) = C ϵ 1 1 δ 2 r 1 + δ 2 r + ϵ k + 1 min R O U 0 U R F + ξ · | μ | · σ 1 ( ρ ) 1 / 2 · r · C ˜ ϵ C ϵ 1 1 δ 2 r 1 + δ 2 r + ϵ k + 1 min R O U 0 U R F + O ( μ ) ,
where in the equality, with slight abuse of notation, we absorbed 2 factor to C ϵ , and similarly absorbed the last term such that C ˜ ϵ = C ϵ · 1 ( ρ ( A ) + ϵ ) k + 1 1 ( ρ ( A ) + ϵ ) . This concludes the proof for Theorem 1.

Supporting Lemmata

In this subsection, we present a series of lemmata, used for the proof of Theorem 1.
Lemma A4. 
Let U C d × r and U C d × r , such that U U R F σ r ( ρ ) 1 / 2 10 3 κ τ ( ρ ) for some R O , where ρ = U U , κ : = 1 + δ 2 r 1 δ 2 r > 1 , for δ 2 r 1 10 , and τ ( ρ ) : = σ 1 ( ρ ) σ r ( ρ ) > 1 . Then:
σ 1 ( ρ ) 1 / 2 1 1 10 3 σ 1 ( U ) σ 1 ( ρ ) 1 / 2 1 + 1 10 3 σ r ( ρ ) 1 / 2 1 1 10 3 σ r ( U ) σ r ( ρ ) 1 / 2 1 + 1 10 3
Proof. 
By the fact · 2 · F and using Weyl’s inequality for perturbation of singular values [106] (Theorem 3.3.16), we have:
σ i ( U ) σ i ( U ) σ r ( ρ ) 1 / 2 10 3 κ τ ( ρ ) σ r ( ρ ) 1 / 2 10 3 , 1 i r .
Then,
σ r ( ρ ) 1 / 2 10 3 σ 1 ( U ) σ 1 ( U ) σ r ( ρ ) 1 / 2 10 3 σ 1 ( ρ ) 1 / 2 σ r ( ρ ) 1 / 2 10 3 σ 1 ( U ) σ 1 ( ρ ) 1 / 2 + σ r ( ρ ) 1 / 2 10 3 σ 1 ( ρ ) 1 / 2 1 1 10 3 σ 1 ( U ) σ 1 ( ρ ) 1 / 2 1 + 1 10 3 .
Similarly:
σ r ( ρ ) 1 / 2 10 3 σ r ( U ) σ r ( U ) σ r ( ρ ) 1 / 2 10 3 σ r ( ρ ) 1 / 2 σ r ( ρ ) 1 / 2 10 3 σ r ( U ) σ r ( ρ ) 1 / 2 + σ r ( ρ ) 1 / 2 10 3 σ r ( ρ ) 1 / 2 1 1 10 3 σ r ( U ) σ r ( ρ ) 1 / 2 1 + 1 10 3 .
In the above, we used the fact that σ i ( U ) = σ i ( ρ ) 1 / 2 , for all i, and the fact that σ i ( ρ ) 1 / 2 σ j ( ρ ) 1 / 2 , for i j . □
Lemma A5. 
Let U C d × r , U C d × r , and U C d × r , such that min R O U U R F σ r ( ρ ) 1 / 2 10 3 κ τ ( ρ ) and min R O U U R F σ r ( ρ ) 1 / 2 10 3 κ τ ( ρ ) , where ρ = U U , and κ : = 1 + δ 2 r 1 δ 2 r > 1 , for δ 2 r 1 10 , and τ ( ρ ) : = σ 1 ( ρ ) σ r ( ρ ) > 1 . Set the momentum parameter as μ = σ r ( ρ ) 1 / 2 10 3 κ τ ( ρ ) · ε 4 · σ 1 ( ρ ) 1 / 2 · r , for ε ( 0 , 1 ) user-defined. Then,
Z U R Z F 3 2 + 2 | μ | · σ r ( ρ ) 1 / 2 10 3 κ τ ( ρ ) .
Proof. 
Let R U arg min R O U U F and R U arg min R O U U R F . By the definition of the distance function:
Z U R Z F = min R O Z U R F = min R O U + μ ( U U ) U R F = min R O U + μ ( U U ) ( 1 μ + μ ) U R F | 1 + μ | · U U * R U F + | μ | · | | U U * R U | | F = | 1 + μ | · U U * R U F + | μ | · U U * R U U * R U + U * R U F = | 1 + μ | · U U * R U F + | μ | · ( U U * R U ) + U * ( R U R U ) F | 1 + μ | · min R O U U R F + | μ | · min R O U U R F + | μ | · U ( R U R U ) F | 1 + μ | · min R O U U R F + | μ | · min R O U U R F + 2 | μ | · σ 1 ( ρ ) 1 / 2 r ( i ) 3 2 + 2 | μ | · σ r ( ρ ) 1 / 2 10 3 κ τ ( ρ )
where ( i ) is due to the fact that μ σ r ( ρ ) 1 / 2 10 3 κ τ ( ρ ) · 1 4 · σ 1 ( ρ ) 1 / 2 · r . We keep μ in the expression, but we use it for clarity for the rest of the proof. □
Corollary A1. 
Let Z C d × r and U C d × r , such that Z U R F 3 2 + 2 | μ | · σ r ( ρ ) 1 / 2 10 3 κ τ ( ρ ) for some R O , and ρ = U U . Then:
σ 1 ( ρ ) 1 / 2 1 3 2 + 2 | μ | 1 10 3 σ 1 ( Z ) σ 1 ( ρ ) 1 / 2 1 + 3 2 + 2 | μ | 1 10 3 σ r ( ρ ) 1 / 2 1 3 2 + 2 | μ | 1 10 3 σ r ( Z ) σ r ( ρ ) 1 / 2 1 + 3 2 + 2 | μ | 1 10 3 .
Given that μ = σ r ( ρ ) 1 / 2 10 3 κ τ ( ρ ) · ε 4 · σ 1 ( ρ ) 1 / 2 · r 1 10 3 , we get:
0.998 · σ 1 ( ρ ) 1 / 2 σ 1 ( Z ) 1.0015 · σ 1 ( ρ ) 1 / 2 0.998 · σ r ( ρ ) 1 / 2 σ r ( Z ) 1.0015 · σ r ( ρ ) 1 / 2 .
Proof. 
The proof follows similar motions as in Lemma A4. □
Corollary A2. 
Under the same assumptions of Lemma A4 and Corollary A1, and given the assumptions on μ, we have:
99 100 · ρ 2 Z Z 2 101 100 · ρ 2 99 100 · ρ 2 Z 0 Z 0 2 101 100 · ρ 2 a n d 99 101 · Z 0 Z 0 2 Z Z 2 101 99 · Z 0 Z 0 2
Proof. 
The proof is easily derived based on the quantities from Lemma A4 and Corollary A1. □
Corollary A3. 
Let Z C d × r and U C d × r , such that Z U R F 3 2 + 2 | μ | · σ r ( ρ ) 1 / 2 10 3 κ τ ( ρ ) for some R O , and ρ = U U . Define τ ( W ) = σ 1 ( W ) σ r ( W ) . Then:
τ ( Z Z ) β 2 τ ( ρ ) ,
where β : = 1 + 3 2 + 2 | μ | · 1 10 3 1 3 2 + 2 | μ | · 1 10 3 > 1 . for μ σ r ( ρ ) 1 / 2 10 3 κ τ ( ρ ) · 1 4 · σ 1 ( ρ ) 1 / 2 · r .
Proof. 
The proof uses the definition of the condition number τ ( · ) and the results from Lemma A4 and and Corollary A1. □
Lemma A6. 
Consider the following three step sizes:
η = 1 4 ( 1 + δ 2 r ) Z 0 Z 0 2 + A A ( Z 0 Z 0 ) y 2 η ^ = 1 4 ( 1 + δ 2 r ) Z Z 2 + A A ( Z Z ) y Q Z Q Z 2 η = 1 4 ( 1 + δ 2 r ) ρ 2 + A A ( ρ ) y 2 .
Here, Z 0 C d × r is the initial point, Z C d × r is the current point, ρ C d × d is the optimal solution, and Q Z denotes a basis of the column space of Z. Then, under the assumptions that min R O U U R F σ r ( ρ ) 1 / 2 10 3 κ τ ( ρ ) , and min R O Z U R F 3 2 + 2 | μ | · σ r ( ρ ) 1 / 2 10 3 κ τ ( ρ ) , and assuming μ = σ r ( ρ ) 1 / 2 10 3 κ τ ( ρ ) · ε 4 · σ 1 ( ρ ) 1 / 2 · r , for the user-defined parameter ε ( 0 , 1 ) , we have:
10 9 η η ^ 10 10.5 η , a n d 100 102 η η 102 100 η
Proof. 
The assumptions of the lemma are identical to that of Corollary A2. Thus, we have: 99 100 · U 2 2 Z 2 2 101 100 · U 2 2 , 99 100 · U 2 2 Z 0 2 2 101 100 · U 2 2 , and 99 101 · Z 0 2 2 Z 2 2 101 99 · Z 0 2 2 . We focus on the inequality η ^ 10 10.5 η . Observe that:
A A ( Z Z ) y Q Z Q Z 2 A A ( Z Z ) y 2 = A A ( Z Z ) y A A ( Z 0 Z 0 ) y + A A ( Z 0 Z 0 ) y 2 ( i ) ( 1 + δ 2 r ) Z Z Z 0 Z 0 F + A A ( Z 0 Z 0 ) y 2 ( 1 + δ 2 r ) Z Z U U F + ( 1 + δ 2 r ) Z 0 Z 0 U U F + A A ( Z 0 Z 0 ) y 2
where ( i ) is due to smoothness via RIP constants of the objective and the fact · 2 · F . For the first two terms on the right-hand side, where R Z is the minimizing rotation matrix for Z, we obtain:
Z Z U U F = Z Z U R Z Z + U R Z Z U U F = ( Z U R Z ) Z + U R Z ( Z U R Z ) F Z 2 · Z U R Z F + U 2 · Z U R Z F Z 2 + U 2 · Z U R Z F ( i ) 101 99 + 100 99 Z 0 2 · Z U R Z F ( i i ) 101 99 + 100 99 Z 0 2 · 0.001 σ r ( ρ ) 1 / 2 101 99 + 100 99 · 0.001 · 100 99 · Z 0 2 2
where ( i ) is due to the relation of Z 2 and U 2 derived above, ( i i ) is due to Lemma A5. Similarly:
Z 0 Z 0 U U F 101 99 + 100 99 · 0.001 · 100 99 · Z 0 2 2
Using these above, we obtain:
A A ( Z Z ) y Q Z Q Z 2 4.1 ( 1 + δ 2 r ) 10 3 Z 0 Z 0 2 + A A ( Z 0 Z 0 ) y 2
Thus:
η ^ = 1 4 ( 1 + δ 2 r ) Z Z 2 + A A ( Z Z ) y Q Z Q Z 2 1 4 ( 1 + δ 2 r ) 101 99 Z 0 Z 0 2 + + 4.1 ( 1 + δ 2 r ) 10 3 Z 0 Z 0 2 + A A ( Z 0 Z 0 ) y 2 1 4 10.5 10 · ( 1 + δ 2 r ) Z 0 Z 0 2 + A A ( Z 0 Z 0 ) y 2 10 10.5 η
Similarly, one gets η ^ 10 9 η .
For the relation between η and η , we will prove here the lower bound; similar motions lead to the upper bound also. By definition, and using the relations in Corollary A2, we get:
η = 1 4 ( 1 + δ 2 r ) Z 0 Z 0 2 + A A ( Z 0 Z 0 ) y 2 1 4 ( 1 + δ 2 r ) 101 100 ρ 2 + A A ( Z 0 Z 0 ) y 2
For the gradient term, we observe:
A A ( Z 0 Z 0 ) y 2 A A ( Z 0 Z 0 ) y A A ( ρ ) y 2 + A A ( ρ ) y 2 = ( i ) A A ( Z 0 Z 0 ) y A A ( ρ ) y 2 ( i i ) ( 1 + δ 2 r ) Z 0 Z 0 U U F ( i i i ) ( 1 + δ 2 r ) Z 0 2 + U 2 · Z U R Z F ( i v ) ( 1 + δ 2 r ) 101 100 + 1 U 2 · 0.001 · U 2 2 0.002 · ( 1 + δ 2 r ) ρ 2
where ( i ) is due to A A ( ρ ) y 2 = 0 , ( i i ) is due to the restricted smoothness assumption and the RIP, ( i i i ) is due to the bounds above on Z 0 Z 0 U U F , ( i v ) is due to the bounds on Z 0 2 , w.r.t. U 2 , as well as the bound on Z U R F .
Thus, in the inequality above, we get:
η 1 4 ( 1 + δ 2 r ) 101 100 ρ 2 + A A ( Z 0 Z 0 ) y 2 1 4 ( 1 + δ 2 r ) 101 100 ρ 2 + 0.001 · ( 1 + δ 2 r ) ρ 2 + A A ( ρ ) y 2 1 4 ( 1 + δ 2 r ) 102 100 ρ 2 + A A ( ρ ) y 2 100 102 η
Similarly, one can show that 102 100 η η . □
Lemma A7. 
Let U C d × r , U C d × r , and U C d × r , such that min R O U U R F σ r ( ρ ) 1 / 2 10 3 κ τ ( ρ ) and min R O U U R F σ r ( ρ ) 1 / 2 10 3 κ τ ( ρ ) , where ρ = U U , and κ : = 1 + δ 2 r 1 δ 2 r > 1 , for δ 2 r 1 10 , and τ ( ρ ) : = σ 1 ( ρ ) σ r ( ρ ) > 1 . By Lemma A5, the above imply also that: Z U R Z F 3 2 + 2 | μ | · σ r ( ρ ) 1 / 2 10 3 κ τ ( ρ ) . Then, under RIP assumptions of the mapping A , we have:
A ( A ( Z Z ) y ) , ( Z U R Z ) ( Z U R Z ) θ σ r ( ρ ) · Z U R Z F 2 + 10.1 100 β 2 · η ^ · ( 1 + 2 | μ | ) 2 1 1 + 2 | μ | 1 200 2 · A ( A ( Z Z ) y ) · Z F 2
where
θ = ( 1 δ 2 r ) 1 + ( 1 + 2 | μ | ) 1 200 2 10 3 + ( 1 + δ 2 r ) 2 + 1 + 2 | μ | · 1 200 1 + 2 | μ | · 1 200 ,
and η ^ = 1 4 ( ( 1 + δ r ) Z Z 2 + A A ( Z Z ) y Q Z Q Z 2 ) .
Proof. 
First, denote Δ : = Z U R Z . Then:
A ( A ( Z Z ) y ) , ( Z U R Z ) ( Z U R Z ) = ( i ) A ( A ( Z Z ) y ) · Q Δ Q Δ , Δ Z Δ Z Tr A ( A ( Z Z ) y ) · Q Δ Q Δ · Δ Z Δ Z ( i i ) A ( A ( Z Z ) y ) · Q Δ Q Δ 2 · Tr ( Δ Z Δ Z ) ( i i i ) A ( A ( Z Z ) y ) · Q Z Q Z 2 + A ( A ( Z Z ) y ) · Q U Q U 2 Z U R Z F 2
Note that ( i ) follows from the fact Δ Z = Δ Z Q Δ Q Δ , for a matrix Q that spans the row space of Δ Z , and ( i i ) follows from | Tr ( A B ) | A 2 Tr ( B ) , for PSD matrix B (Von Neumann’s trace inequality [107]). For the transformation in ( i i i ) , we use that fact that the row space of Δ Z , SPAN ( Δ Z ) , is a subset of SPAN ( Z U ) , as Δ Z is a linear combination of U and U .
To bound the first term in Equation (A12), we observe:
A ( A ( Z Z ) y ) · Q Z Q Z 2 · Z U R Z F 2 = ( i ) η ^ · 4 ( ( 1 + δ 2 r ) Z Z 2 + A ( A ( Z Z ) y ) · Q Z Q Z 2 ) · A ( A ( Z Z ) y ) · Q Z Q Z 2 · Z U R Z F 2 = 4 η ^ ( 1 + δ 2 r ) Z Z 2 A ( A ( Z Z ) y ) · Q Z Q Z 2 · Z U R Z F 2 : = A + 4 η ^ A ( A ( Z Z ) y ) · Q Z Q Z 2 2 · Z U R Z F 2
where ( i ) is due to the definition of η ^ .
To bound term A, we observe that A ( A ( Z Z ) y ) · Q Z Q Z 2 ( 1 δ 2 r ) σ r ( Z Z ) 10 3 or A ( A ( Z Z ) y ) · Q Z Q Z 2 ( 1 δ 2 r ) σ r ( Z Z ) 10 3 . This results into bounding A as follows:
4 η ^ ( 1 + δ 2 r ) Z Z 2 A ( A ( Z Z ) y ) · Q Z Q Z 2 · Z U R Z F 2 max { 4 · η ^ · ( 1 + δ 2 r ) Z Z 2 · ( 1 δ 2 r ) σ r ( Z Z ) 10 3 · Z U R Z F 2 , η ^ · 4 · 10 3 κ τ ( Z Z ) A ( A ( Z Z ) y ) · Q Z Q Z 2 2 · Z U R Z F 2 } 4 · η ^ · ( 1 δ 2 r 2 ) Z Z 2 · σ r ( Z Z ) 10 3 · Z U R Z F 2 + η ^ · 4 · 10 3 κ τ ( Z Z ) A ( A ( Z Z ) y ) · Q Z Q Z 2 2 · Z U R Z F 2 .
Combining the above inequalities, we obtain:
A ( A ( Z Z ) y ) · Q Z Q Z 2 · Z U R Z F 2 ( i ) ( 1 δ 2 r ) σ r ( Z Z ) 10 3 · Z U R Z F 2 + ( 10 3 κ τ ( Z Z ) + 1 ) · 4 · η ^ A ( A ( Z Z ) y ) · Q Z Q Z 2 2 · Z U R Z F 2 ( i i ) ( 1 δ 2 r ) σ r ( Z Z ) 10 3 · Z U R Z F 2 + ( 10 3 β 2 κ τ ( ρ ) + 1 ) · 4 · η ^ A ( A ( Z Z ) y ) · Q Z Q Z 2 2 · 3 2 + 2 | μ | 2 κ τ ( ρ ) 1 10 6 σ r ( ρ ) ( i i i ) ( 1 δ 2 r ) σ r ( Z Z ) 10 3 · Z U R Z F 2 + 4 · 1001 β 2 · η ^ · A ( A ( Z Z ) y ) · Q Z Q Z 2 2 · 3 2 + 2 | μ | 2 10 6 1 3 2 + 2 | μ | 1 10 3 2 σ r ( Z Z ) ( i v ) ( 1 δ 2 r ) σ r ( Z Z ) 10 3 · Z U R Z F 2 + 4 · 1001 β 2 · η ^ · 3 2 + 2 | μ | 2 10 6 1 3 2 + 2 | μ | 1 10 3 2 · A ( A ( Z Z ) y ) · Z F 2 ( v ) ( 1 δ 2 r ) 1 + ( 3 2 + 2 | μ | ) 1 10 3 2 σ r ( ρ ) 10 3 · Z U R Z F 2 + 1 200 β 2 · η ^ · 3 2 + 2 | μ | 2 1 3 2 + 2 | μ | 1 10 3 2 · A ( A ( Z Z ) y ) · Z F 2
where ( i ) follows from η ^ 1 4 ( 1 + δ 2 r ) Z Z 2 , ( i i ) is due to Corollary A3, bounding Z U R Z F ρ σ r ( ρ ) 1 / 2 , where ρ : = 3 2 + 2 | μ | 1 10 3 κ τ ( ρ ) by Lemma A5, ( i i i ) is due to ( 10 3 β 2 κ τ ( ρ ) + 1 ) 1001 β 2 κ τ ( ρ ) , and by Corollary A1, ( i v ) is due to the fact
σ r ( Z Z ) A ( A ( Z Z ) y ) · Q Z Q Z 2 2 A ( A ( Z Z ) y ) Z F 2 ,
and ( v ) is due to Corollary A1.
Next, we bound the second term in Equation (A12):
A ( A ( Z Z ) y ) · Q U Q U 2 · Z U R Z F 2 ( i ) A ( A ( Z Z ) y ) A ( A ( ρ ) y ) 2 · Z U R Z F 2 ( i i ) ( 1 + δ 2 r ) · Z Z U U F · Z U R Z F 2 ( i i i ) ( 1 + δ 2 r ) ( 2 + ρ ) · ρ · σ 1 ( U ) · σ r ( U ) · Z U R Z F 2 ( i v ) ( 1 + δ 2 r ) ( 2 + ρ ) 3 2 + 2 | μ | · 1 10 3 σ r ( ρ ) · Z U R Z F 2 ( 1 + δ 2 r ) 2 + 3 2 + 2 | μ | · 1 10 3 3 2 + 2 | μ | · 1 10 3 σ r ( ρ ) · Z U R Z F 2 ,
where ( i ) follows from A ( A ( Z Z ) y ) · Q U Q U 2 A ( A ( Z Z ) y ) 2 and A ( A ( ρ ) y ) = 0 , ( i i ) is due to smoothness of f and the RIP constants, ( i i i ) follows from [25] (Lemma 18), for ρ = 3 2 + 2 | μ | · 1 10 3 κ τ ( ρ ) , ( i v ) follows from substituting ρ above, and observing that τ ( ρ ) = σ 1 ( U ) 2 / σ r ( U ) 2 > 1 and κ = ( 1 + δ 2 r ) / ( 1 δ 2 r ) > 1 .
Combining the above we get:
A ( A ( Z Z ) y ) , ( Z U R Z ) ( Z U R Z ) θ σ r ( ρ ) · Z U R Z F 2 + 1 200 β 2 · η ^ · 3 2 + 2 | μ | 2 1 3 2 + 2 | μ | 1 10 3 2 · A ( A ( Z Z ) y ) · Z F 2
where θ = ( 1 δ 2 r ) 1 + 3 2 + 2 | μ | 1 10 3 2 10 3 + ( 1 + δ 2 r ) 2 + 3 2 + 2 | μ | · 1 10 3 3 2 + 2 | μ | · 1 10 3 . □
Lemma A8. 
Under identical assumptions with Lemma A7, the following inequality holds:
A ( A ( Z Z ) y ) , Z Z U U 1.1172 η A ( A ( Z Z ) y ) Z F 2 + 1 δ 2 r 2 U U Z Z F 2
Proof. 
By smoothness assumption of the objective, based on the RIP assumption, we have:
1 2 A ( Z Z ) y 2 2 1 2 A ( U + U + ) y 2 2 A ( A ( Z Z ) y ) , U + U + Z Z 1 + δ 2 r 2 U + U + Z Z F 2 1 2 A ( Z Z ) y 2 2 1 2 A ( U U ) y 2 2 A ( A ( Z Z ) y ) , U + U + Z Z 1 + δ 2 r 2 U + U + Z Z F 2
due to the optimality A ( U U ) y 2 2 = 0 A ( V V ) y 2 2 , for any V C d × r . Also, by the restricted strong convexity with RIP, we get:
1 2 A ( U U ) y 2 2 1 2 A ( Z Z ) y 2 2 + A ( A ( Z Z ) y ) , U U Z Z + 1 δ 2 r 2 U U Z Z F 2
Adding the two inequalities, we obtain:
A ( A ( Z Z ) y ) , Z Z U U A ( A ( Z Z ) y ) , Z Z U + U + 1 + δ 2 r 2 U + U + Z Z F 2 + 1 δ 2 r 2