Almost Optimality of the Orthogonal Super Greedy Algorithm for µ -Coherent Dictionaries †

: We study the approximation capability of the orthogonal super greedy algorithm (OSGA) with respect to µ -coherent dictionaries in Hilbert spaces. We establish the Lebesgue-type inequalities for OSGA, which show that the OSGA provides an almost optimal approximation on the ﬁrst [ 1/ ( 18 µ s)] steps. Moreover, we improve the asymptotic constant in the Lebesgue-type inequality of OGA obtained by Livshitz E D.


Introduction
Approximation by the sparse linear combination of elements from a fixed redundant system continues to develop actively, which is driven not only by theoretical interest but also by frequent applications from areas such as signal processing and machine learning, cf. [1][2][3][4][5][6][7]. This type of approximation is called highly nonlinear approximation. Greedy-type algorithms have been used as a tool for generating such approximations. Among others, the orthogonal greedy algorithm (OGA) has been widely used in practice. In fact, the OGA is regarded as the most powerful algorithm to solve the problem of approximation with respect to redundant systems, cf. [8][9][10].
We recall some notations and definitions from the theory of greedy algorithms. Let H be a Hilbert space with an inner product ·, · and the norm x := x, x f m+1 := f − G OGA m+1 ( f , D), where P span{g 1 , g 2 , ··· , g m } is the operator of the orthogonal projection onto span{g 1 , g 2 , · · ·, g m }.
In [11], Liu and Temlyakov proposed the orthogonal super greedy algorithm (OSGA). The OSGA selects more than one element from a dictionary in each iteration step and hence reduces the computational burden of the conventional OGA. Therefore, the OSGA is more efficient than the OGA from the viewpoint of the computational complexity.
(2) Let H m := H m ( f ) := span{g 1 , g 1 , · · · , g ms } and P H m denote the operator of the orthogonal projection onto H m . Define  Note that, in the case s = 1, OSGA(s) coincides with OGA.
In this paper, we study the approximation capability of the OSGA with respect to µ-coherent dictionaries in Hilbert spaces. We denote by µ = µ(D) = sup g =h,g,h∈D | g, h | the coherence of a dictionary. The coherence µ is a blunt instrument to measure the redundancy of dictionaries. It is clear that if D is an orthonormal basis, then µ(D) = 0. The smaller the µ(D), the more the D resembles an orthonormal basis. We study dictionaries with small values of coherence µ(D) > 0, and call them µ-coherent dictionaries.
In [11], the authors found that such computational burden reduction of OSGA does not degrade the approximation capability if f belongs to the closure of the convex hull of the symmetrized dictionary D ± := {±g, g ∈ D}, which is denoted by A 1 (D). Theorem 1. Let D be a dictionary with coherence parameter µ := µ(D). Then, for s ≤ (2µ) −1 , the algorithm OSGA(s) provides an approximation of f ∈ A 1 (D) with the following error bound: It seems that a dimensional independent convergence rate was deduced, but the condition that the target element belongs to A 1 (D) becomes more and more stringent as the number of the elements in D grows, cf. [2].
Fang, Lin, and Xu [12] studied the behavior of OSGA for f ∈ H. They defined L 1 = { f : f = ∑ g∈D a g g} and f L 1 := inf{∑ g∈D |a g | : f = ∑ g∈D a g g} for f ∈ L 1 , and obtained the following theorem.
Theorem 2. Let D be a dictionary with coherence µ. Then, for all f ∈ H, h ∈ L 1 and arbitrary s ≤ (2µ) −1 + 1, the OSGA(s) provides an approximation of f with the error bound: The µ-coherence of a dictionary is used in OSGA, which implies that computational burden reduction does not degenerate the approximation capability. Moreover, if µ > 1 2 , then OSGA coincides with OGA.
Let Σ m denote the collection of elements in H, which can be expressed as a linear combination of, at most, m elements of the dictionary D, namely For an element f ∈ H, we define its best m-term approximation error by The inequality connecting the error of greedy approximation and the error of best mterm approximation is called the Lebesgue-type inequality, cf. [13][14][15]. In this paper, we will establish the Lebesgue-type inequalities for OSGA with respect to µ-coherent dictionaries.
We first recall some results on the efficiency of OGA with respect to µ-coherent dictionaries. These results relate the error of OGA's A(m)-th approximation to the error of the best m-term approximation with an extra multiplier: where A(m) ∈ N, B(m), C(µ) ∈ R. Gillbert, Muthukrishnan, and Strauss [16] gave the first Lebesgue-type inequality for OGA. They proved The constant in the above inequality was improved by Tropp in [17]: Donoho, Elad, and Temlyakov [18] dramatically improved the factor in front of σ m and obtained that where the constant 24 is not the best. Many researchers have sought to improve the factor B(m). Temlyakov and Zheltov improved the above inequality in [4]. They obtained Livshitz [19] took the parameters A(m) := 2m, B(m) := 2.7, C(µ) := 1 20 in (1) and obtained the following profound result.

Theorem 3.
For every µ-coherent dictionary D and any f ∈ H, the OGA applied to f provides By using the same method as in [19], Ye and Wei [20] improved slightly the constant 2.7. Based on the above works, we give the error bound of the form (1) for OSGA with respect to dictionaries with small but non-vanishing coherence. Theorem 4. Let D be a dictionary with coherence µ. Then, for any f ∈ H and any > 0, the OSGA(s) applied to f provides f Am ≤ 2.24(1 + )σ m ( f ) (2) for all 1 ≤ m ≤ 1 18µs , 3 100 ≤ µ ≤ 1 18 and an absolute constant A ≥ 2.
1. We remark that the values of µ and A for which (2) holds are coupled. For example, it is possible to obtain a smaller value of µ at the price of a larger value of A. Moreover, for sufficiently large A, µ can be arbitrarily close to zero. 2. Our results improve Theorem 3 only in the asymptotic constant and not in the rate. Under the condition of Theorem 4, for s = 1, taking (A, , µ) as (2, 0.1, 0.03), we can obtain f 2m ≤ 2.5σ m ( f , D). Comparing it with Theorem 3, the constant that we obtain is better. 3. The specific constant 2.24 in (2) is not the best. By adjusting parameters A and µ, we can obtain a more general estimation: where C(A) and α A are interdependent. Thus, Theorem 4 shows that OSGA(s) can achieve an almost optimal approximation on the first [1/(18µs)] steps for dictionaries with small but non-vanishing coherence.
The paper is organized as follows. In Section 2, we establish several preliminary lemmas. In Section 3, for some closed subspace L of H, as defined below, we first give the estimations of P ⊥ L ( f n ) in different situations based on the lemmas in Section 2. Then, we estimate the P L ( f n ). Finally, combining the above two estimations, we provide the detailed proof of Theorem 4. In Section 4, we test the performance of the OSGA in the case of finite dimensional Euclidean space. In Section 5, we make some concluding remarks on our work.

Preliminary Lemmas
In this section, we will introduce several quantities and discuss their properties, which are important to the proof of our main result. By the condition of Theorem 4, we have We establish three preliminary lemmas.
Proof. For any g l ∈ D, 1 ≤ l ≤ n, we have For 1 ≤ n ≤ Am, we set Assume that x i,n , n ≥ 1, 1 ≤ i ≤ ns, satisfying the equation Next, we give the estimates of {x i,n } ns i=1 and {d n } for n ≥ 1 in turn. Applying Lemma 1, we have the following estimates for x i,n .
Proof. For 1 ≤ n ≤ Am, according to the definition of d n , we have We continue to estimate the two summands of the right-hand side of the above inequality. For the first summand, the greedy step implies For the second summand, by Lemma 1, we have Combining inequalities (9)-(11) with Lemma 2, we conclude that Thus, for any n and 1 ≤ l ≤ n ≤ Am + 1, we have where k 2 := exp Amsµ 1 + Amsµ

Proof of Theorem 4
Based on the above preliminary lemmas, we will prove Theorem 4 step by step. We first introduce some notations. Define Let a j,n ∈ R, 1 ≤ j ≤ m, 0 ≤ n ≤ Am satisfy the following equations Thus, for f n , 0 ≤ n ≤ Am, we have To obtain the upper bound of f n , it suffices to estimate ξ n and P L ( f n ) . By the definitions of sets T 1 , T 2 and I n in OSGA, we first give the estimate of ξ n according to whether the intersection of T 2 and I n is an empty set.

Proof. Let
Λ n := ∪ n i=1 I n , T n 2 := T 2 ∩ Λ n , t n := |T n 2 |. By Lemma 3, for 1 ≤ l ≤ n ≤ Am, Then, we have so, we can obtain that Since I n ∩ T 2 = ∅, we obtain t n = t n−1 . We define Note that By the definitions of L, T 1 , T 2 , Λ n and the expression of (14), we have P ⊥ L (P H n ( f n−1 )) = P ⊥ L (h). Then, we obtain To obtain the final result, it suffices to estimate the upper bounds of | ξ n−1 , (12) and (14), we have where we have used the fact < f n−1 , h >= 0.
On the one hand, for any 1 ≤ l ≤ m and n satisfying T 2 ∩ I n = ∅, we obtain Thus, by Lemma 1 and inequality (17), we obtain On the other hand, by Lemma 2, we have, for 1 ≤ j ≤ m, Thus, substituting (18) and (19) into (16), and then combining it with (13), we get the estimate Finally, we estimate P ⊥ L (h) 2 . Note that By using (13), we have Combining (15) and (20) with (21), we give Theorem 5 gives the estimation of ξ n in the situation I n ∩ T 2 = ∅. The following theorem deals with the situation I n ∩ T 2 = ∅. Theorem 6. Let n satisfy 1 ≤ n ≤ Am and I n ∩ T 2 = ∅. Then, Proof. Since we set ξ n = P ⊥ L ( f n−1 − ∑ i∈I n x i,n g i ), h = ∑ i∈T n−1 2 x i,n g i and write ξ n as According to the following inequality, we need to estimate ξ n 2 , ξ n , h and h 2 . We first estimate h 2 by (23) Next, we continue to estimate ξ n 2 . It is not difficult to see that i∈I n x i,n g i 2 . (24) Note that ∑ i∈I n By (18), for any i ∈ T 2 ∩ I n , we have Combining Lemma 2 with inequality (26), we obtain ∑ i∈I n for 0 ≤ s ≤ 1 18Amµ ≤ 1 9(1+m)Aµ . and m ≥ 1, A ≥ 2. For the last summand of the right-hand side of the inequality in (24), we have Thus, combining (27) with (28), for 3 100 ≤ µ ≤ 1 18 , we have We next estimate | ξ n , h |. Since we need to give the upper bounds of A and B. By (18) and (19), we have As for B, since for 1 ≤ j ≤ (n − 1)s < i ≤ ns ≤ Ams, i, j ∈ T 2 , P L (g i ) = ∑ m l=1 c i l ψ l , by Lemma 1, we know that and Combining (32) with (33), we have Using Lemma 1 again, we obtain from (34) that Combining (22), (23) and (29) with (36), we have It remains to estimate P L ( f n ) . We first recall a lemma proven by Fang, Lin and Xu in [12].

Lemma 4.
Assume that a dictionary D has coherence µ. Then, we have, for any distinct g i ∈ D, a i ∈ R, i = 1, 2, · · · , s, the inequalities

Theorem 7.
For any 1 ≤ n ≤ Am, we have Proof. From Lemma 4, we know that From Lemmas 1 and 2, we have, for any 1 ≤ l ≤ n + 1, Thus, Combining (37) with (38), we have Next, using Theorems 5 and 6, we give the estimation of D.
Theorem 8. For A ≥ 1 and any positive integer m, the following inequalities hold.
Proof. From (3), we have By using Theorems 5 and 6, we derive which is equivalent to Furthermore, we also have ξ Am ≤ ξ 0 . Now, we can give the proof of our main result.
Proof of Theorem 4. Note that From Theorem 7 and Theorem 8, we obtain that Thus, we complete the proof of Theorem 4.

Simulation Results
It is known from Theorem 4 that if f ∈ Σ m , then σ m ( f ) = 0, and hence f = G m ( f ). In this spirit, the OSGA can be used to recover sparse signals in compressed sensing, which is a new field of signal processing. We remark that in the field of signal processing, the orthogonal super greedy algorithm (OSGA) is also known as orthogonal multi-matching pursuit (OMMP). For the reader's convenience, we will use the term OMMP instead of OSGA in what follows.
In this section, we test the performance of the orthogonal multi-matching pursuit with parameter s (OMMP(s)). We consider the following model. Suppose that x ∈ R N is an unknown N-dimensional signal and we wish to recover it by the given data where Φ ∈ R M×N is a known measurement matrix with M N. Furthermore, since M N, the column vectors of Φ are linearly dependent and the collection of these columns can be viewed as a redundant dictionary.
For arbitrary x, y ∈ R N , define x j y j , and , where x = (x j ) N j=1 and y = (y j ) N j=1 . Obviously, R N is a Hilbert space with the inner product ·, · .
A signal x ∈ R N is said to be K-sparse if x 0 := #supp(x) = #{i : x i = 0} ≤ K < N. We will recover the support of a K-sparse signal via OMMP(s) under the model (40). It is well known that OMMP takes the following form; see, for instance, [3].
ORTHOGONAL MULTI MATCHING PURSUIT (OMMP(s)) Input: Measurement matrix Φ, vector y, and s, the stopping criterion.
End if the stopping condition is achieved. Otherwise, we set l := l + 1 and turn to step 2.
Output: If the algorithm stops at the kth iteration, then output Λ k and x Λ k = x k . In the experiment, we set the measurement matrix Φ to be a Gaussian matrix where each entry is selected from the N (0, M −1 ) distribution and the density function of this distribution is p(x) := 1 √ 2πM e −x 2 M/2 . We execute OMMP(s) with the data vector y = Φx and stop the algorithm when #Λ l ≥ K. The mean square error(MSE) of x is defined as follows:  Figure 2 describes the case of the dimension N = 256. It displays which percentage (the average of 100 input signals) of the elements in support can be found correctly as a function of M with s = 3. If the percentage equals 100%, it means that all the elements in support can be found, which implies that the input signal can be exactly recovered.
As expected, Figure 2 shows that when the sparsity level K increases, more measurements are necessary to guarantee signal recovery.

Concluding Remarks
This paper investigates the error behavior of the orthogonal super greedy algorithm OSGA with respect to µ-coherent dictionaries. The OSGA is simpler than the OGA from the viewpoint of the computational complexity. Under the assumption that the coherence parameter µ has a lower bound, we establish the ideal Lebesgue-type inequality for the OSGA, which shows that the OSGA provides an almost optimal approximation on the first [1/(18 µs)] steps. Moreover, we improve the asymptotic constant in the Lebesgue-type inequality of the OGA obtained in [19]. We develop some new techniques to obtain our results. We found that there exists a strong dependency between the constant A and the coherence parameter µ in (2). The specific constant 2.24 is not the best; we can change it by adjusting the values of A and µ, but the best one is still unknown. In fact, we do not even know if such a constant exists. We will continue to study the improvement of the Lebesgue constant in our future work. As for the applications of the OSGA, our simulation results show that OSGA is very efficient for recovering sparse signals.
Author Contributions: Authors contribute evenly in this paper. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.