Efﬁcient Implementations for Orthogonal Matching Pursuit

: Based on the efﬁcient inverse Cholesky factorization, we propose an implementation of OMP (called as version 0, i.e., v0) and its four memory-saving versions (i.e., the proposed v1, v2, v3 and v4). In the simulations, the proposed ﬁve versions and the existing OMP implementations have nearly the same numerical errors. Among all the OMP implementations, the proposed v0 needs the least computational complexity, and is the fastest in the simulations for almost all problem sizes. As a tradeoff between computational complexities/time and memory requirements, the proposed v1 seems to be better than all the existing ones when only considering the efﬁcient OMP implementations storing G (i.e., the Gram matrix of the dictionary), the proposed v2 and v3 seem to be better than the only existing one when only considering the efﬁcient implementations not storing G , and the proposed v4 seems to be better than the naive implementation that has the (known) minimum memory requirements. Moreover, all the proposed ﬁve versions only include parallelizable matrix-vector products in each iteration, and do not need any back-substitutions that are necessary in some existing efﬁcient implementations (e.g., those utilizing the Cholesky factorization).


Introduction
A signal vector x ∈ C N is said to be K-sparse if x includes no more than K none-zero entries. The measurements of the signal x can be collected in a data vector u ∈ C M satisfying where Φ is an M × N dictionary with K < M N, and n is noise. The problem of signal recovery, i.e., solving (1) to estimate x, is sparse approximation or representation [1][2][3][4][5][6].
Recently, the OMP algorithm has been applied in various scenarios [7][8][9][10][11][12][13]. The stagewise OMP algorithm was proposed in [7] to enhance the convergence speed for switching signal reconstruction in IGBT (insulated gate bipolar transistor) online condition monitoring, while the SiT based stagewise OMP was propopsed in [8] for the estimation of massive-MIMO sparse uplink channels. A modified OMP algorithm was utilized in [9] to propose compressed sensing ISAR (inverse synthetic aperture radar) imaging with highly maneuvering motion. The OMP algorithm was used in [10] for AoD (angle of departure) estimation at the mobile station, applied in [11] for channel estimation and equalization on the basis of the Pseudo-Noise (PN) sequences, and was chosen in [12] as the imaging algorithm to reconstruct imaging scenes in the single-pixel camera, since it contains the stable and efficient operation in reconstructing small data. Moreover, simultaneous OMP [14] was utilized in [13] to convert image patches into sparse coefficients for representing the source multi-focus images.
Based on the efficient inverse Cholesky factorization proposed in [36], we propose an efficient OMP implementation and its four memory-saving versions (This work was presented partly in [37] at IEEE Vehicular Technology Conference (VTC) 2013 fall. The new contribution in this manuscript includes the memory-saving versions 1 and 4, the derivation of the equations in the proposed OMP implementations, the more detailed introduction of the existing OMP implementations, and the more detailed comparison among the presented OMP implementations), which are then compared with the existing ones. To focus on efficient implementations of OMP, we do not consider the extended, variant, or approximate implementations of OMP, such as the Stagewise OMP (StOMP) [38], the Backtracking-based Adaptive OMP (BAOMP) [39], the gradient pursuit [40], or the cyclic matching pursuit [41].
This paper is organized as follows. Section 2 describes the existing implementations for OMP. In Section 3, we propose an efficient OMP implementation and its four memory-saving versions. Then in Section 4, we evaluate the presented implementations for OMP. Finally, we make conclusion in Section 5. In what follows, (•) T , (•) * , and (•) H denote the matrix transposition, the matrix conjugate, and the Hermitian matrix, respectively. We use 0 M , I M and ∅ to denote the length-M zero column vector, the M × M identity matrix and the empty matrix (or vector), respectively, and we use · to denote the square norm (i.e., the L 2 norm) of a vector.

The Existing Implementations for OMP
In this section, we summarize the existing OMP implementations.

The Naive Implementation of OMP
The dictionary Φ in (1) can be written as where ϕ :i ∈ C M denotes the i-th column of Φ, and is called as an atom. The signal recovery problem is to find the sparse solution to the underdetermined linear system (1), i.e., to represent u as a weighted sum of the least number of atoms. OMP is a greedy algorithm that selects atoms from Φ iteratively to approximate u gradually. In each iteration, OMP selects the atom best matching the current residual (i.e., the approximation error), renews the weights for all the already selected atoms to obtain the least squares approximation of u, and then updates the residual accordingly. Algorithm 1 summarizes the naive implementation of OMP [20,24], which is called Naive in this paper, as in [24]. As the OMP implementation described in ( [20] (left lines 27-34 on page 2)), we also follow these conventions: Λ k is a set containing the indices of the selected k atoms, and Φ Λ k is a sub-matrix of Φ containing only those columns of Φ with indices in Λ k . The estimatex for x has nonzero entries at the indices listed in Λ k , and the λ i -th entry ofx equals the i-th entry ofx k .
Step 2 : Projection: compute the inner products Step 3 : , λ k i k is the k-th entry of the set Λ k .
Step 4 : Renew the weights for the selected k atoms bŷ and then utilizex k to update the residual by Step 5 : k := k + 1. Then return to Step 2 if k < K and the residual energy [24] where ξ is a small positive real number, e.g., ξ = 10 −10 .
For simplicity, we assume that the dictionary Φ has normalized columns, i.e., ϕ :i 2 = 1, as in [18] and the shared simulation code [42] of [24]. We omit step 6 to only focus on the k-th iteration, and we omit steps 1, 3 and 5, since they are the same as those in Algorithm 1 except that now it is required to initialize r 0 2 = u 2 in step 1. Thus we only introduce steps 2 and 4, which update the projection p k−1 and the residual energy r k 2 , respectively. Many existing efficient OMP implementations utilize the Gram matrix [20] of all the N atoms ϕ :1 , ϕ :2 , · · · , ϕ :N , i.e., The Gram matrix of the k atoms selected in iteration k is where Φ Λ k can be written as In Algorithm 2 we introduce Chol-1 and Chol-2, which utilize the Cholesky factor [43] We also need which is obtained in G Λ k by To verify (12) and (11), substitute (9) into (8) to deduce In Algorithm 3 we introduce QR-1 and QR-2, which utilize the QR factors [19] of Φ Λ k , i.e., Q k and R k satisfying In Algorithm 4 we introduce MIL-1 and MIL-2. In each iteration, MIL-1 updates the biorthogonal basis of Φ Λ k , i.e., satisfying while MIL-2 updates G −1 Λ k .

Algorithm 2 OMP by the Cholesky Decomposition
Step 2. Projection: if k = 1, compute p 0 by (3); else when k ≥ 2, computex k−1 by solving the triangular system and then compute p k−1 by one of two ways. In the first case (Chol-1), explicitly compute r k−1 (k ≥ 2) by (5) and then directly compute p k−1 (k ≥ 2) by (3); or in the second case (Chol-2), compute p k−1 (k ≥ 2) by where G(:, Λ k−1 ) includes k − 1 columns in G and satisfies Step 4. If k = 1, obtain V 1 = 1 from (10); else when k ≥ 2, obtain V k by where g λ k ,λ k = ϕ H :λ k ϕ :λ k is in G, and z k−1 is computed by solving the triangular system Then for any k ≥ 1, compute y k by solving where has the i-th entry p 0 being the λ i -th entry in p 0 , and finally compute
Step 4. If k = 1, obtain Q 1 and R 1 from (14); else when k ≥ 2, compute and to obtain and and finally compute r k 2 = r k−1 2 − q H :k u 2 .

The Proposed Efficient OMP Implementation and Its Memory-Saving Versions
In this section, we propose an efficient OMP implementation, which is based on the efficient inverse Cholesky factorization proposed in [36]. We also introduce four memory-saving versions for the proposed OMP implementation.

The Proposed OMP Implementation by the Inverse Cholesky Factorization
The inverse Cholesky factor [36] of G Λ k can be denoted as F k satisfying while from (10) we can deduce Then we can compare (43) with (42) to deduce since V k is the unique lower-triangular Cholesky factor [43] of G Λ k , and F k , the upper-triangular equivalent Cholesky factor of G −1 Λ k [36], is also unique [43]. In [36], F k is computed from F k−1 iteratively by where γ k and t k−1 is computed by ( [36] (Equation (24))) and Notice that s k−1 in (48) and g λ k ,λ k in (46) are the vector and scalar in G Λ k , respectively, as shown in (12).
To utilize F k in OMP, we substitute (42) into (4) to deducê Then we substitute (49) into (5) to obtain Equation (50) can be written as where and Let us deduce from (42), and substitute (47) into (45) to obtain Then we utilize the above-described intermediate results a k and B k , to propose an efficient OMP implementation (called as version 0, i.e., v0) in the following Algorithm 5, where steps 1, 3 and 5 are omitted as in Algorithms 2-4. In Algorithm 5, we will use (3), (46), (55), (56) and (5) successively.

Algorithm 5 The Proposed OMP Implementation (v0) by the Inverse Cholesky Factorization
Step 2.
where p k−1 λ k is the λ k -th entry of p k−1 , g :λ k is the λ k -th column of G, and c 0 = B 0 = a 0 = ∅ is assumed for k = 1. Finally update the residual energy by If F k is required to computex k and r k , compute F k by F 1 = √ g λ 1 ,λ 1 when k = 1, or by Step 6. Output: r K 2 and Λ K . Whenx K and r K are required, computê and In the next subsection, we will verify Equations (57)-(64) in Algorithm 5. Moreover, we will also verify e :k = γ k ϕ : and which can be utilized to update the intermediate result E k .
To verify (60), we substitute (66) into (52) to obtain a k = (E H k−1 u) T e H :k u T , into which we substitute (52) to verify (60) and obtain a k = e H :k u.
Now let us verify (59). We can substitute (65) into (75) to obtain a k = γ k ϕ H Finally we substitute (58) into (76) to obtain It can be seen from (51) that ϕ H : (77) is the i k -th entry of the vector To verify (57), substitute (60) and (62) into (51) to deduce and then substitute (51) (with k = k − 1) into (78). Finally let us verify (63). Substitute (64) into (5) to obtain Let (9), (45), (59) and (60) into (79) to obtain In this paragraph, we consider the second, third and fourth entries in the right side of (80). Substitute (79) into the second entry to obtain It can be seen from (52) that the last k-th entry of a k is On the other hand, from (8), (10) and (44) and Then we substitute (82) and (83) into (81), to deduce that the second entry The third entry in the right side of (80) is equal to (85), since The fourth entry in the right side of (80) is where (84) is utilized. Lastly, we can substitute (85)-(87) into (80) to obtain (r k ) H r k = (r k−1 ) H r k−1 − a k a * k , i.e., (63).

Several Memory-Saving Versions of the Proposed OMP Implementation
The proposed OMP implementation (v0) requires N 2 + Nk memories (when considering the required memories, we neglect the difference between k − 1 and k for simplicity) for G and B k in the k-th iteration, and requires extra N M memories for Φ in the 1-st iteration. In some cases, we may need to consider the tradeoff between computational complexities and memory requirements. Accordingly, we will show that the proposed OMP implementation (v0) can be modified to obtain the proposed memory-saving v1, v2, v3 and v4 (i.e., versions 1, 2, 3 and 4), which all can save some memories at the expense of higher computational complexity.
With respect to the proposed v0, the proposed memory-saving v1 saves the Nk memories for B k , and requires the extra k 2 memories for F k . To obtain the proposed memory-saving v1, we need to modify steps 2 and 4 of the proposed OMP implementation (v0) (i.e., Algorithm 5) into steps 2 a and 4 a , respectively, which are described in the following Algorithm 6. To deduce (89) in Algorithm 6, we substitute (19) and then substitute (88) into (57).
With respect to the proposed v0, the proposed memory-saving v2 saves the N 2 memories for G, and requires the extra Mk memories for E k . The proposed memory-saving v2 only needs to modify step 4 of the proposed OMP implementation (v0) (i.e., Algorithm 5) into step 4 b , which is described in the following Algorithm 7. It can easily be seen that (90) in Algorithm 7 can be deduced from (72).

Algorithm 7 The Proposed Memory-Saving version 2
Step to obtain B k = B k−1 b :k . Notice that when k = 1, c 0 = E 0 = B 0 = a 0 = ∅ is assumed.
Finally update the residual energy by r k 2 = r k−1 2 − a k 2 . If F k is required to computex k and r k , compute F k by F 1 = √ g λ 1 ,λ 1 when k = 1, or by With respect to the proposed v0, the proposed memory-saving v3 saves the N 2 + Nk memories for G and B k , and requires the extra Mk memories for E k . It modifies steps 2 and 4 of the proposed OMP implementation (v0) (i.e., Algorithm 5) into steps 2 c and 4 c , respectively, which are described in the following Algorithm 8. We only need to substitute (73) into (57) to obtain (91) in Algorithm 8. Moreover, we can substitute (54) into (67) to obtain b H i k ,1:k−1 = ϕ H :i k E k−1 , which is then substituted into (58) to obtain (92) in Algorithm 8.

Algorithm 8 The Proposed Memory-Saving version 3
Step 2 c . Projection: if k = 1, compute p 0 = Φ H r 0 ; else when k ≥ 2, compute p k−1 (k ≥ 2) by Step 4 c . Compute c k−1 from E k−1 by Finally update the residual energy by If F k is required to computex k and r k , compute F k by F 1 = √ g λ 1 ,λ 1 when With respect to the proposed memory-saving v3, the proposed memory-saving v4 saves the Mk memories for E k , and requires the extra k 2 memories for F k . The proposed memory-saving v4 needs to modify steps 2 and 4 of the proposed OMP implementation (v0) (i.e., Algorithm 5) into steps 2 d and 4 d , respectively, which are described in the following Algorithm 9. From (54), we can deduce e :k = Φ Λ k f k , which is then substituted into (91) to obtain (93) in Algorithm 9.

Analysis of Presented OMP Implementations
In this section, we compare the presented implementations for OMP, by theoretical and empirical analysis.

Theoretical Analysis of Computational Complexities and Memory Requirements
Table 1 compares the expected computational complexities (in the implementations with N 2 memories (to store G, the Gram matrix of the dictionary), we assume that G has been computed offline and stored) and memory requirements of the existing and proposed OMP implementations in the k-th iteration (k ≥ 2), where the complexities are the numbers of multiplications and additions, which include both dominating terms and coefficients. It can be seen that each implementation needs nearly the same number of multiplications and additions. "Dominating terms" here means that we omit the terms that are O(N), O(M) or O(k). For example, "Dominating terms" for Chol-2 only include the terms for (18) and to solve the triangular systems (17), (21) and (22). When evaluating the memory requirements for Table 1, we neglect the fact that we may only store parts of some matrices, such as the Hermitian Gram matrix G and the triangular Cholesky factor. The entries in Table 1 are counted from the above-described relevant equations. The complexities of the existing OMP implementations listed in Table 1 are consistent with those in Table 1 of [24], while the complexity of each proposed OMP implementation is broken down for each equation in Table 2.

Algorithm Complexity Memory
Naive From Table 1, it can be seen that the proposed OMP implementation (i.e., Proposed-v0) needs the least computational complexities. With respect to any of the existing efficient implementations (i.e., Chol-2, MIL-2, Chol-1, MIL-1 and QR-2) spending N 2 memories for G, Proposed-v1 requires less computational complexities and the same or less memories. On the other hand, with respect to the only existing efficient implementation not storing G, i.e., QR-1, Proposed-v2 needs less computational complexities and a little more memories, while Proposed-v3 needs the same complexities and less memories. Lastly, with respect to Naive, Proposed-v4 needs much less computational complexities and only a little more memories. Table 2. Complexities of the equations in the k-th iteration (k ≥ 2) for the proposed OMP implementations.

Algorithm Complexity Equation
Proposed-v4

Empirical Comparison for OMP Implementations
We perform MATLAB simulations to compare all the proposed 5 versions of OMP implementations with the existing ones, on a 64-bit 2.4-GHz Microsoft-Windows Xeon workstation with 15.9 GB of RAM. We give the simulation results for numerical errors and computational time (the MATLAB code to generate Figures 1-6 of this paper is shared in https://github.com/zhuhufei/OMP. git) in Figures 1-6. Moreover, we show the floating-point operations (flops) required by different OMP implementations in Figures 7-9, where the complexities listed in Table 1 are utilized to count the flops. To obtain Figures 1-9, the shared simulation code [42] is utilized.  As Figure 1 in [24], Figure 1 here shows the accumulation of numerical errors in the recursive computation of the inner products p k s and the solutionsx k s, where the errors mean the differences in the computed p k s andx k s between the naive implementation and any considered efficient implementation. For example, the numerical errors of Chol-2 are defined by and We perform 100 independent trials, and the other simulation parameters are usually identical to those for Figure 1 in [24]. For example, we set N = M = 400, K = 200 and the sparsity ρ = K/M = 0.5, and sample sparse vectors from a Rademacher or Normal distribution. For any k-th (k = 1, 2, · · · , K) iteration, we compute the mean relative errors of the inner products and those of the solutions, which are plotted in Figure 1a and Figure 1b, respectively. Since Figure 1 in [24] shows that QR-1, Chol-2 and MIL-1 have nearly the same numerical errors, here we only compare the numerical errors of the proposed five versions with those of Chol-2. Moreover, we do the same simulations in Figure 2 as in Figure 1, except that we set the sparsity ρ = 0.1 instead of ρ = 0.5.  In Figure 3, we perform Monte Carlo simulations to show the Signal-to-Reconstruction-Error Ratio (SRER) of different OMP implementations versus Indeterminacy M/N (i.e., Fraction of Measurements), where we follow the simulations for Figures 1 and 2 in [44], and SRER is defined by As Figure 1 in [44], Figure 3a gives the SRER performance for Normal (i.e., Gaussian) sparse signals in the clean measurement case and the noisy measurement case with SMNR = 15 dB, where SMNR denotes Signal to Measurement-Noise Ratio and satisfies Similarly, Figure 3b gives the SRER performance for Rademacher sparse signals, as Figure 2 in [44]. In Figure 3, we set N = 500 and K = 20. Moreover, we generate the dictionary Φ 100 times and for each realization of Φ, we generate a sparse signal x 100 times.       is near or less than that of an existing efficient implementation (e.g., MIL-2 or QR-1) for most problem sizes, while the memory requirements of Proposed-v4 are much less than those of any existing efficient implementation, and are only a little more than the (known) minimum requirements (i.e., the requirements of Naive). Thus Proposed-v4 seems to be a good choice when we need an efficient implementation with the minimum memory requirements. Moreover, from Table 1, Figures 4-6, it can be seen that with respect to Chol-2 and MIL-2, Proposed-v1 needs the same size of memories and 1 2 k 2 less of complexities, and is usually a little faster.
In Figure 7, we show log 10 Flops naive (K, M, N)/Flops imp (K, M, N) for two different N. Then in Figures 8 and 9, we show the flops ratios of Chol-2 over Proposed-v0, Proposed-v1 and MIL-2, respectively, to compare flops among the efficient OMP implementations storing G, and we also show the flops ratios of of QR-1 over Proposed-v2, Proposed-v3 and Proposed-v4, respectively, to compare flops among the efficient implementations that do not store G. In Figure 8, we give the flops ratios as a function of the sparsity K/M for the indeterminacies M/N = 0.1155 and M/N = 0.5086, respectively, while in Figure 9, we give the flops ratios as a function of the indeterminacy M/N for the sparsities K/M = 0.1155 and K/M = 0.5086, respectively. From Figure 7, it can be seen that the speedups in flop numbers of Proposed-v0, Proposed-v1, Chol-2, QR-2 and MIL-1 over Naive are obvious for almost all problem sizes. From Figures 8 and 9, it can be seen that the speedups in flop numbers of Proposed-v0 and Proposed-v1 over Chol-2 and those of Proposed-v2 over QR-1 grow linearly with the increase of the sparsity K/M for the two fixed indeterminacies M/N, and grow linearly with the increase of the indeterminacy M/N for the two fixed sparsities K/M. On the other hand, with respect to QR-1, Proposed-v3 requires the similar number of flops and Proposed-v4 requires a little more flops, while as shown in Table 1, both Proposed-v3 and Proposed-v4 require less memories.      Table 1 and Figure 2 of [24] also indicate that with respect to QR-1, Chol-2 and MIL-1 need less flops and more computational time. One possible reason for this phenomenon could be that the computational time is decided not only by the required flops, but also by the required operations for memory access. Specifically, this reason can partly explain why MIL-2 is obviously slower than Chol-2, since in the k-th iteration, MIL-2 needs to access k 2 memories in (38) to obtain the inverse of G Λ k−1 , while Chol-2 only needs to access k memories in (20) to obtain the Cholesky factor of G Λ k−1 .
In addition to the flops and the operations for memory access, the parallelizability of the OMP implementation can also affect the the computational time. When considering this factor, notice that the proposed five versions do not need any back-substitution in each iteration, and only include parallelizable matrix-vector products [45]. Contrarily, the existing Chol-1, Chol-2 and QR-2 usually solve triangular systems by back-substitutions [43], which are inherent serial processes unsuitable for parallel implementations [45]. It can be easily seen that Chol-1 and Chol-2 both need 3 2 k 2 multiplications and additions for the back-substitutions to solve 3 k × k triangular systems (17), (21) and (22), while QR-2 needs k 2 multiplications and additions for the back-substitutions to solve 2 k × k triangular systems (25) and (26).

Conclusions
Based on the efficient inverse Cholesky factorization proposed in [36], we propose an implementation of OMP (i.e., the proposed v0) and its four memory-saving versions (i.e., the proposed v1, v2, v3 and v4), and compare them with the existing ones. Among all the OMP implementations, the proposed v0 needs the least computational complexity, and is the fastest in the simulations for almost all problem sizes. The proposed v1, v2, v3 and v4 are all good tradeoffs between computational complexities/time and memory requirements. When only considering the efficient OMP implementations that spend N 2 memories to store G (i.e., the Gram matrix of the dictionary), the proposed v1 seems to be better than all the existing ones (i.e., QR-2, MIL-1, MIL-2, Chol-1 and Chol-2), since it needs less computational complexity, is usually faster in the simulations, and does not require more memories. When only considering the efficient OMP implementations that do not store G, with respect to the only existing one (i.e., QR-1), the proposed v2 needs less computational complexity and a little more memories, while the proposed v3 needs the same complexity and less memories. Moreover, our simulations show that both the proposed v2 and v3 are faster than all the existing implementations for most problem sizes. Lastly, with respect to the naive OMP implementation that has the (known) minimum memory requirements, the proposed v4 only requires a little more memories, and needs much less computational time and complexities that are similar to those required by some existing efficient OMP implementations (e.g., the computational time of MIL-2 and Chol-2, and the complexities of QR-1). Thus the proposed v4 seems to be the best choice when we need an efficient implementation with the minimum memory requirements.
Our simulations show that the proposed five versions and the existing ones have nearly the same numerical errors. On the other hand, our simulations also show that some speedups in computational time or complexities of the proposed OMP implementations over the existing ones grow almost linearly with the increase of the sparsity K/M or the indeterminacy M/N, e.g., the speedups in computational time of the proposed v0, v2 and v3 over Naive, those in complexities of the proposed v0 and v1 over Chol-2, and those in complexities of the proposed v2 over QR-1. Moreover, in each iteration, some existing efficient OMP implementations (e.g., Chol-1, Chol-2 and QR-2) solve triangular systems by back-substitutions (that are inherent serial processes unsuitable for parallel implementations), while the proposed five versions of OMP implementations do not need any back-substitution, and only include parallelizable matrix-vector products.