Stochastic Recursive Gradient Support Pursuit and Its Sparse Representation Applications

In recent years, a series of matching pursuit and hard thresholding algorithms have been proposed to solve the sparse representation problem with ℓ0-norm constraint. In addition, some stochastic hard thresholding methods were also proposed, such as stochastic gradient hard thresholding (SG-HT) and stochastic variance reduced gradient hard thresholding (SVRGHT). However, each iteration of all the algorithms requires one hard thresholding operation, which leads to a high per-iteration complexity and slow convergence, especially for high-dimensional problems. To address this issue, we propose a new stochastic recursive gradient support pursuit (SRGSP) algorithm, in which only one hard thresholding operation is required in each outer-iteration. Thus, SRGSP has a significantly lower computational complexity than existing methods such as SG-HT and SVRGHT. Moreover, we also provide the convergence analysis of SRGSP, which shows that SRGSP attains a linear convergence rate. Our experimental results on large-scale synthetic and real-world datasets verify that SRGSP outperforms state-of-the-art related methods for tackling various sparse representation problems. Moreover, we conduct many experiments on two real-world sparse representation applications such as image denoising and face recognition, and all the results also validate that our SRGSP algorithm obtains much better performance than other sparse representation learning optimization methods in terms of PSNR and recognition rates.


Introduction
In recent years, sparse representation has been proved to be a useful approach to represent or compress high dimensional signals. Sparse representation algorithms have attracted many researchers in the fields of signal processing, image processing, medical imaging, machine learning, computer vision, pattern recognition, and so on [1]. In most applications, the unknown signal of interest is regarded as a sparse combination of a few columns from a given dictionary, and this problem is usually formulated as a sparsity constrained problem. Such sparse representation problems are common in the fields of image denoising, image inpainting, and face recognition or others such as [2][3][4].
Image denoising is a classical problem to improve image quality in computer vision. The aim of this problem is to recover the clean image x from the noisy image y = x + e, where e is additive white Gaussian noise in general [5]. It can be realized generally by the following three types of methods: transform domain [6], spatial filtering [7], and dictionary learning-based methods [8,9]. Note that the dictionary learning-based methods optimize the following model with the step of 0 -norm sparse coding: min where x represents the sparse coding of y ∈ R n with a given error tolerance ε > 0, D ∈ R n×d is a given dictionary, x 0 denotes the number of nonzero entities of the vector x, and x 2 is the 2 -norm, i.e., x 2 = ∑ i x 2 i . This is an 0 -norm constrained minimization problem, which can be solved via convex relaxation or greedy algorithms. The performance of the algorithms for finding a sparse representation solution has a big impact on denoised images. Similar effects can be found in face recognition tasks.
As a challenging application in computer vision and pattern recognition, face recognition has become a more complicated problem on account of illuminations, occlusions, expressions, and facial disguises. Many face recognition methods have been proposed, e.g., Eigenfaces [10] and Fisherfaces [11]. Wright et al. [12] proposed a sparse representation-based classification (SRC) method, which regards the training face images themselves as an overcomplete dictionary. It is possible for a testing face image to be represented by the atoms of the overcomplete dictionary, and thus we need to solve a sparse coding problem similar to Equation (1). Hence, there is the same question to image denoising. In this paper, our main aim is to find a more efficient algorithm for sparse representation on image denoising and face recognition tasks.

Stochastic Hard Thresholding Methods
So far, there have been many algorithms for pursuing sparse solutions using p -norm (0 ≤ p ≤ 1) minimization [1]. Although convex relaxation algorithms are easy to find an optimal solution due to the convexity of 1 -norm, 1 -norm minimization has some limits as pointed out in [13] and sometime obtains worse empirical performance [14]. Thus, it is necessarily to solve the 0 -norm problem directly. In order to obtain the sparse solutions, many greedy algorithms [15][16][17][18][19] have been developed. Moreover, there are some hard thresholding-based methods, such as iterative hard thresholding [20], fast gradient hard thresholding pursuit [21], and gradient support pursuit (GraSP) [22]. All the methods have many successful applications for various real-world problems such as sparse vector and low-rank matrix recovery. However, the hard thresholding methods are deterministic optimization algorithms, they need to compute a full gradient using all training samples and have a high per-iteration complexity O(nd), which makes them unsuitable for real-world large-scale data.
To address this issue, Nguyen et al. [23] proposed a stochastic gradient hard thresholding (SG-HT) algorithm, and introduced the idea of stochastic optimization into hard thresholding methods. It randomly selects one sample to optimize per-iteration and holds a much lower complexity. However, SG-HT cannot decrease the variance between the stochastic gradient and its full gradient. Li et al. [24] proposed a stochastic variance reduced gradient hard thresholding (SVRGHT) method, which uses the stochastic variance reduction gradient (SVRG) technique [25] as well as in [26]. With the help of variance reduction techniques, SVRGHT can obtain a faster convergence rate. More recently, there have been several stochastic hard thresholding algorithms using first-order or second-order information [27][28][29][30][31][32][33]. However, many stochastic algorithms such as SVRGHT have a hard thresholding operation in each iteration, whose computational complexity is relatively high O(d log(d)) in general [34], especially for high-dimensional data. In addition, there are two main drawbacks for the thresholding methods. The first shortcoming is the optimization theoretical basis. That is, when the current iterate solution is not a minimizer of the function, moving from the solution in the direction of negative gradient of the function leads to the decrease in the value of this function. However, this assumption is not generally true when the hard thresholding operator H s (·) is applied to the current vector x t , which means that the gradient information has lost. It breaks the information of the current solution and may waste much computation to perform gradient descent. Secondly, the computational burden of hard thresholding operation is still linear with d, which can not be negligible. There exists an interesting question whether there is an algorithm to overcome these drawbacks. We answer this question affirmatively in theory and in practice.

Our Contributions
In this paper, we propose the first variance reduced stochastic recursive gradient method for sparse representation problems. In other words, we use the stochastic recursive gradient proposed in [35], which is suitable for solving non-convex problems, to optimize the non-convex sparse representation problem in this paper. In order to keep the gradient information of current iterate as suggested in [36], we perform lots of gradient descent steps, followed by a hard thresholding operation. We also construct the most relevant support on which minimization will be efficient. Therefore, this paper proposes a novel sparsity-constrained algorithm, called stochastic recursive gradient support pursuit (SRGSP). At each iteration in SRGSP, we first find the most relevant support set, minimize slackly over the support set by our stochastic recursive gradient solver, which satisfies a certain descent condition, and then perform hard thresholding on the updated model parameter. The main contributions and novelty of this paper are listed as follows: (1) It is is non-trivial that we analyze the statistical estimation performance of SRGSP under mild assumptions, and the theoretical results show that SRGSP obtains a fast linear convergence rate. (2) Benefiting from less hard thresholding operations than existing algorithms such as SVRGHT, the average per-iteration cost of our algorithm is much lower (O(d) for SRGSP vs. O(d log(d)) for SVRGHT), which leads to faster convergence. (3) Moreover, less usage of hard thresholding operators to the current variable results in retain of gradient optimization information, which improves empirical performances. Stochastic recursive gradient support pursuit leads to a new trend to reduce the complexity of head thresholding operation while maintaining or even improving the performance. (4) We also evaluate the empirical performance of our SRGSP method on sparse linear and logistic regression tasks as well as real-world applications such as image denoising and face recognition. Our experimental results show the efficiency and effectiveness of SRGSP.
The remainder of this paper is organized as follows. In Section 2, we introduce the related applications (i.e., image denoising and face recognition), and we propose our SRGSP algorithm in Section 3. The convergence analysis is provided in Section 4. In Section 5, many experimental results on both synthetic and real-world datasets verify the effectiveness of SRGSP, and the results of image denoising and face recognition further demonstrate the superiority of SRGSP against some state-of-the-art hard thresholding algorithms. Section 6 presents conclusions and future work.

Related Work
In this section, we start with a brief description of some related applications, in which sparse representation can play an important role.

Notation
In this paper, x 0 denotes the number of nonzero entities of the vector x, supp(x) denotes the index set of nonzero entities of x, and supp(x, s) is the index set of the top s entries of x in terms of magnitude. In addition, we denote I c the complement set of I and x| I the restriction of vector x to the rows indicated by indices in I. Furthermore, we denote H F (·) the Hessian matrix of the function F (·), and denote E(·) the expectation.

Sparse Representation-Based Image Denoising
In sparse representation, the clean images or signals can be approximated via a sparse combination of coefficients from a basis set, called dictionary. In this circumstance, denoising a patch vector y j , which is extracted from the noisy image matrix, with a dictionary D ∈ R n×d is regarded as solving the following sparsity constrained optimization problem: where D i x is an estimate of y j i , s is a sparsity constant, and y j is the j-th patch of the noisy image y. There are many dictionary learning algorithms such as [8,37,38], which alternately update the dictionary with learned sparse iterate x. Although these algorithms have demonstrated that learned dictionaries on noisy images or on a set of good quality images can achieve better performance than off-the-shelf ones such as [9], we here use the fixed overcomplete dictionary for verifying the property of our algorithm in sparse coding. The overcomplete dictionary D, which means that the number of columns may be greater than the number of rows, can be obtained by the discrete cosine transform (DCT) [39] or its redundant version, as implemented in [38]. Since general images may be very large, current practices sparsely represent image patches rather than the full image.
In summary, we obtain an overcomplete dictionary matrix D by DCT and then use 0 -norm constrained optimization algorithms to find an approximate solution of Equation (2) to restore the image.

Sparse Representation-Based Face Recognition
Face recognition is an active research field in computer vision, and this task is to use k classes labeled training samples to classify the testing sample into the correct class. In this paper, we take our algorithm into the SRC framework [12] for face recognition. As SRC uses 1 -norm minimization to solve the sparse coding model, in this paper we use 0 -norm minimization instead, which can also obtain a sparse solution and this algorithm is provided in the Appendix A. In the SRC algorithm, an l ×h gray facial training image is reshaped into a column vector a r,u ∈ R n , i.e., n = lh. Then we construct the matrix A r = [a r,1 , a r,2 , · · · , a r,d r ] ∈ R n×d r by using d r training samples belonging to the r-th class. For each testing sample y r = [y r 1 , · · · , y r n ] ∈ R n×1 in the same class can be linear represented by the columns in A r : y r = a r,1 x r,1 + a r,2 x r,2 + · · · + a r,d r x r,d r .
Here, x r,1 , x r,2 , · · · , x r,d r are all scalars, which are the representation coefficients for y r . Since the testing sample is unknown, then we consider all training samples of k classes and define a matrix A: A = [A 1 , A 2 , · · · , A k ] ∈ R n×d . Therefore, the representation of a testing sample can be rewritten with respect to all the training samples as: where x 0 is a coefficient vector, whose nonzero elements only associated with the r-th class. In this paper, sparse representation with 0 -norm minimization can be used to solve the following sparsity constrained optimization problem in the SRC framework: where s is the sparse constant, which implies the number of nonzero elements of x and A i row ∈ R 1×d is the i-th row vector of A. Defining δ p : R d → R d is a selection function corresponding to the p-th class. Given a sparse vector x ∈ R d from Equation (5), δ p (x) ∈ R d is a new vector, whose nonzero elements are only associated with the p-th class. Then minimizing the following residual function can encode the identity of the sample y r as follows:

Our Stochastic Recursive Gradient Support Pursuit Method
In this section, we propose a novel stochastic recursive gradient support pursuit (SRGSP) method for sparsity constrained problems. Different from existing gradient support pursuit methods (e.g., GraSP [22]), SRGSP only requires to satisfy a certain constrictive condition in each iteration, and thus has a faster convergence speed in practice.
In recent years, many non-convex gradient support pursuit methods such as [20,22] have been proposed, and it has also been shown that they can have better performance than convex 1 -norm methods in certain circumstances. Most of the existing gradient support pursuit algorithms use deterministic optimization methods to minimize various sparse learning problems (e.g., Problem (2)). However, the per-iteration complexity of all these algorithms is O(nd), which leads to slow convergence, especially for large-scale and high-dimensional problems. Inspired by GraSP [22], which is a well-known gradient support pursuit method, we propose an efficient stochastic recursive gradient support pursuit (SRGSP) algorithm to approximate the solution to Problem (2), as outlined in Algorithm 1.

Algorithm 1: Stochastic Recursive Gradient Support Pursuit (SRGSP)
Input: Sparsity level s, learning rate η, the numbers of outer-iterations and inter-iterations, T and J. Initialize:x 0 .
At each iteration of Algorithm 1, we first compute the gradient of F (·) at the current estimate, i.e., g 0 = ∇F (x t−1 ). Then we choose 2s coordinates of g 0 that have the largest magnitude as the direction in which pursuing the minimization will be most effective, and denote their indices by Z, where s is the sparsity constant. Merging the support of the current estimate with the 2s coordinates mentioned above, we can obtain the combined support, which is a set of at most 3s indices, i.e., T = Z ∪supp(x t−1 ) (Some parameters used in this paper have already been defined in the second section). Over the current support set T , we compute an estimate b by using stochastic recursive gradient descent as the approximate solution to the problem (7).
The key difference between GraSP and SRGSP is that GraSP needs to yield the exact solutionb to the following minimization problem: where T c is the complement set of T in line 4 of Algorithm 1, while our SRGSP method only requires a sub-solver (e.g., the iteration steps from Step 5 to Step 10 in Algorithm 1), which is to find an approximate solution b to Problem (7) satisfying whereb is the exact solution to Equation (7),x t−1 is the temporary result of last outer-iteration, 0 < c 1 < 1 is an error bound constant, meaning that our algorithm can have a guaranteed decrease at each iteration, as shown in our convergence analysis in the next section. In other words, we can select other efficient solvers (e.g., SVRG [25], VR-SGD [32] and their accelerated variants [40,41]) for the proposed framework, as long as the solvers satisfy the certain constrictive condition in Equation (8).
Since stochastic recursive gradient descent in [42,43] has been proved to have a faster convergence rate than other stochastic gradient operators such as SVRG [25] for solving non-convex optimization problems, we choose the former as our solver rather than SVRG as in [24]. When a fully deterministic optimization method is used as a sub-solver in Algorithm 1 for solving Problem (7), GraSP can be viewed as a special case of SRGSP.
In our experiments, we usually set J = 2n as the number of iterations similar to the original SARAH algorithm [35]. Within each inner-loop of Algorithm 1, our main update rules are as follows: Note that g j is the stochastic recursive gradient, which is first proposed in [35]. That is, our SRGSP algorithm updates g j using the accumulated stochastic information, which has the advantage of accelerating convergence naturally. The parameterx t is then updated using the hard thresholding operator, which keeps the largest s terms of the intermediate estimate b. This step makesx t as the best s-term approximation of the estimate b. The hard thresholding operator is defined as follows: where x i is the i-th coordinate value of the vector x. Assumption 1. The solution to the sub-problem (7) is unique.
From the above analysis, we can find that our SRGSP algorithm uses a hard thresholding operation after lots of stochastic recursive gradient updates, while existing stochastic algorithms such as SVRGHT [24] perform hard thresholding in each inner-iteration, which is very time consuming for high-dimensional problems.

Convergence Analysis
In this section, we provide the convergence analysis of our SRGSP algorithm.

Convergence Property of Our Sub-solver
In this part, we consider the convergence property of our sub-solver in Algorithm 1, that is, our sub-solver can satisfy the descent condition in Equation (8). As most of the algorithms available in the community provide the bound for F (b)−F (b) 2 , our convergence analysis requires other structures in F (·) to obtain a bound for b−b 2 . Therefore, we would like to introduce the following insightful summary of the structures of F (·) [44]. Lemma 1. Let F (·) be a function with a Lipschitz-continuous gradient, the following implications hold: where SC means Strong Convexity, ESC means Essential Strong Convexity, WSC means Weak Strong Convexity, PL means Polyak-Lojasiewicz, and QG means Quadratic Growth. For their definitions, we would refer the reader to [44]. If we further assume that F (·) is convex, then we have (PL) ≡ (QG).
These results show that QG is the weakest assumption. Next, we prove that our sub-solver satisfies the descent condition in Equation (8).
Theorem 1. Suppose F (·) satisfies the QG-condition with the parameter ρ and is Lipschitz continuous with the parameter L. Assume that the number of inter-iterations, J, is sufficiently large, our sub-solver has the following expected convergence property: where 0 < c 2 < 1 is a constant, and then we have The detailed proofs of Theorem 1 and the theorem below are provided in the Supplementary Material. Similar to the linear convergence analysis of SARAH for solving convex problems in [35], our sub-solver can exhibit expected descent in the objective function value, as shown in Equation (9). If our sub-solver with a sufficiently large number of inter-iterations, then c 2 can be a very small constant. According to Theorem 1, one can easily verify that our sub-solver can satisfy the constrictive condition in Equation (8) when 2c 2 L ρ ≤ c 1 . That is, our sub-solver with a sufficiently large number of inter-iterations can satisfy the constrictive condition in Algorithm 1.

Convergence Property of SRGSP
Before giving our main convergence result, we first present some important definitions.

Definition 1 (Stable Restricted Hessian).
Suppose that F (·) is a twice continuously differentiable function, and its Hessian matrix is denoted by H F (·). For a given positive integer k, let for all k-sparse vectors u. Then F (·) is said to have a Stable Hessian Property (SRH) with constant µ k , or in This definition shows that the SRH condition is similar to various forms of Restricted Strong Convexity (RSC) used in the performance analysis of existing sparsity constrained algorithms [22]. Note that this property is suitable for smooth loss functions, and there are a broad family of loss functions that have Lipschitz-continuous gradients. Theorem 2. Let F (·) be a twice continuously differentiable function that has µ 4s -SRH with µ 4s < √ 2, and satisfies Assumption 1. For some > 0, we have < B 4s (u) for all 4s-sparse u, and {x t } is a sequence generated by Algorithm 1. Then we have where δ := (1+c 1 )(µ 2 4s −1) + 2c 1 < 1, and I is the position of the 3s largest entries of ∇F (x * ) in magnitude.
As discussed above, our sub-solver with a sufficiently large number of inter-iterations can satisfy the constrictive condition in Equation (8) with a very small constant c 1 , which makes δ < 1 hold.
Then we also have 0 < c 1 < . This implies that our sub-solver has to achieve a certain accuracy for the theorem to work. Theorem 2 suggests that our proposed algorithm achieves a linear convergence rate. This error bound consists of two terms, where the first term corresponds to the optimization error and the second term corresponds to the statistical error. After sufficient iterations, the second term will approach to zero. Therefore, our algorithm can always converge to the unknown true parameter x * with increasing of iterations.

Experimental Results
In this section, we evaluate the performance of our SRGSP method on synthetic and real-world large-scale datasets. Moreover, we apply SRGSP to tackle various sparse representation problems including image denoising and face recognition tasks. In this work, we only use the two real-world applications to illustrate the excellent performance of our SRGSP algorithm against other sparse learning optimization methods including GraSP [22], SG-HT [23], SVRGHT [24], and loopless semi-stochastic gradient descent with less hard thresholding (LSSG-HT) [34].

Synthetic Data
We generated a synthetic matrix A with size n×d, each row of which is drawn independently from a d-dimensional Gaussian distribution with mean 0 and covariance matrix Σ ∈ R d×d . The response vector was generated from the linear model y = Ax * + e, where x * ∈ R d is the s * -sparse coefficient vector, and the noise e was generated from a multivariate normal distribution N(0, σ 2 I) with σ 2 = 0.01. The nonzero entries in x * were sampled independently from a uniform distribution over the interval [−1, 1]. For the experiments, we constructed two synthetic data: (1) n = 2500, d = 5000, s * = 250, Σ = I; (2) n = 5000, d = 10, 000, s * = 500 and the diagonal entries of the covariance matrix Σ were set to 1, and the other entries were set to 0.1. The sparsity parameter s was set to s = 1.2s * for all the algorithms. Figure 1 shows the performance (including the logarithm of the objective function values and the estimation error ) of all the algorithms on the synthetic data. All the results show that our algorithm converges significantly faster than the state-of-the-art methods in terms of function values and estimation error in all the settings. Although our SRGSP algorithm, SVRGHT and LSSG-HT have been theoretically proved to have a linear convergence rate, SRGSP consistently outperforms SVRGHT and LSSG-HT in terms of number of effective passes. Comparison of gradient support pursuit (GraSP) [22], gradient support pursuit (SG-HT) [23], stochastic variance reduced gradient with hard thresholding (SVRGHT) [24], loopless semi-stochastic gradient descent with less hard thresholding (LSSG-HT) [34] and SRGSP for solving sparse linear regression problems on synthetic data. (a) n = 2500, d = 5000, s * = 250; (b) n = 5000, d = 10, 000, s * = 500.

Real-World Data
In this subsection, we focus on the two real-world large-scale datasets: rcv1-train and real-sim, which can be downloaded from the LIBSVM website (https://www.csie.ntu.edu.tw/~cjlin/libsvm/). In our experiments, we use the rcv1-train and real-sim datasets to evaluate the performance for linear regression, and the two datasets include 20,242 samples with 47,236 features and 72,309 samples with 20,958 features, respectively. Moreover, we choose s = 200 for the two datasets and compare all the algorithms in terms of logarithm objective value gap versus the number of effective passes and running time (in seconds). Figure 2 illustrates the performance of our algorithm in terms of the logarithm of function gap (i.e., log( F (x t )−F (x * ) 2 )). More specifically, our SRGSP algorithm has a faster convergence rate than the four state-of-the-art sparsity constrained algorithms. In addition, SRGSP has the ability to jump out of a local minimum and can find a better solution, as shown in Figure 2b. Compared with SVRGHT, we can see that the results of our SRGSP in the first few iterations are similar to SVRGHT. However, due to lots of gradient updates followed by a hard thresholding, SRGSP can obtain a better solution, as discussed in [36]. This further verify the advantage of our SRGSP against other methods. On the other hand, our SRGSP algorithm can reach better solutions in much less CPU time than the other methods, including SVRGHT and LSSG-HT. Since SRGSP has one hard thresholding operation per iteration, while SVRGHT needs n operations (n is the number of samples) in each epoch, and thus SVRGHT has a higher per-iteration complexity than SRGSP, especially for large-scale and high-dimensional data. Therefore, our SRGSP algorithm is very suitable for the large-scale non-convex sparsity constrained problem. . Comparison of GraSP [22], SG-HT [23], SVRGHT [24], LSSG-HT [34] and our SRGSP method for solving sparse linear regression problems. In each plot, the vertical axis shows the logarithm of the objective value minus the minimum, and the horizontal axis is the number of effective passes over data or running time (in seconds). (a) rcv1; (b) real-sim.

Image Denoising
In this subsection, we apply our SRGSP algorithm to image denoising tasks for evaluating its performance. First of all, the most important of image denoising is to find a suitable dictionary. In fact, the DCT seems like a reliable choice following the Guleryuz's work [45]. We use the overcomplete DCT dictionary in our experiments. Similar to [8], the error bound ε is set empirically as 1.15σ. The experiments are conducted on 6 standard benchmark images with synthetic white Gaussian noise. The sparsity level parameter in this experiment is set as s = 10 and the Gaussian random noise with zero mean and standard deviation = σ is added to the standard images. The dictionary in the experiments is the fixed overcomplete DCT dictionary of size 64 × 256, and is designed to deal with image patches of 8×8 pixels. The denoising processes are mainly concentrated on sparse coding of these patches using different sparsity constrained algorithms (e.g., GraSP, SG-HT, SVRGHT, LSSG-HT and SRGSP) and the classical greedy orthogonal matching pursuit (OMP) algorithm [46]. The parameters of x are computed till the loss of Equation (1) lower than the error bound or the number of iterations larger than 32 (half of the row size of dictionary). When computing the sparse representative solution on overlapping patches, all the algorithms have to evaluate the sparse coding solution of 62,001 patches for images of size 256×256 or 255,025 patches for images of size 512×512. Then, the restoring patches were averaged in the same the procedure as in [8]. All the experiments are repeated 10 times, and the average results are reported. Table 1 shows the results (including PSNR and SSIM) of all the algorithms at different noise levels, i.e., the values of σ vary from 5 to 55. As we can see from Table 1, our SRGSP algorithm can obtain higher PSNR and SSIM results than the other methods in all the settings, which indicates that the intrinsic low-dimensional structure can be found by our algorithm. Figure 3 shows the visual results of all the methods (i.e., SRGSP, LSSG-HT, SVRGHT, SG-HT, GraSP and OMP) on the cameraman image with σ = 15, where s = 10. It is clearly visible that the sky region of the cameraman image is well restored by SRGSP, while the results of other methods are not well recovered. Moreover, our SRGSP algorithm has a higher PSNR value of 29.08 dB, compared to 27.96 dB of SVRGHT and 27.32 dB of OMP. The SSIM results of OMP, SVRGHT and SRGSP are 0.6006, 0.8588, 0.8761, respectively. All the above results demonstrate the effectiveness of SRGSP for image denoising tasks. Table 1. The denoising results (PSNR (dB)/SSIM) of all the methods including OMP [46], GraSP [22], SG-HT [23], SVRGHT [24], LSSG-HT [34] and our SRGSP method on 6 standard images at different noise levels from 5 to 55. The best performance is shown in bold.    6 show the denoising results of all the algorithms on the hill, pepper and boat images with different noise levels (e.g., σ = 25,35,45). We can see that our SRGSP algorithm consistently outperforms the other methods in terms of both PSNR and SSIM. Moreover, we give the following suggestion of empirical parameter-tuning for our SRGSP to obtain a good result. Based on all the experimental results, we find that in image denoising tasks, the error bound can be set empirically to 1.15σ for yielding a good result. For a general parameter setting, the number of outer-iteration T is set in the interval [20,30] and its inter-iteration is set to integer multiples of the number of samples.

Face Recognition
In this subsection, we evaluate the performance of our SRGSP algorithm for face recognition on two real-world face datasets. More specifically, SRGSP is used as the solver in the sparse representation-based classification (SRC) framework [12]. We compare the recognition rates of SRGSP with those of some state-of-the-art sparsity constrained algorithms, such as GraSP, SG-HT, SVRGHT and LSSG-HT. In order to evaluate robustness of our algorithm, we manually add Gaussian noise to the face data.

Datasets
Although there are many datasets available for face recognition, we choose two common publicly datasets, i.e., the AR database [47] and the extended Yale B database [48]. The extended Yale B database contains 2414 frontal-face images of 38 people under different controlled lighting conditions [48]. For each individual, we randomly choose 26 images for training and 15 images for testing. The AR database contains over 4000 color images corresponding to 126 people's faces (70 Male and 56 Female). The images are obtained under different situation including different facial expressions, illumination conditions, and occlusions such as scarf or sunglasses. For simplicity, we randomly choose 100 objects, and each object has 15 images for training and 11 for testing. Note that the AR database may be difficult for face recognition because there are more classes to recognize and few training samples for each object.

Experimental Setup
For each sparsity constrained algorithm, the sparsity level makes great difference to the solution of sparse representation, especially at different noise levels. Therefore, in order to approach the best performance of all these algorithms, we change the sparsity parameter within a certain range for each algorithm, i.e., s ∈ [5,10,20,30,40,50]. Thus, we can make sure that all these algorithms achieve the best recognition rates in the parameter setting. In all settings of the experiments, the images are down-sampled to 32×32 pixels. As in [12], a series of processing operations are made to the above datasets. We first rescale the training matrix into [0,1] for the convenience of adding noise, and then add Gaussian noise with zero mean and standard deviation σ. Finally, we normalize the columns of the training matrix to have unit 2 -norm.

Results on Real-World Face Data
In this part, we report the recognition rates of SRGSP on both the AR and the extended Yale B databases. Figure 7 shows the real testing images from the AR database with different Gaussian noise levels (e.g., σ = 0.25 and σ = 0.5). As we can see, it is challenging for humans to correctly recognize the face images under this situation. However, even in this extreme conditions, SRGSP achieves a high recognition rate with high probability.
In Figure 8, SVRGHT, LSSG-HT and SRGSP obtain much higher recognition rates than other two methods on both datasets, which verifies the superiority of variance reduction or recursive gradient techniques. Although at the very early iterations, SRGSP may have slightly lower recognize rates than SVRGHT, while with the increase of iterations, SRGSP can achieve the highest recognition rate among all the classifiers. Moreover, SRGSP is several times faster than SVRGHT due to less hard thresholding operations. For example, when σ = 0.25 on the Yale B database, SRGSP obtains over 90% recognition rate within 50 s, while SVRGHT needs more CPU time to reach the same accuracy. In fact, in the same number of passes, SRGSP still achieves the highest recognition rate among the state-of-the-art hard thresholding algorithms, which demonstrates the effectiveness and efficiency of SRGSP. In the extreme situation of the AR database with the Gaussian noise level σ = 0.5, SRGSP still achieves a higher recognition rate than the other methods. The results of recognition rates with respect to the number of passes on both datasets are provided in Figures 9 and 10, and further demonstrate the superiority of SRGSP.

Conclusions and Future Work
In this paper, we proposed a stochastic recursive gradient support pursuit (SRGSP) method for solving large-scale sparsity constrained optimization problems. We also provided the convergence analysis of SRGSP, which shows that SRGSP obtains a linear convergence rate. As existing hard thresholding-based algorithms need more thresholding operations, SRGSP just needs a hard thresholding operation in each epoch, and thus has a significantly per-iteration lower computational complexity, i.e., O(d) vs. O(d log(d)). Experimental results on synthetic and real large-scale datasets verified the effectiveness and efficiency of SRGSP.
Moreover, we also applied our SRGSP method to tackle image denoising and face recognition tasks, where sparse representation learning plays an important role. Our experimental results show that SRGSP outperforms other sparse representation methods in terms of PSNR and recognition rates. Note that for the image denoising application, the dictionary is the fixed overcomplete DCT dictionary. Inspired by some sophisticated methods such as K-SVD [38], we will iteratively update the dictionary in the future, which can further improve performance. In fact, there are many real-world sparse representation learning applications such as image super-resolution, image restoration, image classification and visual tracking. Therefore, we will apply our SRGSP method to address more applications in the future. In addition, our SRGSP algorithm can be extended to tackle low-rank matrix and tensor completion and recovery problems as in [49][50][51].

Sparse Representation-Based Classification (SRC)
In this paper, we apply our algorithm into the SRC framework for face recognition tasks, as outlined in Algorithm A1.