Kernel Joint Sparse Representation Based on Self-Paced Learning for Hyperspectral Image Classiﬁcation

: By means of joint sparse representation (JSR) and kernel representation, kernel joint sparse representation (KJSR) models can effectively model the intrinsic nonlinear relations of hyperspectral data and better exploit spatial neighborhood structure to improve the classiﬁcation performance of hyperspectral images. However, due to the presence of noisy or inhomogeneous pixels around the central testing pixel in the spatial domain, the performance of KJSR is greatly affected. Motivated by the idea of self-paced learning (SPL), this paper proposes a self-paced KJSR (SPKJSR) model to adaptively learn weights and sparse coefﬁcient vectors for different neighboring pixels in the kernel-based feature space. SPL strateges can learn a weight to indicate the difﬁculty of feature pixels within a spatial neighborhood. By assigning small weights for unimportant or complex pixels, the negative effect of inhomogeneous or noisy neighboring pixels can be suppressed. Hence, SPKJSR is usually much more robust. Experimental results on Indian Pines and Salinas hyperspectral data sets demonstrate that SPKJSR is much more effective than traditional JSR and KJSR models.


Introduction
Hyperspectral sensors simultaneously acquire digital images in many narrow and contiguous spectral bands across a wide range of the spectrum. The resulting hyperspectral data contains both detailed spectral characteristics and rich spatial structure information from a scene. Exploiting the rich spectral and spatial information, hyperspectral remote sensing has been successfully applied in many fields [1][2][3], such as agriculture, the environment, the military, etc. In most of these applications, pixels in a scene need to be classified [2,4,5]. The commonly used classifiers include spectral-based classifiers such as nearest neighbor (NN) and support vector machine (SVM), and spatial-spectral classifiers such as mathematical morphological (MM) and Markov random field (MRF) methods [2]. Compared with the spectral-based classifiers that only use spectral information, spatial-spectral classifiers use both the spectral and spatial information of a hyperspectral image (HSI) and can produce much better classification results [5][6][7][8]. Now, spatial-spectral classifiers have become the research focus [2,5].
Under the definition of spatial dependency systems [5], spatial-spectral classification methods can be approximately divided into three categories [5]: preprocessing-based classification, postprocessingbased classification, and integrated classification. In the preprocessing-based methods, spatial features are first extracted from the HSI and then used for the classification. In postprocessing-based methods, a pixel-wise classification is first conducted, and then spatial information is used to refine the previous pixel-wise classification results. The integrated methods simultaneously use the spectral and spatial information to generate an integrated classifier. In this paper, we focus on the integrated methods.
The joint sparse representation (JSR)-based classification method is a typical integrated spatial-spectral classifier [9][10][11][12][13][14][15][16][17][18]. JSR pursues a joint representation of spatial neighboring pixels in a linear and sparse representation framework. If neighboring pixels are similar, making a joint representation of each neighboring pixel can improve the reliability of sparse support estimation [17,19]. The success of the JSR model mainly lies in the follows two factors: (1) joint representation: the neighborhood pixel set is consistent, that is, pixels in a spatial neighborhood are highly similar or belong to the same class; (2) linear representation: the linear representation framework in the JSR model is coincident with the hyperspectral data characteristics. However, in practice, spatial neighborhood is likely to have inhomogeneous pixels [15], such as background, noise, and pixels from other classes, which dramatically affects the joint representation. In addition, hyperspectral data usually exhibits nonlinear characteristics [20][21][22][23], so the linear representation framework may also be unreasonable.
To alleviate the effect of noisy or inhomogeneous neighboring pixels, an intuitive and natural idea is to introduce a weight vector to discriminate neighboring pixels. The weight can be predefined as a non-local weight [10] or dynamically updated in a nearest regularized JSR (NRJSR) model [11]. Rather than weighting neighboring pixels, another method is to construct an adaptive spatial neighborhood system such that inhomogeneous neighboring pixels are precluded. The adaptive neighborhood can be constructed based on traditional image segmentation techniques [12], the superpixel segmentation method [13], and the shape-adaptive region extraction method [14]. Both the weighting method and adaptive neighborhood method improve the consistency of neighborhood pixel sets such that the joint representation framework is effective in most cases. However, all these methods are linear methods and show performance deficiency when data are not linearly separable.
Hyperspectral data are considered inherently nonlinear [23]. The nonlinearity is attributed to multiple scattering between solar radiation and targets, the variations in sun-canopy-sensor geometry, the presence of nonlinear attenuating medium (water), the heterogeneity of pixel composition, etc. [23]. To cope with the nonlinear problem, kernel-based JSR (KJSR) methods are proposed [19,[24][25][26]. KJSR mainly includes two steps: projecting the original data into high-dimensional feature space using a nonlinear map and then performing JSR in the feature space. Because the data in the feature space are linear separable, the JSR model can be directly applied. In practice, it only needs to compute a kernel function between samples. Most works on the KJSR method are concentrated on the design of kernel function. The kernel functions include Gaussian kernel [24], spatial-spectral derivative-aided kernel [25], and multi-feature composite kernel [26]. Compared with the original JSR, the use of kernel methods yields a significant performance improvement [24]. However, these kernel-based JSR methods assume that neighboring pixels have equal importance and do not considered the differences of neighboring pixels in the feature space. This is obviously unreasonable when pixels in the spatial neighborhood are inhomogeneous.
To simultaneously improve the joint representation and linear representation abilities of the JSR model, we adopt a self-paced learning (SPL) strategy [27][28][29] to select feature-neighboring pixels and propose a self-paced KJSR (SPKJSR) model. In detail, a self-paced regularization term is incorporated into the KJSR model, and the resulting regularized KJSR model can simultaneously learn the weights and sparse coefficient vectors for different feature-neighboring pixels. The optimized weight vector indicates the importance of neighboring pixels. The inhomogeneous or noisy pixels are automatically assigned small weights and hence their negative effects are eliminated.
The rest of this paper is organized as follows. Section 2 introduces JSR and KJSR and then describes our proposed method. Section 3 provides the experimental results. Section 4 gives a discussion. Finally, Section 5 draws a conclusion.

Self-Paced Learning (SPL)
Given a training set {x i , y i } N i=1 , the goal of a learning algorithm is to learn a function f (·, θ) (or simply parameter θ) by solving the following minimization problem: where y i is the true value, f (x i , θ) is the estimated value, and L is a loss function.
In model (1), all training samples are used to learn the parameter θ without considering the differences between training samples. When there exist noisy training samples or outliers, the learned function f or model parameter θ will be inaccurate.
Rather than using all training samples for learning, curriculum learning (CL) or self-paced learning (SPL) adopts a gradual learning strategy to select samples from easy to complex to use in training [17,[27][28][29][30][31]. Both CL and SPL share a similar conceptual learning paradigm but differ in the derivation of the curriculum. A curriculum determines a sequence of training samples ranked in ascending order of learning difficulty. In CL, the curriculum is predefined by some prior knowledge and thus is problem-specific and lacks generalizations. To alleviate this problem, the self-paced learning (SPL) method incorporates curriculum updating in the process of model optimization [28]. SPL jointly optimizes the learning objective and the curriculum such that the learned model and curriculum are consistent.
For model (1), SPL simultaneously optimizes the model parameter θ and the weight vector ω = [ω 1 , . . . , ω N ] T by employing a self-paced regularizer: where ω i is a weight that indicates the difficulty of sample x i , and h(λ, ω) is a self-paced function. λ is a "model age" parameter that decides the size of the model. The parameters θ and ω in the model (2) can be solved by an alternating optimization strategy [28].

Joint Sparse Representation (JSR)
Given a set of N training samples x 1 , x 2 , . . . , x N with x i ∈ R B , we can construct a training dictionary as X = [x 1 , x 2 , . . . , x N ] ∈ R B×N . For a testing pixel z, we can extract its neighboring pixels in a w × w spatial window centered at z and denote them as Z = [z 1 , z 2 , . . . , z T ] ∈ R B×T (z 1 = z), where T is the number of pixels in the neighborhood.
In the framework of sparse representation [32], each neighboring pixel z k can be represented by the training dictionary X with a sparsity coefficient vector α k . Because neighboring pixels in a small spatial window are highly similar, they have similar representations under the training dictionary. By further assuming that the positions of nonzero elements in the sparsity coefficient vectors are the same and combining the representation of neighboring pixels, the following JSR model is generated [9]: where S = [α 1 , α 2 , . . . , α T ] ∈ R N×T is a matrix with only K nonzero rows. The sketches of sparse representation and joint sparse representation are shown in The optimization problem for solving the matrix S in (3) is: where · F denotes the Frobenius norm, S row,0 denotes the number of nonzero rows of S, and K is an upper bound of the sparsity level. The solution of (4) can be obtained by the simultaneous orthogonal matching pursuit (SOMP) algorithm [9,33].
Once the row-sparse matrix S is obtained, the testing pixel z is classified in the class with the minimal reconstruction residual: where the c-th residual r c (Z) = Z − X :,Ω c S Ω c ,: 2 F , and Ω c ∈ {1, 2, . . . , N} is the index set corresponding to the c-th class.

Kernel Joint Sparse Representation (KJSR)
In order to exploit the intrinsic nonlinear properties of the hyperspectral imagery, pixels in the original space are projected to a high-dimensional feature space by a nonlinear map, and a kernel-based JSR (KJSR) model is obtained by performing the JSR on the feature space [24].
Denote φ as a nonlinear map, the KJSR model assumes that the mapped neighboring pixels in the feature space (i.e., φ(z 1 ), φ(z 2 ), . . . , φ(z T )) are also similar and hence have a similar sparsity pattern. The KJSR model is represented as where The optimization problem for solving the row-sparse matrix S is which can be solved by the kernel SOMP (KSOMP) algorithm [24], as shown in Algorithm 1.
Run the following steps until convergence: 1 Compute the correlation coefficient matrix: 2 Identify the optimal atom, and find the corresponding index: 3 Enlarge the index set: Update the iteration number: k = k + 1, and go to Step 1.

Self-Paced Kernel Joint Sparse Representation (SPKJSR)
In the KJSR model, it is assumed that transformed neighboring pixels φ(z 1 ), φ(z 2 ), . . . , φ(z T ) have equal importance in the sparse representation. However, this assumption is usually unreasonable because there exist differences between original spatial neighboring pixels and the nonlinear map φ that may further enlarge the differences. The spatial inconsistency mainly appears in the following two aspects: (1) When a target pixel is around the boundary of an object, its spatial neighborhood usually contains inhomogeneous pixels, such as background pixels or pixels from different classes; (2) when a target pixel lies in the center of a large homogeneous region, all neighboring pixels are from the same class. This notwithstanding, the spatial distances between neighboring pixels and the center target pixel are different. Neighboring pixels that are far away from the center pixel usually provide limited contributions to the classification of the central target pixel, especially when the neighborhood window is large.
When there exists a spatial inconsistency between neighboring pixels, the feature representations of neighboring pixels in the kernel space are also dissimilar. Considering the distinctiveness of feature neighboring pixels, we employ a self-paced learning strategy to select important feature neighboring pixels and propose a self-paced KJSR (SPKJSR) model for the classification of HSIs.
For convenience, we first transfer the matrix-norm-based objective function in the model (7) to a vector-norm-based one as follows: Based on the self-paced learning strategy, the SPKJSR model simultaneously optimizes a weight vector and sparse coefficient matrix for feature neighboring pixels: where ω = [ω 1 , ω 2 , . . . , ω T ] T is a weight vector of feature neighboring pixels and h(λ, ω) is the self-paced function. Here, the self-paced learning strategy is used to select important feature neighboring pixels for joint sparse representation. The optimization problem (9) can be solved by an alternative optimization strategy. With a fixed ω, (9) can be represented as: where As the feature map φ is unknown, we cannot compute Z φ W 1/2 directly. Fortunately, in the KSOMP algorithm, we only need to compute the correlation matrix between Z φ and X φ as By employing the KSOMP Algorithm 1, we can obtain the sparsity coefficient matrix: and further compute the approximation error of each neighboring pixel: Based on the approximation errors, we can describe the complexity of each feature neighboring pixel φ(z t ) and determine the corresponding weight value based on self-paced learning as follows: To solve the weight ω t in (15), a self-paced function h needs to be defined. The self-paced function can be binary, linear, logarithmic, or a mixture function [28]. Here, we take the mixture function as an example to show how to solve the weight. The mixture function is defined as Substituting (16) into (15) obtains Taking the derivative with respect to ω t , we obtain the mixture weight: Based on the weight in (18), the pixels are divided into three classes: (1) "easy" pixels with small loss ( t ≤ λ 2 ) corresponding to weight 1; (2) "complex" pixels with large loss ( t ≥ λ 1 ) corresponding to weight 0; and (3) "moderate" pixels whose loss is between λ 2 and λ 1 . It is clear that the "complex" pixels are excluded from the JSR model. When λ 1 is small, only very limited "easy" pixels are used. With an increase in λ 1 , more neighboring pixels are regarded as "moderate" pixels and hence used for the model.
From (18), we can see that the parameters λ 1 and λ 2 are related to the losses of neighboring pixels. As the amplitude of losses is different in each iteration, the setting of these parameters is difficult. Considering that λ 1 and λ 2 control the size of the active feature neighboring pixel set, we can set these two parameters by setting the number of feature neighboring pixels in each iteration of the self-paced learning process. For a square neighborhood with T pixels, we first define a pixel number sequence {T 1 , T 2 , . . . , T max }, where T i denotes the number of feature neighboring pixels selected in the i-th iteration [17], and T max = T. In the i-th iteration, the loss vector of neighboring pixels (i) is sorted in an ascending order, which results in a sorted loss (i) a . Then we set the parameters as follows: λ Now, the parameters λ 1 and λ 2 are determined by the fixed initialization parameters k 1 , k 2 and step size δ. In the experiment, k 1 , k 2 and δ are set as k 1 = 0.5, k 2 = 0.2 and δ = 0.05.
By alternatively updating the coefficient matrix S and weight vector ω according to (13) and (15) or (18), the objective function in (11) will decrease, and finally the the algorithm will converge.
When the coefficient matrix S and weight vector ω are obtained, the reconstruction residual for each class can be computed: Finally, the testing pixel z is assigned to the class with the minimal reconstruction residual. Algorithm 2 shows the implementation process of SPKJSR.

Compute the sorted loss vector
(k) a , and update the model age: 2 , ω t ).
3 Classify the testing pixel z:

Data Sets
To evaluate the performance of the proposed method for HSI classification, we use the following two benchmark hyperspectral data sets (available online: http://www.ehu.eus/ccwintco/index.php/ Hyperspectral_Remote_Sensing_Scenes):   In real applications, the number of labeled samples is usually very limited, which makes HSI classification a challenging problem. To show the performance of our proposed method in the case of small sample sizes, we randomly choose 1% of samples from each class to form the training set, and all remaining samples consist of the testing set. The selected training and testing samples for these two data sets are shown in Tables 1 and 2, respectively.
In the experiments, the class-specific accuracy (CA), overall accuracy (OA), average accuracy (AA), and kappa coefficient (κ) on the testing set are recorded to assess the performance of different classification methods. The CA is the ratio of correctly classified pixels to the total number of pixels for each individual class. The OA is the ratio of correctly classified pixels to the total number of pixels. The AA is the mean of the CAs. The kappa coefficient quantifies the agreement of classification. A statistical McNemar's test is used to evaluate the statistical significance of differences between the overall accuracy of different algorithms [3,15,34,35]. The Z value of McNemar's test is defined as: where f 12 indicates the number of testing samples classified incorrectly by classifier 1 and correctly by classifier 2, and f 21 has a dual meaning. Accepting the common 5% level of significance, the difference in accuracy between classifiers 1 and 2 is said to be statistically significant if |Z| > 1.96. Here, we set the proposed SPLKJSR algorithm as classifier 1 and the comparison algorithm as classifier 2. Thus, Z < −1.96 indicates that SPLKJSR is statistically more accurate than the comparison algorithm.
For SVM, the Gaussian radial basis function kernel is used, and the related parameters are set as C = 10000 and σ = 0.5. For SVMCK, a weighted summation spatial-spectral kernel is used [6]. The parameters for SVMCK are a penalty parameter C, the width of spectral kernel σ w , the width of spatial kernel σ s , the combination coefficient µ, and the neighborhood size T. These parameters are empirically set as C = 10000, σ w = 2, σ s = 0.25, µ = 0.8, and T = 81. As recommended in [9,10], the number of neighboring pixels and the sparsity level for JSR and WJSR on the Indian Pines data set are set as T = 81 and K = 30. In [24], the number of neighboring pixels for KJSR is also set as T = 81, and the sparsity level K ≥ 30 corresponds to similar results. For a fair comparison, we set the number of neighboring pixels and the sparsity level as T = 81 and K = 30 in the Indian Pines data set for four JSR-based methods (i.e., JSR, WJSR, KJSR, and SPKJSR). Considering that the Salinas image consists of large homogeneous regions, a large spatial window of size 9 × 9 (T = 81) is used, and the sparsity level is also set as K = 30 [15,17].

Classification Results
In the experiment, we randomly choose the training samples and testing samples five times and record the mean accuracies and standard derivations over the five runs for different algorithms. The classification results for the Indian Pines and Salinas data sets are shown in Tables 3 and 4, respectively. From the classification results, we can see that (1) The proposed SPKJSR provides the best classification results for both data sets. Compared with the original KJSR, SPKJSR improves the OA and κ coefficient by about 4% in Indian Pines, and by about 2% in Salinas. This demonstrates that the self-paced learning strategy can eliminate the negative effect of inhomogeneous pixels and select effective feature neighboring pixels for the joint sparse representation, which helps to improve the classification performance. Moreover, the improvement in performance when using the proposed model over the rest of the methods is statistically significant because the Z values for McNemar's test in Table 5 are much smaller than −1.96.
(2) Compared with JSR, KJSR improves the OA by 9% and 6% in the Indian Pines and Salinas data sets, respectively. It demonstrates that there exists nonlinear relations between samples in these two hyperspectral data sets, and the nonlinear kernel can effectively describe the intrinsic nonlinear structure relations.
(3) For the CA, SPKJSR obtains the best results for most of the classes. In particular, SPKJSR shows good performance for the classes with large numbers of samples, such as Classes 2, 10, 11, and 12 of Indian Pines, and Classes 8, 9, 10, and 15 of Salinas. In addition, our SPKJSR also provides satisfactory performance for the classes with very limited samples, such as Classes 4, 7, 9, 10, and 15 of Indian Pines. In contrast, traditional sparse-based methods almost fail for these classes due to the lack of dictionary atoms. In the JSR model, the training dictionary and neighborhood pixel matrix are two key factors. From the mechanism of sparse representation, when a class has a large number of samples, the number of dictionary atoms corresponding to this class is also large, so the representation ability and classification performance on large classes are usually better. Our proposed SPKJSR employs a self-paced learning strategy to adaptively select important pixels and discard inhomogeneous pixels, which refines the structure of the neighborhood feature pixel matrix and improves the reliability of joint representation. Figures 4 and 5 show the classification maps obtained by different algorithms. The spectral-based methods (i.e., SVM, SRC, and OMP) generate very bad results with too much "salt & pepper" noise because they only use the spectral information without using spatial neighborhood information. Spatial-spectral-based methods greatly improve the classification performance. Compared with KJSR, our SPKJSR shows relatively better results. In particular, with the Salinas data set, traditional sparse-based methods, such as SRC, OMP, JSR, and WJSR, almost wrongly classify Class No. 15 "Vineyard untrained" as Class No. 8 "Grape untrained". These two classes are spatially adjacent, so their spectral characteristics are very similar. It is very difficult to classify these spectrally similar classes. This notwithstanding, our SPKJSR can still discriminate the subtle differences between these two classes and provides desirable results. Table 3. Overall, average, and individual class accuracies and κ statistics in the form of mean ± standard deviation for the Indian Pines data set. The best results are highlighted in bold typeface.    (a) The computational times for different methods when reaching their optimal classification performances are reported in Table 6. Because sparse models need to make predictions for each test sample separately, they are usually more time-consuming. As an iterative algorithm, the proposed SPKJSR is relatively slower than JSR and KJSR. In the previous experiments, we have evaluated the effectiveness of our proposed SPKJSR in the case of small sample sizes (i.e., 1% of samples per class for training). Here, we further show the results for a large number of training samples and analyze the effect of the number of training samples. For this purpose, we draw 1%, 2%, 3%, 4%, and 5% of labeled samples from each class to form the training set and then evaluate the performance of different algorithms on the corresponding testing set. Figures 6 and 7 show the OA of different methods versus the ratio of training samples per class for the Indian Pines and Salinas data sets, respectively. It can clearly be seen that SPKJSR provides consistently better results than the other algorithms with different numbers of training samples. Compared with the original linear JSR method, kernel-based KJSR and SPKJSR show a great performance improvement, which demonstrates the effectiveness of the kernel method for describing the intrinsic nonlinear relations of hyperspectral data.

Class
As shown in Algorithm 2, SPKJSR employs an alternative iterative strategy to update the sparsity coefficient matrix and weight matrix, so it is an iterative algorithm. Now, we investigate the effect of the number of iterations. Figures 8 and 9 show the classification OA versus the number of iterations on the Indian Pines and Salinas data sets, respectively. It can be seen that the proposed algorithm achieves relatively better performance after 3 iterations.

Discussions
From the previous results in Tables 3 and 4, we can see that KJSR dramatically improves the original JSR. This demonstrates that kernel representation is effective for modeling the nonlinear relation between hyperspectral pixels [24]. In addition, the improvement of WJSR over JSR demonstrates that pixels in a spatial neighborhood have differences, and the weighted strategy can alleviate the negative effect of noisy or inhomogeneous pixels in the neighborhood [10]. Due to the presence of noisy neighboring pixels, directly performing kernel representation on these noisy pixels may not be robust. Our proposed SPKJSR model provides a robust kernel representation to improve the robustness of the joint representation of neighboring pixels in the feature space.
Although many existing JSR methods have tried to eliminate noisy or inhomogeneous pixels in the spatial neighborhood in a preprocessing step by means of image segmentation techniques [12][13][14], they usually suffer from an inaccurate identification of inhomogeneous neighboring pixels. The identification of pixels is based on spectral similarity, which is usually inaccurate due to the spectral variation of spatially adjacent pixels. Rather than deleting inhomogeneous pixels in advance, defining a robust metric has proven to be effective for suppressing the effect of inhomogeneous pixels on the JSR model [15,17,18]. Some robust metrics, such as correntropy-based metrics [15] and maximum likelihood estimation-based metrics [18,36], are used. This notwithstanding, in the feature space, there is a lack of robust metrics on kernel representation for the KJSR model. To the best of our knowledge, this paper is the first to provide a robust kernel representation for the KJSR model.

Conclusions
We have proposed a self-paced kernel joint sparse representation (SPKJSR) model for hyperspectral image classification. The proposed SPKJSR mainly improves the feature neighboring pixel representation in the traditional kernel joint sparse representation (KJSR) model. By introducing a self-paced learning strategy, the proposed SPKJSR simultaneously optimizes the sparsity coefficient matrix and weight matrix for feature neighboring pixels in a regularized KJSR framework. The optimized weight can indicate the difficulty of neighboring pixels. The inhomogeneous or noisy neighboring pixels are assigned a small weight or 0 weight and hence produce very limited effects on the joint sparse representation. Thus, SPKJSR is much more robust and accurate than the traditional kernel joint sparse representation method. To validate the effectiveness of the proposed method, we have performed experiments on two benchmark hyperspectral data sets: Indian Pines and Salinas. Experimental results have shown that the proposed SPKJSR provides consistently better results than other existing joint sparse representation methods. In particular, when only 1% of samples per class are use for training, SPKJSR improves the overall accuracy of KJSR by about 3.5% on the Indian Pines data set and 1.8% on the Salinas data set. In the future, we will consider using different kinds of kernels for the KJSR model, such as spatial-spectral-based kernels and Log-Euclidean kernels.