Discriminative Sparse Representation for Hyperspectral Image Classiﬁcation: A Semi-Supervised Perspective

: This paper presents a novel semi-supervised joint dictionary learning (S 2 JDL) algorithm for hyperspectral image classiﬁcation. The algorithm jointly minimizes the reconstruction and classiﬁcation error by optimizing a semi-supervised dictionary learning problem with a uniﬁed objective loss function. To this end, we construct a semi-supervised objective loss function which combines the reconstruction term from unlabeled samples and the reconstruction–discrimination term from labeled samples to leverage the unsupervised and supervised information. In addition, a soft-max loss is used to build the reconstruction–discrimination term. In the training phase, we randomly select the unlabeled samples and loop through the labeled samples to comprise the training pairs, and the ﬁrst-order stochastic gradient descents are calculated to simultaneously update the dictionary and classiﬁer by feeding the training pairs into the objective loss function. The experimental results with three popular hyperspectral datasets indicate that the proposed algorithm outperforms the other related methods.


Introduction
Hyperspectral remote sensing sensors can provide plenty of useful information that increases the accurate discrimination of spectrally similar materials of interest and allow for the acquisition of hundreds of contiguous bands for the same area on the surface of the Earth [1].The acquired hyperspectral images have been extensively exploited for classification tasks [2][3][4][5], which aim at assigning each pixel with one thematic class for an object in a scene.
Recently, sparse representation has emerged as an effective way to solve various computer vision tasks, such as face recognition, image super-resolution, motion tracking, image segmentation, image denoising and inpainting, background modeling, photogrammetric stereo, and image classification [6], where the concept of sparsity often leads to state-of-the-art performances.Sparse representation has also been used for hyperspectral image classification [4,7], target detection [7,8], unmixing [9], pansharpening [10], image decomposition [11,12], and dimensionality reduction [13,14], where the high-dimensional pixel vectors can be sparsely represented by a few training samples (atoms) from a given dictionary and the encoded sparse vectors carry out the class-label information.In order for sparse representation to yield superior performances, the desired dictionary should have good representation power [15].
Traditional methods for learning a representative and compact dictionary have been extensively exploited [16][17][18][19].The method of optimal directions (MOD) [16] is an improvement of matching pursuit with an iterative optimization strategy, which iteratively calculates the optimal adjustment for the atoms.Singular value decomposition generalized from K-means (K-SVD) [17] is an iterative method that alternates between sparse coding of instances based on the current dictionary and updating the dictionary to better represent the data.The majorization method (MM) [18] is an optimization strategy that substitutes the original objective function with a surrogate function updated in each optimization step.The recursive least squares algorithm (RLS-DLA) [19] adopts a continued update approach as each atom is being processed.These methods have shown good performances in various computer vision tasks.However, these dictionary learning algorithms are designed for reconstruction tasks but not for the purpose of classification.Moreover, these algorithms often result in high computational complexity, less representative power for specific class, and lower discriminative power.
Advanced dictionary learning algorithms have been recently proposed to incorporate a discriminative term into the objective function in the dictionary learning problem [20][21][22][23][24][25][26][27][28][29][30], which has been regarded as discriminative sparse representation (DSR) in our previous work [31].DSR allows for jointly learning reconstructive and discriminative parts instead of only the reconstructive one.Existing DSR models include the following categories: (i) The seminal studies that paved the way for considering discrimination in dictionary learning.
Mairal [20] adopted MOD and K-SVD to update the dictionary by using a truncated Newton iteration method.However, this method is not strictly convex and it did not explore the discrimination capability of sparse coefficients.Later, Mairal [22] adopted a logistic loss function to build a binary classification problem.This work also illustrated the possibility of extending the proposed binary classification problem to multi-class classification problems using soft-max loss function.Pham [21] designed a constrained optimization problem by adopting a linear classifier with quadratic loss and 2 -norm regularization to jointly minimize the reconstruction and classification errors.This approach may suffer from a local minimum issue due to the fact that it iteratively alternates between reconstruction and classification terms.(ii) Exploiting a different loss function and discriminative criterion.Lian [23] adopted a hinge loss function inspired from support vector machines (SVMs) to design a unified objective loss function that links classification with dictionary learning.Such a framework is able to further increase the margins of a binary classifier, which consequently decreases the error bound of the classifier.
Yang [26] presented a novel discrimination dictionary learning (FDDL) method by using a Fisher discrimination criterion for penalizing the sparse coefficients, where a structured dictionary was used for minimizing the reconstruction error.Henao [32] developed a new Bayesian formulation for nonlinear SVM, based on a Gaussian process and with the hinge loss expressed as a scaled mixture of normals.(iii) Incorporating K-SVD with a DSR model.Following the work [21], Zhang [25] proposed a discriminative K-SVD (D-KSVD) by incorporating the classification error term into the K-SVD-based objective function.However, this approach does not guarantee the ability of discrimination when acting on a small training set.In order to overcome this issue, Jiang [29] presented a label consistent K-SVD (LC-KSVD) algorithm, where the class-label information is associated with dictionary atoms to enforce discriminative property, and the optimal solution is efficiently obtained by using the K-SVD algorithm.(iv) Exploiting the hybrid supervised and unsupervised DSR model.Lian [24] presented a probabilistic model that combines an unsupervised model (i.e., Gaussian mixture model) and a supervised model (i.e., logistic regression) for supervised dictionary learning.Marial [33] presented an online dictionary learning (OnlineDL) algorithm.Following this work, Zhang [30] presented an online semi-supervised dictionary learning (OnlineSSDL) algorithm by optimizing the reconstruction error from labeled and unlabeled data, and the classification error from labeled data.However, for the sake of simplicity, OnlineSSDL droops the weight decay for classifier parameters and makes the problem strictly convex, resulting in a suboptimal solution.(v) Exploiting structured sparsity in the DSR model.Compared with traditional sparse representation methods, structured sparsity-based methods are always more robust to noise due to the stability associated with group structure [34].In addition, the structured sparsity-inducing dictionary learning methods require a smaller sample size to obtain the optimal solution [35][36][37][38].Based on graph topology, Jiang [27] proposed a submodular dictionary learning (SDL) algorithm by optimizing an objective function that accounts for the entropy rate of random walk on a graph and a discriminative term.This dictionary learning problem can be considered as a graph partitioning problem, where the dictionary is updated by finding a graph topology that maximizes the objective function.
These works also stated that a good dictionary learning method should find a proper balance between reconstruction, discrimination, and compactness.Particularly, some studies have exploited DSR for hyperspectral image processing.Charles [39] modified an existing unsupervised learning method to learn the dictionary for hyperspectral image classification.Later, Castrodad [40] exploited DSR, where block-structured dictionary learning and subpixel unsupervised abundance mapping were jointly considered.More recently, Wang [41] designed a hinge loss function inspired from learning vector quantization to address the discriminative dictionary learning problem.Wang [42] proposed a semi-supervised classification method by jointly learning the classifier and dictionary in a task-driven framework, where logistic loss function is adopted to build the discriminative term.Our previous work presented a new method for DSR by learning a reconstructive dictionary and a discriminative classifier in a sparse representation model regularized with total variation [31].
Despite the good performances of these dictionary learning methods, some shortcomings can be observed.On the one hand, most of these approaches deal with supervised dictionary learning problems, and the performances of the learnt dictionary for classification greatly depend on the number of labeled samples.Unfortunately, the collection of labeled training samples is generally difficult, expensive and time-consuming, whereas unlabeled training samples can be generated in a much easier way, which has fostered the idea of exploiting semi-supervised learning (SSL) for hyperspectral image classification [43].On the other hand, the loss function adopted in most of these approaches is square loss that considers classification as a regression problem.In addition, square loss often suffers from one critical flaw that the data outliers are punished too heavily when squaring the errors.
In this paper, we consider the above issues by jointly learning a reconstructive and discriminative dictionary in a semi-supervised fashion.To this end, we first employ a soft-max loss function to build the multi-class discriminative term to address the multi-class classification problem.Different from square loss, soft-max loss is overparameterized, which means for any hypothesis that needs to fit, there are multiple parameter settings giving rise to exactly the same hypothesis function mapping from inputs to the predictions.We then calculate the first-order stochastic gradient descents (SGD) [44] to simultaneously update the dictionary and classifier.The dictionary learning phase is iteratively performed in a semi-supervised learning fashion with the obtained labeled and unlabeled training pairs.The ultimate goal of this study is classification, while dictionary is an implicit variable when applying the proposed DSR model on hyperspectral image classification.Note that the recent study [42] is related to our work.However, we adopt soft-max loss to build the discriminative term, whereas [42] used logistic loss.
Although our previous studies [11,14,45,46] have exploited sparse representation for hyperspectral image classification, the methodologies are quite different from this work.Xue [11] focused on hyperspectral image decomposition for spectral-spatial classification, Xue [14] addressed hyperspectral image dimensionality reduction using sparse graph embedding, and Xue [45,46] exploited sparse graph regularization for hyperspectral image classification with very few labeled samples.
In this context, the main contribution of our work is the proposed semi-supervised joint dictionary learning (S 2 JDL) algorithm, which leverages the information from labeled and unlabeled samples, allowing more accurate classification performance.Note that the proposed algorithm is unique compared to previously proposed approaches in the hyperspectral image classification community.In addition, we adopted a soft-max loss function to build the DSR problem, which is beneficial to hyperspectral image classification since the multi-class classification problem is very common in this community.

Background
Let X = [X l , X u ] = [x 1 , ..., x N ] ∈ R n×N be a hyperspectral dataset with an n-dimensional signal for each pixel x i = [x 1 , ..., x n ] T , i ∈ 1, ..., N. Let superscripts l and u be the labeled and unlabeled sample or dataset, and let the subscripts u and s be unsupervised or supervised objective loss function (i.e., Γ or L).Let Y = [y 1 , ..., y N ] ∈ R m×N represent the label matrix for the input data, where the index of the nonzero value (i.e., 1) of y i represents its label.Let Y (x i ) ∈ {1, ..., m} denote the label of x i .Let D = [d 1 , ..., d K ] ∈ R n×K be the dictionary.Let W = [w T 1 , ..., w T m ] T ∈ R m×K be the classifier.Let Z = [z 1 , ..., z N ] ∈ R K×N be the sparse coefficients for X.

Sparse Representation
In the context of sparse representation, the sparse coefficients of X with respect to dictionary D can be obtained by optimizing an 1 -norm regularization problem [6] arg min where λ is a regularization parameter controlling the tradeoff between reconstruction error and sparsity.

Dictionary Learning for Classification
For various dictionary learning algorithms, the construction of D can be achieved by minimizing the reconstruction error and satisfying the sparsity constraint as arg min K-SVD [17], which iteratively alternates between sparse coding and dictionary updating to better fit the data, is an efficient algorithm generalized from the K-means clustering process to solve Equation (2).However, K-SVD is not explicitly designed for classification tasks, as it only focuses on minimizing the reconstruction error.
Separating dictionary learning from classification may result in a suboptimal D for classification.Therefore, it is generally preferred to jointly learn the dictionary and classifier by solving [21,22,25,[28][29][30] arg min where the classifier can be obtained by optimizing the model parameter W = [w 1 , ..., w m ] T ∈ R m×K as where Γ denotes the objective loss function which can be of square loss, logistic loss, soft-max loss, and hinge loss forms, z * (x i , D) denotes the sparse code z i obtained by solving Equation (1), and λ 1 is another regularization parameter preventing overfitting.

Related Work
Recently proposed joint dictionary learning methods mainly focus on supervised dictionary learning, which take the form [22] arg min However, Equation ( 5) is designed for binary classification with y i ∈ {[1 0] T ; [0 1] T }.K-SVD can be extended to discriminative K-SVD (D-KSVD) [25] by reconstructing an augmented dictionary with augmented training data, which can be formulated as arg min where X = [ T , α and β are two scalars controlling the related contributions of the corresponding terms.
Recently, label consistent K-SVD (LC-KSVD) [29] has emerged as an effective way to solve Equation ( 6) by jointly adding a classification term and a label consistent regularization term into the square loss objective function, which is of the form arg min where the term Q − GZ 2 F signifies the discriminative sparse code error, Q = [q 1 , ..., q N ] ∈ R K×N refers to the discriminative sparse code (0 or 1) corresponding to input data Z.The nonzero values (i.e., 1) of q i = [q 1 i , ..., q K i ] T ∈ R K occur at those indices where the input signal z i and the dictionary atom d k share the same label.G∈ R K×K is a linear transformation matrix, which transforms the original sparse code to be the most discriminative one in sparse feature space R K .Similar to D-KSVD, LC-KSVD is solved by using K-SVD with an augmented dictionary D = [ √ αD T , βW T , √ γG T ] T [29].More recently, based on LC-KSVD, Zhang [30] tried to solve Equation (7) in an online SSL fashion by using a block-coordinate gradient descent algorithm to update the dictionary, which is named as OnlineSSDL and formulated as arg min Mairal [28] represented semi-supervised dictionary learning as an extension in the task-driven dictionary learning problem, which takes the form arg min where the loss functions Γ s and Γ u are, respectively, for supervised and unsupervised learning fashions, and µ ∈ (0, 1) is a new parameter controlling the tradeoff between them.However, Equation (9) adopts the logistic loss with a one-versus-all strategy and addresses classification as a regression problem, resulting in the scalability issues and large memory burden.

Proposed Method
In the proposed method, we first define a semi-supervised joint dictionary learning problem.Then, the optimization phase includes initialization, sparse coding, dictionary updating, and classifier updating.Finally, the class labels of unknown data are predicted by using the learnt classifier and the sparse coefficients.Figure 1 graphically illustrates the main idea.

Model Assumption
An attractive and promising research line for jointly learning the dictionary and classifier is to incorporate SSL.Inspired by the Equations ( 8) and ( 9), we now reformulate the semi-supervised joint dictionary learning problem into an improved form arg min where ψ is a sparsity-inducing function.
We adopt an 1 -norm for ψ, and adopt a soft-max loss to design the supervised objective loss function, which takes the form where 1{•} is an indicator function, so that 1{a true statement} = 1 and 1{a false statement} = 0, and j ∈ {1, 2, ..., m} refers to class.Finally, the designed semi-supervised joint dictionary learning problem can be defined as arg min 3.2.Optimization

Initialization
Let us assume that we have a small labeled dataset X l spanning all classes and a large unlabeled dataset X u .Two variables need to be initialized since Y can be seen as a prior.For D 0 , we intend to initialize such a dictionary in a way that its atoms are uniformly allocated to each class, with the number of atoms proportional to the dictionary size.Thus, we randomly select multiple class-specific dictionaries with equal size from the training data.The initialization process is completely supervised and the class labels attached to the dictionary remain fixed during the dictionary learning process.As for W 0 , we employ a multivariate ridge regression model [47] as which is equipped with square loss and 2 -norm regularization, and yields the following solution where I K denotes the identity diagonal matrix with degree K.
We employ a spectral unmixing by variable splitting and augmented Lagrangian (SUnSAL) algorithm [48] to obtain the sparse code Z for the input data X with respect to the initialized dictionary D 0 .Then, the initial W 0 can be computed by using Equation (14).Our previous studies have validated the good performance of SUnSAL for hyperspectral image processing [11,14,31,45,46].

Variables Updating
We can resort to the SGD algorithm to optimize Equation ( 12) since the objective loss function in our problem is highly nonlinear.In order to achieve semi-supervised optimization, we first regard the optimization process as two independent ingredients (i.e, unsupervised learning and supervised learning) and then calculate their gradients respectively.Then, we can combine the gradients with weighted summation strategy to obtain the final update.
At iteration t, we first select the t-th labeled sample x l t from X l .Assume currently, the dictionary D t and the label matrix y t are all given.We then randomly select an unlabeled sample x u t from X u .Next, we calculate the sparse codes (z u t , z l t ) for the training pair (x u t , x l t ) with respect to the current dictionary by adopting the SUnSAL algorithm.At last, D t and W t can be updated by feeding the training pair into the semi-supervised objective loss function.To this end, the gradients should be firstly formulated, which poses a critical challenge in optimizing the proposed dictionary learning problem.
For an unlabeled sample 2 + λ z u t 1 , then we compute the gradients of L u for x u t with respect to D t as For a labeled sample x l t , we solve the following problem arg min The skeleton optimization process of the presented semi-supervised joint dictionary learning (S 2 JDL) method is summarized in Algorithm 1.More details for the dictionary and classifier updating can be found in the Appendix A.
1: Input: X l , X u , K, T, λ, λ 1 , λ 2 , µ, ρ. 2: Output: D and W. 3: Initialization: Initialize D 0 with K atoms from X l and obtain W 0 by Equation ( 14).4: for each x l t in X l do 5: Randomly select an unlabeled sample x u t from X u . 6: Obtain (z l t , z u t ) for (x l t , x u t ) with the current dictionary D t using SUnSAL.

9:
Update D and W by Equations (A4) and (A5), respectively. 10: Remove the selected unlabeled sample: X u ← X u \ x u t .11: end for 12: return D t+1 and W t+1 .

Classification
Once we have obtained the learnt dictionary D and the classifier parameter W from Algorithm 1, we can predict a new incoming test sample x test .To this end, we first compute its sparse code z test using SUnSAL and then assign its label by the position corresponding to the largest value (also the most possible value) in the label vector by Since the proposed method can produce probability as Equation ( 17), we adopt graph cuts [49] to obtain smoother classification maps.Graph cuts are as an energy minimization algorithm, which can tackle the combinatorial optimization problem involving unary and pairwise interaction terms, i.e., the maximum a posteriori (MAP) segmentation problem using multinomial logistic regression with multilevel logistic prior.Graph cuts can yield very good approximations to the MAP segmentation and are quite efficient from a computational point of view [50].Overall accuracy (OA) and the Kappa value (κ) generated from the confusion matrix are used to evaluate the classification performance based on the ground-truth [51].In addition, the Kappa [52] and McNemar z-score statistical tests [53] are also adopted for accuracy assessment.

Experimental Settings
For performance comparison, some strongly related dictionary learning algorithms are considered.The unsupervised methods are MOD, K-SVD, D-KSVD, and OnlineDL.The supervised category includes LC-KSVD and SDL.We also implemented a semi-supervised method, semi-supervised joint dictionary learning with logistic loss function (S 2 JDL-Log, "Log" for Logistic loss function), which is a variant of the proposed method.We then denote by S 2 JDL-Sof ("Sof" for soft-max loss function) the proposed method to differentiate them.It is worth noting that all the methods employ a multivariate ridge regression model for classifier learning (see Equation ( 14)), and adopt orthogonal matching pursuit (OMP) [54] for sparse coding.
It is worth noting that MOD, K-SVD, D-KSVD, LC-KSVD, OnlineDL, and SDL are implemented based on their original source codes released to the public by their owners, and they share the common parameters of sparsity level (T = 5), reconstruction error (E = 1 × 10 −4 ), and 1 -norm penalty (λ = 1 × 10 −5 ).In addition, three other parameters in SDL have been set as the recommended values.We also note that these parameters have been selected by careful cross-validation.In this context, the parameter settings always ensure a fair comparison even if they are suboptimal.For dictionary and classifier update, the initial learning rate is set to ρ = 1 × 10 −3 , and the tradeoff between the unsupervised and supervised learning ingredients is set to 0.5.
We randomly select labeled training samples from the ground-truth to initialize the dictionary and classifier.The 2 -norm penalty parameter is set to λ 1 = 1 × 10 −5 when initializing the classifier using Equation (14).
For classification, we set the maximum number of iterations to k max = 10, which means that the reported overall accuracies (OAs), average accuracies (AAs), kappa statistic (κ), and class individual accuracies are derived by averaging the results after conducting ten independent Monte Carlo runs with respect to the initialized labeled training set.In addition, the smoothness parameter (τ) in graph cuts is set to 3, and we adopt graph cuts for all the methods.
Finally, it is worth noting that all the implementations were carried out using Matlab R2015b on a desktop PC equipped with an Intel Xeon E3 CPU (at 3.4 GHz) and 32 GB of RAM.It should be emphasized that all the results are derived from our own implementations and the involved parameters for these methods have been carefully optimized in order to ensure a fair comparison.

Experiments with AVIRIS Indian Pines Dataset
Experiment 1: We first analyze the sensitivity of parameters for the proposed method.Figure 5 illustrates the sensitivity of the proposed method to parameters λ, λ 1 , and µ under the condition that 10% of labeled samples per class are used for training and 15 labeled samples per class are used to build the dictionary.As we can see from Figure 5a, the OA is insensitive to λ and λ 1 .Therefore, we roughly set λ = 1 × 10 −5 and λ 1 = 1 × 10 −5 .This observation is reasonable since λ controls the uniqueness of sparse coding and λ 1 determines the initial performance of the classifier parameter.The impacts of the two parameters are reduced in a iterative learning scheme since sparse coding and classifier update are alternatively conducted.According to Figure 5b, we found that OA is sensitive to µ, and the optimal value is set to µ = 3.This observation is in accordance with that made in [50].Experiment 2: In this test, Gaussian random noise with a pre-defined signal-to-noise ratio ) is generated and added to the original imagery.Figure 6 illustrates the evolution of OA with SNR for different classifiers.As shown in the figure, the proposed method outperforms others in most cases with SNR = 5, 10, etc.The interval between the curve of the proposed method and others visually indicates the significances, which confirms the robustness of the proposed method to data noise.Experiment 3: We then analyze the impact of training data size on classification accuracy.To this end, we randomly choose 10% of labeled samples per class (a total of 1027 samples) to initialize the training data and evaluate the impact of the number of atoms on the classification performance.Figure 7 shows the OAs as a function of the number of atoms per class obtained by different methods.As we can see from the figure, the proposed method obtains the highest accuracies in all cases.Another observation is that, as the number of atoms increases, the OAs also increase.When the number of atoms per class is equal to 15, the proposed method reaches a stable level, with an OA higher than 80%.It is interesting to note that D-KSVD and LC-KSVD obtain very similar results.
Figure 8 shows the OAs as a function of the ratio of labeled samples per class for different methods.As we can see from the figure, the proposed method obtains higher accuracies when the ratio is larger than 5%.It is worth noting that the proposed method can stably obtain improved classification performance with additional labeled samples.However, the other methods cannot produce higher OAs as the ratio ranges from 3% to 10%.This observation can be explained by the increase of the number of labeled samples; the proposed method can exploit more information from both the labeled and unlabeled training samples, whereas the supervised dictionary learning methods can only rely on the labeled information, and the performance cannot be improved significantly.Another observation is that S 2 JDL-Log yields a very competitive classification performance.This is due to the fact that S 2 JDL-Log is based on semi-supervised learning fashion.
Experiment 4: Table 1 reports the OA, AA, individual classification accuracies, and κ statistic.The results of the proposed algorithm are listed in the last column of the table and we have marked in bold typeface the best result for each class and the best results of OA, AA, and κ statistic.Our method achieves the best results compared to the other supervised dictionary learning methods.The improvements of classification accuracy are around 10-40% when compared to other methods.Especially, when classifying the class Wheat, our method performs very well.Although our method may not always obtain the best accuracy in some specific classes, AA, OA, and kappa are more convincing metrics measuring the classification performance.In addition, the time costs of different methods are listed in the table, where we can see that the proposed method is more efficient than K-SVD, D-KSVD, and LC-KSVD.However, MOD, OnlineDL, and SDL take less time.The time cost of the proposed method mainly comes from the optimization step, which can be reduced by exploiting more efficient optimization strategy.
Table 2 reports the statistical tests between-classifier in terms of Kappa z-score and McNemar z-score.The critical value of z-score is 1.96 at a confidence level of 0.95, and all the tests are significant with 95% confidence, which indicates that the proposed method significantly outperforms the other methods.Another observation is that the lower value of z-score demonstrates the closer classification results, e.g., the Kappa z-score value for S 2 JDL-Log/S 2 JDL-Sof is 4.4.Similar observation can be made for the tests using McNemar z-score.
For illustrative purposes, Figure 9 shows the obtained classification maps (corresponding to the best one after ten Monte Carlo runs).The advantages obtained by adopting the semi-supervised dictionary learning approach with regard to the corresponding supervised and unsupervised cases can be visually appreciated in the classification maps displayed in Figure 9, where the classification OAs obtained for each method are reported in the parentheses.Compared to the ground-truth map, the proposed method obtains a more accurate and smoother classification map.Significant differences can be observed when classifying the class Wheat in this scene, which is in accordance with the former results.However, the classification maps obtained by D-KSVD and LC-KSVD are much less homogeneous than the other methods.This observation can be explained by the fact that Graph cuts are adopted as a post processing strategy, which largely relies on the initial classification probabilities obtained by the classifiers.If the initial classification results are poor, then the classification improvements may not be satisfied.That is the case for D-KSVD and LC-KSVD with an initial OA = 60.07%for the former and an initial OA = 60.52% for the latter.The critical value of z-score is 1.96 at a confidence level of 0.95, and all the tests are significant with 95% confidence.Experiment 5: In the last experiment, we analyze the mechanism of the proposed method.Firstly, we plot the stem distributions of sparse coefficients obtained by different methods.As we can see from Figure 10, the distributions between-classifier are significantly different.Precisely, the corresponding atoms belonging to the same land cover type contribute more than the others, thus making the associated coefficients sparsely locate at those atoms.For example, the atoms indexed by 146-160 in the dictionary belong to the class Wheat, and the sparse coefficients will mainly locate at these atoms if this class is well represented.Obviously, the proposed method produces more accurate sparse coefficients since the stem distributions mainly locate at the corresponding atoms (see Figure 10g).As for the other methods, the associated sparse atoms cannot be accurately found, i.e., the stem distributions obtained by OnlineDL partially locate at the class Woods.Figure 11 spatially exhibits the sparse coefficients relative to the class Wheat for different methods.As shown in Figure 11, the proposed method yields more accurate sparse coefficients relative to the class Wheat.Therefore, the aggregation characteristics of sparse coefficients naturally enlarge the discrimination between different land cover types, and the spatial variations of sparse coefficients explain the accuracy of the prosed method for sparse representation.The above observations demonstrate the good performance of the proposed method in dictionary learning, and the discrimination performance of our method has been validated in the former experiments.
Secondly, we analyze the denoising power of the proposed method by plotting the original spectrum, the reconstructed spectrum, and the noise for the class Wheat.From the results shown in Figure 12, we can see the original spectrum can be accurately reconstructed with a very small Root-Mean-Square Error (RMSE) (The RMSE for two observations x i and x j can be defined as: ), which is the difference between the original spectrum (x) and the reconstructed spectrum (Dz).It is worth noting that the proposed method obtains a very small RMSE value.In this context, the proposed method can accurately reconstruct the original spectrum with high fidelity by removing noise, which explains the robustness to noise of the proposed method in the former experiment.We then evaluate the global reconstruction performance of the proposed method by considering all classes.As reported in Table 3, the proposed method obtains the smallest RMSE value.This experiment also hints at the good performance of the proposed method in dictionary learning.Finally, we attempt to analyze the dictionary structure by visually illustrating the matrix D learnt by different methods.As shown in Figure 13, S 2 JDL-Sof and S 2 JDL-Log yield similar data structure, and the atoms belonging to the same class are more similar to each other, while the atoms belonging to different classes are more distinctive between each other.Similar observations can be made for D-KSVD, LC-KSVD, and SDL.However, the dictionaries learnt by OnlineDL model very little data structure, as shown in the figure.Note that we cannot currently explain the factors inducing the differences of dictionary structure.

Experiments with ROSIS University of Pavia Dataset
Experiment 1: We first analyze the impact of training data size on classification accuracy.We randomly choose 5% of labeled samples per class (a total of 2138 samples) to initialize the training data and evaluate the impact of the number of atoms on classification performance achieved by the proposed method for the ROSIS University of Pavia dataset.Figure 14 shows the OAs as a function of the number of atoms per class obtained by different methods.Again, the proposed method obtains the highest accuracies in all cases.Another observation is that, for most of the methods, the OAs increase as the number of atoms also increase.Different from the former experiments, when the number of atoms per class is larger than 10, the OAs obtained by D-KSVD and LC-KSVD become lower.In this scene, MOD does not perform very well in all cases.Figure 15 depicts the OAs as a function of the ratio of labeled samples per class for different methods.As we can see from the figure, the proposed method obtains the highest accuracies in all cases.It is interesting to note that the proposed method cannot stably obtain improved classification performances with additional labeled samples.This may be due to the fact that the homogeneity in this scene is not so significant, and graph cuts reduce the effects of the learning phase since the classification accuracy cannot be significantly improved by using additional labeled samples.In addition, similar observations can be made for other methods.In this scene, D-KSVD does not perform very well in different cases.Again, S 2 JDL-Log provides competitive performance.Experiment 2: Table 4 lists the OA, AA, individual classification accuracies, and κ statistic.As reported in the table, the proposed method achieves the best results compared to the other supervised dictionary learning methods.The improvements of classification accuracy are around 3-30% when compared to other methods.When classifying the class Bare soil, our method obtains the highest accuracy, with an OA of 23.10%.Although this accuracy is not very high, it demonstrates the merit of the proposed method since Bare soil is very difficult to accurately classify.In addition, the time costs of different methods are listed in the table, where we can see that the proposed method is more efficient than K-SVD, D-KSVD, and LC-KSVD.Again, MOD, OnlineDL, and SDL take less time.that the proposed method significantly outperforms the other methods.In this scene, we observe that OnlineDL is closer to our method, i.e., the Kappa z-score value for OnlineDL/S 2 JDL is 14.2, which is in accordance with the results reported in Table 4.The critical value of z-score is 1.96 at a confidence level of 0.95, and all the tests are significant with 95% confidence.
Figure 16 visually depicts the obtained classification maps.The advantages obtained by adopting the semi-supervised dictionary learning approach with regard to the corresponding supervised case can be visually appreciated in the classification maps displayed in Figure 16, which also reports the classification OAs obtained for each method in the parentheses.As shown in the figure, the homogeneity is very clear for this scene, and the proposed method depicts a more accurate and smoother classification map.As expected, D-KSVD and LC-KSVD obtain poor classification maps for this scene.

Experiments with AVIRIS Salinas Dataset
Experiment 1: Similarly, we first analyze the impact of training data size on classification accuracy.A total of 5% of labeled samples per class (a total of 2706 samples) are randomly selected to initialize the training data.We evaluate the impact of the number of atoms on the classification performance achieved by the proposed method for the AVIRIS Salinas dataset.Figure 17 shows the OAs as a function of the number of atoms per class obtained by different methods.Similar to the experiments implemented for the AVIRIS Indian Pines dataset, the OAs become stable when 15 atoms per class are used to build the dictionary.Another observation is that when the number of atoms per class is larger than 10, our method stably outperforms the other methods.It is interesting to note that D-KSVD and LC-KSVD obtain similar results even when the latter is incorporated with the class-label information.For most of the cases, the OAs increase as the number of atoms also increases.Figure 18 plots the OAs as a function of the ratio of labeled samples per class for different methods.As we can see from the figure, the proposed method obtains the highest accuracies in all cases.Different from the former experiments, the proposed method can stably obtain improved classification performance with the additional labeled samples in this scene.However, the additional labeled samples deteriorate the classification performance for SDL.Experiment 2: Table 6 gives the OA, AA, individual classification accuracies, and κ statistic.As reported in the table, the proposed method achieves the best results compared to the other supervised dictionary learning methods.The improvements of classification accuracy are around 10-20% when compared to the other methods.As for the specific classification accuracy, the proposed method obtains higher accuracy when classifying the class Lettuce − romaine − 6wk.In addition, the time costs of different methods are listed in the table, where we can see that the proposed method is more efficient than D-KSVD and LC-KSVD.Again, MOD, OnlineDL, and SDL take less time.Similarly, we conduct the statistical tests between-classifier in terms of Kappa z-score and McNemar z-score in this scene.According to the results reported in Table 7, we observe that the proposed method significantly outperforms the other methods since all the tests are significant with 95% confidence.Another observation is that K-SVD and the proposed method produce similar results, with the Kappa z-score value of 22.9.The critical value of z-score is 1.96 at a confidence level of 0.95, and all the tests are significant with 95% confidence.
The classification maps are given in Figure 19, where the OAs obtained for each method are reported in the parentheses.As shown in the figure, we can see clear differences between different methods.For example, when classifying the class Lettuce − romaine − 6wk, the proposed method is more accurate compared to the other methods.In addition, the homogeneity is very clear for this scene, which is similar to the AVIRIS Indian Pines dataset.Also, our method produces a more accurate and smoother classification map compared to the other methods.

Conclusions
In this paper, we have developed a novel semi-supervised algorithm for jointly learning a reconstructive and discriminative dictionary for hyperspectral image classification.Precisely, w design a unified semi-supervised objective loss function which integrates a reconstruction term with a reconstruction-discrimination term built by soft-max loss to leverage the unsupervised and supervised information from the training samples.In the iteratively semi-supervised learning phase, we simultaneously update the dictionary and classifier by feeding the obtained training pairs into the unified objective function via a SGD algorithm.The experimental results obtained by using three real hyperspectral images indicate that the proposed algorithm leads to better classification performance compared to the other related methods.We should note that although dictionary learning serves as an important part in DSR, previous studies involving the DSR problem mainly focused on the classification performance.Our experiments also mainly focus on classification since it is the ultimate goal of this work.On the basis of the comprehensive experiments, we draw the following conclusions: (i) The proposed method is insensitive to λ and λ 1 , but it is sensitive to µ. (ii) The proposed method outperforms other related algorithms in terms of classification accuracy, which demonstrates the superiority of soft-max loss.(iii) Although the proposed method exhibits slightly higher computational complexity compared with MOD, OnlineDL, and SDL, its computational time is bearable.
Further experiments with additional scenes and comparison methods should be conducted in the future.Furthermore, we also envisage two future perspectives for the development of the presented work: (i) Given the fact that the dictionary and classifier are updated by randomly selected unlabeled samples, our future work will consider exploiting active learning [43] to select the most informative samples during the learning phase in the DSR model.(ii) Since the computational complexity of our algorithm is a bit high, in our future work we will exploit the objective function to speed up the optimization process by incorporating incremental learning [56].(iii) Inspired by the experimental results, our future work will exploit the theoretical reason for when and why more labeled samples help in the semi-supervised joint dictionary learning tasks, which is a crucial and interesting problem.The possible reason may be training data distribution, variance across the training and test data domain, feature dimensions, sample size, number of labels samples per class, [57].(iv) Since the spatial property is important in the proposed method, our future work will also focus on exploiting the dictionary structure in the DSR problem [37], and we will try to reveal the relation between dictionary structure and classification accuracy in DSR.
convergence of SGD with t 0 = T/10 and T is the total number of iterations.Before applying another iteration, we remove x u from the candidate pool X u .The process of sampling, sparse coding, and updating is repeated as we loop through all the labeled samples x l in X l .

Figure 1 .
Figure 1.Graphical illustration of the proposed method.
collected by Airborne Visible Infrared Imaging Spectrometer (AVIRIS) and the Reflective Optics Spectrographic Imaging System (ROSIS) are used in our experiments.The first hyperspectral image was acquired by the AVIRIS sensor over the Indian Pines region in northwestern Indiana in 1992.The image size in pixels is 145 × 145, with moderate spatial resolution of 20 m.The number of data channels in the acquired image is 220 (with spectral range from 0.4 to 2.5 µm).A total of 200 radiance channels are used in the experiments by removing several noisy and water absorbed bands.A three-band false color composite image and the ground-truth map are shown in Figure 2. A total of 10,366 samples containing 16 classes are available.

Figure 5 .
Figure 5. Parameter sensitivity analysis of the proposed method for the AVIRIS Indian Pines dataset (10% of labeled samples per class are used for training and 15 labeled samples per class are used to build the dictionary).(a) Overall accuracy (OA) as a function of λ and λ 1 ; (b) Overall accuracy (OA) as a function of τ.(a) OA vs. λ and λ 1 ; (b) OA vs. τ.

Figure 6 .
Figure 6.The evolution of overall accuracy with SNR for different classifiers (10% of labeled samples per class are used for training and 15 labeled samples per class are used to build the dictionary).

Figure 7 .
Figure 7. Overall accuracy (OA) as a function of the number of atoms per class for the AVIRIS dataset (10% of labeled samples per class are used for training).Error bars indicate the standard deviations obtained by the proposed method.

Figure 8 .
Figure 8. Overall accuracy (OA) as a function of the ratio of labeled samples per class for the AVIRIS Indian Pines dataset (15 labeled samples per class are used to build the dictionary).Error bars indicate the standard deviations obtained by the proposed method.

Figure 10 .Figure 11 .Figure 12 .
Figure 10.Stem distributions of sparse coefficients relative to the class Wheat obtained by different methods for the AVIRIS Indian Pines dataset.The circles terminating different stems represent the sparse coefficients relative to the associated atoms which are marked with different colors representing different classes.(a) MOD; (b) K-SVD; (c) D-KSVD; (d) LC-KSVD; (e) OnlineDL; (f) SDL; (g) S 2 JDL-Log; (h) S 2 JDL-Sof.

Figure 14 .
Figure 14.Overall accuracy (OA) as a function of the number of atoms per class for the ROSIS University of Pavia dataset (5% of labeled samples per class are used for training).Error bars indicate the standard deviations obtained by the proposed method.

Figure 15 .
Figure 15.Overall accuracy (OA) as a function of the ratio of labeled samples per class for the ROSIS University of Pavia dataset (15 labeled samples per class are used to build the dictionary).Error bars indicate the standard deviations obtained by the proposed method.

Figure 17 .
Figure 17.Overall accuracy (OA) as a function of the number of atoms per class for the AVIRIS Salinas dataset (5% of labeled samples per class are used for training).Error bars indicate the standard deviations obtained by the proposed method.

Figure 18 .
Figure 18.Overall accuracy (OA) as a function of the ratio of labeled samples per class for the AVIRIS Salinas dataset (15 labeled samples per class are used to build the dictionary).Error bars indicate the standard deviations obtained by the proposed method.

Table 1 .
Overall (OA), average (AA), and individual class accuracies (%), kappa statistics (κ), and the standard deviation of ten conducted Monte Carlo runs obtained for different classification methods for the AVIRIS Indian Pines dataset with a balanced training set (10% of labeled samples per class are used for training and 15 labeled samples per class are used to build the dictionary).

Table 2 .
Pairwise statistical test in terms of Kappa z-score and McNemar z-score for the AVIRIS Indian Pines dataset with a balanced training set (10% of labeled samples per class are used for training and 15 labeled samples per class are used to build the dictionary).

Table 4 .
Overall (OA), average (AA), and individual class accuracies (%), kappa statistics (κ), and the standard deviation of ten conducted Monte Carlo runs obtained for different classification methods for the ROSIS University of Pavia dataset with a balanced training set (5% of labeled samples per class are used for training and 15 labeled samples per class are used to build the dictionary).

Table 5
also reports the statistical tests between-classifier in terms of Kappa z-score and McNemar z-score.Again, the results indicate

Table 5 .
Pairwise statistical test in terms of Kappa z-score and McNemar z-score for the ROSIS University of Pavia dataset with a balanced training set (5% of labeled samples per class are used for training and 15 labeled samples per class are used to build the dictionary).

Table 6 .
Overall (OA), average (AA), and individual class accuracies (%), kappa statistics (κ), and the standard deviation of ten conducted Monte Carlo runs obtained for different classification methods for the AVIRIS Salinas dataset with a balanced training set (5% of labeled samples per class are used for training and 15 labeled samples per class are used to build the dictionary).

Table 7 .
Pairwise statistical test in terms of Kappa z-score and McNemar z-score for the AVIRIS Salinas dataset with a balanced training set (5% of labeled samples per class are used for training and 15 labeled samples per class are used to build the dictionary).