A Robust Sparse Representation Model for Hyperspectral Image Classification †

Sparse representation has been extensively investigated for hyperspectral image (HSI) classification and led to substantial improvements in the performance over the traditional methods, such as support vector machine (SVM). However, the existing sparsity-based classification methods typically assume Gaussian noise, neglecting the fact that HSIs are often corrupted by different types of noise in practice. In this paper, we develop a robust classification model that admits realistic mixed noise, which includes Gaussian noise and sparse noise. We combine a model for mixed noise with a prior on the representation coefficients of input data within a unified framework, which produces three kinds of robust classification methods based on sparse representation classification (SRC), joint SRC and joint SRC on a super-pixels level. Experimental results on simulated and real data demonstrate the effectiveness of the proposed method and clear benefits from the introduced mixed-noise model.


Introduction
Unlike classical multispectral images, hyperspectral images (HSIs) provide richer spectral information about the image objects in hundreds of narrow bands. A HSI is captured as a three-dimensional data cube comprising two-dimensional spatial information and one-dimensional spectral information. The spectral signature of a pixel is a vector whose entries correspond to spectral responses in different bands. Different materials have diverse spectral signatures, thus hyperspectral imaging allows differentiation between materials that are often visually indistinguishable. Numerous application areas include agriculture [1,2], defense and security [3] and environmental monitoring [4,5].
Classification of HSIs currently enjoys huge interest in the remote sensing community. The objective of supervised hyperspectral classification is to group pixels into different classes with the classifiers trained by the given training samples. A large number of HSI classification methods have been proposed, based on artificial neural networks [6], multinomial logistic regression [7,8], spectral-spatial preprocessing with multihypothesis prediction [9], information fusion [10] and support vector machines (SVM) [11], just to name a few. With the target of exploiting spatial information in the classification task, spatial-spectral classification approaches have been developed, including SVM with composite kernels [12], methods based on mathematical morphology [13][14][15][16][17] and image segmentation [18].
In recent years, sparse representation classification (SRC) [19] emerged as another effective classification approach, which became widely adopted for HSI [20][21][22][23][24][25][26][27][28][29][30][31][32][33][34]. SRC assumes that each test sample can be sparsely represented as a linear combination of atoms from a dictionary, which is constructed or learned from training samples [19]. Chen et al. [20] introduced the joint sparse representation classification (JSRC) in HSI classification by incorporating spatial information. The model was based on the observation that the pixels in a patch share similar spectral characteristics and can be represented by a common set of atoms but with different sparse coefficients. Zhang et al. [21] proposed a nonlocal weighted joint sparse representation (NLW-JSRC) to further improve the classification accuracy. They enforced a weight matrix on the pixels of a patch in order to discard the invalid pixels whose class was different from that of the central pixel. The works in [22,24] extended the JSRC to the kernel versions to address the linearly non-separable problem. In [27], a multi-layer spatial-spectral sparse representation framework was proposed for HSI classification in order to stabilize the sparse codes of the traditional single-layer sparse representation. Related classification methods effectively exploiting spatial information with adaptive neighborhood were reported in [25,26,31] and produced good results. Recent studies in [29][30][31][32][33] indicated that learning a compact and discriminative dictionary from the training samples can reduce the computational burden significantly.
However, all of these sparsity-based methods for HSI classification only take into account Gaussian noise. In real applications, HSIs are inevitably corrupted by different kinds of noise, including Gaussian noise and sparse noise. Here, sparse noise is defined as the noise of arbitrary magnitude that only affects certain bands or pixels, which can be impulse noise, dead lines and strips. It may arise due to the defective pixels and poor imaging conditions such as water vapor and atmospheric effect [35]. With the consideration of sparse noise in the tasks of HSIs denoising [36,37], unmixing [35,38] and robust learning [39], significant improvements have been achieved over the state-of-the-art methods, which indicates the importance of taking the sparse noise into account in those tasks. For the classification task, the sparse noise can hinder the performance undoubtedly. We are not aware of any sparsity-based classification method that takes it explicitly into account. This motivates us to develop a robust classification model that accounts for realistic degradations in the HSIs.
The key idea of our model is to incorporate the presence of sparse noise in HSIs into the classification problem, by combining the appropriate statistical models for the sparse noise and the representation coefficients of test pixel(s) within a unified framework. In particular, we make use of the fact that test pixels can be represented with relatively few atoms from a well constructed dictionary, meaning that the representation coefficients are sparse or jointly sparse within small neighborhoods. This is the main assumption of SRC and JSRC models. In addition, we introduce a statistical model for the sparse noise as an instance of a multivariate Laplacian distribution, which allows us to derive an optimization problem that extends elegantly the previous ones with an additional 1 norm on the sparse noise term. Following this idea, we extend and generalize the existing SRC [19] and JSRC [20] methods to the robust versions, i.e., robust SRC (R-SRC) and robust JSRC (R-JSRC), respectively. We also derive an optimization algorithm for the corresponding objective function, based on the alternating minimization strategy.
Moreover, in order to further exploit the available spatial information, we extend the R-JSRC model to a classification model on a super-pixel level. In the JSRC model, spatial information is defined by the collection of neighbouring pixels in a square window of fixed size, while super-pixel segmentation can adaptively divide the HSIs into a number of non-overlapping homogenous regions depending on the spatial content, which makes the joint sparse representation more effective and precise. We name this extended method robust super-pixel level joint sparse representation classification (R-SJSRC). The results on simulated and real data demonstrate improved performance in comparison to recent related methods and a clear benefit resulting from the introduced robust model. Parts of this work have been accepted for presentation at a conference [40]. In comparison to the conference version, here we give more elaborate presentation and analysis of the method. Moreover, extra experiments with both simulated and real HSI data are conducted to investigate the effect of sparse noise and parameters on performance.
The main contributions of the paper can be summarized as follows: (1) A robust sparsity-based classification model for HSIs is proposed when the data is corrupted by Gaussian noise and sparse noise, by incorporating the appropriate priors for noise-free data and degradations into an optimization framework. (2) An efficient algorithm is developed to solve the optimization problem by using an alternating minimization strategy. (3) The robust model is extended to efficiently incorporate spatial information. By jointly processing super-pixels, we strongly improve the performance both in terms of the classification accuracy and processing speed.
The rest of this paper is organized as follows. Section 2 reviews briefly the classical sparsity-based models in HSI classification. Section 3 extends the existing sparsity-based models to the robust versions and designs an effective algorithm to solve corresponding optimization problems. Section 4 presents experimental results with simulated and real data and Section 5 concludes the paper.

Sparse Representation Classification
Let x ∈ R B be a test pixel and D = [D 1 , D 2 , ..., D C ] ∈ R B×d a structured dictionary constructed from training samples, where B is the number of bands in the HSI; d is the number of training samples; C is the number of classes, and D i ∈ R B×d i (i = 1, 2, ..., C) is the sub-dictionary in which each column is a training sample of i-th class, and d i is the number of training samples from class i, such that The goal of sparse representation is to represent each test pixel as where n ∈ R B is Gaussian noise and α ∈ R d are sparse coefficients, satisfyinĝ α 0 denotes the number of non-zero elements in α and K is the sparsity level, i.e., the largest number of atoms in dictionary D needed to represent any input sample x. Problem in Equation (2) is typically solved with a greedy algorithm, such as Orthogonal Matching Pursuit (OMP) [41].
The class of the test sample is identified by calculating the class-specific residuals r i [19]: where α i are the sparse coefficients associated with class i.

Joint Sparse Representation Classification
An effective method to exploit the spatial information of the HSI is using joint sparse representation of neighbouring pixels. The assumption is that the pixels in a small patch are likely to belong to the same class and thus share the same sparsity pattern, meaning that they can be represented by the same set of atoms but with different sets of coefficients [20]. In the JSRC model, the spatial neighbourhood for the central pixel is a square window and all the neighbouring pixels are gathered into the input matrix X = [x 1 , x 2 , ..., x T ] ∈ R B×T , where x i is the spectral signature of the i-th pixel in a patch of size √ T × √ T. Denoting by α i the sparse coefficients of x i in dictionary D leads to where A = [α 1 , α 2 , ..., α T ] ∈ R d×T is the coefficient matrix.
Since all x i in a small patch are likely to belong to the same class and thus share the same set of atoms, α i have non-zero entries at the same positions. Therefore, A is row-sparse, and can be obtained by solving the following problem with Simultaneous Orthogonal Matching Pursuit (SOMP) algorithm [42]: where X F denotes the Frobenius norm of X, A row,0 denotes the number of non-zero rows of A and K 0 is the row-sparsity level. In a similar way to SRC, the central test pixel of the patch is labeled by minimizing the class-specific residual: where A i is the portion of the sparse matrix A associated with class i.

Robust SRC Model
Here, we develop a more general classification method, which takes into account not only the Gaussian noise (as described above) but also sparse noise, which affects real HSIs. The motivation is as follows. In practice, HSIs are often contaminated by horizontal and vertical strips, impulse noise and dead lines. This type of degradation is called sparse noise as it affects only relatively few pixels. Sparse noise typically arises in situations with poor imaging conditions due to sensor artifacts. In the real HSIs, different bands can be corrupted by different kinds of noise [35,38]. In some bands, sparse noise is a dominant degradation, while others may be corrupted by mixed noise. An example of the noise in real HSI (Hyperspectral Digital Image Collection Experiment (HYDICE) Urban data set [35]) can be found in Figure 1, where Figure 1a shows a band affected with stripe noise, and Figure 1b shows a band affected by a mixture of sparse noise and Gaussian noise. We model the observed pixel in HSI as: where y ∈ R B is an error-free sample, s ∈ R B sparse noise and n ∈ R B Gaussian noise.
As the error-free samples are not available in practice, we have to express y in terms of the observed samples. To this end, we will employ in our derivation a hypothetic, ideal dictionary D y . Let D y ∈ R B×d = [y 1 , y 2 , ..., y d ] denote an ideal, error-free dictionary and y j ∈ R B the j-th error-free training sample. The main assumption of SRC is that any y can be represented by a few atoms in D y as follows: where α is a sparse vector and ε is arbitrarily small. The model (7) holds for any observed sample: where .., s d ] and D n = [n 1 , n 2 , ..., n d ] are collections, or dictionaries, composed of the observed data x i , sparse noise components s i and Gaussian noise components n i , respectively. Substituting the Equations (7) and (9) into (8), we derive the representation of x as follows: where s = s − D s α and n = n − D n α + ε.
A linear combination of two (or more) sparse vectors is not necessarily sparse. However, the sparse noise in HSIs is typically detected only in certain and relatively few bands, which means the non-zero elements of s and s i are located at the relatively few positions. Therefore s , being a linear combination of s and elements of {s i } d i=1 , is sparse as well. The expression in Equation (10) tells us that the observed pixel contaminated by sparse noise and Gaussian noise can be represented by relatively few atoms from the noisy dictionary with the addition of a sparse term s and an error term n . Note that here s in Equation (10) is not exactly the sparse noise of x but a mixture of the sparse noise in x and D, which is the reason why this model can not be directly used in the denoising task. Now, we are ready to define an optimization problem that generalizes the one in Equation (2) as a result of our mixed-noise model. Observe first that the problem in Equation (2) can equivalently be written as where p(x; α) is the probability distribution of x with parameter α, which is according to the model in . We formulate a similar problem taking into account the sparse noise s : Making use of the fact that p(x, s ) = p(x|s )p(s ) and that the parameter α appears only in the first term, we can rewrite the objective function in Equation (12) From our model (10), it follows that p(x|s ; . By imposing a Laplacian prior on s of the form: p(s ) ∝ exp(− 1 2τ s 1 ) with τ > 0 and (13) can be written as arg min With this, we can rewrite the Equation (13) as where λ = σ n /τ is a positive parameter that controls the tradeoff between data fidelity and the constraint on the sparse noise.
The resulting optimization problem in Equation (15) combines a prior knowledge about the representation coefficients α (meaning that α is sparse), a statistical model for the observation x in Equation (7) expressed as x ∼ N (Dα + s ; σ n I), and a prior model for the sparse noise s ∝ exp(− 1 2τ s 1 ). We solve this problem by an alternating minimization algorithm described later (Section 3.4).
Once the sparse coefficients are obtained, we can calculate the class of x by where α i is a sparse vector associated with class i.

Robust JSRC Model
Similar to Equation (4), by gathering all the neighbouring pixels around a central test pixel into a matrix X, we can rewrite the Equation (10) in matrix form as follows: where S ∈ R B×T and N ∈ R B×T are the corresponding matrices representing sparse noise and Gaussian noise, respectively. With the assumption as in the JSRC model that the pixels in a small patch share the same set of training samples, the proposed optimization problem with respect to A and S can be formulated as: where S 1 is a norm defined as After finding the sparse coefficient matrix A and the sparse noise matrix S, we can label the class of the central pixel by where A i denotes the sparse matrix of A corresponding to class i.

Robust Super-Pixel Level JSRC
Imposing that pixels within a fixed-size rectangular neighbourhood share the same sparsity pattern, as in JSRC, has the following limitations. First, the size of the window is a free parameter, and determining its optimal value requires some tuning that varies from one image to the other. Secondly, when the central pixel is located on or near the boundaries between different classes, its neighbouring pixels belong to multiple classes, violating the assumption of the JSRC model and causing classification errors in these border regions. Finally, in practice, both the shape and the size of nearly homogeneous regions may vary a lot across a real scene, which suggests adaptive neighbourhoods instead of the fixed ones. The price to pay for such adaptive instead of fixed neighbourhoods is that a certain type of segmentation is needed. However, it turns out that such an approach with adaptive neighbourhoods may be advantageous not only in terms of accuracy, but also in terms of the net computation time, since each small region can be classified simultaneously as we show next.
We develop here a robust JSRC model, where the spatial information is captured at a super-pixel level, instead of using fixed-size rectangular neighbourhoods. Super-pixel segmentation techniques [43] adaptively divide the image into non-overlapping super-pixels being nearly homogeneous regions according to some criterion. In our problem, each super-pixel is a relatively small arbitrarily shaped and nearly homogeneous region, composed of pixels that belong to the same class. Let X now denote a matrix composed of pixels within the same super-pixel. With the same reasoning as in the previous section, we assume the model in Equation (17). Note that now the size of X is not fixed, but, otherwise, the formal description remains equivalent to the previous one, with the optimisation problem defined in Equation (18).
An important difference, both formally and practically, is that now we can assign X simultaneously to a given class instead of its central pixel alone in Equation (19). Now, we have Here, the class label of a super-pixel is simultaneously calculated, which means also that the sparse coding problem, calculation of class residuals and the minimization over these is calculated only once per non-overlapping super-pixel. On the contrary, in Section 3.2, all these operations are performed in each sliding window, centred around each image pixel. A typical hyperspectral image in remote sensing often has the size of thousands by thousands or more amounting to over million pixels, while we segment it into a couple of hundreds or thousands of super-pixels. This indicates a tremendous saving in computation. The concrete example are given in Section 4.2.

Optimization Algorithm
Here, we present an optimization algorithm to solve the proposed robust model by an alternating minimization strategy. A general derivation for the optimization in a matrix form is shown in Algorithm 1, where the input matrix X can represent a patch in R-JSRC or a super-pixel in R-SJSRC or reduce to a single vector in R-SRC. We employ alternating minimization similarly as in [28,31,36,44] to split a difficult problem into two easily solvable ones by fixing one variable in the other sub-problem, and alternating the process iteratively. In the (k + 1)th iteration, we update A and S as follows: Problem in Equation (21) can be solved by the SOMP algorithm [42], and for problem in Equation (22), the optimization with respect to S (k+1) is formulated by which is the well-known shrinkage problem. By introducing the following soft-thresholding operator: the solution of Equation (23) could be given by Note that, for the vector form of R-SRC in Algorithm 1, the sparse coefficients α in step 4 are obtained by OMP algorithm [41] and s in step 5 is derived by λ/2 (x − Dα). The class in step 8 is labeled by Equation (16).

Experiments
We evaluate the performance of our methods on both simulated and real hyperspectral images, in comparison with SVM with radial basis function (RBF) kernel [45], SRC [19], JSRC [20] and NLW-JSRC [21]. As quantitative performance measures, we adopt the common indicators: overall accuracy (OA), average accuracy (AA) and Kappa coefficient (κ). All the reported results represent the average of ten runs. In each run, the training samples are randomly selected and the remaining labeled samples are used for testing.

Results for the Simulated HSI Experiment
The Washington DC image shown in Figure 2a was collected by the HYDICE. Due to its high quality, this image is commonly used to simulate data degraded with different kinds of noise. The image is of size 280 × 307 × 210 with the spectrum ranging from 0.4 to 2.4 µm and has six classes in total. In this experiment, we reduce the number of bands to 191 by removing the opaque bands. Five percent of labeled samples were randomly selected as training samples and the remainder as test samples as shown in Table 1.
Experiment 1 (Synthetic simulation): In this simulated experiment, four kinds of noise were added as follows:

1.
Zero-mean Gaussian noise in all bands with SNR value for each band varying from 10 to 20 dB.

3.
Dead lines in bands 70-73 with width ranging from one line to three lines.    Table 1, all the parameters were tuned to give the best results, which are denoted in bold and suboptimal results are underlined. In order to be able to evaluate the contribution of each of the components of the proposed approach separately (both the robust nature and handling of spatial context), we also implemented the super-pixel level joint sparse representation classification (SJSRC) method with the same segmentation map as R-SJSRC. The results in Table 1 and Figure 2 show that the R-SJSRC model yielded a superior performance in terms of OA, AA and Kappa coefficient. The improvement due to the better spatial modelling can be clearly seen by comparing the performance of the super-pixel based SJSRC with the original JSRC. In terms of OA, this improvement was above 9.6%. Further improvement in the performance results from the improved noise model in R-SJSRC (the OA increases by other 1.5% compared to SJSRC). Similarly, the robust versions R-SRC and R-JSRC improve consistently over the corresponding SRC and JSRC methods, respectively. Experiment 2 (Effect of sparse noise): In this experiment, we analyse the robustness of our models to degradations dominated by sparse noise. We attempt to simulate a realistic situation where at least a small amount of Gaussian noise is always present and where sparse noise only affects certain bands or pixels of HSIs. Therefore, we first add a small amount of zero mean white Gaussian noise, such that the resulting SNR is 30 dB, and subsequently we introduce sparse noise. Let S b denote the fraction of bands affected by sparse noise and S p the fraction of affected pixels in each band. We perform experiments with S b = S p = S ∈ {0, 0.05, 0.1, 0.2}. The results are reported in Figure 3. R-SJSRC is the most stable method among all the tested ones, while SRC degrades sharply with the increasing level of sparse noise. Clearly, the performance of R-SJSRC is less sensitive to sparse noise than that of other methods. Moreover, the robust methods R-SRC, R-JSRC and R-SJSRC yield consistent improvements over the original models as expected.

Experiment 3 (Effect of sparsity constraint λ):
In this experiment, we study the effects of the parameter λ on the classification performance for our methods. The test image was firstly degraded by Gaussian noise such that the SNR is 30dB, and then corrupted by sparse noise with S b = S p = 0.2. The classification performance for R-SRC, R-JSRC and R-SJSRC is reported in Figure 4. Note that when the parameter λ is set as zero, R-SRC, R-JSRC and R-SJSRC reduce to SRC, JSRC and SJSRC, respectively.
We can observe in Figure 4 that the overall accuracies of the three models show similar trends in a function of the parameter λ. When the value of λ is relatively low, which means we enforce a smaller weight on the sparse noise, the performance of the proposed methods is not significantly improved over the results with λ = 0. As the value of λ increases, the classification performance also improves, reaching its highest values at λ = 10 −3 for R-JSRC and R-SJSRC, and at λ = 10 −2 for R-SRC. The improvements for R-SRC, R-JSRC and R-SJSRC show the benefit of incorporating the effect of sparse noise in our models.

Results for Real HSI Experiment
In this section, two real HSI datasets are used: Indian Pines data set and an urban area HYDICE data set.

Classification Results on the Real Datasets
The first experiment was conducted on the Indian Pines image, which was acquired by the Airborne/Visible Infrared Imaging Spectrometer (AVIRIS) sensor over the Indian Pines region in northwestern Indiana in 1992 as shown in Figure 5a. This image has 16 classes and 220 spectral reflectance bands ranging from 0.4 to 2.5µm. In this experiment, 20 water absorption spectral bands in 104-108, 150-163 and 200 are removed; therefore, the real hyperspectral image size is 145 × 145 × 200. Nine percent of the labeled samples are randomly selected as training samples and the remainder as test samples as shown in Table 2. Table 2. Reference classes for the Indian Pines.

No.
Class Name Train Test   1  Alfalfa  6  40  2  Corn-notill  129  1299  3  Corn-mintill  83  747  4  Corn  24  213  5  Grass-pasture  48  435  6 Grass-trees 73 657 7 Grass-pasture-mowed 5 23 8 Hay -windrowed  48  430  9  Oats  4  16  10  Soybean-notill  97  875  11  Soybean-mintill  196  2259  12  Soybean-clean  59  534  13  Wheat  21  184  14  Woods  114  1151  15 Bldgs-grass-trees-drives 39 347 16 Stone-steel-towers 12 81 Total 958 9291 The optimal parameters of our methods were: λ = 4 × 10 −4 , K = 11 for R-SRC, λ = 1.5 × 10 −3 , K 0 = 30, T = 49 for R-JSRC and λ = 0.003, K 0 = 50, Ns = 300 for R-SJSRC. For JSRC, the optimal window size was 7 × 7 and sparsity level was 30. In NLW-JSRC, the parameters were chosen from the recommendation of [21]. For SVM and SRC classifiers, we tuned the parameters such to produce the best classification results. The results are listed in Table 3 and Figure 5. In most cases, our method R-SJSRC yields better results than other classifiers. Based on super-pixel segmentation, the SJSRC model had at least 2.7% improvement over the reference methods JSRC and NLW-JSRC. Considering the sparse prior for multiple noise in the HSIs, our proposed R-SJSRC further improves OA by 1.5% over SJSRC. Moreover, the proposed robust models show a superior performance over SRC, JSRC and SJSRC, respectively. In Table 3, it should be noted that, even though the number of training samples for classes 1, 7 and 9 is very limited, both SJSRC and R-SJSRC still achieve a very high classification accuracy over others, which is largely due to the exploitation of super-pixel segmentation. Both R-SJSRC and SJSRC on a super-pixel level classification are able to alleviate the effect of unbalanced training samples on the performance to a certain degree. We also test the computation time saving of R-SJSRC compared to R-JSRC. The experiment was implemented in Matlab R2015a on the computer with Intel Core i7-3930K CPU and 64 GB RAM, and recorded time consumption of one iteration including super-pixel segmentation and classification map generation for R-SJSRC and classification map generation for R-JSRC. The results show that R-JSRC spends 321 s, while R-SJSRC only takes 5 s for one iteration, which indicates the benefit of R-SJSRC in terms of time saving. The reason for the high complexity of R-JSRC mainly comes from the computation of sparse coefficient when using the sliding window, which has to be calculated multiple times.
The second image that we use for evaluation is HYDICE Urban captured by the HYDICE sensor [46]. The original image size is of 307 × 307 × 210 and there are five classes in total. In this experiment, we tested our method on a part of this image with size 200 × 200 as shown in Figure 6a. The number of bands was reduced to 188 by removing the bands 104-108, 139-151 and 207-210, which were seriously polluted by the atmosphere and water absorption. We used this image as it contains different types of noise including strips, dead lines, impulse noise and Gaussian noise [46]. The number of samples used for training and test are shown in Table 4.
The quantitative results and classification maps from different methods are shown in Table 4 and Figure 6. The optimal parameters of R-SRC, R-JSRC and R-SJSRC methods are obtained, respectively, by λ = 2.5 × 10 −3 , K = 4, λ = 0.01, K 0 = 10, T = 25 and λ = 0.01, K 0 = 12, Ns = 1450. For other classification methods, we tuned the parameters in order to yield the best results. The results in Table 4 and Figure 6 show clearly that the proposed R-SJSRC model performs better than other classification methods on the HYDICE Urban image in terms of quantitative measurements and visual evaluation. A superior performance can be also viewed for other robust models, i.e., R-SRC and R-JSRC, over SRC and JSRC.  Here, we examine the effect of the training set size on the classification performance, using HYDICE Urban image as a case study. The number of training samples per class was set as 5, 10, 20, 40, 80 and 160, respectively, and the parameters for different methods were fixed as earlier specified. The results shown in Figure 7 reveal that the OA of all the methods gets improved significantly with the increase of training sample size, and R-SJSRC consistently achieves the best performance over all other tested methods. It can be observed that the highest improvement of R-SJSRC over SJSRC, as well as R-JSRC over JSRC and R-SRC over SRC is obtained when the number of training samples is the smallest (five per class). This improvement, resulting from accounting for the sparse noise in our model, turns out to be less significant when the size of the training set increases. This demonstrates that our robust model is especially effective when the training samples are limited.

The Influence of the Segmentation Granularity
To investigate the influence of the segmentation granularity on the performance of R-SJSRC, we conduct the experiments with varying number of super-pixels and record the resulting overall classification accuracy. Figure 8 shows the results for HYDICE Urban image and Indian Pines, where the number of super-pixel Ns is ranging from 100 to 3200, and the parameters of R-SJSRC are as specified earlier. The results demonstrate that the OA of HYDICE Urban image is less sensitive to the number of super-pixels than Indian Pines image for R-SJSRC. The OA of HYDICE Urban image stably increases to 97.93% when the value of Ns is less than 1600. The OA of Indian Pines image rapidly reaches to the top of 95.47% at Ns = 300, and then drops down to 85% at Ns = 3200. The reason for the stronger sensitivity of Indian Pines to Ns may be caused by the large diversity of the ground truth in the same class. When the number of super-pixels is large, more homogeneous regions will be separated into many small pieces, which results in the constraint relaxation of joint sparsity for the pixels in the same super-pixel and deteriorates the performance of R-SJSRC.

Practical Specification of the Parameters
In our experiments, we make sure that the comparison between different methods is fair by presenting for all of them the best achievable performance, assuming that the parameters were set optimally. In practice, ground truth data are rarely available. We advise the user in this case to optimize the parameters (using e.g., a widely adopted grid search) for images that are similar (in resolution and variability) to the ones being tested and for which ground truth data are available. The parameter values that we give may also be used without extensive decrease of the performance on a wide range of images of two types: AVIRIS and high-resolution urban images. The diagrams where we report the influence of the different parameters should also serve as a useful guideline in this respect. Figure 4 shows that λ can be chosen in a relatively wide range around the optimal value, without strongly affecting the performance. The same holds for the segmentation parameter, especially for Urban types of images.

Conclusions
In this work, we have proposed a robust classification model for HSIs, which combines an appropriate statistical model for the sparse noise and the representation coefficients of test samples into a unified framework, explicitly accounting for both Gaussian noise and sparse noise. An alternating minimization strategy is utilized to solve the resulting optimization problems. The robust model can easily generalize the off-the-shelf classification model to a robust version. The superior performance of the proposed methods over the existing methods is confirmed by the experiments on both real and simulated data, which demonstrates the effectiveness of the proposed robust model.