Classification of Hyperspectral Images Using Kernel Fully Constrained Least Squares †

As a widely used classifier, sparse representation classification (SRC) has shown its good performance for hyperspectral image classification. Recent works have highlighted that it is the collaborative representation mechanism under SRC that makes SRC a highly effective technique for classification purposes. If the dimensionality and the discrimination capacity of a test pixel is high, other norms (e.g., `2-norm) can be used to regularize the coding coefficients, except for the sparsity `1-norm. In this paper, we show that in the kernel space the nonnegative constraint can also play the same role, and thus suggest the investigation of kernel fully constrained least squares (KFCLS) for hyperspectral image classification. Furthermore, in order to improve the classification performance of KFCLS by incorporating spatial-spectral information, we investigate two kinds of spatial-spectral methods using two regularization strategies: (1) the coefficient-level regularization strategy, and (2) the class-level regularization strategy. Experimental results conducted on four real hyperspectral images demonstrate the effectiveness of the proposed KFCLS, and show which way to incorporate spatial-spectral information efficiently in the regularization framework.


Introduction
Sparse representation classification (SRC) has been widely used in many applications, such as pattern recognition [1,2], visual classification [3,4], and hyperspectral image classification [5][6][7][8][9][10][11]. Unlike the common classifiers (e.g., support vector machines (SVMs) [12] and multinomial logistic regression [13]), SRC is not a learning-based classifier, which first represents a test sample as the sparse linear combination of all training samples and then directly assigns a class label to the test sample by evaluating which class leads to the minimum reconstruction error.Although the usage of sparsity prior in the literature often leads to robust classification performance, recent works [14,15] have shown that it is the collaborative representation (CR) mechanism under SRC (i.e., representing a test sample collaboratively with training samples from all classes) that makes SRC a highly effective technique for classification purposes.Moreover, if the dimensionality and the discrimination capacity of a sample is high, other regularization terms such as 2 -norm can play the same role as the sparsity 1 -norm.Several approaches have demonstrated the effectiveness of the classification method using 2 -norm in many applications [14,16], including hyperspectral image classification [17,18].For the sake of simplicity, the method using 2 -norm is referred to as collaborative representation classification (CRC), and both SRC and CRC are referred to as CR-based classification methods.
Although CR-based classification methods can get good performance, it is difficult to use them to classify data that is not linearly separable.Moreover, for hyperspectral image classification, the discriminability of a pixel is generally low, owing to the presence of redundant spectral bands, although its dimensionality is high.As the pixel-wise classification results reported in [17,18], SRC often produces superior performance compared to CRC.Some approaches have considered using the kernel method that is widely used in SVM classification [12,19] to mitigate these problems [20][21][22], since in the kernel feature space the dimensionality of a sample is very high, and its discriminability is generally enhanced [23].In hyperspectral image classification, kernel CR (KCR)-based classification has shown an improvement over CR-based classification [21,24,25], and kernel CRC (KCRC) exhibits competitive advantages in terms of classification accuracy and computational cost when compared with kernel SRC (KSRC) [23].
In the development of CR-based classification methods, more attention is paid to the selection of norms.However, for both SRC or CRC they belong to the regularized least squares.That is to say, the improvement brought by the norms can also be achieved by other regularization terms or constraint terms.Among the numerous terms, the nonnegative constraint is an effective one that is very common in many other techniques and applications, such as nonnegative matrix factorization [26] and spectral unmixing [27].Moreover, the nonnegative constraint can also bring the sparsity of the coding coefficients [28,29].Accordingly, we consider exploiting kernel nonnegative constrained least squares (KNLS) for hyperspectral image classification.Since in the kernel feature space the dimensionality and the discrimination capacity of a pixel is high, the nonnegative constraint may play a role of CR.Considering that the nonnegative coding coefficient reflects the similarity between a test pixel and the related training pixel, we suggest to provide the posterior probabilistic outputs by enforcing the summation of the nonnegative coding coefficients of each pixel to be one, and thus investigate kernel fully constrained least squares (KFCLS) [30,31] for hyperspectral image classification.
The investigated KFCLS is a pixel-wise classifier that treats hyperspectral data as an unordered list of feature vectors but not as images.In order to handle the coarse classification maps brought by a pixel-wise classifier, previous methods have considered incorporating spatial-contextual information during the classification process [32].According to the relationship between the pixel-wise classification process and the fusion of spatial-spectral information, these methods can be roughly divided into three categories: (1) The first category can be treated as pre-processing methods.These methods usually extract the spatial features first, and then incorporate both the spatial and spectral features into a pixel-wise classifier.For instance, in [33] a composite kernel framework is proposed to combine the spatial and spectral features first and subsequently embed into SVM for classification purposes.
In [34], the authors extract multiple types of spatial features from both linear and nonlinear transformations first, and then integrate them via multinomial logistic regression.In [35], a convolutional neural network is utilized to extract deep features from high levels of the image data and the final classification is done by using SRC.(2) The second category can be treated as post-processing methods.These methods usually perform a pixel-wise classifier first, and then refine the pixel-wise results by incorporating spatial information.For instance, in [23,[36][37][38] the class conditional probability density functions are first estimated using a probabilistic pixel-wise classifier, and then refined by using some regularization models to incorporate the spatial information.In [39,40], the original hyperspectral image is first classified per pixel and simultaneously segmented into several adaptive neighborhoods, and then a decision fusion mechanism is undertaken within the pixel-wise classification results of these neighborhoods.In [41], KSRC is first used to get the coding coefficients of the original hyperspectral image, and then the coding coefficients are refined by incorporating the spatial information for the final classification.
(3) The last category can be treated as co-processing methods, which jointly integrate the pixel-wise classification process and the fusion of spatial-spectral information.For CR-based classification, the related methods usually assign a neighborhood/window to each pixel, and perform the representation of a test pixel jointly by its neighbouring pixels [21,[42][43][44][45].In addition, there are other methods that consider incorporating the spatial information by appending a spatial-spectral term to the coding model of CR-based classification [11,25,46].
Notably, regularization is an important technique for CR-based classification, since all CR-based classification methods are built by regularization technique.As a widely used technique in mathematical and image processing problems [47], the regularization technique is very suitable for the integration of different prior knowledge owing to its flexibility and availability.This paper considers incorporating the spatial information into KFCLS using regularization technique.For this issue, we consider both the co-processing and post-processing methods, and propose a weighted H1-norm [48] for the description of spatial information.Furthermore, we investigate two regularization strategies to integrate the spatial and spectral information.One is the coefficient-level regularization strategy that incorporates the spatial information by enforcing or refining the coding coefficients, and the other is the class-level regularization strategy that handles the posterior probabilistic outputs.
The remainder of this paper is organized as follows.Section 2 briefly introduces two instantiations of KCR-based classification methods (i.e., KSRC and KCRC).In Section 3, we first present the proposed KFCLS for hyperspectral image classification, and then introduce the co-processing and post-processing methods for KFCLS using two regularization strategies.The effectiveness of the proposed KFCLS and and the suggested way of incorporating spatial-spectral information are demonstrated in Section 4 by conducting experiments on four real hyperspectral images.Finally, Section 5 concludes this paper.

KCR-Based Classification
In this section, we briefly review the general model of KCR-based classification and subsequently introduce its two instantiations.Given a hyperspectral image, every pixel in it can be interpreted as an L-dimensional column vector with L being the number of spectral bands.Suppose the given hyperspectral image includes C classes, and there exists a feature mapping function φ which maps a test pixel x ∈ R L and J training pixels For a mapped pixel φ(x), KCR-based classification supposes that it can be collaboratively represented as the linear combination of all mapped training pixels; i.e., where s ∈ R J is an unknown coding coefficient vector of φ(x).To recover a coding coefficient vector s from φ(x) and Φ(A) stably, the regularization method is the best choice, and the corresponding optimization problem can be written as follows: where λ > 0 is a regularization parameter, and q = 0, 1, or 2.
Using different q will lead to different instantiations of KCR-based classification.KSRC and KCRC are two instantiations, where q is respectively set to 1 and 2. The corresponding optimization problems can be respectively written as follows: After solving the above optimization problems, the obtained s is used for the final classification.For KSRC, the class label y of x is determined via the minimal residual between φ(x) and its approximation from the mapped training pixels of each class, and the classification rule can be written as follows: where δ c (•) is the characteristic function that selects coefficients related to the cth class and makes the rest zero.For KCRC, it considers the discriminative information brought by s, and modifies the classification rule as: Notably, all φ mappings used in kernel methods occur in the form of inner products.For every two pixels x i and x j , we can define a kernel function as: where •, • represents the inner product.In this paper, only the radial basis function (RBF) kernel 2 ), γ > 0) is considered, owing to its simplicity and empirically observed good performance [12,19,21,49].After defining the kernel function K, the optimization problems of KSRC and KCRC can be rewritten as: where the constant terms are dropped, Similarly, the classification rules of KSRC and KCRC can be rewritten as: The optimization problem of KSRC is convex but not smooth.For this type of problem, several algorithms proposed in the sparse representation and compressive sensing community can be adopted to solve it [50][51][52].In this paper, an alternating direction method of multipliers (ADMM) algorithm [53] is adopted owing to its flexibility and availability, and the details can be seen in [49].As for the optimization problem of KCRC, it is convex and smooth, and an analytical solution can be derived (see [23]).

Problem Formulation
Works in [14,15] point out that if the dimensionality and discriminability of a test sample is high, the estimated coefficient vector will be naturally sparse and concentrate on the training samples whose class labels are the same as that of the test sample, regardless of whether the 1 -norm or 2 -norm is used.Since in the kernel feature space the dimensionality of a test sample is very high and its discriminability is enhanced, KCRC can get the same performance as KSRC [23].Considering these issues, one may wonder whether other constraint terms can play the same role, except the q -norm regularization terms.Notably, in the coefficient vector s, each entry can be treated as the similarity between the corresponding training pixel and the test pixel.If the test pixel is similar to some training pixels, large values will be assigned to the corresponding entries of s; otherwise, small values (may be negative) will be assigned.It is natural to enforce the similarity to be nonnegative.Moreover, the nonnegative constraint can promote the sparsity of the coefficient vector [28,29].For this reason, we consider the KNLS problem, which is defined as follows: where 0 J ∈ R J is a zero vector with all entries being 0, and the symbol denotes component-wise inequality; i.e., s 0 J means entry s j ≥ 0 for j = 1, 2, • • • , J. Since s is nonnegative and reflects the similarity, it can be regraded as a probability distribution if we enforce the summation of its entries to be one [54].Accordingly, we obtain KFCLS and the corresponding problem can be written as follows: where 1 J ∈ R J is a vector with all entries being 1. Figure 1 shows a comparison of the coefficient vectors obtained by KNLS, KFCLS, and KSRC.It can be observed that the coefficients of KNLS and KFCLS are almost as sparse as those of KSRC.Although the number of training pixels of each class may be unequal, the summation of the entries of δ c (s) can reflect the similarity between the cth class and the test pixel [23].Figure 2 shows the summation of the entries of each δ c (s).It is apparent that the summation value of the true class label is predominant.Moreover, the outputs of a classifier should be calibrated posterior probabilities to facilitate the subsequent processing, which are very useful in spatial-spectral classification [23,36,37].With the aforementioned observation and motivation in mind, we have designed a posterior probability in this context as follows: where (•) c denotes the cth entry of a vector, and the summation matrix T ∈ R C×J is defined by With the definition of the posterior probability, the classification rule of KFCLS can be written as follows: As for KNLS, we use the classification rule (4) in this paper.Notably, the classification rule (4) is also suitable for KFCLS.

Optimization Algorithm
In this paper, ADMM is adopted to solve the optimization problems (10) and (11).For KFCLS, the optimization problem (11) can be rewritten as the following equivalent form: where ι S is the indicator function of the set S (i.e., ι S (x) = 0 if x ∈ S and ι S (x) = ∞ if x / ∈ S).By introducing a variable v ∈ R J , the optimization problem (15) can be rewritten as follows: The augmented Lagrangian function of ( 16) can be written as follows: where µ > 0 is the penalty parameter and d ∈ R J is an auxiliary variable.The ADMM iteration procedure can be written as: where t > 0 is the iteration number.The first step of ( 18) is to solve the s-subproblem, and the solution can be derived as: where F = Q + µI with I being the identity matrix, and the projection operator . The second step of ( 18) is to solve the v-subproblem, which is the well-known proximal operator [55]: where max(•, 0) is used to set the negative components to zero and keep the nonnegative components unchanged.The last step of ( 18) is used to update the auxiliary variable.The algorithm of KFCLS is detailed as follows.
1. Input: A training dictionary A ∈ R L×J , and a hyperspectral pixel x ∈ R L . 2. Select the parameter γ for the RBF kernel and compute the matrix Q and the vector b.
Until some stopping criterion is satisfied.8. Output: The estimated label of x using ( 14) or (4).
As for KNLS, its ADMM iteration procedure is almost as same as that of KFCLS, where we do not need the projection operator P 1 in (19).

Spatial-Spectral Classification
The suggested KFCLS is just a pixel-wise classifier that does not treat hyperspectral data as images but as an unordered list of pixels.In order to incorporate the spatial-spectral information, several methods have been proposed as discussed in Section 1.Among these methods, the regularization strategy is an important one for CR-based classification, since CR-based classification is also a group of regularization methods.In this section, we show the incorporation of spatial-spectral information for KFCLS using both the co-processing and post-processing methods, and consider two regularization strategies to combine the spatial-spectral information.

Problem Formulation
Suppose that a hyperspectral image is composed of a set of Then, the unconstrained optimization problem (15) for X can be written as: where Tr(•) denotes the trace of a matrix.In this paper, the spatial relationship between two adjacent pixels x i and x j is modeled by the similarity defined as where = 10 −6 is a small positive constant and xi and xj are the pixels of the first three principle components of the hyperspectral image X.For each pixel x i , its neighborhood N i is built by its eight spatially-adjacent neighbors, and W ij is set to 0 if x j does not belong to N i .

Co-Processing Methods
The spatial arrangement of the coefficient matrix S is associated with that of the hyperspectral image X.That is to say, the spatial relationship between every two pixels x i and x j is also suitable for that of the coefficient vectors s i and s j .It is natural to integrate the spatial information of X by enforcing S. If x i is similar to x j (i.e., W ij is relatively large), s i and s j should be close to each other, and vice versa.In this paper, the weighted H1-norm that is convex and smooth is adopted to describe the aforementioned relationship and the joint regularization model (JRM) can be written as follows: where λ > 0, • F denotes the Frobenius norm, and ∇ w S 2 F is the weighted H1-norm of S with ∇ w s i = { W ij (s i − s j )|j ∈ N i }.We may note that the spatial arrangement of the probability matrix P is also associated with that of the hyperspectral image X.That is to say, we can integrate the spatial information of X by enforcing P (i.e., TS).In view of this, we propose the following class-level JRM (CJRM): Since the objective solutions of JRM and CJRM are the coefficient matrix S and the columns of which sum to one, both the classification rules (4) and ( 14) are suitable for the final classification.

Post-Processing Methods
For the post-processing methods, the procedures of the pixel-wise classification and the integration of spatial-spectral information are performed separately.Generally, the incorporation of spatial-spectral information can be done by refining the coefficient matrix S [41] or the probability matrix P [23,36,37,56].For this issue, we propose the corresponding post-processing regularization model (PRM) and class-level PRM (CPRM) for KFCLS, which are defined as follows: Notably, we can verify that the columns of solutions V and U sum to one with reference to (34).The objective solution of PRM is the refined coefficient matrix V, and thus both the classification rules ( 4) and ( 14) are suitable for the final classification; whereas the objective solution of CPRM is the refined probability matrix U, and thus only the classification rule ( 14) is suitable for classification purposes.Furthermore, we can prove that PRM using the classification rule ( 14) is equivalent to CPRM using the classification rule (14) with reference to (34).Accordingly, PRM using the classification rule ( 14) is dropped in the experiments.

Optimization Algorithm
In this paper, the optimization problems ( 23) and ( 24) are solved by ADMM.For CJRM, the optimization problem ( 24) can be rewritten as the following formulation by introducing a variable The optimization problem (27) accords with the framework of ADMM, and the corresponding augmented Lagrangian function of ( 27) can be written as: where D ∈ R J×I and R ∈ R C×I are two auxiliary variables.Then, the optimization problem ( 27) can be solved by the following ADMM iterations: Similar to (18), the solutions of the first two subproblems of ( 29) can be derived as: where F = Q + µI + µT T T, and the projection operator The third subproblem of (29) can be written as follows: The optimization problem ( 32) is a linear system, which can be solved by the Gauss-Seidel method according to [48,51].In addition, the optimization problem (32) can also be rewritten as the following formulation: min where G ∈ R I×I can be treated as the graph Laplacian with The analytical solution of (33) can be derived as: The algorithm of CJRM is detailed as follows.
1. Input: A training dictionary A ∈ R L×J , and a hyperspectral data matrix X ∈ R L×I .2. Choose β and compute the weights W ij according to (22).
3. Select the parameter γ for the RBF kernel and compute the matrices Q and B.
As for JRM, the aforementioned procedure is also suitable for the optimization problem (23).By changing the summation matrix T to the identity matrix I, we can obtain the ADMM iterations of JRM.For the two post-processing methods PRM and CRPM, their formulations are as same as that of (32).Thus, they can be solved by the Gauss-Seidel method according to [48,51], and we can get their analytical solutions with reference to (34).

Discussion
The 2 norm characterization of coding residual (or fidelity term)-i.e., the first term of (2)-is related to the robustness of KCR-based classification to noise, as stated in [14,15].For the proposed pixel-wise classifiers KNLS and KFCLS, they can be treated as two instantiations of KCR-based classification, where the collaborative representation mechanism is implemented by the nonnegative constraint.Therefore, their performance is almost as same as that of KSRC and KCRC, which is experimentally demonstrated in Section 4 by using four different hyperspectral scenes.Moreover, in KCR-based classification, the accuracy of one class is usually not vulnerable to the number of training samples taken from another class, since all training samples contribute collaboratively (or competitively) to represent a test sample.In Section 4, the experimental results confirm this phenomenon, where the numbers of training samples of some classes are far less than those of the others (see Tables 1 and 2).That is to say, KNLS and KFCLS are not too sensitive to class imbalance.
Because of the limitations of remote sensing sensors, a hyperspectral image may contain outliers such as noise and missing or corrupted pixels.The proposed spatial-spectral methods can cope with these pixels.Taking CPRM as an example, the optimization problem (26) can be rewritten as follows: where u i is the ith column vector of U. The solutions of ( 35) can be derived as: Since N i is a 3 × 3 neighborhood, (36) can be treated as a 3 × 3 adaptive mean filtering, and thus the outliers can be smoothed by using their neighbors.Notably, deep learning has attracted a lot of attention very recently.In hyperspectral image classification, various deep models (e.g., stacked autoencoder [57] and convolutional neural network [35,[58][59][60]) have been proposed with the observation of good performance in terms of accuracy and flexibility.This paper proposes two new instantiations of KCR-based classification and investigates how to efficiently incorporate spatial-spectral information in the regularization framework.Compared with the deep learning-based methods, the proposed methods have limitations in generalization performance owing to the drawbacks of traditional methods, but have advantages in the requirement of training samples and computational cost as mentioned in the existing deep learning approaches [35,[57][58][59][60].In this paper, we do not expect the proposed methods to exceed the deep learning based methods.It is unfair to compare these two different kinds of methods.Moreover, it is beyond the scope of this paper.

Data Collection and Experimental Setup
In the experiments, four hyperspectral remote sensing datasets have been considered to evaluate the performance of the proposed methods.
(1) The first one is the Indian Pines dataset taken by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) over northwest Indiana's Indian Pines test site in 1992.This dataset contains 220 spectral bands within the wavelength range of 0.4-2.5 µm, and consists of 145 × 145 pixels.Its spectral and spatial resolutions are 10 nm and 17 m, respectively.There are sixteen ground reference classes of interest, ranging from 20 to 2468 pixels in size.Figure 3 shows the false color composite image and the ground reference map.After removing 20 water absorption bands, there are 200 spectral bands remaining in the experiments.We randomly chose about 5% of the labeled pixels for training, and used the rest for testing, as shown in Table 1.
(2) The second one is the Kennedy Space Center dataset taken by the AVIRIS sensor over the Kennedy Space Center, Florida, in 1996.This dataset contains 224 spectral bands, covering the wavelength range of 0.4-2.5 µm.Its spectral and spatial resolutions are 10 nm and 18 m, respectively.This image, with a size of 512 × 614 pixels, contains 176 spectral bands after removing water absorption and low signal-to-noise bands.There are thirteen ground reference classes of interest, ranging from 105 to 927 pixels in size.Figure 4 shows the false color composite image and the ground reference map.In the experiments, we randomly chose 5% of the labeled pixels for training, and used the rest for testing, as shown in Table 2.
Table 6 summarizes the class-specific and global classification results for the two AVIRIS datasets, where the processing times in seconds are also included for reference.For the pixel-wise classification, the proposed KNLS and KFCLS can achieve competitive results when compared with KSRC and KCRC, and both the classification rules "dist" and "prob" are suitable for KFCLS.For the spatial-spectral classification, it can be observed that all the spatial-spectral methods perform better than the pixel-wise methods.Among the six spatial-spectral methods, CPRM yields the highest global and most of the best class-specific accuracies followed by CJRM-prob.The improvement of the two JRM methods over the pixel-wise methods is not significant when compared with the other four spatial-spectral methods.This is because JRM combines the spatial-spectral information by enforcing the coding coefficients directly, which is too strict and does not consider the variation of training pixels within each class.For CJRM, the classification rule "prob" is better than the classification rule "dist".Furthermore, considering PRM using "prob" is equivalent to CPRM, we can conclude that the classification rule "prob" is more suitable for the spatial-spectral classification than the classification rule "dist".As for the computational cost, KCRC is a cheap pixel-wise classifier since its objective function has an analytical solution.CJRM is the cheapest one among the four kinds of spatial-spectral methods, whereas JRM is the most expensive one.
For the two ROSIS datasets, the classification results and processing times are presented in Table 7. From this table, we can obtain almost the same conclusions as Table 6.It is apparent that KNLS and KFCLS are two competitive methods compared with KSRC and KCRC.When using the same regularization strategy and classification rule, the post-processing methods outperform the co-processing methods.For both the co-processing and post-processing methods, it is better to use the class-level regularization strategy.
Figures 7-10 show the classification maps corresponding to one of the ten random tests in each case for the AVIRIS Indian Pines dataset, the AVIRIS Kennedy Space Center dataset, the ROSIS University of Pavia dataset, and the ROSIS Center of Pavia dataset, respectively.From these figures, it can be observed that the numerical comparisons are confirmed by inspecting these classification maps.It is evident that the maps of the spatial-spectral methods are smoother that those of the pixel-wise methods, and the maps of CPRM are closest to the ground truth maps.(a (a (a

Analysis of Parameters
In the first set of experiments, we investigated the impact of the input parameters on KFCLS.Apart from the ADMM parameter µ that is empirically set to 10 −4 , KFCLS has only one parameter γ, which is used for the RBF kernel.Figure 11 shows the impact of γ on the four given datasets, where γ is varied from 2 −9 to 2 7 .It can be observed that for all four given datasets, there is a wide optimal range for the choice of γ.When γ is small, the classification rule "prob" is more robust than the classification rule "dist" in most cases, and the difference between them is unapparent when γ is large.
In the next set of experiments, we investigated the impact of the input parameters on CJRM, PRM, and CPRM.Notably, JRM is dropped owing to its low accuracy and heavy computational cost.Apart from the parameters µ and γ that are empirically set to the same values as those used in KFCLS, there are two parameters needing to be tuned.One is the balance parameter λ used in ( 24)- (26), which is varied in the range [10 −4 , 10 2 ] for CJRM and [10 −3 , 10 9 ] for PRM and CPRM; and the other is the weight parameter β used in (22), which is varied in the range [5,500] for CJRM and [100, 1000] for PRM and CPRM. Figure 12 shows the classification accuracies for CJRM when applied to the four given datasets.It can be observed that the tuning of λ should synchronize with that of β, the optimal parameters of CJRM-dist are almost the same as those of CJRM-prob, and it is not difficult for us to get a good result in all cases, since there is a wide range for us to choose a suboptimal combination of parameters.Figure 13 shows the classification accuracies for PRM and CPRM when applied to the four given datasets.It is evident that the optimal value of λ is 10 6 in all cases.Notably, a small positive constant = 10 −6 is used in (22), and the majority of W ij will be very small if β is relatively large.In order to connect all the spatially adjacent pixels, it is preferable to fix λ = 1/ .As for the parameter β, we can choose it in a wide range.

Impact of the Number of Training Pixels
In this set of experiments, we evaluated the eleven classification methods compared in Section 4.2 in an ill-posed scenario, where different numbers of training pixels are considered.The parameters of these methods are fixed to be the same as those used in Section 4.2.For the two AVIRIS datasets, different percentages of the labeled pixels per class, varied in the range [1%, 20%], were randomly chosen for training, where a minimum of two training pixels per class were taken for very small classes.For the two ROSIS datasets, different numbers of training pixels per class were randomly chosen to build the training sets.Specifically, for University of Pavia, the number was varied to be 10, 20, 40, 60, 80, and 100, and for Center of Pavia, the number was varied to be 5, 10, 20, 40, 60, and 80. Table 8 presents the classification results of the compared methods for the two AVIRIS datasets.It can be observed that for all the compared methods, the OAs increase monotonically as the number of training pixels increases.For the pixel-wise classification, there are no significant gaps between the five methods, and the two investigated classification rules are suitable for the proposed KFCLS.For the spatial-spectral classification, CPRM performs the best, and JRM performs the worst.
Table 9 presents the classification results for the two ROSIS datasets.From this table, we can conclude almost the same results as Table 8.It is evident that the OAs increase monotonically as the number of training pixels increases.For the pixel-wise classification, the proposed KNLS and KFCLS get competitive results compared with KSRC and KCRC.For the spatial-spectral classification, CPRM consistently yields better results than the other five methods, and the improvement of JRM over KFCLS is not significant.Among the three methods CJRM-dist, CJRM-prob, and PRM, CJRM-prob outperformed the others in most cases when applied to the University of Pavia dataset, and they obtained almost the same results when applied to the Center of Pavia dataset.

Comparison to Other Classification Techniques
In this set of experiments, CPRM is compared with three other techniques that can incorporate the spatial-spectral information into KFCLS-prob.For these methods, the free parameters introduced by KFCLS are set to the same values as those used in KFCLS.The first method is the composite kernel technique [33] using the original spectral features and the extended multiattribute profile (EMAP) features [61,62], termed as CKEMAP, where the EMAP features are extracted from the first three principal components of the hyperspectral image and built by the area and standard deviation attributes as reported in [63].The additional parameters of CKEMAP are chosen using cross-validation.The second method is the pixel-wise KFCLS-prob followed by Markov random fields (MRF) [36,37], where the MRF technique is utilized to incorporate the spatial-spectral information by refining the posterior probabilistic outputs.The free parameters of MRF are chosen using cross-validation.The last method is the pixel-wise KFCLS-prob followed by majority voting within superpixel regions [39], termed as MV, where the superpixel segmentation algorithm and its free parameters are chosen with reference to [64,65].Moreover, two baseline classifiers KFCLS-prob and SVM are also included for reference.For SVM, the RBF kernel is used and the free parameters are chosen using cross-validation.
Figure 14 shows the classification accuracies of the six compared methods when different numbers of training pixels are used.It can be observed that CKEMAP performs the best for the two AVIRIS datasets and CPRM performs the best for the two ROSIS datasets.Among the three post-processing methods (i.e., MRF, MV, and the proposed CPRM), CPRM outperforms the other methods in most cases.

Conclusions
This paper considers using the nonnegative constraint to achieve the collaborative representation mechanism under SRC and CRC in the kernel space, and thereby proposes KNLS for hyperspectral image classification by replacing 1 -norm or 2 -norm with the nonnegative constraint.In order to provide the posterior probabilistic outputs, we propose KFCLS by enforcing the summation of the nonnegative coding coefficients of each pixel to be one, and subsequently introduce two classification rules to determine the class labels.Compared with KSRC and KCRC, KFCLS can get competitive results and its coding coefficients are more meaningful and useful for the subsequent processing steps.Furthermore, in order to incorporate the spatial-spectral information into KFCLS using regularization technique, we investigated the co-processing and post-processing methods by applying coefficient-level and class-level regularization strategies.Experimental results conducted on four real hyperspectral images have demonstrated: (1) the proposed KFCLS can get competitive results comparing with the other pixel-wise classifiers; (2) the proposed classification rule "prob" is effective; (3) the class-level regularization strategy is better than the coefficient-level regularization strategy; and (4) CPRM is an effective and efficient post-processing method and the most efficient method among the investigated four kinds of methods.In future work, we expect that the suggested regularization method can facilitate the development of spectral unmixing.

Figure 11 .
Figure 11.OA as a function of γ for KFCLS when applied to the four given datasets.

Figure 14 .
Figure 14.OA as a function of the number of training pixels when applied to the four given datasets.(a) AVIRIS Indian Pines dataset, (b) AVIRIS Kennedy Space Center dataset, (c) ROSIS University of Pavia dataset, (d) ROSIS Center of Pavia dataset.

Table 1 .
The ground reference classes in the AVIRIS Indian Pines dataset and the number of training and test pixels used in experiments.

Table 2 .
The ground reference classes in the AVIRIS Kennedy Space Center dataset and the number of training and test pixels used in experiments.

Table 3 .
The ground reference classes in the ROSIS University of Pavia dataset and the number of training and test pixels used in experiments.

Table 4 .
The ground reference classes in the ROSIS Center of Pavia dataset and the number of training and test pixels used in experiments.

Table 5 .
The optimal combination of parameters for the investigated methods when applied to the four given datasets.JRM: joint regularization model; CJRM: class-level JRM; KCRC: kernel collaborative representation classification; KFCLS: kernel fully constrained least squares; KNLS: kernel nonnegative constrained least squares; KSRC: kernel sparse representation classification; PRM: post-processing regularization model; CPRM: class-level PRM.

Table 6 .
Classification accuracies for the two AVIRIS datasets using different classification methods.For both the pixel-wise classification and spatial-spectral classification, the best results are highlighted in bold, and the second best results are underlined.AA: average accuracy; KA: kappa coefficient of agreement; OA: overall accuracy.

Table 7 .
Classification accuracies for the two ROSIS datasets using different classification methods.For both the pixel-wise classification and spatial-spectral classification, the best results are highlighted in bold, and the second best results are underlined.

Table 8 .
OA as a function of the number of training pixels per class for different classification methods when applied to the two AVIRIS datasets.The standard deviation (in parentheses) of the ten random tests is also reported in each case.For both the pixel-wise classification and spatial-spectral classification, the best results are highlighted in bold, and the second-best results are underlined.

Table 9 .
OA as a function of the number of training pixels per class for different classification methods when applied to the two ROSIS datasets.The standard deviation (in parentheses) of the ten random tests is also reported in each case.For both the pixel-wise classification and spatial-spectral classification, the best results are highlighted in bold, and the second-best results are underlined.