Sparse Representation Graph for Hyperspectral Image Classiﬁcation Assisted by Class Adjusted Spatial Distance

: In the past few years, the sparse representation (SR) graph-based semi-supervised learning (SSL) has drawn a lot of attention for its impressive performance in hyperspectral image classiﬁcation with small numbers of training samples. Among these methods, the probabilistic class structure regularized sparse representation (PCSSR) approach, which introduces the probabilistic relationship between samples into the SR process, has shown its superiority over state-of-the-art approaches. However, this category of classiﬁcation methods only apply another SR process to generate the probabilistic relationship, which focuses only on the spectral information but fails to utilize the spatial information. In this paper, we propose using the class adjusted spatial distance (CASD) to measure the distance between each two samples. We incorporate the proposed a CASD-based distance information into PCSSR mode to further increase the discriminability of original PCSSR approach. The proposed method considers not only the spectral information but also the spatial information of the hyperspectral data, consequently leading to signiﬁcant performance improvement. Experimental results on di ﬀ erent datasets demonstrate that compared with state-of-the-start classiﬁcation models, the proposed method achieves the highest overall accuracies of 99.71%, 97.13%, and 97.07% on Botswana (BOT), Kennedy Space Center (KSC) and the truncated Indian Pines (PINE) datasets, respectively, with a small number of training samples selected from each class.


Introduction
A hyperspectral image (HSI) records a wide range of electromagnetic wave data reflected by the earth's surface. HSI has been widely used in agricultural mapping [1] and mineral identification [2], and due to its high-resolution spectral record of the land covers, HSI data is suitable for the classification of different objects on land [3][4][5]. However, among all HSI data acquired, the labeled one is very limited. In this situation, semi-supervised learning (SSL) provides a promising way to deal with both the limited labeled data and the rich unlabeled data [6,7].
In recent years, many groups have applied SSL methods to the HSI classification area. The typical SSL methods include the self-training method [8], the collaborative training method [9], the generative model method [10] and the graph-based method [11]. The self-training method [8] adds pseudo-labels to high-confidence unlabeled samples in each iteration until all the unlabeled samples are labeled. The collaborative learning [9] is proposed to make the HSI classification performances more reasonable 1.
We propose the concept of the CASD. The calculation of the CASD based mainly on the planar Euclidean distance and the shortest path algorithm. The CASD takes the class similarity between samples into consideration, which can make the measurement of distance more accurate and reasonable.

2.
We apply the CASD to estimate the distance information needed in the PCSSR algorithm.
The results show that, this approach can enhance the performance of the PCSSR algorithm when enough training samples are provided. We achieve the highest improvement of classification accuracy of 8.65% and 3.85% on the KSC and the BOT dataset when the number of labeled samples selected from each class reaches 20, and achieve 15.97% on the truncated IND PINE dataset when the number of labeled samples selected from each class reaches 15.

Related Works
This section provides a brief discussion of existing graph construction methods for HSI classification. During the process of graph-based SSL method, label propagation (LP) is a crucial step for transferring labels from a limited number of labeled samples to abundant unlabeled samples [6] with a given graph which denotes the connection among all samples. The basic idea of the LP algorithm is to assume that similar samples should have similar labels, so the mathematical way of achieving this purpose is to define an energy function (see Equation (8)) for the given graph that is used to judge the "smoothness" of the classification results-if the results meet the assumption of LP (i.e., similar samples should have similar labels), the value of the energy function will be small and vice versa.
To implement the above-mentioned procedure, we need to first obtain a well-constructed graph and provide an accurate adjacent matrix. The adjacency matrix of the graph reflects the relationship between samples, and a well-constructed graph should denote the similarity between samples honestly. Therefore, we need to find a good and proper method to generate an accurate similarity matrix, i.e., the adjacency matrix of the graph. Different from traditional graph construction methods, SR-based methods have the capabilities of learning the local relationship from samples and computing the well-discriminated edge weights of the graph, and therefore are robust to noises and parameter variations. We discuss below some representative methods in these two categories.

Traditional Graph Construction Methods
The process of graph construction is momentous in graph-based SSL which mainly involves two steps: building the graph adjacency structure and calculating the graph edge weight. For building graph adjacency structure, k-nearest neighbors (KNN) and ε-ball neighborhood are the two most popular approaches [20]. As for graph weighting methods, Zhou et al. [21] use the Gaussian kernel (GK) function to calculate the edge weight, however if only a few labeled samples are provided, it will be hard to determine the hyper-parameters in the function [22]. Wang et al. [22] propose a non-negative local linear reconstruction (LLR) to use the neighborhood information of each data point to construct a graph in order to derive a more reliable and stable way to construct the graph. First, they approximate the entire graph as a series of overlapping linear neighborhood patches, then they find the edge weight of each linear neighborhood patch, and then they aggregate all the edge weights together to form the edge weight matrix of the entire graph; Ma et al. [23] consider that sparsity is essential for improving the efficiency of SSL algorithms. Therefore, they propose local linear embedding (LLE)-based weight which can capture the local geometric properties of hyperspectral data and is good for weighting the graph edge in a low-level computational cost. Zhuang et al. [24] proposed nonnegative low-rank and sparse (NNLRS) approach to use both low-rankness of high dimensional data samples and the sparsity to construct a good graph. The obtained graph can capture the local low-dimensional linear structures of the data samples and the global cluster or subspace structures of the data samples.
However, these traditional methods share the same disadvantages that they all have fixed manually tuning parameters. As a result, this category of graph construction methods are very sensitive to the data noise and parameter variations.

SR-Based Graph Construction Methods
Unlike the traditional graph generation approaches, the SR-based methods can learn the local relationship from samples and compute the well-discriminated edge weights of the graph. By encoding a certain sample as a sparse linear combination of all the other samples, the sparse coefficients of the linear combination can be viewed as the edge weights from the certain sample to all the other samples [13,14]. By doing so, the graph that LP algorithm demanded could be generated.
In addition to the most SR based methods, Shao et al. proposed the probabilistic class structure regularized sparse representation (PCSSR) approach. In their work, the authors manage to incorporate the SR model with a probabilistic class structure that reflects the probabilistic relationship between each sample and each class. Further, with the probabilistic class structure provided, the distance between each two samples can be acquired according to the difference between their probabilistic class labels. Finally, a class structure regularization is developed using the distance between each two samples. The authors claim that, with the class structure regularizer, PCSSR can learn a more discriminative graph from the data, and as shown in the experimental results, the PCSSR method outperforms state of the art on Hyperion and airborne visible infrared imaging spectrometer (AVIRIS) hyperspectral data. The class structure regularizer and the full model of PCSSR are shown in Equations (1) and (2), respectively, where W is the adjacency matrix we need to obtain for the LP algorithm, M is the distance matrix and each entry M ij represents the distance between two samples based on the difference between their probabilistic class label, and X denotes all samples in training set and testing set.
However, the probabilistic class structure used in the PCSSR paper in obtained only through another SR process, which fails to take into account the abundant spatial information in the HSI dataset. Despite the highly discriminative capability to achieve high classification accuracy, PCSSR suffers from the limitation for neglecting the spatial information of HSI. Since sample pixels have the characteristics of spatial continuality, failing to consider spatial information would miss such important characteristics that are beneficial for enhancing classification capability. Therefore, we conclude that the classification results by using only spectral information would lack spatial continuality and smoothness. In order to address the above-mentioned limitation, our work aims to incorporate the spatial distance information into PCSSR to improve the discriminative capability of PCSSR, which will be introduced and tested in the following sections.

Modeling and Algorithms
This section details the proposed HSI classification approach that introduces CASD in a SR graph-based method, in order to take advantage of spatial information for improving the classification accuracy. The fundamental idea is to use our proposed CASD instead of the distance matrix M acquired by SR process in the original PCSSR method to measure the distance between any two samples. The CASD-assisted PCSSR can achieve a more accurate and reasonable measurement of sample distances. We further employ the LP algorithm to predict the probability of each unlabeled pixel belonging to a certain class. Figure 1 illustrates the general flow of the proposed CASD-assisted HSI classification method. In what follows, we describe in detail the main steps in this classification flow.

Class Adjusted Spatial Distance
For the purpose of incorporating spatial information into PCSSR, we propose using CASD to replace distance matrix required by SR process in the original PCSSR. We first provide a brief introduction to th planar Euclidean distance (PED). Consider two points ( , ), ( , ) in a plane. The PED between these two points is defined as: As we have discussed, to improve the performance of the PCSSR algorithm, a proper distance measurement between each two samples is needed. The distance matrix should reflect the similarity or difference among samples. Since each sample is just an area on the ground, the simplest way to measure the distance between each two samples is by calculating the spatial distance, i.e., the PED between them. The distribution of land covers is usually in a continuous way, so if a sample belongs to some class ∈ , , , . . . , the samples in its spatial neighborhood are likely to belong to the same class as it. Thus, we can use the PED between two samples to represent their similarity. However, PED has its limitation for measuring the distance information that PCSSR needs. It is possible that two samples distant from each other belong to the same class, which is not unusual in the land cover classification. In this case, PCSSR using PED would fail to classify such samples. To overcome this limitation, we introduce the class adjusted spatial distance (CASD) to replace the naïve planar Euclidean distance. Generally speaking, the CASD is a distance measurement which considers not only the Euclidean distance between two samples but also their class difference. We mainly use the Euclidean distance algorithm and the shortest path algorithm to solve the CASD.
We first generate a complete undirected graph ( , ) where represents all the samples and is valued with the Euclidean distances between every two samples. The distance from a sample point to itself is defined as 0. Then, we check all the labeled samples (vertices) in the complete graph . If two labeled samples belong to the same class, we change the edge weight between them to 0. In this way, we make the samples with the same class "closer" to each other. At the last step, we apply the shortest path algorithm (for example Dijkstra algorithm [25]) between every two vertices in the graph and revalue the edge weight between them with the length of the computed shortest path. We define this new edge weight as "the class adjusted spatial distance". The above process is illustrated in Figure 2, and the Algorithm 1 is described below.

Class Adjusted Spatial Distance
For the purpose of incorporating spatial information into PCSSR, we propose using CASD to replace distance matrix M required by SR process in the original PCSSR. We first provide a brief introduction to th planar Euclidean distance (PED). Consider two points (a 1 , b 1 ), (a 2 , b 2 ) in a plane. The PED between these two points is defined as: As we have discussed, to improve the performance of the PCSSR algorithm, a proper distance measurement between each two samples is needed. The distance matrix should reflect the similarity or difference among samples. Since each sample is just an area on the ground, the simplest way to measure the distance between each two samples is by calculating the spatial distance, i.e., the PED between them. The distribution of land covers is usually in a continuous way, so if a sample belongs to some class c i ∈ {c 1 , c 2 , c 3 , . . .}, the samples in its spatial neighborhood are likely to belong to the same class as it. Thus, we can use the PED between two samples to represent their similarity.
However, PED has its limitation for measuring the distance information that PCSSR needs. It is possible that two samples distant from each other belong to the same class, which is not unusual in the land cover classification. In this case, PCSSR using PED would fail to classify such samples. To overcome this limitation, we introduce the class adjusted spatial distance (CASD) to replace the naïve planar Euclidean distance. Generally speaking, the CASD is a distance measurement which considers not only the Euclidean distance between two samples but also their class difference. We mainly use the Euclidean distance algorithm and the shortest path algorithm to solve the CASD.
We first generate a complete undirected graph G(V, E) where V represents all the n samples and E is valued with the Euclidean distances between every two samples. The distance from a sample point to itself is defined as 0. Then, we check all the labeled samples (vertices) in the complete graph G. If two labeled samples belong to the same class, we change the edge weight between them to 0. In this way, we make the samples with the same class "closer" to each other. At the last step, we apply the shortest path algorithm (for example Dijkstra algorithm [25]) between every two vertices in the graph G and revalue the edge weight between them with the length of the computed shortest path. We define this new edge weight as "the class adjusted spatial distance". The above process is illustrated in Figure 2, and the Algorithm 1 is described below.
,otherwise 3. Calculate the shortest path between every two vertices , in the graph: The element value in the output adjacency matrix represents the calculated CASD between the i-th sample and j-th sample.

CASD-Assisted PCSSR
Based upon the CASD metric defined in Section 3.1, we now describe how to generate the graph for the LP algorithm by using the PCSSR flow. To start with, the PCSSR-based graph generation method is derived from the typical SR-based method. For every sample, the SR based method aims The subscript below each sample shows its pixel location in the hyperspectral data. (b) Construct a complete undirected graph where each vertex represents a sample and the edge between every two samples is weighted by their Euclidean distance. (c) A and F are the labeled samples with the same class, so reweight the edge between them by zero. (d) For every two vertices, compute the shortest path between them (the shortest path between A and C is marked in magenta). (e) Update the weight between every two vertices with the length of the shortest path between them. The new edge weight is called "the class adjusted spatial distance". Weight the edges in the graph by: Update the edge weight M(l 1 , l 2 ) between every two labeled samples l 1 ,l 2 according to the following equation: Calculate the shortest path between every two vertices v 1 ,v 2 in the graph: The element value M ij in the output adjacency matrix M represents the calculated CASD between the i-th sample and j-th sample.

CASD-Assisted PCSSR
Based upon the CASD metric defined in Section 3.1, we now describe how to generate the graph for the LP algorithm by using the PCSSR flow. To start with, the PCSSR-based graph generation method is derived from the typical SR-based method. For every sample, the SR based method aims to encode it as a sparse linear combination of the other samples [13,14]. The typical SR model is formulated as follows: where X denotes all the samples in training set and testing set; · represents the L-1 norm. By solving this regularization model, we can obtain the graph weight matrix W demanded in the following LP process. Furthermore, due to the complex working environment and contamination during the data transmission, many hyperspectral images are corrupted by different types and amounts of noises, two common types of which are stripping noise and salt-and-pepper noise. Therefore, considering the corrupted data and the noise during collection, the method can be rewritten as follow to enhance the robustness against noises: where X denotes all the samples in training set and testing set and λ is a tradeoff parameter that controls the sparsity of W.
In the next step, we come to a point of divergence from the original paper-the original PCSSR paper next introduces a probabilistic class structure term P = [P l ; P u ] ∈ R n×c where P ij represents the possibility that a sample i belongs to the class j, and then calculates the distance matrix M based on the probabilistic class structure P, where M ij = 1 2 P i − P j 2 . It is necessary to state that, in the original PCSSR paper, the probabilistic class structure P is generated through a standard SR process, and one of the aims of our work is to introduce the spatial information into the PCSSR. Therefore, instead of computing the probabilistic class structure, we run Algorithm 1, as proposed in Section 3.1, to get the CASD information between each two samples and apply the CASD information as the new distance matrix M, where M ij measures the distance between the i-th and j-th sample. If the two samples are close to each other (by category or in spatial), M ij will be a small number, which means they are similar to each other. The additional regularizer for graph edge matrix W is as follows: Obviously, to acquire a smaller R(W), W ij needs to be small when M ij is a large number. By this regularizer, the linkage between two far-away samples will be regularized into a weak linkage. Once we obtain the similarity (or the distance) matrix M by calculating CASD, the final formula of our CASD-assisted PCSSR approach is formulated as: where λ 1 controls the sparsity of W, λ 2 controls the effect of class structure regularizer. The model formulated in (7) is a constrained optimization problem and can be relaxed and solved by Lagrange multipliers methods, for example the alternating direction methods of multipliers (ADM) [26]. However, ADM has the disadvantage of introducing extra variables and requiring parameter tuning. In this work, following the original PCSSR method [17], we employ the ADM with adaptive penalty (ADMAP) [27], which can overcome the above-mentioned limitations, to solve problem (7).

Label Propagation
After getting the sparse graph and its adjacency matrix W, we can obtain the final prediction result by using the LP algorithm on the obtained graph. As mentioned in Section 2, the main purpose of the LP algorithm is to transfer labels from the labeled samples to unlabeled samples, and during this process, a prediction matrix will be generated. Furthermore, the generated prediction results should meet the basic assumption of LP algorithm that similar samples should have similar labels. The mathematical way of achieving the purpose of the LP algorithm is to define an energy function E( f ) with a given graph and to minimize the function E( f ).
where f i , f j are respectively the predicted label vectors of the i-th and j-th data samples. f is composed of all the predicted label vectors. The matrix W is the adjacency matrix of the graph needed for the LP process.
In order to maintain the experimental consistency with the original PCSSR paper, we follow the formula of LP algorithm used in the original PCSSR paper. The full explanation and the adapted formula are detailed as follows.
The labeled samples are expressed as X l = [x 1 , x 2 , . . . , x l ], and a large number of unlabeled samples There are total C classes denoted as C = {1, 2, . . . , c}. Let n = l + u be the total number of data samples, and usually, the value l is much smaller than u. The matrix W ∈ R n×n , the adjacency matrix of graph G which can be obtained from the PCSSR process, implies the similarity or the connection between each two samples. Next, we define a label matrix Y l with l rows, where each row Y li ∈ R 1×c is a one-hot vector representing the class that the corresponding labeled sample x i belongs to. F ∈ R n×c is the prediction matrix, of which each element F ij represents the probability of the i-th sample belonging to the j-th class; F l ∈ R l×c is the upper l rows of F, while F u ∈ R u×c is the lower u rows of F. min F∈R n×c where the expression after min can be viewed as the energy function; f i ∈ R 1×c , f j ∈ R 1×c are the predicted label vector of the data sample x i , x j . L w = D − W is the Laplacian matrix where D is a diagonal matrix, and D ii = j W ij .
Then we split L W into 4 blocks by the number of labeled and unlabeled samples: Finally, we get the prediction matrix that records the possibility of each unlabeled sample belonging to each class: The final prediction result for every unlabeled sample is given by: where y i denotes the class that the unlabeled sample i is most likely to belong to.

Experimental Results and Analysis
In this section, we will test the CASD assisted PCSSR algorithm on six different hyperspectral datasets. The algorithm is implemented with MATLAB 2019b and runs on a laptop with i5-7300HQ and GTX 1050TI. We use traditional graph-based algorithms in comparison. The codes and datasets used to generate the results and figures are available in Code Ocean [28].

Experimental Datasets
Two groups of datasets are used to evaluate our model. The first group includes the whole Botswana (BOT) dataset, the whole Kennedy Space Center (KSC) dataset, and the truncated Indian More information of these six datasets can be found in [29], and all datasets can be downloaded from [28]. The ground truth of every dataset is shown in Figures 3 and 4. The sample size of each class in each dataset is shown in Tables 1 and 2. Buildings-grass-trees-drives 297 16 Stone-steel-towers 93

Experimental Setup
In this part, we evaluate the performance of our CASD assisted PCSSR algorithm on all datasets, and its performance on group I datasets will be compared to other traditional graph-based classification methods stated in [17], including the original PCSSR graph method, the Gaussian kernel (GK) graph method, the nonnegative local linear reconstruction (LLR) graph method, the local linear embedding (LLE) graph method, the nonnegative low-rank and sparse (NNLRS) graph method, and the SR graph method. Our CASD assisted PCSSR approach is implemented under the same label propagation framework as other models, and the hyperparameters from other models stay the same as [17]. The process of hyper-parameter determination during our model development will be stated

Experimental Setup
In this part, we evaluate the performance of our CASD assisted PCSSR algorithm on all datasets, and its performance on group I datasets will be compared to other traditional graph-based classification methods stated in [17], including the original PCSSR graph method, the Gaussian kernel (GK) graph method, the nonnegative local linear reconstruction (LLR) graph method, the local linear embedding (LLE) graph method, the nonnegative low-rank and sparse (NNLRS) graph method, and the SR graph method. Our CASD assisted PCSSR approach is implemented under the same label propagation framework as other models, and the hyperparameters from other models stay the same as [17]. The process of hyper-parameter determination during our model development will be stated in Section 4.4.
We separate every dataset into two parts, i.e., the training set and the testing set. In our case, the latter is much larger than the former. For each dataset, we randomly pick out 3/5/10/15/20 samples per class as the training set (the labeled samples), and the rest as the testing set (the unlabeled samples). An example of dividing IND PINE dataset is illustrated by Figure 5. To accord with [17], we run our algorithm 20 times for each dataset. The mean of overall accuracy (OA), average accuracy (AA), and the Kappa coefficient are utilized to evaluate the classification results.

Experimental Setup
In this part, we evaluate the performance of our CASD assisted PCSSR algorithm on all datasets, and its performance on group I datasets will be compared to other traditional graph-based classification methods stated in [17], including the original PCSSR graph method, the Gaussian kernel (GK) graph method, the nonnegative local linear reconstruction (LLR) graph method, the local linear embedding (LLE) graph method, the nonnegative low-rank and sparse (NNLRS) graph method, and the SR graph method. Our CASD assisted PCSSR approach is implemented under the same label propagation framework as other models, and the hyperparameters from other models stay the same as [17]. The process of hyper-parameter determination during our model development will be stated in Section 4.4.
We separate every dataset into two parts, i.e., the training set and the testing set. In our case, the latter is much larger than the former. For each dataset, we randomly pick out 3/5/10/15/20 samples per class as the training set (the labeled samples), and the rest as the testing set (the unlabeled samples). An example of dividing IND PINE dataset is illustrated by Figure 5. To accord with [17], we run our algorithm 20 times for each dataset. The mean of overall accuracy (OA), average accuracy (AA), and the Kappa coefficient are utilized to evaluate the classification results.    Figure 6a, the CASD assisted PCSSR-graph method performs better than other methods when the number of labeled samples is more than 5 per class, finally achieving an accuracy about 97% and about 10% higher than other methods. For the result on the BOT dataset, as illustrated in Figure 6b, the CASD assisted PCSSR-graph method performs better than other methods when the number of labeled samples is more than 5 per class, finally achieving an accuracy about 99% and about 5% higher than other methods. For the result on the truncated IND PINE dataset presented in Figure 6c, the performance of our method surpasses other methods all along, and obtains an accuracy about 96% and about 16% higher than other methods when the number of labeled samples is 15 per class. 5 per class, finally achieving an accuracy about 97% and about 10% higher than other methods. For the result on the BOT dataset, as illustrated in Figure 6b, the CASD assisted PCSSR-graph method performs better than other methods when the number of labeled samples is more than 5 per class, finally achieving an accuracy about 99% and about 5% higher than other methods. For the result on the truncated IND PINE dataset presented in Figure 6c, the performance of our method surpasses other methods all along, and obtains an accuracy about 96% and about 16% higher than other methods when the number of labeled samples is 15 per class.  Furthermore, the classification accuracy of each class, the overall accuracy (OA), the average accuracy (AA), and the Kappa coefficient for the different graph-based methods on three datasets are shown in Tables 3-5, where the highest value of each row is shown in bold. For the BOT dataset, Table 3 exhibits that our method outperforms all the other algorithms with the best class-specific accuracies on almost all indices on all classes. The only exception is that on Class Two, our method achieves an accuracy 99.93% whereas the highest accuracy is 100.00%. For the KSC dataset, Table 4  Furthermore, the classification accuracy of each class, the overall accuracy (OA), the average accuracy (AA), and the Kappa coefficient for the different graph-based methods on three datasets are shown in Tables 3-5, where the highest value of each row is shown in bold. For the BOT dataset, Table 3 exhibits that our method outperforms all the other algorithms with the best class-specific accuracies on almost all indices on all classes. The only exception is that on Class Two, our method achieves an accuracy 99.93% whereas the highest accuracy is 100.00%. For the KSC dataset, Table 4 presents that our method achieves better performance than all the other algorithms on almost all indices. The only exception is that on Class 11 our method achieves an accuracy 99.64% whereas the highest accuracy is 99.70%. For the truncated IND PINE dataset, Table 5 shows that our method outperforms all the other algorithms once again with the best class-specific accuracies on almost all indices. The only exception is that on Class Eight our method achieves an accuracy 99.64% whereas the highest accuracy is 100.00%.

Results and Discussion
All the above figures and tables clarify that the classification accuracies of our model are more satisfactory than other traditional graph-based methods. Based on the above experiment results, we can come to the following conclusions:

1.
For datasets in Group I, the CASD assisted PCSSR algorithm doesn't perform so well when a small number of labeled samples are provided. However, as more labeled samples are given, our method gradually surpasses other graph-based methods, finally by more than 5% in overall accuracy. The experiment result indicates the introduction of the spatial information can effectively improve the classification accuracy of those traditional spectral-focusing algorithms when given a relatively larger training set.

2.
For datasets in Group II, our algorithm achieves super high accuracy on the SAL dataset. While for the IND PINE dataset, compared to the truncated one in Group I, the algorithm gets poorer performance on the whole IND PINE dataset than on the truncated one. Table 3. Classification accuracy of each class, OA, average accuracy (AA) and Kappa coefficients for BOT data with nine classes (20 training samples for each class). The highest value of each row is shown in bold.  Table 4. Classification accuracy of each class, OA, AA and Kappa coefficients for KSC data with 13 classes (20 training samples for each class). The highest value of each row is shown in bold. Conclusion 1 states that the performance of the CASD assisted PCSSR algorithm is highly related to the number of labeled samples for each class. Lack of labeled samples leads to low accuracy and the increment of labeled samples can improve the result effectively.

Class GK-Graph LLR-Graph LLE-Graph NNLRS-Graph SR-Graph PCSSR-Graph CASD
Since the result of the PCSSR algorithm is regularized by the probabilistic class structure which is generated by our CASD algorithm, the distances (CASDs) between samples have a great effect on the final performance of our algorithm. We can do the following operations to visualize the effect of the distances: for every unlabeled sample, find out the labeled sample with the shortest CASD to it, then mark that unlabeled sample. The classification results on BOT dataset (with three labeled samples per class) are shown in Figure 8. Please notice that "the visualization of the CASD" is an independent process, which is only for a better understanding of how well the CASD is measured. It is not an intermediate result of CASD assisted PCSSR algorithm.
It is easy to see, if the labeled samples we select from different categories are very limited, they can't be sufficiently assigned to every ground block in testing set. During the classification of such a sample block, if the samples of the same category are far away or the samples of the different classes are nearby, misclassification is likely to happen. The flaws in the probabilistic class structure generated by spatial algorithm can interfere the following sparse representation process, finally resulting in a decrease of accuracy. With the number of labeled samples increasing, the probability that a block is assigned to labeled samples will rise, the accuracy of the algorithm will be improved, and finally, the OA will be improved. It is easy to see, if the labeled samples we select from different categories are very limited, they can't be sufficiently assigned to every ground block in testing set. During the classification of such a sample block, if the samples of the same category are far away or the samples of the different classes are nearby, misclassification is likely to happen. The flaws in the probabilistic class structure generated by spatial algorithm can interfere the following sparse representation process, finally resulting in a decrease of accuracy. With the number of labeled samples increasing, the probability that a block is assigned to labeled samples will rise, the accuracy of the algorithm will be improved, and finally, the OA will be improved.
The spatial algorithm performs well only when the samples to predict are close to the labeled samples. As the distance increases, the reliability of prediction will drop. Besides, the classification boundary delineated by the spatial algorithm does not take into account the edge information of the hyperspectral figure. Therefore, samples in the intersecting area between classes are more affected by neighbor samples and more likely to be assigned to an incorrect category. If there are many unlabeled samples near the intersecting area, the classification result based on CASD could be unsatisfactory ( Figure 9). To sum up, the classification effect will be relatively poor at the category boundary away from the training samples. Conversely, if the ground blocks to be classified in the dataset are broken and scattered, the classification boundary is more likely to fall in negligible areas (the black background area), and the classification center is more likely to fall within the ground block that needs to be classified. Therefore, with enough training samples given, the classification result on the more scattered dataset are basically better. The spatial algorithm performs well only when the samples to predict are close to the labeled samples. As the distance increases, the reliability of prediction will drop. Besides, the classification boundary delineated by the spatial algorithm does not take into account the edge information of the hyperspectral figure. Therefore, samples in the intersecting area between classes are more affected by neighbor samples and more likely to be assigned to an incorrect category. If there are many unlabeled samples near the intersecting area, the classification result based on CASD could be unsatisfactory ( Figure 9). To sum up, the classification effect will be relatively poor at the category boundary away from the training samples. Conversely, if the ground blocks to be classified in the dataset are broken and scattered, the classification boundary is more likely to fall in negligible areas (the black background area), and the classification center is more likely to fall within the ground block that needs to be classified. Therefore, with enough training samples given, the classification result on the more scattered dataset are basically better.

Parameters Sensitivity Analysis
In this subsection, we will discuss the parameter sensitivity of our model using the truncated IND PINE dataset with 10 labeled samples selected from every class. There are two parameters in PCSSR algorithm, and . controls the sparsity of while controls the effect of class structure regularizer. We repeat 50 runs for each fixed parameter configuration and present the average results. For example, in Figure 10a, we use a fixed value and varies value to observe

Parameters Sensitivity Analysis
In this subsection, we will discuss the parameter sensitivity of our model using the truncated IND PINE dataset with 10 labeled samples selected from every class. There are two parameters in PCSSR algorithm, λ 1 and λ 2 . λ 1 controls the sparsity of W while λ 2 controls the effect of class structure regularizer. We repeat 50 runs for each fixed parameter configuration and present the average results. For example, in Figure 10a, we use a fixed λ 1 value and varies λ 2 value to observe the classification results. For each λ 2 value to be observed, we repeat 50 runs, calculate the classification accuracy during each run, and finally obtain the average accuracy. During the experiment, we first keep λ 1 equal to 1 × 10 −4 and vary the value of λ 2 from 1 × 10 −5 to 1 × 10 −4 with the step of 1 × 10 −5 . As we can see from Figure 10a, the algorithm reaches the optimal performance when λ 2 equals 7 × 10 −5 . Then we fix λ 2 and let λ 1 change. As illustrated in Figure 10b,c, the OA basically keeps the same when λ 1 is between 1 × 10 −5 to 1 × 10 −4 , and drops when λ 1 is larger. The result shows that sparsity and probabilistic structure both matter in the classification process, though the variety of performance isn't so great when parameters change.

Conclusions
This paper has developed a novel graph construction method called CASD assisted PCSSR algorithm. The proposed method introduces the spatial information into the classification process on the SR graph, so that the "distance" of two samples can be measures by both spatial distance and class distance. It is shown by the experimental result that CASD assisted PCSSR algorithm is an effective method for hyperspectral data classification and can achieve a relatively high performance when enough training samples are provided.
The shortage of our method also exists: Firstly, the number of training samples should be sufficient for the training process. If the training set is very limited while the ground blocks to predict are in large numbers, the final performance might be not as good. However, due to the sparse representation model used in this work, we only need a relatively small size of training set to accomplish model training. Secondly, categorizing by CASD doesn't assure a well-delineated intersection line between classes, which means the samples close to that line might be badly classified. Nevertheless, the final output of the model could be corrected by the following sparse representation process since the CASD algorithm only provides a "suggestion" to the PCSSR algorithm. Our future work is to extract the edge information from the hyperspectral data. Applying it to the CASD

Conclusions
This paper has developed a novel graph construction method called CASD assisted PCSSR algorithm. The proposed method introduces the spatial information into the classification process on the SR graph, so that the "distance" of two samples can be measures by both spatial distance and class distance. It is shown by the experimental result that CASD assisted PCSSR algorithm is an effective method for hyperspectral data classification and can achieve a relatively high performance when enough training samples are provided.
The shortage of our method also exists: Firstly, the number of training samples should be sufficient for the training process. If the training set is very limited while the ground blocks to predict are in large numbers, the final performance might be not as good. However, due to the sparse representation model used in this work, we only need a relatively small size of training set to accomplish model training. Secondly, categorizing by CASD doesn't assure a well-delineated intersection line between classes, which means the samples close to that line might be badly classified. Nevertheless, the final output of the model could be corrected by the following sparse representation process since the CASD algorithm only provides a "suggestion" to the PCSSR algorithm. Our future work is to extract the edge information from the hyperspectral data. Applying it to the CASD algorithm may compensate for the lack of classification accuracy in the intersecting area between classes.