1. Introduction
Hyperspectral remote sensing images contain rich spectral features, which can provide effective information for the classification tasks and/or other tasks [
1]. Hyperspectral imaging technology is widely used in the field of remote sensing, including military target detection [
2], urban land planning [
3], and vegetation coverage analysis [
4]. In addition, it also has applications in the fields of medical and health [
5] and plant disease disasters [
6]. This paper mainly studies the general classification task of HSIs, which is to identify the object category of each pixel in the image. In the early studies, individual pixels were often used as the target to train the classification model [
7,
8]. In fact, HSIs also have spatial characteristics, which means that adjacent pixels often belong to the same category. In recent years, the research on spatial–spectral feature integration has attracted more and more attention.
In the task of HSI classification, many machine learning algorithms, such as Random Forest [
9], K-Nearest Neighbor [
10], Linear Discriminant Analysis [
11], and Support Vector Machine (SVM) [
7], are often used as classifiers. Since the Convolutional Neural Network naturally has the function of extracting spatial features, it is also often used for HIS classification [
12,
13]. However, when the sample size of the training set is small, CNN is difficult to train to obtain stable results. On the contrary, SVM not only converges and stabilizes, but also can be well applied to the case of a small sample size [
14]. Due to the high cost of manual sample labeling of HSIs [
15], it is very meaningful to explore the classification with a small sample size. In many current researches, it is also very popular to combine SVM as a classifier with the spatial–spectral feature integration algorithm [
16,
17]. These algorithms are mainly divided into two categories, namely “pre-processing before classification” and “processing after classification”.
The “pre-processing before classification” method refers to a pre-processing method that reconstructs data before classification. In this process, the spatial features of hyperspectral data will be extracted and fused with spectral features. The method based on mathematical morphology is one of the methods often used to extract spatial features [
18], and many researchers have made improvements on this basis. For example, Liao W et al. proposed a morphological profiles (MPs) classification method based on partial reconstruction and directional MPs [
19]. On this basis, they proposed a semi-supervised feature extraction method to reduce the dimensionality of the generated MPs. Hou B et al. proposed a 3-D morphological profile (3D-MP) based method to utilize the dependence between data to improve the classification accuracy [
20]. Imani M et al. pointed out that the use of fixed-shape for structural elements cannot extract the profile information effectively, so they proposed a method to extract edge patch image-based MPs [
21]. Kumar B et al. used multi-shape structuring elements instead of ones with a particular shape, and then used the decision fusion method to fuse the classification results obtained based on different Extended Morphological Profiles (EMPs) [
22]. In addition, Kumar B et al. also used parallel computing to improve the speed of the EMP algorithm [
23].
In recent years, the spatial–spectral feature integration method based on spatial filtering has also been often used in research. Various filtering methods (e.g., Gabor Filtering, Wiener Filtering) have been applied. Gabor Filtering is often used in combination with other algorithms to improve the model. For example, Wu K et al. combined Simple Linear Iterative clustering, two-dimensional Gabor Filtering and Sparse Representation, and proposed a SP-Gabor classification method [
24]. Jia S et al. proposed an extended Gabor wavelet based on morphology, combining the advantages of EMP operator and Gabor wavelet transform [
25]. In addition, there are many studies that combine Gabor filtering with convolutional neural networks [
26,
27]. Wiener filtering is also often used in combination with other models or methods to improve classification performance. For example, in the CDCT-WF-SVM model proposed by Bazine R et al., after using Discrete Cosine Transform (DCT) to process the data, Wiener filtering is used to spatially filter the high-frequency components to further extract useful information [
28]. In fact, the role of Wiener Filtering is mainly to reduce noise [
29]. In the current research on denoising of HSIs, the additional photon noise related to the signal has become a research hotspot. For example, Liu X et al. used pre-whitening processing to transform the non-white noise in HSIs into white noise, and then used multidimensional Wiener filtering to denoise [
30]. In addition to the above two kinds of filtering, filtering methods such as mean filtering [
31,
32] and edge preservation filtering [
33,
34] are also often used to extract spatial features of HSI.
The “processing after classification” method refers to first using a classifier to obtain the predicted probability map, and then using one or more processing methods to process the obtained probability map. Methods based on Markov Random Field (MRF) have been widely studied, and they are mainly used to further process the pixel classification results obtained by the classifier [
35,
36]. For example, Qing C et al. proposed a deep learning framework in which a convolutional neural network is used as a pixel classifier, and MRF is used for spatial information mining and pixel classification results [
37]. Chakravarty S et al. pointed out that the model that simply combines the SVM classifier and MRF cannot enhance the smoothness of spatial and spectral analysis. Therefore, they used fuzzy MRF to promote a smooth transition between classified pixels [
38]. Xu Y et al. optimized the model by inserting the watershed algorithm in the process of combining SVM and MRF [
39]. Tang B et al. proposed a classification framework based on the Spectral Angle Mapper (SAM) to obtain a more accurate classification by introducing the multi-center model and MRF into the probabilistic decision framework [
40]. For the problem that shallow MRF cannot fully utilize the spatial information of HSIs, Cao X et al. proposed a cascaded MRF model, which further improved the classification performance of the model [
41].
Similar to the MRF-based model, the graph segmentation (GC) based model is also used to further process the pixel classification results to effectively use the spatial information of the hyperspectral image. Wang Y et al. proposed a classification model based on joint bilateral filtering (JBF) and graph segmentation. The model first used SVM to classify pixels, and then used JBF and GC to smooth the obtained probability map [
42]. Yu H et al. combined the MRF and GC algorithm, and successively processed the classification probability map obtained by SVM, which further improved the classification accuracy [
43]. In addition, there are also studies that combine “pre-processing before classification” and “processing after classification”. For example, Liao W et al. proposed an adaptive Bayesian context classification model. The model first fused the spatial–spectral features of the original data based on extended morphology, and then used Markov random field to obtain the probability after classification. The feature map was processed [
44]. Cao X et al. used the low-rank matrix factorization (LRMF) method based on the Gaussian mixture algorithm to extract features before classification, and used the combination of SVM and MRF to classify the data [
45]. The edge retention filtering method mentioned above can actually be used in the model after classification [
46,
47], or used before and after classification [
48].
Since HSIs are obtained by continuous imaging of the target area by the sensor, the probability that adjacent pixels belong to the same category of features is high. In the method based on image segmentation, the segmented area is considered to be composed of homogeneous pixels [
49]. The spectral features of the pixels belonging to the same category should be similar, so the correlation between two similar pixels should be relatively high [
50]. Based on this assumption, this paper proposed a Nested Sliding Window (NSW) method, which uses the correlation between pixel vectors to reconstruct the data in the pre-processing stage. In the NSW-PCA-SVM model, the PCA method is used to reduce the dimension of the reconstructed data, and the RBF-kernel SVM is used for classification.
Various popular methods were usually improved on the basis of some original methods or obtained by combining existing algorithms. For example, CDCT-WF-SVM model based on Discrete Cosine Transform (DCT) algorithm and Adaptive Wiener Filter (AWF) algorithm was obtained by combining existing algorithms [
28]. Although the final classification accuracy can be significantly improved by improving or skillfully combining existing algorithms, we try to propose a new algorithm from a more intuitive perspective, hoping to provide additional ideas for future research.
The rest of this article is arranged as follows.
Section 2 introduces the proposed model, including the structure of the model and the implementation details of the NSW method;
Section 3 introduces the dataset used in this article, and gives the verification process of the model performance;
Section 4 gives the experimental results and the corresponding analysis, including the measurement of model parameters and the specific classification results;
Section 5 discusses the advantages and limitations of the proposed approach;
Section 6 is the conclusion, which will summarize the work of this article.
2. The Proposed Approach
The proposed model consists of three parts: the reconstruction of the original data using NSW method; The dimensionality reduction of the reconstructed data by using PCA method; The classification of the data after dimensionality reduction using SVM.
Figure 1 shows the structure diagram of the NSW-PCA-SVM model.
The original hyperspectral data is a three-dimensional cube, which can be expressed as
, where
represents the set of real numbers and
represents that the image has three dimensions, including two spatial dimensions (i.e.,
H and
W) and one spectral dimension (i.e.,
B). For
, there are a total of
pixels, and each pixel is a B-dimensional vector. In fact, not all pixels in HSIs have a category label, and only pixels with labels will be used as target pixels for reconstruction. Therefore, the data reconstructed by NSW is transformed into a two-dimensional matrix, i.e.,
, where
N represents the number of pixels with labels. Dimensionality reduction using PCA maps
to
, where
C represents the number of principal components, and
. According to the sample size of the training set specified in
Section 3, a specified number of samples are randomly selected from
as the training set, and the remaining samples are all as the test set, in which the training set is used to train SVM.
2.1. The Nested Sliding Window Method
Due to the continuity of the hyperspectral imaging area, it can theoretically be considered that adjacent pixels have the same category label. However, at the junction of the areas where two different ground targets are located, adjacent pixels obviously do not share the same label. The nested sliding window (NSW) method proposed in this paper uses the correlation between HSI pixels to reconstruct the data, and generally uses the Pearson correlation coefficient to measure the correlation. For pixel vectors
and
, the Pearson correlation coefficient is calculated as follows:
where
represents covariance, and
represents variance.
Denote the pixel vector in row
i and column
j of
as
, and assume that
is the target pixel. define a neighborhood with a size of
that contains the surrounding pixels of
:
where
,
,
.
Considering that Formula (
2) cannot be used to obtain the neighborhood when the target pixel is at the edge position of
in the spatial dimension, i.e.,
,
,
, or
, the data needs to be zero-padding before obtain the neighborhood. Perform zero-padding operation on
to get
:
After zero-padding is performed on the raw data, the neighborhood of the target pixel
can be obtained according to Formula (
2), denoted as
. Set a sliding sub-window with a size of
in the neighborhood. Using this sub-window, a smaller three-dimensional matrix
can be divided in
, where
,
m and
n are used to determine the position of the sliding window, and
,
.
is defined as follows:
where
and
.
Within the valid range of
m and
n,
always contains the target pixel
. Therefore, the correlation coefficients between
and the pixel vectors in
can be calculated according to Formula (
1), and the calculated correlation coefficient matrix be denoted as
:
Let the mean value of the elements in the matrix
be
. When the position of the sliding sub-window changes, i.e., when the values of
m and
n change,
will change accordingly. Assuming
, then the
and
corresponding to
can be obtained according to Formulas (
4) and (
5), respectively, which would be used to reconstruct the target pixel. First, change the shape of
from
to
, and change the shape of
from
to
. Second, expand the reshaped
from a one-dimensional vector to a two-dimensional matrix, denoted as
:
The formula for reconstruction of
is as follows:
where
is the reconstructed pixel of the current target pixel, which is a one-dimensional vector;
represents the element-wise product (i.e., the Hadamard product) of
and
;
represents the sum of the elements in a vector (or matrix).
Every time
m and
n change, a calculation of
is performed, which actually causes a lot of repeated calculations. Therefore, it is better to calculate the correlation coefficient of
and the pixels in the neighborhood
at the beginning. The resulting correlation coefficient matrix is as follows:
where
represents the correlation coefficient between
and
. Therefore,
can be obtained directly by partition in
.
In fact, only a fraction of the pixels in the dataset used for the experiment have category labels. If all the pixels in the dataset are reconstructed, there will be a lot of useless computational overhead. Therefore, a pixel needs to be reconstructed based on whether it has a category label. For the three-dimensional hyperspectral dataset , there is a corresponding label matrix , and the elements in correspond to the pixel vectors in . Generally, pixels without a category label are uniformly assigned a 0-label to facilitate culling during analysis. For this reason, in the NSW method, pixels with a label of 0 will be skipped without processing; pixels with a label of non-zero will be used as the target pixels for reconstruction.
The pseudo code of NSW is shown in Algorithm 1.
Algorithm 1 Nested Sliding Window algorithm. |
- Input:
the raw dataset: , the category label matrix: , the size of the neighborhood: - Output:
the reconstructed dataset: , the label set: - 1:
Initialize the reconstructed dataset, which is a 2-D matrix: - 2:
Initialize the label set, which is a 1-D vector: - 3:
/*Obtain the neighborhoods of pixels labeled non-zero.*/ - 4:
Initialize the neighborhood set, which is a 4-D tensor: - 5:
the matrix obtained by zero-padding on according to formula 3- 6:
- 7:
for to
do - 8:
for to
do - 9:
if then - 10:
; - 11:
; - 12:
end if - 13:
end for - 14:
end for - 15:
/*According to the obtained neighborhood tensor N, reconstruct the data.*/ - 16:
▹ Get the first dimension of . - 17:
for
to
n
do - 18:
; - 19:
The correlation coefficient matrix calculated using Formula ( 8); - 20:
; - 21:
; - 22:
for
to do - 23:
for to do - 24:
; - 25:
; - 26:
if then - 27:
; - 28:
; - 29:
end if - 30:
end for - 31:
end for - 32:
The reconstructed pixel vector calculated using Formula ( 7); - 33:
; - 34:
end for
|
An example is given to graphically illustrate the general processing flow of the NSW method, as shown in
Figure 2.
In the example given in
Figure 2, the size of the neighborhood is
. By calculating the correlation coefficients between the target pixel and all pixels in the neighborhood, a correlation coefficient matrix of size
is obtained. Whenever the position of the sliding sub-window changes, a subset of pixels with a size of
is partitioned from the neighborhood, and a corresponding subset of correlation coefficients with a size of
is obtained, too. The pixel subset contains 1 target pixel (marked in red) and 8 adjacent pixels (marked in green). By comparing the mean values of different correlation coefficient matrix subsets, the maximum one corresponds to the optimal position of the sliding sub-window. Assuming that the correlation coefficient matrix subset corresponding to the optimal sub-window in
Figure 2 is
, and the corresponding pixel subset is
. When performing reconstruction, first expand
from a one-dimensional vector to a two-dimensional matrix (donated as
). In the expanded matrix, the value of each row is the same, that is, all are
. The reconstruction is carried out according to Formula (
7), namely:
2.2. Dimensionality Reduction and Classifier
In the proposed NSW-PCA-SVM model, PCA [
51] is used for dimensionality reduction of the data reconstructed by NSW, and RBF-kernel SVM [
51] is used for classification. In
Section 2.2, PCA and RBF-kernel SVM are introduced in two subsections, respectively.
2.2.1. Principal Component Analysis
Principal component analysis (PCA) is one of the most commonly used dimensionality reduction algorithms. It learns the low-dimensional representation of data to achieve the purpose of dimensionality reduction and denoising.
After the reconstruction of the original data using NSW method, a two-dimensional matrix with B rows and N columns is obtained, denoted as R. The matrix contains n samples, and each sample is a B-dimensional vector. Before using PCA for dimensionality reduction, it is necessary to centralize the reconstructed data, that is, , where represents a sample in the reconstructed dataset.
Assuming that the reconstructed data needs to be reduced from B-dimension to d-dimension, the purpose of PCA is to find a two-dimensional transformation matrix with
B rows and
d columns, denoted as
W. The variance of the projected sample points is expected to be maximized, where the projection of the sample point
is
. Therefore, the variance of the projected sample is
, and the optimization goal is as follows:
where
represents the trace of the matrix, and each column vector (denoted as
) in
W is an standard orthonormal basis vector, which means,
,
and
.
Using Lagrange Multiplier Method for Equation (
10),
, where
represents the eigenvalue of the covariance matrix
. Assuming that the eigenvalues are sorted:
, then
, and the dimensionality reduction data is
.
2.2.2. RBF-Kernel Support Vector Machine
Support Vector Machine (SVM) is one of the most influential algorithms in supervised classifiers, which is used to solve binary-classification tasks. By combining multiple binary-class SVMs, a multi-class classifier can be obtained. A common combination strategy is One-Versus-One (OVO), which implements multi-classification by designing SVMs between every two categories. Assuming classification for a dataset with k categories, the number of binary-class SVMs that OVO-SVMs need to design is . Considering the classification results of all the binary-class SVMs, the predicted value with the highest frequency is the final classification result.
For a given training set
, where
, the binary-class SVM needs to find a hyperplane (i.e.,
) to make the best segmentation of the sample points. Assuming that the hyperplane correctly classifies the training samples, for
, there is
Let
the sum of the distances from the two heterogeneous support vectors to the hyperplane is
where the support vector refers to the sample point closest to the hyperplane that makes Formula (
12) true. The distance
in Formula (
13) is also called margin. Therefore, the goal of SVM is to find the partitioning hyperplane with the maximum margin, that is, to find the
w and
b that maximize
and satisfy the constraints in Formula (
12). Since maximizing
only needs to maximize
, which is equivalent to minimizing
, that is, satisfying
This equivalent transformation makes solving Formula (
14) a convex optimization problem.
Obviously, SVM solves the classification problem in a linear manner. For linearly inseparable data, the training sample can be linearly separable in the transformed eigenspace by introducing a kernel function. The Gaussian Kernel Function (also known as Radial Basis Function, RBF) is one of the commonly used kernel functions:
where
is the bandwidth of the Gaussian kernel (width).
4. Experimental Results and Analysis
In this section, the classification performance of the proposed model was tested and analyzed experimentally on three public datasets. Firstly, the values of the two main parameters in the NSW-PCA-SVM model (i.e., neighbor size and principal component number) were measured. Secondly, the best classification results of NSW-PCA-SVM on three datasets are given, and compared with the best results of two basic pixel-wise classification models and six state-of-the-art models based on filtering. Finally, the classification results of NSW-PCA-SVM and various comparison models under different training set sample sizes are presented.
4.1. Model Validation
There are two key parameters, namely neighborhood size and principal component number, which have a significant impact on the performance of the NSW-PCA-SVM model. The value of the neighborhood size (denoted as
) determines the value of
a in Formula (
2), i.e.,
. According to the description of the NSW algorithm in
Section 2.1, the value of
directly affects the number of related pixels during reconstruction. When the value of
is large, the reconstructed pixels are more affected by the related pixels. Conversely, when the value of
is small, the reconstructed pixels are less affected by the related pixels.
Since each dataset has a different spatial resolution, it is necessary to conduct different experiments for different datasets when determining the value of , where the value range of is . For the NSW-SVM model, the OAs corresponding to different values of were obtained, and the optimal value of corresponded to the largest OA. Since the result of NSW-PCA-SVM is determined by two parameters, when determining the optimal value of , it is necessary to consider the optimal values of the principal component number corresponding to different .
Figure 3 shows the OAs of NSW-SVM and NSW-PCA-SVM corresponding to different values of
on the Indian Pines, Salinas and PaviaU datasets. As shown in
Figure 3a, on the Indian Pines dataset, when
, both NSW-SVM and NSW-PCA-SVM achieved the highest OA. As shown in
Figure 3b, on the Salinas dataset, the highest OA of NSW-SVM was obtained when
, and the highest OA of NSW-PCA-SVM was obtained when
. As shown in
Figure 3c, on the PaviaU dataset, the highest OA of NSW-SVM was obtained when
, and the highest OA of NSW-PCA-SVM was obtained when
.
According to the trend of each curve in
Figure 3, the following three conclusions can be drawn:
(1) All the curves basically showed a trend of increasing first and then decreasing, which is particularly obvious in
Figure 3a,b. In the NSW method, the value of
determines the number of related pixels in the neighborhood. When the value of
is larger, the neighborhood contains relatively more related pixels, including homogeneous pixels and heterogeneous pixels. Therefore, although a larger value of
makes the reconstructed pixels contain richer useful information, which is provided by homogeneous pixels, it also makes the reconstructed pixels contain more interference information, which is provided by heterogeneous pixels. On the contrary, a smaller value of
makes the reconstructed pixels contain less interference information, but at the same time it makes the reconstructed pixels contain less useful information. Therefore, whether the value of
is too large or too small will cause the classification accuracy of the NSW-PCA-SVM model to decrease.
(2) When the dataset is unchanged, the optimal values of in NSW-SVM and NSW-PCA-SVM are close. Therefore, the optimal value of in NSW-PCA-SVM can be roughly determined according to the optimal value of in NSW-SVM, so as to reduce the experimental testing required for parameter adjustment.
(3) When NSW-SVM and NSW-PCA-SVM achieved the best classification effect, the values of on the Salinas dataset were both greater than those on the Indian Pines dataset. The reason is that the spatial resolution of the Salinas dataset is larger than that of the Indians Pines dataset. When other conditions are the same or similar, the larger the spatial resolution of HSI is, the more homogeneous pixels that can be contained in the neighborhood.
For any non-adaptive dimensionality reduction algorithm, the dimension of the data after dimensionality reduction needs to be determined in advance. The principal component number of PCA is the dimension of the dataset after dimensionality reduction, denoted as c. When performing experimental determination on the value of c, it is also necessary to consider the optimal value of corresponding to different values of c.
It can be seen from
Figure 4 that the three curves generally show a trend of first increasing and then decreasing. Therefore, the possible range of the optimal value of
c can be investigated in a sparse value interval, such as
. According to
Figure 4, the optimal value of
c in NSW-PCA-SVM on the three datasets can be determined: on the Indian Pines dataset,
; on the Salinas dataset,
; on the PaviaU dataset,
.
According to the experimental results, the parameter settings of NSW-SVM and NSW-PCA-SVM on the three datasets are obtained, as shown in
Table 4.
The experimental results of the five state-of-the-art comparison models are obtained according to the codes provided by the authors of the original papers, so the parameters of each comparison model are set according to the original papers. For the basic SVM model and PCA-SVM model, the best parameters were determined based on experiments.
SVM: The kernel function is RBF, the kernel coefficient is 0.125, and the regularization parameter is 200. The above parameter settings are applicable to the experiments on the three datasets, and are applicable to the SVM part of all models in this paper.
PCA-SVM: The principal component number of PCA is 6 on the Indian Pines dat aset, 4 on the Salinas dataset, and 11 on the PaviaU dataset.
4.2. Classification Results on the Indian Pines Dataset
In
Section 4.2, the classification performances of NSW-SVM and NSW-PCA-SVM were tested on the Indian Pines dataset.
Table 5 shows the specific classification results of each model, including the classification accuracy of each class, overall classification accuracy (OA), average classification accuracy (AA) and Kappa coefficient (
), where the sample size of the training set is shown in
Table 1.
According to
Table 5, on the Indian Pines dataset, NSW-PCA-SVM achieved the best classification result (
). The Kappa coefficient in the classification results of each model is mainly determined by OA, which is the reason why NSW-PCA-SVM (
) achieved a high Kappa value. However, the classification result of the proposed model performed worse on AA than the six state-of-the-art comparison models. By analyzing the classification results of various classes of samples, it can be seen that NSW-PCA-SVM achieved low classification accuracy in the classification of samples of category 1, 7 and 9. According to
Table 1, the Indian Pines dataset has a small number of samples in categories 1, 7 and 9, which makes it difficult for NSW method to obtain effective spatial information and results in poor final classification results.
By comparing the classification results with the basic SVM model and the PCA-SVM model, it can be seen that 8 models utilizing the spatial information of HSIs were significantly improve the classification accuracy.
Figure 5 shows the classification maps of each model on the Indian Pines dataset. According to
Figure 5, the classification maps of the two pixel-based classification models (SVM and PCA-SVM) contained a large amount of classification noise, while the classification maps of the models utilizing the spatial information contained much less classification noise. In addition, by observing the classification maps of NSW-SVM and NSW-PCA-SVM, it can be seen that the classification maps of the two models contained less noise for the categories with more samples. On the contrary, they contain more noise for the categories with less samples.
In order to investigate the impact of the size of the training set on the classification accuracy of the models, this paper tested the OAs of each model under different sizes of the training set, as shown in
Figure 6. According to
Figure 6, with the increase of the sample size of the training set, the increase amplitudes of the OAs of NSW-SVM and NSW-PCA-SVM were smaller than that of other comparison models. With the increase of the sample size of the training set, the OAs of the six state-of-the-art classification models were getting closer to NSW-PCA-SVM. In particular, when the number of samples in the training set exceeds 60 samples/class, the overall classification accuracy of NSW-PCA-SVM is worse than that of the Two-Stage model.
Figure 6 can show the advantage of NSW-PCA-SVM under a small sample size of the training set.
4.3. Classification Results on the Salinas Dataset
In
Section 4.3, the classification performances of NSW-SVM and NSW-PCA-SVM were tested on the Salinas dataset.
Table 6 shows the specific classification results of each model, including classification accuracy of each category, overall classification accuracy (OA), average classification accuracy (AA) and Kappa coefficient (
), where the sample size of the training set is shown in
Table 2.
According to
Table 6, on the Salinas dataset, NSW-PCA-SVM achieved the best classification result (
), while NSW-SVM achieved the sub-optimal result (
). Different from the Indian Pines dataset, the AAs of the two proposed models on the Salinas dataset were better than that of the comparison models, except for the Two-Stage model. According to
Table 2, although the number of samples in each category of the Salinas dataset were not similar, there is no case where the number of samples in several categories were particularly small.
Figure 7 shows the classification maps of each model on the Salinas dataset. According to
Figure 6, the classification noises contained in the classification maps of NSW-SVM and NSW-PCA-SVM were far less than other classification models, especially the two basic models. According to
Figure 7k, it can be seen that the distribution of similar samples was very concentrated, which enables the NSW method to effectively extract the spatial information of the Salinas dataset to improve the classification effect.
Figure 8 shows the OAs of each model on the Salinas dataset under different sample sizes of the training set. According to
Figure 8, when the sample size of the training set was small, the OA of NSW-PCA-SVM was much higher than that of the comparison models. However, as the sample size of the training set increased, the advantage of the proposed model over the comparison models was gradually weakening, especially when the sample size of the training set is 100 samples/class, the OA of SDWT-2DCT-SVM was very close to NSW-PCA-SVM and the OA of Two-Stage was higher than NSW-PCA-SVM.
Figure 8 also reflected the superiority of NSW-PCA-SVM under a small sample size of the training set.
4.4. Classification Results on the PaviaU Dataset
In
Section 4.4, the classification performances of NSW-SVM and NSW-PCA-SVM were tested on the PaviaU dataset.
Table 7 shows the specific classification results of each model, including the classification accuracy of each category, overall classification accuracy (OA), average classification accuracy (AA) and Kappa coefficient (
), where the sample size of the training set is shown in
Table 3.
According to
Table 7, on the PaviaU dataset, NSW-PCA-SVM also achieved the best classification results (
). In addition, although NSW-PCA-SVM achieved the best OA, it was slightly lower than Two-Stage on AA.
Figure 9 shows the classification maps of each model on the PaviaU dataset. According to
Figure 9, the classification map of NSW-PCA-SVM contained less noise, which corresponds to the classification results in
Table 7. According to
Figure 9k, it can be seen that except for a few categories, the sample distribution of the same categories was scattered. Therefore, it difficult for the NSW method to effectively extract the spatial information of the PaviaU dataset.
Figure 10 shows OAs of each model on the PaviaU dataset under different sample sizes of the training set. It can be seen from
Figure 10 that the classification effect of NSW-PCA-SVM compared to the six state-of-the-art models achieveed greater advantages when the sample size of the training set was small. However, as the sample size of the training set increased, the advantage was gradually weakening. The result was similar to the those on the first two datasets. For example, when the sample size of the training set exceeds 60 samples/class, the OA of the Two-Stage model exceeded that of NSW-PCA-SVM.
Table 8 shows the running time of the proposed NSW-SVM, NSW-PCA-SVM and the eight models used for comparison. It can be seen that the running time of the original SVM model or the PCA-SVM model is very short. The reason is that the SVM algorithm and PCA algorithm called in the program are actually developed based on the C Programming Language. Compared with the comparison models, the two models based on NSW require more computational time. The main computational cost lies in the calculation of correlation coefficient matrices according to formula (
6). In fact, after the neighbourhoods are divided using formula (
2), the reconstruction of each target pixel is independent of each other. Therefore, parallel computing techniques can be used to improve the calculation speed of the NSW algorithm. If the GPU is used for running acceleration, the running time of the NSW algorithm can be greatly reduced.
5. Discussion
In the experimental part, we conducted experimental tests on the proposed NSW-PCA-SVM model and the models for comparison on three datasets. The performance of all models on the three datasets is relatively consistent, so the following analysis is based on the experimental results on the Indian Pines dataset. By analysing the experimental results, the following conclusions can be drawn:
(1) The two basic models, SVM and PCA-SVM, only consider the spectral information of the HSI, their classification accuracy is worse than other models that additionally consider spatial information. The classification accuracy of the NSW-SVM model (OA = 87.35%) is much higher than that of the SVM model (OA = 53.29%), and the classification accuracy of the NSW-PCA-SVM model (OA = 91.40%) is also much higher than that of the PCA-SVM model (OA = 58.44%), which shows that the NSW method can effectively extract the spatial information of HSI and significantly improve the classification performance of the models.
(2) The classification accuracy of the five comparison models, i.e., CDCT-WF-SVM, CDCT-2DCT-SVM, SDWT-2DWT-SVM, SDWT-WF-SVM and SDWT-2DCT-SVM, has been significantly improved compared with the SVM model and the PCA-SVM model. Among them, the CDCT-2DCT-SVM model (OA = 83.26%) achieved the highest classification accuracy. In fact, the CDCT-2DCT-SVM model and the other four models only perform spatial filtering on the noisy part from the spectral filter, while a part of the reconstructed data does not contain spatial information. However, the NSW method considers all spectral bands when extracting the spatial information of HSIs. Therefore, the classification results of the NSW-PCA-SVM model (OA = 91.40%, AA = 84.00%, Kappa = 90.20%) were better than those of the CDCT-2DCT-SVM model (OA = 83.26%, AA = 89.48%, Kappa = 81.07%) and the other four models.
(3) In our comparative experiment, we chose a “processing after classification” model, namely Two-Stage. Since the main classification stage of the Two-Stage model is based on pixel-wise classification, when the sample size of the training set is small, the OA of Two-Stage (OA = 89.02%) is lower than that of NSW-PCA-SVM (OA = 91.04%). The Two-Stage model uses a variational denoising method to restore the classification map, which allows the model to ignore the uneven sample distribution, so the AA of Two-Stage (AA = 94.64%) is higher than the NSW-PCA-SVM model (AA = 84.00%). It can also be seen from
Figure 5 that the classification map obtained by using the Two-Stage model contains less noise, but there are cases where all pixels in a certain area are classified incorrectly.
(4) When the number of samples in the training set is small, such as 20 samples/class, the advantage of the NSW-PCA-SVM model in OA can reach 2.38–38.11% over the comparison models. According to
Figure 6,
Figure 8 and
Figure 10, further reducing the number of training set samples, such as 10 samples/class, would further expand the advantage of the NSW-PCA-SVM model. On the contrary, as the number of samples in the training set increases, the advantage of the NSW-PCA-SVM model decreases continuously. According to
Table 8, the NSW-PCA-SVM model requires more computational time than other models, although it can be further optimized. Therefore, only when the number of samples in the training set is small, such as 10 samples/class, 20 samples /class or 40 samples/class, etc., the use of NSW-PCA-SVM for classification can achieve higher returns.
Another disadvantage of the NSW-PCA-SVM model is that when the distribution of homogeneous samples of the HSI in the spatial dimension is scattered, it is difficult for the NSW method to extract spatial information very effectively. For example, on the PaviaU dataset, the classification accuracy of NSW-PCA-SVM did not have a great advantage. According to
Table 7, the advantage of the NSW-PCA-SVM model in OA only achieves 1.61–18.40% over the comparison models.
6. Conclusions
Based on the correlation between pixels, a Nested Sliding Window (NSW) method is proposed to extract the spatial features of HSIs. The NSW method can be used to reconstruct the pixels of HSIs, and the reconstructed pixels contain the information of the original pixels and the pixels that are in a spatially adjacent relationship with them. For the reconstructed data, PCA is used for dimensionality reduction to further eliminate the noise in the spectral dimension. Finally, the RBF-kernel SVM is used to classify the processed data. The NSW-PCA-SVM model has been tested experimentally on three public datasets. By comparing with the SVM model and the PCA-SVM model that only consider the spectral information of HSIs, the proposed model based on the NSW method can significantly improve the classification accuracy. Compared with five filter-based comparison approaches, NSW can extract spatial features for all spectral bands of HSIs. Compared with the “processing after classification” model, i.e., the Two-Stage model, the NSW-PCA-SVM model also has obvious advantages when the training set sample size is small. Therefore, our main contribution is to propose an effective classification model with limited training samples.
The limitations of NSW-PCA-SVM are: when the number of training set samples is large, it is difficult for NSW-PCA-SVM to obtain the advantages in classification accuracy, especially the NSW method requires more computational time; on datasets where homogeneous samples are closely adjacent in spatial relationship, such as Indian Pines and Salinas datasets, NSW-PCA-SVM has greater advantages, while for the datasets with dispersed spatial relations of homogeneous samples, such as PaviaU dataset, NSW-PCA-SVM are difficult to achieve significant advantages. For the improvement of the NSW method, we will carry out the following goals in future work: consider using more reasonable parameters to measure the degree of correlation between homogeneous samples; consider more reasonable reconstruction methods to avoid heterogeneous pixels participating in reconstruction; consider different neighborhood division methods, including the division with irregular shapes.