1. Introduction
Hyperspectral images (HSIs) can be viewed as 3D data blocks, which contain abundant 2D spatial information and hundreds of spectral bands. Therefore, they can be used to recognize different materials based on the pixels. These images have been used widely in many applications such as natural resource monitoring, geological mapping, and object detection [
1]. Different machine learning methods have been used and proposed for HSI classification in the past few decades, which mainly include supervised learning methods such as probabilistic approaches: multinomial logistic regression (MLR) [
2], SVM [
3], decision trees [
4], MLP [
5], random forest (RF) [
6], and sparse representation classifiers [
7].
Recently, deep learning methods have been introduced for HSI classification to combine both spatial and spectral features [
8,
9,
10,
11,
12,
13,
14]. Considering the size of the pixel patches, the deep learning models are usually not very deep. A five-layer CNN [
9] was proposed, which combined various properties including batch normalization, dropout, and the parametric rectified linear unit (PReLU) activation function. The local contextual features were also learned by the CNN [
8] and diverse region-based CNN [
13] to improve the performance. Different features such as multiscale features [
11], multiple morphological profiles [
10], and diversified metrics [
15] were learned by different CNNs. The CNN was also extended to three dimensions such as the 3D CNN [
16], mixed CNN [
17], and hybrid spectral CNN (HybridSN) [
18]. As new methods such as the fully convolutional network (FCN), attention mechanism, active learning, and transfer learning have been proposed and used successfully in computer vision problem, they have also been applied to HSI classification. These learning methods include the FCN with an efficient nonlocal module (ENL-FCN), active learning methods [
19,
20], 3D octave convolution with the spatial–spectral attention network (3DOC-SSAN) [
21], the CNN [
22] with transfer learning that uses an unsupervised pre-training step, the superpixel pooling convolutional neural network with transfer learning [
23], and the lightweight spectral–spatial attention network [
24]. Researchers have also tried to learn features more efficiently and robustly using the proxy-based deep learning framework [
25].
Other than CNNs, new technologies have also been applied to HSI classification, such as the deep recurrent neural network [
26] and spectral–spatial attention networks based on the RNN and CNN learning spectral correlations within a continuous spectrum and spatial neighboring relevance. The deep support vector machine (DSVM) [
27] extended the SVM in the deep direction. Harmonic networks [
28] were proposed using circular harmonics instead of CNN kernels. These were extended as naive Gabor networks [
29] for HSI classification, which can reduce the number of learning parameters. A cascaded dual-scale crossover network [
30] was proposed to extract more features without extending the architecture in the deep direction. Recently, a recurrent feedback convolutional neural network [
31] was proposed to overcome overfitting, and a generative adversarial minority oversampling method [
32] was proposed to deal with imbalanced data in HSIs.
It is usually not efficient to train a learning model with a large number of training samples with a single HSI dataset. Therefore, this is the drawback of using deep learning models with a large amount of hyperparameters and layers. Incremental learning is learning new knowledge without forgetting learned knowledge to overcome catastrophic forgetting [
33]. It can also make the learning model be generated according to the complexity of the learning models, which is useful for HSI classification with limited training samples. Researcher have proposed different incremental methods such as elastic weight consolidation (EWC) [
34], remembering old knowledge by selectively reducing important weights, and incremental moment matching (IMM) [
35], assuming that the posterior distribution of the parameters of the Bayesian neural networks matches the moment of the posteriors incrementally. Another similar idea is scalable learning, which mainly includes multistage scalable learning and tree-like scalable learning. The parallel, self-organizing, hierarchical neural networks (PSHNNs) [
36] and parallel consensual neural networks (PCNNs) [
37] have been proposed, which usually combine multiple stages of neural networks with an instance rejection mechanism or statistical consensus theory. The final output of these networks is the consensus result among all the stages of the neural networks. Scalable-effort classifiers [
38,
39] were proposed including multiple stages of classifiers, which can be increased with increasing architectural complexity. A Tree-CNN was proposed [
40], and the deep learning models were organized hierarchically to learn incrementally. Conditional deep learning (CDL) [
41] was proposed for the active convolutional layer for inputs that hard to classify. Stochastic configuration networks (SCNs) [
42] were proposed using a stochastic configuration to grow the hidden units incrementally. Other than different types of learning methods, researchers proposed methods on learning security such as adversarial examples [
43] and the backdoor attack on multiple learning models [
44].
In addition to the deep direction, the learning model can be generated in wide [
45,
46] or both wide and deep [
47] directions. It was discovered that the training and learning process of the wide fully connected neural networks can be represented by the evaluation of the Gaussian process, and the wide neural network usually has better performance in generalization [
45,
46]. Recently, researchers also tried to combine HSI with LiDAR data to perform land cover classification, such as discriminant correlation analysis [
48] and the inverse coefficient of variation features and multilevel fusion method [
49].
In this paper, we proposed a dynamic wide and deep neural network (DWDNN) for hyperspectral image classification, which combines wide and deep learning and generates the learning model with the proper architectural complexity for different learning data and tasks. It was based on multiple dynamically organized efficient wide sliding windows and subsampling (EWSWS) networks. Each EWSWS network has multiple layers of transform kernels with sliding windows and strides, which can be extended in the wide direction to learn both the spatial and spectral features sufficiently. The parameters of these transform kernels can be learned with randomly chosen training samples, and the number of the outputs of the transform kernels can be reduced. With multiple EWSWS layers combined in the deep direction, the spatial and spectral features from the lower level to the higher level can be learned efficiently with a proper configuration of the hyperparameters for EWSWS networks. The EWSWS network can be generated one-by-one dynamically. In this way, the DWDNN can learn features from HSI data more smoothly and efficiently to overcome overfitting. The weights of the DWDNN are mainly in the fully connected layer, which can be learned easily using iterative least squares. The contributions of the proposed DWDNN are as follows:
Extracting spatial and spectral features from the low level to the high level efficiently by the EWSWS network with a proper architectural complexity;
Training the DWDNN easily, because the parameters of the transform kernels in EWSWS networks can be obtained directly by randomly choosing training samples or with the unsupervised learning method. The only weights are those in the fully connected layers, which can be computed with iterative least squares;
Generating learning models with the proper architectural complexity according to the characteristics of the HSI data. Therefore, learning can be more efficient and smooth.
The rest of the paper is organized as follows:
Section 2 presents the detailed description of the proposed DWDNN.
Section 3 presents the datasets and the experimental settings.
Section 4 gives the classification results for the HSI data.
Section 5 and
Section 6 provide the discussions and conclusions.
3. Datasets and Experimental Settings
The HSI datasets Botswana, Pavia University, and Salinas [
53] used in the experiments and shown in
Table 1 are as follows:
(1) Botswana:This was acquired by the NASA EO-1 satellite over the Okavango Delta, Botswana, in 2001–2004. The data have a 30 m pixel resolution over a 7.7 km strip in 242 bands. The spectrum is from 400–2500 nm. After removing the uncalibrated and noisy bands that cover water absorption, one-hundred forty-five bands are included as the candidate features. There are 14 classes of land cover;
(2) Pavia University: The data were acquired by the ROSIS sensor over Pavia. The images are (after discarding the pixels without information, the image is ). There are 9 classes in total in the images;
(3) Salinas: This was obtained in Salinas Valley, California, with 204 bands (abandoning bands of water absorption), and the size is . There are 16 classes in total in the images.
The experiments were performed with a desktop with Intel i7-8700K CPU, NVDIA RTX 2080TI GPU, and 32GB memory. The number of hyperspectral bands was reduced to 15 using principal component analysis. The different patch sizes of the HSIs have different effects on the classification results. The performance usually increases as the size of the patches increases. However the computational load can also increase [
21]. Therefore, there is a balance between the patch size and the computational load. We chose
as the patch size for all datasets to ensure that the DWDNN could achieve good performance and the computation process was efficient. For the hyperspectral data, there was one channel after the selected bands were concatenated as a vector. The proportions of the instances for training and validation were 0.2 for Pavia University and Salinas. For Botswana, the ratios were 0.14 and 0.01. The training instances were organized together as a single set to train the DWDNN. The patch size for the proposed method was
. The remaining instances were used for testing. The overall accuracy (OA), average accuracy (AA), and Kappa coefficient [
50] were used to evaluate the performance. The DWDNN was composed of 10 EWSWS networks, and the detailed parameters for each EWSWS network on different datasets are show in
Table 2. There are two formats of the parameters for the size of the sliding window: the integers represent the absolute window sizes, and the decimals represent the ratio of the length of the input vectors as the window sizes.
4. Performance of Classification with Different Hyperspectral Datasets
For the Botswana dataset, the proposed DWDNN was compared with SVM [
21,
26], multilayer perceptron (MLP) [
50], RBF [
50], the CNN [
50], the recently proposed 2D CNN [
21,
26], the 3D CNN [
21,
26], deep recurrent neural networks (DRNNs) [
21,
26], and the wide sliding window and subsampling (WSWS) network [
50]. MLP was implemented with 1000 hidden units. The RBF network had 100 Gaussian kernels, and the centers of these kernels were chosen randomly from the training instances. The CNN was composed of a convolutional layer with 6 convolutional kernels, a pooling layer (scale 2), a convolutional layer with 12 convolutional kernels, and a pooling layer (scale 2). The patch size was
for MLP and RBF and
for the CNN.
The classification results are shown in
Table 3 and
Figure 5. The proposed DWDNN had the best OA and Kappa coefficient among the compared methods. The OA, AA, and Kappa coefficient of the DWDNN could reach 99.64%, 99.60%, and 99.61%, respectively. The test accuracy for each class in the table represents how many samples could be classified correctly among the total number of samples in the corresponding class. The test accuracies of 12 of the 14 classes reached 100%.
Figure 5 shows the predicted results of the whole hyperspectral image. The proposed DWDNN had much smoother classification results for almost all classes. For example, the class of exposed soils with yellow color in
Figure 5g has much smoother connected regions.
For the Pavia University and Salinas datasets, the proposed DWDNN was compared with SVM [
17,
54], MLP [
50], RBF [
50], the CNN [
50], the 2D CNN [
17,
55], the 3D CNN [
16,
17], sparse modeling of spectral blocks (SMSB) [
56], and the WSWS Net [
50]. MLP had 1000 hidden units. The RBF network had 2000 Gaussian kernels, and the centers of these kernels were chosen randomly from the training instances. The CNN was composed of a convolutional layer with 6 convolutional kernels, a pooling layer (scale 2), a convolutional layer with 12 convolutional kernels, and a pooling layer (scale 2). The patch sizes were
for MLP,
for RBF, and
for the CNN, respectively.
The classification results for the Pavia University dataset are shown in
Table 4 and
Figure 6. The proposed DWDNN had the highest classification results compared with the other methods. The OA, AA, and Kappa coefficient of the DWDNN were 99.69%, 99.31%, and 99.59%, respectively. The test accuracy for each class in the table represents the percentage of samples that could be classified correctly among the total number of samples in the corresponding class.
Figure 6 shows the predicted results of the whole hyperspectral image. The instances without class information were predicted. The proposed DWDNN and the compared WSWS Net had smoother classification results than other compared methods. For example, the predicted classes of bare soil with brown color and trees with green color in
Figure 6g are much smoother than the predicted results of the compared methods.
The classification results for the Salinas dataset are shown in
Table 5 and
Figure 7. The proposed DWDNN had the best classification results compared with the other methods, and the OA, AA, and Kappa coefficient were 99.76%, 99.73%, and 99.73%, respectively. The test accuracy for each class in the table represents how many samples can be classified correctly among the total number of samples in the corresponding class. The instances without class information were predicted as shown in
Figure 7. The proposed DWDNN had smoother predicted results than the compared methods, which can be seen from the predicted classes of grapes-untrained with purple color and vineyard-untrained with dark yellow color in
Figure 7g.
5. Discussion
5.1. Visualization of the Extracted Features from the DWDNN
In this discussion, the extracted high-level features are visualized to show that the proposed DWDNN can extract different features effectively. Therefore, it can have a good performance on the HSI classification task.
For the Botswana dataset, the proportion for training and validation was 0.2, and 0.2. The basic parameter group of each single EWSWS network had: 4 EWSWS layers and 12, 71, 80, and 40 for the stride number, window size, number of transform kernels, and subsampling number for the first EWSWS layer; 400, 0.1 of the length of the current input vector, 40, and 20 for the stride number, window size, number of transform kernels, and subsampling number for the second EWSWS layer; 60, 0.7 of the length of the current input vector, 40, and 20 for the stride number, window size, number of transform kernels, and subsampling number for the third EWSWS layer; and 2, 0.5 of the length of the current input vector, 20, and 10 for the stride number, window size, number of transform kernels, and subsampling number for the fourth EWSWS layer, respectively.
For the Pavia University dataset, the proportion of training and validation was 0.2 and 0.2. The basic parameter group of each single EWSWS network was: 4 EWSWS layers and 2, 5, 100, 50 for the stride number, window size, number of transform kernels, and subsampling number for the first EWSWS layer; 320, 0.1 of the length of the current input vector, 100, and 50 for the stride number, window size, number of transform kernels, and subsampling number for the second EWSWS layer; 40, 0.2 of the length of the current input vector, 32, and 16 for the stride number, window size, number of transform kernels, and subsampling number for the third EWSWS layer; and 12, 0.3 of the length of the current input vector, 12, and 6 for the stride number, window size, number of transform kernels, and subsampling number for the fourth EWSWS layer, respectively.For the Salinas datasets, the settings were the same as in
Table 2.
The DWDNN was composed of multiple EWSWS networks, and each EWSWS network had multiple layers to extract features from the low level to the high level. The extracted features of the DWDNN with the hyperspectral datasets are shown in
Figure 8,
Figure 9 and
Figure 10. These extracted features were from the training samples and combined in a cascade from the fourth layer of the transform kernels in the DWDNN. Each curve in the figures represents an extracted feature vector from the training instances with the same classes. For the Botswana dataset, the extracted features from Classes 2, 10, 13, and 14 are shown. All the training instances of the same class were stacked together. It is observed in
Figure 8 that the extracted features of the same class have very similar curves. Classes 10 and 13 have more training samples, but they still have very similar feature curves. For Pavia University, the extracted features from Classes 1, 3, 5, and 7 are shown. There were 0.1 of the total number of the training instances of the same class stacked together. It is observed in
Figure 9 that all the classes have very similar feature curves. Class 1 has more training samples, but it still has similar feature curves for these samples. For Salinas, the extracted features from Classes 1, 11, 13, and 14 are shown. There were 0.1 of the total number of the training instances of the same class stacked together. It is observed in
Figure 10 that all the classes have very similar feature curves. Because 210 features were extracted and shown, these feature curves are denser compared with the other two datasets.
It is observed from all the figures with the three datasets that different classes have different feature curves, and the instances in the same class have similar feature curves. This demonstrated that the proposed DWDNN can learn features effectively with the HSI data.
5.2. Smooth Fine-grained Learning with Different Numbers of EWSWS Networks
In this part, the experimental setting of the Botswana data was the same as in the previous discussion.
For the Pavia University and Salinas datasets, the settings were the same as in
Table 2. The main advantage of the DWDNN is that it can learn smoothly with fine-grained hyperparameter settings. That is because the features can be learned in both the deep and wide directions iteratively. The DWDNN was composed of a number of EWSWS Nets, and the training can start from an EWSWS network with a basic group of parameters, then it can learn incrementally with the succeeding EWSWS networks one after the other. Therefore, the DWDNN with the proper architectural complexity can be obtained without overfitting.
It is observed in
Table 6,
Table 7 and
Table 8 that the performance can be improved smoothly as the number of EWSWS networks increases. The testing performance of the iteration process with the DWDNN including 10 EWSWS networks is shown in
Figure 11. During the iterations, the testing performance improved steadily. The iterations were stopped by the validation process. In
Table 6,
Table 7 and
Table 8, the testing accuracies started increasing above 98% as the number of EWSWS networks increased. The number of EWSWS networks can actually be reduced to reach the desired and sufficient testing performance. It is also observed in
Figure 11 that the iteration can start from a point with good performance to reduce the number of iterations.
5.3. Running Time Analysis
The proposed method was extended in both the wide and deep direction. The number of EWSWS networks, the strides for the sliding windows, and the subsampling ratios at each EWSWS layer can be used to balance the performance and the computational load. The running time including the training and testing were compared with different methods on the Botswana dataset for further discussion. The parameter settings were the same as in
Section 4. The results are shown in
Table 9. Compared with the classical machine learning models MLP and RBF and the recently proposed WSWS network, the proposed DWDNN has longer training time, but it still has a good testing time. That is because the DWDNN was composed of 10 EWSWS networks, and it needed a bit more time to train the model iteratively. During the testing process, it can compute quickly because the parameters of the DWDNN were reduced greatly using the measures such as the strides for the sliding windows and subsampling for the transform kernels. The proposed DWDNN had both a shorter training and testing time compared to the CNN.