Next Article in Journal
Rapid Determination of Nutrient Concentrations in Hass Avocado Fruit by Vis/NIR Hyperspectral Imaging of Flesh or Skin
Previous Article in Journal
National-Scale Variation and Propagation Characteristics of Meteorological, Agricultural, and Hydrological Droughts in China
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Depth-Wise Separable Convolution Neural Network with Residual Connection for Hyperspectral Image Classification

1
School of Computer and Information Engineering, Henan University, Kaifeng 475004, China
2
Henan Key Laboratory of Big Data Analysis and Processing, Henan University, Kaifeng 475004, China
3
Henan Engineering Laboratory of Spatial Information Processing, Henan University, Kaifeng 475004, China
4
College of Environment and Planning, Henan University, Kaifeng 475004, China
5
Department of Geography, Kent State University, Kent, OH 44240, USA
*
Author to whom correspondence should be addressed.
Remote Sens. 2020, 12(20), 3408; https://doi.org/10.3390/rs12203408
Submission received: 13 September 2020 / Revised: 14 October 2020 / Accepted: 15 October 2020 / Published: 17 October 2020
(This article belongs to the Section Remote Sensing Image Processing)

Abstract

:
The neural network-based hyperspectral images (HSI) classification model has a deep structure, which leads to the increase of training parameters, long training time, and excessive computational cost. The deepened network models are likely to cause the problem of gradient disappearance, which limits further improvement for its classification accuracy. To this end, a residual unit with fewer training parameters were constructed by combining the residual connection with the depth-wise separable convolution. With the increased depth of the network, the number of output channels of each residual unit increases linearly with a small amplitude. The deepened network can continuously extract the spectral and spatial features while building a cone network structure by stacking the residual units. At the end of executing the model, a 1 × 1 convolution layer combined with a global average pooling layer can be used to replace the traditional fully connected layer to complete the classification with reduced parameters needed in the network. Experiments were conducted on three benchmark HSI datasets: Indian Pines, Pavia University, and Kennedy Space Center. The overall classification accuracy was 98.85%, 99.58%, and 99.96% respectively. Compared with other classification methods, the proposed network model guarantees a higher classification accuracy while spending less time on training and testing sample sites.

Graphical Abstract

1. Introduction

The neural network-based hyperspectral images (HSI), an emerging remote sensing technology, can simultaneously detect two-dimensional geometric spatial information and one-dimensional continuous spectral information of the target object. This allows HSI the ability of "image-spectrum merging" that is the abstract feature combined image space with spectrum domain. Geometric space information reflects the size, shape and other external features of the target object. Spectral information reflects the physical structure and chemical composition of the ground object. Together, HSI extracts comprehensive characteristics of studied objects. With this, hyperspectral remote sensing is now widely used in the analysis of the composition of planets [1], marine plant detection [2], shallow river bathymetry, and turbidity estimation [3].
An important consideration for the application of hyperspectral remote sensing technology is how to build a more accurate and effective classification method. Traditional methods such as support vector machine (SVM) [4,5], 3D wavelet transform [6], and Gaussian mixture [7] usually use band selection and feature extraction to reduce the dimension of the original image and project the image into the low-level feature space. These methods often change how the original image bands are correlated, losing part of the spectral information, or failing to extract abstract features of the HSI. Needless to say, these shortcomings affect the accuracy of classified images.
For the past few years, with the application and development of deep learning technology, the convolutional neural network (CNN) has been widely used in image classification [8,9,10], speech recognition [11], target detection [12], and image semantic segmentation [13]. CNN shows powerful capabilities in extracting features on images for aforementioned fields. In order to effectively obtain spatial and spectral features of HSI and achieve more accurate classification, more and more researchers have begun to use CNN to replace traditional classification methods. Hu et al. [14] applied a convolutional neural network to the classification of HSI for the first time. They constructed a 1D-CNN that was composed of one convolution and two fully connected layers. However, the model only used spectral information to do the classification. Without considering the spatial information of HSI, their classification accuracy was slightly lower than that of conventional approaches.
Subsequently, Makantasis et al. [15] used a random principal component analysis (R-PCA) to reduce the number of spectral channels of the input image, followed by using a two-layer convolution model to encode the spectral and spatial information of the pixels, and finally completing the classification with the application of a multi-layer perceptron (MLP). They were able to achieve a high classification accuracy. However, it should be noted that using PCA to reduce the dimension would break the spectral continuity and lose some spectral information. To this end, Zhang et al. [16] expanded the training samples of the original HSI data using data augmentation and proposed a multi-channel CNN that used a one-dimensional convolution to extract the spectral characteristics of each pixel and a two-dimensional convolution to extract the neighborhood spatial features of the target pixel. The two convolution models were combined to realize classification. Yu et al. [17] also used a data augmentation method with a 1 × 1 small convolution that was pooled to extract features for effective classification. However, the way of data expansion was relatively tedious while the increase of training data undoubtedly lengthened the training time. Therefore, using the above methods for solving overfitting problems or to improve classification accuracy is not likely an optional method.
In order to fully extract the spatial-spectral information of HSI and achieve effective and accurate classification in the case of limited training samples, many of classifications using the neural network-based model tend to have deeper or wider complex hierarchical structures. Lee et al. [18] proposed a deep context CNN (DC-CNN) model based on the Inception module [19]. The DC-CNN model used convolution kernels of different sizes to combine the extracted space spectrum information in the first layer. It then used a two-layer residual structure to further extract space spectrum features. He et al. [20] constructed an M3D-DCNN model using a 3D convolution kernel that was characterized by two large-scale convolutions and one convolution layer. Moreover, Zhong et al. [21] constructed a spatial-spectral residual network (SSRN) that included spectral feature extraction block and spatial feature extraction block using 3D kernels. Wang et al. [22] used convolution kernels of 1 × 1 and 3 × 3 to extract spectral and spatial features through density connection for effective classification. Gao et al. [23] proposed the feature multiplexing module (SC-FR) that was composed of two cross-layer connected 1 × 1 small convolution kernels. The cross-layer combination increased the depth of the model and strengthened the flow and utilization of feature information to achieve an accurate classification. Paoletti et al. [24] constructed a deep residual network (pResNet) by stacking pyramidal bottleneck residual units [25] to achieve a high classification accuracy. Although the above-mentioned uses of 3D convolution or deeper CNN networks achieved good classification results to some extent, the deep layer means that the network model has more parameters, which not only increases the computational overhead, but also requires higher computer hardware equipment.
In view of the shortcomings of existing research, this work designs a lightweight deep network classification model based on learned experiences in literature [24,25]. Our model improves the pyramid residual unit [25] and replaces the standard convolution in the residual unit with a depth-wise separable convolution [26]. This model greatly reduces the number of model parameters and the computational cost. By stacking the improved pyramidal residual units, the multiplexing of low-level feature information is strengthened. In the meantime, the gradients disappearance and overfitting phenomena of deep network are alleviated. Moreover, the spatial-spectral information of HSI is used to effectively improve classification accuracy. All convolutional layers in the model, except for the residual unit, use 1 × 1 small convolutions, and the global average pooling layer is used at the end of the model to replace the fully connected layer, thereby further reducing the training parameters and accelerating the speed of classification.

2. Model Design

2.1. Depth-Wise Separable Convolution

Depth-wise separable convolution can be decomposed into depth-wise convolution and 1 × 1 convolution (also known as point-by-point convolution). Among them, depth-wise convolution is a separate convolution operation on each channel of the input image. The convolution operation is used to extract spatial features on each dimension; point-by-point convolution is a 1 × 1 standard convolution operation on the output feature map. The convolution is used to merge the feature map across channels.
As shown in Figure 1, in which the size of the input image is Df × Df × M, where Df is the height and width of the input image, M is the number of channels of the map, it is assumed that the convolution kernel size is k × k in the process of depth-wise convolution, and the size of output feature map obtained by convolution is Dg × Dg × M (Dg is the height and width of the output image), which is consistent with the number of input image channels. It is used as the input of the next convolution. For the point-by-point convolution, the size of the convolution kernels is 1 × 1, the number of channels on each convolution kernels must be the same as the number of input feature map channels. Let the number of the convolution kernels be N, then the output feature map would become Dg × Dg × N after convolution.
For the input feature maps H whose size is Df × Df, the convolution kernel K of size is k × k. The number of input channels and the number of output channels are M and N, respectively. The size of the output feature maps G is Dg × Dg. A standard convolution operation can be defined as follows:
G j   =   i   =   1 M H i · K i j + b j ,     j   =   1 ,   2 ,   ,   N ,
where Hi is the ith map in H, Gi is the ith map in G, and K i j is the ith slice in the jth kernel. bj is the bias of output map Gi. Furthermore, the notation · stands for convolution operator. Let the total numbers of trainable parameters in convolution be P1 (without considering bias parameters) and the number of floating-point calculations be F1 in a standard convolution process. They can be calculated as shown in Equations (2) and (3) below:
P 1   =   k × k × M × N ,
F 1   =   k × k × M × N × D g × D g .
From Equation (2), the number of parameters depend on kernel size, the number of input channels M and the number of output channels N. Equation (3) shows that the number of floating-point operations is dependent on parameter P1 and the output feature map size Dg × Dg.
In depth-wise convolution, as shown in Figure 1, each kernel has only one piece, which is to convolute each input channel maps, and this process can be defined as:
G j   =   H i · K j + b j ,     i , j   =   1 ,   2 ,   ,   M .
Here, Kj is jth depth-wise convolutional kernel. However, depth-wise convolution only filters input channels, it does not combine them to create new features. Therefore, an additional layer that is 1 × 1 standard convolution is needed in order to generate these new features [10]. For a process of depth-wise separable convolution, the parameter P2 and the floating-point calculation F2 are the sum of the depth-wise and 1 × 1 pointwise convolutions. Hence, P2 and F2 can be calculated as shown in Equations (5) and (6), respectively:
P 2   =   k × k × M + M × N ,
F 2   =   k × k × D g × D g × M + D g × D g × M × N .
The ratio of Equations (5) and (2) and the ratio of Equations (6) and (3) are shown in Equations (7) and (8):
P 2 P 1   =   1 N + 1 k 2 ,
F 2 F 1   =   1 N + 1 k 2 ,
It can be clearly seen that the parameters and calculations of the depth-wise separable convolution are only 1 N + 1 k 2 times of the standard convolution. This greatly reduces the parameter and computing cost in the model.

2.2. Residual Unit

A deeper neural network tends to easily hamper convergence from the beginning. In addition, it faces a problem of network degradation [27]. To alleviate this, several studies built HSI classification models ([18,21]) through residual connection and attempted to solve the problem of deep network gradient dispersion by stacking residual modules to achieve better classification results. The basic residual unit is shown in Figure 2, where Χ i and Χ i + 1 are input and output of i-th residual unit. F represents a residual function. H is the way of shortcut connection: If the identity mapping [27] is used, then H ( Χ i ) = Χ i . With these notations, the basic residual unit can be expressed as follows:
Χ i + 1   =   ( Χ i , W i ) + Χ i .
With the shortcut, the skipped connections increase the depth of the network, but they do not add additional parameters. Furthermore, this structure promotes the training efficiency and solves the problem of network degradation effectively [25].
Although some models constructed by the above unit have achieved good results in HSI classification, it is not the optimum. To further improve on existing models, our study introduces a pyramid residual unit [25] to promote the classification efficiency. There are two underlying ideas that guide us to do so. First, the pyramid residual unit is a modification of the basic residual unit that shows significant generalization ability [25], which will be very beneficial for the classification of hyperspectral images with unbalanced sample distribution. Second, the pyramid residual unit was the simple way to linearly increase the number of feature map channels in small steps. That greatly reduced the training parameters and computational cost of the model. Pyramid residual unit is shown in Figure 3, and unlike a basic residual unit, the last rectified linear unit (ReLU) [28] was deleted, and batch normalization (BN) [29] was required before the first convolution operation in pyramid residual unit. Specifically, the order of execution of the layers can be described as follows: BN→Conv→BN→ReLU→Conv→BN. When the number of channels after the output residual function was not the same as that of the input, the skipped connections of zero-padded [25] were performed by the element-wise addition. In addition, the original standard convolution in the unit was changed to the depth-wise separable convolution to reduce the model parameters. In short, the core parts of the network model proposed in this paper were all made of this type of unit. As the network went deeper gradually, the parameters did not increase significantly, so that a lightweight residual classification model was constructed. The specific network structure is introduced in detail in Section 2.3.

2.3. Proposed Model for HSI Classification

Here, a deep neural network model was constructed, as shown in Figure 4. In general, the model first reduced the dimension of input HSI data cube through 1 × 1 convolution, which extracted abundant spectral information. Then, the three residual units, R1, R2, and R3, were adopted to extract ceaselessly both spatial contextual features and spectral features of data cube. Finally, the combination of 1 × 1 convolution and global average pooling (GAP) layer instead of the fully connected layer fused extracted abstract features to complete the final classification. The codes of this work will be available at https://github.com/pangpd/DS-pResNet-HSI for the sake of reproducibility.
Furthermore, as seen in Table 1, the more detailed per layer of network configuration that takes Indian Pines data set (present in Table 2) as an example is listed. First, the processed 3D HSI data of which we have the shape of 11 × 11 × 200 (200 is the number of bands) was fed to the network. In the first layer C1, 38 1 × 1 kernels reconstruct channel features of the original input data and retain the spatial information. Then, the R1 block consisted by two 3 × 3 kernels with stride = 1 were adopted so as to remain spatial edge information. Following the R1 block, the first layer of R2 was the 3 × 3 filter with stride = 2 for conducting a down sampling operation, and 3 × 3 kernel in the second layer uses with step size of 1 to generate 6 × 6 × 70 feature tensors. Then, similar to the R2 block, R3 continues to perform the down sampling operation and produce a 3 × 3 × 86 feature cube that has a smaller space size. Finally, the C2 in the last convolutional layer of model, which includes 16 3 × 3 kernels for compressing discriminative feature maps, post generated features to GAP layer, and then transform the shape of space into a one-dimensional vector of 1 × 16. Next, we will combine the characteristics of HSI to expound and analyze the reasons that the network was so designed.

2.4. Detailed Design of the Model

HSI have the continuity attribute that the data of each band was relatively scattered. In order to speed up convergence and reduce the training time of the model, the input data cube firstly carried out zero-mean standardization before inputting the network. The standardized calculation was defined as in Equation (10):
  Χ i , j   n   =   Χ i , j   n   Χ ¯ n σ n     ( 1 i W ,   1 j H ,   1 n N ) ,
where Χ i , j   n represents the pixel value of i-th row and j-th column in the n-th band of HSI, Χ ¯ n is the mean value of pixels in the n-th band, σ n indicates the standard deviation of pixels in n-th band; W, H, and N denote the width, height and total number of bands of input HSI, respectively.
In order to utilize the spectral and spatial information of HSI simultaneously, the original HSI was preprocessed into a neighborhood pixel block of spatial size S × S × N as the input of the model, where S × S represents the size of the neighborhood space centered on a certain pixel and N represents the number of bands of HSI. Considering that the input HSI data cube had numerous spectral dimensions, it was easy to cause Hughes phenomenon [30]. In other words, the imbalance between high-dimensional spectral bands and a limited number of training samples tends to overfit. Therefore, as seen in the first layer of the model, the 1 × 1 bottleneck layer was used to reduce the number of original channels, so that multiple channels could be recombined without changing the size of the space. In this manner, cross channel information integration could be realized, and nonlinear characteristics could be increased (ReLU activation function after convolution was used). On the whole, the 1 × 1 convolution not only retained the original spatial information of HSI data cube, but it also reduced the spectral channel of input data and effectively extracted the spectral features of spatial blocks.
As seen in Figure 4, R1, R2 and R3 were three pyramidal residual unit modules, and each module used two 3 × 3 depth-wise separable convolutions to extract corresponding features. First, the convolution operation of stride = 1 and padding = 1 was adopted in R1, so that the input and output feature maps had the same size, and the edge space information of the feature map was retained. Then, R2 and R3 were the same down-sampling units, the convolution stride of the first and second layers were 2 and 1, respectively. They were used to extract the more abstract spatial-spectral information. It should be pointed out that we adopted the skipped connection of zero-padded [25] in R1, but zero-padded with 2 × 2 average pooling was used in R2 and R3. The advantage of this connection was that there were no additional parameters. At the same time, normal addition operation was ensured.
In the previous convolution process, when the size of the output feature map was the same as the input or reduced to a half of the original, the number of channels of the output feature maps would be twice the original; this method undoubtedly increased the number of parameters and computation load. However, the pyramid residual unit introduced in this paper was a way to linearly increase the number of feature map channels in small steps. That greatly reduced the number of parameters and computational complexity as compared to some of the other existing methods. The calculation method of the number of output channels for each residual unit is as shown in Equation (11):
D i   =   { C     ;                         i   =   1 D i 1 + α R ;     i > 1 ,
where Di is the number of output channels of i-th residual unit, C is the number of initial channels feeding the first unit. That is, the number of output channels after the first layer 1 × 1 convolution, R is the total number of all residual units, and α is an integer greater than zero.
In order to avoid the information redundancy of the full connection layer, at the end of the model, 1 × 1 convolution and global average pooling layer were combined instead of using the full connection layer to fuse the features extracted from the previous layer. This approach further reduced the number of model parameters, alleviated the overfitting, and made the network to have faster convergence rate [31]. What needs to be mentioned is that the number of 1 × 1 convolution kernels should be the same as the species number of current HSI data set, so to ensure that 1 × N (N is the species number of current HSI data set) feature vector could be output from the global average pooling layer, and the final classification would be completed.

3. Experimental Setup and Parameter Discussion

3.1. Datasets Description

In order to measure the classification effect of the proposed model, three benchmark HSI data sets, Indian Pines (IP), Pavia University (PU), and Kennedy Space Center (KSC), were selected for experimental research. These data sets were available from the Grupo de Inteligencia Computacional (GIC) website (http://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes). The three data sets were captured at the pine testing ground in northwest Indiana, the University of Pavia in northern Italy, and Florida in the US. The details of the data sets are shown in Table 2.

3.2. Experimental Setup

The model designed in this paper was implemented by Python version 3.6.5 and the deep learning framework of PyTorch version 1.0.0 (available from https://pytorch.org). The computer hardware was Intel (R) Xeon (R) E5-2697 [email protected] CPU, the memory size was 32 GB, and the NVIDIA Tesla K20m GPU. We set the batch size of Indian Pines, Pavia University and Kennedy Space Center dataset to 64, 128, and 32, respectively. The value of α was fixed to be 48, and the learning rate was set to 0.01 uniformly. Stochastic gradient descent (SGD) optimizer was used to optimize the training parameters, and each experiment was performed 200 epochs. Besides that, all convolution layers were initialized with MSRA method [32] before training.
The experiments in this section are organized as follows. The influence of space size, initial convolution kernel number, and residual unit number on the classification accuracy were analyzed. The overall classification accuracy (OA), average classification accuracy (AA), and Kappa coefficient (K) were used as the measurement indexes of the experimental results. In order to ensure the accuracy of the experiment, each group of experiments were carried out 10 times, and the average values and standard deviation of the results of each 10-time experiment were taken as the final experimental results. To rule out variability due to random factors (training and testing sample order) in the experiment, we selected 10 different random seeds for every experiment.
In the three datasets, we randomly selected 200 samples per class object as the training set. For the Indian Pines and Kennedy Space Center datasets, however, the numbers of individual objects were less than 200, respectively. For example, the 1, 7, 9, and 16 class objects in the Indian Pines data set did make the overall sample distribution uneven. For this problem, the objects of less than 200 were randomly selected to be around 80% of the samples to be used as training set. The remaining data was used for the testing. In addition, 75% of the training samples were randomly selected for model validation. The specific samples divisions of the three datasets are shown in Table 3, Table 4 and Table 5.

3.3. The Impact of Spatial Size

For HSI, the neighborhood pixels of a certain pixel often have similar characteristic, and the probability of them belonging to the same objects is very high. If the neighborhood pixel block is too small, the model cannot fully learn the image spatial features. Alternatively, if the neighborhood pixel block is too large, it is easy to mix with other types of targets, which reduces the classification accuracy and increases the operation time and memory overhead at the same time. Therefore, it is very crucial to choose an appropriate neighborhood pixel block size. In our experiments, the initial convolution kernel number C was set to be 38, and the residual unit depth R was set to be 3 on the three datasets. For the neighborhood pixel block size, S, we chose S to be 5, 7, 9, 11, 13, and 15 for the next experiments.
As shown in Figure 5, for all datasets, the values of OA, AA, and Kappa coefficient increased significantly with the increase of spatial sizes. For the Indians Pines data set, the accuracy reached the highest when S was 11, and then, the accuracy starts to decline after that. For the Pavia University dataset, the accuracy rose gradually when S was greater than 11, and when S was 15, the accuracy was the highest. For the Kennedy Space Center dataset, the accuracy increased in a wavy trend as S increased. When S was 15, the accuracy reached the highest. When S was between 5 and 15, the accuracy was higher than 99.0%, which indicated that the classification accuracy of the dataset was less affected by neighborhood pixels. As shown in Table 6, with the increase of spatial size, training time and test time were also increasing. For Indians Pines and Pavia University data sets, tiny classification accuracy was sacrificed for balance the classification time, so that S was set to 11,13,9 in Indians Pines, Pavia University and Kennedy Space Center data sets respectively as the input of the next experiment.

3.4. The Impact of Initial Convolution Kernels Number

In this experiment, the number of initial convolution kernels is referred to the number of channels input to the R1 unit, which was also the number of 1 × 1 convolution kernels at the first layer of the model. According to Equation (11), the number of output channels of the i-th (i > 0) residual unit is C   +   α × i R (C, R are the number of 1 × 1 convolution kernels and residual units, respectively). When α and R are constants, the number of initial convolution kernels C completely determines the number of output channels of the later layer. In general, the larger C means that the model can learn more features. However, too many features not only increase model parameters, but they may also lead to overfitting and reduce classification accuracy.
In order to explore the influence of the number of initial convolution kernels (C) on the classification accuracy, the depth of the residual unit (R) was fixed to be 3. We tested the number of initial convolution kernels of 16, 24, 32, 38 and 42 for the three datasets. The experimental results are shown in Table 7, on the same dataset, with the increase of the number of initial convolution kernels, there is no significant difference between the training time and the test time. Obviously, it can be seen from Figure 6 that on the Indian Pines, Pavia University and Kennedy Space Center datasets, when C was 38, 42, and 32, respectively, the best classification accuracy could be achieved. Consequently, we chose this value as the super parameter of the next experiments.

3.5. The Impact of Residual Unit Depth

In addition to the above factors, the depth of residual unit directly affects the feature extraction capability of the entire model. In other words, if the structure of the model is shallow, it would be unable to extract features effectively. The deeper structure is prone to gradient disappearance, but it cannot further improve the classification accuracy. In order to test the influence of the depth of residual units on the classification accuracy in three datasets, we chose the residual unit depth R to be 1, 2, 3 and 4 to conduct experiments.
The experimental results are shown in Table 8 and Table 9. For the same dataset, each additional residual unit increases the parameters by about 10,000. On the Indian Pines dataset, the training and testing time did not change significantly with the increase of the number of residual units. In addition, with the number of residual units increasing, the training and testing times were gradually longer as a whole in Kennedy Space Center dataset, but the training and testing times were acceptable under the premise of obtaining the highest classification accuracy. As shown in Figure 7, the optimal accuracy was obtained when R was 3 for Indian Pines and Kennedy Space Center datasets. For the Pavia University dataset, when R was 2, the model acquired the highest classification accuracy.

4. Results and Discussion

In order to further measure the performance of the proposed classification model, we selected several typical classification models that became available in recent years. These include SVM-RBF [5], 1D-CNN [14], M3D-DCNN [20], SSRN [21], and pResNet [24]. Besides this, the same model using standard CNN (Std-CNN) instead of depth-wise separable convolution was also built. They were used in this study for detailed comparisons. More specifically, the SVM-RBF and 1D-CNN are spectral-based methods, and M3D-DCNN, pResNet, SSRN, Std-CNN, together with the present model, are spectral-spatial approaches. In order to ensure the fairness of the experiments, each group of experiments in three datasets were repeated 10 times. We uniformly used the space size that was determined in Section 3.3 of this section as the input of spectral-spatial model. The evaluation indicators OA, AA, Kappa coefficient, and F1-score are expressed in the form of "mean ± standard deviation".
In addition, we compared the proposed depth-separable convolution network (Des-CNN) with Std-CNN in detail to verify the feasibility and effectiveness of the depth-separable convolution approach for hyperspectral images classification.

4.1. Comparison with Other Methods

The classification results for each of the methods are shown in Table 10, Table 11 and Table 12. First, it can be seen that the classification accuracy (OA, AA, Kappa values, and F1-score) of the model proposed in this paper are higher than those of other classification models (SVM-RBF, 1D-CNN, M3D-DCNN, pResNet) on all datasets. Specifically, in the Indian Pines data set, our model achieves roughly 17.48% higher mean OA than that of SVM-RBF, 1D-CNN, and about 3.66% higher than that of spectral-spatial models (M3D-CNN, Std-CNN, and pResNet). Clearly, the accuracies of SVM-RBF and 1D-CNN models using only spectral features are significantly lower than that of the spectral-spatial model. This implies that only spectral features were unable to achieve high classification accuracy. Then, it should be pointed out that the OA value of SSRN is only 0.04% higher than that of our model, but it has a higher standard deviation. In addition, the performance of 3D networks, such as M3D-DCNN and SSRN, seems to be limited, especially M3D-DCNN; its accuracy on the IP dataset is much lower than SSRN, Std-CNN, and our proposed depth-separable convolution model. From Table 10, the accuracy of No.2, 11 is only 86.29% and 80.15%, because the band information of the two samples is relatively similar, resulting in a misjudgment. Moreover, in the face of small sample surface features, the feature learning ability of the SSRN is insufficient, such as the accuracy of No.9, 16 is less than 98%, while our model is higher than 99.44%. Furthermore, the Std-CNN and the model in our proposal both get rid of the problem of small sample, and their classification accuracy is superior to other models on the same objects with small samples (No.1, 7, 9, 16), indicating that our model still has strong feature extraction capability under the condition of small sample data. Finally, for Indian Pines and Kennedy Space Center datasets, the OA, AA, Kappa values and F1-score of depth-wise separable convolution model are better than those of Std-CNN. However, for the Pavia University dataset, the OA and Kappa of the two models are equal, and the AA of the proposed model is only lower than Std-CNN 0.04%, but the gap is not obvious.
In addition, compared with other classification models, the standard deviation of our model results is the smallest, which further indicates that the proposed model ensures high accuracy while the classification effect is more stable.
Figure 8, Figure 9 and Figure 10 visualize the classification results of the different models in three datasets, as well as the false color images of original HSI and their corresponding ground-truth maps. It can be clearly seen from the classification map that only using spectral features for classification, such as SVM-RBF and 1D-CNN, will produce many noise points, but the spectral–spatial-based methods, M3D-CNN, pResNet, and proposed, overcome this shortcoming, especially our proposed model shows better classification effect. For example, in the Indian Pines data set, M3D-CNN and pResNet mistakenly labeled some pixels of Class 11 (Soybean-mintill) as Class 3 (Corn-mintill), while our proposed model correctly labeled them. Specifically, by comparing ground-truth maps, our model achieved a more accurate and smooth classification effect.
In the last part of the experiments, we compared the above CNN based 1D-CNN, M3D-CNN, pResNet, SSRN, Std-CNN, and our model in four aspects: floating point operations (FLOPs), the number of training parameters, training time, and test time. As shown in Figure 11, on three different datasets, the number of the parameters of our model constructed in this paper was far lower than that of other models. Among the other models, the pResNet had the most parameters, mainly because the structure of the model was extremely complex, which had approximately 40 layers. The parameters of the M3D-DCNN model were second only to those of pResNet. It consisted of 10 layers, and there were two three-dimensional multi-scale convolutional layers with the width being 4, which greatly increased the parameters and slowed down the classification speed. However, the 1D-CNN model was the shallowest, with only 5 layers. Its number of parameters was much lower than those of the M3D-DCNN or pResNet models. In addition, from Table 13, the training time and test time of the proposed model were much less than those of M3D-DCNN and pResNet, but slightly higher than those of 1D-CNN. This was mostly because the 1D-CNN model only used the spectral features to complete the classification and lost the spatial information, so the time was slightly faster. The classification accuracy of the model was lower, however.
On the other hand, we can also observe in Table 13 that the FOLPs of 1D-CNN is the lowest, only 0.16M. This is because 1D convolution operation is relatively simple in the process of exploring spectral information, so the ability of feature learning is limited. However, the FLOPs of 3D-CNN and SSRN are as high as 234.68M and 75.31M, respectively, especially the SSRN is about 100 times that of our model. The reason is that the 3D convolution networks need a great number of floating-point calculations in the training process for feature learning. In contrast, the computational burden of 2D networks is insignificant, such as pResNet and Std-CNN. Especially pResNet with more parameters has fewer FLOPs than 3D networks. Nevertheless, the performance of the 3D network and the standard 2D network did not show advantages, mainly due to the excessive redundant parameters in network limited the feature representation. Finally, it should be noted that the parameters and FLOPs of Std-CNN are approximately 6 times the depth-separable convolution model in the same data set, which indicates that the computational cost of the model we build is lower.
In short, in pursuit of high precision, the model in this paper is superior to other models in terms of the number of parameters, training time, and testing time. In addition, Depth-wise separable convolution is feasible and has some advantages such as low cost, simple structure, and accuracy compared with standard convolution. In practical application, this lightweight framework does not require too much computer hardware platform.

4.2. Effectiveness Analysis to Depth-Separable Convolution

In this part, we compared Std-CNN with the proposed depth-separable convolution network (Des-CNN) in detail to verify the feasibility and effectiveness of the depth-separable convolution approach for hyperspectral images classification. The OA and F1-score of the two models on different data sets are recorded in Table 14 and Table 15, respectively. It can be seen that compared with Std-CNN, the OA and F1-score performance of the Des-CNN on Indian Pines data set achieved higher accuracy. Specifically, most of the results of Des-CNN exceeded Std-CNN for the same threshold under the same hyper-parameter, especially in the case of the small threshold (Spatial Size = 5, 7; and Initial Convolution Kernels Number are 16, 24), the difference was more obvious. However, the performance of Des-CNN on the Pavia University and Kennedy Space Center data sets was slightly worse than that of Std-CNN. Under the same configuration, the accuracy (OA, FLOPs) of Des-CNN was slightly lower than those of Std-CNN, but this difference was less than 1%.
Besides accuracy, the complexity and computational load of the model are another important consideration. Since the standard convolution and depth-separable convolution appeared in the residual unit, the complexity of the network framework was mainly affected by the depth of the residual unit. The training parameters and FLOPs of Std-CNN and Des-CNN at different residual unit depths are shown in Figure 12, respectively. Obviously, the training parameters and FLOPs of Std-CNN were much higher than those of Des-CNN under the same configuration. From Figure 12a, the training parameters of Des-CNN under different residual units on the three data sets were all lower than 50,000, while the parameters of Std-CNN were mostly more than 10,000. From Figure 12b, the FLOPs of Std-CNN were higher than 5M, and even more than 10M on Indian pines and Pavia University, while the flops of Des-CNN were less than 5M on three datasets.
From the above analysis, compared with Std-CNN, Des-CNN constructed by depth-separable convolution achieved competitive results on Indian Pines data set and was slightly insufficient on Pavia University and Kennedy Space Center data sets. However, the difference was very small. In addition, the model complexity of Des-CNN was significantly better than that of Std-CNN on three data sets. Consequently, it is sufficient to demonstrate that depth-separable convolution is feasible for hyperspectral classification. Compared with Std-CNN, the proposed model is more lightweight, and it is more satisfied with limited computer hardware devices in reality.

5. Conclusions

In this paper, a lightweight model for HSI classification was constructed and discussed. Results of experiments show that it has fewer parameters and faster classification speed. In the first layer of the model, a 1 × 1 convolution kernel was used to re-combine the input channels of HSI and realize the cross-channel information integration. This reduced the number of spectral channels. Next, the spatial-spectral features were extracted by residual unit of the middle layer. At the end of executing the model, a combination of a 1 × 1 filter and a global average pooling layer was used to replace the full connection layer to complete the final classification. That further reduced the number of model parameters and sped up the classification while ensuring the accuracy.
In the experiments, the effect of space size, the number of initial convolution kernels, and the depth of residual units on the classification accuracy were first analyzed. Then, we further compared experimental results with results from other classification models. The experimental result shows that the proposed model reduced the number of parameters to a large extent and had a faster classification speed while ensuring higher accuracy. In addition, the proposed model has powerful feature extraction capabilities, because it still shows high classification accuracy on small sample data. Finally, we continued to explore the impact of the model that was constructed by standard convolution and depth-wise separable convolution on classification accuracy, the number of parameters and FLOPs. The experimental results showed that depth-separable convolution is feasible for hyperspectral classification. Compared with Std-CNN, the proposed model is more lightweight, and it is more satisfied with limited computer hardware devices in reality.
For future research work, three-dimensional convolution will be used to extract spectral features, and two-dimensional convolution will fuse spatial-spectral features, introducing density connection to speed up the flow of feature information, further reduce the training and testing time of the model, and accelerate the convergence of the model so as to build a more rapid and effective HSI classification model.

Author Contributions

Conceptualization, L.D. and P.P.; methodology, L.D., J.L.; software, P.P.; validation, P.P., L.D.; investigation, P.P.; writing—original draft preparation, P.P.; writing—review and editing, J.L., L.D.; funding acquisition, L.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China, grant number 41801310; Technology Development Plan Project of Henan Province, China, grant number 202102210160.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Nakhostin, S.; Clenet, H.; Corpetti, T.; Courty, N. Joint anomaly detection and spectral unmixing for planetary Hyperspectral Images. IEEE Geosci. Remote Sens. 2016, 54, 6879–6894. [Google Scholar] [CrossRef] [Green Version]
  2. Torrecilla, E.; Stramski, D.; Reynolds, R.A.; Millán-Núñez, E.; Piera, J. Cluster analysis of hyperspectral optical data for discriminating phytoplankton pigment assemblages in the open ocean. Remote Sens. Environ. 2011, 115, 2578–2593. [Google Scholar] [CrossRef] [Green Version]
  3. Pan, Z.; Glennie, C.L.; Fernandez-Diaz, J.C.; Legleiter, C.J.; Overstreet, B. Fusion of LiDAR orthowaveforms and hyperspectral imagery for shallow river bathymetry and turbidity estimation. IEEE Geosci. Remote Sens. 2016, 54, 4165–4177. [Google Scholar] [CrossRef]
  4. Li, C.H.; Hun, C.C.; Taur, J.S. An effective feature selection method for hyperspectral image classification based on genetic algorithm and support vector machine. Knowl. Based Syst. 2011, 24, 40–48. [Google Scholar] [CrossRef]
  5. Kuo, B.C.; Ho, H.H.; Li, C.H. A kernel-based feature selection method for SVM with RBF kernel for hyperspectral image classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2013, 7, 317–326. [Google Scholar]
  6. Zhu, Z.; Jia, S.; He, S.; Sun, Y.; Ji, Z.; Shen, L. Three-dimensional gabor feature extraction for hyperspectral imagery classification using a memetic framework. Inf. Sci. 2015, 298, 274–287. [Google Scholar] [CrossRef]
  7. Fauvel, M.; Dechesne, C.; Zullo, A.; Ferraty, F. Fast forward feature selection of hyperspectral images for classification with gaussian mixture models. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2015, 8, 2824–2831. [Google Scholar] [CrossRef]
  8. LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE Inst. Electr. Electron Eng. 1998, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version]
  9. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar]
  10. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Adam, H. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv 2015, arXiv:1704.04861. [Google Scholar]
  11. Abdel-Hamid, O.; Mohamed, A.R.; Jiang, H.; Penn, G. Applying Convolutional Neural Networks Concepts to Hybrid NN-HMM Model for Speech Recognition. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Kyoto, Japan, 25–30 March 2012; pp. 4277–4280. [Google Scholar]
  12. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 2014, 580–587. [Google Scholar] [CrossRef] [Green Version]
  13. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 39, 640–651. [Google Scholar]
  14. Hu, W.; Huang, Y.; Wei, L.; Zhang, F.; Li, H. Deep convolutional neural networks for hyperspectral image classification. J. Sens. 2015, 2015, 258619. [Google Scholar] [CrossRef] [Green Version]
  15. Makantasis, K.; Karantzalos, K.; Doulamis, A.; Doulamis, N. Deep supervised learning for hyperspectral data classification through convolutional neural networks. In Proceedings of the International Geoscience and Remote Sensing Symposium (IGARSS), Milan, Italy, 26–31 July 2015; pp. 4959–4962. [Google Scholar]
  16. Zhang, H.; Li, Y.; Zhang, Y.; Shen, Q. Spectral-spatial classification of hyperspectral imagery using a dual-channel convolutional neural network. Remote Sens. Lett. 2017, 8, 438–447. [Google Scholar] [CrossRef] [Green Version]
  17. Yu, S.; Jia, S.; Xu, C. Convolutional neural networks for hyperspectral image classification. Neurocomputing 2017, 219, 88–89. [Google Scholar] [CrossRef]
  18. Lee, H.; Kwon, H. Contextual deep CNN based hyperspectral classification. In Proceedings of the International Geoscience and Remote Sensing Symposium (IGARSS), Beijing, China, 10–15 July 2016; pp. 3322–3325. [Google Scholar]
  19. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Rabinovich, A. Going deeper with convolutions. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 2015, 1–9. [Google Scholar] [CrossRef] [Green Version]
  20. He, M.; Li, B.; Chen, H. Multi-scale 3D deep convolutional neural network for hyperspectral image classification. Proc. IEEE Int. Conf. Image Process. 2017, 3904–3908. [Google Scholar] [CrossRef]
  21. Zhong, Z.; Li, J.; Luo, Z.; Chapman, M. Spectral–spatial residual network for hyperspectral image classification: A 3-D deep learning framework. IEEE Trans. Geosci. Remote Sens. 2018, 56, 847–858. [Google Scholar] [CrossRef]
  22. Wang, W.; Dou, S.; Jiang, Z.; Sun, L. A fast dense spectral spatial convolution network framework for hyperspectral images classification. Remote Sens. 2018, 10, 1068. [Google Scholar] [CrossRef] [Green Version]
  23. Gao, H.; Yang, Y.; Li, C.; Zhang, X.; Zhao, J.; Yao, D. Convolutional neural network for spectral–spatial classification of hyperspectral images. Neural. Comput. 2018, 31, 8997–9012. [Google Scholar] [CrossRef]
  24. Paoletti, M.E.; Haut, J.M.; Fernandez-Beltran, R.; Plaza, J.; Plaza, A.J.; Pla, F. Deep pyramidal residual networks for spectral–Spatial hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2019, 57, 740–754. [Google Scholar] [CrossRef]
  25. Han, D.; Kim, J.; Kim, J. Deep pyramidal residual networks. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 2017, 6307–6315. [Google Scholar] [CrossRef] [Green Version]
  26. Chollet, F. Xception: Deep Learning with depthwise separable convolutions. arXiv 2017, arXiv:1610.02357. [Google Scholar]
  27. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 2016, 770–778. [Google Scholar]
  28. Nair, V.; Hinton, G.E. Rectified linear units improve restricted Boltzmann machines. In Proceedings of the 27th International Conference on International Conference on Machine Learning (ICML), Haifa, Israel, 21–24 June 2010; Omnipress: Madison, WI, USA, 2010; pp. 807–814. [Google Scholar]
  29. Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning, Paris, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
  30. Donoho, D.L. High-dimensional data analysis: The curses and blessings of dimensionality. AMS Math Chall. Lect. 2000, 1, 32. [Google Scholar]
  31. Lin, M.; Chen, Q.; Yan, S. Network in network. arXiv 2013, arXiv:1312.4400. [Google Scholar]
  32. He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In Proceedings of the International Conference on Computer Vision (ICCV), Santiago, Chile, 13–16 December 2015; pp. 1026–1034. [Google Scholar] [CrossRef] [Green Version]
Figure 1. The process of depth-wise separable convolution. This process includes depth-wise convolution, and 1 × 1 convolution.
Figure 1. The process of depth-wise separable convolution. This process includes depth-wise convolution, and 1 × 1 convolution.
Remotesensing 12 03408 g001
Figure 2. The structure of basic residual unit. Its execution order is: Conv→ batch normalization (BN)→rectified linear unit (ReLU)→Conv→BN.
Figure 2. The structure of basic residual unit. Its execution order is: Conv→ batch normalization (BN)→rectified linear unit (ReLU)→Conv→BN.
Remotesensing 12 03408 g002
Figure 3. The structure of pyramid residual unit. Unlike a basic residual unit, the last ReLU layer was deleted, and BN was required before the first convolution operation in this unit.
Figure 3. The structure of pyramid residual unit. Unlike a basic residual unit, the last ReLU layer was deleted, and BN was required before the first convolution operation in this unit.
Remotesensing 12 03408 g003
Figure 4. The overall structure of proposed model for hyperspectral images (HSI) classification.
Figure 4. The overall structure of proposed model for hyperspectral images (HSI) classification.
Remotesensing 12 03408 g004
Figure 5. Accuracy of three data sets with different spatial sizes: (a) Indian Pines; (b) Pavia University; (c) Kennedy Space Center.
Figure 5. Accuracy of three data sets with different spatial sizes: (a) Indian Pines; (b) Pavia University; (c) Kennedy Space Center.
Remotesensing 12 03408 g005
Figure 6. Accuracy of three data sets with number of initial convolution kernels: (a) Indian Pines; (b) Pavia University; (c) Kennedy Space Center.
Figure 6. Accuracy of three data sets with number of initial convolution kernels: (a) Indian Pines; (b) Pavia University; (c) Kennedy Space Center.
Remotesensing 12 03408 g006
Figure 7. Accuracy of three data sets with depth of residual unit: (a) Indian Pines; (b) Pavia University; (c) Kennedy Space Center.
Figure 7. Accuracy of three data sets with depth of residual unit: (a) Indian Pines; (b) Pavia University; (c) Kennedy Space Center.
Remotesensing 12 03408 g007
Figure 8. Classification maps of the different models for the Indian Pines data set: (a) false color image; (b) ground-truth map; (c) SVM-RBF; (d) 1D-CNN; (e) M3D-CNN; (f) pResNet; (g) SSRN; (h) Std-CNN; (i) proposed.
Figure 8. Classification maps of the different models for the Indian Pines data set: (a) false color image; (b) ground-truth map; (c) SVM-RBF; (d) 1D-CNN; (e) M3D-CNN; (f) pResNet; (g) SSRN; (h) Std-CNN; (i) proposed.
Remotesensing 12 03408 g008
Figure 9. Classification maps of the different models for the Pavia University data set: (a) false color image; (b) ground-truth map; (c) SVM-RBF; (d) 1D-CNN; (e) M3D-CNN; (f) pResNet; (g) SSRN; (h) Std-CNN; (i) proposed.
Figure 9. Classification maps of the different models for the Pavia University data set: (a) false color image; (b) ground-truth map; (c) SVM-RBF; (d) 1D-CNN; (e) M3D-CNN; (f) pResNet; (g) SSRN; (h) Std-CNN; (i) proposed.
Remotesensing 12 03408 g009
Figure 10. Classification maps of the different models for the Kennedy Space Center data set: (a) false color image; (b) ground-truth map; (c) SVM-RBF; (d) 1D-CNN; (e) M3D-CNN; (f) pResNet; (g) SSRN; (h)Std-CNN; (i) proposed.
Figure 10. Classification maps of the different models for the Kennedy Space Center data set: (a) false color image; (b) ground-truth map; (c) SVM-RBF; (d) 1D-CNN; (e) M3D-CNN; (f) pResNet; (g) SSRN; (h)Std-CNN; (i) proposed.
Remotesensing 12 03408 g010
Figure 11. The number of trainable parameters of the different models.
Figure 11. The number of trainable parameters of the different models.
Remotesensing 12 03408 g011
Figure 12. Training parameters and FLOPs (×106) of Std-CNN and Des-CNN under different residual unit depths: (a) the number of training parameters; (b) the number of FLOPs.
Figure 12. Training parameters and FLOPs (×106) of Std-CNN and Des-CNN under different residual unit depths: (a) the number of training parameters; (b) the number of FLOPs.
Remotesensing 12 03408 g012
Table 1. The detailed configuration of the proposed network.
Table 1. The detailed configuration of the proposed network.
LayerOutput SizeKernel SizeStridePadding
Input11 × 11 × 200
C111 × 11 × 381 × 110
R111 × 11 × 543 × 3(DS-Conv)11
11 × 11 × 543 × 3(DS-Conv)11
R26 × 6 × 703 × 3(DS-Conv)21
6 × 6 × 703 × 3(DS-Conv)11
R33 × 3 × 863 × 3(DS-Conv)21
3 × 3 × 863 × 3(DS-Conv)11
C23 × 3 × 161 × 110
GAP1 × 16
Table 2. The detailed information of three hyperspectral datasets.
Table 2. The detailed information of three hyperspectral datasets.
IPPUKSC
Type of SensorAVIRISROSISAVIRIS
Spatial Size 145 × 145610 × 340512 × 614
Spectral Range0.4–2.5 µm0.43–0.86 µm0.4–2.5 µm
Spatial Resolution20 m1.3 m18 m
Bands 200103176
Num. of Classes16913
Table 3. Samples information for the Indiana Pines data set.
Table 3. Samples information for the Indiana Pines data set.
NoClassTotalTrainTest
1Alfalfa46379
2Corn-notill14282001228
3Corn-mintill830200630
4Corn23720037
5Grass-pasture483200283
6Grass-trees730200530
7Grass-pasture-mowed28235
8Hay-windowed478200278
9Oats20164
10Soybean-notill972200772
11Soybean-mintill24552002255
12Soybean-clean593200393
13Wheat2052005
14Woods12652001065
15Buildings-Grass-Trees386200186
16Stone-Steel-Towers937518
Total10,24925517698
Table 4. Samples information for the Pavia University data set.
Table 4. Samples information for the Pavia University data set.
NoClassTotalTrainTest
1Asphalt66312006431
2Meadows18,64920018,449
3Gravel20992001899
4Trees30642002864
5Sheets13452001145
6Bare soils50292004829
7Bitumen13302001130
8Bricks36822003482
9Shadows947200747
Total42,776180040,976
Table 5. Samples information for the Kennedy Space Center data set.
Table 5. Samples information for the Kennedy Space Center data set.
NoClassTotalTrainTest
1Scrub761200561
2Willow swamp24320043
3CP hammock25620056
4Slash pine25220052
5Oak/Broadleaf16112932
6Hardwood22920029
7Swamp1058421
8Graminoid marsh431200231
9Spartina marsh520200320
10Cattail marsh404200204
11Salt marsh419200219
12Mud flats503200303
13Water527200727
Total521124132798
Table 6. Training and test Time and overall classification accuracy (OA) for different spatial sizes on three data sets.
Table 6. Training and test Time and overall classification accuracy (OA) for different spatial sizes on three data sets.
Spatial
Size
Indian PinesPavia UniversityKennedy Space Center
Training Time(s)Test
Time(s)
OA
(%)
Training Time(s)Test
Time(s)
OA
(%)
Training Time(s)Test
Time(s)
OA
(%)
5 × 5576.32 ± 19.842.07 ± 0.0394.63 ± 0.97519.46 ± 14.223.67 ± 0.0597.23 ± 0.69787.59 ± 15.141.90 ± 0.0299.75 ± 0.12
7 × 7593.70 ± 15.062.54 ± 0.0597.31 ± 0.55579.74 ± 44.645.25 ± 0.0998.61 ± 0.31776.79 ± 22.032.08 ± 0.0799.95 ± 0.05
9 × 9640.74 ± 27.283.32 ± 0.0398.63 ± 0.38591.55 ± 34.477.30 ± 0.0999.25 ± 0.41806.58 ± 26.582.28 ± 0.0499.96 ± 0.08
11 × 11714.28 ± 16.924.57 ± 0.0398.85 ± 0.23616.49 ± 22.2510.62 ± 0.0699.45 ± 0.19871.04 ± 23.122.59 ± 0.0299.93 ± 0.07
13 × 13796.30 ± 16.756.22 ± 0.7398.39 ± 0.36666.41 ± 39.6814.33 ± 0.2299.46 ± 0.19900.05 ± 0.952.90 ± 0.0199.87 ± 0.28
15 × 151017.71 ± 8.198.15 ± 0.9598.00 ± 0.44748.43 ± 42.3118.47 ± 0.0999.64 ± 0.13980.50 ± 13.083.39 ± 0.0599.99 ± 0.03
Table 7. Training and Test time and OA for different number of initial convolution kernels (C) on three data sets.
Table 7. Training and Test time and OA for different number of initial convolution kernels (C) on three data sets.
CIndian PinesPavia UniversityKennedy Space Center
Training Time(s)Test
Time(s)
OA
(%)
Training Time(s)Test
Time(s)
OA
(%)
Training Time(s)Test
Time(s)
OA
(%)
16708.97 ± 15.584.60 ± 0.0698.38 ± 0.65739.87 ± 22.6914.18 ± 0.0699.07 ± 0.47816.49 ± 12.832.29 ± 0.0399.94 ± 0.09
24715.34 ± 39.674.57 ± 0.0398.46 ± 0.53749.73 ± 40.4414.24 ± 0.0799.33 ± 0.42814.42 ± 21.802.28 ± 0.0299.91 ± 0.07
32708.66 ± 15.274.56 ± 0.0598.60 ± 0.40719.52 ± 1.1514.21 ± 0.0599.43 ± 0.27850.34 ± 74.702.31 ± 0.0199.96± 0.05
38714.28 ± 16.924.57 ± 0.0398.85 ± 0.23666.41 ± 39.6814.33 ± 0.2299.46 ± 0.19806.58 ± 25.682.28 ± 0.0499.96 ± 0.08
42722.69 ± 20.084.57 ± 0.0498.63 ± 0.29729.48 ± 5.8814.26 ± 0.1699.51 ± 0.20799.62 ± 5.932.27 ± 0.0299.94 ± 0.04
Table 8. Model parameters for different residual unit depth (R) on three Data sets.
Table 8. Model parameters for different residual unit depth (R) on three Data sets.
RIndian PinesPavia UniversityKennedy Space Center
1212841875017114
2310362962225306
3406604036633370
4502525107841402
Table 9. Training time, testing time, and OA for different residual unit depth (R) on three data sets.
Table 9. Training time, testing time, and OA for different residual unit depth (R) on three data sets.
RIndian PinesPavia UniversityKennedy Space Center
Training Time(s)Test
Time(s)
OA
(%)
Training Time(s)Test
Time(s)
OA
(%)
Training Time(s)Test Time(s)OA
(%)
1725.25 ± 66.244.60 ± 0.0498.49 ± 0.23655.02 ± 50.0414.13 ± 0.1199.43 ± 0.25755.43 ± 16.102.21 ± 0.0399.91 ± 0.11
2712.27 ± 24.134.59 ± 0.0798.51 ± 0.28739.40 ± 20.0614.24 ± 0.0599.58 ± 0.12863.41 ± 13.492.25 ± 0.0399.88 ± 0.12
3714.28 ± 16.924.57 ± 0.0398.85 ± 0.23729.48 ± 5.8814.26 ± 0.1699.51 ± 0.20850.34 ± 74.702.31 ± 0.0199.96 ± 0.05
4735.02 ± 28.175.74 ± 3.4598.58 ± 0.36638.42 ± 33.8713.90 ± 0.0899.50 ± 0.17874.02 ± 55.472.34 ± 0.0499.91 ± 0.08
Table 10. Classification results of different methods for the Indian Pines data set.
Table 10. Classification results of different methods for the Indian Pines data set.
ClassSVM-RBF1D-CNNM3D-DCNNpResNetSSRNStd-CNNProposed
185.5690.0097.78100.0098.89100.00100.00
277.0282.1586.2997.2998.2096.5198.22
376.3578.5792.2199.1399.2799.199.16
491.6290.27100.00100.00100.00100.0099.73
594.6394.9899.1299.8999.9699.9399.86
696.9697.8999.3699.7499.8999.7799.85
786.0094.0098.00100.00100.00100.00100.00
898.1799.0699.86100.00100.00100.00100.00
982.597.5095.00100.0095.00100.00100.00
1083.0687.6692.5198.9498.8197.6798.60
1166.8270.5280.1596.1297.9995.6098.22
1286.3989.9296.9598.6399.5498.4798.78
1398.0098.00100.00100.00100.00100.00100.00
1489.2591.6296.0398.6199.8896.5199.58
1580.1682.5399.1999.84100.0099.10100.00
1697.2296.11100.0099.4497.22100.0099.44
OA (%)79.76 ± 0.6882.99 ± 0.8589.80 ± 1.3698.09 ± 1.1198.89± 0.4497.67 ± 0.5298.85 ± 0.23
AA (%)86.86 ± 2.0790.05 ± 0.9295.78 ± 1.0999.10 ± 0.5599.04 ± 0.9399.12 ± 0.1599.46 ± 0.14
Kappa × 10076.46 ± 0.7880.17 ± 0.9588.05 ± 1.5797.59 ± 1.2498.68± 0.5297.24 ± 0.6298.63 ± 0.27
F1-score × 10076.41 ± 1.9178.32± 1.7090.66 ± 1.6897.67 ± 1.0197.47 ± 1.9896.11 ± 1.2197.79 ± 1.17
Table 11. Classification results of different methods for the Pavia University data set.
Table 11. Classification results of different methods for the Pavia University data set.
ClassSVM-RBF1D-CNNM3D-DCNNpResNetSSRNStd-CNNProposed
186.0187.4890.8498.8699.5699.8199.83
290.2589.8796.0199.5599.6899.6499.71
384.4784.9291.7298.5899.0299.7199.67
495.1095.9398.0198.8498.2097.9297.74
599.4299.7899.9299.9499.8799.7999.83
689.9688.6697.8499.6999.99100.0099.90
793.1992.2796.6599.51100.00100.00100.00
885.1181.9293.2799.1599.1699.3599.22
999.9299.7999.5799.9299.4299.4599.41
OA (%)89.70 ± 0.9589.40 ± 0.9895.31 ± 2.1099.35 ± 0.1799.53 ± 0.1499.58 ± 0.2999.58 ± 0.12
AA (%)91.49 ± 0.4591.18 ± 0.4395.98 ± 1.3199.34 ± 0.2199.43 ± 0.1699.52± 0.1399.48 ± 0.15
Kappa × 10086.36 ± 1.2285.97 ± 1.2393.76 ± 2.7399.12 ± 0.2399.37 ± 0.1999.43 ± 0.3999.43 ± 0.16
F1-score × 10088.62 ± 0.6888.24 ± 0.8694.26 ± 1.9699.16 ± 0.2599.33 ± 0.2599.44 ± 0.2199.38 ± 0.16
Table 12. Classification results of different methods for the Kennedy Space Center data set.
Table 12. Classification results of different methods for the Kennedy Space Center data set.
ClassSVM-RBF1D-CNNM3D-DCNNpResNetSSRNStd-CNNProposed
192.2390.7199.1899.6399.5799.9399.89
295.8190.70100.00100.00100.00100.00100.00
393.3988.3998.3999.6499.4699.82100.00
486.5477.6995.0099.2399.8198.8599.23
577.5070.0093.4499.0699.3898.75100.00
689.6687.9399.66100.00100.00100.00100.00
792.8687.14100.0099.52100.0099.52100.00
895.9394.8999.48100.00100.00100.00100.00
997.7297.7299.6999.9199.8499.94100.00
1099.0299.6199.5699.90100.00100.0099.90
1198.6398.5499.77100.00100.00100.00100.00
1296.5397.3399.9100.00100.0099.93100.00
1399.93100.00100.00100.00100.00100.00100.00
OA (%)96.41 ± 0.3095.67 ± 0.8599.49 ± 0.2699.87 ± 0.2599.87 ± 0.1099.93 ± 0.0699.96 ± 0.05
AA (%)93.52 ± 0.9090.82 ± 0.8598.77 ± 0.8199.76 ± 0.3599.85 ± 0.1399.75 ± 0.2199.93 ± 0.09
Kappa × 10095.78 ± 0.3594.91 ± 0.9899.40 ± 0.3199.85 ± 0.2999.85 ± 0.1299.92 ± 0.0899.95 ± 0.06
F1-score × 10090.53 ± 0.7187.82 ± 1.1498.32 ± 0.8999.57± 0.8099.59 ± 0.3699.71 ± 0.2999.85 ± 0.19
Table 13. Parameters, training time, and test time for different models, based on CNN, on the three data sets.
Table 13. Parameters, training time, and test time for different models, based on CNN, on the three data sets.
1D-CNNM3D-DCNNpResNetSSRNStd-CNNProposed
Indian PinesFLOPs(×106)0.1675.3130.09234.6810.342.20
Training time(s)511.253380.231040.363519.76772.93714.28
Test time(s)1.298.498.0116.754.624.57
Pavia UniversityFLOPs(×106)0.0854.2642.16168.7117.953.01
Training time(s)541.521975.36908.991759.32684.71739.40
Test time(s)1.8129.7226.5260.4614.1114.24
Kennedy Space CenterFLOPs(×106)0.1539.2520.44137.745.871.20
Training time(s)509.262008.181396.612117.42715.60668.25
Test time(s)1.312.663.314.012.061.91
Table 14. The Overall (OA) classification accuracy (%) for Standard CNN and the proposed model on three HSI datasets.
Table 14. The Overall (OA) classification accuracy (%) for Standard CNN and the proposed model on three HSI datasets.
Indian PinesPavia UniversityKennedy Space Center
Std-CNNDes-CNNStd-CNNDes-CNNStd-CNNDes-CNN
Spatial Size5 × 589.04 ± 0.8194.63 ± 0.9796.48 ± 0.9997.23 ± 0.6999.54 ± 0.2499.75 ± 0.12
7 × 794.36 ± 0.9497.31 ± 0.5598.33 ± 0.4498.61 ± 0.3199.80 ± 0.1199.95 ± 0.05
9 × 997.42 ± 0.5798.63 ± 0.3899.48 ± 0.1199.25 ± 0.4199.96 ± 0.0599.96 ± 0.08
11 × 1197.87 ± 0.3998.85 ± 0.2399.46 ± 0.2599.45 ± 0.1999.97 ± 0.0699.93 ± 0.07
13 × 1398.01 ± 0.4698.39 ± 0.3699.54 ± 0.2599.46 ± 0.1999.96 ± 0.0599.87 ± 0.28
15 × 1597.68 ± 0.3498.00 ± 0.4499.57 ± 0.1699.64 ± 0.1399.99 ± 0.0299.99 ± 0.03
Initial Convolution Kernels Number1697.37 ± 0.4298.38 ± 0.6599.36 ± 0.2499.07 ± 0.4799.93 ± 0.0999.94 ± 0.09
2497.63 ± 0.5098.46 ± 0.5399.50 ± 0.3999.33 ± 0.4299.95 ± 0.0499.91 ± 0.07
3297.79 ± 0.4398.60 ± 0.4099.51 ± 0.2699.43 ± 0.2799.89 ± 0.1199.96± 0.05
3897.87 ± 0.3998.85± 0.2399.54 ± 0.2599.46 ± 0.1999.96 ± 0.0599.96 ± 0.08
4297.68 ± 0.4898.63 ± 0.2999.36 ± 0.6499.51± 0.2099.96 ± 0.0599.94 ± 0.04
Residual Unit Depth198.74 ± 0.4498.49 ± 0.2399.57 ± 0.1399.43 ± 0.2599.92 ± 0.0899.91 ± 0.11
298.29 ± 0.2498.51 ± 0.2899.44 ± 0.5399.58 ± 0.1299.94 ± 0.0699.88 ± 0.12
397.87 ± 0.3998.85 ± 0.2399.54 ± 0.2599.51 ± 0.2099.89 ± 0.1199.96 ± 0.05
496.58 ± 0.6798.58 ± 0.3699.37 ± 0.2899.50 ± 0.1799.87 ± 0.1099.91 ± 0.08
Table 15. F1-score × 100 for Standard CNN and the proposed model on three HSI datasets.
Table 15. F1-score × 100 for Standard CNN and the proposed model on three HSI datasets.
Indian PinesPavia UniversityKennedy Space Center
Std-CNNDes-CNNStd-CNNDes-CNNStd-CNNDes-CNN
Spatial size5 × 591.24 ± 1.2194.45 ± 1.1496.31 ± 0.7396.95 ± 0.9098.67 ± 0.7099.75 ± 0.12
7 × 794.91 ± 0.9296.91 ± 1.0197.95 ± 0.4298.29 ± 0.3999.33 ± 0.4199.80 ± 0.14
9 × 997.19 ± 0.9698.35 ± 0.8599.30 ± 0.1799.13 ± 0.3599.88 ± 0.1299.84 ± 0.31
11 × 1196.11 ± 1.2197.79 ± 1.1799.35 ± 0.2599.31 ± 0.2199.92 ± 0.1999.78 ± 0.17
13 × 1396.19 ± 1.5796.20 ± 1.5699.41 ± 0.2099.31 ± 0.2399.96 ± 0.0599.57 ± 0.79
15 × 1594.81 ± 1.9596.04 ± 1.3699.40 ± 0.1699.47 ± 0.1199.94 ± 0.1299.93 ± 0.18
Initial Convolution Kernels Number1696.68 ± 1.3297.87 ± 1.2399.20 ± 0.2098.99 ± 0.3899.71 ± 0.4099.81 ± 0.37
2496.96 ± 0.9697.44 ± 1.0199.39 ± 0.3399.22 ± 0.3199.79 ± 0.1999.91 ± 0.07
3296.93 ± 0.9996.84 ± 1.7699.51 ± 0.2699.26 ± 0.3299.61 ± 0.3299.85 ± 0.19
3896.11 ± 1.2197.79 ± 1.1799.41 ± 0.2099.31 ± 0.2399.88 ± 0.1299.84 ± 0.31
4296.51 ± 0.9197.52 ± 1.7999.24 ± 0.6199.42 ± 0.1799.82 ± 0.2199.76 ± 0.16
Residual Unit Depth197.04 ± 1.4196.02 ± 1.9099.34 ± 0.1899.14 ± 0.2499.76 ± 0.2599.59 ± 0.57
296.04 ± 1.7796.28 ± 1.6899.30 ± 0.3999.58 ± 0.1299.94 ± 0.0699.88 ± 0.12
396.11 ± 1.2197.79 ± 1.1799.41 ± 0.2099.42 ± 0.1799.89 ± 0.1199.85 ± 0.19
496.25 ± 1.3998.00 ± 0.7999.39 ± 0.2499.34 ± 0.3699.49 ± 0.3999.91 ± 0.08
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Dang, L.; Pang, P.; Lee, J. Depth-Wise Separable Convolution Neural Network with Residual Connection for Hyperspectral Image Classification. Remote Sens. 2020, 12, 3408. https://doi.org/10.3390/rs12203408

AMA Style

Dang L, Pang P, Lee J. Depth-Wise Separable Convolution Neural Network with Residual Connection for Hyperspectral Image Classification. Remote Sensing. 2020; 12(20):3408. https://doi.org/10.3390/rs12203408

Chicago/Turabian Style

Dang, Lanxue, Peidong Pang, and Jay Lee. 2020. "Depth-Wise Separable Convolution Neural Network with Residual Connection for Hyperspectral Image Classification" Remote Sensing 12, no. 20: 3408. https://doi.org/10.3390/rs12203408

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop