Depth-Wise Separable Convolution Neural Network with Residual Connection for Hyperspectral Image Classiﬁcation

: The neural network-based hyperspectral images (HSI) classiﬁcation model has a deep structure, which leads to the increase of training parameters, long training time, and excessive computational cost. The deepened network models are likely to cause the problem of gradient disappearance, which limits further improvement for its classiﬁcation accuracy. To this end, a residual unit with fewer training parameters were constructed by combining the residual connection with the depth-wise separable convolution. With the increased depth of the network, the number of output channels of each residual unit increases linearly with a small amplitude. The deepened network can continuously extract the spectral and spatial features while building a cone network structure by stacking the residual units. At the end of executing the model, a 1 × 1 convolution layer combined with a global average pooling layer can be used to replace the traditional fully connected layer to complete the classiﬁcation with reduced parameters needed in the network. Experiments were conducted on three benchmark HSI datasets: Indian Pines, Pavia University, and Kennedy Space Center. The overall classiﬁcation accuracy was 98.85%, 99.58%, and 99.96% respectively. Compared with other classiﬁcation methods, the proposed network model guarantees a higher classiﬁcation accuracy while spending less time on training and testing sample sites.


Introduction
The neural network-based hyperspectral images (HSI), an emerging remote sensing technology, can simultaneously detect two-dimensional geometric spatial information and one-dimensional continuous spectral information of the target object. This allows HSI the ability of "image-spectrum merging" that is the abstract feature combined image space with spectrum domain. Geometric space information reflects the size, shape and other external features of the target object. Spectral information reflects the physical structure and chemical composition of the ground object. Together, HSI extracts comprehensive characteristics of studied objects. With this, hyperspectral remote sensing is now widely used in the analysis of the composition of planets [1], marine plant detection [2], shallow river bathymetry, and turbidity estimation [3].
In view of the shortcomings of existing research, this work designs a lightweight deep network classification model based on learned experiences in literature [24,25]. Our model improves the pyramid residual unit [25] and replaces the standard convolution in the residual unit with a depth-wise separable convolution [26]. This model greatly reduces the number of model parameters and the computational cost. By stacking the improved pyramidal residual units, the multiplexing of low-level feature information is strengthened. In the meantime, the gradients disappearance and overfitting phenomena of deep network are alleviated. Moreover, the spatial-spectral information of HSI is used to effectively improve classification accuracy. All convolutional layers in the model, except for the residual unit, use 1 × 1 small convolutions, and the global average pooling layer is used at the end of the model to replace the fully connected layer, thereby further reducing the training parameters and accelerating the speed of classification.

Depth-Wise Separable Convolution
Depth-wise separable convolution can be decomposed into depth-wise convolution and 1 × 1 convolution (also known as point-by-point convolution). Among them, depth-wise convolution is a separate convolution operation on each channel of the input image. The convolution operation is used to extract spatial features on each dimension; point-by-point convolution is a 1 × 1 standard convolution operation on the output feature map. The convolution is used to merge the feature map across channels.
As shown in Figure 1, in which the size of the input image is D f × D f × M, where D f is the height and width of the input image, M is the number of channels of the map, it is assumed that the convolution kernel size is k × k in the process of depth-wise convolution, and the size of output feature map obtained by convolution is D g × D g × M (D g is the height and width of the output image), which is consistent with the number of input image channels. It is used as the input of the next convolution. For the point-by-point convolution, the size of the convolution kernels is 1 × 1, the number of channels on each convolution kernels must be the same as the number of input feature map channels. Let the number of the convolution kernels be N, then the output feature map would become D g × D g × N after convolution.
Remote Sens. 2020, 12, x FOR PEER REVIEW 4 of 21 In depth-wise convolution, as shown in Figure 1, each kernel has only one piece, which is to convolute each input channel maps, and this process can be defined as: = · + , , = 1, 2, … , .
Here, Kj is jth depth-wise convolutional kernel. However, depth-wise convolution only filters input channels, it does not combine them to create new features. Therefore, an additional layer that is 1 × 1 standard convolution is needed in order to generate these new features [10]. For a process of depth-wise separable convolution, the parameter P2 and the floating-point calculation F2 are the sum of the depth-wise and 1 × 1 pointwise convolutions. Hence, P2 and F2 can be calculated as shown in Equation (5) and (6), respectively: The ratio of Equation (5) and Equation (2) and the ratio of Equation (6) and Equation (3) are shown in Equation (7) and Equation (8): It can be clearly seen that the parameters and calculations of the depth-wise separable convolution are only + times of the standard convolution. This greatly reduces the parameter and computing cost in the model.

Residual Unit
A deeper neural network tends to easily hamper convergence from the beginning. In addition, it faces a problem of network degradation [27]. To alleviate this, several studies built HSI classification models ( [18], [21]) through residual connection and attempted to solve the problem of deep network gradient dispersion by stacking residual modules to achieve better classification results. The basic residual unit is shown in Figure 2, where and are input and output of i-th residual unit. F represents a residual function. H is the way of shortcut connection: If the identity mapping [27] is used, then Η ( ) = . With these notations, the basic residual unit can be expressed as follows: For the input feature maps H whose size is D f × D f , the convolution kernel K of size is k × k. The number of input channels and the number of output channels are M and N, respectively. The size of the output feature maps G is D g × D g . A standard convolution operation can be defined as follows: Remote Sens. 2020, 12, 3408 4 of 20 where H i is the ith map in H, G i is the ith map in G, and K j i is the ith slice in the jth kernel. b j is the bias of output map G i . Furthermore, the notation · stands for convolution operator. Let the total numbers of trainable parameters in convolution be P 1 (without considering bias parameters) and the number of floating-point calculations be F 1 in a standard convolution process. They can be calculated as shown in Equations (2) and (3) below: From Equation (2), the number of parameters depend on kernel size, the number of input channels M and the number of output channels N. Equation (3) shows that the number of floating-point operations is dependent on parameter P 1 and the output feature map size D g × D g .
In depth-wise convolution, as shown in Figure 1, each kernel has only one piece, which is to convolute each input channel maps, and this process can be defined as: Here, K j is jth depth-wise convolutional kernel. However, depth-wise convolution only filters input channels, it does not combine them to create new features. Therefore, an additional layer that is 1 × 1 standard convolution is needed in order to generate these new features [10]. For a process of depth-wise separable convolution, the parameter P 2 and the floating-point calculation F 2 are the sum of the depth-wise and 1 × 1 pointwise convolutions. Hence, P 2 and F 2 can be calculated as shown in Equations (5) and (6), respectively: The ratio of Equations (5) and (2) and the ratio of Equations (6) and (3) are shown in Equations (7) and (8): It can be clearly seen that the parameters and calculations of the depth-wise separable convolution are only 1 N + 1 k 2 times of the standard convolution. This greatly reduces the parameter and computing cost in the model.

Residual Unit
A deeper neural network tends to easily hamper convergence from the beginning. In addition, it faces a problem of network degradation [27]. To alleviate this, several studies built HSI classification models ( [18,21]) through residual connection and attempted to solve the problem of deep network gradient dispersion by stacking residual modules to achieve better classification results. The basic residual unit is shown in Figure 2, where X i and X i+1 are input and output of i-th residual unit. F represents a residual function. H is the way of shortcut connection: If the identity mapping [27] is used, then H (X i ) = X i . With these notations, the basic residual unit can be expressed as follows: Remote Sens. 2020, 12, 3408 5 of 20 performed by the element-wise addition. In addition, the original standard convolution in the unit was changed to the depth-wise separable convolution to reduce the model parameters. In short, the core parts of the network model proposed in this paper were all made of this type of unit. As the network went deeper gradually, the parameters did not increase significantly, so that a lightweight residual classification model was constructed. The specific network structure is introduced in detail in Section 2.3.  The structure of pyramid residual unit. Unlike a basic residual unit, the last ReLU layer was deleted, and BN was required before the first convolution operation in this unit.

Proposed Model for HSI Classification
Here, a deep neural network model was constructed, as shown in Figure 4. In general, the model first reduced the dimension of input HSI data cube through 1 × 1 convolution, which extracted abundant spectral information. Then, the three residual units, R1, R2, and R3, were adopted to extract ceaselessly both spatial contextual features and spectral features of data cube. Finally, the combination of 1 × 1 convolution and global average pooling (GAP) layer instead of the fully connected layer fused extracted abstract features to complete the final classification. The codes of this work will be available at https://github.com/pangpd/DS-pResNet-HSI for the sake of reproducibility. With the shortcut, the skipped connections increase the depth of the network, but they do not add additional parameters. Furthermore, this structure promotes the training efficiency and solves the problem of network degradation effectively [25].
Although some models constructed by the above unit have achieved good results in HSI classification, it is not the optimum. To further improve on existing models, our study introduces a pyramid residual unit [25] to promote the classification efficiency. There are two underlying ideas that guide us to do so. First, the pyramid residual unit is a modification of the basic residual unit that shows significant generalization ability [25], which will be very beneficial for the classification of hyperspectral images with unbalanced sample distribution. Second, the pyramid residual unit was the simple way to linearly increase the number of feature map channels in small steps. That greatly reduced the training parameters and computational cost of the model. Pyramid residual unit is shown in Figure 3, and unlike a basic residual unit, the last rectified linear unit (ReLU) [28] was deleted, and batch normalization (BN) [29] was required before the first convolution operation in pyramid residual unit. Specifically, the order of execution of the layers can be described as follows: BN→Conv→BN→ReLU→Conv→BN. When the number of channels after the output residual function was not the same as that of the input, the skipped connections of zero-padded [25] were performed by the element-wise addition. In addition, the original standard convolution in the unit was changed to the depth-wise separable convolution to reduce the model parameters. In short, the core parts of the network model proposed in this paper were all made of this type of unit. As the network went deeper gradually, the parameters did not increase significantly, so that a lightweight residual classification model was constructed. The specific network structure is introduced in detail in Section 2.3.
Remote Sens. 2020, 12, x FOR PEER REVIEW 5 of 21 Although some models constructed by the above unit have achieved good results in HSI classification, it is not the optimum. To further improve on existing models, our study introduces a pyramid residual unit [25] to promote the classification efficiency. There are two underlying ideas that guide us to do so. First, the pyramid residual unit is a modification of the basic residual unit that shows significant generalization ability [25], which will be very beneficial for the classification of hyperspectral images with unbalanced sample distribution. Second, the pyramid residual unit was the simple way to linearly increase the number of feature map channels in small steps. That greatly reduced the training parameters and computational cost of the model. Pyramid residual unit is shown in Figure 3, and unlike a basic residual unit, the last rectified linear unit (ReLU) [28] was deleted, and batch normalization (BN) [29] was required before the first convolution operation in pyramid residual unit. Specifically, the order of execution of the layers can be described as follows: BNConvBNReLUConvBN. When the number of channels after the output residual function was not the same as that of the input, the skipped connections of zero-padded [25] were performed by the element-wise addition. In addition, the original standard convolution in the unit was changed to the depth-wise separable convolution to reduce the model parameters. In short, the core parts of the network model proposed in this paper were all made of this type of unit. As the network went deeper gradually, the parameters did not increase significantly, so that a lightweight residual classification model was constructed. The specific network structure is introduced in detail in Section 2.3.  The structure of pyramid residual unit. Unlike a basic residual unit, the last ReLU layer was deleted, and BN was required before the first convolution operation in this unit.

Proposed Model for HSI Classification
Here, a deep neural network model was constructed, as shown in Figure 4. In general, the model first reduced the dimension of input HSI data cube through 1 × 1 convolution, which extracted abundant spectral information. Then, the three residual units, R1, R2, and R3, were adopted to extract ceaselessly both spatial contextual features and spectral features of data cube. Finally, the combination of 1 × 1 convolution and global average pooling (GAP) layer instead of the fully connected layer fused extracted abstract features to complete the final classification. The codes of this work will be available at https://github.com/pangpd/DS-pResNet-HSI for the sake of reproducibility. Figure 3. The structure of pyramid residual unit. Unlike a basic residual unit, the last ReLU layer was deleted, and BN was required before the first convolution operation in this unit.

Proposed Model for HSI Classification
Here, a deep neural network model was constructed, as shown in Figure 4. In general, the model first reduced the dimension of input HSI data cube through 1 × 1 convolution, which extracted abundant spectral information. Then, the three residual units, R 1 , R 2 , and R 3 , were adopted to extract ceaselessly both spatial contextual features and spectral features of data cube. Finally, the combination of 1 × 1 convolution and global average pooling (GAP) layer instead of the fully connected layer fused extracted abstract features to complete the final classification. The codes of this work will be available at https://github.com/pangpd/DS-pResNet-HSI for the sake of reproducibility. Furthermore, as seen in Table 1, the more detailed per layer of network configuration that takes Indian Pines data set (present in Table 2) as an example is listed. First, the processed 3D HSI data of which we have the shape of 11 × 11 × 200 (200 is the number of bands) was fed to the network. In the first layer C1, 38 1 × 1 kernels reconstruct channel features of the original input data and retain the spatial information. Then, the R1 block consisted by two 3 × 3 kernels with stride = 1 were adopted so as to remain spatial edge information. Following the R1 block, the first layer of R2 was the 3 × 3 filter with stride = 2 for conducting a down sampling operation, and 3 × 3 kernel in the second layer uses with step size of 1 to generate 6 × 6 × 70 feature tensors. Then, similar to the R2 block, R3 continues to perform the down sampling operation and produce a 3 × 3 × 86 feature cube that has a smaller space size. Finally, the C2 in the last convolutional layer of model, which includes 16 3 × 3 kernels for compressing discriminative feature maps, post generated features to GAP layer, and then transform the shape of space into a one-dimensional vector of 1 × 16. Next, we will combine the characteristics of HSI to expound and analyze the reasons that the network was so designed.

Layer Output Size
Kernel Size Stride Padding

Detailed Design of the Model
HSI have the continuity attribute that the data of each band was relatively scattered. In order to speed up convergence and reduce the training time of the model, the input data cube firstly carried out zero-mean standardization before inputting the network. The standardized calculation was defined as in Equation (10): Furthermore, as seen in Table 1, the more detailed per layer of network configuration that takes Indian Pines data set (present in Table 2) as an example is listed. First, the processed 3D HSI data of which we have the shape of 11 × 11 × 200 (200 is the number of bands) was fed to the network. In the first layer C 1 , 38 1 × 1 kernels reconstruct channel features of the original input data and retain the spatial information. Then, the R 1 block consisted by two 3 × 3 kernels with stride = 1 were adopted so as to remain spatial edge information. Following the R 1 block, the first layer of R 2 was the 3 × 3 filter with stride = 2 for conducting a down sampling operation, and 3 × 3 kernel in the second layer uses with step size of 1 to generate 6 × 6 × 70 feature tensors. Then, similar to the R 2 block, R 3 continues to perform the down sampling operation and produce a 3 × 3 × 86 feature cube that has a smaller space size. Finally, the C 2 in the last convolutional layer of model, which includes 16 3 × 3 kernels for compressing discriminative feature maps, post generated features to GAP layer, and then transform the shape of space into a one-dimensional vector of 1 × 16. Next, we will combine the characteristics of HSI to expound and analyze the reasons that the network was so designed.

Detailed Design of the Model
HSI have the continuity attribute that the data of each band was relatively scattered. In order to speed up convergence and reduce the training time of the model, the input data cube firstly carried out zero-mean standardization before inputting the network. The standardized calculation was defined as in Equation (10): where X n i,j represents the pixel value of i-th row and j-th column in the n-th band of HSI, X n is the mean value of pixels in the n-th band, σ n indicates the standard deviation of pixels in n-th band; W, H, and N denote the width, height and total number of bands of input HSI, respectively.
In order to utilize the spectral and spatial information of HSI simultaneously, the original HSI was preprocessed into a neighborhood pixel block of spatial size S × S × N as the input of the model, where S × S represents the size of the neighborhood space centered on a certain pixel and N represents the number of bands of HSI. Considering that the input HSI data cube had numerous spectral dimensions, it was easy to cause Hughes phenomenon [30]. In other words, the imbalance between high-dimensional spectral bands and a limited number of training samples tends to overfit. Therefore, as seen in the first layer of the model, the 1 × 1 bottleneck layer was used to reduce the number of original channels, so that multiple channels could be recombined without changing the size of the space. In this manner, cross channel information integration could be realized, and nonlinear characteristics could be increased (ReLU activation function after convolution was used). On the whole, the 1 × 1 convolution not only retained the original spatial information of HSI data cube, but it also reduced the spectral channel of input data and effectively extracted the spectral features of spatial blocks.
As seen in Figure 4, R 1 , R 2 and R 3 were three pyramidal residual unit modules, and each module used two 3 × 3 depth-wise separable convolutions to extract corresponding features. First, the convolution operation of stride = 1 and padding = 1 was adopted in R 1 , so that the input and output feature maps had the same size, and the edge space information of the feature map was retained. Then, R 2 and R 3 were the same down-sampling units, the convolution stride of the first and second layers were 2 and 1, respectively. They were used to extract the more abstract spatial-spectral information. It should be pointed out that we adopted the skipped connection of zero-padded [25] in R 1 , but zero-padded with 2 × 2 average pooling was used in R 2 and R 3 . The advantage of this connection was that there were no additional parameters. At the same time, normal addition operation was ensured.
In the previous convolution process, when the size of the output feature map was the same as the input or reduced to a half of the original, the number of channels of the output feature maps would be twice the original; this method undoubtedly increased the number of parameters and computation load. However, the pyramid residual unit introduced in this paper was a way to linearly increase the number of feature map channels in small steps. That greatly reduced the number of parameters and Remote Sens. 2020, 12, 3408 8 of 20 computational complexity as compared to some of the other existing methods. The calculation method of the number of output channels for each residual unit is as shown in Equation (11): where D i is the number of output channels of i-th residual unit, C is the number of initial channels feeding the first unit. That is, the number of output channels after the first layer 1 × 1 convolution, R is the total number of all residual units, and α is an integer greater than zero.
In order to avoid the information redundancy of the full connection layer, at the end of the model, 1 × 1 convolution and global average pooling layer were combined instead of using the full connection layer to fuse the features extracted from the previous layer. This approach further reduced the number of model parameters, alleviated the overfitting, and made the network to have faster convergence rate [31]. What needs to be mentioned is that the number of 1 × 1 convolution kernels should be the same as the species number of current HSI data set, so to ensure that 1 × N (N is the species number of current HSI data set) feature vector could be output from the global average pooling layer, and the final classification would be completed.

Datasets Description
In order to measure the classification effect of the proposed model, three benchmark HSI data sets, Indian Pines (IP), Pavia University (PU), and Kennedy Space Center (KSC), were selected for experimental research. These data sets were available from the Grupo de Inteligencia Computacional (GIC) website (http://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes). The three data sets were captured at the pine testing ground in northwest Indiana, the University of Pavia in northern Italy, and Florida in the US. The details of the data sets are shown in Table 2.

Experimental Setup
The model designed in this paper was implemented by Python version 3.6.5 and the deep learning framework of PyTorch version 1.0.0 (available from https://pytorch.org). The computer hardware was Intel (R) Xeon (R) E5-2697 v3@2.60GHz CPU, the memory size was 32 GB, and the NVIDIA Tesla K20m GPU. We set the batch size of Indian Pines, Pavia University and Kennedy Space Center dataset to 64, 128, and 32, respectively. The value of α was fixed to be 48, and the learning rate was set to 0.01 uniformly. Stochastic gradient descent (SGD) optimizer was used to optimize the training parameters, and each experiment was performed 200 epochs. Besides that, all convolution layers were initialized with MSRA method [32] before training.
The experiments in this section are organized as follows. The influence of space size, initial convolution kernel number, and residual unit number on the classification accuracy were analyzed. The overall classification accuracy (OA), average classification accuracy (AA), and Kappa coefficient (K) were used as the measurement indexes of the experimental results. In order to ensure the accuracy of the experiment, each group of experiments were carried out 10 times, and the average values and standard deviation of the results of each 10-time experiment were taken as the final experimental results. To rule out variability due to random factors (training and testing sample order) in the experiment, we selected 10 different random seeds for every experiment.
In the three datasets, we randomly selected 200 samples per class object as the training set. For the Indian Pines and Kennedy Space Center datasets, however, the numbers of individual objects were less than 200, respectively. For example, the 1, 7, 9, and 16 class objects in the Indian Pines data set did make the overall sample distribution uneven. For this problem, the objects of less than 200 were randomly selected to be around 80% of the samples to be used as training set. The remaining data was used for the testing. In addition, 75% of the training samples were randomly selected for model validation. The specific samples divisions of the three datasets are shown in Tables 3-5.

The Impact of Spatial Size
For HSI, the neighborhood pixels of a certain pixel often have similar characteristic, and the probability of them belonging to the same objects is very high. If the neighborhood pixel block is too small, the model cannot fully learn the image spatial features. Alternatively, if the neighborhood pixel block is too large, it is easy to mix with other types of targets, which reduces the classification accuracy and increases the operation time and memory overhead at the same time. Therefore, it is very crucial to choose an appropriate neighborhood pixel block size. In our experiments, the initial convolution kernel number C was set to be 38, and the residual unit depth R was set to be 3 on the three datasets. For the neighborhood pixel block size, S, we chose S to be 5, 7, 9, 11, 13, and 15 for the next experiments.
As shown in Figure 5, for all datasets, the values of OA, AA, and Kappa coefficient increased significantly with the increase of spatial sizes. For the Indians Pines data set, the accuracy reached the highest when S was 11, and then, the accuracy starts to decline after that. For the Pavia University dataset, the accuracy rose gradually when S was greater than 11, and when S was 15, the accuracy was the highest. For the Kennedy Space Center dataset, the accuracy increased in a wavy trend as S increased. When S was 15, the accuracy reached the highest. When S was between 5 and 15, the accuracy was higher than 99.0%, which indicated that the classification accuracy of the dataset was less affected by neighborhood pixels. As shown in Table 6, with the increase of spatial size, training time and test time were also increasing. For Indians Pines and Pavia University data sets, tiny classification accuracy was sacrificed for balance the classification time, so that S was set to 11,13,9 in Indians Pines, Pavia University and Kennedy Space Center data sets respectively as the input of the next experiment.
Remote Sens. 2020, 12, x FOR PEER REVIEW 10 of 21 three datasets. For the neighborhood pixel block size, S, we chose S to be 5, 7, 9, 11, 13, and 15 for the next experiments. As shown in Figure 5, for all datasets, the values of OA, AA, and Kappa coefficient increased significantly with the increase of spatial sizes. For the Indians Pines data set, the accuracy reached the highest when S was 11, and then, the accuracy starts to decline after that. For the Pavia University dataset, the accuracy rose gradually when S was greater than 11, and when S was 15, the accuracy was the highest. For the Kennedy Space Center dataset, the accuracy increased in a wavy trend as S increased. When S was 15, the accuracy reached the highest. When S was between 5 and 15, the accuracy was higher than 99.0%, which indicated that the classification accuracy of the dataset was less affected by neighborhood pixels. As shown in Table 6, with the increase of spatial size, training time and test time were also increasing. For Indians Pines and Pavia University data sets, tiny classification accuracy was sacrificed for balance the classification time, so that S was set to 11,13,9 in Indians Pines, Pavia University and Kennedy Space Center data sets respectively as the input of the next experiment.

The Impact of Initial Convolution Kernels Number
In this experiment, the number of initial convolution kernels is referred to the number of channels input to the R1 unit, which was also the number of 1 × 1 convolution kernels at the first layer of the model. According to Equation (11), the number of output channels of the i-th (i > 0) residual unit is + × (C, R are the number of 1 × 1 convolution kernels and residual units, respectively). When and R are constants, the number of initial convolution kernels C completely determines the number of output channels of the later layer. In general, the larger C means that the model can learn more features. However, too many features not only increase model parameters, but they may also lead to overfitting and reduce classification accuracy.
In order to explore the influence of the number of initial convolution kernels (C) on the classification accuracy, the depth of the residual unit (R) was fixed to be 3. We tested the number of

The Impact of Initial Convolution Kernels Number
In this experiment, the number of initial convolution kernels is referred to the number of channels input to the R 1 unit, which was also the number of 1 × 1 convolution kernels at the first layer of the model. According to Equation (11), the number of output channels of the i-th (i > 0) residual unit is C + α×i R (C, R are the number of 1 × 1 convolution kernels and residual units, respectively). When α and R are constants, the number of initial convolution kernels C completely determines the number of output channels of the later layer. In general, the larger C means that the model can learn more features. However, too many features not only increase model parameters, but they may also lead to overfitting and reduce classification accuracy.
In order to explore the influence of the number of initial convolution kernels (C) on the classification accuracy, the depth of the residual unit (R) was fixed to be 3. We tested the number of initial convolution kernels of 16, 24, 32, 38 and 42 for the three datasets. The experimental results are shown in Table 7, on the same dataset, with the increase of the number of initial convolution kernels, there is no significant difference between the training time and the test time. Obviously, it can be seen from Figure 6 that on the Indian Pines, Pavia University and Kennedy Space Center datasets, when C was 38, 42, and 32, respectively, the best classification accuracy could be achieved. Consequently, we chose this value as the super parameter of the next experiments. seen from Figure 6 that on the Indian Pines, Pavia University and Kennedy Space Center datasets, when C was 38, 42, and 32, respectively, the best classification accuracy could be achieved. Consequently, we chose this value as the super parameter of the next experiments.

The Impact of Residual Unit Depth
In addition to the above factors, the depth of residual unit directly affects the feature extraction capability of the entire model. In other words, if the structure of the model is shallow, it would be unable to extract features effectively. The deeper structure is prone to gradient disappearance, but it cannot further improve the classification accuracy. In order to test the influence of the depth of residual units on the classification accuracy in three datasets, we chose the residual unit depth R to be 1, 2, 3 and 4 to conduct experiments.
The experimental results are shown in Tables 8 and 9. For the same dataset, each additional residual unit increases the parameters by about 10,000. On the Indian Pines dataset, the training and testing time did not change significantly with the increase of the number of residual units. In addition, with the number of residual units increasing, the training and testing times were gradually longer as a whole in Kennedy Space Center dataset, but the training and testing times were acceptable under the premise of obtaining the highest classification accuracy. As shown in Figure 7, the optimal accuracy was obtained when R was 3 for Indian Pines and Kennedy Space Center datasets. For the Pavia University dataset, when R was 2, the model acquired the highest classification accuracy.

The Impact of Residual Unit Depth
In addition to the above factors, the depth of residual unit directly affects the feature extraction capability of the entire model. In other words, if the structure of the model is shallow, it would be unable to extract features effectively. The deeper structure is prone to gradient disappearance, but it cannot further improve the classification accuracy. In order to test the influence of the depth of residual units on the classification accuracy in three datasets, we chose the residual unit depth R to be 1, 2, 3 and 4 to conduct experiments.
The experimental results are shown in Tables 8 and 9. For the same dataset, each additional residual unit increases the parameters by about 10,000. On the Indian Pines dataset, the training and testing time did not change significantly with the increase of the number of residual units. In addition, with the number of residual units increasing, the training and testing times were gradually longer as a whole in Kennedy Space Center dataset, but the training and testing times were acceptable under the premise of obtaining the highest classification accuracy. As shown in Figure 7, the optimal accuracy was obtained when R was 3 for Indian Pines and Kennedy Space Center datasets. For the Pavia University dataset, when R was 2, the model acquired the highest classification accuracy.

Results and Discussion
In order to further measure the performance of the proposed classification model, we selected several typical classification models that became available in recent years. These include SVM-RBF [5], 1D-CNN [14], M3D-DCNN [20], SSRN [21], and pResNet [24]. Besides this, the same model using standard CNN (Std-CNN) instead of depth-wise separable convolution was also built. They were used in this study for detailed comparisons. More specifically, the SVM-RBF and 1D-CNN are spectral-based methods, and M3D-DCNN, pResNet, SSRN, Std-CNN, together with the present model, are spectral-spatial approaches. In order to ensure the fairness of the experiments, each group of experiments in three datasets were repeated 10 times. We uniformly used the space size that was determined in Section 3.3 of this section as the input of spectral-spatial model. The evaluation indicators OA, AA, Kappa coefficient, and F1-score are expressed in the form of "mean ± standard deviation".
In addition, we compared the proposed depth-separable convolution network (Des-CNN) with Std-CNN in detail to verify the feasibility and effectiveness of the depth-separable convolution approach for hyperspectral images classification.

Comparison with Other Methods
The classification results for each of the methods are shown in Tables 10-12. First, it can be seen that the classification accuracy (OA, AA, Kappa values, and F1-score) of the model proposed in this paper are higher than those of other classification models (SVM-RBF, 1D-CNN, M3D-DCNN, pResNet) on all datasets. Specifically, in the Indian Pines data set, our model achieves roughly 17.48% higher mean OA than that of SVM-RBF, 1D-CNN, and about 3.66% higher than that of spectral-spatial models (M3D-CNN, Std-CNN, and pResNet). Clearly, the accuracies of SVM-RBF and 1D-CNN models using only spectral features are significantly lower than that of the spectral-spatial model. This implies that only spectral features were unable to achieve high classification accuracy. Then, it should be pointed out that the OA value of SSRN is only 0.04% higher than that of our model, but it has a higher standard deviation. In addition, the performance of 3D networks, such as M3D-DCNN and SSRN, seems to be limited, especially M3D-DCNN; its accuracy on the IP dataset is much lower than SSRN, Std-CNN, and our proposed depth-separable

Results and Discussion
In order to further measure the performance of the proposed classification model, we selected several typical classification models that became available in recent years. These include SVM-RBF [5], 1D-CNN [14], M3D-DCNN [20], SSRN [21], and pResNet [24]. Besides this, the same model using standard CNN (Std-CNN) instead of depth-wise separable convolution was also built. They were used in this study for detailed comparisons. More specifically, the SVM-RBF and 1D-CNN are spectral-based methods, and M3D-DCNN, pResNet, SSRN, Std-CNN, together with the present model, are spectral-spatial approaches. In order to ensure the fairness of the experiments, each group of experiments in three datasets were repeated 10 times. We uniformly used the space size that was determined in Section 3.3 of this section as the input of spectral-spatial model. The evaluation indicators OA, AA, Kappa coefficient, and F1-score are expressed in the form of "mean ± standard deviation".
In addition, we compared the proposed depth-separable convolution network (Des-CNN) with Std-CNN in detail to verify the feasibility and effectiveness of the depth-separable convolution approach for hyperspectral images classification.

Comparison with Other Methods
The classification results for each of the methods are shown in Tables 10-12. First, it can be seen that the classification accuracy (OA, AA, Kappa values, and F1-score) of the model proposed in this paper are higher than those of other classification models (SVM-RBF, 1D-CNN, M3D-DCNN, pResNet) on all datasets. Specifically, in the Indian Pines data set, our model achieves roughly 17.48% higher mean OA than that of SVM-RBF, 1D-CNN, and about 3.66% higher than that of spectral-spatial models (M3D-CNN, Std-CNN, and pResNet). Clearly, the accuracies of SVM-RBF and 1D-CNN models using only spectral features are significantly lower than that of the spectral-spatial model. This implies that only spectral features were unable to achieve high classification accuracy. Then, it should be pointed out that the OA value of SSRN is only 0.04% higher than that of our model, but it has a higher standard deviation. In addition, the performance of 3D networks, such as M3D-DCNN and SSRN, seems to be limited, especially M3D-DCNN; its accuracy on the IP dataset is much lower than SSRN, Std-CNN, and our proposed depth-separable convolution model. From Table 10, the accuracy of No.2, 11 is only 86.29% and 80.15%, because the band information of the two samples is relatively similar, resulting in a misjudgment. Moreover, in the face of small sample surface features, the feature learning ability of the SSRN is insufficient, such as the accuracy of No.9, 16 is less than 98%, while our model is higher than 99.44%. Furthermore, the Std-CNN and the model in our proposal both get rid of the problem of small sample, and their classification accuracy is superior to other models on the same objects with small samples (No. 1,7,9,16), indicating that our model still has strong feature extraction capability under the condition of small sample data. Finally, for Indian Pines and Kennedy Space Center datasets, the OA, AA, Kappa values and F1-score of depth-wise separable convolution model are better than those of Std-CNN. However, for the Pavia University dataset, the OA and Kappa of the two models are equal, and the AA of the proposed model is only lower than Std-CNN 0.04%, but the gap is not obvious.   In addition, compared with other classification models, the standard deviation of our model results is the smallest, which further indicates that the proposed model ensures high accuracy while the classification effect is more stable. Figures 8-10 visualize the classification results of the different models in three datasets, as well as the false color images of original HSI and their corresponding ground-truth maps. It can be clearly seen from the classification map that only using spectral features for classification, such as SVM-RBF and 1D-CNN, will produce many noise points, but the spectral-spatial-based methods, M3D-CNN, pResNet, and proposed, overcome this shortcoming, especially our proposed model shows better classification effect. For example, in the Indian Pines data set, M3D-CNN and pResNet mistakenly labeled some pixels of Class 11 (Soybean-mintill) as Class 3 (Corn-mintill), while our proposed model correctly labeled them. Specifically, by comparing ground-truth maps, our model achieved a more accurate and smooth classification effect.      In the last part of the experiments, we compared the above CNN based 1D-CNN, M3D-CNN, pResNet, SSRN, Std-CNN, and our model in four aspects: floating point operations (FLOPs), the number of training parameters, training time, and test time. As shown in Figure 11, on three different datasets, the number of the parameters of our model constructed in this paper was far lower than that of other models. Among the other models, the pResNet had the most parameters,  In the last part of the experiments, we compared the above CNN based 1D-CNN, M3D-CNN, pResNet, SSRN, Std-CNN, and our model in four aspects: floating point operations (FLOPs), the number of training parameters, training time, and test time. As shown in Figure 11, on three different datasets, the number of the parameters of our model constructed in this paper was far lower than that of other models. Among the other models, the pResNet had the most parameters,  Figure 11, on three different datasets, the number of the parameters of our model constructed in this paper was far lower than that of other models. Among the other models, the pResNet had the most parameters, mainly because the structure of the model was extremely complex, which had approximately 40 layers. The parameters of the M3D-DCNN model were second only to those of pResNet. It consisted of 10 layers, and there were two three-dimensional multi-scale convolutional layers with the width being 4, which greatly increased the parameters and slowed down the classification speed. However, the 1D-CNN model was the shallowest, with only 5 layers. Its number of parameters was much lower than those of the M3D-DCNN or pResNet models. In addition, from Table 13, the training time and test time of the proposed model were much less than those of M3D-DCNN and pResNet, but slightly higher than those of 1D-CNN. This was mostly because the 1D-CNN model only used the spectral features to complete the classification and lost the spatial information, so the time was slightly faster. The classification accuracy of the model was lower, however.
only 0.16M. This is because 1D convolution operation is relatively simple in the process of exploring spectral information, so the ability of feature learning is limited. However, the FLOPs of 3D-CNN and SSRN are as high as 234.68M and 75.31M, respectively, especially the SSRN is about 100 times that of our model. The reason is that the 3D convolution networks need a great number of floating-point calculations in the training process for feature learning. In contrast, the computational burden of 2D networks is insignificant, such as pResNet and Std-CNN. Especially pResNet with more parameters has fewer FLOPs than 3D networks. Nevertheless, the performance of the 3D network and the standard 2D network did not show advantages, mainly due to the excessive redundant parameters in network limited the feature representation. Finally, it should be noted that the parameters and FLOPs of Std-CNN are approximately 6 times the depth-separable convolution model in the same data set, which indicates that the computational cost of the model we build is lower.
In short, in pursuit of high precision, the model in this paper is superior to other models in terms of the number of parameters, training time, and testing time. In addition, Depth-wise separable convolution is feasible and has some advantages such as low cost, simple structure, and accuracy compared with standard convolution. In practical application, this lightweight framework does not require too much computer hardware platform.   On the other hand, we can also observe in Table 13 that the FOLPs of 1D-CNN is the lowest, only 0.16M. This is because 1D convolution operation is relatively simple in the process of exploring spectral information, so the ability of feature learning is limited. However, the FLOPs of 3D-CNN and SSRN are as high as 234.68M and 75.31M, respectively, especially the SSRN is about 100 times that of our model. The reason is that the 3D convolution networks need a great number of floating-point calculations in the training process for feature learning. In contrast, the computational burden of 2D networks is insignificant, such as pResNet and Std-CNN. Especially pResNet with more parameters has fewer FLOPs than 3D networks. Nevertheless, the performance of the 3D network and the standard 2D network did not show advantages, mainly due to the excessive redundant parameters in network limited the feature representation. Finally, it should be noted that the parameters and FLOPs of Std-CNN are approximately 6 times the depth-separable convolution model in the same data set, which indicates that the computational cost of the model we build is lower.
In short, in pursuit of high precision, the model in this paper is superior to other models in terms of the number of parameters, training time, and testing time. In addition, Depth-wise separable convolution is feasible and has some advantages such as low cost, simple structure, and accuracy compared with standard convolution. In practical application, this lightweight framework does not require too much computer hardware platform.

Effectiveness Analysis to Depth-Separable Convolution
In this part, we compared Std-CNN with the proposed depth-separable convolution network (Des-CNN) in detail to verify the feasibility and effectiveness of the depth-separable convolution approach for hyperspectral images classification. The OA and F1-score of the two models on different data sets are recorded in Tables 14 and 15, respectively. It can be seen that compared with Std-CNN, the OA and F1-score performance of the Des-CNN on Indian Pines data set achieved higher accuracy. Specifically, most of the results of Des-CNN exceeded Std-CNN for the same threshold under the same hyper-parameter, especially in the case of the small threshold (Spatial Size = 5, 7; and Initial Convolution Kernels Number are 16,24), the difference was more obvious. However, the performance of Des-CNN on the Pavia University and Kennedy Space Center data sets was slightly worse than that of Std-CNN. Under the same configuration, the accuracy (OA, FLOPs) of Des-CNN was slightly lower than those of Std-CNN, but this difference was less than 1%. Besides accuracy, the complexity and computational load of the model are another important consideration. Since the standard convolution and depth-separable convolution appeared in the residual unit, the complexity of the network framework was mainly affected by the depth of the residual unit. The training parameters and FLOPs of Std-CNN and Des-CNN at different residual unit depths are shown in Figure 12, respectively. Obviously, the training parameters and FLOPs of Std-CNN were much higher than those of Des-CNN under the same configuration. From Figure 12a, the training parameters of Des-CNN under different residual units on the three data sets were all lower than 50,000, while the parameters of Std-CNN were mostly more than 10,000. From Figure 12b, the FLOPs of Std-CNN were higher than 5M, and even more than 10M on Indian pines and Pavia University, while the flops of Des-CNN were less than 5M on three datasets. Besides accuracy, the complexity and computational load of the model are another important consideration. Since the standard convolution and depth-separable convolution appeared in the residual unit, the complexity of the network framework was mainly affected by the depth of the residual unit. The training parameters and FLOPs of Std-CNN and Des-CNN at different residual unit depths are shown in Figure 12, respectively. Obviously, the training parameters and FLOPs of Std-CNN were much higher than those of Des-CNN under the same configuration. From Figure  12(a), the training parameters of Des-CNN under different residual units on the three data sets were all lower than 50,000, while the parameters of Std-CNN were mostly more than 10,000. From Figure  12(b), the FLOPs of Std-CNN were higher than 5M, and even more than 10M on Indian pines and Pavia University, while the flops of Des-CNN were less than 5M on three datasets. From the above analysis, compared with Std-CNN, Des-CNN constructed by depth-separable convolution achieved competitive results on Indian Pines data set and was slightly insufficient on Pavia University and Kennedy Space Center data sets. However, the difference was very small. In addition, the model complexity of Des-CNN was significantly better than that of Std-CNN on three data sets. Consequently, it is sufficient to demonstrate that depth-separable convolution is feasible From the above analysis, compared with Std-CNN, Des-CNN constructed by depth-separable convolution achieved competitive results on Indian Pines data set and was slightly insufficient on Pavia University and Kennedy Space Center data sets. However, the difference was very small. In addition, the model complexity of Des-CNN was significantly better than that of Std-CNN on three data sets. Consequently, it is sufficient to demonstrate that depth-separable convolution is feasible for hyperspectral classification. Compared with Std-CNN, the proposed model is more lightweight, and it is more satisfied with limited computer hardware devices in reality.

Conclusions
In this paper, a lightweight model for HSI classification was constructed and discussed. Results of experiments show that it has fewer parameters and faster classification speed. In the first layer of the model, a 1 × 1 convolution kernel was used to re-combine the input channels of HSI and realize the cross-channel information integration. This reduced the number of spectral channels. Next, the spatial-spectral features were extracted by residual unit of the middle layer. At the end of executing the model, a combination of a 1 × 1 filter and a global average pooling layer was used to replace the full connection layer to complete the final classification. That further reduced the number of model parameters and sped up the classification while ensuring the accuracy.
In the experiments, the effect of space size, the number of initial convolution kernels, and the depth of residual units on the classification accuracy were first analyzed. Then, we further compared experimental results with results from other classification models. The experimental result shows that the proposed model reduced the number of parameters to a large extent and had a faster classification speed while ensuring higher accuracy. In addition, the proposed model has powerful feature extraction capabilities, because it still shows high classification accuracy on small sample data. Finally, we continued to explore the impact of the model that was constructed by standard convolution and depth-wise separable convolution on classification accuracy, the number of parameters and FLOPs. The experimental results showed that depth-separable convolution is feasible for hyperspectral classification. Compared with Std-CNN, the proposed model is more lightweight, and it is more satisfied with limited computer hardware devices in reality.
For future research work, three-dimensional convolution will be used to extract spectral features, and two-dimensional convolution will fuse spatial-spectral features, introducing density connection to speed up the flow of feature information, further reduce the training and testing time of the model, and accelerate the convergence of the model so as to build a more rapid and effective HSI classification model.