Small Sample Hyperspectral Image Classiﬁcation Method Based on Dual-Channel Spectral Enhancement Network

: Deep learning has achieved signiﬁcant success in the ﬁeld of hyperspectral image (HSI) classiﬁcation, but challenges are still faced when the number of training samples is small. Feature fusing approaches based on multi-channel and multi-scale feature extractions are attractive for HSI classiﬁcation where few samples are available. In this paper, based on feature fusion, we proposed a simple yet effective CNN-based Dual-channel Spectral Enhancement Network (DSEN) to fully exploit the features of the small labeled HSI samples for HSI classiﬁcation. We worked with the observation that, in many HSI classiﬁcation models, most of the incorrectly classiﬁed pixels of HSI are at the border of different classes, which is caused by feature obfuscation. Hence, in DSEN, we specially designed a spectral feature extraction channel to enhance the spectral feature representation of the speciﬁc pixel. Moreover, a spatial–spectral channel was designed using small convolution kernels to extract the spatial–spectral features of HSI. By adjusting the fusion proportion of the features extracted from the two channels, the expression of spectral features was enhanced in terms of the fused features for better HSI classiﬁcation. The experimental results demonstrated that the overall accuracy (OA) of HSI classiﬁcation using the proposed DSEN reached 69.47%, 80.54%, and 93.24% when only ﬁve training samples for each class were selected from the Indian Pines (IP), University of Pavia (UP), and Salinas Scene (SA) datasets, respectively. The performance improved when the number of training samples increased. Compared with several related methods, DSEN demonstrated superior performance in HSI classiﬁcation.


Introduction
Hyperspectral remote sensing is an important research field in remote sensing science [1]. Typically, the number of spectral segments and the data size of HSI are much greater than that of ordinary images, thereby presenting challenges to the storage and analysis of HSI. However, due to the rich spatial and spectral information contained, HSI plays an especially important role in a wide range of applications, such as vegetation research [2], fine agriculture [3,4], agricultural product detection [5], and environmental monitoring [6]. The classification and recognition of ground cover based on HSI represents an important step in promoting the application of hyperspectral remote sensing technology. HSI classification is used to determine the class of each pixel of HSI and has become a hot research topic in the field of hyperspectral remote sensing [7].
The traditional HSI classification methods include support vector machine (SVM) [8], random forest [9,10], etc. Due to the spectrum of HSI, the Hughes phenomenon easily occurs in HSI classification. Therefore, researchers proposed various methods for the dimensionality reduction of HSI, such as PCA [11], PPCA [12], and ICA [13]. Dimensionality reduction can effectively eliminate the redundancy of HSI data, thereby extracting HSI features better. In the traditional HSI classification, the classification method and the intermediate parameter setting depend on past experience, resulting in an unsatisfactory classification result and robustness.
In order for HSI classification to perform well when there are only a few HSI samples, in this paper, besides the channel design for extracting joint spatial-spectral features using small convolution kernels, we focused on the spectral feature of specific pixels, and designed a specific HSI spectral feature extraction channel using 1 × 1 3D and 2D convolution to avoid the weakening of the pixel's spectral feature representation. Moreover, the joint spatial-spectral features and the spectral features extracted by the designed model were fused with a plastic layer to enhance classification performance.
The rest of the paper is organized as follows. Section 2 presents features such as hybrid convolution, Dropout, and Dropblock used in the proposed model. The details and parameter settings of the proposed model are introduced in Section 3. Section 4 reports the experimental setup and results. Section 5 provides some conclusions.

3D-2D Hybrid Convolution
Two dimensional convolution for HSI classification [50], as shown in Figure 1, is generally divided into the following three steps: data dimension reduction (DR), feature extraction, and classification. DR is performed to reduce the spectral dimension of original HSI data. The main purpose is to reduce the number of HSI spectra, remove the redundancy between spectra, and facilitate subsequent feature extraction. Feature extraction uses a convolution operation to extract features from data. The 2D convolution operation for reduced HSI is similar to that for the ordinary image, except the difference in the number of channels. The number of channels in reduced HSI depends on the dimensionality reduction operation. The 2D feature information of the reduced HSI can be obtained after several runs of convolution operations. Classification refers to the classification function used, such as SoftMax, to analyze the feature extracted from the convolution layer and to obtain specific classes. The 2D convolution model is simple and the number of parameters is small, but the extracted features lack the spectral dimension, thereby reducing the classification performance for HSI.
gently. Researchers have proposed various solutions to improve the performance of HSI classification in the case of fewer HSI samples, such as using unsupervised methods to select the band with discrimination [45] and extracting the features of HSI after dimensionality reduction [46]. Other methods, such as meta-learning [47], transform learning [48] and cross-scene classification [49], have been implemented to solve the problem of HSI classification with fewer samples. By extracting discriminative features, feature fusing approaches that are multi-channel and multi-scale are also attractive for HSI classification when the number of training samples is low.

Contribution and Paper Organization
In order for HSI classification to perform well when there are only a few HSI samples, in this paper, besides the channel design for extracting joint spatial-spectral features using small convolution kernels, we focused on the spectral feature of specific pixels, and designed a specific HSI spectral feature extraction channel using 1 × 1 3D and 2D convolution to avoid the weakening of the pixel's spectral feature representation. Moreover, the joint spatial-spectral features and the spectral features extracted by the designed model were fused with a plastic layer to enhance classification performance.
The rest of the paper is organized as follows. Section 2 presents features such as hybrid convolution, Dropout, and Dropblock used in the proposed model. The details and parameter settings of the proposed model are introduced in Section 3. Section 4 reports the experimental setup and results. Section 5 provides some conclusions.

3D-2D Hybrid Convolution
Two dimensional convolution for HSI classification [50], as shown in Figure 1, is generally divided into the following three steps: data dimension reduction (DR), feature extraction, and classification. DR is performed to reduce the spectral dimension of original HSI data. The main purpose is to reduce the number of HSI spectra, remove the redundancy between spectra, and facilitate subsequent feature extraction. Feature extraction uses a convolution operation to extract features from data. The 2D convolution operation for reduced HSI is similar to that for the ordinary image, except the difference in the number of channels. The number of channels in reduced HSI depends on the dimensionality reduction operation. The 2D feature information of the reduced HSI can be obtained after several runs of convolution operations. Classification refers to the classification function used, such as SoftMax, to analyze the feature extracted from the convolution layer and to obtain specific classes. The 2D convolution model is simple and the number of parameters is small, but the extracted features lack the spectral dimension, thereby reducing the classification performance for HSI. 3DCNN is used to extract the spatial-spectral joint features of HSI [51], as shown in Figure 2. Unlike 2D, 3D convolution can be directly applied on the raw HSI data, and can conduct convolution in both spatial and spectral dimensions. Compared to 2D convolution, the features obtained from 3D convolution contain additional spectral dimen- 3DCNN is used to extract the spatial-spectral joint features of HSI [51], as shown in Figure 2. Unlike 2D, 3D convolution can be directly applied on the raw HSI data, and can conduct convolution in both spatial and spectral dimensions. Compared to 2D convolution, the features obtained from 3D convolution contain additional spectral dimension and can be used to improve classification performance. However, the use of 3DCNN has the problems of increased model computation, a large number of parameters, and difficult training process. sion and can be used to improve classification performance. However, the use of 3DCNN has the problems of increased model computation, a large number of parameters, and difficult training process. There are certain shortcomings in using 2D or 3D convolution alone. To solve this problem, many researchers proposed the use of 3D-2D [52,53] hybrid convolution, as shown in Figure 3. Firstly, 3DCNN is used to extract the spatial-spectral joint features. Then, the last two dimensions of the features extracted by the 3D convolution layer are combined to achieve dimension reduction. The data after dimension reduction are used as the input of the 2D convolution layer to further extract more abstract spatial features. The use of hybrid convolution not only ensures the feature extraction ability, but also reduces the complexity and number of parameters of the model, which is easier for model training.

Dropout and Dropblock
Overfitting is a problem that is encountered in deep learning. When the number of training samples is low, the model learns the unique features from a few samples and ignores more general features, resulting in good performance in training and poor performance in testing. To solve this problem, researchers proposed some solutions, in which Dropout [54] was widely used in the application of deep learning due to its simple implementation and excellent results.
During forward propagation with Dropout, a neuron will stop working at a certain probability (Figure 4), which can make the model more generalizable, because the model does not heavily rely on specific local features. Dropout makes multiple neurons not necessarily appear in a dropout-based network every time. In this way, the updating of weights no longer depends on the joint action of hidden nodes with fixed relationships, which prevents some features from being effective under other specific features. Dropout forces the network to learn more robust features to achieve the purpose of enhancing the generalizability of the model. Dropout is generally used for the full connection layer in the deep learning model, rather than the convolution layer. This is because the convolution kernel corresponds to a region. If only a few neurons are stopped, the convolution kernel can learn information from the adjacent neurons, which does not improve the generalizability of the model. Dropblock [55] omits multiple neurons in continuous re- There are certain shortcomings in using 2D or 3D convolution alone. To solve this problem, many researchers proposed the use of 3D-2D [52,53] hybrid convolution, as shown in Figure 3. Firstly, 3DCNN is used to extract the spatial-spectral joint features. Then, the last two dimensions of the features extracted by the 3D convolution layer are combined to achieve dimension reduction. The data after dimension reduction are used as the input of the 2D convolution layer to further extract more abstract spatial features. The use of hybrid convolution not only ensures the feature extraction ability, but also reduces the complexity and number of parameters of the model, which is easier for model training.
sion and can be used to improve classification performance. However, the u 3DCNN has the problems of increased model computation, a large number of par ters, and difficult training process. There are certain shortcomings in using 2D or 3D convolution alone. To solve problem, many researchers proposed the use of 3D-2D [52,53] hybrid convolutio shown in Figure 3. Firstly, 3DCNN is used to extract the spatial-spectral joint feat Then, the last two dimensions of the features extracted by the 3D convolution laye combined to achieve dimension reduction. The data after dimension reduction are as the input of the 2D convolution layer to further extract more abstract spatial feat The use of hybrid convolution not only ensures the feature extraction ability, but reduces the complexity and number of parameters of the model, which is easie model training.

Dropout and Dropblock
Overfitting is a problem that is encountered in deep learning. When the numb training samples is low, the model learns the unique features from a few samples ignores more general features, resulting in good performance in training and poor formance in testing. To solve this problem, researchers proposed some solution which Dropout [54] was widely used in the application of deep learning due to its ple implementation and excellent results.
During forward propagation with Dropout, a neuron will stop working at a ce probability (Figure 4), which can make the model more generalizable, because the m does not heavily rely on specific local features. Dropout makes multiple neuron necessarily appear in a dropout-based network every time. In this way, the updati weights no longer depends on the joint action of hidden nodes with fixed relations which prevents some features from being effective under other specific features. D out forces the network to learn more robust features to achieve the purpose of enhan the generalizability of the model. Dropout is generally used for the full connection in the deep learning model, rather than the convolution layer. This is because the co lution kernel corresponds to a region. If only a few neurons are stopped, the convol kernel can learn information from the adjacent neurons, which does not improv generalizability of the model. Dropblock [55] omits multiple neurons in continuou

Dropout and Dropblock
Overfitting is a problem that is encountered in deep learning. When the number of training samples is low, the model learns the unique features from a few samples and ignores more general features, resulting in good performance in training and poor performance in testing. To solve this problem, researchers proposed some solutions, in which Dropout [54] was widely used in the application of deep learning due to its simple implementation and excellent results.
During forward propagation with Dropout, a neuron will stop working at a certain probability (Figure 4), which can make the model more generalizable, because the model does not heavily rely on specific local features. Dropout makes multiple neurons not necessarily appear in a dropout-based network every time. In this way, the updating of weights no longer depends on the joint action of hidden nodes with fixed relationships, which prevents some features from being effective under other specific features. Dropout forces the network to learn more robust features to achieve the purpose of enhancing the generalizability of the model. Dropout is generally used for the full connection layer in the deep learning model, rather than the convolution layer. This is because the convolution kernel corresponds to a region. If only a few neurons are stopped, the convolution kernel can learn information from the adjacent neurons, which does not improve the generalizability of the model. Dropblock [55] omits multiple neurons in continuous regions, as shown in Figure 5, the size of which are equal to that of the convolution kernel in the current layer. When the convolution kernel extracts a feature, it will lose the feature information of the relevant region. The network will focus on learning the features of other regions for classification, so as to improve the generalizability of the model. gions, as shown in Figure 5, the size of which are equal to that of the convolution kernel in the current layer. When the convolution kernel extracts a feature, it will lose the feature information of the relevant region. The network will focus on learning the features of other regions for classification, so as to improve the generalizability of the model.

Standard neural network model
Neural network using dropout Figure 4. The difference between a standard neural network model and a neural network using Dropout (The red solid circle represents the dropped-out neurons). The proposed model extracts the features of HSI using two convolution channels to improve the performance of HSI classification, and uses Dropblock in the convolution layer and Dropout in the full connection layer to manage the overfitting problem. The two feature extraction channels first use 3D convolution to extract features, and then carry out 2D convolution on the extracted features to further extract deeper features, which not only extracts discriminative features, but also enhances the generalizability of the model.

The Design of DSEN
As shown in Figure 6, the designed DSEN has two convolution channels. The upper channel is a spatial-spectral extraction channel, by which the spatial-spectral joint feature of HSI can be extracted from the data cube after dimension reduction. The lower channel is a spectral extraction channel focusing on the spectral feature representation of a specific pixel, by which the spectral features can be extracted from HSI. By adjusting the fusion proportion of the features extracted from the two feature extraction channels, the expression of spectral features can be enhanced in the fused features for better HSI classification. The model is mainly composed of the following four modules: data preprocessing, feature extraction, feature fusion, and classification. These modules are described in detail below. gions, as shown in Figure 5, the size of which are equal to that of the convolution kernel in the current layer. When the convolution kernel extracts a feature, it will lose the feature information of the relevant region. The network will focus on learning the features of other regions for classification, so as to improve the generalizability of the model.

Standard neural network model
Neural network using dropout Figure 4. The difference between a standard neural network model and a neural network using Dropout (The red solid circle represents the dropped-out neurons). The proposed model extracts the features of HSI using two convolution channels to improve the performance of HSI classification, and uses Dropblock in the convolution layer and Dropout in the full connection layer to manage the overfitting problem. The two feature extraction channels first use 3D convolution to extract features, and then carry out 2D convolution on the extracted features to further extract deeper features, which not only extracts discriminative features, but also enhances the generalizability of the model.

The Design of DSEN
As shown in Figure 6, the designed DSEN has two convolution channels. The upper channel is a spatial-spectral extraction channel, by which the spatial-spectral joint feature of HSI can be extracted from the data cube after dimension reduction. The lower channel is a spectral extraction channel focusing on the spectral feature representation of a specific pixel, by which the spectral features can be extracted from HSI. By adjusting the fusion proportion of the features extracted from the two feature extraction channels, the expression of spectral features can be enhanced in the fused features for better HSI classification. The model is mainly composed of the following four modules: data preprocessing, feature extraction, feature fusion, and classification. These modules are described in detail below. The proposed model extracts the features of HSI using two convolution channels to improve the performance of HSI classification, and uses Dropblock in the convolution layer and Dropout in the full connection layer to manage the overfitting problem. The two feature extraction channels first use 3D convolution to extract features, and then carry out 2D convolution on the extracted features to further extract deeper features, which not only extracts discriminative features, but also enhances the generalizability of the model.

The Design of DSEN
As shown in Figure 6, the designed DSEN has two convolution channels. The upper channel is a spatial-spectral extraction channel, by which the spatial-spectral joint feature of HSI can be extracted from the data cube after dimension reduction. The lower channel is a spectral extraction channel focusing on the spectral feature representation of a specific pixel, by which the spectral features can be extracted from HSI. By adjusting the fusion proportion of the features extracted from the two feature extraction channels, the expression of spectral features can be enhanced in the fused features for better HSI classification. The model is mainly composed of the following four modules: data preprocessing, feature extraction, feature fusion, and classification. These modules are described in detail below.

Data Preprocessing
The raw HSI cannot be directly used as the input of the proposed model, and needs to be processed first, as shown in Figure 7. It was assumed that the size of raw HSI data is W × H × B, where W and H are the length and width of the HSI, and B is the number of spectral bands.

Data Preprocessing
The raw HSI cannot be directly used as the input of the proposed model, and needs to be processed first, as shown in Figure 7. It was assumed that the size of raw HSI data is W H B × × , where W and H are the length and width of the HSI, and B is the number of spectral bands. For the spatial-spectral extraction channel, the size of input data is S S L × × , where L is much smaller than B, S is the length/width of the cube, L is the number of spectral segments. The classification refers to obtain the class of the central pixel of the data cube. In order to avoid the Hughes effect, PCA is used to reduce the dimension of the raw data. Assuming that L spectral bands are retained from the raw HSI, and hence the size of reduced HSI is W H L × × . In order to make full use of the data, mirror padding is firstly carried out on the four sides of the reduced HSI to obtain a data cube with size of , and then the data are divided into cubes with size of S S L × × .
Finally, the number of cubes is equal to the number of original pixels, and a total number of W H × data cubes are obtained.

Data Preprocessing
The raw HSI cannot be directly used as the input of the proposed model, and needs to be processed first, as shown in Figure 7. It was assumed that the size of raw HSI data is W H B × × , where W and H are the length and width of the HSI, and B is the number of spectral bands. For the spatial-spectral extraction channel, the size of input data is S S L × × , where L is much smaller than B, S is the length/width of the cube, L is the number of spectral segments. The classification refers to obtain the class of the central pixel of the data cube. In order to avoid the Hughes effect, PCA is used to reduce the dimension of the raw data. Assuming that L spectral bands are retained from the raw HSI, and hence the size of reduced HSI is W H L × × . In order to make full use of the data, mirror padding is firstly carried out on the four sides of the reduced HSI to obtain a data cube with size of , and then the data are divided into cubes with size of S S L × × .
Finally, the number of cubes is equal to the number of original pixels, and a total number of W H × data cubes are obtained. For the spatial-spectral extraction channel, the size of input data is S × S × L, where L is much smaller than B, S is the length/width of the cube, L is the number of spectral segments. The classification refers to obtain the class of the central pixel of the data cube. In order to avoid the Hughes effect, PCA is used to reduce the dimension of the raw data. Assuming that L spectral bands are retained from the raw HSI, and hence the size of reduced HSI is W × H × L. In order to make full use of the data, mirror padding is firstly carried out on the four sides of the reduced HSI to obtain a data cube with size of (W + S 2 ) × (H + S 2 ) × L, and then the data are divided into cubes with size of S × S × L. Finally, the number of cubes is equal to the number of original pixels, and a total number of W × H data cubes are obtained.
For the spectral extraction channel, the input data size was 1 × 1 × B. Because only spectral features were extracted from the spectral channel, the original data were standardized based on the spectral dimension, rather than global standardization. The pur- pose of this was to maximize the spectral dimension features. The standardized formula (Equation (1)) is: where x represents the original data, µ is the average, and δ is the standard deviation. The standardized data were divided into W × H blocks with a size of 1 × 1 × B.

Feature Extraction
As shown in Figure 6, two extraction channels of the proposed model were implemented based on 3D-2D hybrid convolution, with similar structural settings as shown in Figure 8. Each extraction channel consists of five convolution modules, including three 3D convolution modules and two 2D convolution modules. Each 3D convolution module contains a convolution layer, Batch Normalization layer [56], RELU activation function layer, and 3D pooling layer. The Batch Normalization layer and RELU function effectively alleviate the problem of gradient disappearance, and accelerate the convergence speed of the model. The pooling layer retains the main features and reduces the calculation cost. The spatial-spectral extraction channel is similar to other CNN-based methods, using multiple convolutional layers to extract features [32]. In this paper, a 3D convolution kernel of size 3 × 3 instead of a larger size was used in the spatial-spectral extraction channel. Compared to a large convolution kernel, multiple small convolution kernels have a stronger feature extraction ability and lower computation cost.
ule contains a convolution layer, Batch Normalization layer [56], RELU activation function layer, and 3D pooling layer. The Batch Normalization layer and RELU function effectively alleviate the problem of gradient disappearance, and accelerate the convergence speed of the model. The pooling layer retains the main features and reduces the calculation cost. The spatial-spectral extraction channel is similar to other CNN-based methods, using multiple convolutional layers to extract features [32]. In this paper, a 3D convolution kernel of size 3 × 3 instead of a larger size was used in the spatial-spectral extraction channel. Compared to a large convolution kernel, multiple small convolution kernels have a stronger feature extraction ability and lower computation cost.
It should be noted that in the design of the spectral channel, multiple 1 × 1 3D convolution kernels were adopted to extract the spectral features of HSI. The reason is that spectral features of neighbor pixels will be introduced when using large convolution kernels to extract HSI spectral features. When the number of samples is sufficient, the unrelated features brought by neighboring pixels is insignificant. However, if the number of samples is scarce, these will interfere with the expression of the spectral features of the specific pixel, thereby affecting classification performance. Using a convolution kernel with size 1 × 1 makes the model only focus on the specific pixel when extracting spectral features, which can solve the problem of introducing unrelated information and enhance the classification performance of the model.
As shown in Figure 8, there a reshape operation occurs after the last 3D convolution module. The purpose is to transform the output calculated by the 3D convolution module into the format that conforms to the subsequent 2D convolution module. The last two dimensions of the features extracted by 3D convolution are merged. The 2D convolution module contains a 2D convolution layer, batch normalization layer, and RELU function. The reason for removing the pooling layer is that the feature size is small after multiple downsampling operations. In the 2D convolution module, the size of the convolution kernel is 1 × 1, which can effectively integrate feature information. After the previous reshape operation, the number of channels is large, and the number of feature channels can be reduced by controlling the number of 2D convolution kernels. It should be noted that in the design of the spectral channel, multiple 1 × 1 3D convolution kernels were adopted to extract the spectral features of HSI. The reason is that spectral features of neighbor pixels will be introduced when using large convolution kernels to extract HSI spectral features. When the number of samples is sufficient, the unrelated features brought by neighboring pixels is insignificant. However, if the number of samples is scarce, these will interfere with the expression of the spectral features of the specific pixel, thereby affecting classification performance. Using a convolution kernel with size 1 × 1 makes the model only focus on the specific pixel when extracting spectral features, which can solve the problem of introducing unrelated information and enhance the classification performance of the model.
As shown in Figure 8, there a reshape operation occurs after the last 3D convolution module. The purpose is to transform the output calculated by the 3D convolution module into the format that conforms to the subsequent 2D convolution module. The last two dimensions of the features extracted by 3D convolution are merged. The 2D convolution module contains a 2D convolution layer, batch normalization layer, and RELU function. The reason for removing the pooling layer is that the feature size is small after multiple downsampling operations. In the 2D convolution module, the size of the convolution kernel is 1 × 1, which can effectively integrate feature information. After the previous reshape operation, the number of channels is large, and the number of feature channels can be reduced by controlling the number of 2D convolution kernels.

Feature Fusion and Classification
The features extracted by the two extraction channels can be turned into a onedimensional vector after flattening the layer. In order to obtain better classification results, the features extracted from the two extraction channels need to be fused. The method for this is to splice the two one-dimensional features into a new one-dimensional vector, and the dimension number of the new feature vector is the sum of the two feature dimensions.
Since the input data of the two extraction channels are different and the size of the convolution kernel is different, the feature dimensions obtained by the convolution layer are quite different, and the dimension of the feature extracted by the spatial-spectral channel is much larger than that of the spectral channel. If the features extracted from two extraction channels are directly fused, the feature expression of the spectral channel is weakened. In order to avoid this problem, this paper adopted the method described below. The features extracted by the two extraction channels are, respectively, passed through a fully connected layer first, and the output of the fully connected layer is then fused. Therefore, the dimension of the features extracted by the two channels can be determined by controlling the number of neurons in the fully connected layer. In this paper, this layer is referred to as the plastic layer. The features are further fused after passing through the plastic layer (Figure 9), and then are used to obtain the classification result by using the SoftMax function.

Feature Fusion and Classification
The features extracted by the two extraction channels can be turned into a onedimensional vector after flattening the layer. In order to obtain better classification results, the features extracted from the two extraction channels need to be fused. The method for this is to splice the two one-dimensional features into a new onedimensional vector, and the dimension number of the new feature vector is the sum of the two feature dimensions.
Since the input data of the two extraction channels are different and the size of the convolution kernel is different, the feature dimensions obtained by the convolution layer are quite different, and the dimension of the feature extracted by the spatial-spectral channel is much larger than that of the spectral channel. If the features extracted from two extraction channels are directly fused, the feature expression of the spectral channel is weakened. In order to avoid this problem, this paper adopted the method described below. The features extracted by the two extraction channels are, respectively, passed through a fully connected layer first, and the output of the fully connected layer is then fused. Therefore, the dimension of the features extracted by the two channels can be determined by controlling the number of neurons in the fully connected layer. In this paper, this layer is referred to as the plastic layer. The features are further fused after passing through the plastic layer (Figure 9), and then are used to obtain the classification result by using the SoftMax function.  Table 1 shows the basic parameters of the two feature extraction channels and the fully connected layers of the model, in which C represents the number of classes.   Table 1 shows the basic parameters of the two feature extraction channels and the fully connected layers of the model, in which C represents the number of classes. Table 1. Parameters of Spatial-Spectral channel, Spectral channel, and Fully connected layers.

Spatial-Spectral Channel Spectral Channel Fully Connected Layers
Layer Channels/P Size Layer Channels/P Size Layer Type Parameter

Experimental Data Sets
This paper used three public hyperspectral image datasets to test the classification performance of the proposed model, which are Indian Pines, University of Pavia and Salinas Scene, and are shown in Table 2 in detail. The Indian Pines (IP) dataset was collected using the AVIRIS sensor in the Indian Pines experimental field in northwest Indiana. The size of pixels is 145 × 145, and there are 224 spectral bands. The wavelength range is 400-2500 nm. This dataset mainly includes about two-thirds agriculture, one-third forest, and a small part natural vegetation. The data excluding crops with coverage less than 5% contain two roads, one railway, low-density houses and buildings, and are divided into 16 classes. The number of bands is reduced to 200 by removing 24 bands in water coverage area.
The University of Pavia (UP) dataset was obtained using an ROSIS sensor during flight over Pavia, northern Italy. The Pavia University scene is composed of 610 × 340 pixels with 103 spectral bands located in the wavelength range of 430-860 nm. The ground cover is divided into 9 urban land cover classes.
Salinas Scene (SA) dataset was captured by AVIRIS sensor at Salinas Valley, California. The data contains 512 × 217 pixels with a spatial resolution of 3.7 m and a total of 224 spectral bands. The data has 204 spectral bands after removing 20 water absorption bands. This dataset is divided into 16 classes, mainly composed of crops.

Experimental Setup
The training and testing of network models in this paper were carried out on the same server. The server hardware configuration was as follows: Intel (R) Xeon(R) Silver 4114 CPU @ 2.20GHz, 64GB RAM, and RTX2060 GPU with 6GB memory. Software configuration was as follows: In order to verify the effectiveness of the proposed model, this paper used the overall accuracy (OA), average accuracy (AA) and Kappa coefficient to evaluate the HSI classification performance. OA is the ratio of the number of correctly classified samples to the total test sample. AA represents the mean classification accuracy of all classes. The Kappa co-efficient is used to test consistency and measure classification accuracy. DSEN was compared with HybridSN [32], MAPC [10], MFFN [44] and DC-CNN [39]. The dataset was randomly divided into 70% for training and 30% for testing first, and then the upper limit samples for each class were set to 5, 10 and 15, respectively, when selecting training samples. In other words, the number of training samples for each class was less than or equal to 5, 10 and 15 in the experiments. In the training process, the optimizer was Adam, the learning rate was 0.001, the decay rate was 0.000001, and the batch size was 16.
The plastic layer in the model controls the proportion of feature fusion of the two channels, and different fusion ratios have different effects on the classification performance. In order to clarify the influence, a series of experiments were conducted on different fusion ratios of the spatial-spectral channel (window size of input data: 25 × 25) and the spectral channel. The experimental results of model testing are presented in Table 3. The experimental result for the classification performance was optimum when the proportion was 1:1. When changing the fusion proportion of the two categories of features, the performance experienced different degrees of decline, so we selected 1:1 as the fusion rate, and this setting was used in all subsequent experiments. Table 3. Classification performance of different fusion ratios (OA, Training sample = 10). In order to confirm the influence of the window size S of the input data for the spatialspectral channel, experiments were carried out with different values of S. The number of training samples in the experiment was 10. The classification results of model testing are listed in Table 4. The time consumed with different S is shown in Table 5. With the increase in S, the performance of the model improved. The performance improvement is obvious with S from 21 × 21 to 27 × 27, but it also increases the complexity of the model, the amount of calculation required, and the time consumed. When S = 27 × 27, compared to S = 25 × 25, the performance is slightly improved, but the time consumption is obviously increased. Therefore, in the subsequent experiments, the window size for the spatial-spectral channel is set to 25 × 25.  Figure 10 shows the influence of a different number of training samples on classification accuracy for each dataset. Figure 11 shows the training process of DSEN in a different number of training samples. The model begins to converge when epoch = 50, which proves that DSEN has the ability of rapid convergence. When epoch = 100, the loss and accuracy of the model tend to remain stable without obvious fluctuations. With the increase in the number of training samples, the fluctuation of loss is lower, and the convergence is faster.
Due to the 3D-2D convolution used by DSEN, the number of parameters also reduced. Table 6 shows the total training (10 training samples) and testing time per model. Compared with MAPC, the time consumption of the model based on CNN significantly reduced. The time consumption of DC-CNN was lowest because it only uses 2D convolution and 1D convolution. Compared with HybridSN, DSEN has fewer parameters, but increased time consumption. This is because DSEN has more network layers and a more complex model structure.
loss and accuracy of the model tend to remain stable without obvious fluctuations. With the increase in the number of training samples, the fluctuation of loss is lower, and the convergence is faster.
Due to the 3D-2D convolution used by DSEN, the number of parameters also reduced. Table 6 shows the total training (10 training samples) and testing time per model. Compared with MAPC, the time consumption of the model based on CNN significantly reduced. The time consumption of DC-CNN was lowest because it only uses 2D convolution and 1D convolution. Compared with HybridSN, DSEN has fewer parameters, but increased time consumption. This is because DSEN has more network layers and a more complex model structure.
Although MFFN also uses 3D convolution to extract features, there is no approach to reduce overfitting. The model performs well when the number of samples is sufficient, but will aggravate the overfitting phenomenon when few samples are available. Therefore, the final performance of MFFN is weakened.  DC-CNN is also a dual-channel design, which is divided into two channels of 2D convolution and 1D convolution. Its performance is stronger than HybridSN and MFFN, Spatial-spectral joint features are extracted using the spatial-spectral channel. The joint features are very effective for HSI classification. So, the performance gap between the Although MFFN also uses 3D convolution to extract features, there is no approach to reduce overfitting. The model performs well when the number of samples is sufficient, but will aggravate the overfitting phenomenon when few samples are available. Therefore, the final performance of MFFN is weakened. DC-CNN is also a dual-channel design, which is divided into two channels of 2D convolution and 1D convolution. Its performance is stronger than HybridSN and MFFN, which also proves that, when the sample number is lower, enhancing the expression of spectral features can improve the classification performance.

Comparison and Analysis
To investigate the role of spatial-spectral and spectral channels in DSEN, this paper conducted experiments relying on a single channel. In the experiment, only one channel was used to extract features for classification. The experimental results were compared with that of Dual-channel, as shown in Table 7. Spatial-spectral joint features are extracted using the spatial-spectral channel. The joint features are very effective for HSI classification. So, the performance gap between the spatial-spectral channel and dual-channel is significantly smaller than that between the spectral channel and dual-channel.
Although the classification performance of the feature extracted by the spectral channel is not satisfactory when the number of HSI samples is low, spectral features can be combined with other features to enhance the expression of spectral features, which can significantly improve the classification performance of the model. It also proves that fully exploiting the spectral information of pixels is effective in HSI classification under scarce samples. Table 8 provides the classification performance of model testing. Among the methods, MAPC is based on random forest and the rest are based on CNN. Among the CNN-based methods, DSEN demonstrated the best integrated classification performance and MFFN had the worst performance. The multi-channel based DSEN and DC-CNN performed better than the single-channel based HybridSN and MFFN, which indicates that the multi-channel design can improve the classification performance of the model. Compared with HybridSN and MFFN, MAPC and DSEN have a significant lead regardless of the number of training samples. Compared with MAPC, DSEN leads in terms of performance when the samples are extremely small, such as when sample = 5 or 10. When the number of samples increases to 15 for each class, the performance gap between the two methods is very small, but DSEN still has a marginal advantage.  Figures 12-14 show the classification results of DSEN trained by different sample numbers on three datasets, respectively. The overall classification accuracy improves significantly as the number of training samples increases, and the correct rate for the individual class also improves. In some of the datasets, there is a large number of classification errors for one category with a training sample size of 5. This situation improves significantly as the training sample size increases. It is worth noting that most of the incorrectly classified pixel points are at the border of the classes, which is because the spatial neighborhood information of the pixel points is used in the classification, thereby the feature information of other classes is mixed in the feature extraction of the specific pixels, and affects the correct classification rate of the pixels by the model.   Figures 15-17 show the classification results of each model on three datasets with training sample = 10, respectively. From the results, it can be seen that DSEN is superior to other methods in terms of overall classification accuracy, but it is worth noting that the accuracy of each method varies significantly between different classes. For example, on the UP dataset, DSEN is weaker than HybridSN and DC-CNN for the classification of Bare Soil, but the overall accuracy is better. The experimental results show that although the proposed method is relatively weak in a few cases, the overall classification performance is superior to the compared methods in almost all cases.  Figures 15-17 show the classification results of each model on three datasets with training sample = 10, respectively. From the results, it can be seen that DSEN is superior to other methods in terms of overall classification accuracy, but it is worth noting that the accuracy of each method varies significantly between different classes. For example, on the UP dataset, DSEN is weaker than HybridSN and DC-CNN for the classification of Bare Soil, but the overall accuracy is better. The experimental results show that although the proposed method is relatively weak in a few cases, the overall classification performance is superior to the compared methods in almost all cases.

Woods
Bldg-Grass-Tree-Drives Stone-steel towers

Conclusions
This paper designed a novel dual-channel network model including two convolutional channels, in which one channel utilized 3D-2D hybrid convolution to extract the joint spatial-spectral features and the other channel used 1 × 1 3D and 2D convolution to extract the spectral features. The performance of the model for HSI classification with few samples improved after enhancing the expression of spectral features based on feature fusion. Through the experiments performed on three public datasets, the results revealed that DSEN has significant advantages in HSI classification performance compared with several other deep learning methods, thereby proving the effectiveness of our method.

Conclusions
This paper designed a novel dual-channel network model including two convolutional channels, in which one channel utilized 3D-2D hybrid convolution to extract the joint spatial-spectral features and the other channel used 1 × 1 3D and 2D convolution to extract the spectral features. The performance of the model for HSI classification with few samples improved after enhancing the expression of spectral features based on feature fusion. Through the experiments performed on three public datasets, the results revealed that DSEN has significant advantages in HSI classification performance compared with several other deep learning methods, thereby proving the effectiveness of our method.

Conflicts of Interest:
The authors declare no conflict of interest.