Hybrid Convolutional Network Combining 3D Depthwise Separable Convolution and Receptive Field Control for Hyperspectral Image Classiﬁcation

: Deep-learning-based methods have been widely used in hyperspectral image classiﬁcation. In order to solve the problems of the excessive parameters and computational cost of 3D convolution, and loss of detailed information due to the excessive increase in the receptive ﬁeld in pursuit of multi-scale features, this paper proposes a lightweight hybrid convolutional network called the 3D lightweight receptive control network (LRCNet). The proposed network consists of a 3D depthwise separable convolutional network and a receptive ﬁeld control network. The 3D depthwise separable convolutional network uses the depthwise separable technique to capture the joint features of spatial and spectral dimensions while reducing the number of computational parameters. The receptive ﬁeld control network ensures the extraction of hyperspectral image (HSI) details by controlling the convolution kernel. In order to verify the validity of the proposed method, we test the classiﬁcation accuracy of the LRCNet based on three public datasets, which exceeds 99.50% The results show that compare with state-of-the-art methods, the proposed network has competitive classiﬁcation performance.


Introduction
Hyperspectral images (HSIs) contain a large amount of spectral and spatial data, which provide abundant information based on the spectral characteristics of the objects and retain the overall shape of an object and its association with the surrounding objects [1].Considering the characteristics of HSI data, it is important to analyze and extract the spectral and spatial features.HSI processing technology possesses the capability to satisfy military and civilian needs, such as medical image processing, agriculture and geological exploration, and sea resource investigation [2][3][4][5][6][7][8].Consequently, hyperspectral image classification (HSIC) has become a research hotspot in image processing and remote sensing [9].
In the early works of HSIC, convolutional neural networks (CNNs) were usually used to extract the features [10][11][12][13][14]. Cheng et al. proposed a simple, effective method in order to extract hierarchical deep spatial features for HSI classification by exploring the power of off-the-shelf CNN models [15].Makantasis et al. exploited a convolutional neural network to encode pixels' spectral and spatial information, and a multi-layer perceptron to conduct the classification task [16].Many deep neural networks have been developed to handle HSIC tasks.Jiao et al. applied fully convolutional networks (FCNs) to the HSIC task for the first time by combining the weighted extracted features and spectral information [17].Sun et al. introduced a supervised network for better performance and proposed a fully convolutional segmentation network (FCSN) [18].Kang et al. believed that the CNN-based methods are unable to effectively extract the discriminant features and proposed a dualpath network (DPN) by combining a residual network and dense convolutional network to perform HSIC [19].In order to obtain additional neighborhood information, Soucy et al. proposed the clustering ensemble U-Net (CEU-Net) by combining clustering methods with U-Net [20].Si et al. used DeepLab v3+ technology and a support vector machine (SVM) classifier for HSI feature extraction and classification recognition [21].
The aforementioned methods perform 2D convolution-based operations.However, some researchers [22] believe that 2D CNNs cannot effectively extract the features because they do not consider the correlation information between channels [22].On the contrary, 3D convolution effectively combines the spatial and spectral features to improve the accuracy.Based on a 3D CNN for feature extraction [13], Hamida et al. presented an efficient method that enables a joint spectral and spatial information process [23].He et al. proposed a multi-scale 3D deep convolutional neural network (M3D-DCNN) in order to jointly learn both 2D multi-scales and 1D spectral features from HSI data [24].Zhong et al. introduced additional residual blocks by using 3D convolutional layers and proposed the spectral -spatial residual network (SSRN) [25].Roy et al. proposed HybridSN by combining 2D and 3D convolutions to obtain a higher classification accuracy [22].Zhu et al. proposed the 3D deep capsule network based on the abundant feature representation capability [26].Sun et al. proposed the cubic capsule network (EMAP-Cubic-Caps) in order to overcome the shortcomings, including the inability to capture fine spatial features, and loss of important information of PCA dimensionality reduction [27].
Despite the fact that 3D convolution effectively obtains the joint features from spatial and spectral dimensions, it has its own limitations [22,28,29].A network that incorporates 3D convolutions has a large number of parameters, leading to a higher computational cost [22].In addition, most networks using 3D convolution do not consider the impact of controlling the receptive field size on the classification accuracy, and they emphasize expanding the receptive fields to obtain a better performance [30,31].Actually, due to the low spatial resolution of HSIs, there is a considerable loss of detail in large receptive fields [9,32].Although multiple pooling operations are useful for acquiring multi-scale features, they also have an adverse effect on the classification accuracy due to the loss of detailed features, creating confusion among similar category features [31].However, if the receptive field is too small, it is not possible to consider the multi-scale features.This may lead to underfitting in the network.Therefore, a suitable receptive field is essential for enhancing the network's classification accuracy.
Recently, researchers have tried to address the above problems.Considering the low spatial resolution of HSIs, Pan et al. proposed the dilated semantic segmentation network (DSSNet) [31].The authors presented a concept for controlling the receptive field of convolution kernels at 13 × 13 [31].However, it is difficult to focus on the joint features of the space and spectrum because DSSNet still extracts features using 2D convolutions.Li et al. argued that existing networks do not effectively combine 2D and 3D convolutions, so they alternately used 2D and 3D units to solve the redundancy of the model structure [33].Although the method proposed by them can reduce the size of the model, it does not specifically reduce the consumption of 3D convolution.Howard et al. proposed depthwise separable convolution in order to reduce the number of parameters and computations in the 2D convolution process.The authors used the proposed convolution in the lightweight network Mobilenetv1 [34].Fırat et al. introduced 2D depthwise separable convolution in HSIC tasks to decrease the computational cost [35].Sandler et al. upgraded the depthwise separable convolution and proposed the inverted residual structure [36].These methods lessen the number of parameters and computational costs that convolutions introduce.Additionally, these methods are aimed at 2D convolution instead of 3D convolution.We also found that in other research fields, researchers have modified and introduced 3D depthwise separable convolution to reduce the computational cost [37][38][39].In order to effectively obtain the multi-scale information from HSIs, Gong et al. proposed the multi-scale squeezeand-excitation pyramid pooling network (MSPN) [28].The classification accuracy of this network is affected due to the introduction of a pooling layer without controlling the size of the receptive field.Although there are solutions available for addressing the aforementioned issues, it is difficult to resolve the defects.
From the previous work, two problems can be summarized as follows.First, although 3D convolutions effectively capture the spectral and spatial features, the number of parameters and computations introduced during the training process is large [22].Second, the spatial resolution of HSIs is usually low and some details are presented only based on a few pixels [31,32].It is noteworthy that the details may disappear after multiple pooling operations, and the lost details cannot be retrieved by up-sampling [40].If the network is too deep, it may lose some details during the feature extraction process due to the large receptive field of the convolution kernel.As a result, the classification accuracy is affected.
For the purpose of solving the above problems, the 3D lightweight receptive control network (LRCNet) is proposed in this paper.We combine 2D and 3D convolution to effectively integrate the features from the spatial and spectral dimensions.Next, in order to lower the computational cost and reduce the number of parameters, we employ depthwise separable convolution and convert it from 2D to 3D format.In order to reduce the negative impact of a low spatial resolution, we control the size of the receptive field based on dilated convolutions.Below is a summary of this work's contributions: 1.
The application of 3D depthwise separable convolution decreases the computational costs of 3D convolution.Additionally, 3D depthwise convolution can effectively capture spatial and spectral features, while pointwise convolution can extract information from adjacent spectral bands, improving the learning ability of the spectral domain.

2.
The receptive field control strategy is adopted.To prevent the loss of detailed information when learning multi-scale features, the receptive filed is gradually increased through dilated convolution.Moreover, the receptive field is left unchanged during 3D convolution to enhance the robustness of the model and lower the risk of overfitting.

3.
The experimental results show that the proposed method has a better classification accuracy in three public datasets, indicating that the model is competitive.
The rest of this paper is organized as follows: The LRCNet architecture and the functional block are presented in Section 2. The experimental results and analysis are discussed in Section 3, and the conclusion is presented in Section 4.

Methods
The proposed LRCNet's architecture is depicted in Figure 1.For the input HSI, we use principal component analysis (PCA) to reduce the dimensions of the data.Next, a 3D depthwise separable convolutional network comprising three 3D-DW modules is used.Afterwards, a reshape operation is applied, and the resulting data are used as the input of the receptive field control network.This network is followed by a fully connected (FC) module, which consists of three FC layers.Finally, the classification results are obtained.

Initial Data Input and Processing
As an HSI contains mixed land categories, there is a similarity between different categories.Additionally, a significant percentage of spectral bands exhibit redundancy, which makes it difficult to train models [16].As shown in Figure 1, in order to reduce the impact of redundant HSI data during the training process, we use PCA before further

Initial Data Input and Processing
As an HSI contains mixed land categories, there is a similarity between different categories.Additionally, a significant percentage of spectral bands exhibit redundancy, which makes it difficult to train models [16].As shown in Figure 1, in order to reduce the impact of redundant HSI data during the training process, we use PCA before further processing [41].Assume that the initial hyperspectral image is represented by I ∈ H×W×D , where I represents the input HSI data, H and W represent the height and width of the input data, respectively, and D represents the number of bands in the input images.The data cube after dimension reduction based on PCA is X ∈ H×W×B , where X represents the data cube and B represents the number of bands after dimension reduction.Next, we divide X into equal sizes and obtain P ∈ S×S×B , where P represents the data cube after partition, B is the number of bands, and S × S represents the height and width of P.

3D Depthwise Separable Convolutional Network
Depthwise separable convolution was first proposed by Howard et al. and used in Mobilenetv1 [34].The standard convolution is split into two parts through the depthwise separable convolution.The first part is the depthwise convolution, which is utilized to extract the features from each input channel separately.The second part is the pointwise convolution, which uses 1 × 1 convolution to combine the output of the depthwise convolution.
Compared with the standard convolution, the depthwise separable convolution significantly reduces the number of parameters and the computational complexity of the convolution layer.We assume that the size of the input feature map is H × W × C in and the parameters of a standard convolution layer are , where H and W represent the height and width of the input data, respectively, C in denotes the number of channels in the input feature map, K 2D represents the size of the convolution kernel for performing 2D convolutions, and C out represents the number of output channels.If the feature map size is still H × W, we set Cost S as the computational complexity of the standard 2D convolution.Next, Cost S is calculated as follows [34]: If 2D depthwise separable convolution is adopted, we assume its computational cost is Cost DW .Cost DW consists of two parts.The first part denotes the computational cost of the 2D depthwise convolution, and the second part denotes the computational cost of the 2D pointwise convolution.The costs are represented by Cost D and Cost P , respectively.In order to compare the computational costs of the 2D depthwise separable convolution and standard 2D convolution, we assume that the size of the convolution kernel is K 2D , the numbers of input and output channels are C in and C out , respectively, and the height and width of the input data are H and W, respectively.Next, Cost DW is calculated as follows: By comparing the computational costs of the two convolutions, the ratio of the computation is obtained as follows: For convenience, we define the computational cost factor ξ 2D as the ratio of the computational cost of the current 2D convolution to that of the standard 2D convolution, as shown in Equation ( 4): Generally, the values of C out and K 2D are greater than 2; thus, ξ 2D < 1 can be obtained from Equations ( 3) and ( 4), which shows that 2D depthwise separable convolution can effectively decrease the computational costs.If a convolution kernel of size 3 × 3 is used, the computational cost of 2D depthwise separable convolution can be reduced by about 9 times as compared with the standard 2D convolution.Therefore, a lightweight network can be created using depthwise separable convolution, which can also increase the network's training effectiveness.
In the 2D depthwise convolution part, the features are extracted separately from each input channel.If 2D depthwise convolution is adopted, the connection between different bands of the same pixel is ignored, and the spectral features cannot be learned completely.Moreover, it is easy to ignore the relationship between spatial and spectral features in channel-by-channel convolutions.Although pointwise convolution addresses this defect, there are still many features that cannot be obtained.
Considering the limitations of 2D depthwise separable convolution, we propose the 3D depthwise separable convolution technique, which can fully extract the spatialspectral features and learn joint features from multiple bands to enhance the classification performance.As each 3D convolution convolves a data block, it is possible to capture the features of adjacent groups of bands.Figure 2 depicts the structure of the proposed 3D depthwise separable convolution (3D-DW) module.The proposed technique also splits the standard 3D convolution into halves, including 3D depthwise convolution and 3D pointwise convolution.In addition, the proposed 3D depthwise separable convolution retains the advantages of 2D depthwise separable convolutions.Note that the computational complexity of 3D depthwise separable convolution is lower as compared to the standard 3D convolution.
Assume that the size of the input data cube is in C H W B    , where in C is the number of input channels, B is the number of bands, and H and W are the height and width of the data cube, respectively.The number of parameters in a standard 3D convolution is 333 , where 3D K is the size of the 3D convolution kernel and out C is the number of output channels.If the space size of the output data cube remains unchanged, we consider as the computational cost of the standard 3D convolution.

3D-S
Cost is computed as follows: If 3D depthwise separable convolution is adopted, we assume its computational cost is  In addition, the proposed 3D depthwise separable convolution retains the advantages of 2D depthwise separable convolutions.Note that the computational complexity of 3D depthwise separable convolution is lower as compared to the standard 3D convolution.
Assume that the size of the input data cube is C in × H × W × B, where C in is the number of input channels, B is the number of bands, and H and W are the height and width of the data cube, respectively.The number of parameters in a standard 3D convolution is where K 3D is the size of the 3D convolution kernel and C out is the number of output channels.If the space size of the output data cube remains unchanged, we consider Cost 3D−S as the computational cost of the standard 3D convolution.Cost 3D−S is computed as follows: If 3D depthwise separable convolution is adopted, we assume its computational cost is Cost 3D−DW .Cost 3D−DW consists of two parts, i.e., computational cost of 3D depthwise convolution, and computational cost of 3D pointwise convolution, which are denoted as Cost 3D−D and Cost 3D−P , respectively.To compare the computational costs of 3D depthwise separable convolution with those of the standard 3D convolution, we assume that the size of the convolution kernel is K 3D , the numbers of input channels and output channels are C in and C out , respectively, and the height and width of the input data are H and W, respectively.Next, Cost 3D−DW is calculated as follows: To compare the computational costs of the convolutions, we define the computational cost factor ξ 3D as follows: Since C out ≥ 2 and K 3D ≥ 2, ξ 3D < 1 is obtained using Equation (11).Therefore, it is evident that 3D depthwise separable convolution greatly reduces the computational cost.
Figure 3 shows the difference between the filters of the 3D depthwise separable convolution and the filters of the standard 3D convolution.Since each input layer channel is convolved separately in depthwise convolution, it is difficult to efficiently utilize the feature information from many channels in the same spatial position.The convolution kernels of the 3D depthwise convolution have three dimensions, so each convolution kernel extracts features from a group of adjacent bands, effectively avoiding the defects of depthwise convolution.Additionally, the number of channels is adjusted, and features are captured again using 3D pointwise convolution.Note that the size of the convolution kernel is only 1 × 1 × 1.Therefore, as compared with the standard convolution, the 3D depthwise separable convolution has significantly fewer parameters and a lower computational cost.nel is convolved separately in depthwise convolution, it is difficult to efficiently utilize the feature information from many channels in the same spatial position.The convolution kernels of the 3D depthwise convolution have three dimensions, so each convolution kernel extracts features from a group of adjacent bands, effectively avoiding the defects of depthwise convolution.Additionally, the number of channels is adjusted, and features are captured again using 3D pointwise convolution.Note that the size of the convolution kernel is only  111 .Therefore, as compared with the standard convolution, the 3D depthwise separable convolution has significantly fewer parameters and a lower computational cost.The 3D depthwise separable convolutional network contains three 3D-DW modules.After each depthwise convolution and pointwise convolution, batch normalization (BN) is applied, along with the ReLU activation function.The parameters of the three 3D-DW modules are different.Since all the bands corresponding to each pixel in the HSI image collectively reflect the features of a pixel, it is necessary to aggregate the information from multiple bands as much as possible when extracting the features.Therefore, we set the size of the convolution kernels for the 3D depthwise convolution to (7,3,3) , (5,3,3) , and (3,3,3) .
In addition, the stride and padding parameters of the depthwise convolution and pointwise convolution are set to 1 and 0, respectively.As a result, the number of channels can be increased without changing the height and width of the input images.Due to the low spatial resolution of HSIs, it is easy to lose small features if the data size is com- The 3D depthwise separable convolutional network contains three 3D-DW modules.After each depthwise convolution and pointwise convolution, batch normalization (BN) is applied, along with the ReLU activation function.The parameters of the three 3D-DW modules are different.Since all the bands corresponding to each pixel in the HSI image collectively reflect the features of a pixel, it is necessary to aggregate the information from multiple bands as much as possible when extracting the features.Therefore, we set the size of the convolution kernels for the 3D depthwise convolution to (7, 3, 3), (5,3,3), and (3, 3, 3).
In addition, the stride and padding parameters of the depthwise convolution and pointwise convolution are set to 1 and 0, respectively.As a result, the number of channels can be increased without changing the height and width of the input images.Due to the low spatial resolution of HSIs, it is easy to lose small features if the data size is compressed too early.This operation ensures that the receptive field of the convolution kernels does not increase during the 3D convolution and that spectral and spatial dimension information can be aggregated.
The essence of 3D depthwise convolution is still 3D convolution.For pixels with spatial position (x, y, z) in the jth feature map of the ith layer, we assume that the activation value v x,y,z i,j is expressed as follows [22]: where f represents the ReLU activation function, b i,j represents the bias parameter for the jth feature map of the ith layer, d l−1 denotes the number of feature maps in the (l − 1)th layer and the depth of kernel ω i,j for the jth feature map of the ith layer, 2γ + 1 is the width of the convolution kernel, 2δ + 1 is the height of the convolution kernel, and 2η + 1 is the depth of the convolution kernel along the spectral dimension.

Receptive Field Control Network
This network includes a standard 2D convolution layer and two dilated convolution layers.The BN operation and ReLU activation function are also applied after each convolution operation.Since the data format output by the 3D convolution includes four dimensions, we multiply the number of bands and channels to reshape the data into three dimensions.However, this operation results in too many channels of data input.To avoid the impact of data redundancy on the training results, we compress the number of channels using a standard 2D convolution layer.Two dilated convolution layers are added to increase the receptive field for obtaining the multi-scale features.The dilation convolution can also obtain the features between neighbors, which can help to improve the classification accuracy.The stride parameter of the two dilated convolutions is 1, the padding parameter is 0, and the dilation rate is 2. The lower side of each 2D convolution layer in Figure 1 also shows the size of their receptive fields.It can be seen that the size of the final receptive field is 11 × 11.
Assume that the receptive field after convolution is r out , and the receptive field for introducing the dilated convolution operation is [31] where r in represents the size of the receptive field of the upper layer, stride is the stride parameter of the convolution layer, and t represents the dilation parameter of the dilated convolution.

Fully Connected Module
The proposed LRCNet consists of three fully connected (FC) layers.The first FC layer converts the feature map output by the last dilation convolution layer into a 1D vector with 256 nodes.Due to the similarity between the classes of HSI data, we further compress 256 nodes into 128 nodes by using an FC layer.As a result, the influence of the feature location on the classification results is reduced in order to improve the network's robustness.To reduce the risk of overfitting, a dropout layer is added after each FC layer.Finally, we use an FC layer with the number of nodes equal to the number of classes in the dataset.Suppose that the 1D vector output by the FC layer is A = (a 1 , a 2 , a 3 . . .a i−1 , a i ), where a i represents the ith element of A. Next, a i is calculated as follows: where G κ represents the κth feature map, W i,κ denotes the weight matrix of the κth feature map of the ith element, and q denotes the total number of feature maps output by the receptive field control network.

Classification Result Output
After the third FC layer, we map the output to (−∞, 0] by using the logsoftmax function.Since the softmax activation function performs exponential operations, overflow or underflow may occur during the calculation; therefore, by using the logsoftmax activation function, problems can be avoided, data stability can be improved, and the operation can be sped up [31].Assuming that x h ∈ 1×C is the output vector of pixel h after passing through the FC layer, where C is the number of object categories in the dataset, the output is expressed as: where y c represents the possibility that x h belongs to category c, and x h (c) is the cth element in x h .The cross-entropy loss is chosen as the loss function.Assuming the loss of classifying pixel h is loss CE (h, c t ), loss CE (h, c t ) can be calculated as: where c t represents the correct class of pixel h, and x h (c t ) is the element in x h that belongs to class c t .

Dataset Introduction
In this work, we used three public datasets to verify the performance of LRCNet in HSIC tasks [42], including Indian Pines (IndianP), Pavia University (PaviaU), and Salinas Valley (SalinasV).

Experimental Setup
We used a GTX 1080 Ti with 10GB of memory for training the network.The hyperparameters of LRCNet were set as follows.We set the learning rate to 0.00008, epochs to 100, and batch size to 128.We divided the input image into small windows of 25 × 25 pixels and reduced the band number to 30.The Adam algorithm was adopted to optimize the learning rate.The cross-entropy function was selected as the loss function.We reserved 30% of the data for testing and 70% of the data for training the network.In this paper, we used the OA, kappa, and AA metrics to evaluate the classification performance.OA represents the number of correctly classified test samples, AA represents the average accuracy of each class, and kappa combines the diagonal and non-diagonal terms of the confusion matrix and is a robust measure of consistency.
Table 1 shows the OA, AA, and kappa values of the different methods based on the three public datasets.The proposed LRCNet clearly performs well on the three datasets, and its classification accuracy has a certain competitiveness.Based on the PaviaU and SalinasV datasets, the classification accuracy of LRCNet is close to 100.However, the classification performance obtained using the IndianP dataset reaches the ideal result, and the AA index score is only 98.40%.Based on the confusion matrix shown in Figure 4, we find that the proposed LRCNet easily misjudges the Soybean-mintill class as the Corn-notill class and Grass-pasture class.The additional observations of the ground truth map of IndianP show that the three classes are very close in the image.We infer that the details of the three categories are wrongly fused together during feature learning.proposed LRCNet easily misjudges the Soybean-mintill class as the Corn-notill c and Grass-pasture class.The additional observations of the ground truth map of I anP show that the three classes are very close in the image.We infer that the detail the three categories are wrongly fused together during feature learning.In order to further verify the LRCNet's high classification performance, we p formed a one-sample t-test experiment to study if the mean values of OA, kappa, and A were substantially different from those in Table 1.We repeated 10 tests on three datase and the results are shown in Table 2.The results in Tables 1 and 2 do not significan differ from one another.It can be inferred that the outstanding classification performan of LRCNet is not an accidental result.In order to further verify the LRCNet's high classification performance, we performed a one-sample t-test experiment to study if the mean values of OA, kappa, and AA were substantially different from those in Table 1.We repeated 10 tests on three datasets, and the results are shown in Table 2.The results in Tables 1 and 2 do not significantly differ from one another.It can be inferred that the outstanding classification performance of LRCNet is not an accidental result.Table 4 shows the number of parameters and the computational cost of training the proposed LRCNet and HybridSN by using the IndianP dataset.It is evident from Table 4 that the proposed method effectively reduces the number of parameters and the computational cost of 3D convolution.The LRCNet has only 3,857,330 parameters, and the floating point operations per second (Flops) is only 95.71MB, which is 152.91MB less than that of HybridSN.Therefore, it can be verified that the 3D depthwise separable technology ensures classification accuracy and reduces the calculation cost.

Ablation Experiments
In order to confirm the impact of the 3D-DW module on the classification accuracy, we performed ablation experiments.We contrasted the proposed LRCNet and the two modified networks in terms of classification performance.In Net1, we replaced the third 3D-DW module with a 2D-DW module.Similarly, in Net2, we replaced the second and third 3D-DW modules with two 2D-DW modules.
We tested the classification performance of the three networks based on IndianP.Table 5 displays the results.It is evident from the results that the classification performance reduces significantly after the 3D-DW module is replaced with the 2D-DW module, and the AA index decreases the most.When only one 3D-DW module is used, the score of the AA index is only 84.29%.When the number of 3D-DW modules is increased to 2, the score of the AA index increases to 93.75%.For some classes with a small number, it is difficult to learn the corresponding features using the 2D-DW module, and they are often classified as other classes during classification.Although the scores of the OA and kappa indicators do not drop considerably, the scores of the AA indicator are much lower.This shows the validity of the 3D-DW module, which learns the joint features of the spectral and spatial dimensions.In order to verify the effectiveness of the receptive field control strategy, we also conducted ablation experiments.We changed the dilation rate of the dilated convolution in the receptive field control network by using three different values, i.e., 1, 3, and 4. The corresponding networks are Net3, Net4, and Net5.
We tested the classification accuracies of the three networks and compared them with LRCNet.The results are presented in Table 6.It is evident from the results that no matter whether the size of the receptive field continues to increase or decrease, the classification accuracy decreases by varying degrees.When the receptive field is 7 × 7, it is too small to capture the multi-scale information.Therefore, for smaller categories, the network is unable to learn all the features and is prone to incorrect classification results.When the receptive field is increased to 15 × 15 and 19 × 19, OA decreases to 98.68% and 98.45%, respectively.It can be inferred that a large receptive field loses some details.More details are lost when the receptive field is larger.Therefore, the strategy for controlling the receptive field of LRCNet is successful.

Conclusions
In this work, we proposed the LRCNet for performing HSIC tasks, which is an end-toend framework comprising two functional modules, including a 3D depthwise separable convolutional network, which is used to reduce the computational cost of the convolution and the number of parameters, and the receptive field control network, which is used to control the receptive field for capturing multi-scale features.In the 3D convolutional network, we propose the 3D depthwise separable convolution technique to decrease the number of parameters and the computational costs.In the receptive field control network, we use dilated convolutions to control the receptive field for capturing the multi-scale features and to avoid the loss of detailed features.In the ablation study, we found that the strategy of mixing 3D and 2D convolutions and controlling the receptive field can enhance the classification accuracy.In order to verify the classification performance of the proposed LRCNet, we tested it on three public datasets and obtained competitive results.The proposed method can be applied to accurately identify and classify ground objects in hyperspectral images.

Figure 1 .
Figure 1.The architecture of the proposed LRCNet.

Figure 3 .
Figure 3.A comparison of a 3D standard convolution kernel and a 3D depthwise separable convolution kernel.

Figure 3 .
Figure 3.A comparison of a 3D standard convolution kernel and a 3D depthwise separable convolution kernel.

Figure 4 .
Figure 4.The confusion matrix obtained using the proposed method by using the IndianP, Pav and SalinasV datasets in the first, second, and third matrices, respectively.

Figure 4 .
Figure 4.The confusion matrix obtained using the proposed method by using the IndianP, PaviaU, and SalinasV datasets in the first, second, and third matrices, respectively.

Figures 5 -
Figures 5-7 show the classification results obtained using LRCNet, HybridSN, and DSSNet on three public datasets.The results show that the classification results obtained with the proposed LRCNet are close to the ground truth.

Figure 4 .Figure 5 .
Figure 4.The confusion matrix obtained using the proposed method by using the IndianP, Pavia and SalinasV datasets in the first, second, and third matrices, respectively.

Figures 5 -Figure 6 .
Figures 5-7 show the classification results obtained using LRCNet, HybridSN, a DSSNet on three public datasets.The results show that the classification results obtain with the proposed LRCNet are close to the ground truth.

Table 2 .
The classification accuracy of LRCNet repeated experiments on three datasets (%).

Table 3
displays the proposed LRCNet's classification performance under differe training set partition ratios.IndianP was used in the experiment.It can be found that t decline in the OA and kappa indicators is small.When the training set only accounts f

Table 2 .
The classification accuracy of LRCNet repeated experiments on three datasets (%).

Table 3
displays the proposed LRCNet's classification performance under different training set partition ratios.IndianP was used in the experiment.It can be found that the decline in the OA and kappa indicators is small.When the training set only accounts for 10%, the score of the OA indicator also reaches 97.90%.It is evident that the proposed LRCNet learns most of the spatial and spectral features by using a small number of training samples.

Table 3 .
The classification accuracies obtained using a small number of training samples (%).

Table 4 .
The comparison of the number of parameters and computational cost.

Table 5 .
The classification accuracy obtained using a small number of training samples (%).

Table 6 .
The classification accuracy of different receptive fields (%).