Remote Sensing Scene Image Classiﬁcation Based on Dense Fusion of Multi-level Features

: For remote sensing scene image classiﬁcation, many convolution neural networks improve the classiﬁcation accuracy at the cost of the time and space complexity of the models. This leads to a slow running speed for the model and cannot realize a trade-off between the model accuracy and the model running speed. As the network deepens, it is difﬁcult to extract the key features with a sample double branched structure, and it also leads to the loss of shallow features, which is unfavorable to the classiﬁcation of remote sensing scene images. To solve this problem, we propose a dual branch multi-level feature dense fusion-based lightweight convolutional neural network (BMDF-LCNN). The network structure can fully extract the information of the current layer through 3 × 3 depthwise separable convolution and 1 × 1 standard convolution, identity branches, and fuse with the features extracted from the previous layer 1 × 1 standard convolution, thus avoiding the loss of shallow information due to network deepening. In addition, we propose a downsampling structure that is more suitable for extracting the shallow features of the network by using the pooled branch to downsample and the convolution branch to compensate for the pooled features. Experiments were carried out on four open and challenging remote sensing image scene data sets. The experimental results show that the proposed method has higher classiﬁcation accuracy and lower model complexity than some state-of-the-art classiﬁcation methods and realizes the trade-off between model accuracy and model running speed.


Introduction
At present, remote sensing images with high resolution have been applied to many fields such as remote sensing scene classification [1], hyperspectral image classification [2], change detection [3,4], geographic image, and land use classification [5,6], etc. However, remote sensing images' complex spatial patterns and geographical structure bring great difficulties to image classification. Therefore, it is particularly important to understand the semantic content of remote sensing images effectively. The purpose of this study is to find a simple and efficient lightweight network model, which can accurately understand the semantics of remote sensing images and efficiently classify remote sensing scene images. In order to effectively extract image features, researchers have proposed many methods. Initially, manually made feature descriptors were used to extract image features, such as color histograms [7], texture descriptors [8], local binary mode [9], GIST [10], directional gradient histograms [11], bag-of-visual words (BOVW) [12], etc. Then, in order to solve the disadvantages brought by the method of manually extracting features, researchers proposed some unsupervised feature learning methods that can automatically extract shallow detail features from images, such as principal component analysis (PCA), sparse coding [13], autoencoders [14], Latent Dirichlet allocation [15], and probabilistic latent propose a hybrid downsampling method, which compensates for the pooled features by using a convolution branch to achieve the minimum loss of critical information while downsampling. Experiments show that the hybrid downsampling method is helpful in improving the performance of the model. (2) The pure bi-branching structure is difficult to extract the key features of the feature map with the deepening of the network. It also causes the loss of shallow features due to the deepening of the network, which is catastrophic for the classification of remote sensing scene images. To solve this problem, we propose a multi-level feature-intensive fusion structure based on dual branches. Each layer of the structure not only uses 3 × 3 depthwise separable convolution, 1 × 1 standard convolution, and Identity for feature extraction but also integrates with the features extracted by 1 × 1 standard convolution in the previous layer. Through the three branches of 3 × 3 depthwise separable convolution, 1 × 1 standard convolution, and identity, the information of the current layer can be fully extracted and fused with the features extracted by 1 × 1 standard convolution in the previous layer, which can avoid the loss of shallow information due to network deepening.
(3) A lightweight convolution neural network composed of shallow mixed downsampling structure and deep dual branch multi-level feature-intensive fusion structure is presented. A series of experimental results show that the proposed network is more suitable for remote sensing scene image classification.
The rest of this paper is as follows. In the second section, the proposed dual branch multi-level feature dense fusion-based lightweight convolutional neural network (BMDF-LCNN) is introduced in detail. In the third section, experiments and analysis are carried out and compared with some state-of-the-art methods to prove the superiority of the proposed method's performance. The fourth section contains the conclusion.

The Structure of the Proposed Method
The overall structure of the model is shown in Figure 1, which is divided into nine parts. In the first and second groups, we propose a feature extraction structure suitable for the shallow layers of the network (see Section 2.2 for the specific structure model). In the third group, the combination of standard convolution and depthwise separable convolution is adopted, and the maximum pool layer is used for downsampling to compress the spatial dimensions of the input images and reduce the risk of overfitting caused by irrelevant features. Groups 4 through 8 mainly extract representative features of remote sensing images. Groups 4 through 7 adopt the designed dual branch multi-level feature intensive fusion method to extract richer feature information. In Group 8, we used sequential 1 × 1 standard convolution, 3 × 3 standard convolution, and 3 × 3 depthwise separable convolution to extract deep-level features. On the basis of double branch fusion, the multilevel features are fully exchanged and fused, which not only improves the classification accuracy but also greatly improves the speed of the network and realizes the balance of accuracy and speed. In addition, in order to extract more features, the number of convolution channels in Groups 5 and 8 is widened to 256 and 512, respectively (see Section 3.2 for the specific channel number setting of other groups). Group 9 is used for classification, and the feature information obtained by the final fusion is utilized for calculating the probability of each scene category.
In deep feature extraction structures from Group 4 to Group 7, each layer can fully extract the information of the current layer through three branches of 3 × 3 depthwise separable convolution, 1 × 1 standard convolution, and Identity. In addition, by fusing the features extracted by 1 × 1 standard convolution with each previous layer, the shallow information loss due to network deepening can be effectively avoided. Using batch normalization (BN) [42] can reduce the dependence of the network on parameter initialization, make the training faster, and use a higher learning rate. In addition, compared with the natural image data set [43], the number of remote sensing images available for training Remote Sens. 2021, 13, 4379 4 of 25 is very small. To avoid possible over-fitting during training, L2 regularization is adopted after the cost function, which is: The partial derivative of the above formula is: In the gradient descent algorithm, in order to converge as quickly as possible, the parameters will be updated along the negative direction of the gradient, so a negative sign is added before the partial derivative of Formula (2) and multiplied by a learning rate factor χ to obtain the final iteration weight parameter j , that is: where γ is the regularization factor, we set it to 0.005, J() is the objective function, x is the training sample, y is the label corresponding to the training sample, and h (x (i) ) is the predicted value. As can be seen from Formula (4), each time the gradient is updated j is multiplied by a factor 1 − χγ m less than 1, so as to attenuate the weight parameters and prevent overfitting. In Group 9, global average pooling [44] is used instead of the traditional full connection layer to avoid the overfitting of the full connection layer.
Remote Sens. 2018, 10, x FOR PEER REVIEW 4 of 26 The partial derivative of the above formula is: In the gradient descent algorithm, in order to converge as quickly as possible, the parameters will be updated along the negative direction of the gradient, so a negative sign is added before the partial derivative of Formula (2) and multiplied by a learning rate factor χ to obtain the final iteration weight parameter j  , that is: where γ is the regularization factor, we set it to 0.005, J () is the objective function, x is the training sample, y is the label corresponding to the training sample, and is the predicted value. As can be seen from Formula (4), each time the gradient is updated j  is multiplied by a factor 1 m χγ − less than 1, so as to attenuate the weight parameters and prevent overfitting. In Group 9, global average pooling [44] is used instead of the traditional full connection layer to avoid the overfitting of the full connection layer.

Shallow Feature Extraction Strategy
The first and second sets of downsampling structures are designed to extract the shallow features of the network. In the process of shallow feature extraction, the effect of downsampling on network performance is significant. Downsampling is the reduction of the convoluted feature map to a certain scale, reducing the spatial size of the image while preserving the main features of the image. The main methods of downsampling in deep convolution neural networks are maximum pooling downsampling and convolution downsampling. Pooling is a non-linear downsampling method that requires a deep convolution overlay. Generally speaking, it is better to use pooled downsampling for small convolution networks, but when the network is deep, multi-layer overlay convolution can learn better non-linear features from the training set. After analyzing the advantages and disadvantages of the two downsampling methods, a hybrid downsampling method based on pooling and convolution is proposed. The proposed hybrid downsampling structure is shown in Figure 2c. The pooling branch in this structure is used for downsampling, but pooling will lead to the loss of some key feature information, which is not conducive to extracting deep network features. Therefore, we use convolution to compensate for the lost features in another branch, which reduces the feature size and ensures the integrity of information to a great extent. Figure 2a,b are multi-layer convolution downsampling and pooling downsampling, respectively. In order to verify the performance of the proposed downsampling methods, the experimental comparisons of the three sampling methods are carried out in the third section.

Shallow Feature Extraction Strategy
The first and second sets of downsampling structures are designed to extract the shallow features of the network. In the process of shallow feature extraction, the effect of downsampling on network performance is significant. Downsampling is the reduction of the convoluted feature map to a certain scale, reducing the spatial size of the image while preserving the main features of the image. The main methods of downsampling in deep convolution neural networks are maximum pooling downsampling and convolution downsampling. Pooling is a non-linear downsampling method that requires a deep convolution overlay. Generally speaking, it is better to use pooled downsampling for small convolution networks, but when the network is deep, multi-layer overlay convolution can learn better non-linear features from the training set. After analyzing the advantages and disadvantages of the two downsampling methods, a hybrid downsampling method based on pooling and convolution is proposed. The proposed hybrid downsampling structure is shown in Figure 2c. The pooling branch in this structure is used for downsampling, but pooling will lead to the loss of some key feature information, which is not conducive to extracting deep network features. Therefore, we use convolution to compensate for the lost features in another branch, which reduces the feature size and ensures the integrity of information to a great extent. Figure 2a,b are multi-layer convolution downsampling and pooling downsampling, respectively. In order to verify the performance of the proposed downsampling methods, the experimental comparisons of the three sampling methods are carried out in the third section.  Figure 3a is the basic structure for optimizing time and space complexity. The structure is derived from the fusion of two branches with similar structures. For the sake of description, one of the branches is explained. According to the number of input and output channels in the first layer, two different structures are shown in Figure 3b and Figure  3c. Each layer of the structure uses 3 × 3 depthwise separable convolution, 1 × 1 standard convolution, and Identity for feature extraction. Starting from the second layer, each layer of the structure not only uses 3 × 3 depthwise separable convolution, 1 × 1 standard convolution, and Identity for feature extraction but also integrates with the features extracted by 1 × 1 standard convolution in the previous layer. The process of dense fusion of multilevel features is as follows:  Figure 3a is the basic structure for optimizing time and space complexity. The structure is derived from the fusion of two branches with similar structures. For the sake of description, one of the branches is explained. According to the number of input and output channels in the first layer, two different structures are shown in Figure 3b,c. Each layer of the structure uses 3 × 3 depthwise separable convolution, 1 × 1 standard convolution, and Identity for feature extraction. Starting from the second layer, each layer of the structure not only uses 3 × 3 depthwise separable convolution, 1 × 1 standard convolution, and Identity for feature extraction but also integrates with the features extracted by 1 × 1 standard When the number of input and output channels of the first layer is not the same ( 1 2 C C ≠ ), the structure is shown in Figure 3c. Only the first layer has no Identity branch, and the other layers have the same structure as in the case. The output feature of each layer is: The process of reducing model complexity is analyzed in detail as follows. The time complexity of a convolution neural network can be represented as:

Strategies to Optimize Time and Space Complexity
Here, L denotes the number of convolution layers of the neural network, The spatial complexity of convolution neural networks is: In Formula (9), the first summation expression represents the total weight parameters of all the layers with parameters in the model, and the second summation expression represents the size of the output feature map of each layer in the model.  When the number of input and output channels of the first layer is the same (C1 = C2), the structure is as shown in Figure 3b. The 3 × 3 branch of layer i − 1 in this structure is represented by δ(BN(M (i−1) * W (3) )), and the Identity branch of layer i − 1, by δ(BN(M (i−1) )). Since each layer is fused with the 1 × 1 convolution branch of the previous layer starting from the second layer, we use . In particular, we specify that the input of the first layer M (1) is M (0) . The output features of each layer can be represented as: Here, BN is batch standardization; δ is the ReLU activation function; W (3) ∈ R C 1 ×C 2 ×3×3 represents the 3 × 3 depthwise separable convolution where the number of input chan-Remote Sens. 2021, 13, 4379 7 of 25 nels is C 1 and the number of output channels is C 2 . W (1) ∈ R C 1 ×C 2 represents the 1 × 1 convolution where the number of input channels is C 1 and the number of output channels is C 2 .
When the number of input and output channels of the first layer is not the same (C1 = C2), the structure is shown in Figure 3c. Only the first layer has no Identity branch, and the other layers have the same structure as in the case. The output feature of each layer is: The process of reducing model complexity is analyzed in detail as follows. The time complexity of a convolution neural network can be represented as: Here, L denotes the number of convolution layers of the neural network, M i denotes the size of the output feature map of the i convolution layer, K i denotes the convolution kernel size of the i convolution layer, i denotes the i convolution layer of the neural network, and C i−1 and C i denote the number of input and output channels C in and C out of the i convolution layer of the neural network, respectively.
The spatial complexity of convolution neural networks is: In Formula (9), the first summation expression represents the total weight parameters of all the layers with parameters in the model, and the second summation expression represents the size of the output feature map of each layer in the model.

Replace the Full Connection Layer with Global Average Pooling
Full Connection Layer is a special convolution layer whose convolution kernel size is the same as the input data size. The output feature map of each convolution kernel is a scalar, i.e., M = 1. The time and space complexity of the full connection layer are: where X represents the size of the input image, M represents the size of the output feature map for each convolution kernel, K represents the size of the convolution kernel, C in and C out represent the number of input and output channels, respectively. Formulas (10) and (11) show that the complexity of the full connection layer is related to the size of the input data. For global average pooling, the time and space complexity are: As seen from (12) and (13), after using global average pooling, both time and spatial complexity are only related to the number of input-output channels, and the number of operations and parameters are greatly reduced.

Replacing Standard Convolution with Depthwise Separable Convolution
The standard convolution operation is to convolute all the channels of the input for each convolution kernel, and the depthwise separable convolution is that each convolution kernel acts on only a certain channel of the input, which reduces the complexity of the model.
The time complexity of standard convolution is: The time complexity of depthwise separable convolutions is: The number of parameters P conv of the standard convolution is: The number of parameters P dsc for the deep separable convolution is: The parameter number ratio of the depthwise separable and standard convolutions is: As can be seen from (14)- (18), when using a convolution kernel of 3 × 3, the parameter amount of the depthwise separable convolution is about 1/9 that of the standard convolution. Therefore, the depthwise separable convolution is adopted, which can greatly reduce the number of parameters, effectively reduce the complexity of the model, and improve the running speed of the model.

Identity
In terms of network structure, the shallow network extracts simple and specific features. With the deepening of network structure, the extracted features become more complex and abstract. Since simple and abstract features can describe images from different aspects, the classification performance can be effectively improved through the information interaction between different hierarchical features. If Identity is not used, the classification of all images can only rely on complex features. After Identity is adopted, the shallow features can be retained, which can accelerate the running speed of the network.

Results
In this section, the proposed dual branching multi-level feature dense fusion method is comprehensively evaluated. Experiments are performed on four challenging datasets. The proposed BMDF-LCNN method is compared with some state-of-the-art classification methods. Experimental results demonstrate that the proposed method performs well with respect to various contrast indexes.

Dataset Settings
To prove the superiority of the proposed BMDF-LCNN method, the proposed BMDF-LCNN method and some state-of-the-art classification methods were compared experimentally in four datasets, the UC dataset [45], RSSCN dataset [46], AID dataset [47], and NWPU dataset [26]. The UC dataset is a remote sensing dataset of land use images from the USGS national map urban area imagery with a total of 2100 land use images in 21 categories, with 256 × 256 pixels per image. The RSSCN dataset is a remote sensing image dataset from Wuhan University with seven categories consisting of 2800 images, with 400 × 400 Remote Sens. 2021, 13, 4379 9 of 25 pixels per image. These images are sampled using different scales in different seasons and different weather conditions, making this dataset relatively challenging. The AID dataset is published jointly by Huazhong University of Science and Technology and Wuhan University. It has 10,000 remote sensing image datasets in 30 categories, with 600 × 600 pixels per image. The NWPU dataset is a remote sensing image dataset with 45 categories and 31,500 images created by Northern Polytechnic University, with 256 × 256 pixels per image. This dataset has the largest image size among the four datasets and the highest intra-class differences and inter-class similarities, which causes great challenges for classification tasks.

Setting of the Experiments
For the UC dataset, 80% of the images were randomly selected in each category as training data for model learning, and the remaining 20% of the images were used as test data to examine the performance of the model. For the RSSCN dataset, 50% of the images in each category were randomly selected as training data for model learning, and the remaining 50% were used as test data to verify model performance. For the AID dataset, 50% and 20% of the images were randomly selected in each category as training data for model learning, and the remaining images were used as test data to examine the performance of the model. For the NWPU45 dataset, 20% and 10% of the images in each category were randomly selected as training data for model learning, and the rest of the images were used as test data to verify the performance of the model.
The size of each convolution kernel is shown in Figure 1. Other settings are as follows: In Group 1 and Group 2, the number of convolution filters is 32 and 64, respectively, with the first convolution step being 2 and the remaining convolution step being 1. To further extract high-level features, the number of convolution filters from Group 3 through Group 8 are 128, 128, 256, 256, and 512, respectively. Set the max-pooling size from Group 1 to Group 8 to 2 × 2, and the pool step is 2. All the steps of Group 3 through Group 8 are 1. To overcome the drawbacks of the small size of the training data and improve the generalization ability of the model, we used data enhancement to increase data diversity. The settings for data enhancement are as follows: (1) Multiplying all pixels of the input image by a scaling factor, which was set to 1/255, reduced the pixel value to between 0 and 1 and favored convergence of the model. Furthermore, to reduce the risk of spillover of memory during training caused by excessive amounts of data, the input images were resized 256 × 256 with the bilinear interpolation method before training. Throughout the training process of the model, using an automatic learning rate reduction mechanism can reduce the learning rate according to the training situation and can quickly and accurately find the optimal model. The initial learning rate was set to 0.01. During the training process, the batch size was set to 16, and the momentum optimization algorithm was used to optimize the network for better and more stable convergence. Here we set the momentum factor to 0.9. The software used throughout the experiment was PyCharm. The final results were obtained by averaging the results of 10 experiments. The computer's configuration is as follows: RAM: 16GB; Processor: AMD Ryzen 7 4800H with RadeonGraphics@2.90GHz; GPU: NVIDIAGeForceRTX2060 6G.

The Performance of the Proposed Model
To verify the advantages of the proposed BMDF-LCNN method over other methods, six evaluation indexes, including overall accuracy (OA), average accuracy (AA), Kappa coefficient (Kappa), confusion matrix, average training time (ATT), and weighting parameters were used to evaluate the proposed method comprehensively. OA represents the ratio of the correct number of classes to the total number of classes on all test sets, AA represents the ratio of the correct number of predictions for each class to the total number of classes and measures the quality of classification on each class by the proposed methods, ATT represents the average time spent training each image by the proposed methods, and F1 score can be regarded as the weighted average of model accuracy and recall rate. The higher the F1 value, the better the model. During the experiments, to ensure fairness of the experiments, all comparative experiments were conducted in the same experimental setting. Considering the proposed method is an improvement on the lightweight convolutional neural network-branch feature fusion (LCNN-BFF) method, in order to verify the advantages of the improved BMDF-LCNN method over the LCNN-BFF method, we use OA, AA, Kappa, and confusion matrix as evaluation indicators to compare the proposed method with LCNN-BFF method on four datasets: UC [45], RSSCN [46], AID [47], and NWPU [26]. The OA and Kappa results of the LCNN-BFF and BMDF-LCNN methods on six datasets are shown in Table 1. As can be seen in Table 1, except for the 80/20UC dataset, both the OA and Kappa values of the proposed BMDF-LCNN method were elevated by more than 1% over those of the LCNN-BFF [41] method. The classification accuracy and Kappa value of the proposed BMDF-LCNN method on the UC dataset are close to 100%, which indicates that the method has better classification advantages on the UC dataset. Similarly, the BMDF-LCNN method has also achieved good classification results for the AID and NWPU datasets, with the most improvement on 10/90NWPU datasets, 5.12% higher classification accuracy, and 4.43% higher Kappa value than LCNN-BFF [41], indicating that the proposed method has better performance. Further, we use AP, F1, and confusion matrix as indicators to validate the advantages of the proposed method over other state-of-the-art classification methods.
Next, the confusion matrix is adopted to evaluate the performance of this method on four datasets, 20/80AID, 50/50RSSCN, 10/90NWPU, and 80/20UC. Each column in the confusion matrix represents the prediction category. Each line represents the actual category. The value on the diagonal line represents the probability value of the correct classification. The value outside the diagonal line indicates the probability of being wrongly classified as the current class.
Next, the confusion matrix is adopted to evaluate the performance of this method on four datasets, 20/80AID, 50/50RSSCN, 10/90NWPU, and 80/20UC. Each column in the confusion matrix represents the prediction category. Each line represents the actual category. The value on the diagonal line represents the probability value of the correct classification. The value outside the diagonal line indicates the probability of being wrongly classified as the current class.  From the results in Figure 5a, it can be seen that the classification accuracy of the BMDF-LCNN method for 'Overpass' and 'Storage tanks' is 95% on the 80/20UC dataset and 100% for other scenarios, which proves that this method has excellent performance on the UC dataset. Figure 5b shows that the BMDF-LCNN method achieves a classification accuracy of more than 96% for most scenes on the 50/50 RSSCN dataset. The recognition rate for 'Industry' is 94%. This is because the two scenes, 'Industry' and 'Parking', From the results in Figure 5a, it can be seen that the classification accuracy of the BMDF-LCNN method for 'Overpass' and 'Storage tanks' is 95% on the 80/20UC dataset and 100% for other scenarios, which proves that this method has excellent performance on the UC dataset. Figure 5b shows that the BMDF-LCNN method achieves a classification accuracy of more than 96% for most scenes on the 50/50 RSSCN dataset. The recognition rate for 'Industry' is 94%. This is because the two scenes, 'Industry' and 'Parking', have mutually inclusive relationships, with the presence of cars in the industry and industry in the parking, which leads to easy confusion when classifying. Nevertheless, the BMDF-LCNN method still achieves high classification accuracy. For the confusion matrix in Figure 6a, we can see that there are 20 categories on the 20/80AID datasets with classification accuracy above 95%, with the accuracy of 'Forest' and 'Park' at 100%. Five percent of the 'Squares' are misclassified as 'Parks', and five percent of the 'Schools' are misclassified as 'Commercial', mainly due to the high class similarity between 'Parks' and 'Squares', and 'Schools' and 'Commercial'. In Figure 6b, on the For the confusion matrix in Figure 6a, we can see that there are 20 categories on the 20/80AID datasets with classification accuracy above 95%, with the accuracy of 'Forest' and 'Park' at 100%. Five percent of the 'Squares' are misclassified as 'Parks', and five percent of the 'Schools' are misclassified as 'Commercial', mainly due to the high class similarity between 'Parks' and 'Squares', and 'Schools' and 'Commercial'. In Figure 6b, on the 10/90NWPU datasets with high similarities between classes and intra-class differences, the classification accuracy of 39 classes is more than 90%, and that of 'Chaparral' and 'Snowberg' are 100%. Due to the high class similarity between 'Palaces' and 'Churches', 12% of palaces are misclassified as churches.
The above experiments fully demonstrate the validity of the proposed method through multiple evaluation indexes. The experimental results show that the dense fusion structure of two-branch and multi-layer features can significantly improve the classification accuracy and robustness of the network through the dense communication of different hierarchical features.

Comparison with Advanced Methods
In this section, in order to further verify the advantages of the proposed BMDF-LCNN method in model complexity and classification accuracy, the most advanced remote sensing scene classification methods proposed in the last two years were chosen and compared with the proposed BMDF-LCNN method on the UC [45], RSSCN [46], AID [47], and NWPU datasets [26]. These methods were evaluated using OA, AA, F1, the number of parameters, Kappa, ATT, and FLOPs as evaluation indexes.

Experimental Results on UC-Merced Datasets
The comparison of the number of parameters, OA, AA, and F1 obtained by the proposed BMDF-LCNN method and that of the advanced methods are shown in Table 2. We can see in Table 2, for the UC dataset [45] with a training rate of 80%, the proposed method achieves a classification accuracy of 99.53%, which exceeds all the comparison methods. This indicates that the dense fusion module with two branches and multi-layers can significantly improve classification accuracy. Inception-v3-CapsNet [35], SF-CNN with VGGNet [32], SCCov [48] and PANNet [49] all achieve more than 99% classification accuracy. However, these four methods have a large number of parameters and do not have a good trade-off between the complexity of the model and the classification accuracy. The parameters of SCCov [48] are only 6M, which is the same as that of the proposed BMDF-LCNN method. However, the accuracy of SCCov [48] is only 98.04%, which is 1.49% lower than the proposed method. The F1 score of the proposed method is 99.51%, 1.49% higher than the lightweight method SCCov [48] and 1.42% higher than Contourlet CNN [50]. In addition, the Kappa values of the proposed methods are compared with those of the most advanced methods on the UC dataset [45], and the results are shown in Table 3. As shown in Table 3, the Kappa value of the proposed BMDF-LCNN method is 99.50%, 1.69% higher than that of Contourlet CNN [50], 1.87% higher than that of LiG with sigmoid kernrl [51], and 1.76% higher than that of SE-MDPMNet [34]. The comparison of the above data shows that the proposed BMDF-LCNN method can provide better classification performance.  To verify the strong immediacy of the proposed method, the proposed BMDF-LCNN method and several state-of-the-art methods were experimentally contrasted on UC datasets [45] under the same configuration conditions. The ATT comparison results are shown in Table 4. From Table 4, we can see that the ATT of the proposed method is 0.017s, which saves 0.035s, 0.031s to process an image with the two methods in [54] and saves 0.036s and 0.022s to process an image with the two methods in [50]. This further verifies the effectiveness of the method. Table 4. The average time between the proposed model and some advanced methods for image processing.

Experimental Results on AID Datasets
The comparison results between the proposed BMDF-LCNN method and the most advanced method are listed in Table 6. When the training ratio is 20%, the overall classification accuracy of the proposed method reaches 94.46%, which is 0.29% and 0.33% higher than that of LiG with RBF kernel [61] and that of Fine-tuneMobileNetV2 [34], respectively. The average accuracy of the proposed method is 94.24%, which is 2.89% and 0.19% higher than the lightweight methods SCCov [48] and LiG with RBF kernel [61], respectively. When the training ratio is 50%, the proposed method has the highest overall classification accuracy, which is 96.76%, which exceeds the accuracy of all the comparison methods. This accuracy is 1.31% higher than that of FACNN [55], 0.57% higher than that of LiG with RBF kernel [61], and 0.8% higher than that of Fine-tune MobileNetV2 [34]. Compared with the lightweight networks SCCov [48] and VGG_VD16 + SAFF [57], the average classification accuracy of the proposed method is improved by 3.07% and 2.76% respectively. This proves that our method can extract the features of images more effectively and understand the semantics of images more accurately. As far as the weight parameters are concerned, the weight parameters of the proposed method are 6M, slightly higher than that of LiG with RBF kernel [61], but our method can provide higher classification accuracy than LiG with RBF kernel [61].
The Kappa values of the proposed BMDF-LCNN method are compared with those of other methods, as shown in Table 7. It can be seen that the Kappa values of the proposed method are 96.24%, 1.91% higher than that of LiG with RBF kernel [61] and 1.41% higher than that of Fine-tune MobileNet V2 [34].

Experimental Results on NWPU Dataset
Experiments were carried out on the NWPU data set. The comparison results between the proposed BMDF-LCNN method and some of the most advanced methods are shown in Table 8. In Table 8, when the training ratio is 10%, the overall classification accuracy of the proposed method reaches 91.65%, which is 1.42% higher than that of LiG with RBF kernel [61] and 1.46% higher than that of LiG with sigmoid kernel [51]. Compared with the lightweight networks SCCov [48] and LiG with RBF kernel [61], the average classification accuracy of the proposed method is improved by 7.59% and 1.87% respectively. When the training ratio is 20%, the overall classification accuracy is 0.32%, which is 0.36% and 0.02% higher than that of LiG with RBF kernel [61], LiG with sigmoid kernel [51], and MSDFF [63], respectively. The average accuracy of the proposed method is 93.56%, which is 4.48% and 0.8% higher than the lightweight methods Contourlet CNN [50] and MSDFF [63], respectively. In terms of parameters, compared with LiG with RBF kernel [61] with small parameters, when the training ratio is 10%, the classification accuracy of the proposed method is improved by 1.42%, and when the training ratio is 20%, the classification accuracy of the proposed method is improved by 0.32%. Compared with SSCov [48] with the same parameters, when the training ratio is 10%, the classification accuracy of the proposed method is improved by 7.32%, and when the training ratio is 20%, the classification accuracy of the proposed method is improved by 6.27%. Experimental results show that the proposed method has better classification performance and fewer parameters, which is very suitable for mobile devices. The comparison of Kappa values of different methods is shown in Table 9. It can be seen that the Kappa of the proposed method is 93.42%, which is 0.40% and 0.49% higher than that of LiG with RBF kernel [61] and Fine-tune MobileNet V2 [34], respectively. The validity of the proposed method is further proved.

Comparison of Three Downsampling Methods
To validate the performance of our proposed downsampling methods, three downsampling methods mentioned in Section IIB are used in the first and second layers of the network. Experiments were performed on two datasets, i.e., UC and RSSCN, and the OA and Kappa were used as evaluation indicators. As shown in Figure 2, for the Conv-Downsampling (CD), the first and third convolution steps are 1, and the second and fourth convolution steps are 2. For the pooling downsampling (Maxpooling-Downsampling, MD), the convolution kernels are all 3 × 3, with convolution steps of 1 × 1. The size of max-pooling is 2 × 2, and the pooling step size is 2. A new downsampling method is proposed in Figure 2c. The experimental results are shown in Table 10. As shown in Table 10, both the classification accuracy and Kappa values of pooling downsampling are lower than those of convolution downsampling on the two datasets. The reason is that convolution downsampling in deep networks yields better non-linear performance than pooled downsampling. The classification accuracy of the proposed downsampling methods on 80/20UC and 50/50RSSCN datasets is 99.53%, 97.86%, and the Kappa values are 99.50%, 97.50%, respectively, which are higher than those of the other two downsampling methods. This further proves that the multi-level features dense fusion method can classify remote sensing scene images more effectively.

Evaluation of Size of Models
To further validate the effectiveness of our proposed method, we used FLOPs and parameter quantities to compare it with advanced methods, where FLOPs measure the complexity of the model, and the parameter quantities measure the size of the model. The results are shown in Table 11. It can be seen from Table 11 that compared with LCNN-BFF, the proposed method has slight advantages in parameter quantity and FLOPs, and the classification accuracy is still 3.22% higher than that of LCNN-BFF, which proves the great advantages of the proposed method. In addition, compared with other lightweight methods, such as MobileNetV2 [34] and SE-MDPMNet [34], the proposed method also can achieve a higher classification accuracy with fewer FLOPs and realize a good balance between model accuracy and complexity.

Discussions
In order to show the performance of the proposed method more intuitively, in this section, three kinds of visualization including grad cam, t-distribution random neighbor embedding (T-SNE), and randomly selected and tested are discussed and analyzed. The grad cam displays the extracted features according to the degree of significance through the visual thermal map. The last layer of convolution neural network contains the richest spatial and semantic information. Therefore, grad cam makes full use of the features of the last layer of convolution to generate an attention map to display important areas of an image. In this experiment, some remote sensing scene images 'Industries', 'Fields', 'Residence', 'Grass', 'Forests' in the RSSCN dataset are randomly selected. The visualization results of thermal diagrams of the improved BMDF-LCNN method with the original LCNN-BFF method are shown in Figure 7.
We can see that from Figure 7, for 'Industries' scenarios, the LCNN-BFF method does not accurately focus on the factory area but instead shifts the focus to the highway, whereas the proposed BMDF-LCNN method is well focused on the factory area. For both 'Fields' and 'Grass' scenarios, there was a partial deviation in the focused areas predicted by the LCNN-BFF model, ignoring the similar surrounding targets and searching with limited targets, while the BMDF-LCNN method is well focused on the target area. In addition, for scenario areas such as 'Residence' and 'Forests', the LCNN-BFF method has limited coverage and cannot extract the target completely, thus affecting the classification accuracy. However, the proposed BMDF-LCNN method can obtain a complete area of interest in these scenarios.  Next, we visualize the classification results on UC (8/2) and RSSCN (5/5) datasets using t-distribution random neighbor embedding (T-SNE). T-SNE maps high-latitude features to two-dimensional or three-dimensional space for visualization, which can evaluate the classification effect of the model very well. The result of the T-SNE visualization is shown in Figure 8.   From Figure 8, it can be seen that the proposed model has better global feature representation ability and increases the separability and relative distance between individual semantic clusters, which can more accurately distinguish different scene categories and improve the classification performance of the method.
In addition, some scene images of the UCM21 dataset were randomly selected and tested by the proposed BMDF-LCNN method. The experimental results are shown in Figure 9. As shown in Figure 9, for these test images, the predictive confidences of the proposed model are all more than 99%; some even reach 100%. This demonstrates that the proposed method can extract significantly discriminative features from input images more effectively and improve the classification accuracy of remote sensing scene images.

Conclusions
For the classification of remote sensing scene images, a lightweight network based on the dense fusion of dual-branch, multi-level features is presented. In addition, a new downsampling method was designed to obtain more representative feature information. The network through the three branches of 3 × 3 depthwise separable convolution, 1 × 1 standard convolution, and identity, the information of the current layer can be fully extracted and fused with the features extracted by 1 × 1 standard convolution in the previous layer, which realizes the information interaction between different levels of features, and effectively improves the classification performance and computational speed of the model. The proposed method is compared with the other most advanced methods on four data sets of remote sensing scene images. Experiments show that the proposed method can provide better classification accuracy and achieve a balance of speed and classification performance.
The proposed model still needs to be improved. When multi-level feature intensive

Conclusions
For the classification of remote sensing scene images, a lightweight network based on the dense fusion of dual-branch, multi-level features is presented. In addition, a new downsampling method was designed to obtain more representative feature information. The network through the three branches of 3 × 3 depthwise separable convolution, 1 × 1 standard convolution, and identity, the information of the current layer can be fully extracted and fused with the features extracted by 1 × 1 standard convolution in the previous layer, which realizes the information interaction between different levels of features, and effectively improves the classification performance and computational speed of the model. The proposed method is compared with the other most advanced methods on four data sets of remote sensing scene images. Experiments show that the proposed method can provide better classification accuracy and achieve a balance of speed and classification performance.
The proposed model still needs to be improved. When multi-level feature intensive fusion occurs, some redundant data will be generated, which increases the computational complexity. Future work should find a method that can selectively fuse, reduce the generation of redundant data, and further construct a lightweight model that incorporates both speed and precision.