Decision-Level Fusion with a Pluginable Importance Factor Generator for Remote Sensing Image Scene Classiﬁcation

: Remote sensing image scene classiﬁcation acts as an important task in remote sensing image applications, which beneﬁts from the pleasing performance brought by deep convolution neural networks (CNNs). When applying deep models in this task, the challenges are, on one hand, that the targets with highly different scales may exist in the image simultaneously and the small targets could be lost in the deep feature maps of CNNs; and on the other hand, the remote sensing image data exhibits the properties of high inter-class similarity and high intra-class variance. Both factors could limit the performance of the deep models, which motivates us to develop an adaptive decision-level information fusion framework that can incorporate with any CNN backbones. Speciﬁcally, given a CNN backbone that predicts multiple classiﬁcation scores based on the feature maps of different layers, we develop a pluginable importance factor generator that aims at predicting a factor for each score. The factors measure how conﬁdent the scores in different layers are with respect to the ﬁnal output. Formally, the ﬁnal score is obtained by a class-wise and weighted summation based on the scores and the corresponding factors. To reduce the co-adaptation effect among the scores of different layers, we propose a stochastic decision-level fusion training strategy that enables each classiﬁcation score to randomly participate in the decision-level fusion. Experiments on four popular datasets including the UC Merced Land-Use dataset, the RSSCN 7 dataset, the AID dataset, and the NWPU-RESISC 45 dataset demonstrate the superiority of the proposed method over other state-of-the-art methods.


Introduction
The rapid development of sensor technology enables us to easily collect a large number of high-resolution optical remote sensing images, which contain rich spatial details. How to efficiently understand and recognize the semantic contents in these images has become a popular research topic in recent years. As one of the basic tasks of understanding and recognizing remote sensing images, optical remote sensing image scene classification targets at automatically labelling the scene category (such as airport, forest, residential area, church) according to the semantic content of each image [1], which plays an important role in the application fields including natural hazards detection, vegetation mapping [2], environment monitoring [3], geospatial object detection [4], and LULC determination [5].
While the high-resolution of remote sensing images is a valuable property for subsequent vision tasks, the complex image details and structures pose a difficult problem in the modelling of feature representation. The traditional methods [6][7][8][9] based on handcrafted features exhibit limited ability in understanding the remote sensing images. By contrast, the deep learning-based methods [10][11][12][13][14][15] achieve superior performance on remote sensing scene classification by benefiting from the powerful capability of extracting hierarchical semantic features. Especially, the convolutional neural network (CNN) is a typical deep model that dominates the recent research on remote sensing image scene classification.
Remote sensing images possess intrinsic characteristics that are different from natural images. Hence, the CNN-based models that perform well on natural images cannot be directly used in remote sensing images. As shown in Figure 1, the small objects are crucial to correctly classify the image scenes, but only occupy small areas of the whole images, which is a general case in the remote sensing data. This informs us that it is necessary to carefully consider the small objects in remote sensing image scene classification. However, in most of the existing CNN-based models, the features of small objects are easily lost during the convolution or the subsampling operations, especially in the deep layers. Hence, using deep CNN features is not sufficient to accurately recognize the content of the input image. Considering the impact of the small-scale details in remote sensing images, it is encouraged to utilize not only the deep features which have high-level semantics but also the shallow features which contain positional and fine information. Fusing shallow and deep features is a possible way to improve the feature representation of the remote sensing images.

Airplane
Freeway Basketball_Court Golf_Course Ship Thermal_Power_Station Tennis_Court Island In recent years, different strategies of feature-level fusion have already been investigated [16][17][18][19][20], which could produce more discriminative features and better performance than the methods that only utilize deep features. Nevertheless, such feature-level fusion still has shortcomings. On one hand, the sizes of the feature maps at different layers are inconsistent, so it is necessary to adjust the sizes via deconvolution or interpolation before fusion, which increases the computation cost. On the other hand, the value range of the feature maps at different layers may also be different, so normalizing is usually acquired before feature fusion which, however, could be harmful to the discriminativity in the resultant feature space. Both factors suggest that the feature-level fusion strategy produces extra computational cost, which inevitably reduces the efficiency of the model on the low-end devices that have limited resources, such as embedded devices.
One important characteristic of remote sensing images is that they contain kinds of objects with different poses, shapes, and illuminations, which leads to high intra-class variance in the same object category. As shown in Figure 2a, the four images with very different shapes and layouts are all labelled as a church. Another characteristic states that the images with similar appearances can be labelled as different categories, which causes high inter-class similarity between different categories. As shown in Figure 2b, although the four images have similar views, the left two examples belong to dense residential while the right two belong to medium residential. Therefore, the properties of high inter-class similarity and high intra-class variance pose a great difficulty to remote sensing image scene classification, which is addressed through metric learning in most of the existing methods. For example, Cheng et al. [21] optimized a distance function on the CNN features, such that the distances between the object features in the same category were suppressed, while the distances between the object features in different categories were maximized. Wei et al. [22] proposed an adaptive marginal centre loss function to enhance the contribution of hard samples during the training process.
(a) (b) Figure 2. Illustration of the high inter-class similarity and high intra-class variance in the remote sensing image dataset. (a) Illustrates four images with different appearances, which are labelled as church. (b) Shows dense residential in the left two images, and medium residential in the right two images.
The above-discussed issues have been tackled by the existing methods which, however, only focused on a certain issue rather than developing a unified pipeline for solving all issues. In this paper, we propose an end-to-end adaptive decision-level fusion framework, which can simultaneously address the issues of information fusion, high inter-class similarity, and high intra-class variance in remote sensing image scene classification. Specifically, the fusion is conducted among the classification scores produced by different layers, meaning that the information with different levels of semantics is fused. In this way, it is not necessary to adjust the dimensions of the classification scores before fusion, nor need to consider the normalization operation for constraining the value range, where hence a natural fusion process is obtained. To implement an adaptive fusion strategy, we propose an importance factor generator that can dynamically assign an importance factor to the corresponding score of each category, which could help to decrease the high inter-class similarity and the high intra-class variance in the score space. To further improve the performance, we develop a stochastic decision-level fusion training strategy that allows each classification score to participate in the decision-level fusion with a certain probability in each training iteration, which is inspired by the idea of Dropout [23]. Experimental results compared with the state-of-the-art methods on four popular remote sensing scene classification datasets demonstrate the superiority of the proposed method. The main contributions of this paper can be summarized as follows: • We propose an end-to-end adaptive decision-level information fusion strategy, which not only overcomes the problems brought by feature-level fusion, but also solves the issues of high inter-class similarity and high intra-class variance in remote sensing images.
• An importance factor generator is developed to support the decision-level fusion, which can adaptively generate importance factors for different categories in the classification score. • Through improving Dropout, we propose a stochastic training strategy for the decisionlevel fusion, which is empirically proved effective. • Extensive experiments on public datasets illustrate the state-of-the-art performance of our method.
The remainder of the paper is organized as follows. First, we review the related work in Section 2, followed by the detailed presentation of the proposed method in Section 3. In Sections 4 and 5, experiments and discussion are presented. Finally conclusions are drawn in Section 6.

The Conventional CNN Features-Based Methods
Among the recent competitive models, CNN has been widely applied in remote sensing image scene classification due to its strong nonlinear representation ability. Early applications of CNN employed simple convolutional architectures to classify remote sensing images directly. For example, Dimitrios et al. [24] applied the model pretrained on ImageNet to the task of remote sensing image scene classification, which alleviated the problem of limited training samples in remote sensing datasets. Cheng et al. [12] used the pretrained model to extract features which were then classified by linear SVM. This method also empirically verified the effectiveness of model fine-tuning via transfer learning. Compared with the handcrafted feature-based methods, these deep learningbased methods have achieved a great improvement in classification performance.

The Feature-Level Fusion-Based Methods
To improve the discriminativity of deep features, information fusion has been integrated into the elaborated CNN architectures. Feature-level fusion is a typical option for information fusion in remote sensing scene classification. Zhang et al. [25] proposed an efficient multi-network integration method for the first time through parameter sharing, resulting in significant improvement of the classification performance compared with a single network. Zhang et al. [27] combined the features extracted from VGG-16 and Inception V3, which were then fed into the capsule network to obtain better classification results. Muhammet et al. [28] proposed a network integration strategy based on Snapshot and Homogeneous, which only fused the information of the last convolutional layer of each network, overcoming the problem of costly computation in multi-network integration. These feature-level fusion strategies based on multiple networks could yield pleasing classification performance, but bring heavy computation cost, hence being inefficient.
Other forms of feature-fusion are also common in remote sensing image scene classification. Fang et al. [18] explored the features of remote sensing images in the frequency domain, which were fused with the spatial-domain features to obtain better robustness. Dong et al. [16] fused the deep CNN features with the GIST features before the classification by LSTM. Xu et al. [17] proposed a two-stream feature fusion method, where one stream provided features using a pretrained CNN while the other output the multi-scale unsupervised MNBOVW features. Shi et al. [19] proposed a network with several groups, each of which had two branches for feature extraction and fusion respectively. Xu et al. [20] fused four groups of intermediate feature maps of the network through pooling, transformation, and other operations, to encourage efficient adoption among multi-layer features.
Lu et al. [30] claimed that the label information was crucial to the effect of feature fusion, and proposed to supervise the softmax outputs of the fused features by semantic labels.
In remote sensing scene classification, the semantic features of an image may be closely correlated with local regions, which suggests that different regions in a remote sensing image are of different importance for accurate classification. However, the abovereviewed methods ignore this fact. To solve this problem, Sun et al. [31] proposed a gating mechanism, allowing the network to determine which area is suitable to be fused. Based on the attention mechanism, Ji et al. [32] drove the network to focus on the features of the most attractive regions in an image such that redundant features could be removed and the classification performance was then improved. Yu et al. [33] adopted ResNet-50 to extract features which were then adaptively enhanced by a channel attention module and finally fused by bilinear pooling. Zhang et al. [34] proposed a location-context fusion module, which could not only distinguish different regions but also met the translation invariance that was very helpful for image scene classification.

The Decision-Level Fusion-Based Methods
In addition to model fusion and feature fusion, the decision-level information fusion methods have also been applied in remote sensing scene classification. Li et al. [35] used a CNN model to get the top-N categories, and then input the features extracted by the dual networks into SVM to obtain the final category. The method involves multi-step operations and does not provide end-to-end training. Wang et al. [36] proposed a dualstream architecture, in which one stream used the SKAL strategy to extract features of interested areas while the other stream extracted the global features. These features were respectively classified, the results being averaged to get the final score. Note that the fusion strategy in this method was rough since there was no difference between the contributions of the local features and the global features. In our work, the proposed architecture solves the problem of contribution allocation and adaptive decision-level fusion simultaneously.

The Proposed Method
In this section, we first illustrate the overall framework of the proposed method and then introduce the importance factor generator and the adaptive decision-level fusion method in detail. Finally, the stochastic decision-level fusion training strategy is presented.

Framework
Most of the existing CNN models have a typical architecture that can be divided into the feature extraction backbone and the classification head. The feature extraction backbone involves a hierarchical structure that tends to extract the semantics of different levels through subsampling or pooling in different depths. We could regard the layers between adjacent subsampling or pooling layers as a stage. To be explicit, the width and the height of the feature maps in all layers of a stage are consistent. As is well known, the features in shallow stages have low-level semantics while the deep stages provide high-level semantics.
The proposed framework aims at exploiting the discriminativity of the features in different stages. As shown in Figure 3, we group the stages of the CNN backbone as n + 1 stages (where n = 4 in the figure). In each of the last n stages, the sizes of the feature maps of all layers keep consistent. The first stage contains all the remained shallow layers, where the sizes of the feature maps may vary. The outputs of the last n stages are employed for classification, where each output produces a classification score. The output of the first stage is fed into the importance factor generator for generating the importance factors (IFs). These important factors diversify the importance of each category in the classification scores as well as the importance of different stages. Specifically, the importance factor generator generates n factors with each having the same dimension as the classes of the interested task. The n factors correspond to the last n stages. The values of the factors range from 0 to 1, hence being viewed as weights for the classification scores. The n factors and the n classification score vectors are element-wisely multiplied, followed by a class-wise summation of all the processed score vectors. Finally, a fused classification score vector is obtained. Note that the importance factors produced by the importance factor generator are determined by the input image, or specifically its shallow features, which means that the decision-level fusion process can be adjusted adaptively.  Figure 3. The framework of the proposed adaptive decision-level fusion method. From top to bottom, the blue blocks represent different stages of a deep CNN backbone; the light blue rectangles with round corners are the classifiers based on the features of different stages; the green block is the importance factor generator, which is used to generate the importance factor matrix composed from "IF1" to "IF4"; the grey blocks named as "Lock 1" to "Lock 4" represent the stochastic decision-level fusion process.

Importance Factor Generator
The function of the importance factor generator is to process the feature maps produced by the first stage and generate an importance factor matrix to adaptively fuse the classification scores. For example, for an N-classification problem, the importance factor generator produces an importance factor matrix with the size of N × n. As shown in Figure 4, the implementation is to pass the feature maps of the first stage through a convolutional layer with big kernels where the kernel size is related to the size of the feature maps and the number of the categories in the task. The following is the sigmoid activation layer and the dimension reshape layer. Mathematically, the importance factor matrix A can be written as: where F 1 is the feature maps of the first stage, W 1 and b 1 are the weight and bias of the convolutional layer, respectively, and σ is the sigmoid function.

Convolution with big kernels
Input feature maps Sigmoid Reshape Importance factor

Adaptive Decision-Level Fusion
In the fusion process, the classifier of each stage is composed of a global pooling layer, a convolutional layer, and a softmax layer. The global pooling layer is implemented as average pooling, i.e., averaging all pixels along the width and height dimensions of the feature maps. The global pooling is robust to the spatial translation of the input image.
Concretely, the output feature maps of the ith stage (2 ≤ i ≤ n + 1) which have a size of C i × W i × H i are passed through the global average pooling layer, resulting in a tensor x i with a size of C i × 1 × 1. We denote F jk,i as the element in column j and row k of the feature maps of the ith stage. Then, The tensorx i is then processed by 1 × 1 convolution which transforms the channel dimension from C i to the number of categories N, and then followed by the softmax layer to get the corresponding classification score x i , i.e., where W i and b i are the weight and bias of the 1 × 1 convolutional layer for the ith stage.
Once the importance factor matrix and the classification scores are obtained, the decisions are fused as follows. As shown in Figure 3, we first multiply each classification score vector (i.e., x i ) with the corresponding importance factor by elements, which produces a weighted classification score. Then, the weighted scores of all stages are summed up to complete the decision-level fusion. Finally, the softmax layer is adopted to convert the fused score to the final classification result Y. The above fusion process can be expressed as where X = [x 2 , x 3 , . . . , x n+1 ], v is a n-dimensional vector full of 1 and is element-wise multiplication. This decision-level fusion method not only realizes the fusion of hierarchical information, but also alleviates the problem caused by high inter-class similarity and high intra-class variance in remote sensing images, which are discussed in the following.
(1) Discussion on reduction of intra-class variance The intra-class variance refers to the difference between the images belonging to the same category. In remote sensing image scene classification, the images generally exhibit high intra-class variance in which case the deep model predicts diverse scores on the ground-truth class for different images of the same category, hence posing a great difficulty for classification. Let x ki denotes score of the ith category in the kth stage and y i be the adaptively fused classification score of the ith category which can be expressed as where g represents the softmax operation to normalize the scores.
The existing studies show that the discrimination ability of the feature representation in a single layer of a deep CNN model is insufficient for accurate classification, mainly because of the complex scene conditions (e.g., diverse shapes, illuminations, and textures). As shown in Figure 5a, the output score of ResNet-50 varies diversely for different images of the same class, in this case church. Instead, by taking the importance factor into consideration, the model tends to generate high factors for the important features while suppressing the useless features, which can yield a more accurate result as shown in Figure 5c. (2) Discussion on reduction of inter-class similarity Inter-class similarity refers to the similarity between the images belonging to different categories. In remote sensing image scene classification, the deep model could predict similar scores on a certain category for two images of different classes; generally, the category is the ground-truth label for one of the two images.
Let x is and x it denote the ith stage scores of the sth and tth categories, respectively. The distance which can be expressed below will be small if these two classes have high inter-class similarity: Let y s and y t be the adaptively fused classification score of the sth and tth categories, respectively. According to Equation (5), the distance between y s and y t can be written as: Similarly, a single layer of a deep model is insufficient to predict reliably for these classes. As shown in Figure 5b, the output scores of dense residential and medium residential are partially mixed, indicating that the boundary between these two classes is not clear enough. After the adaptive fusion, the importance factors help to produce high scores for the interested classes. It is seen from Figure 5d that the score margin between the two classes become clearer than the case without score fusion. This also indicates the validity of the proposed fusion strategy.

Stochastic Decision-Level Fusion Training Strategy
Dropout informs us that during the training process, by ignoring a neuron with a certain probability, this neuron would be independent from other neurons of the same layer, and this operation can reduce the co-adaptation between different neurons and enhance the representation ability of each neuron. This idea drives us to propose a stochastic decision-level fusion training strategy. As shown in Figure 6, in each training iteration, the classifiers will stochastically participate in the adaptive decision-level fusion process with a certain probability which is called survival rate. Specifically, we set the same initial survival rate p 0 for all classifiers, which is fixed for the first T f epochs of training. Then, the survival rate is increased gradually in a sinusoidal way as the training goes on to improve the stability of the training process. We use p t to denote the survival rate at the tth epoch, which can be expressed as where T is the total training epochs. In this way, each classifier could participate in the final decision process with a certain probability. The optimization encourages that different classifiers are independent from each other, where each tends to produce an accurate result. Hence, the stochastic training strategy improves the performance of the fused decision.

Experiments
In this section, we present the details of the experiments to validate the effectiveness of the proposed framework.

Datasets
We employ four public and popular datasets for remote sensing scene classification, which are introduced as follows.
• UC Merced Land-Use Dataset (UCM) [37]: This dataset is collected from the USGS National Map Urban Area Image series. The dataset contains 2100 scene images of 21 categories, including 100 images for each category. The image resolution is 256 × 256, and the spatial resolution of each pixel is 1 foot. In the experiment, we randomly choose 50% and 80% of the samples as the training set, and the rest is the test set.

Evaluation Metrics
To make a quantitative comparison, we adopt the widely used overall accuracy and the confusion matrix to evaluate the performance of all the competitors.

•
The overall accuracy (OA) is obtained by dividing the number of correctly classified images by the number of all images in the test set, which demonstrates the overall classification performance of the model. • The confusion matrix (CM) is a N × N matrix where N is the number of categories. The value at the ith row and the jth column indicates the proportion of the samples in class i being predicted as class j. The confusion matrix clearly shows which two categories are difficult to be distinguished from each other.

Experimental Settings
All experiments are carried out by using the open-source deep-learning library Pytorch. Unless noted otherwise, we adopt ResNet-50 [40] as the backbone to examine the performance of the proposed framework. The pre-trained weights of ResNet-50 on Ima-geNet are employed as the initialization of the model parameters, which are then fine-tuned on the corresponding dataset. In the RSSCN 7 and AID datasets, we resize the images to 256 × 256. While the input size of the model is set to 224 × 224, all images with the size of 256 × 256 are randomly cropped to 224 × 224. The batch size is set to 32 and the number of training epochs is set to 200. SGD optimizer is employed while the momentum is set to 0.9 and the regularization coefficient is set to 0.0005. The initial learning rate is 0.001, which is then adjusted according to the cosine annealing strategy. In addition, we use the data augmentation options to enhance the generalization performance of the model, such as random horizontal flip, vertical flip, and 90 • rotation. All the implementations are conducted on the Ubuntu 16.04 operating system equipped with a 3.3 GHz CORE i9-7900x CPU and two NVIDIA 2080Ti GPUs.

Ablation Study
(1) Investigation on model size and efficiency The proposed importance factor generator involves very limited parameters but brings a noticeable improvement in performance. To validate this, we first examine how the model size and FLOPs change when the backbone is equipped with the decision-level fusion plugin. The model size is measured using the number of parameters. Considering that the execution time of the model on different devices is different, FLOPs instead of latency is used to represent the model efficiency. The results are shown in Tables 1 and 2. It is seen that the improved model has a very limited increase in the number of parameters and FLOPs compared with the original ResNet-50. The differences across the datasets are caused by the different numbers of categories. The proposed importance factor generator is the key to improve the performance of the deep models. To validate the effectiveness of this plugin, we employ multiple backbones including VGG-16, MobileNet V2, ResNet-50, and ResNet-152. The RSSCN 7 dataset is used with a training ratio of 50%. The survival rate is set to 0.8 for VGG-16 and 0.9 for MobileNet V2 and ResNet-152, respectively. As seen from Table 3, compared with the original models, the accuracies are improved by 1.75%, 1.72%, 2.86%, and 3.73%, respectively. Moreover, the stochastic decision-level fusion training strategy further increases the accuracies by 0.38%, 0.98%, 0.34%, and 0.67%, respectively, which demonstrates the effectiveness. It is known that MobileNet V2 is a popular lightweight model which is opposite to VGG-16 that has 138 MB parameters. Hence, this indicates that our method works well with models of different sizes.

(3) Investigation on survival rate
To examine how the proposed stochastic training strategy performs, here we conduct experiments with different survival rates and compare the corresponding classification performance. The AID dataset is employed in the case of 50% training samples. The candidate survival rate ranges from 0.4 to 1 and the same value is set for each classifier. To reduce the deviation, the experiment in each case is conducted ten times, and the averaged performance is used for comparison. Figure 7 illustrates the error rate under different survival rates, in which the yellow star (i.e., the survival rate is 1.0) corresponds to the performance of the model trained without the stochastic decision-level fusion training strategy. It is clearly seen that the stochastic training strategy can improve the classification accuracy. According to the rationale of Dropout, we understand that the stochastic training strategy reduces the coadaptation effect between different classifiers and hence the performance is improved. We also note that when the survival rate is reduced smaller than 0.9, the accuracy gradually decreases. This is because when the survival rate is too small, the error is difficult to be backpropagated, affecting the optimization of model parameters. From the viewpoint of optimization, a good optimization method possesses a good balance between the global exploration capability and the local exploitation capability. The purpose of global exploration is to explore as large areas as possible in the solution space, while local exploitation is to exploit the fine structures of the known areas to obtain a better solution. In our method, by randomly ignoring the classification results during fusion, the proposed strategy could diversify the solution space, which corresponds to the global exploration. When the global structure of the solution space has been explored sufficiently, the fine-tuning process means exploiting the fine structures. Therefore, the stochastic decision-level fusion training strategy can be regarded as a trade-off between global exploration and local exploitation.
In subsequent experiments, we set different survival rates for different datasets and different training-test split ratios. Generally, the datasets with a large number of samples need global exploration, whereas the datasets with limited samples require local exploitation. The survival rates for different cases are listed in Table 4.

Comparison with State-of-the-Art Methods
We compare the proposed method with the state-of-the-art methods on the four datasets. To reduce the deviation, the experiment in each case is conducted ten times, and the averaged result is used for comparison.

•
Experimental results for the UC Merced Land-Use dataset The comparison with the state-of-the-art methods for the UC Merced land-use dataset is shown in Table 5. The classification accuracies of the proposed method in the cases of both 50% and 80% training samples are 98.65% and 99.71%, respectively, which surpass most of the competitors. ADFF [43] and ResNet_LGFFE [44] are based on feature-level fusion, and their backbones are both ResNet-50. In the case of 50% training samples, the accuracy of our method is 1.43% higher than that of ADFF; in the case of 80% training samples, the accuracy of our method is increased by 0.9% and 1.09% compared with ADFF and ResNet_LGFFE, respectively. This shows the superiority of the adaptive decisionlevel fusion in our method. With the stochastic decision-level fusion training strategy, the accuracy under 50% training samples is improved by 0.25%, while the accuracy under 80% training samples is slightly reduced. The main reason is that the performance under 80% training samples nearly reaches the perfect, which is difficult to improve. The confusion matrix in Figure 8 shows the detailed classification result of each category when 50% samples are used for training. As shown, the accuracies of all categories are above 92%, or even close to 100%. As mentioned above, dense residential and medium residential have high inter-class similarity, which leads to a great challenge. Clearly, our method achieves 92% and 100% accuracies for both scenes, which validates the effectiveness of the adaptive decision-level fusion on reducing the inter-class similarity.

•
Experimental results for the RSSCN 7 dataset Table 6 shows the comparison for the RSSCN 7 dataset. Under 50% training samples, the classification accuracy of our method reaches 95.15%, which surpasses all the competitors including the basic ResNet-50. ADFF [43] achieves the second highest accuracy in the table where the backbone is also ResNet-50 and SVM is adopted for classification. With end-to-end learning, the accuracy of our method is even higher. With the stochastic decision-level fusion training strategy, the performance is further improved by 0.34% and is also more stable.  [45] 93.12 ± 0.55 TEX-Net-LF [45] 94.00 ± 0.57 SPM-CRC [47] 93.86 WSPM-CRC [47] 93.9 Proposed [42] 93.14 ADFF [43] 95.21 ± 0.50 LCNN-BFF [19] 94.64 ± 0.21 Ours 95.15 ± 0.64 Ours+SDFTS 95.49 ± 0.55 Figure 9 shows the confusion matrix on the RSSCN 7 dataset. The grass class and the field class have high inter-class similarity, and hence low classification accuracies. By contrast, the classification accuracies of ResNet-50 in the grass class and the field class are 85% and 88% which are surpassed by 9% and 6%, respectively [42]. This comparison validates that the proposed method is effective in reducing the inter-class similarity and the intra-class variance. The comparison for the AID dataset is listed in Table 7. It can be seen that there is a clear gap between the proposed method and the other competitors. Our performance reaches 94.69% and 96.67% when the training samples are 20% and 50%, respectively, which are 8.21% and 7.45% higher than ResNet-50. Moreover, the stochastic decision-level fusion training strategy further improves the results by 0.36% and 0.37%. Figure 10 shows the detailed accuracy of each category when the training sample is 50%. Except for the school class and the resort class, the accuracies of other categories are higher than 90%. The resort class has a high similarity relative to other classes such as square, so the classification accuracy is low. Compared with ResNet_LGFFE [44] which adopts ResNet-50 as the backbone, the accuracies on the school class and the resort class are improved by 56% and 28%, respectively. In addition, the accuracies of our method on dense residential areas and medium residential areas are 97% and 98%, respectively, which shows that our method greatly reduces the similarity between these two classes.  [45] 88.79 ± 0.19 92.33 ± 0.13 TEX-Net-LF [45] 93.81 ± 0.12 95.73 ± 0.16 SPM-CRC [47] -95.1 WSPM-CRC [47] -95.11 GBNet [31] 92.20 ± 0.23 95.48 ± 0.12 MSCP+MRA [50] 92.21 ± 0.17 96.56 ± 0.18 FACNN [30] -95.45 ± 0.11 MF2Net [20] 93.82 ± 0.26 95.93 ± 0.23 TFADNN [17] 93.21 ± 0.32 95.64 ± 0.16 ARCNet [51] 88.75 ± 0.40 93.10 ± 0.55 DCNN [21] 90.82 ± 0.16 96.89 ± 0.10 ResNet_LGFFE [44] 90  Table 8 shows the comparison for the NWPU-RESISC 45 dataset. This dataset is the largest dataset in the remote sensing image scene classification at present. The proportion of the training set is usually low, posing the request of high generalization ability by the model. As seen, the proposed method surpasses most of the other methods. Our accuracy is similar to that of RIR [49] under 10% training samples and is better than RIR in the case of 20% training samples. Similar effects with the stochastic decision-level fusion training strategy are again observed.  [53] 91.47 ± 0.20 93.14 ± 0.16 MSCP+MRA [50] 88.07 ± 0.18 90.81 ± 0.13 MF2Net [20] 90.17 ± 0.25 92.73 ± 0.21 TFADNN [17] 87.78 ± 0.11 90.86 ± 0.24 LCNN-BFF [19] 86. 53 Figure 11 shows the confusion matrix in the NWPU-RESISC 45 dataset. It is observed that our method performs well in all categories except for those with great intra-class variance and inter-class similarity such as church, palace, and commercial area. However, compared with DCNN [21] which focuses on inter-class similarity and intra-class variance, the accuracies of our method in the commercial area and palace classes are still increased by 6% and 3%, respectively, which further verifies the superiority of our method in reducing intra-class variance and inter-class similarity.

Analysis about the Fusion Strategy
Through extensive experiments, our proposed adaptive decision-level fusion basedmethod has shown superior performance. Here, we detail the varied scores before and after the fusion process. Specifically, we randomly select four examples that belong to #4 (buildings), #19 (storage tanks), #11 (intersection), and #6 (dense residential), respectively, from the UC Merced Land-Use dataset. The output scores of different stages are shown in Figure 12.
As seen, the output scores vary significantly in different stages. For example, in Figure 12a, the predicted classes in the four stages are #4 (buildings), #1 (airplane), #11 (intersection), and #6 (dense residential), respectively, while the fused probability results in the correct prediction. Considering that different stages of the deep CNN model may produce different classes, the average fusion or the max fusion strategy is also applicable but not necessarily effective. For example, if we regard the max score in Figure 12b, the image is mistakenly classified as #11 (intersection). Instead, the proposed importance factor generator can predict the weights adaptively according to the image itself, encouraging proper supervision during the model training. In other words, our method can be viewed as a generalization of the conventional decision-level fusion strategies, which possess high adaptivity to the image content and hence, can improve the fault tolerance of the model.

Discussion
In this section, we present a detailed discussion on how the proposed information fusion strategy alleviates the issues caused by the loss of small objects, high inter-class similarity, and high intra-class variance. Specifically, the discussion starts from the comparison of different layers between the original ResNet-50 and the proposed method. All test examples are selected from the NWPU-RESISC 45 dataset.

Avoiding the Loss of Small Objects
It is common in remote sensing datasets that objects occupying small areas are usually discriminative to classify the images. As shown in Figure 13, three remote sensing images with small objects, which are labelled as airplane, basketball court, and island, respectively, are used to validate that the proposed fusion strategy can avoid the loss of small objects in feature extraction and hence, improve the performance.
The images are misclassified as #27 (palace), #22 (meadow), and #9 (cloud), respectively, by ResNet-50, which only pays attention to the background and ignores the small objects in deep CNN layers. By contrast, our method fuses the information from the shallow layers to the deep layers adaptively, which preserves the features of small objects in the final feature representation, thus correcting the classification results.

Reducing the Inter-Class Similarity and Intra-Class Variance
The issues of high inter-class similarity and high intra-class variance are challenging to remote sensing scene image classification. In Figure 14, two images labelled as medium residential and dense residential are tested to show the proposed strategy can reduce the inter-class similarity. In Figure 15, the reduction of intra-class variance is validated.
The images from top to bottom in Figure 14 are misclassified as #11 (dense residential) and #23 (medium residential) by ResNet-50 which is difficult to discriminate the samples between the two categories because of high inter-class similarity. By contrast, our method produces a very high score for the correct category in Figure 14b, which demonstrates the ability to learn a more proper decision boundary between the two categories than ResNet-50. Regarding the image in Figure 14a which exhibits a challenging case, our method can still present a correct prediction. The images from top to bottom in Figure 15 are misclassified as #10 (commercial area) and #27 (palace) by ResNet-50 due to high intra-class variance of the church category. Instead, in Figure 15b, our method learns the universal features of the church category and thus, produces a very high score on the correct category. In Figure 15a, our method obtains a correct result on the challenging sample, where the top two scores are very close to each other.

Conclusions
In this paper, we propose an adaptive decision-level fusion framework to improve the performance of the existing deep CNN models in remote sensing image scene classification. The architecture consists of a backbone network, a pluginable importance factor generator, multiple classifiers, and a decision fusion module. Each sub-classifier predicts the classification scores based on the features of different stages. The importance factor generator is used to adaptively assign an importance factor to each classification score in each stage. The factors and the scores are then fused to produce the final classification result. This framework not only achieves information fusion across different scales but also reduces the inter-class similarity and the intra-class variance which are obstacles in remote sensing scene classification. In addition, we further propose the stochastic decision-level fusion training strategy, which selectively enables the classification scores of each stage to participate in the decision-level fusion process during training. Experiments on four popular remote sensing image datasets validate the superiority of our adaptive decision-level fusion architecture and the stochastic decision-level fusion training strategy.

Data Availability Statement:
The data used to support the findings of this study are available from the corresponding author upon request.