A Multi-Branch Feature Fusion Strategy Based on an Attention Mechanism for Remote Sensing Image Scene Classification

In recent years, with the rapid development of computer vision, increasing attention has been paid to remote sensing image scene classification. To improve the classification performance, many studies have increased the depth of convolutional neural networks (CNNs) and expanded the width of the network to extract more deep features, thereby increasing the complexity of the model. To solve this problem, in this paper, we propose a lightweight convolutional neural network based on attention-oriented multi-branch feature fusion (AMB-CNN) for remote sensing image scene classification. Firstly, we propose two convolution combination modules for feature extraction, through which the deep features of images can be fully extracted with multi convolution cooperation. Then, the weights of the feature are calculated, and the extracted deep features are sent to the attention mechanism for further feature extraction. Next, all of the extracted features are fused by multiple branches. Finally, depth separable convolution and asymmetric convolution are implemented to greatly reduce the number of parameters. The experimental results show that, compared with some state-of-the-art methods, the proposed method still has a great advantage in classification accuracy with very few parameters.


Introduction
Remote sensing image scene classification refers to the use of aerial scanning, microwave radar, and other methods to image the target scene and then extract useful information from different scene images, thus enabling an analysis and evaluation of the scene image. Relevant research on remote sensing scene classification has been widely used in national defense security [1], analyses of crop growth [2], and environmental management [3]. Because of the large content differences in the same scene, the similar content in different scenes, the inconsistent spatial scales of landforms, and the different shapes and sizes of images, remote sensing image scene classification has become a very challenging task. Therefore, in recent years, some researchers have focused on the effective scene classification of remote sensing images.
Deep learning, a very effective technology in the field of computer vision, was considered one of the ten technological breakthroughs in 2013. With the development of imaging technology and hardware equipment, deep learning has become widely used in remote sensing image scene classification and has natural advantages. Convolution neural networks (CNNs) can extract rich feature details from images and are used by most researchers [4][5][6]. However, increasingly, researchers are expanding the depths and it still reaches up to 60 MB. Liu et al. proposed a weighted spatial pyramid matching collaborative-representation-based classification method, and further improved AlexNet, which is used as the basic network for feature extraction. W. Zhang [23] et al. proposed a capsule network structure based on InceptionNet, which focuses on the connection of spatial information while building a lightweight network, and reduces the information visibility loss. Although these methods can provide good classification accuracy, the number of parameters is still large, so the efficiency of the model is limited. Therefore, how to reduce the complexity of the model as much as possible while ensuring the classification accuracy is a problem that needs to be further studied. At the same time, the attentionbased strategy has become a favorable method to improve the accuracy of classification.
In the field of artificial intelligence today, attention mechanisms are increasingly expected to focus on the details and locations of useful information, search for the important characteristics of the target, and filter out irrelevant information so as to improve the confidence of prediction. Attention mechanisms have emerged from research on human vision and perform well in many tasks, such as target detection [24], sentence generation [25], and speech recognition [26]. Attention mechanism is mainly divided into three categories, namely: soft attention, hard attention, and self attention. The soft attention mechanism model is differentiable. By extracting the correlation weights between different layers, it focuses on the correlation between the input features and target features. The hard attention mechanism model is non-differentiable. Reinforcement learning is usually used to explore the correlation between the input mechanism and the target extracted by the convolutional neural network, which is more difficult to train than the soft attention mechanism. The self attention mechanism mainly reflects the mutual attention of the input information, and processes the feature information extracted by different layers in parallel. In 2018, X. He et al. [27] proposed a sequence to sequence model that generates a question-and-answer dialogue language model by applying an encoder-decoder structure to multilingual translation. In addition, in 2017, a self-attention mechanism was proposed in [28], which alleviated the shortcomings of traditional attention mechanisms, which depend on external information and make greater use of the internal relationship of data. In 2019, Wang et al. [29] proposed a cyclic attention model. Since then, many researchers have tried to apply various attention mechanisms to remote sensing image scene classification according to the characteristics of dense connection between the layers of DenseNet and ResNet. W. Tong et al. [30] improved the DenseNet network and integrated it into a channel attention model. Then, the attention mechanism was used to enhance the weight of important feature information. D. Yu et al. [31] proposed a feature fusion framework based on hierarchical attention and bilinear pool (HABFNet) with ResNet50 as the basic network. The extracted information was enhanced and linearly fused. Through a large number of experiments, H. Alhichri et al. [32] found the location for integrating the attention mechanism into the specific convolutional layer in the model, and proposed a deep attention convolutional neural network (CNN), which makes the channel automatically adjust the weight of the learning feature information when training CNN end-to-end by back propagation. However, not all attention mechanisms are universal, and attention strategies should to be designed according to the characteristics of the networks. It remains a challenging task to develop an effective attention mechanism and to apply it to remote sensing scene classification.
For remote sensing image scene classification, a lightweight attention-based multibranch feature fusion CNN (AMB-CNN) is proposed. On the premise of enlarging the field of perception, the proposed model utilizes an alternating combination of different convolutions for extracting deep features. The extracted effective information is sent to the attention module to obtain new features, which are fused with the features from the previous branches. The proposed AMB-CNN model can provide good performance in remote sensing image scene classification with lower complexity.
The main contributions of this paper are as follows: (1) Two convolution combination modules for feature extraction are proposed. These modules use the method of multi-convolution cooperation outside the module and multiconvolution alternately inside the module to enable the model to more fully determine the key information of the image and to exactly discriminate the scene.
(2) A strategy for fusing multi-branch features is explored. After extracting the feature information from multiple branches, the attention mechanism is utilized to extract the branch information again. Ultimately, the multi-segment features are fused.
(3) To solve the rapid growth of the network model parameters in recent years, a lightweight model with fewer parameters is constructed. Different convolutions are utilized to reduce the number of parameters of the network model. Meanwhile, the hardswish activation function is adopted to enhance the nonlinear representation ability of the model.
The rest of this paper is as follows. The proposed attention-based multi branch feature fusion convolutional neural network (AMB-CNN) method is described in detail in Section 2. In Section 3, the experiment and analysis are carried out, and a comparison with the state-of-the-art methods is used to show the effectiveness of the proposed method. Section 4 gives the discussion. The last section is the conclusions.

The Structure of the Proposed Method
The AMB-CNN method consists of eight groups, as shown in Figure 1. Here, represents the compression process, represents the excitation process, is the channel feature after dimensionality reduction, is the height of the feature map, is the channel of the feature map, and represents the final output features of the squeeze and excitation modules. The first three groups are utilized to extract the shallow feature of the remote sensing images. This method combines the squeeze and excitation (SE) modules, enhances the relationship between feature channels, expands the global perception field, and reduces information loss during subsequent deep feature extraction. Starting with the fourth group, a multilinear fusion strategy based on spatial and channel attention is used to extract more useful information, as described in Section 2.2. Finally, in the eighth group, asymmetric convolution is added to further reduce the number of parameters. The process of the AMB-CNN model is described in Algorithm 1.
In the main part related to extracting deep features (groups 4 to 7), each group can be regarded as being composed of two modules whose inputs are obtained from the upper end. These two modules are the alternating combinations of ordinary and depthwise separable convolution, and a combination of ordinary and max-pooling layers, respectively. When the two modules are fused directly, although the extracted feature information is better than that of a single branch, the overall improvement effect is still not ideal. Therefore, the features extracted by one of the modules are input into the convolutional block attention module (CBAM). Then, the features are further extracted and multi-branch fusion is performed to obtain more key features, which is called , as described in the next section.
To build a lightweight network, this paper utilizes a combination of different convolutions to alleviate the problem of a large number of model parameters and to slow the training speed, and abandons the traditional method of the direct linear stacking of multiple large convolutions. Asymmetric convolution is applied in the eighth group of the model. Compared with traditional convolution, the parameter quantity is greatly reduced. A detailed introduction is provided in Section 2.3.  As remote sensing scene images often have rich details and many landforms have a high similarity, the model has been adjusted slightly. A more stable hard-swish is used instead of ReLu as the activation function, which improves the non-linear representation ability of the model. Finally, the convergence speed of the model is accelerated via batch normalization (BN) layer processing. In addition, to prevent the phenomenon of overfitting in training, an L2 regularization penalty is utilized. The details are described in Section 2.4.

Feature Extraction and Attention Module
The second and third groups of the model are designed to extract the shallow features of the image. Here, we add the SE module, which takes the upper convolution block as the input, uses the average pooling layer to compress each channel, and employs the dense layer to increase the nonlinearity so as to reduce the complexity of the output channel. Next, a dense layer is used to give the channel a smooth gating function. Finally, each feature map is weighted to expand the field of perception, reduce the loss of feature information, and provide more detailed information for the feature extraction from the fourth group. The SE module includes the following two parts: squeeze and excitation. A detailed explanation is provided in the following.
 represents a set of filter kernels learned and i V represents the parameters of the i filter. The output can be show as follows: where  represents the operation of convolution, is a feature map with size C W H   , . i u represents the information of the i -th output channel, and i V represents the convolution kernel used in channel i of the input.
The channel correlation is reflected in the spatial correlation of the image, and now the two are combined. When extracting the channel information, global average pooling is used to compress multiple channels into one channel, where channel i is where H is the height of the feature map and W is the width of the feature map. To obtain the channel correlation, a gate function is employed, and a sigmoid is used as the activation function, as follows: where  is the activation function, l is the output of the activation function, z is the channel, and K represents the weights of the fully connected layers used for the dimension reduction and dimension elevation. The gate mechanism is parameterized by forming a bottleneck around the nonlinearity of two fully connected (FC) layers-that is, the dimension-reducing layer parameter is 1 W , and the dimension-reducing scale is r . By adjusting the output U with the activation function, the final output of the attention module is refers to the corresponding channel product between the feature map i u RW H   and the original eigenvalue i l . For the main part of the image feature extraction (Groups 4 to 7), two modules are proposed in this paper. The first module involves the alternate use of 2D convolution and depthwise separable convolution. In the second part, the 2D convolution and the maxpooling layer are cascaded, as shown in Figure 2. On this basis, the channel and spatial attention mechanisms are added to the beginning layer (the fourth layer) and the end layer (the seventh layer) for deep feature extraction. It should be noted that the beginning layer refers to the beginning stage of extracting the deep features (groups 4 to 7 in the model), rather than the beginning of the whole model. The output feature map F of the second module is transmitted to the convolutional attention module (CBAM). We input a one-dimensional feature map of CBAM as where  represents the multiplication of elements, F represents the output feature map of channel attention, and 3 F represents the final output feature map of the CBAM attention module. In this way, the key information and location of the feature map can be further obtained when the receptive field of the shallow features is enlarged. Furthermore, the ability of the feature extraction is enhanced.

Some Strategies of Building Lightweight Model
In the feature extraction, with the increase in the number of layers, the number of parameter calculations also increases. Taking a 3 × 3 convolution as an example, the parameter cost is massive after multi-channel convolution. Therefore, in the proposed model, a hybrid method of depthwise separable convolution combined with traditional 2D convolution is adopted, and the BN layer and nonlinear activation function are used after convolution to accelerate the convergence speed. The complexity and parameters of the depthwise separable convolution and the traditional 2D convolution are also compared.
Suppose the size of the input feature map is f , and the size of the convolution kernel is k k T T P   . Then, the ordinary convolution parameter is k k T T P Q    . Depthwise separable convolution can be regarded as the sum of the pointwise convolution and depthwise separation convolution, where the parameter of the pointwise convolution is (1 1 ) P Q    , and the parameter of . The ratio of deep separable convolution to ordinary convolution can be represented as after simplification. The number of parameters can be reduced by about nine times when using a 3 × 3 convolution kernel, and by about 25 times when using a 5 × 5 convolution kernel.
In order to measure the computational complexity, the convolution step is set to 1. As the zero-fill feature maps can ensure the same space size, the feature map of common convolution output is , , , , , 1,l 1, , , The computational complexity is f Here, the complexity is related to the dimensions of the input and output channels, the input feature map, and the convolution kernel. The depthwise separable convolution used here does not depend on the size relationship between the convolution kernel size and the input feature map. In this paper, each channel is convoluted step by step, as follows: where R represents the channel of the output feature map, V is the convolution kernel, H is the height of the feature map, W is the width of the feature map, k is the number of convolutional kernels, I is the length of convolutional kernel, C is the number of channels, and n is the step of the convolutional kernel. In addition, in the eighth group of the proposed model, several asymmetric convolution fusion strategies are used to extract deeper features. Inspired by inception v3, by using multiple small convolution fusion instead of large convolution, we found that the cascade of 1 × 3 convolution and 3 × 1 convolution can reduce the convolution computation by about 33% compared with that using 3 × 3 directly. In this way, the computational complexity of the model can be reduced effectively without affecting the performance of the network model.

The Strategy of Nonlinear Feature Enhancement
The activation function plays an important role in training convolution neural network models. The traditional ReLu function is as follows: Although this function converges faster than the Sigmoid function, ReLu is fragile during training. If parameters such as the learning rate are set inappropriately, when neuron necrosis occurs, the subsequent parameters will never be updated. The Sigmoid activation function can be represented as The derivative of it is In back propagation, when the gradient is close to 0, the weights are not updated, so the gradient can easily disappear. The hard-swish activation function is adopted in the proposed model and is characterized by smooth nonlinearity with no upper bound or lower bound. In training, although this function's cost on embedded devices is non-zero, in general, the convolution layer/full connection layer of flops is the main computational model, which accounts for more than 95%, while the impact caused by the small cost of the hard-swish is negligible. After the convolution of each layer, the BN layer and activation function are added, which not only accelerates the training time of the model, but also enables the neurons to adapt more fully to complex, non-linear tasks. At the same time, the data offset can be eliminated to some extent, which can be seen in Algorithm 1.
During the model training stage, a type of weight attenuation, called L2 regularization, is added to make the representative data distribution stand out. For the proposed model, a regular term is added after the cost function: and the partial derivative can be obtained to yield The gradient descent is when the coefficient of , which indicates that during the training process, the weight is attenuated, resulting in a smaller weight. Moreover, L2 regularization is adopted to alleviate the over-fitting phenomenon. In the experiment, the regularization coefficient of L2 is set to 0.005.

Dataset Settings
The images of this dataset are manually extracted from the USGS National Map Urban Area Image Set for urban areas across the country. The spatial resolution of the image is 0.3 m. For UCM 21 datasets, the image size is 256 × 256 and contains 21 types of scene images, 100 of each type, and 2100 spatial images. In the experiment, 80% randomly selected images from UCM 21 are used for the training. The remaining images are used for testing. Some scene images are shown in Figure 3.

AID30 Dataset
The AID30 dataset is obtained by collecting samples from Google Earth images, which is a large aerial image dataset. It comes from different remote imaging sensors, covering different seasons in China, the United States, Germany, France, the United Kingdom, Italy, and other countries. Therefore, it is a very challenging dataset. The spatial resolution of the image is 0.5-0.8 m. Compared with the UCM datasets, the AID30 datasets feature more images and categories with image sizes of 600 × 600, and a total of 30 categories of scene images, each of which contains about 220~420 images, totaling 10,000 images. Some scene images are shown in Figure 4.    To evaluate the proposed method effectively, two different data partitioning methods are employed.
(1) In the experiment, 20% of the images are randomly selected for training, and the rest are used for testing.
(2) In the experiment, 50% of the images are randomly selected for training, and the rest are used for testing.

RSSCN7 Dataset
The RSSCN7 dataset was released by Qin Zou Yu of Wuhan University in 2015. These images are obtained in different seasons and weather changes, and are sampled with four different scales. The RSSCN7 dataset includes seven types of scene images, 400 × 400 in size. Each type is represented by 400 images, for a total of 2800 images. In the experiment, 50% of the images are randomly selected for training, and the rest are used for testing. Some scene images are shown in Figure 5.

NWPU45 Dataset
The NWPU45 dataset is large, covering more than 100 regions around the world. It is composed of images obtained by Google Earth through satellite images, aerial photography, and geographic information system (GIS). The images of the dataset selected for this study include different angles, lighting conditions, time, and seasons, so the similarity between classes is very high, which makes the process more challenging. The size of each image is 256 × 256, with 45 types of scene images, each of which is represented by 700 images, for a total of 31,500. The spatial resolution of the images is 0.2-30 m. Some scene images are shown in Figure 6. Two different data partition methods are used in the experiment.
(1) In the experiment, 10% of the images are randomly selected for training, and the rest are used for testing.
(2) In the experiment, 20% of the images are randomly selected for training, and the rest are used for testing.
(3) In the experiment, 80% of the images are randomly selected for training, and the rest are used for testing.

Data Preprocessing
A. Normalize the input image B. Rotate the input image 0-60 degrees C. Randomly flip the input image horizontally or vertically D. Randomly offset the size of the image by 0.2 times

Parameter Settings
The initial learning rate is 0.01. The momentum of training is 0.9, and the batch size is set to 16. The experiments are conducted on a computer with an Intel (R) Core (TM) i7-10750H CPU, and RTX2060 GPU, and 16 GB of RAM.

The Performance of the Proposed Model
The proposed model is based on an improvement of the MobileNet network. To prove the effectiveness of an attention-based multi-branch fusion strategy in the model, firstly, the proposed model and MobileNet model are compared with the UCM21, AID30, NWPU45, and RSSCN7 datasets. The overall accuracy (OA), kappa coefficient, F1 coefficient, and average precision (AP) are adopted as the evaluation indexes. In the experiment, keras is adopted to reproduce MobileNet and to fine-tune the last layer of the network. Table 1 gives the comparison results of the OA, kappa coefficient, AP, and F1 score between the proposed model and the MobileNet model.    It can be seen from Table 1 that the classification performance of the proposed method is better than that of MobileNet. For the AID30 dataset, when the proportion of training sample and test sample is 20% and 80% (i.e., 20/80), respectively, the OA of the proposed method is 6.06% higher than that of MobileNet, and the kappa of the proposed method is higher by 6.28% than that of MobileNet. The F1 and AP results of the proposed method are also the highest. For other datasets with different proportions of training samples and test samples, the proposed method also shows an excellent classification performance. This proves that the proposed attention-based multi-branch fusion strategy can further extract the depth features of the remote sensing images, thus improving the classification performance of the remote sensing images.
In addition, on the UCM21 (20/80) dataset, the confusion matrix obtained by the proposed method and the MobileNet model are compared, as shown in Figure 7. It can be seen from Figure 7a that the proposed method provides a 100% correct classification rate for almost all of the categories. Moreover, the number of classification error samples is far less than that of the MobileNet model.  To summarize, by using multiple evaluation indicators (OA, AP, kappa, and F1 confusion matrix), the classification performance of the proposed method on six datasets is higher than that of MobileNet. This proves the effectiveness of multi-branch fusion strategy based on attention, which offers an excellent performance in remote sensing image scene classification.

Comparison with Advanced Methods
The proposed AMB-CNN method considers the feature information and location of the feature information comprehensively, applies the feature map with an enlarged receptive field to the two convolution model structures, extracts the feature, and finally fuses the multiple branches obtained through the attention mechanism. In this way, not only the classification accuracy is effectively improved, but the complexity of the model is also greatly reduced.
In the experiment, the proposed method is evaluated comprehensively through a comparison with some state-of-the-art methods under the same conditions. Firstly, some experiments are carried out on UCM21 dataset with training/test ratio of 8:2, the comparison results are shown in Table 2. Table 2. Performance comparison of the proposed model with some state-of-the-art methods on the UCM21 dataset.

M
Here, the OA of the proposed model is 0.31% higher than that of the recently proposed PANet50 [48] model and 0.23% higher than that of the LCNN-BFF dual branch fusion network. Moreover, the parameters of the proposed model are only 5.6 M compared with the SF-CNN of VGGNet [47], VGG16-DF [46], and FACNN [44]. Taking VGG16 as the basic network model, the parameter amounts only account for 4.3% of these methods. For models based on ResNet, such as PANet50, the number of parameters is only 20%. Notably, the proposed AMB-CNN method still maintains the best performance with the fewest parameters. For a more comprehensive evaluation of the proposed method, the UCM21 dataset is used for the cross validation (five-fold) experiment, and the OA accuracy of the proposed AMB-CNN method is 99.49%. Figures 8-10 show the AP comparison results of MobileNet, LCNN-BFF, and the proposed AMB-CNN method for the RSSCN7 (5/5), AID30 (2/8), and NWPU45 (1/9) datasets, respectively. The experiments show that the APs of the AMB-CNN method are higher than those of the other two methods for each specific category. These results illustrate that the strategy of multi-branch and attention fusion can extract the image feature more effectively and can reduce the loss of useful information, which enables this strategy to provide an excellent classification performance.
Next, some experiments are carried out on the RSSCN7 dataset with a training/test ratio of 5:5. The results are shown in Table 3. Here, the proposed model still has great advantages on the RSSCN7 datasets with a high similarity between classes. Compared with the two-stage deep feature fusion [17] method, the SPM-CRC [40] method, the WSPM-CRC [40] method, and the LCNN-BFF [49] method, the OA of the proposed method is improved by 2.77%, 1.28%, 1.24%, and 0.50%, respectively. Compared with the ADFF method, although the OA of the proposed method is slightly lower, the number of parameters is only 24.3% that of the ADFF method. In general, the complexity of the proposed network model is greatly reduced at the cost of a slight decrease in classification accuracy. In addition, in order to avoid biased results, a cross-validation (five-fold) experiment is also used on the RSSCN7 dataset. The OA accuracy of the proposed method reaches 96.07%, which proves the effectiveness of the proposed method. Table 3. Performance comparison of the proposed model with some state-of-the-art methods on the RSSCN7 dataset.

M The proposed AMB-CNN
95.14 ± 0.24 Table 4 shows the experimental results on the AID dataset after a training/test ratio of 2:8 and training/test ratio of 5:5. In the AID30 (20/80) classification, the proposed method still provides the best classification accuracy. Compared with the GBNet+global feature method [43], the LCNN-BFF method [49], the GBNet method [43], and the DCNN method [19], the OA of the proposed method is improved by 1.07%, 1.67%, 3.11%, and 2.45%, respectively. In the AID30 (50/50) partition, the complexity of the proposed AMB-CNN model is only 4.3%, 4.1%, and 37.3% that of the DCNN method, the GBNet+global feature method [43], and the VGG_VD16+SAFF method [39]. The proposed method is also tested on AID30 with 80% training, and the OA accuracy is up to 97.56%.

NWPU45(1/9) Dataset
MobileNet LCNN-BFF AMB-CNN Finally, the effectiveness of the proposed method is further evaluated on a large NWPU45 dataset. Some experiments are carried out under the conditions with a training/test ratio of 1:9 and a training/test ratio of 2:8. The results are shown in Table 5. The accuracy of our method in NWPU 45 (10/90) is 88.99, which outperforms some state-ofthe-art classification methods as follows : 2.46% higher than LCNN-BFF [49], 4.66% higher than sCCov [41], and 3.66% higher than MSCP [42]. Moreover, in the NWPU45 (20/80) partition, the performance of our proposed model remains excellent. In order to evaluate our model more comprehensively, the proposed method is tested on the NWPU45 dataset with 80% training, and the OA accuracy is up to 95.97%, 1.27% higher than that of the Siamese VGG16 method.      To comprehensively evaluate the proposed model from different perspectives, grad cam is used to visually analyze the different network models. This method can use the gradient of any target along with the last layer of the convolution network to generate a rough attention map, which can be used to display the important areas in the model prediction image. In the experiment, some images are selected randomly in the UCM21 dataset, and the latest LCNN-BFF method is compared with the proposed method. Some remote sensing scene images, including an aircraft, fuel tank, golf course, sparse residence, and forest, are randomly selected for comparison, as shown in Figure 16. It can be seen in Figure 16 that for the scene of storage tanks, the focus area of the LCNN-BFF model is shifted, and the proposed AMB-CNN model can focus on the target object very well. For the airplane, golf course, sparse residence, and forest scenes, the focus areas of the LCNN-BFF are limited, and thus the regions surrounding the target are ignored. Therefore, the extracted targets are incomplete. However, the proposed model can still provide complete focus areas.
The trained network model is also tested with randomly selected images, as shown in Figure 17. We can see that the prediction results provided by the proposed model are consistent with the real scenarios, and the prediction confidences are all above 99%, with some individual scenarios reaching 100%. This proves that the proposed method can extract image features more effectively.

Model Analysis
The proposed AMB-CNN method is evaluated on four datasets with different division proportions, and is proven to have a good classification performance. At the same time, the number of parameters for the proposed method is lower than that of the other advanced methods. These benefits come from the following two aspects. First, in the feature extraction stage, the model is divided into several modules, by which the shallow features and deep features are extracted. At the same time, considering the complexity of the model, a hybrid convolution method is adopted, and multiple small convolutions cascade instead of a large convolution. Secondly, in the aspect of feature fusion, with the multi segment attention mechanisms, the extracted features are input into the attention mechanism to extract important information again, and finally, the multi branch features are fused.

Visual Dimension Assessment
T-distributed stochastic neighbor embedding (T-SNE) visualization is adopted to further evaluate the performance of the AMB-CNN model. T-SNE data dimensionality reduction and visualization can better estimate the classification performance of the model by mapping high-dimensional data into two-dimensional space and using scatter distribution to visually display the classification effect. On the RSSCN7 (5/5) and UCM21 (8/2) datasets, the T-SNE visualization effects of MobileNet, LCNN-BFF, and the proposed AMB-CNN model are compared, as shown in Figure 18.
In Figure 18, we can see that compared with the other two methods, the classification results of the proposed method have a smaller intra class distance and larger inter class distance, which indicates that the proposed method can better extract image features and distinguish different categories, thereby providing an excellent performance for remote sensing image scene classification.

Conclusions
In this paper, a lightweight convolutional neural network based on attention-oriented multi-branch feature fusion (AMB-CNN) is proposed. This method has been proven to be effective with various datasets and under various conditions. A multi-branch convolution block is designed for feature extraction under a fully expanded receptive field. At the same time, the attention mechanism is utilized for feature weighting analysis of the spatial and channel information. Finally, all of the features are fused. In this way, not only are the key features extracted accurately, but the loss of information is also reduced. In addition, a network model with very low parameters is constructed through depthwise separable convolution and traditional convolution alternation. The experimental results show that compared with some state-of-the-art methods, the parameter of the proposed method is only 5.6M, while still having a great advantage in classification accuracy. Especially on the UCM21 dataset, the OA of the proposed method is up to 99.52%, which exceeds that of most of the existing advanced methods. In addition, the proposed method also shows good performance on other datasets, such as the AID dataset with a training/test ratio of 2:8, where the OA of the proposed method reaches 93.27%. The next step is to extend this method to other remote sensing data, such as hyperspectral images, to improve the universality of the proposed model.