CRABR-Net: A Contextual Relational Attention-Based Recognition Network for Remote Sensing Scene Objective

Remote sensing scene objective recognition (RSSOR) plays a serious application value in both military and civilian fields. Convolutional neural networks (CNNs) have greatly enhanced the improvement of intelligent objective recognition technology for remote sensing scenes, but most of the methods using CNN for high-resolution RSSOR either use only the feature map of the last layer or directly fuse the feature maps from various layers in the “summation” way, which not only ignores the favorable relationship information between adjacent layers but also leads to redundancy and loss of feature map, which hinders the improvement of recognition accuracy. In this study, a contextual, relational attention-based recognition network (CRABR-Net) was presented, which extracts different convolutional feature maps from CNN, focuses important feature content by using a simple, parameter-free attention module (SimAM), fuses the adjacent feature maps by using the complementary relationship feature map calculation, improves the feature learning ability by using the enhanced relationship feature map calculation, and finally uses the concatenated feature maps from different layers for RSSOR. Experimental results show that CRABR-Net exploits the relationship between the different CNN layers to improve recognition performance, achieves better results compared to several state-of-the-art algorithms, and the average accuracy on AID, UC-Merced, and RSSCN7 can be up to 96.46%, 99.20%, and 95.43% with generic training ratios.


Introduction
RSSOR is popularly adapted to specific tasks such as geological exploration, precision agriculture, and urban planning [1][2][3].As the name implies, RSSOR infers the right category of scene objectives by evaluating the content features that are included in the remote sensing data.With the continuous advancement of urban construction and the rapid progress of high-resolution observation satellites, the characteristics of diverse feature objectives and the scale of data are increasing, and how to perform RSSOR more accurately is already a popular and difficult problem for ongoing research in the field of remote sensing technology development [4][5][6].
With the accumulation of data volume and the improvement of computer performance, artificial neural networks and deep learning networks are developing rapidly, and the use of CNN for RSSOR has come into being [7].CNN, as one of the emerging artificial neural network technologies, merges intelligent deep learning techniques, and has the advantages of "sparse connection", "parameter sharing", and "equivariant representation" [8].It can shorten the time required for model learning, lower the volumes of data requiring training parameters, and reduce the memory requirement for model operation.In addition, the feature maps obtained by using CNN generally have three layers: the bottom layer reflects the details of the color, texture, and shape of the objective; the middle layer reflects the state of an object in the image at a certain moment; and the top layer reflects the overall Sensors 2023, 23, 7514 2 of 24 concept of the image with rich semantic information.In particular, it should be said that the top-layer feature maps are also the most applied in RSSORs.However, when CNN is employed for RSSOR, ignoring the other layers and just adopting the last layer not only fails to improve the recognition performance but also cannot fully exploit the advantages of CNN [9].
Another popular method based on CNN is to integrate the hessian eigenmaps learned from different CNN layers to generate new discriminative feature maps for RSSOR, which can achieve complementary feature advantages and even improve the recognition effect of the network.Two structures are common for multilayer feature fusion networks: the first is a parallel multi-branch network (PMBN), and the other is a serial hop-layer connection network (SHLCN).PMBNs are usually used to fuse features using different convolutional kernels, convolution with holes, and pooling operations of different sizes.In [10], the features are first extracted and then fused using four parallel structures, each containing convolutional kernels of different sizes.In [11], highly accurate features were obtained using convolutional networks with holes.In [12], the recognition accuracy of small samples is improved by assembling feature maps of different scales under different weights.The above methods achieve their purpose, but they ignore the relationship between adjacent layers.SHLCN is a combination of features implemented through hop-level connections.In [13], the fusion of features obtained by using layer-hopping connections for recognition is superior to traditional methods.In [14], the covariance matrix is obtained by superimposed multilayer features, and then the covariance matrix and support vector machine are used to further obtain better classification results.In [15], sparse representation is used to fuse the middle layer and top layer features, and then the fused features are used for scene classification, which is effective for classification in limited data.The above method utilizes multilayer feature fusion, but there are problems of feature redundancy and offset in the integration process, which also ignores the relationship between adjacent layers.In summary, it is easy to understand that the parallel structure is able to acquire different perceptual field features at the same level, while serial structures are able to integrate features from various levels.All these methods are able to enhance the features, but they also bring the problems of redundancy and mutual exclusion of feature maps.
In addition, because of the complex and diverse characteristics of the features themselves, the satellite will be affected by the background, lighting, scale, and other imaging conditions in the process of photography.Therefore, two types of feature confusion problems arose in RSSOR: scene objectives with similar semantic categories probably share different visual variability, and scene images of different semantic categories may also have certain similarities [16].To reduce the impact of these two problems, many researchers have tried to use an attentional mechanism (AM) [17].In [18], a dual-attention residual network is designed to extract features, embedding spatial attention into the bottom features and channeling attention into the top features.In [19], adding AM to top-level features, selectively focusing on key content, and discarding non-key information improves classification performance.The above methods only add attention features after convolutional processing, so that attention features can only be learned from the current feature layer, ignoring the attention relationship with other convolutional layers.
To fully exploit the powerful learning capability provided by CNNs while reducing the impact of feature confusion for remote sensing scene objective recognition, inspired by the literature [20] and AM, we plan to explore the complementary relationships and enhanced relationship messages existing between feature maps of adjacent convolutional layers, focusing on key messages and discarding non-key messages in the process of feature maps computation.
In general, this study has three main contributions.
(1) A complementary relational feature computation module is designed; (2) An enhanced relational feature calculation module is designed; (3) A contextual, relational attention-based recognition network is proposed to effectively enhance the performance of RSSOR using CNN.Other important contents are organized as follows: Section 2 describes related work; Section 3 introduces CRABR-Net; Section 4 reports the experimental results; Section 5 carries out the discussion; Section 6, the paper is summarized.

Methods Based on Intuitive Feature
This category is the earliest recognition method to identify the category of an image by the most intuitive underlying features of the scene objectives.The underlying features consist of local features and global features, such as color, spectrum, texture, structure, and so on [21].Color features are typical local features, and they are also the most easily observed and calculated underlying features [22].A common method of identifying color histograms is to interpret categories by comparing the proportions of different colors in the entire image [23].This method cannot determine the spatial position of each color in the image, and is less effective in identifying images that are spectrally similar but have large differences in distribution.Texture features are a type of global feature [24].Typical methods, such as de-identification using the grayscale covariance matrix, are used to calculate the gray-level covariance matrix of an objective, and then the categories are identified by analyzing the features of the image [25].This method is more effective in recognizing images with large differences in texture features, but it is not easy to recognize scene images with insignificant texture features.

Methods Based on Statistical Features
This method is an agglutination or consolidation of intuitive features, and its essence is to analyze the statistical distribution of image intuitive features to establish the connection between them and semantic features, and the representative methods are bag of visual words (BoVW) and k-mean clustering methods [26].The core idea of BoVW is to count the underlying features of an image, such as SIFTI [27], GIST [28], etc., and then analyze these underlying features by clustering methods such as K-mean to form a "visual dictionary", and then encode the image according to the frequency of the intuitive features appearing in the "visual dictionary", as a feature description of the image.The BoVW method recognizes better than the method based on intuitive features, but only utilizes the frequency information of the visual lexicon, ignores the spatial distribution relationship, and lacks the correlation between the features, which still has limitations.Later, there are some improved methods, such as spatial pyramid matching [29], to segment the image at multiple scales and enhance the spatial information.However, these methods still need to extract many intuitive features, which are not only cumbersome and inflexible, but also easy to ignore semantic information.

Methods Based on Depth Feature
These methods utilize deep learning models to adaptively learn objectives in an "endto-end" manner, and achieve higher accuracy after obtaining deep semantic information.Commonly used models include Stacked Auto Encoder (SAE) [30], Visual Transformer (VIT) [31], CNN [32], etc.For example, Li et al. [33] will apply the SAE; the model is simpler, and the feature representation of the input data can be quickly established by a small number of features, but this type of method is unable to catch the spatial relationship among the local features.Bazi et al. [34] utilize VIT and achieve a high recognition accuracy, but these methods take a long time to train and need a large amount of Objective information to achieve a relatively good training result.Methods utilizing CNN are the most popular approaches for RSSOR [9].Generally, according to the way of deep feature utilization, the method can be categorized into CNN without fusion method, CNN with fusion method, and CNN with AM method.

•
CNN without Fusion Method.The method utilizes CNN to acquire local features of the training objectives and then transforms them directly into global features for recognition [35].According to whether pretraining parameters are used or not, the present method can be categorized into two classes.One class does not use pretraining parameters.Nogueira et al. [36]  CNN with AM Method.The methods usually add AM behind the convolutional layer to filter useless information and enhance useful features.For example, the literature [43] added a channel attention mechanism [44] to different stages of DenseNet-121, and Guo et al. [18] added a spatial attention mechanism [45] to the second convolutional module of ResNet-101, and channel attention to the third, fourth, and fifth convolutional modules.Wang et al. [19] propose a mask matrix as a convolutional feature for attention; Fan et al. [46] design an attention mechanism with trunk branches and mask branches for ResNet-50.
All of the above methods work well in RSSOR, but where these methods either utilize a certain layer of features or simply sum the features of several layers, ignoring the relational information between the features, our goal is to maximize the use of CNN extracted features of each layer, and to obtain a better recognition effect just by one CNN backbone network.

Methodology
The architecture of the CRABR-Net proposed is shown in Figure 1 (e) The fifth step is to recognize.F is fed into a recognizer consisting of GAP, Fully Connected Layer, and Softmax Layer for scene recognition.

Backbone Network for Extraction Feature Map
We use Se-ResNext-50 as the feature extraction backbone network for this remote sensing image recognition task.Se-ResNext-50 retains the advantages of the residual structure of ResNet, adopts ideas from the inception network model in widening network processing, and combines the advantages of the Se-Net network to exploit the relationship between channels between features, which performs better in feature learning compared to ResNet and other variants of the network [48].As shown in © CNN Backbone Network in Figure 1, the Stem module, layer1 module, layer2 module, layer3 module, and layer4 module in the Se-ResNext-50 network are used to compute the preprocessed dataset in turn for obtaining the output feature maps from the four-level modules.Within the Stem module, 64 convolution kernels of size 7 × 7 are used for the convolution calculation at one step of 2.Then, the feature maps obtained in step 1 are pooled with a window of 3 × 3 and a maximum value of 2 for obtaining a feature mapping with a size variation of 56 × 56.
As shown in Figure 2, the Layer1 module contains three groups of Bottleneck.Each group of Bottleneck consists of Conv_1, Conv_2, Conv_3, and Se-Module, where the convolutional kernel sizes of the three convolutional modules are 1 × 1, 3 × 3, and 1 × 1, and the numbers of convolutional kernels are 128, 128 and 256, in that order.Specifically, in the second convolution stage, 32 identical structures are utilized to widen this network module.In this se-module, the compression is performed using global average pooling, followed by modeling associations between channels through a full connectivity layer, a sigmoid function to export weights with an equal amount of input features, and finally, the normalized weights are added onto the features per channel.Similar to the Layer1 module, the number of Bottleneck compositions of Layer2, Layer3, and Layer4 modules are

Preprocessing for Relational Feature Map
To prevent the model from becoming more complex and to control the number of parameters as much as possible, we use SimAM [49] to focus the feature expressions of the four different layers deeper into the more important information without increasing the network parameters.a bilinear difference algorithm to match the size of the feature maps acquired from lowlevel convolutional layers.In particular, unlike the literature [20], considering various fusion methods of convolutional features from adjacent layers will have variable effects on integrated features; instead of simply using the direct summation of the corresponding elements, we obtain the primary relational features by assigning different weight parameters to the adjacent feature layers and then multiplying the corresponding elements with the weights before summation.As seen in Figure 3, firstly achieve size augmentation of dimensions between relational features by a bilinear interpolation algorithm, and then the dimensionally augmented feature map and the underlying feature map in its adjacent layers are sequentially summed by the corresponding positions of the pixels to acquire the fused feature map. (1) Where denotes the element-by-element summation operation.culation.
Then, the two results are imported into the MLP separately. ( where and represents the scaling ratio of the channel dimension.and are convolutional operations.In particular, the activation function comes right after to avoid over-fitting and speed up network convergence.Then, the output from the multilayer perceptron is subjected to an element-wise summation operation, followed by a Sigmoid activation operation to generate the enhanced weights of the adjacent two layers of feature maps: (9) where denotes the Sigmoid function.
After calculating the augmented weights of the adjacent two layers of feature maps, we perform an elemental multiplication to calculate the mapping with feature augmentation: (10) where denotes the element multiplication operation.The enhanced relationship feature maps , , , and can be calculated from Equation (10).(e) The fifth step is to recognize.F is fed into a recognizer consisting of GAP, Fully Connected Layer, and Softmax Layer for scene recognition.

Backbone Network for Extraction Feature Map
We use Se-ResNext-50 as the feature extraction backbone network for this remote sensing image recognition task.Se-ResNext-50 retains the advantages of the residual structure of ResNet, adopts ideas from the inception network model in widening network processing, and combines the advantages of the Se-Net network to exploit the relationship between channels between features, which performs better in feature learning compared to ResNet and other variants of the network [48].
As shown in © CNN Backbone Network in Figure 1, the Stem module, layer1 module, layer2 module, layer3 module, and layer4 module in the Se-ResNext-50 network are used to compute the preprocessed dataset in turn for obtaining the output feature maps from the four-level modules.Within the Stem module, 64 convolution kernels of size 7 × 7 are used for the convolution calculation at one step of 2.Then, the feature maps obtained in step 1 are pooled with a window of 3 × 3 and a maximum value of 2 for obtaining a feature mapping with a size variation of 56 × 56.
As shown in Figure 2, the Layer1 module contains three groups of Bottleneck.Each group of Bottleneck consists of Conv_1, Conv_2, Conv_3, and Se-Module, where the convolutional kernel sizes of the three convolutional modules are 1 × 1, 3 × 3, and 1 × 1, and the numbers of convolutional kernels are 128, 128 and 256, in that order.Specifically, in the second convolution stage, 32 identical structures are utilized to widen this network module.In this se-module, the compression is performed using global average pooling, followed by modeling associations between channels through a full connectivity layer, a sigmoid function to export weights with an equal amount of input features, and finally, the normalized weights are added onto the features per channel.Similar to the Layer1 module, the number of Bottleneck compositions of Layer2, Layer3, and Layer4 modules are 4, 6, and

Preprocessing for Relational Feature Map
To prevent the model from becoming more complex and to control the number of parameters as much as possible, we use SimAM [49] to focus the feature expressions of the four different layers deeper into the more important information without increasing the network parameters.
In order to facilitate the primary relational feature calculation and advanced relational feature calculation later, we use 1 × 1 convolution to perform channel reduction operation on the features maps.We design the convolutional dimensionality reduction module separately; the input size of the convolution kernel is set to the channel number scale of the input features, and the output number of the convolution kernel is kept the same as the channel number F 1 .
In the above processing, to avoid the instability of the network learning process due to the oversized feature data after the convolutional dimensionality reduction calculation, we batch normalize the dimensionality reduction results so that the feature data satisfy the distribution law of mean 0 and variance 1.In addition, to avoid over-fitting, we add a modified linear function [50] to keep only the outputs larger than 0, and other inputs will be set to 0, so that the network can be better fitted.
So far, we obtained the results after relational feature maps preprocessing as

Complementary Relationship Feature Map Calculation
Information about the relationship between F 1 , F 2 , F 3 and F 4 should be fully utilized.We design a primary relationship enhancement process from the high feature layer to the low feature layer to further extract the relationship between adjacent layer features and embed this relationship into the adjacent low layer features to complement the performance of low layer features, and the structure is described in Figure 3.
In aiming to utilize the adjacent high convolutional layers to complement the missing global message of low-level features, we enhance the size of high-level feature maps with a bilinear difference algorithm to match the size of the feature maps acquired from low-level convolutional layers.In particular, unlike the literature [20], considering various fusion methods of convolutional features from adjacent layers will have variable effects on integrated features; instead of simply using the direct summation of the corresponding elements, we obtain the primary relational features by assigning different weight parameters to the adjacent feature layers and then multiplying the corresponding elements with the weights before summation.
As seen in Figure 3, firstly achieve size augmentation of dimensions between relational features by a bilinear interpolation algorithm, and then the dimensionally augmented feature map and the underlying feature map in its adjacent layers are sequentially summed by the corresponding positions of the pixels to acquire the fused feature map.
where ⊕ denotes the element-by-element summation operation.Then, utilizing the features acquired in the previous step, the global and self-attentive relationship weights are calculated by the sigmoid function, respectively.As shown in Figure 3, the process shown in the upper part of the branch is the computation process of global attention features.We perform a two-dimensional global average adaptive pooling of the input features, and then use a convolutional kernel of size 1 × 1, and the channel dimension of output features is one-fourth of the channel dimension of input features to realize the dimensionality reduction of convolutional feature channels.In order to avoid the computed data being too large and the network over-fitting problem, we perform batch normalization and add modified linear units.Finally, the original count of channels for features is to be restored with a convolutional kernel of size 1 × 1, and batch normalization is performed to obtain global attention features.
The process shown in the lower branch is the computation process of local attention features.By adopting a 1 × 1 size convolution kernel, the channel dimension of the input features is minimized to one-fourth of the original size.Then, batch normalization is performed, and corrected linear units are added.Finally, the amount of original channels to which the channel dimension of the feature map is restored with a convolution kernel of size 1 × 1 is applied, and then all feature values are normalized to acquire self-attention features.
After summing the global attentional features and self-attentive features per element according to the corresponding positions, the sigmoid function is employed for computing the focused relationship parameters of the bottom layer in the adjacent feature layer, which is S 12 n+1,n .Similarly, the supplemental relationship parameter of the higher level is obtained, where S 11 n+1,n = 1 − S 12 n+1,n .This leads to the focused relation feature map S 22 n+1,n and the supplemental relation feature map S 21 n+1,n : where ⊗ indicates that the elements in the corresponding positions are calculated sequentially according to the multiplication rule.Finally, the complementary relationship feature map is obtained.
By the same principle, we obtained the complementary relationship feature map for F 1 , F 2 , F 3 and F 4 .

Enhanced Relationship Feature Map Calculation
Considering the main relationship feature maps of two neighboring layers, where one lower layer contains the contextual information of the upper layer and the main relationship feature map of the upper layer is a more abstract representation of the lower layer, there is a rich contextual dependency between these feature maps.
The purpose of this proposed section is to capture such contextual relationships for embedding into the higher-level feature maps of neighboring layers so as to enhance the representation of higher-level features.
The calculation process for the module is illustrated in Figure 4; let F n R B×C×H n ×W n denote the obtained primary relationship feature map, where B, C, H n and W n denote the Sensors 2023, 23, 7514 9 of 24 number of learned features, the channel dimension of features, the horizontal dimension of features, and the vertical dimension of features in one training session, respectively.
To establish the high-level enhancement relationship between two adjacent layers of features F n and F n+1 , the GAP is calculated to acquire global feature map Z GAP (F n ) R C and the GMP algorithm is utilized for local feature map where G pool indicates that after GAP calculation and G max indicates that after GMP calculation.
Then, the two results are imported into the MLP separately.
where W 0 R C/r×C and W 1 R C×C/r r represents the scaling ratio of the channel dimension.W 0 and W 1 are convolutional operations.In particular, the activation function ReLU comes right after W 0 to avoid over-fitting and speed up network convergence.Then, the output from the multilayer perceptron M GAP M GMP is subjected to an element-wise summation operation, followed by a Sigmoid activation operation to generate the enhanced weights of the adjacent two layers of feature maps: where σ denotes the Sigmoid function.
After calculating the augmented weights of the adjacent two layers of feature maps, we perform an elemental multiplication to calculate the mapping with feature augmentation: where ⊗ denotes the element multiplication operation.The enhanced relationship feature maps F L 1 , F L 2 , F L 3 , and F L 4 can be calculated from Equation (10).

Feature Fusion and Objective Recognition
The advanced enhancement features are fused using the concatenation function to generate the final multilevel enhanced relationship feature map.
Then, after GAP calculation, the flattened feature is obtained by pulling the global average pooled features into a one-dimensional vector using the flatten function.Then, the flattened features are input to the fully connected layer.Finally, We use one-hot coding to represent N categories of remote sensing scene categories, where the true probability of a category is denoted as y ij .The predicted probability y ij of each of the N categories is obtained by inputting Z 1 into the Softmax Layer.
The loss distance between the true probability and the predicted probability is determined by using the loss function; the smaller the loss value, the more accurate the prediction: where N represents the total number of scene objective categories to be recognized.

Datasets
To evaluate the recognition effect for CRABR-Net under different numbers of remote sensing scene categories and different amounts of remote sensing scene data, the proposed CRABR-Net is validated on the following three datasets.

1.
AID Dataset.It is a massive dataset of airborne scenes, acquired by collecting Google Earth images.It includes 30 categories of feature images of targets such as landforms, terrain, and buildings, and there are approximately 220 to 420 feature images collected for each category.The number of all images together is 10,000; in addition, the pixel size of each image is 600 × 600 [51].Figure 5 shows instances of the scene objectives for every category within this dataset;

2.
UC-Merced Dataset.It is an image data representing land use extracted manually by the researchers.These data reflect the land use within the city, and in terms of the main content reflected in the images, there are a total of 21 land use types, with 100 images of each type.The total number of images is 2100, and the size of each type of image is 256 × 256 [52].Figure 6 shows instances of the scene objectives for every category within this dataset;

Experimental Environment Setup
Our work was performed on a Linux platform with four NVIDIA A100-type GPU processors installed.Considering the seamless use of NumPy and the ability to accelerate the training using GPUs, as well as the ability to use dynamic graph computation to make the network more flexible, we used PyTorch, a deep learning framework released by Facebook.In our proposed model, to fasten convergence and increase speed while reducing the over-fitting of the model, we used pretraining parameters, a distributed training approach, and take batch size to 64, using the Adam gradient function, set L2 regularization to 0.0001, set the learning rate to 0.0003, and trained the network with 200 Epochs.

Data Preprocessing
To prove the advantages of the proposed method via comparative results, we borrowed ratios used by many previous most advanced algorithms in classifying the dataset during the experimental process.Specifically, for the UC-Merced dataset, we set the proportion of training data to verified data to 1:1 and 4:1, respectively, and for the AID dataset and RSSCN7 dataset, we set 1:4 and 1:1, respectively.
An insufficient amount of data can easily cause the model training results to be under-fitted.To minimize the possible adverse effects in this regard, we used a data enhancement technique from the image processing domain to generate new training samples for the data used in our experiments.Specifically, we further enhanced the data diversity using rotation, translation, and flip processing for all the data in the training set before feeding it into our proposed model, while the dimensions were all resized to 224 pixels × 224

3.
RSSCN7 Dataset.It is a typical scene target collected from Google Earth, acquired under the conditions of diverse seasonal changes and weather variations, and the data processing is challenging.It contains seven types of features, with a total of 400 images for each type of feature, where each image gets a size of about 400 × 400, for a total of 2800 images [53].Figure 7 shows instances of the scene objectives for every category within this dataset.

Experimental Environment Setup
Our work was performed on a Linux platform with four NVIDIA A100-type GPU processors installed.Considering the seamless use of NumPy and the ability to accelerate the training using GPUs, as well as the ability to use dynamic graph computation to make the network more flexible, we used PyTorch, a deep learning framework released by Facebook.In our proposed model, to fasten convergence and increase speed while reducing the over-fitting of the model, we used pretraining parameters, a distributed training approach, and take batch size to 64, using the Adam gradient function, set L2 regularization to 0.0001, set the learning rate to 0.0003, and trained the network with 200 Epochs.

Data Preprocessing
To prove the advantages of the proposed method via comparative results, we borrowed ratios used by many previous most advanced algorithms in classifying the dataset during the experimental process.Specifically, for the UC-Merced dataset, we set the proportion of training data to verified data to 1:1 and 4:1, respectively, and for the AID dataset

Experimental Environment Setup
Our work was performed on a Linux platform with four NVIDIA A100-type GPU processors installed.Considering the seamless use of NumPy and the ability to accelerate the training using GPUs, as well as the ability to use dynamic graph computation to make the network more flexible, we used PyTorch, a deep learning framework released by Facebook.In our proposed model, to fasten convergence and increase speed while reducing the over-fitting of the model, we used pretraining parameters, a distributed training approach, and take batch size to 64, using the Adam gradient function, set L2 regularization to 0.0001, set the learning rate to 0.0003, and trained the network with 200 Epochs.

Data Preprocessing
To prove the advantages of the proposed method via comparative results, we borrowed ratios used by many previous most advanced algorithms in classifying the dataset during the experimental process.Specifically, for the UC-Merced dataset, we set the proportion of training data to verified data to 1:1 and 4:1, respectively, and for the AID dataset and RSSCN7 dataset, we set 1:4 and 1:1, respectively.
An insufficient amount of data can easily cause the model training results to be underfitted.To minimize the possible adverse effects in this regard, we used a data enhancement technique from the image processing domain to generate new training samples for the data used in our experiments.Specifically, we further enhanced the data diversity using rotation, translation, and flip processing for all the data in the training set before feeding it into our proposed model, while the dimensions were all resized to 224 pixels × 224 pixels.In addition, we convert all data formats to tensor format and normalize them to facilitate data processing and ensure faster convergence when the program runs.

Performance Evaluation Metrics
To demonstrate the validity and sophistication of our proposed method CRABR-Net, we used several important evaluation metrics, namely Accuracy, Confusion Matrix (CM), Precision, Recall, and Specificity, to quantitatively evaluate.

Accuracy
For the validation of the recognition performance with the model throughout the verified dataset, we calculated the recognition accuracy as follows: where the category of remote sensing scene sample x i is y i , the overall amount of remote sensing scene objective is N, and the function of predicted category is f .

Confusion Matrix
To determine which classes of samples the model misidentified and to obtain the probability of misidentifying samples in that class, we constructed CMs for the three datasets at different training ratios using PyTorch 3.7.The vertical coordinates represent the true category of the remote sensing scene objective, and the horizontal coordinates represent the categories identified by our method.

Precision, Recall, Specificity
Precision, which indicates the accuracy rate, for the percentage of positive samples you predict that are identified correctly (i.e., identified the positive sample as a positive sample).The higher the precision, the more accurate the finding.

Precision =
T P T P + F P (14) where T P means identifying positive samples as positives, and F P means predicting negative samples as positives.
Recall is a metric of coverage, and the metric has multiple positive examples being divided into positives.The higher the recall, the more complete the search is.
where T P means identifying positive samples as positive samples and F N means identifying positive samples as negative samples.
Speci f icity indicates the ability to predict negative cases (the higher, the better).
T N means identifying negative samples as negative samples.

Analysis of Accuracy
According to the characteristics of the proposed method, we chose three different methods of the same type to conduct a comparison experiment: single CNN, multiple CNNs, and CNN combined with AM, with the same proportion of training data, and analyzed the performance of CRABR-Net in three typical scenarios for accuracy.The specific comparison is given below: Table 1 gives the results of scene objective recognition using CNNs for the AID dataset.Of the three datasets, the AID dataset is much more challenging because it has more sample classes and a larger number of samples.As shown in Table 1, among single CNNs, CaffeNet, GoogLeNet, and VGG-VD-16 all use the top-level features of CNNs for scene recognition, and VGG-16 combines pretraining parameters; among multi-CNNs, the literature [54] uses two deep networks to learn different features of the same data separately and uses the fused two depth features for scene recognition; the literature [55] fused local binary pattern features of remote sensing image data for classification; the literature [10] achieved scene recognition by tandem CNN network and CapsNet network; in CNNs combining AMs, Wang et al. [18] improved classification performance by using AM on top layer features to selectively focus on key regions; Sun et al. [56] used three layers of convolutional features to combine to form new features for scene recognition, and additionally added two auxiliary linear classifiers to promote network convergence; the literature [57] applied the self-attention mechanism and combined with SVM to achieve scene recognition.The CRABR-Net achieved an impressive performance in the scene recognition task; while utilizing 20% of the dataset for training, the accuracy obtained is about 94.02%, and utilizing 50% of the dataset for training, the accuracy obtained is about 96.46%.
Table 1.Scene objective recognition accuracy on the AID dataset.

Modes
Solutions Accuracy

•
GBNet [54] 90.16 ± 0.24 93.72 ± 0.34 GBNet + global feature [54] 92.20 ± 0.23 95.48 ± 0.12 AlexNet + SAFF [57] 87.51 ± 0.36 91.83 ± 0.27 VGG_VD16 + SAFF [57] 90.28 ± 0.29 93.83 ± 0.28 ARCNet-VGG16 [18] 88.75 ± 0.40 93.10 ± 0. The results of scene objectives recognition using CNNs for the UC-Merced dataset are given in Table 2.As shown in Table 2, two approaches are proposed in the literature [58]; one is to perform scene recognition using the fusion of feature maps from various convolutional Sensors 2023, 23, 7514 14 of 24 layers, and the other is to continue collecting feature maps from various layers separately and then fuse them to perform scene recognition using the fused features.The CRABR-Net achieved impressive performance in the UC-Merced scene recognition task; while utilizing 50% of the dataset for training, the accuracy obtained is about 98.06%, and utilizing 80% of the dataset for training, the accuracy obtained is about 99.20%.

Modes
Solutions Accuracy
Table 3 gives the results of scene objective recognition using CNNs for the RSSCN7 dataset.In [59], scene recognition is achieved by fine-tuning the MobileNet V2 network and then using top-level features; Gao et al. [60] use channel attention and spatial attention to extract important information about features; in [61], a bilinear structure is built using deep separable convolution and regular convolution, to fuse feature of both branches for scene recognition; Liu et al. [62] proposes a weighted spatial pyramidal matching classification method based on collaborative representation.In [63], the features of each branch of the CaffeNet and the VGG-VD-16 network are fused separately, and then the features of both branches are fused to form new features for scene recognition; Xu et al. [64] use CNN and graph neural network in parallel to achieve scene recognition; As shown in Table 3, the CRABR-Net achieved impressive performance in the RSSCN7 scene recognition task, while utilizing 20% of the dataset for training, the accuracy obtained is about 93.21% and utilizing 50% of the dataset for training, the accuracy obtained is about 95.43%.

Modes
Solutions Accuracy

Analysis of Confusion Matrix
To analyze the recognition accuracy of CRABR-Net for each sample category in the three datasets, we constructed prediction CM to demonstrate the performance, respectively.
Figure 8 shows the CM generated under different proportions of AID training data.When the training data amount is 50% of all data, there are 27 remote sensing scene objective types recognized by our proposed method with an accuracy close to 100%; when the training data amount is 20% of all data, there are seven types recognized with 100% accuracy and eighteen types recognized with more than 90% accuracy; like "BareLand", "MediumResidential", "River", " StorageTanks", "Viaduct", and "Bridge" are difficult to recognize because of the large amount of overlap in the content of the image data, but despite this, our method achieves recognition accuracy of nearly 90%.data accounts for 20% of the total data.When a percentage of up to 50% of the training data is increased, the accuracy of our proposed approach can be seen to be greater than 90% for all remote sensing scene objective types.
(a) 50% for training.Figure 10 shows the CM generated with different proportions of RSSCN7 training data.It can be seen that because of the overlap between the contents of "Industry" and "Resident" and "Parking", the accuracy of "Industry" is close to 90% when the training data accounts for 20% of the total data.When a percentage of up to 50% of the training data is increased, the accuracy of our proposed approach can be seen to be greater than 90% for all remote sensing scene objective types.

Discussion
To evaluate our proposed method scientifically, we have conducted sufficient ablation studies in three aspects: the typical model used in extracting features, the attention mechanism used in preprocessing, and the two modules used in the calculation of relational feature maps to verify the scientific validity of the present method.

Effects of Backbone Network
For a better demonstration of how superior the Se-ResNext-50 model is in our proposed approach, we selected ResNet-50 and its improved model to compare the experimental effects.Specifically, the UC-Merced dataset is split into training data and validation data in a 1:1 ratio, at the same time keeping the feature preprocessing module and two relational feature calculation modules unchanged.In addition, the optimizer and learning rate, etc., were also kept unchanged, and only the backbone network for extracting features was replaced, and 200 epochs were trained to obtain the accuracy results of RSSOR, as shown in Figure 11.

Discussion
To evaluate our proposed method scientifically, we have conducted sufficient ablation studies in three aspects: the typical model used in extracting features, the attention mechanism used in preprocessing, and the two modules used in the calculation of relational feature maps to verify the scientific validity of the present method.

Effects of Backbone Network
For a better demonstration of how superior the Se-ResNext-50 model is in our proposed approach, we selected ResNet-50 and its improved model to compare the experimental effects.Specifically, the UC-Merced dataset is split into training data and validation data in a 1:1 ratio, at the same time keeping the feature preprocessing module and two relational feature calculation modules unchanged.In addition, the optimizer and learning rate, etc., were also kept unchanged, and only the backbone network for extracting features was replaced, and 200 epochs were trained to obtain the accuracy results of RSSOR, as shown in Figure 11.
The left panel in Figure 11 shows the recognition accuracy of different backbone networks in the training data, while the right panel shows the recognition accuracy of different backbone networks in the verified data.The solid line indicates that we used pretraining parameters in the training, and the dashed line indicates that we did not use pretraining parameters.Obviously, the Se-ResNext-50 model with pretraining parameters in the same case not only converges quickly and smoothly during the learning process in both datasets, but also has the highest target recognition accuracy.Therefore, it is clear that the convolutional network backbone model used has some superiority.lower than when no attention mechanism was used.Therefore, we chose SimAM with facilitation in the preprocessing stage.

Effects of MLP, GAP, and GMP
In enhanced relationship feature map calculation, the number of feature channels input to the MLP is 256, so we set seven different scaling values, and using the UCM dataset trained under the same conditions, we obtained the accuracy of the model under different channel scaling ratios.As can be seen from Table 5, the model has the highest accuracy when the scaling ratio is equal to 16.To verify the effect of GAP and GMP on the accuracy of the model, we designed three combinations and trained them under the same conditions, as shown in Table 6; when both GAP and GMP are involved in the training, the local enhancement coefficients and global enhancement coefficients of the input features are involved in the relationship enhancement computation, which leads to the highest accuracy of the model.

Effects of Feature Fusion Strategy
Towards analyzing the influence of multilevel enhancement relationship features on scene recognition effect under different fusion strategies, on the basis of fusing four-level features by using the concatenation function, we carried out comparison experiments on four high-level enhancement features according to the ways of fusing three-level features, fusing two level features and no fusing.
We design the model architecture in each of the four different fusion methods according to the mathematical approach to combination.When no features are fused, the channel dimension is minimized, which is 256; when two features are fused, the channel dimension is 512; and when three features are fused, the channel dimension is 768.Using 80% and 50% of the UCM data, we train under the same conditions.Figure 12 lists some of the results of the experiments, from which it can be seen that the enhanced features are able to obtain high accuracy; in addition to the different strategies for combining the features, the recognition accuracy of the model under the same conditions is also different.When all four levels of features are concatenated by the concatenation function, the channel dimension reaches 1024, and the features at this time fully integrate the relationship information between the features at all levels, and after training, the model has the highest accuracy rate.
results of the experiments, from which it can be seen that the enhanced features are able to obtain high accuracy; in addition to the different strategies for combining the features, the recognition accuracy of the model under the same conditions is also different.When all four levels of features are concatenated by the concatenation function, the channel dimension reaches 1024, and the features at this time fully integrate the relationship information between the features at all levels, and after training, the model has the highest accuracy rate.

Effects of Calculation Module
For analyzing the effect of our proposed complementary relationship and enhanced relationship module on the recognition effect of the model, we set four different combinations of the relationship module under the same other conditions, so as to verify the recognition accuracy of the method in terms of different combinations of modules.
As shown in Figure 13, the "00" mode indicates that the complementary and augmented relationship modules are not used; the "01" mode indicates that the complementary relationship module is not utilized, but the augmented relationship module is utilized; the "10 " mode indicates that the complementary relationship module is utilized and the augmented relationship module is not utilized; "11" mode indicates that the complementary relationship module and the augmented relationship module are utilized.We conducted comparison experiments on the UC-Merced dataset to obtain the recognition of each category of scene targets.From the figure, we can see that the "11" mode has relatively high accuracy and is more stable than the other modes.

Effects of Calculation Module
For analyzing the effect of our proposed complementary relationship and enhanced relationship module on the recognition effect of the model, we set four different combinations of the relationship module under the same other conditions, so as to verify the recognition accuracy of the method in terms of different combinations of modules.
As shown in Figure 13, the "00" mode indicates that the complementary and augmented relationship modules are not used; the "01" mode indicates that the complementary relationship module is not utilized, but the augmented relationship module is utilized; the "10 " mode indicates that the complementary relationship module is utilized and the augmented relationship module is not utilized; "11" mode indicates that the complementary relationship module and the augmented relationship module are utilized.We conducted comparison experiments on the UC-Merced dataset to obtain the recognition of each category of scene targets.From the figure, we can see that the "11" mode has relatively high accuracy and is more stable than the other modes.

Conclusions
Not only because of the complexity of remote sensing scene image data, but also because of the simple application of features to each layer of CNN, all of which affect the improvement of scene objective recognition accuracy to a certain extent.To solve the issue,

Conclusions
Not only because of the complexity of remote sensing scene image data, but also because of the simple application of features to each layer of CNN, all of which affect the improvement of scene objective recognition accuracy to a certain extent.To solve the issue, we use the convolutional feature message of the upper layer to complement the lower layer, and complementary weights between adjacent layers are calculated using the self-attention relation and the global attention relation, and then the weights are assigned to the adjacent layers to complementary relationship feature maps, and the global and local features of the underlying layers are extracted to form the guide coefficients, and then fused with the features of the upper layers to obtain the enhanced relationship feature maps, and finally the features are fused to achieve scene objective recognition using softmax recognizer.The network is able to capture the key contents of scene objectives and enhance the representation of deep features by using the complementary relationships between contextual features and enhanced relational information, further improving the performance of scene recognition based on CNNs effectively.Experimental results on three common benchmark data collections (including AID, UC-Merced, and RSSCN7) indicate that CRABR-Net can fully utilize the powerful learning ability of and realize higher recognition accuracy.In the next work, we will investigate various network architectures to enhance the efficiency of remote sensing scene objective recognition further by fusing and optimizing different networks.

Sensors 2023 , 26 Figure 5 .
Figure 5. Instances of the scene objectives within AID Datasets.

Figure 5 .
Figure 5. Instances of the scene objectives within AID Datasets.

Figure 7 .
Figure 7. Instances of the scene objectives within RSSCN7 Datasets.

Figure 6 .
Figure 6.Instances of the scene objectives within UC-Merced Datasets.

Figure 7 .
Figure 7. Instances of the scene objectives within RSSCN7 Datasets.

Figure 7 .
Figure 7. Instances of the scene objectives within RSSCN7 Datasets.

Figure 8 .
Figure 8. CMs on the AID dataset.Figure 8. CMs on the AID dataset.

Figure 8 .
Figure 8. CMs on the AID dataset.Figure 8. CMs on the AID dataset.

Figure 9
Figure 9 shows the CM generated under different proportions of UC-Merced training data.It is observed that all the types of remote sensing scene objectives are recognized by our proposed method with no less than 90% accuracy; 12 types are recognized with 100% accuracy when the training data amount is 50% of all data; 16 types are recognized with 100% accuracy when the training data amount is 80% of all data.Sensors 2023, 23, x FOR PEER REVIEW 18 of 26

Figure 9 .
Figure 9. CMs on the UC-Merced dataset.Figure 9. CMs on the UC-Merced dataset.

Figure 9 .
Figure 9. CMs on the UC-Merced dataset.Figure 9. CMs on the UC-Merced dataset.

Figure 12 .
Figure 12.Accuracy for different feature fusion strategies.

Figure 12 .
Figure 12.Accuracy for different feature fusion strategies.

Sensors 2023 , 26 Figure 13 .
Figure 13.Accuracy for per category with different module combinations.

Figure 13 .
Figure 13.Accuracy for per category with different module combinations.

Table 2 .
Scene Objective Recognition Accuracy on the UC-Merced Dataset.

Table 3 .
Scene objective recognition accuracy on the RSSCN7 dataset.

Table 5 .
Accuracy at Different Ratio of MLP.

Table 6 .
Accuracy in Different Combinations of GAP and GMP.