High-Resolution Remote Sensing Image Segmentation Framework Based on Attention Mechanism and Adaptive Weighting

: Semantic segmentation has been widely used in the basic task of extracting information from images. Despite this progress, there are still two challenges: (1) it is difﬁcult for a single-size receptive ﬁeld to acquire sufﬁciently strong representational features, and (2) the traditional encoder-decoder structure directly integrates the shallow features with the deep features. However, due to the small number of network layers that shallow features pass through, the feature representation ability is weak, and noise information will be introduced to affect the segmentation performance. In this paper, an Adaptive Multi-Scale Module (AMSM) and Adaptive Fuse Module (AFM) are proposed to solve these two problems. AMSM adopts the idea of channel and spatial attention and adaptively fuses three-channel branches by setting branching structures with different void rates, and ﬂexibly generates weights according to the content of the image. AFM uses deep feature maps to ﬁlter shallow feature maps and obtains the weight of deep and shallow feature maps to ﬁlter noise information in shallow feature maps effectively. Based on these two symmetrical modules, we have carried out extensive experiments. On the ISPRS Vaihingen dataset, the F1-score and Overall Accuracy (OA) reached 86.79% and 88.35%, respectively.


Introduction
Semantic segmentation of remote sensing images assigns categories to each category in remote sensing images, thereby completing the pixel-level classification task. Its application is very extensive, and it is used in fields such as vegetation extraction monitoring [1], urban planning [2,3] and building extraction [4,5], among others.
In recent years, with the development of deep learning, remote sensing image segmentation algorithms based on deep learning have developed rapidly. The essence of multi-scale information for target detection with different scales has gradually emerged. People have also proposed many methods that can be applied to image semantic segmentation. In 2015, Long et al. proposed a Fully Convolutional Network (FCN) [6], which converts the convolutional layer of the last layer of traditional CNN into a fully connected layer, and uses an end-to-end deep convolutional neural network to complete semantic segmentation tasks. After that, Huang et al. proposed the U-Net [7] network, which is a semantic segmentation model based on encoding and decoding. This model uses skip connections to connect the features obtained by the decoder with the corresponding feature maps of the encoder at each level. In this way, the semantic information between different levels can be fully utilized, and the problem of loss of detailed information can be solved corresponding feature maps of the encoder at each level. In this way, the semantic information between different levels can be fully utilized, and the problem of loss of detailed information can be solved well. After this, PSPNet [8] and DeepLab [9] further explored the encoder-decoder structure. PSPNet uses the spatial pyramid module to aggregate contextual information in different areas to achieve the ability to obtain global information. Similar to this, DeepLabv3+ [10] obtains multi-scale feature information through the Hollow Space Convolution Pooling Pyramid. In the semantic segmentation of high-resolution images, Dense Pyramid Network (DPN) [11] processes multi-sensor data to extract feature maps of each channel separately. Recently, in an end-to-end framework, the cluster monitoring network (ClusDet) [12] realized the monitoring of cluster multi-scale targets through the unification of multi-scale normalized clustering and implicit models.
Although the semantic segmentation of remote sensing images has made considerable progress, there are still two limitations.
On the one hand, almost all remote sensing images are high-resolution images, in which the multi-scale phenomenon of objects is very obvious, as shown in Figure 1a. Therefore, it is difficult for a single-sized receptive field to obtain object features with sufficient characterization ability. The Atrous Spatial Pyramid Pooling (ASPP) [10] structure obtains the multi-scale features of the image to a certain extent through the continuous expansion rate of the atrous convolution [9,10,13,14]. However, this method uses a fixed weight for each image to fuse the multi-scale features of each branch and cannot be based on the diversity of image sizes or make adaptive weight adjustments. Therefore, using this strategy to identify remote sensing images is not the best strategy.  On the other hand, the traditional encoder-decoder [15] structure directly merges the deep feature map and the shallow feature map through ADD or CAT. Although this method can achieve the fusion of the shallow feature map and the deep feature map to a certain extent, this fusion is not selective. Although the shallow network structure has more detailed information, the number of network layers passed is less, and the number of convolutions is limited, so the ability to extract features is limited. There will be a lot of noise in the feature map, which will affect the effect of segmentation. Therefore, it is very necessary to filter noise information through information selection for shallow feature maps before feature map fusion.
In order to overcome the above shortcomings, we propose two structures to solve these problems. Aiming at the first shortcoming, we propose an adaptive multi-scale fusion module (AMSM). Based on the classic multi-branch feature extraction, we adaptively generate different fusion weight ratios for each image according to the image scale. For example, for remote sensing images with obvious large-scale features, large-scale branches use larger fusion weights. Similarly, smaller branches are given greater weight for integration.
Through the above methods, the AMSM module uses an adaptive way to solve the multi-scale feature problem of remote sensing images.
Aiming at the second shortcoming, it is considered that although the deep feature map has lost some detailed information, it has better feature discrimination. Therefore, we propose the adaptive fuse module (AFM). This module uses deep features to filter shallow feature maps. After the noise information is filtered out, the feature maps are fused.
Through the above ideas, the AFM can solve the noise problem in the shallow features of remote sensing images well.
In summary, the main contributions of this paper are as follows: (1) A novel multi-scale fusion module-ASMS (Adaptive Multi-Scale Module) module is proposed, which can adaptively fuse multi-scale features from different branches according to the size characteristics of remote sensing images and has a better segmentation effect in the data sets with complex and variable object sizes. (2) We designed an AFM (Adaptive Fuse Module) that can filter and extract shallow information of remote sensing images. This module can combine the shallow and deep feature information effectively. After obtaining the weights of shallow and deep layers, these weights are multiplied by the original weight of the feature map to emphasize the useful information in the shallow feature map and suppress useless noise. So that the deep feature map can obtain more accurate detailed information. (3) A new type of network structure-Adaptive Weighted Network (AWNet) is proposed, which is a network structure embedded with AMSM and AFM. AWNet achieved one of the best accuracies on the ISPRS Vaigingen data set, reaching an overall accuracy of 88.35%.

Related work
In this part, we introduce the development of semantic segmentation structure and attention mechanism [16] in order to better discuss our work.

Semantic Segmentation
In recent years, with the development of deep learning and the computing power of graphics processing units, semantic segmentation has also made considerable progress. In 2015, FCN replaced the fully connected layer of the classic classification network with a convolutional layer and achieved the end-to-end training [6], which became the pioneering work of semantic segmentation.
Later, on this basis, DeepLabv3+ [10], MSCI [17], SPGNet [18], RefineNet [19], and DFN [20] all adopted encoder-decoder structure for dense prediction. Among them, both Refinenet and Global Convolutional Networks (GCNS) [21] have successively reached the most advanced performance. Gradually, the application of semantic segmentation in multi-scale has also made new progress. In order to deal with various scales and deformations of segmented objects, people use Deformable convolutional networks (DCN) [22] and the scale-adaptive convolutions (SAC) [23] model to improve the standard convolution operator. Soon after this, CRF-RNN [24] and DPN [25] used the graph model for semantic segmentation. In order to capture and match the semantic relationship between adjacent pixels in the label space, AAF [26], using adversarial learning, achieves this goal. BiSeNet [27] is applied with real-time semantic segmentation. DenseDecoder [28] built a functional-level remote jump connection on the cascade architecture for the first time, which further improved the effect of semantic segmentation. Later, CE2P [29] proposed a network structure that can achieve both edge detection and computing context embedding, which is also an efficient and concise framework. Obviously, semantic segmentation has made significant progress in various fields.

Attention Module
At present, the attention mechanism has been widely used in computer vision and natural language processing. The attention mechanism module appeared in people's vision. This is a landmark innovative design, which contains three parts: squeeze, stimulation and attention. For multi-label classification tasks, Hao Guo et al. used attention consistency [30] to make up for the defects of data augmentation in image classification tasks. This model adopts a dual-branch structure and uses two heat maps generated by CAM [31] to achieve the effect of still focusing on the same part after data augmentation. Subsequently, in order to realize the positioning of the shared objects between the images, Bo Li, et al. set the update gate and reset gate [32] so as to continuously update the hidden unit to integrate the information of all images, and then return Parameters to guide the generation of predicted values for each sample. For multi-task learning, Liu S et al. adapted an attention to each task as a feature selector [33], making it possible to extract specific features of each task. Lu, Xiankai et al. proposed a co-attention [34] module, which aligns adjacent frames and then integrates the information between adjacent frames to achieve unsupervised video object segmentation. For the target positioning task, although each channel can respond to a specific object, the noise of a single object is too large. Heliang Zheng, et al. [35] use the idea of self-attention, regard each channel as a spatial attention map, make it corresponding to a specified part, and realize the adaptive and unsupervised positioning of the region of each part of the object.

Spatial Pyramid Pooling and Atrous Convolution
In the previous CNN structure, the input of the convolutional neural network can only input pictures of a fixed size, which makes it difficult to meet the needs of modern computer vision. In order to realize the recognition of multi-scale objects, people proposed spatial pyramid pooling (SPP) on the basis of a convolutional neural network [26]. The SPP structure can use different sizes of the same image as input and output the same pooling feature. Moreover, regardless of the input of any size of the image, the image after SPP can produce a fixed size of output. Finally, all the segmentation results are merged to obtain the semantic segmentation result of the original input image.
Later, the concept of parallel sampling was proposed, and an upgraded version of SPP, ASPP, appeared. In this module, the input image can be sampled in parallel with irregular convolution. After each channel is sampled and pixels are added, the results obtained by irregular convolution of each branch are fused to obtain the final prediction result. ASPP also makes full use of the atrous convolution, effectively expanding the receptive field without increasing the amount of parameters and merging more context information. Hollow convolution is a convolution method born in the field of image segmentation. It proposes the characteristics of the input image through a convolutional neural network, and expands the receptive field, while merging to reduce the image size. Then, it restores the image size by upsampling to generate an output image. However, due to the limitations of the upsampling algorithm, many details will be lost with the merge. This problem is solved by expanding the receptive field. There is an important parameter (r) in the atrous convolution. When r = 1, it is a standard convolution process. When r > 1, every (r − 1) pixels will be sampled once. This idea is similar to that of dilated convolution [13].

Materials and Methods
Inspired by the attention mechanism [30], we proposed the AMSM module and the AFM module. In this part, we mainly describe the realization of this specific model. First, we introduced the overall architecture used to test both modules, namely the AWNet workflow. After that, the network architecture of the AMSM module is introduced. Finally, the construction principle of the AFM module is explained.

Overview
As shown in Figure 2, this network is mainly composed of three parts: ResNet preprocessing encoder based on residual block; AMSM; and AFM of attention module and upsampling block. The encoder-decoder structure is one of the most common structures in segmented networks. It is possible to extract effective global and local information by extracting high-resolution remote sensing images. In the encoder part, the network ensures the effective use of semantic and spatial information. First, we output different levels of semantic information through the Resnet101 [36] network. Then, we used AMSM stacked in the encoder part as a feature extractor. Each level performs multi-scale feature extraction on the semantic information of the fixed-dimensional image to ensure that it can be adaptively fused with the change of the image scale. Then, we used AFM to fully integrate deep and shallow semantic information. This module is a better feature extractor, designed to ensure consistent resolution. After the processing was over, the number of channels of the feature map was reduced to the same number of categories. The decoder part uses upsampling to restore the feature map to the original image size and output the final result. Facts have proven that our proposed AWNet, a new network deconstruction that combines AMSM and AFM, is very effective and practical in processing the spatial and semantic information of remote sensing images.
Then, it restores the image size by upsampling to generate an output image. However, due to the limitations of the upsampling algorithm, many details will be lost with the merge. This problem is solved by expanding the receptive field. There is an important parameter (r) in the atrous convolution. When r = 1, it is a standard convolution process. When r > 1, every (r − 1) pixels will be sampled once. This idea is similar to that of dilated convolution [13].

Materials and Methods
Inspired by the attention mechanism [30], we proposed the AMSM module and the AFM module. In this part, we mainly describe the realization of this specific model. First, we introduced the overall architecture used to test both modules, namely the AWNet workflow. After that, the network architecture of the AMSM module is introduced. Finally, the construction principle of the AFM module is explained.

Overview
As shown in Figure 2, this network is mainly composed of three parts: ResNet preprocessing encoder based on residual block; AMSM; and AFM of attention module and upsampling block. The encoder-decoder structure is one of the most common structures in segmented networks. It is possible to extract effective global and local information by extracting high-resolution remote sensing images. In the encoder part, the network ensures the effective use of semantic and spatial information. First, we output different levels of semantic information through the Resnet101 [36] network. Then, we used AMSM stacked in the encoder part as a feature extractor. Each level performs multiscale feature extraction on the semantic information of the fixed-dimensional image to ensure that it can be adaptively fused with the change of the image scale. Then, we used AFM to fully integrate deep and shallow semantic information. This module is a better feature extractor, designed to ensure consistent resolution. After the processing was over, the number of channels of the feature map was reduced to the same number of categories. The decoder part uses upsampling to restore the feature map to the original image size and output the final result. Facts have proven that our proposed AWNet, a new network deconstruction that combines AMSM and AFM, is very effective and practical in processing the spatial and semantic information of remote sensing images.  At the same time, it is important to consider low-level details while preserving highlevel semantic information in order to achieve more accurate semantic segmentation. Especially for high-resolution remote sensing images, it has more detailed information than natural images. In general, deeper networks will get better functionality. However, due to the disappearance of the gradient, the training results will be unsatisfactory, as shown in Figure 1. This problem can be solved using residual neural networks, as shown in Figure 3. The mechanism of residual blocks can be expressed by the following formula: Here, and +1 represent input or output residual blocks, and each residual block may contain a multi-layer structure. The residual function can be represented by F, which is obtained by the weight and the output of the previous layer. Here H( ) is the input of a certain layer of neural network. If the expected output, , is a complex latent mapping, such model training is more difficult. The input ( ) of the neural network of this layer can be directly used as the initial result of the output of this layer to effectively improve the training effect. ( ) is the activation function of the linear unit Relu.

Pretreatment
Before the formal training, we used the pre-trained expanded ResNet-101 [37] to preprocess the input image to extract semantic features in the global scope. The entire semantic input stream can be expressed as: Here ∈ × × represents the original input remote sensing image. W and H represent the width and height of the input image, respectively. θ respectively represents the parameters of the semantic input stream. ∈ × × represents the semantic map of the output feature map. ( ) represents the preprocessing process of ResNet semantic flow under the θ conditions. The independent variable I is the original semantic input stream.

Adaptive Multi-Scale Module (AMSM)
Because the ASPP structure directly integrates multiple scales (different atrous rates), it does not guarantee the adaptive fusion of the branch information, which will lead to Here, X L and X L+1 represent input or output residual blocks, and each residual block may contain a multi-layer structure. The residual function can be represented by F, which is obtained by the weight W L and the output X L of the previous layer. Here H(X L ) is the input of a certain layer of neural network. If the expected output, Y L , is a complex latent mapping, such model training is more difficult. The input H(X L ) of the neural network of this layer can be directly used as the initial result of the output of this layer to effectively improve the training effect. f (x) is the activation function of the linear unit Relu.

Pretreatment
Before the formal training, we used the pre-trained expanded ResNet-101 [37] to preprocess the input image to extract semantic features in the global scope. The entire semantic input stream can be expressed as: Here I ∈ R H×W×3 represents the original input remote sensing image. W and H represent the width and height of the input image, respectively. θ respectively represents the parameters of the semantic input stream. F ∈ R H 8 × W 8 ×2048 represents the semantic map of the output feature map. S θ (I) represents the preprocessing process of ResNet semantic flow under the θ conditions. The independent variable I is the original semantic input stream.

Adaptive Multi-Scale Module (AMSM)
Because the ASPP structure directly integrates multiple scales (different atrous rates), it does not guarantee the adaptive fusion of the branch information, which will lead to inconsistencies within the class. In order to solve the problem of differences in features corresponding to objects with the same label, we designed the AMSM structure to adaptively optimize features by using the attention mechanism. Figure 4 shows this structure in detail. It combines a spatial attention module and channel attention module. In order to fully enhance the characterization effect of the module, the two modules are combined together. The structure consists of three parallel structures, and the weights obtained using the spatial attention mechanism are then multiplied by the output of the parallel structure pixel by pixel. After that, the residual network module is used to multiply the feature graph pixel by pixel with the weight generated by multi-channel attention. Finally, the feature map is restored to the size of the original input image through the Fuse block (composed of 1 × 1 convolutional layer and 3 × 3 convolutional layer in series). channel attention module. In order to fully enhance the characterization effect of the module, the two modules are combined together. The structure consists of three parallel structures, and the weights obtained using the spatial attention mechanism are then multiplied by the output of the parallel structure pixel by pixel. After that, the residual network module is used to multiply the feature graph pixel by pixel with the weight generated by multi-channel attention. Finally, the feature map is restored to the size of the original input image through the Fuse block (composed of 1 × 1 convolutional layer and 3 × 3 convolutional layer in series). For the three branches of AMSM, each branch corresponds to a different void rate so as to obtain different weights according to the size of the remote sensing image. The input image passes through a spatial attention module. The number of channels output by this module is three. We assumed that the output (i = 1, 2, 3) is the feature of each layer, and the output results of the three channels are regarded as the weights of their respective spaces. Then, we used the spatial attention mechanism to generate the weight SA weight [37], as shown in the right part of Figure 4. After getting the spatial attention weight, it was multiplied by each branch to achieve fusion. Figure 5 shows the process of obtaining spatial attention weights and channel attention weights. The input image passes through a channel attention module, and the channel attention is used for further screening, assuming that each layer outputs (i = 1, 2). Then the fusion feature of each spatial attention channel is = ⋅ Further, the three-layer output is added and fused, and ( ) is the jump connection.
Finally, the output results are convolved with the residual structure through two layers to achieve the effect of reducing the number of channels. For the three branches of AMSM, each branch corresponds to a different void rate so as to obtain different weights according to the size of the remote sensing image. The input image passes through a spatial attention module. The number of channels output by this module is three. We assumed that the output X i (i = 1, 2, 3) is the feature of each layer, and the output results of the three channels are regarded as the weights of their respective spaces. Then, we used the spatial attention mechanism to generate the weight SA weight [37], as shown in the right part of Figure 4. After getting the spatial attention weight, it was multiplied by each branch to achieve fusion. Figure 5 shows the process of obtaining spatial attention weights and channel attention weights. The input image passes through a channel attention module, and the channel attention is used for further screening, assuming that each layer outputs Y i (i = 1, 2). Then the fusion feature Ni of each spatial attention channel is ISPRS Int. J. Geo-Inf. 2021, 10, x FOR PEER REVIEW 8 of 19 Figure 5. The generation mechanism of spatial attention weight (SA weight) and channel attention weight (CA weight) [36] in Adaptive Multi-Scale Module. The number marked below the picture is the size of the convolution kernel.
It is worth mentioning that AMSM is very simple to use and does not require too many additional parameters or computations. For various network models, there are two common embedding methods. One is to add AMSM after each convolution layer of some network structure. The other is to add AMSM between the two blocks of the remaining network.
The channel attention module generates channel attention feature maps by using channel connections between features. Each channel in the feature map is treated as a Figure 5. The generation mechanism of spatial attention weight (SA weight) and channel attention weight (CA weight) [36] in Adaptive Multi-Scale Module. The number marked below the picture is the size of the convolution kernel. Further, the three-layer output is added and fused, and f (X) is the jump connection.
Finally, the output results are convolved with the residual structure through two layers to achieve the effect of reducing the number of channels.
It is worth mentioning that AMSM is very simple to use and does not require too many additional parameters or computations. For various network models, there are two common embedding methods. One is to add AMSM after each convolution layer of some network structure. The other is to add AMSM between the two blocks of the remaining network.
The channel attention module generates channel attention feature maps by using channel connections between features. Each channel in the feature map is treated as a feature detector. The focus of the channel attention mechanism is mainly on what is meaningful in the input image. Channel attention uses two common methods to aggregate spatial information, namely max pooling and average pooling operations.
The spatial attention is different from the channel attention above, it mainly focuses on the position information of the input image. It first uses average pooling and maximum pooling to obtain two different feature descriptions. In the channel dimension, the two pooled feature descriptions are integrated. Finally, we used the concat operation to generate the spatial attention map.

Adaptive Fuse Module (AFM)
The input of AFM is a feature map of semantic information from different convolution kernel sizes. The architecture of Adaptive Fuse Module is shown in Figure 6, and the numbers marked in the figure are the size of the convolution kernel in the block. As can be seen, in this module, we got feature maps with different resolutions from two branches, namely, deep and shallow feature maps. When the combined feature map is obtained, it is necessary to ensure that the two branches maintain the same size, so it is necessary to carry out upsampling after the deep feature map to restore the feature map size. Shallow feature maps contain useful edge information and details but also annoying noise. Therefore, we filtered the shallow feature map with the help of deep features, filtering out the unnecessary noise information and only retaining the required details; we then performed the fusion operation. The weight generation process is shown in Figure 6a. The Fuse block after the fusion of the deep and shallow feature maps is three 1 × 1 convolutional layers in series. It should be noted that these feature maps will be added pixel by pixel instead of simple merging.
There are many reasons why we chose this method, including the following main reasons. First, we can ensure that the weights of the two branches can be easily obtained after data normalization. In addition, the calculation cost can be reduced while the size is uniform. However, the feature processed in this form is not suitable for calculation and extraction. Therefore, in order to better aggregate spatial information, we add a convolution layer after fusion to solve this problem. We no longer use a single branch to calculate the global semantic information and spatial information. It is proposed to use multi-scale to collect weights from deep and shallow layers. The final feature map will be the sum of the feature maps of the two branches, as shown in Figure 6b. then performed the fusion operation. The weight generation process is shown in Figure  6a. The Fuse block after the fusion of the deep and shallow feature maps is three 1×1 convolutional layers in series. It should be noted that these feature maps will be added pixel by pixel instead of simple merging. There are many reasons why we chose this method, including the following main reasons. First, we can ensure that the weights of the two branches can be easily obtained after data normalization. In addition, the calculation cost can be reduced while the size is uniform. However, the feature processed in this form is not suitable for calculation and extraction. Therefore, in order to better aggregate spatial information, we add a convolution layer after fusion to solve this problem. We no longer use a single branch to calculate the global semantic information and spatial information. It is proposed to use multi-scale to collect weights from deep and shallow layers. The final feature map will be the sum of the feature maps of the two branches, as shown in Figure 6b.

Experiments
To verify the validity of our model, we conducted a series of experiments using the ISPRS Vaihingen dataset. The data sets are available from http://www2.isprs.org/commissions/comm3/wg4/2d-sem-label-vaihingen.html (accessed date: 25 November 2020). In A, we introduced the dataset and measurement index; in B, we introduced the preprocessing method of the dataset; and in C, we introduced the specific hyperparameters used in the experiment.
The ISPRS Vaihingen dataset contains a total of 33 graph blocks, and the dataset can be split to represent the semantic markup of each graph block. Figure 7 shows the input image of a sample in the ISPRS dataset and the ground truth corresponding to this image.

Experiments
To verify the validity of our model, we conducted a series of experiments using the ISPRS Vaihingen dataset. The data sets are available from http://www2.isprs.org/ commissions/comm3/wg4/2d-sem-label-vaihingen.html (accessed date 25 November 2020). In A, we introduced the dataset and measurement index; in B, we introduced the preprocessing method of the dataset; and in C, we introduced the specific hyperparameters used in the experiment.
The ISPRS Vaihingen dataset contains a total of 33 graph blocks, and the dataset can be split to represent the semantic markup of each graph block. Figure 7 shows the input image of a sample in the ISPRS dataset and the ground truth corresponding to this image. We conducted a lot of experiments on the ISPRS Vaihingen dataset to evaluate the algorithm model we proposed. The following experiments will be carried out with this as a sample.

Experiments Sets
(1) Database Sets The ISPRS Vaihingen dataset is composed of 33 aerial images, as shown in Table 1. We conducted a lot of experiments on the ISPRS Vaihingen dataset to evaluate the algorithm model we proposed. The following experiments will be carried out with this as a sample.

Experiments Sets
(1) Database Sets The ISPRS Vaihingen dataset is composed of 33 aerial images, as shown in Table 1. These images have a spatial resolution of 9 cm and are collected from an area of 1.38 square kilometers. In this data set, the average size of each picture is 2494*2064 pixels, and each picture has three bands of green (G), near infrared (NIR), and red (R). It is worth mentioning that, in order to achieve a generalized template, we did not use DSM data in the experiments conducted in this article. The data set is divided into training data (ID 1,3,11,13,15,17,21,26,28,30,32,34) and verification data (ID 5,7,23,37). Meanwhile, in our study, all the pixels in the image are divided into six categories, which are white impermeable surface, blue building, cyan low vegetation, green tree, yellow car, and red background, as shown in Figure 8.  (2) Evaluation indicators In order to better evaluate our model, we used F1-score and Overall Accuracy (OA) as network accuracy evaluation indicators. The two classification indicators are briefly introduced below.
F1-score is a very important indicator in classification problems. This indicator takes into account both the precision rate and the recall rate, and uses the percentage of pixels that predict the correct category as the overall accuracy. Among them, both values are between 0 and 1. The closer the value is to 1, the higher the accuracy. The two parameters used to calculate F1-score involve recall and accuracy, which are defined as: Among them, TP represents the number of categories C correctly predicted by the model. P represents the total number of pixels in the sample predicted by the model as category C, and C is the total number of pixels in the sample. When it is necessary to consider both the accuracy rate and the recall rate, the model's F1-score index can be used to judge the pros and cons of the model. F1-socre also considers the precision and recall of the model, which is defined as follows:  (2) Evaluation indicators In order to better evaluate our model, we used F1-score and Overall Accuracy (OA) as network accuracy evaluation indicators. The two classification indicators are briefly introduced below.
F1-score is a very important indicator in classification problems. This indicator takes into account both the precision rate and the recall rate, and uses the percentage of pixels that predict the correct category as the overall accuracy. Among them, both values are between 0 and 1. The closer the value is to 1, the higher the accuracy. The two parameters used to calculate F1-score involve recall and accuracy, which are defined as: precision(c) = TP C × 100 Among them, TP represents the number of categories C correctly predicted by the model. P represents the total number of pixels in the sample predicted by the model as category C, and C is the total number of pixels in the sample. When it is necessary to consider both the accuracy rate and the recall rate, the model's F1-score index can be used to judge the pros and cons of the model. F1-socre also considers the precision and recall of the model, which is defined as follows: Among them, precision represents the accuracy of the model, which is the proportion of correct results in the total results predicted by the model. Recall is the recall rate of the model, which is the percentage of correct results predicted by the model in the true value label of the sample. F 1 is the harmonic average of precision and recall. Therefore, F1-score will only be high when the precision and recall indicators are balanced.
Overall Accuracy represents the proportion of samples that are correctly classified in all samples. This indicator reflects the correctness of the overall classification of the map and is a rough overall measure.
The overall accuracy is the percentage of pixels with the correct class predicted, and the accuracy is defined as In this formula, T represents the number of pixels that predict the correct category, and A is the total number of all pixels.
For each category, the average F1 score is achieved by calculating all F1 scores to achieve a fair evaluation model. It is worth noting that the higher the F1 score, the better the model evaluation result.

Data Set Preprocessing
Due to the limited Graphic Processing Unit (GPU) memory, we cut the input image of the model to a fixed pixel size through a sliding window and, then, input it into our model to train and verify the images in the dataset. Similar to the current mainstream processing methods, we used some of the more common data enhancement strategies to achieve data enhancement, such as Gaussian blur, image rotation, random cropping, horizontal flip, vertical flip, 90-degree rotation, grid mask, etc. These methods not only play a role in data enhancement but also prevent the occurrence of over-fitting to a certain extent.

Implementation
We developed the following training strategy. For the optimizer used, we chose ADAM's optimizer and set the parameters of the optimizer according to the suggestions, setting the initial learning rate as 1e-3. The model was trained on a single NVIDIA Tesla V100. We set the batch size to 3, trained a total of 50 epochs, and when the verification loss started to stop reducing, we would stop the training. In order to reduce the vibration of the model in the later period of training, we adopted the adaptive learning rate decline strategy. We used U-Net with ResNet-101 as out baseline. Similar to the method used in similar studies, weighted cross entropy function is used to train the whole model. We implemented our network using PyTorch, where the learning rate was initialized to 1e-3 when the validation loss was saturated, and we stopped training when the validation loss function failed to decrease. After setting the above parameters, we trained and tested AWNet. The training of the model lasted about 50 hours, and the test lasted about 20 minutes.

Ablation Study for Relation Modules
We evaluated each component of the model, used ResNet-101 as our baseline, and added AFM and AMSM to enhance the consistency of the model. In order to verify the performance of the various models we proposed, we conducted a series of ablation experiments. The experimental results of different models in the Vaihingen data set are presented in Table 2. The overall accuracy rate of ResNet101 + AMSM + AFM in ablation experiments is 88.35%, which is better than ResNet, ResNet + AMSM, and ResNet + AFM. As shown in Figure 9, in the comparison with ground truth, we can see that when only the baseline is used for segmentation, the adhesion between two similar objects that are close in distance is more obvious, and there is obvious noise at the edge. After adding AMSM or AFM, the adhesion phenomenon was reduced, and the independence of the object was improved. See Figure 9f again. After using AMSM and AFM at the same time, there is almost no adhesion between the edges between two objects that are close to each other. The noise at the edge of each object is also significantly reduced, making the boundary of the segmentation result clearer.

Comparing with Existing Works
In order to make a more comprehensive evaluation of our research, we first tested the model with five landmark networks [6,7,14,15,38], and the test results obtained are shown in Table 3. The output image is shown in Figure 10. At the same time, we also compared our model with five existing models based on basic network improvements, including FCN with fully connected CRF (FCN-dcrf), spatial propagation CNN (SCNN)

Comparing with Existing Works
In order to make a more comprehensive evaluation of our research, we first tested the model with five landmark networks [6,7,14,15,38], and the test results obtained are shown in Table 3. The output image is shown in Figure 10. At the same time, we also compared our model with five existing models based on basic network improvements, including FCN with fully connected CRF (FCN-dcrf), spatial propagation CNN (SCNN) [39], FCN with atrus convolution (extensed FCN) [9], FCN with feature remake (FCN-FR) [40], convolutional neural network (CNN-FPL) with patch labeling learning through learning upsampling [41], and VGG16 is the PSPNet of the backbone network [42], and the test results are shown in Table 4.  Our model has obvious advantages in dealing with small objects. Specifically, the "car" category is a category that is difficult to handle in the Vaihingen dataset because, compared with other categories, "car" is a relatively small object. As shown in Table 3 and Figure 11, the number of pixels in other categories is much more than the number of pixels in the "car" category, and there are large differences in objects between this category. For example, the diversity of car colors in the image also leads to huge differences within the category. Our proposed method achieves an accuracy of 82.22% in the car category, which is significantly higher than other models, which proves the effect of our method on small   The numerical results of the Vaihingen data set are shown in Tables 3 and 4. The results show that whether it is a landmark classic network or an improved network based on the classic network, our model is superior to other methods in terms of F1 average score and overall accuracy. Specifically, for example, compared with FCN-dCRF and SCNN, the average F1 score of our proposed network increased by 1.70% and 1.92%, respectively, which verified the high performance of the spatial relationship module in our network. It shows that the integration of AMSM and AFM relationship modules is effective.
Our model has obvious advantages in dealing with small objects. Specifically, the "car" category is a category that is difficult to handle in the Vaihingen dataset because, compared with other categories, "car" is a relatively small object. As shown in Table 3 and Figure 11, the number of pixels in other categories is much more than the number of pixels in the "car" category, and there are large differences in objects between this category. For example, the diversity of car colors in the image also leads to huge differences within the category. Our proposed method achieves an accuracy of 82.22% in the car category, which is significantly higher than other models, which proves the effect of our method on small targets.
In addition, the qualitative results are shown in Figure 11. For the first line, although the low-vegetation area contains complex local context information and is easily misidentified, due to its powerful function, our network can obtain more accurate results, compared with other methods, to solve the problem of vision blurring [43][44][45] by using global relations, and the phenomenon of category misclassification [46] is greatly reduced. In addition, the edge of our model is clearer and more coherent, which proves that the model has the function of eliminating outliers, and the noise in the detail information has less impact on the result.

Conclusions and Future Work
In this paper, we propose two kinds of effective network modules to solve the noise and classification problems in remote sensing images. Adaptive Multi-Scale Module (AMSM) and Adaptive Fuse Module (AFM). Among them, the Adaptive Multi-Scale Module (AMSM) can adaptively generate spatial weight, which has a better segmentation effect in the data set with complex and variable object size. The AFM (Adaptive Fuse Module) module, which can filter and extract shallow information of remote sensing images, is also designed. This module can effectively remove the noise information in the shallow layer feature image and make up for the details with better robustness in the deep layer feature image. Both relationship modules learn the global relationship information between the target and the feature graph. Verified on the Vaihingen dataset, we used the network of two relationship modules to better identify smaller targets while still maintaining good overall accuracy. Moreover, the multi-scale convolutional feature network of AMSM and AFM is superior to other models in terms of vision and numerical value. AWNet's F1 Score reached OA and reached 88.35%. However, our understanding

Conclusions and Future Work
In this paper, we propose two kinds of effective network modules to solve the noise and classification problems in remote sensing images. Adaptive Multi-Scale Module (AMSM) and Adaptive Fuse Module (AFM). Among them, the Adaptive Multi-Scale Module (AMSM) can adaptively generate spatial weight, which has a better segmentation effect in the data set with complex and variable object size. The AFM (Adaptive Fuse Module) module, which can filter and extract shallow information of remote sensing images, is also designed. This module can effectively remove the noise information in the shallow layer feature image and make up for the details with better robustness in the deep layer feature image. Both relationship modules learn the global relationship information between the target and the feature graph. Verified on the Vaihingen dataset, we used the network of two relationship modules to better identify smaller targets while still maintaining good overall accuracy. Moreover, the multi-scale convolutional feature network of AMSM and AFM is superior to other models in terms of vision and numerical value. AWNet's F1 Score reached OA and reached 88.35%. However, our understanding of how these two modules deal with segmentation problems in remote sensing images is not yet in place, and further research is needed.