Dual Attention Feature Fusion and Adaptive Context for Accurate Segmentation of Very High-Resolution Remote Sensing Images

Land cover classification of high-resolution remote sensing images aims to obtain pixel-level land cover understanding, which is often modeled as semantic segmentation of remote sensing images. In recent years, convolutional network (CNN)-based land cover classification methods have achieved great advancement. However, previous methods fail to generate fine segmentation results, especially for the object boundary pixels. In order to obtain boundary-preserving predictions, we first propose to incorporate spatially adapting contextual cues. In this way, objects with similar appearance can be effectively distinguished with the extracted global contextual cues, which are very helpful to identify pixels near object boundaries. On this basis, low-level spatial details and high-level semantic cues are effectively fused with the help of our proposed dual attention mechanism. Concretely, when fusing multi-level features, we utilize the dual attention feature fusion module based on both spatial and channel attention mechanisms to relieve the influence of the large gap, and further improve the segmentation accuracy of pixels near object boundaries. Extensive experiments were carried out on the ISPRS 2D Semantic Labeling Vaihingen data and GaoFen-2 data to demonstrate the effectiveness of our proposed method. Our method achieves better performance compared with other state-of-the-art methods.


Introduction
With the development of very high resolution (VHR) remote sensing technology, large amounts of satellite remote sensing images with very high resolution are obtained every day [1]. Semantic segmentation is a computer vision task that predicts the semantic category for every pixel in an image, and such comprehensive image understanding is essential for many vision-based applications such as orbital remote sensing, autonomous driving [2,3], medical image analysis, and so on [4][5][6]. However, there are still lots of challenges for the task of semantic segmentation in VHR remote sensing images with complex scenes, such as poor accuracy of multi-category semantic segmentation, poor speed of multi-category semantic segmentation, and so on.
Traditional machine learning-based methods [7][8][9] rely on human experience and complex feature engineering. The segmentation performance mainly depends on whether researchers can obtain the accurate features of their targets. Since feature extraction is done manually by the researcher, these human-designed features may fail to handle various complex applications. With the development of deep learning [10][11][12][13][14], there are lots of CNN-based methods [15][16][17][18][19][20][21] applied in the semantic segmentation of the VHR remote sensing images. Previous methods [22][23][24] have used ConvNets for semantic segmentation, in which each pixel is labeled with the class of its enclosing object or region. The fully high-level and high-resolution feature map [26]. There is a semantic gap between the feature maps in the different layers, which may have a negative influence on the result.
To address the above problems and improve the segmentation accuracy of targets, we propose a novel framework for remote sensing image segmentation, which is illustrated in Figure 1. Inspired by [34], we adopt an adaptive context aggregating module to reweigh pixels in different channels with the weight vectors generated by the global context information. We introduce matrix multiplication to generate the spatially-varying feature weight factors, which are utilized as the parameters of a series of dilated depth-wise convolutions with different dilation factors to capture information in multiple scales. In this way, we integrate the contextual cues into the feature maps with predictable and input-variant convolutions and the module re-weighs features at different spatial locations automatically for fine semantic segmentation results. Furthermore, we introduce a channel attention module to enhance the consistency of the feature maps.

Image
Resnet Moreover, in order to make better use of the multi-scale feature maps in CNNs, we adopt a dual attention feature fusion module to fuse the feature maps in CNNs into the different layers based on both channel and spatial attention mechanisms. Generally speaking, our goal is to import extra low-level details in the object boundary, where it is difficult to obtain accurate category labels for the pixels. As the feature maps in two different levels of CNN layers may have the semantic gap, we utilize a matrix multiplication mechanism to measure the relevance of two feature maps at both the channel and spatial dimension, which is the basis of the weight vectors. As for the channel dimension, we capture the channel dependencies between any two channels in the different layers and update the lower channel features with the weight vector. For spatial attention, any two positions in the different layers with similar features can contribute mutual improvement regardless of their distance in spatial dimension. Finally, the outputs of these two attention modules consist of the final output. This feature fusion module aims at reducing the semantic gap for fine segmentation results. Overall, the contributions of this paper are summarized as follows: • In order to utilize global contextual cues, we integrate the contextual cues with the spatially-varying feature weighing factors. • To improve the classification accuracy of the pixel near the object boundary, we propose a multi-scale feature fusion module based on the attention mechanism on both the spatial and channel dimensions. • To validate the effectiveness of our method, we conduct extensive experiments based on the ISPRS 2D Semantic Labeling Vaihingen data [35] and GaoFen-2 data [36]. The results show that our method has led to significant improvements and demonstrate the effectiveness and robustness of our method.
This article is organized as follows: Section 2 is about related work. In Section 3, we introduce our semantic segmentation framework in detail. Section 4 presents the results of experiments. In Section 5, we analyze the results of the experiments and discuss future work. Finally, we summarize the paper in Section 6.

Related Work
In this section, we review works related to semantic segmentation on three different aspects: FCN-based semantic segmentation, contextual cue extraction, and multi-scale feature fusion.

Semantic Segmentation
Fully Convolutional Network (FCN) [25] based methods have made significant progress in semantic segmentation, which first replaces the fully connected layer in the traditional classification network by convolutional layer to get a segmentation result and it achieves end-to-end training by adopting the output feature map to match the resolution of the input image with up-sampling. However, the spatial resolution will be reduced due to the use of the downsampling and pooling operation. To improve the resolution of the output, researchers have adopted a variety of methods. Badrinarayanan et al. [15] utilize the convolution and deconvolution layers to construct a symmetric auto-encoder architecture, which maintains high-frequency details in the input image. Ronneberger et al. [16] adopt an encoder-decoder architecture to improve segmentation results. In order to preserve more detailed information, Chen et al. [18,20] adopt atrous spatial pyramid pooling (ASPP) to expand the receptive fields and embed contextual cues, which consists of parallel dilated convolutions with different dilated rates. However, some drawbacks, like the grid effect, bring new challenges for improving the accuracy of the segmentation results. Chen et al. [21] employ a new joint upsampling module to solve the above issue generated with dilated convolution. Moreover, Zhao et al. [17] propose a pyramid pooling module to collect the effective contextual prior, containing information of different levels. ASPP module [18] has been utilized in many methods [37,38] to capture multi-scale contextual cues from the final convolutional feature map. Lin et al. [39] and Ding et al. [40] obtain context in different scales by fusing different feature maps.
Meanwhile, in remote sensing, researchers [41,42] are also inspired by the development of segmentation in natural scenes. Wang et al. [43] present a gated convolutional neural network to automatically select adaptive features when merging different-layer feature maps. Panboonyuen et al. [27] introduce the global convolutional network to capture different resolutions by extracting multi-scale features for better results on remotely sensed images. Li et al. [44] present an auto encoder-based architecture of deep learning that makes extensive use of residual learning and multiscaling for better semantic segmentation of remote sensing images. Kang et al. [45] design the dense spatial pyramid pooling to extract dense and multi-scale features simultaneously and use the focal loss to suppress the impact of the error labels in ground truth.

Contextual Cues Modeling
Although FCN [25] based methods have made great progress in semantic segmentation, some new problems have emerged with the development of the research. A series of convolution and down-sampling operations capture information with larger receptive fields. However, they still cannot take advantage of the global or long-range contextual cues effectively. Liu et al. [46] encode the global pooling feature, which is concatenated with the original feature maps to integrate the global context. PSPNet [17] adopt a spatial pyramid pooling module consist of a series of pooling operations to collect contextual cues in different scales. Deeplab series [18,20] develop ASPP to obtain multi-scale contextual cues by dilated convolutional layers with different dilation rates. Yang et al. [47] and Bilinski et al. [48] encode contextual cues in a dense way. Huang et al. [28] propose a network structure whose multiple branches with different atrous rates can share a single kernel effectively. In order to increase the receptive field size, Peng et al. [33] directly utilize a large filter to capture the contextual cues. Although, the above papers utilize various methods to capture the contextual cues, they treat all pixels in each sub-region with uniform weight for feature aggregation, which cannot capture the information in each channel with different weight vectors.
To solve this issue, some researchers try to aggregate the feature in an adaptive and flexible way. He et al. [29] adopt an adaptive context module to estimate inter-pixel affinity weights for feature aggregation. Zhang et al. [49] propose an aggregated cooccurring feature (ACF) module to aggregate the co-occurrent context. Based on the ACF module, Zhang et al. [50] propose the attentional class feature module to make different pixels adaptively focus on different class centers to improve the semantic segmentation. Zhao et al. [51] predict that the attention map will aggregate contextual cues for each pixel. Fu et al. [52] propose the dual attention module consists of a position attention module and channel attention module with the self-attention mechanism to aggregate features. These methods show robustness to aggregate the contextual cues and encode the long-range context, which is able to boost the segmentation performance.

Multi-Scale Feature Fusion
FCN-based methods utilize a series of convolution and pooling operations to obtain semantic information of the target. However, it generates a new problem that successive convolution and pooling operations lead to the reduction of the feature resolution and the detail information, which influence the accuracy of the result. In order to solve this issue, it is essential to make use of both high-level categorical semantic, and low-level spatial details.
Unet [16] adopt an encoder-decoder architecture with skip connections to combine categorial semantic and spatial details in different scale feature maps. Lin et al. [53] adopt the same architecture as Unet [16] with predictions from each level of the feature pyramid. Lin et al. [19] propose a multi-path refinement network to exploit features at multiple levels of abstraction for high-resolution semantic segmentation. Panboonyuen et al. [27] extract multi-scale features from different stages of the network and fuse these features for better results. Wang et al. [43] adopt a gate mechanism to integrate the feature maps more effectively. In order to take advantage of the redundancy in the label space of semantic segmentation, while Tian et al. [54] propose a data-dependent upsampling to replace the bilinear one. He et al. [55] propose a dynamic multi-scale network to adaptively capture and fuse multi-scale contents for predicting pixel-level semantic labels. Li et al. [26] propose a new architecture to selectively fuse features from multiple levels using gates in a fully connected way. Yu et al. [56] adopt an encoder-decoder architecture containing new modules to select the more discriminative features and make the bilateral features of the boundary distinguishable with deep semantic boundary supervision.

Methods
In this section, we first present a general framework of our network and then introduce the adaptive context aggregating module (ACAM) and the dual attention feature fusion module (DAFFM). Finally, we describe how to aggregate them together for further refinement.

Overview
Given the context of remote sensing, objects are diverse on scales, lighting, and views. High-resolution remote sensing of images involves complex scenes where objects in the same category can be diverse in appearance and features. Meanwhile, different semantic categories may have similar features. Since a series of convolution operations can lead to a local receptive field, the features corresponding to the pixels with the same label may differ significantly, which brings additional difficulties for accurate classification at the pixel-level.
Feature re-weighing has proven to be an efficient approach to capture semantic contexts at different distances according to the channel-wise weight factors from the global contextual cues. However, there is the limitation that the weight vector is shared by all spa-tial locations of the 2D feature map. Actually, feature maps of each channel have different contextual cues. We cannot make full use of these differences by a single weight vector [34]. Thus, it is not a suitable choice to use a globally-shared weight to re-weigh different spatial locations belonging to objects of different categories. Therefore, it is necessary to aggregate the contextual cues for better segmentation according to two principles: (1) learning the feature weight factors from the global context and (2) capturing different locations' unique characteristics in a spatially-varying way. To achieve this goal, we propose the ACAM to obtain the channel-wise vectors to re-weigh the 2D feature maps within the global context.
The top layers in CNNs encode rich global category semantics. However, local spatial details are missing [57][58][59][60]. On the contrary, the lower-level feature maps capture rich spatial details. However, the lower-level feature layers fail to encode global semantic cues due to limited discriminative ability [19]. Therefore, it is essential to fuse the global semantic cues and local spatial details. The authors of [16] address this issue by concatenating different levels' feature maps, whose improvement is limited by the large semantic gap. Inspired by some successfully applied attention mechanisms [61][62][63], we introduce a feature fusion module based on both position and channel attention to effectively combine the feature maps at different scales.
As illustrated in Figure 1, we introduce the ACAM to capture multi-scale contextual cues. On this basis, the DAFFM is utilized to fuse the multi-scale feature maps with both spatial and channel attention mechanisms. We adopt a pre-trained residual network [13] with the dilated strategy as the backbone. We replace the down-sampling operations with dilated convolutions in the last two layers. Thus, the size of the final feature map is 1/8 of the input images. It can retain more details without adding extra parameters. The backbone encodes each remote scene image into a feature map of X ∈ R c×h×w , where h, w, c are the height, width, and feature channels of the feature map. First, we adopt the ACAM to capture contextual cues in the top feature map X, then re-weigh the top feature map by the spatially-varying weights generated by the global context. Then, we feed the feature maps generated by the ACAM and the feature maps from the fourth layer of the backbone CNN into the DAFFM to achieve a balanced fusion of features at different levels. Finally, we obtain the segmentation result by concatenating the feature maps generated by the DAFFM.

Adaptive Context Aggregating Module
The adaptive context aggregating module (ACAM) consists of three submodules: channel re-weighing module, convolution kernels predicting module, and context-adaptive capturing module.
First, we utilize the channel re-weighing module based on channel attention to enhance the consistency of the feature maps in the top layer. We will illustrate the details of the re-weighing module in Section 3.4.
In order to control the computation cost and maintain the spatial information, we utilize matrix multiplication to predict the convolution kernels. The input feature map is first transformed into the query feature map Q ∈ R c×h×w and the key feature map K ∈ R c×h×w , respectively, which is implemented by 1 × 1 convolutions to reduce the computation.
To aggregate the global spatial information, we first reshape the feature map K and Q into K 1 ∈ R c×n and Q 1 ∈ R c×n where n = h × w. Then, we transpose feature map Q 1 for performing a matrix multiplication with feature map K 1 to generate the feature map W as illustrated in Equation (1): where W ∈ R C×C . The element in the feature maps W, represents the overall spatial distribution of each channel in the feature map K. The result measures the similarity of the spatial distribution between the feature maps Q and K. In this way, the global spatial information of each channel is concentrated into the feature map W. Then, we expand the dimension of W and obtain the feature map W 1 ∈ R 1×C×C . After that, we reduce the channels of W 1 to s by 1 × 1 convolutions and compress its dimension to obtain the feature map W 2 ∈ R C×9 . Then, we obtain the predicted feature map from W 2 as the convolution kernel by reshaping the batch normalization operation. Then, in order to re-weigh each pixel of the input feature maps, we adopt depthwise convolution with the predicted convolution kernels to generate the spatially-varying weight map. Thus, each channel of the predicted kernel is utilized to re-weigh one channel of the input feature map. In this way, the weight vector is independent with each other and the contextual cues can be aggregated according to its spatial information. Moreover, we denote the original kernels S ∈ R c×3×3 with dilation rate 1 as S 1 and obtain S 2 , S 3 with different dilation rates 2 and 3 to expand the receptive field without introducing extra parameters and computations.
As shown in Figure 2, we perform depth-wise convolution on the input feature map with convolution kernels S 1 , S 2 and S 3 , independently and use the sigmoid function to generate the weight feature maps R 1 ∈ R c×h×w , R 2 ∈ R c×h×w and R 3 ∈ R c×h×w , which is added to generate the final weight feature map R ∈ R c×h×w , where ⊕ represents the element-wise addition.  Figure 2. Overview of the proposed ACAM. To make better use of the contextual cues, we introduce ACAM to aggregate the contextual cues adaptively. Different from current context aggregating methods, the ACAM generates the weight vectors according to the global context cues of all feature channels. Moreover, the dilated depth-wise convolutions with different dilation factors are introduced to capture contextual cues. In this way, we integrate the contextual cues adaptively for better performance.
Finally, we obtain the output feature map O ∈ R c×h×w by performing the element-wise multiplication between R and the input feature map X as shown in Equation (3), where donates the element-wise multiplication. Since the scenes in remote sensing applications are more complicated, and the categories of targets are richer compared with the natural scene, the contextual cues in the background of the target varies. Thus, we utilize more channels in the feature map Q to explore more complex relationships between the different channels.

Dual Attention Feature Fusion Module
The feature maps in deeper layers of CNNs encode richer semantic cues but with smaller spatial resolution. On the contrary, the spatial resolution of the lower feature maps is larger, but local spatial details are lack. Although the existing multi-scale feature fusion mechanism is a reasonable solution, the improvement is limited by the large semantic gap among the multi-level features.
To effectively combine the multi-scale feature maps, we propose the dual attention feature fusion module (DAFFM), as illustrated in Figure 3, which is based on both spatial and channel attention mechanisms. Given the deeper-level feature map A and the lowerlevel feature map B, where A is generated by the ACAM, we first re-weigh the lower feature map by the module illustrated in Figure 4 and utilize the 1 × 1 convolution to compress the channels to generate the feature map B 1 ∈ R c×h×w , where c, h and w represent the height, width, and channels of the feature map.  Figure 3. Overview of the proposed DAFFM. In order to make better use of multi-scale feature maps, we adopt the feature fusion module DAFFM based on spatial attention and channel attention mechanism to fuse the detail and semantic information for better segmentation results at the boundary. Benefiting from the detail and semantic information aggregating based on attention mechanisms, the semantic gap between different layers is reduced. In this way, we utilize the multi-scale feature maps more effectively.

Output Input
Global average pooling ReLU Sigmoid Firstly, we fuse the feature map based on the attention mechanisms on spatial dimensions. We reshape A and B 1 to P ∈ R c×n and Q ∈ R c×n , where n = h × w represents the number of pixels in each feature map. After that, we apply the matrix multiplication between P and Q and utilize a softmax layer to calculate the spatial attention map S ∈ R n×n ,

ReLU Rectified Linear Unit Element-wise addition Element-wise multiplication Output Input
where s ji measures the relevance of pixels between i-th position in the lower feature map and j-th position in the higher feature map. It can be inferred from Equation (4) that each position of the final feature O ∈ R c×h×w is a weighted sum of the features across all positions of the deeper-level features. As the final feature map is generated by the deeper-level feature map, the high-level semantic is well preserved in the outputs.
Then we transpose the feature map Q to V ∈ R c×n for performing a matrix multiplication with the spatial attention map S and reshape the output to generate the feature map L ∈ R c×h×w . Finally, we utilize an element-wise sum operation between A and L to obtain the final output M ∈ R c×h×w as follows: where α is initialized as 0 and gradually learns to assign a reasonable weight factor and V i represents the pixels of the i-th position in the lower feature map and A j represents the j-th channel of the deeper-level feature map.
On the other hand, we reshape the B into U ∈ R c×n and then perform a matrix multiplication between U and V. Then, we utilize a softmax layer to obtain the channelwise attention map R ∈ R c×c , where c is the number of the channels: where r ji measures the i-th channel's impact on the j-th channel. Then we perform a matrix multiplication between the transpose of R and Q to generate the feature map K ∈ R c×h×w . Then we multiply the results by a scale parameter β and perform an element-wise sum operation with A to obtain the final output N ∈ R c×h×w , where β gradually learns a weight from 0. A j represents the j-th channel of the A and Q i represents the i-th channel of the Q. We can learn from Equation (7) that each channel of the final output is a weighted sum of the features Q. It models the relevance of different channels in feature maps and helps to boost feature fusion. Finally, we obtain the final fusion result O ∈ R c×h×w by concatenating M and N.
In summary, we utilize matrix multiplication to measure the relevance between feature maps with different scales in both spatial and channel dimensions, which is utilized as the guidance of the fusion operation. In this way, we improve the semantic segmentation accuracy in the boundary and reduce the negative influence of the semantic gap for feature fusion.

Channel-Wise Feature Re-Weighing
Generally, each channel of the feature map can be regarded as a class-specific response. In order to enhance the consistency of the feature maps in each layer, we utilize the channel attention module (CAM) to change the weights of the features in each channel, as illustrated in Figure 4. We first employ a global average pooling layer to squeeze the spatial information and then utilize the sigmoid function to generate the weight vectors, which are finally combined with the input feature maps by an element-wise multiplication operation to generate the output feature map. The overall information is integrated into the weight vectors and strengths the feature maps, which are more relevant to the ground-truth.

Implementation Details
Our implementation is based on Pytorch [64]. For better training results, we adopt the pre-trained ResNet model [13] to initialize the backbone CNN and initialize the other layer with normal distribution. Moreover, we replace the down-sampling operations with dilated convolutions in the last two layers, and the hyperparameters of training epochs, batch-size, initial learning rate are set to 200, 6 and 0.005. Many public datasets have been published to advance semantic segmentation in remote sensing. We select two widely used datasets to evaluate our proposed method: ISPRS 2D Semantic Labeling Vaihingen data [35] and GaoFen-2 dataset [36].
The ISPRS 2D Semantic Labeling Vaihingen data [35] is provided by the International Society for Photogrammetry and Remote Sensing, which consists of 33 high-resolution true orthophoto tiles and corresponding digital surface models, as well as ground-truth labels, and we adopt the DSM band channel to perform our experiments. The labels are classified into 6 categories: impervious surfaces, building, low vegetation, tree, car and clutter. We select 11 tiles for the training dataset and 5 tiles are used as the validation set. The rest of the tiles are used for testing the performance of the method. Based on the 33 tiles, we perform experiments on the whole 6 categories.
Comparing with ISPRS 2D Semantic Labeling Vaihingen data [35], GaoFen-2 dataset [36] is more challenging, which contains 500 satellite images collected from GaoFen-2 satellite over different geographic locations in China. It contains 500 labeled images of size 512 × 512, which have a large intra-class difference and small inter-class diversities. Thus, the GaoFen-2 [36] is more convincing to test the effectiveness and robustness of our method. The GaoFen-2 [36] is split into 400 training and 100 validation images with annotation containing 9 categories: road, building, tree, grass, bare land, water, transportation, impervious surfaces and others.
In order to evaluate the performance of our method, we utilize overall pixel accuracy (OA) and mean intersection over union (mIOU) as the metrics. The OA represents the accuracy of all pixels for all categories. The mIOU means the intersection of prediction and target divided by the union, which is the main criterion for evaluating the performance of each method [17,52].

Experimental Results
In this section, to evaluate the proposed method, we carry out comprehensive experiments and evaluate the performance of our method qualitatively and quantitatively.

Ablation Study for Each Module
We employ the ACAM and DAFFM for better segmentation results. To verify the performance of the two modules and help us to understand them better, we benchmark the whole modules based on the above two datasets. The experiment results are shown in Tables 1 and 2.

Baseline
The baseline has similar architectures like U-net, which are used to evaluate the effectiveness of other components, and it only fuses the feature maps in the last two layers by concatenating the feature maps directly. Baselines achieve mIOU of 68.02% on ISPRS 2D Semantic Labeling Vaihingen data [35] and 53.94% on GaoFen-2 dataset [36].

Adaptive Context Aggregating Module
Compared with the baseline, we introduce the ACAM to capture the contextual cues in the last layer. As shown in Tables 1 and 2

Network with Full Architecture
We integrate the ACAM and DAFFM into the baseline to generate a network which has full architecture of our method. Compared with the models mentioned above, our method utilize the ACAM to capture multi-distance context adaptively and combine the lower features by DAFFM for more accurate segmentation result. The two modules adopted in our method improve the performance in dimensions of context and multiscale features, which improve the classification accuracy of pixels both in the internal and boundary regions. The experimental results demonstrate the effectiveness of our method which obtains the best performance with mIOU of 70.51% on ISPRS 2D Semantic Labeling Vaihingen data [35] and 56.98% on GaoFen-2 [36] and boosts over 2.49% and 3.04%, respectively, compared with the baseline.
In order to show discrimination of each module directly, we visualize the comparison results based on ISPRS 2D Semantic Labeling Vaihingen data [35] as illustrated in Figure 5. It is shown that DAFFM obtains the better segmentation performance compared to the baseline, which is consistent with our analysis mentioned above. On the other hand, ACAM also achieves better performance on some small targets visually, compared to the baseline, which contributes to aggregating the multi-scale contextual cues. By combining the advantages of DAFFM and ACAM, our method obtains the best performance according to the visual segmentation result.
According to the analysis in different aspects, the ACAM and DAFFM adopted in our method improve the segmentation accuracy, respectively. Moreover, it is consistent and feasible to combine both of them into the same architecture, which obtains more improvement, and the DAFFM brings the most improvements in our method. We adopt two modules to obtain better results in different dimensions, which is demonstrated in effectiveness based on two datasets [65,66]. On the other hand, the quantitative and qualitative experiment results further demonstrate the importance of the multi-scale contextual cues and features in different scales for remote scene image segmentation.  Figure 5. Visualization of the segmentation results for ablation study. From left to right, the input image, the ground-truth segmentations, and the results from variants of our methods.

Comparing with State-of-the-Art Methods
In order to show the effectiveness of the proposed method, we further perform comparisons with state-of-the-art semantic segmentation methods on both ISPRS 2D Semantic Labeling Vaihingen data [35] and GaoFen-2 [36] datasets. We select four popular methods, PSPNet [17], DeeplabV3 [21], Unet [16] and DANet [52] for comparison. We choose the mIOU of each category, overall accuracy and mIOU of all categories as the metrics to evaluate the performance of each method and the mIOU is the main evaluation metric between different categories, similar to [66]. Results based on ISPRS 2D Semantic Labeling Vaihingen data [35] and GaoFen-2 [36] datasets are shown in Tables 3 and 4, respectively. As shown in Tables 3 and 4, DAFFM and ACAM have similar or better performance compared with most of the state-of-the-art semantic segmentation methods and our method achieves the best performance between all the state-of-the-art methods mentioned above on both datasets [65,66]. The comparison between different methods mentioned above further demonstrates the superiority of our method. We also visualize the segmentation results of each method on both datasets [65,66] respectively as illustrated in Figures 6 and 7. The comprehensive comparison mentioned above further demonstrates the effectiveness and superiority of our method.  Figure 6. Visualization of the segmentation results for state-of-the-art methods based on ISPRS 2D Semantic Labeling Vaihingen data [35]. From left to right, the input image, the ground-truth segmentation results, the results from our methods and the state-of-theart method.  . From left to right, the input image, the ground-truth segmentation results, the results from our methods and the state-of-the-art method.

Discussion
Previous methods [18,20,33] utilize various strategies to capture contextual cues. However, they treat all pixels with uniform weight for feature aggregation, which cannot capture the information in each channel with different weight vectors. To solve this issue, we integrate the contextual cues with the spatially-varying feature-weighting factor. Contributing to the adaptive contextual cue aggregating, the context information in the background of the objects can be aggregated by different weight vectors. Taking the second and fourth rows of Figure 7 as examples, the impervious surfaces, roads and buildings have a similar appearance. PSPNet [17] utilizes the spatial pyramid pooling module to aggregate multi-scale context. However, it cannot aggregate context adaptively and selectively, which leads to the wrong classification. Compared with PSPNet [17], Deeplabv3 [21] introduces the ASPP module which obtains better segmentation result. However, it also cannot achieve adaptive context aggregating, which leads to weakness in obtaining fine details and completing the object shape. Compared with the above methods, our method further alleviates this challenging issue by adaptive contextual cue aggregating. It is illustrated in Figures 6 and 7 that our method obviously obtains fine segmentation result for some objects.
Feature fusing is an effective strategy to obtain accurate segmentation results. However, methods [16,21,53] extensively used to fuse multi-scale feature maps ignore the differences between feature maps in different channels, which will limit the improvement of segmentation performance. To solve this issue, we integrate the DAFFM to make better use of the multi-scale feature maps, which fuses the feature maps in the different layers of CNNs based on both channel and spatial attention mechanisms. As illustrated in Figures 6 and 7, Unet [16] fuses the features by concatenating features directly without distinguishing the difference and similarity of features and leads to the wrong prediction in the boundary region of the object. On the contrary, our method has better performance in the boundary region, which contributes to the effective feature fusion generated by the DAFFM.
Compared with the other methods mentioned above, our method improved the segmentation performance by context aggregating and feature fusion. It can be inferred that adaptive context aggregating and feature fusion are feasible to achieve better segmentation performance. Although our method achieved better performance in the experiment, it has a more complex structure. With the development of remote sensing technology, there will be more remote sensing images to be processed. Therefore, we believe that further improving the trade-off between the complexity of structure and accuracy is possible for future works that need more attention.

Conclusions
In this work, we presented a network for the challenging and meaningful task of precise semantic segmentation of VHR remote sensing images, which adaptively aggregates the contextual cues and flexibly fuses multi-scale feature maps based on spatial and channel attention mechanisms. Specifically, we first introduced ACAM to capture the multi-scale contextual cues of objects with adaptive weight vectors concentrating the global semantic information. Moreover, we adopted DAFFM based on spatial attention and channel attention mechanisms to explore the consistency of multi-scale features for better fusion results. In this way, our method makes better predictions for pixels of objects in complex remote scenes and improves the classification accuracy of pixels in the boundary of objects. Finally, the extensive ablation experiments based on ISPRS 2D Semantic Labeling Vaihingen data [35] and GaoFen-2 [36] data show that our method gives more precise segmentation results and achieves a state-of-the-art performance.