MAFF-HRNet: Multi-Attention Feature Fusion HRNet for Building Segmentation in Remote Sensing Images

: Built-up areas and buildings are two main targets in remote sensing research; consequently, automatic extraction of built-up areas and buildings has attracted extensive attention. This task is usually difficult because of boundary blur, object occlusion, and intra-class inconsistency. In this paper, we propose the multi-attention feature fusion HRNet, MAFF-HRNet, which can retain more detailed features to achieve accurate semantic segmentation. The design of a pyramidal feature attention (PFA) hierarchy enhances the multilevel semantic representation of the model. In addition, we develop a mixed convolutional attention (MCA) block, which increases the capture range of receptive fields and overcomes the problem of intra-class inconsistency. To alleviate interference due to occlusion, a multiscale attention feature aggregation (MAFA) block is also proposed to enhance the restoration of the final prediction map. Our approach was systematically tested on the WHU (Wuhan University) Building Dataset and the Massachusetts Buildings Dataset. Compared with other advanced semantic segmentation models, our model achieved the best IoU results of 91.69% and 68.32%, respectively. To further evaluate the application significance of the proposed model, we migrated a pretrained model based on the World-Cover Dataset training to the Gaofen 16 m dataset for testing. Quantitative and qualitative experiments show that our model can accurately segment buildings and built-up areas from remote sensing images.


Introduction
Building and built-up area semantic segmentation in remote sensing images has been proven to be applicable in various scenarios, such as emergency management, ancient city reconstruction, traffic evaluation, and map updating [1,2].Owing to the large amount and complex backgrounds of remote sensing images, manual annotation is usually highly time-consuming and laborious work [3].Therefore, it is necessary to find ways automatically extract the objects in these images quickly and accurately.At the same time, automatic segmentation is often more challenging for remote sensing images than for natural scene images.Many factors affect and interfere with the extraction of buildings and builtup areas, including scale variations, backdrop complexity (shadows, clouds, and trees, among others), diverse architectural styles, and other topological difficulties [4].Therefore, fast and accurate automatic segmentation of buildings and built-up areas from remote sensing images has become a difficult task of considerable interest for remote sensing image processing in recent years.
To extract the shapes and contours of built-up areas and buildings from remote sensing images, three kinds of methods have been proposed: traditional extraction, machine learning, and deep learning.In the category of traditional extraction, most building segmentation methods rely on human experience and hand-crafted features; for example, morphological and texture features of buildings [5][6][7], geometrical information and spectral or spatial information [8][9][10], building and shadow companions, the spacing between buildings, and the topological relationships between buildings and roads.However, these methods reduce the representational ability and performance of the original features and depend on inefficient and complex manual feature selection.With the progress of science and technology, machine learning has gradually been applied for remote sensing segmentation.Initially, several feature descriptors were carefully designed for pixel-by-pixel classification [11].For instance, Aptoula et al. [12] investigated the value of global morphological texture descriptors in remote sensing.Later, support vector machine (SVM) technology was gradually applied for processing remote sensing images.Mitra et al. [13] overcame the problem of insufficient labelled pixels in remote sensing images by applying the SVM algorithm.Qi et al. [14] presented a multiclass SVM-based semi-supervised approach to enhance the classification efficiency.Subsequently, the random forest method gained popularity in remote sensing image processing.Pal et al. [15] made use of random forest decision trees to vote for the most suitable class.Xia et al. [16] improved the original random forest algorithm by means of the "rotating forest" technique and applied it for remote sensing image processing.These methods have achieved a certain level of success, but traditional machine learning methods still have obvious weaknesses.In particular, the generalization ability of manually designed features is poor, resulting in insufficient segmentation accuracy.
With rapid advances in computational performance, deep-learning-based methods have strongly reshaped research on building segmentation.In the early stage, when deep learning was first becoming popular in computer vision studies [17][18][19], most research focused on classification or detection tasks, which promoted the development of neural network structures (CNNs) and their applications for building segmentation in remote sensing images.Deep learning methods can extract high-level characteristics and coordinate various levels of abstraction [20].Therefore, deep learning has gradually come to dominate the field of remote sensing segmentation owing to its excellent performance.At present, there are many techniques for the segmentation of buildings and built-up areas.Long et al. [21] designed a fully convolutional network (FCN) and produced accurate and detailed segmentation results.They skillfully replaced complex fully connected layers with convolutional layers and adopted deconvolution layers to obtain a segmentation map consistent with the size of the original image.Subsequently, encoder-decoder structures have gradually been widely used in the field of segmentation.Zhao L. et al. [22] proposed a decoding network built on an FCN, which utilizes three different upsampling methods and fuses their generated features, thus effectively reducing the number of parameters and the training time.He C. et al. [23] optimized the model segmentation results and improved their accuracy by introducing boundary information.The U-Net model [24] proposed by Ronberg et al. is an improvement on the basic encoder-decoder architecture.The corresponding layers of the encoder and decoder are associated with combining features from different layers, thereby narrowing the gap between them.To address the challenges of multiscale objects and low precision in remote sensing image segmentation, Sun Y. et al. [25] proposed the MA-UNet method.To achieve the requirement of a large number of pixel-level labels, Moghalles et al. [26] presented a novel segmentation framework based on a weakly supervised structure, which can compensate for the deficiency of the original data by generating labels to achieve better segmentation accuracy.
Despite the promising results achieved by advanced methods, current automatic semantic segmentation tasks for buildings still face the following main challenges.First, most remote sensing images have the characteristics of a complex background and a large size, while the previous segmentation networks have a shallow model depth and the ability for feature extraction is relatively insufficient, resulting in boundary blurring in the segmented map.Moreover, the similarity of artificial features between buildings and background leads to intra-class inconsistency [27], which can cause the buildings to be incorrectly classified as background.Finally, when a target building is blocked by shadows or trees, it is difficult to segment accurately.In order to solve the above challenges, Chen et al. [28] developed the atrous spatial pyramid pooling (ASPP) module with different dilation rates to obtain multi-scale feature maps.Zhang et al. [29] proposed a multilevel feature aggregation strategy to fuse feature maps of different resolutions from different layers.DFN [30] captured more distinctive features through channel attention.Zhao et al. [31] utilized spatial attention to relax local domain constraints.Hou et al. [32] embedded position information into channel attention to enhance the representation of objects of interest.Abdullahi et al. [33] were able to better extract object features through the combination of attention mechanism and dense convolution.Inspired by the concepts of attention mechanisms and feature information fusion, in this paper, we developed the MAFF-HRNet model based on multi-attention fusion to automatically extract built-up areas and buildings from remote sensing images.Our contributions in this article are as follows: 1. We propose a high-resolution network (HRNet) [34] based on a pyramidal feature attention (PFA) hierarchy.Our model can retain more spatial feature information and achieve more accurate semantic segmentation, thus effectively solving the problem of the blurring of segmented object boundaries in remote sensing images.2. We introduce a novel mixed convolution attention (MCA) block that can improve the capture range of receptive fields [35] and improve the ability to recognize object contours, thus mitigating segmentation errors caused by intra-class inconsistency.3. We develop a novel multiscale attention feature aggregation (MAFA) block to boost the semantic representation in the feature map and better retain fine contextual details, thereby addressing the problem of inaccurate segmentation caused by object occlusion.4. Without the use of additional datasets, data augmentation [36], or postprocessing, MAFF-HRNet achieves the best accuracy among a set of comparable models on two building segmentation datasets, and the practicability of our model was verified on the Gaofen 16 m dataset.Our code is available at https://github.com/Zhihao-Che/MAFF-HRNet(accessed on 17 January 2023).
The rest of this paper is organized as follows.Section 2 details the structure of the proposed approach.In Section 3, we introduce the experimental dataset, model evaluation indicators, experimental implementation details, and experimental results in detail.Section 4 discusses the applicability of MAFF-HRNet on the Gaofen 16 m dataset and possibilities for further work.Finally, Section 5 summarizes the article.

Methods
In this section, we first introduce the overall structure of the proposed model in Sub-Section 2.1.Then, we describe PFA, MCA, and MAFA in detail in Sections 2.2-2.4,and finally our loss functions are provided.

Architecture of the Proposed Framework
Most of the segmentation models used in previous studies have had encoder-decoder architectures.The traditional network structure for semantic segmentation includes a process of spatial downsampling for the feature extraction; however, this process cannot ensure that deeper features will necessarily have higher resolutions.At the same time, feature extraction performed in a serial manner by high to low level encoders may generate fuzzy feature maps after multiple convolutions, resulting in the loss of some significant contour and boundary details.In contrast, an HRNet is composed of subnetworks based on parallel decomposition into images of multiple spatial resolutions, and feature maps are constructed by fusing the outputs of these subnetworks by means of reused modules.Inspired by the HRNet architecture, we use a parallel structure instead of the traditional serial structure to produce feature maps of different resolutions, as shown in Figure 1.At the same time, we combine it with a feature pyramid structure to effectively retain the spatial feature information of each layer in the model and produce more accurate prediction maps, which is helpful for overcoming the blurred boundary problem.In order to further enhance the capture range of receptive fields in each branch structure, we develop an innovative MCA block to enhance the discriminability of objects of different sizes and shapes in order to mitigate segmentation errors caused by intra-class inconsistency.In addition, to boost the performance of the proposed model in identifying fine details, especially for target buildings blocked by shadows or trees, we propose a MAFA block based on the mixed attention mechanism.Finally, we use a smoothing method to gradually merge the multilayer feature maps to obtain the final semantic segmentation results.

Pyramidal Feature Attention (PFA) Hierarchy
Because relying only on the feature information from the deep layer is not enough to reconstruct an accurate prediction map, the combination of multi-scale feature maps is required to compensate for rough local feature information to achieve accurate pixel classification [37,38].A feature pyramid [39] is a basic component used to detect objects of different scales in object recognition models, which can take advantage of the multilevel characteristics of images and provide sufficient spatial information for each layer in the network, so as to achieve accurate pixel detection and classification.Therefore, to boost the multilevel representation ability of our network, pyramid feature levels are introduced in our method, as shown in Figure 2.
The PFA hierarchy employs a four-level top-down structure.To implement the downsampling function, we adopt the attention downsampling (AD) block, which consists of two 3 × 3 convolutional layers, a batch normalization operation, the rectified linear unit (ReLU) activation function, and an attention module.In the AD block, two types of attention modules are used.A channel attention mechanism is employed in the AD blocks in the first three layers, and the structure of this module is shown in Figure 3a.To boost the correlation between feature maps in the region of interest and enhance the contour and boundary details in the segmentation maps, we design the attention mechanism to utilize the interdependence between channel maps.We convert the feature map  ∈  × × to  ∈  × and perform a matrix multiplication between  and the transpose of  .Subsequently, we obtain the output attention feature map  ∈  × through the softmax layer.The formula is as follows: where  measures the  channel's impact on the  channel.Finally, we perform a matrix multiplication between the transpose of  and  and reshape their result to  ∈  × × .

Figure 2.
The framework of the four-level PFA hierarchy.For downsampling operations, AD blocks are used.We set the downsampling rate to 2, which means that, after a downsampling operation, the feature map will be reduced to half the original size.In contrast, the number of channels will double.The first-level feature map is directly input into the backbone of MAFF-HRNet without downsampling, and the other three layers are fused with the matching feature map in the network structure through downsampling.
In contrast, we adopt a multihead self-attention mechanism in the last layer of the AD block, with the architecture shown in Figure 3b.This block contains multiple parallel self-attention modules, which capture the same related information in different subspaces to achieve more comprehensive feature association from multiple perspectives and scales, further strengthening the contour and boundary features.As shown, the architecture of the multihead self-attention module consists of a scaled dot-product attention operation and an attention calculation.For the scaled dot-product attention operation, a scaling weight prevents the vector dimensions from being too high, which leads to the calculated dot product result being too large.The formula is as follows: where , , and  indicate the query matrix, key matrix, and value matrix, respectively;  indicates the attention matrix; and  is the normal distribution.In addition, the attention calculation takes the results of each q, k, and v mapping in the scaled dot-product attention operation as input and fuses each result.The process is expressed as follows: After each AD block (using the transformer model [40,41] for reference), the spatial resolution of the feature map will be reduced to half the original size, while the number of channels will double.The first-level feature map is directly input to the first-layer branch of the model architecture.Additionally, the feature maps of the remaining corresponding layers after the downsampling operation are fused with the corresponding maps from the multiresolution branches through elementwise addition.Therefore, benefiting from the features obtained from the high level and low level, the PFA hierarchy enables the exploration of different layers of semantic features to transmit rich multilevel semantic information to the model, thus solving the problem of boundary blurring in the segmentation map.

Mixed Convolution Attention (MCA) Block
The parallel structure of the HRNet encoder can generate features of different sizes to enhance the high-resolution representation and improve the model performance, but the receptive fields for shallow feature maps are still small, which will lead to intra-class inconsistency in the segmentation results.In order for the segmentation model to obtain more abundant feature information in the shallow layer, we propose a new MCA block to obtain the feature information in all layers of the original backbone, thus improving the capture range of the receptive fields and the discriminability of objects of different sizes and shapes, as shown in Figure 4. Atrous convolution [42][43][44] has been widely applied to increase the receptive field without increasing the size of the convolution kernel or the amount of computation.Because atrous convolution does not require changing the model construction or adding other parameters, it can be perfectly embedded in any network model.Currently, most multiscale feature extraction strategies mainly rely on the method of stacking atrous convolutions with different dilation rates, which usually leads to a gridding effect.As an alternative, some researchers have suggested using multiple atrous convolutions with the same dilation rate to expand the receptive field; however, this will require a large number of atrous convolutions to obtain sufficient multiscale feature maps, and thus increase the computational complexity and the number of parameters.To overcome these drawbacks, we develop a new MCA block containing three consecutive atrous convolutions.Each block contains three atrous convolution kernels (with dilation rates of r = 1, 2, and 3), which are applied to the feature maps in parallel.
This architecture can capture farther distance information of multiscale objects through the merging of the three atrous convolution kernels.Regarding the three corresponding dilation rates, the common divisor of these kernels is no greater than 1, thus mitigating the gridding effect.Moreover, our MCA block includes residual connections to further preserve the spatial feature information from the previous network layer; consequently, it not only enriches the spatial feature information of the receptive fields, but also compensates for the disappearance of local features yielded by the sparse sampling performed in atrous convolution, thereby further avoiding the gridding effect.In addition, we incorporate a coordinate attention mechanism after the three atrous convolutions; the corresponding architecture is shown in Figure 5.The coordinate attention block encodes the channels of the feature map in two dimensions, i.e., the horizontal and vertical coordinates, which enables it to generate a pair of direction-aware feature maps.This method can capture long-range correlations in space while maintaining spatial orientation information to a certain extent, thus allowing the model to locate targets more accurately.Because the input and output dimensions of the coordinate attention block are the same, it can be flexibly inserted into any network structure.For building segmentation, such coordinate attention blocks can be used to capture the global information of the building data, and this effective information can be fully used to further improve the capture range of the receptive fields, thus solving the problem of intra-class inconsistency in the segmentation map.

Multiscale Attention Feature Aggregation (MAFA) Block
In previous studies, most of the effort has been devoted to improving the encoder architecture, while simple upsampling or deconvolution has still been applied to gradually enlarge the outputs at smaller resolutions.However, such decoder designs largely ignore contextual information, resulting in a loss of detail and greatly reducing the segmentation accuracy [45], especially in the case of object occlusion.To retain more detailed features and prevent segmentation errors caused by occlusion, we propose the MAFA block to boost the recovery of the output map.
Without an attention mechanism, restoring a higher-resolution prediction map by means of simple bilinear upsampling may result in the loss of fine details and poor semantic segmentation.When the spatial correlations between feature maps are fully utilized, however, we can obtain more accurate segmentation results.As shown in Figure 6, instead of adopting an overly simple bilinear upsampling strategy for the fusion of neighbouring feature maps, we propose an efficient spatial and channel attention upsampling (ESCAU) module to improve the semantic reconstruction ability by making full use of the spatial and channel dependencies.Because the feature maps of an image contain rich interdependent structural information, we can convert each feature map into the matching object in the upsampling path and apply it to help fuse neighbouring feature maps.The formula is as follows: where  indicates the feature map from the  layer;  indicates the feature map from the  layer after the fusion operation; and (•) denotes the ESCAU module, which consists of an efficient spatial and channel attention (ESCA) mechanism and bilinear upsampling.The ESCA module, which combines spatial and channel attention, can effectively integrate multi-scale information in this context, thus enhancing the feature representation ability for building semantic segmentation.At the same time, this module involves only a handful of parameters, which greatly reduces the complexity of the model, as shown in Figure 7.In the channel attention module, given an input, aggregated features are obtained through global average pooling (GAP), and then the channel weights are generated by performing one-dimensional convolution with a kernel size of k, where k is adaptively determined from the following mapping with respect to the channel dimension c: where • represents the nearest odd number to the argument and  and  are set to 2 and 1, respectively.Finally, the result of the convolutional layer is output through the activation function.The channel attention is computed as follows: where (•) signifies the activation function and  (•) represents a one-dimensional convolution operation with a filter size of k × k.In the spatial attention module, we aggregate the channel information of a feature map by means of two pooling operations, generating two types of features: average-pooled features and max-pooled features.Then, these features are used to generate two-dimensional spatial attention maps by means of standard convolutional layers and sigmoid activation functions.The spatial attention is computed as follows: where (•) denotes the sigmoid activation function and  (•) represents a convolution operation with a filter size of 3 × 3. To ensure that the two attention modules complement each other, we use residual connections to combine them, thus making full use of the spatial and channel features and helping improve the accuracy of semantic segmentation.Finally, we can merge all adjacent feature maps in the bottom-up direction while retaining fine details and achieving more accurate pixel-level semantic predictions, thereby solving the problem of semantic segmentation errors caused by occlusion.For feature selection in binary classification tasks, better results will be obtained when the number of the two types of samples tends to be consistent.However, we have found that, in the semantic segmentation of buildings, there is class imbalance between the foreground and background, which is also a core difficulty in high-resolution image segmentation.Thus, we replace the general loss function with a weighted binary crossentropy loss function to alleviate the class imbalance problem.The binary cross-entropy loss function, which is widely used in various semantic segmentation or classification tasks, captures the difference between the true distribution of the input data and the distribution obtained by the model through training.Therefore, this loss function is also used in our network.The loss function is as follows: where  denotes a label segmentation map and  denotes the corresponding predicted segmentation map.By properly adjusting the loss caused by false positive samples, the situation of the imbalance between the two types of samples can be solved to a certain extent.

Experiment
In this section, we present the building dataset preparation, experimental environment, and performance measurement factors in Sections 3.1-3.3.Then, we evaluate the building extraction capability on two high-resolution public datasets in Section 3.4, and we report an ablation experiment conducted on the World-Cover Dataset in Section 3.5 to validate the built-up area extraction capability.

Public Datasets
We evaluate the building extraction capability on two public datasets: the WHU Building Dataset [46] and the Massachusetts Buildings Dataset [47].
The Massachusetts Buildings Dataset is a public dataset used for building segmentation in the Kaggle competition.It contains 151 remote sensing images of urban and suburban areas in Boston, with a single-image size of 1500 × 1500.The whole dataset is divided into 136 images for model training, 11 images for model training, and the rest for model validation.Owing to memory limitations, the size of the original images exceeded the available memory in these experiments, so the images were cut to 500 × 500, with no overlap.The target map was produced by rasterizing a large quantity of high-quality building footprint data.The data were limited to areas where the average missing noise level was approximately 5% or less.
The WHU Building Datasets are public datasets specifically used for building segmentation in remote sensing images.The remote sensing images were subjected to downsampling from 0.075 to 0.3 m resolution.The entire dataset is divided into 8189 images, without overlap, and the single-image size is 512 × 512.In total, 4736 images are assigned to model training, 2416 images are assigned to model testing, and the remaining images are used for model validation.In addition to differences in satellite sensors, changes in atmospheric conditions, atmospheric and radiation corrections, and seasonal changes in the samples also help to increase the robustness of building extraction algorithms.

Experimental Setup
In the entire process of the experiment, no training tricks [48] and data augmentation were adopted; for example, warm-up [49], label smoothing [50], or adding the pretrained model.To demonstrate the performance of our algorithm, we compare it with ten advanced models: SegNet [51], FPN [39], U-Net [24], DeepLabv3 [52], PSPNet [53], DeepLabv3+ [54], U-Net++ [55], SRINet [56], MSG-SR-Net [57], and the Spot-Seeds and Refinement Process (SSRP) method [26].SegNet, FPN, PSPNet, U-Net, and DeepLabv3 are the most representative deep learning models in semantic segmentation.They are widely applied in various scenarios and have achieved good segmentation results.DeepLabv3+, U-Net++, SRINet, MSG-SR-Net, and the SSRP are well-known models applied in the field of remote sensing segmentation and have good performance.Similar to our proposed method, DeepLabv3+ and U-Net++ are also fully supervised semantic segmentation networks that are improved variants of a given base network.In addition, MSG-SR-Net and SSRP adopt the novel weakly supervised approach to segment buildings, which alleviates the problem of insufficient segmentation accuracy caused by a lack of sample data and has gradually become a focus of research in recent years.
To ensure fairness of the experiments, ResNet50 was used as the backbone of all models, which do not include any pretrained models.All network models for comparative experiments were processed via Kaiming initialization [58].We trained and verified all models on an Ubuntu 18.04 server with four 12 GB Nvidia GeForce RTX 2080 Ti GPUs using PyTorch 1.8 and CUDA 10.2.Additionally, to compare the segmentation ability of the model more fairly, we set the number of training epochs to 300 and the batch size to 16. Considering the gradient oscillation in the training process, the adaptive moment estimation (Adam) is introduced to optimize the training results of the model.As an adaptive learning method, Adam [59] requires less adjustment and has higher computational efficiency than other random optimization methods.In addition, the cosine annealing strategy was adopted to adjust the attenuation of the learning rate and was applied in the training stage.

Evaluation Metrics
To verify and compare the model performance, we consider four traditional semantic segmentation metrics: precision, recall, the intersection over union (IoU) ratio, and the F1 score (F1).The meaning of each indicator is as follows.Precision indicates the percentage of pixels correctly predicted in total.In other words, the precision of the model can directly reflect the segmentation ability and effect of the model.Recall represents the proportion of the predicted quantity of the model to the total.The F1 score, as an important value for evaluating the proposed method, combines the values of the recall rate and precision rate in a certain proportion, which can effectively reflect the segmentation performance of the model.Finally, the IoU, as the most common indicator in semantic segmentation tasks, clearly reflects the proportion of overlapping pixels between the predicted map and the label map.The formulas for calculating these metrics are shown in Equations ( 10)- (13).
Intersection over union (IoU) =   +  + As the IoU and F1 scores can more comprehensively reflect and compare the segmentation ability of the model, we analyse the performance differences between the models around these two main indicators in the subsequent analysis of the experimental results.

Results on Public Datasets
In this part, we tested the performance of all models on two public datasets.We first tested the performance of MAFF-HRNet on the Massachusetts Buildings Dataset and compared the segmentation capabilities of MAFF-HRNet to those of advanced models, including SegNet, FPN, U-Net, HRNet, and DeepLabv3.The quantitative results calculated for the five model validation indicators on the validation set are shown in Table 1.Through observation, we can easily find that the IoU scores of FPN, SegNet, HRNet, and U-Net are 62.24%, 56.57%, 66.60%, and 63.67%, respectively.In contrast, DeepLabv3, as one of the state-of-the-art models, did not achieve the expected results, and its IoU score was approximately 4% lower than that of FPN.The average F1 scores achieved by SegNet, U-Net, DeepLabv3, HRNet, and FPN were 72.08%, 77.80%, 73.58%, 79.74%, and 76.68%, respectively.The MAFF-HRNet model is evidently superior to the other methods in building extraction, improving the F1 score by 1.43% and 3.37% compared with the original HRNet and U-Net models, respectively.At the same time, Figure 8 shows the segmentation results for the building objects extracted using U-Net (the best model among the networks considered for comparison) and MAFF-HRNet on the validation set, from which we can observe that the comparison of the segmentation results is consistent with the quantitative results.Although the interference from the complex background is eliminated to some extent, we can observe that, in the U-Net prediction results, the edges of the buildings are not clear enough, and dense buildings cannot be accurately distinguished.In contrast, the segmentation results generated by our model contain more detailed features than those of the other methods.These findings indicate that the proposed PFA hierarchy can enhance the multilevel semantic representation ability of our model and allow it to maintain higher accuracy of object boundary information in building segmentation.Owing to the presence of more occlusion by shadows or trees, the WHU Building Dataset presents greater segmentation challenges than other datasets.In order to compare the performance of different types of models on the WHU Building Dataset from multiple perspectives, we selected six state-of-the-art models including both fully supervised and weakly supervised models, namely, PSPNet, DeepLabv3+, U-Net++, SRINet, MSG-SR-Net, and SSRP, to perform experiments on the WHU Building Dataset using the same metrics.PSPNet can aggregate the context of different regions, giving this model the ability to understand global context information, and this model achieved an IoU score of 88.17% on the dataset.The designs of the U-Net++ and DeepLabv3+ models are similar to our idea; they are also fully supervised semantic segmentation networks that are improved variants of a given base network, and their IoU scores were the highest among the other models considered for comparison at 89.36% and 89.19%, respectively.As shown in Table 2, our model achieved excellent indicators on the WHU Building Dataset.In particular, MAFF-HRNet achieved the highest IoU score of 91.69% on the validation set, which was better than all other models in the experiment, even outperforming U-Net++ by 2.33%.In addition, the findings are sufficient to prove that the performance of the fully supervised networks is better than that of the weakly supervised networks.As shown in Figure 9, our method exhibits better performance in overcoming the problems of intra-class inconsistency and occlusion.For tree and shadow occlusion, as marked in the orange boxes, our model is almost unaffected.This is because the proposed MAFA block can boost the semantic representation ability of the feature maps and better retain fine contextual details.In addition, the green boxes indicate errors caused by intra-class inconsistencies.The proposed MCA block can effectively increase the capture range of the receptive fields; thus, our model has a better recognition ability for the contours and scales of buildings.Therefore, these blocks embedded in the HRNet model can significantly increase the accuracy of building segmentation and effectively solve the problems caused by occlusion and intra-class inconsistency.

Ablation Experiments
To test the function of the PFA hierarchy and the MCA and MAFA blocks in MAFF-HRNet, and to prove the effectiveness of the proposed model to segment built-up areas, we conducted the following ablation experiments on the World-Cover Dataset, adopting the IoU as the evaluation metric.The World-Cover Dataset was collected by the European Space Agency (ESA) based on Sentinel-2 data and consists of 11 land-use types.We selected only the imagery of built-up land in Paris, France, with a resolution of 20 m, as our dataset.The entire dataset is cut into 6354 pieces of 512 × 512 pixels, with no overlap.Among these images, 5083 images are assigned to the model for training and 1271 images are assigned to the model for verification.
In these ablation experiments, HRNet was used as the baseline comparison model of our experiment.By optimizing HRNet using various combinations of the PFA hierarchy and the MCA and MAFA blocks, we obtained four additional models for comparison.First, the performance of the MCA block was verified.Specifically, the MCA block is used in place of the normal 3 × 3 convolutional layer in the backbone of HRNet, thereby increasing the receptive fields and semantic information without sacrificing spatial resolution.In addition, we employed the MAFA block to remake the decoder module of the baseline model in these experiments.Our feature maps are gradually fused from bottom to top, thereby retaining more fine details and achieving more accurate pixel-level semantic predictions.In Table 3, we observe that the IoU score gradually increases from 75.41% to 80.72% when our proposed MCA and MAFA blocks are used, which fully reflects the advantages of the MCA and MAFA blocks in improving the segmentation accuracy.Finally, we further integrated the PFA hierarchy into HRNet to help further boost the multilevel semantic representation ability.To visualize the segmentation effect, we compared our results to the prediction map obtained by the baseline model for the World-Cover Dataset in Figure 10.We can clearly observe that our model retains more details in the segmentation of built-up areas than the baseline model.The above comparison again proves that the proposed MAFF-HRNet has improved performance and pixel classification ability.

Discussion
According to the experimental results on two public datasets in the previous section, we know that our model can obtain more abundant feature information through PFA hierarchy, thus it can overcome the boundary blurring problem.At the same time, when faced with intra-class inconsistency and occlusion problems, our model has stronger robustness than other models owing to the addition of MCA and MAFA blocks.In addition, our model has obtained the best score in all validation indicators, which is enough to prove that our model has excellent building segmentation ability in remote sensing images.In order to further verify the segmentation performance of our model in different resolution data, we will test the MAFF-HRNet on the Gaofen 16 m dataset.
At present, the Group on Earth Observations (GEO) is establishing the integrated, coordinated, and sustainable Global Earth Observation System of Systems (GEOSS), whose application focuses on achieving the United Nations 2030 Sustainable Development Goals (SDGs).China has always advocated for and participated in the construction of a global geodetic system and has announced that the GF1 WFV/GF6 WFV 16 m (GF1/GF6) data will be shared globally.The GF1/GF6 satellite is a typical high-temporalresolution remote sensing satellite that carries a wide-view camera with a resolution of 16 m and a width of 800 km.To further evaluate the application significance of the proposed model, we migrated the pretrained model based on the World-Cover Dataset to the Gaofen 16 m dataset for testing.
As seen in Figure 11, most of the built-up areas have been completely separated, and the false segmentation of the background has also been alleviated to some extent, thus fully demonstrating that our model also has good segmentation ability on the Gaofen 16 m dataset.However, our model does not effectively segment built-up areas that are too small in the original image, which may be attributable to the effects of the resolution and model migration.The image resolution of the Gaofen 16 m data is lower than that of the WHU Building Dataset and the Massachusetts Buildings Dataset, which makes it more difficult to classify pixels and achieve the same effect.In addition, our testing on the Gaofen 16 m dataset was based on model migration, which further exacerbated the loss in accuracy.In general, however, the contours of the built-up areas are completely separated, which proves that our method has practical significance for built-up area segmentation in Gaofen 16 m images.In addition, although the proposed network shows some improvements regarding boundary blurring, the feature loss caused by downsampling is irreversible.The boundaries of artificial built-up areas are uncertain in remote sensing images, and there will be errors in artificial marking.This will lead to adhesion problems and classification errors in the segmentation graph, which will become our next research goal.

Figure 1 .
Figure 1.Structure of the MAFF-HRNet model.Our model uses a parallel construction instead of a serial construction to generate four feature maps of different resolutions.We incorporate a PFA hierarchy (details in Section 2.2) into the original HRNet by adding the attention downsampling (AD) blocks.In addition, we further add MCA (details in Section 2.3) and MAFA (details in Section 2.4) blocks into the model to boost the capture range of the receptive fields and retain more detailed features, so as to obtain more accurate segmentation results.

Figure 3 .
Figure 3. (a) The structure of the channel attention mechanism applied in the AD blocks in the first three layers, and (b) the structure of the multihead self-attention mechanism applied in the lastlayer AD block.

Figure 4 .
Figure 4. Different from general convolution, our MCA block is implemented based on three serial atrous convolution kernels (with dilation rates of r = 1, 2, and 3) and additionally includes a coordinate attention mechanism, thus increasing the capture range of the receptive fields.

Figure 5 .
Figure 5.The structure of the coordinate attention block.X Avg Pool and Y Avg Pool denote 1D horizontal global pooling and 1D vertical global pooling, respectively.

Figure 6 .
Figure 6.Framework of the MAFA block.Instead of using the traditional bilinear upsampling method to achieve the aggregation of adjacent levels, our MAFA block combines the ESCA mechanism with bilinear upsampling to boost the semantic reconstruction ability of feature maps.Based on the MAFA block, the algorithm utilizes the spatial and channel correlations of segmentation markers, allowing it to more accurately restore the semantic predictions at the pixel level.

Figure 7 .
Figure 7.The ESCA module is composed of two parts in series: a spatial attention module and a channel attention module.

Figure 8 .
Figure 8. Experimental results on the Massachusetts Buildings Dataset.(a) Original images.(b) Image labels.(c) U-Net results.(d) Results of our model.

Figure 9 .
Figure 9. Experimental results on the WHU Building Dataset.(a) Original images.(b) Image labels.(c) U-Net++ results.(d) Results of our model.

Figure 10 .
Figure 10.Experimental results on the World-Cover Dataset.(a) Original images.(b) Image labels.(c) Results of the baseline model.(d) Results of our model.

Figure 11 .
Figure 11.Experimental results on the Gaofen 16 m validation set.

Table 1 .
Evaluation results on the Massachusetts Buildings Dataset.

Table 2 .
Evaluation results on the WHU Building Dataset.The symbol "-" means that a corresponding result is not given for this method in its associated paper.

Table 3 .
Ablation results on the World-Cover Dataset.