Boundary-Aware Reﬁned Network for Automatic Building Extraction in Very High-Resolution Urban Aerial Images

: Convolutional Neural Networks (CNNs), such as U-Net, have shown competitive performance in the automatic extraction of buildings from Very High-Resolution (VHR) aerial images. However, due to the unstable multi-scale context aggregation, the insufﬁcient combination of multi-level features and the lack of consideration of the semantic boundary, most existing CNNs produce incomplete segmentation for large-scale buildings and result in predictions with huge uncertainty at building boundaries. This paper presents a novel network with a special boundary-aware loss embedded, called the Boundary-Aware Reﬁned Network (BARNet), to address the gap above. The unique properties of the proposed BARNet are the gated-attention reﬁned fusion unit, the denser atrous spatial pyramid pooling module, and the boundary-aware loss. The performance of the BARNet is tested on two popular data sets that include various urban scenes and diverse patterns of buildings. Experimental results demonstrate that the proposed method outperforms several state-of-the-art approaches in both visual interpretation and quantitative evaluations.


Introduction
Automatic building extraction from VHR aerial images has been a hot topic in the field of photogrammetry and remote sensing for decades. The end product is of paramount importance for various applications such as urban planning, regional administration [1,2] and disaster management [3]. However, the heterogeneous spectral and structural characteristics among different buildings coupled with the highly complex urban scene pose enormous challenges to extract buildings precisely from VHR aerial images in an automatic fashion. Therefore, developing an advanced method for automated building extraction is essential and urgently needed.
The existing building extraction methods can be divided into three main categories based on different data sources, including optical image based [4], non-optical image based [5,6] and data fusion based approaches [7,8]. Furthermore, in terms of the adopted algorithms, these methods can be categorized into two major groups: non-learning based and learning based.
For non-learning based methods, buildings are extracted by: (1) thresholding buildings using specific characteristics such as spectral [9], shadow [10] and texture [11]; and (2) detecting building edges [12]. For learning based approaches, supervised classification methods, such as Support Vector Machine (SVM) [13], are applied to acquire building extraction maps at the pixel level or the object level [14,15]. However, it is rather difficult for the conventional methods to realize really automatic building extraction, particularly when handling complex VHR aerial images, because the empirically designed features vary across different building structures, imaging conditions and roof materials. Recently, with the rapid development of deep learning methods [16][17][18], a significant breakthrough has been made in mapping buildings using CNNs [19,20]. CNNs have the potential to address the challenge by closing the semantic gap between different semantic levels, and the feature representation can be learned autonomously from the data itself in a hierarchy. Simultaneously, remote sensing has entered the big data era, with massive amounts of aerial images being captured, providing the fuel for deep learning methods to learn for automatic building extraction. Thus, the latest research has paid much attention to exploiting CNNs for automatic building. In general, under the support of massive training data with highquality annotations, the performance of CNN based approaches is superior to the other learning based methods in terms of generalization and precision. As a result, CNN based learning methods are widely utilized and represent an exciting program of research [21][22][23][24][25].
Building extraction aims to estimate a mask, where each pixel represents a specific category (i.e., building or non-building). Based on the Fully-Convolutional Network (FCN), previous works have achieved great success as reflected by numerous variants of FCNs, such as the encoder-decoder based U-Net [26] and SegNet [27]. In previous methods, the encoder-decoder structure [21,23] and some off-the-shelf context modelling approaches [22,25] were adopted to perform building extraction without considering the multi-scale problem and the boundary segmentation accuracy. There were some early attempts to tackle the multi-scale problem [28], but they are not yet adequate to cope with the scale variations in buildings and often involve insufficient multi-scale context aggregation, which leads to incomplete segmentation, particularly for large-scale building extraction. In addition, each pixel is treated equally for a standard FCN. Boundary estimation is extremely challenging since the spatial details are lost during the down-sampling process. As a consequence, the boundary accuracy of the building mask is limited. To support these statements, an instance of a segmentation error map acquired by U-Net, PSPNet [29], DeepLab-v3+ [30] and MA-FCN [23] is demonstrated in Figure 1, from which all four state-of-the-art methods present errors, with missed holes in the large buildings and inconsistent boundary segmentation. Therefore, a novel method needs to be developed to address these issues mentioned above and further enhance the performance. Figure 1. Examples of segmentation error map for several existing state-of-the-art methods performed on the WHUaerial building dataset [21]. Column (a), original images. Column (b), reference labels. Columns (c-f), results obtained by U-Net, PSPNet, DeepLab-v3+ and MA-FCN, respectively. In (b-f), green, white and blue indicate building pixels, non-building pixels and misidentified building pixels, respectively.
Based on U-Net and DeepLab-v3+, the BARNet is proposed in this paper, with the aim to refine the building extraction with accuracy, particularly for those large buildings and boundary regions. Different from other studies, our BARNet method has a significant novelty to learn the boundary structure information in an end-to-end manner as the boundary-aware refined network. The performance of our method is compared with several state-of-the-art methods comprehensively through extensive experiments. The three major contributions of this work are summarized as follows: (1) we develop the Gated-Attention Refined Fusion Unit (GARFU), which realizes a better fusion of cross-level features in the skip connection; (2) we propose a Denser Atrous Spatial Pyramid Pooling (DASPP) module to capture dense multi-scale building features; and (3) we design a boundary-enhanced loss that allows the models to pay attention to the boundary pixels.
The remainder of this article is organized as follows. Section 2 reviews the relevant works for this study. Section 3 presents the proposed method. Experiments and results are provided in Section 4. Section 5 gives some discussion of this work, and this paper is concluded in Section 6.

Related Work
In this section, we briefly review the relevant works of this study, i.e., CNNs for semantic segmentation in Section 2.1, multi-level feature fusion in Section 2.2, aggregation of the multi-scale context in Section 2.3 and boundary refinement in Section 2.4.

CNNs for Semantic Segmentation
Following the pioneering study of FCN [31], numerous FCN schemes have been put forward in the semantic segmentation domain, including dilated FCN, encoder-decoder and multi-path. For dilated FCNs, the last convolution layers of the backbone network (e.g., ResNet [16]) were usually replaced with dilated convolutions for maintaining the resolution of feature maps, and the transposed convolution and bilinear interpolation were embedded after the backbone network as segmentation heads. The encoder-decoder structure derived from U-Net [26] and SegNet [27] is composed of two parts: encoder and decoder. Because the low-level features containing rich details can be reused through multiple skip connections, this scheme enables the network to better restore the spatial details. The multi-path architecture often has multiple paths, such as the Bilateral Segmentation Network (BiSeNet) [32], which has a spatial and a context path. The highlight of this structure is that it is capable of constructing a lightweight network.

Multi-Level Feature Fusion
In general, there are rich spatial details such as edges in the high-resolution feature maps from shallow layers. By contrast, the abstract global representation is learned while the spatial details are lost with the successive convolution and down-sampling operations. Maintaining the resolution of feature maps is the key to semantic segmentation, while it is laborious for a standard CNN to balance the high-resolution and abstract semantic representation. Accordingly, it is natural to combine the high-level features from the bottom layers and the high-resolution features from the top layers for exploiting the complementary features. The FCN combines the low-level features and high-level features via element-wise addition operation. U-Net employs channel-wise concatenation in the lateral short-cuts between the encoder and decoder to reuse the low-level features. Based on the structure of U-Net, the Feature Pyramid Network (FPN) [28] integrates each same level low-and high-level features to make predictions. These fusion approaches fuse the different level features directly without awareness of the usefulness of all features, limiting the propagation for beneficial features.

Aggregation of the Multi-Scale Context
The context corresponds to the receptive field (RF) size of the CNN. Small objects need a small RF size, and vice versa. Owing to the fixed RF size, it is rather complicated for a CNN to suit the objects with diverse sizes. Common attempts to handle this issue are to append a well-designed sub-module after the backbone network for modelling the multi-scale context. PSPNet [29] introduces the Pyramid Pooling Module (PPM) to capture the multi-scale information. Based on the dilated convolution, the DeepLab series [30,33] proposes Atrous Spatial Pyramid Pooling (ASPP) to obtain the multi-scale context. DenseASPP [34] combines a dense skip connection with ASPP, which effectively enlarges the receptive field size of the network. Recently, inspired by the success of the attention mechanism in natural language processing, the self-attention mechanism has also been applied to aggregate the dense pixel-wise context [18,[35][36][37]. The major drawback of self-attention is that it has excessive computation and memory consumption.

Boundary Refinement
Due to the inevitable degradation of spatial details caused by the down-sampling process, the boundary accuracy of the segmentation mask is usually limited for most existing CNN based methods. To resolve this problem, painstaking efforts have been made. One way is to employ many post-processing operations with high computational costs such as the Dense Conditional Random Field (DenseCRF) [38]. Another way relies on combining the boundary prior information and the extra sub-network that is responsible for detecting edges [39]. For example, Gated-SCNN [40] refines the boundary predictions by exploiting the duality between semantic segmentation and boundary segmentation with two branches and a regularizer. Evidently, this way increases the complexity and the parameter amount of the model. The others focus on exploiting the online hard example mining strategy [29,37] and perceptual loss [41], which both require careful re-training or fine-tuning of the hyperparameters.

Methodology
In this section, the proposed method is presented in detail. We first overview the architecture of the proposed BARNet briefly. Then, the proposed gated-attention refined fusion unit, denser atrous spatial pyramid pooling module, boundary-aware loss and training loss are elaborated.

Model Overview
The BARNet takes VHR aerial images as the input and performs pixel-level building extraction in an end-to-end manner. As illustrated in Figure 2, the BARNet is a standard encoder-decoder structure composed of three parts: encoder, context aggregation module and decoder. ResNet-101 [16] is adopted as the backbone network to encode the basic features of buildings. The fully-connected layer and the last global average pool layer in ResNet-101 are removed. To retain more details in the final feature map, for the last stage of ResNet-101, all 3 × 3 convolutions are modified with dilated convolutions with dilated rates of {1, 2, 4}, and the stride in the down-sampling module is set to 1. After the encoder, DASPP is appended to capture the dense global context semantics. Before harvesting the low-level features into the decoder, each low-level feature map from the encoder is reduced to 256 channels with a 3 × 3 convolution layer followed by Batch Normalization (BN) and ReLU layers, for less computational cost. The reduced low-level feature map and the corresponding high-level feature map from the decoder are fused in the GARFU. Then, the fused features are fed into the decoder block to restore the details. Each decoder block in the BARNet is equipped with two cascaded Conv-BN-ReLU blocks, the same as U-Net and DeepLab-v3+. At the end of the BARNet, a 1 × 1 convolution layer and a softmax layer are applied to output the final predictions. To get the same size as the original input, the final predictions are further up-sampled by 4 times using bilinear interpolation.

Gated-Attention Refined Fusion Unit
The significant highlight of U-Net and FPN is integrating the cross-level features via simple concatenation and addition operations. Given a low-level feature map E i ∈ R C i ×H i ×W i and a high-level feature map D i ∈ R C i ×H i ×W i , in which i, C, H and W denote the level order, number of channels and height and width of the feature map, respectively, the two basic fusion strategies can be formulated as: where concat(.) denotes the concatenation operation, U psample(.) means the up-sampling operation and F i represents the fused feature map. It can be evidently observed from the above two equations that all feature maps are fused directly without considering the contribution of each feature. As described earlier in Section 2.2, different level features are complementary, which is beneficial for building extraction. However, to the best of our knowledge, the informative features in a feature map are mixed with massive redundant information [42]. As a result, the cross-level features should be re-calibrated before combining them for the best exploitation of the beneficial features. To achieve this goal, we developed the GARFU. As presented in Figure 2, the GARFU is embedded at each short-cut, which only incurs a small extra computational cost. The GARFU consists of two major components: the channel-attention module and the spatial gate module. As displayed in Figure 3, the low-level feature map E i is first re-calibrated by a channel-wise attention vector α i ∈ R C i ×1×1 . Then, E i and D i are recalibrated by multiplication of a gated map The mechanism of the GARFU can be defined as: where f C (.) represents the multiplication in the channel axis and f S (.) represents the multiplication in the spatial dimension. (1) Channel attention: The latest works demonstrated the effectiveness of modelling the contribution of each channel using the channel attention mechanism [1,43]. Therefore, we adopt the channel attention module reported in SENet [43] to exploit channel-wise useful features in the low-level feature map, because the low-level features usually carry more noise. The attention vector α i is generated based on the concatenated result of D i and E i . Before concatenating them, D i is first up-sampled with bilinear interpolation to make D i have the same spatial shape as E i . Then, the concatenated features are passed through a 1 × 1 convolution layer for less parameters. Let x = GAP(c), where c are the concatenated features and GAP(.) is the channel-wise global average pooling operation. The attention vector α i is obtained by: in which W 1 ∈ R C i 2 ×1×1 and W 2 ∈ R C i ×1×1 are two linear transformations, σ 1 denotes the ReLU activation function and σ 2 is the sigmoid activation. α i is multiplied by E i to enable the network to learn the salient channels that contribute to distinguishing buildings.
(2) Gate: Many studies [26,27,30] suggested that introducing the low-level information can improve the accuracy of predictions on the boundary and details, whereas lacking global semantics may lead to confusions in other regions due to a limited local receptive field size. On the other hand, there exists a semantic gap for low-and high-level features, that is not all features benefit building extraction. With this motivation, we adopt the gate mechanism to generate a gate map β i , which serves as a guide to enhance the informative regions and suppress the useless regions both in low-and high-level features. Gates are widely used in deep neural networks to control the information propagation [44]. For example, the gated recurrent unit in the LSTM network is a typical gate [45]. In this work, the gate β i is generated by: where W is a linear transform parameterized with R C i ×1×1 , σ denotes sigmoid activation for normalizing the value into [0, 1] and γ is a trainable scale factor to prevent the minima occurring during the initial training. The gate map is learned under the supervision of the ground-truth during training, and the pixel value in it measures the degree of importance for each pixel. The feature at position (x, y) would be highlighted when the value of g i (x, y) is large, and vice versa. In this manner, the useless information is suppressed, and only useful features can be harvested to the following decoder block, thus obtaining better cross-level feature fusion. Different from the self-attention mechanism, the gate is learned with the explicit supervision of the ground-truth.

Denser Atrous Spatial Pyramid Pooling
As is well known, building scale variance frequently occurs in complex urban scenes, resulting in non-unified extraction scales. Thus, an ideal context modelling unit should capture the dense multi-scale features as much as possible. To achieve this goal, a new ASPP module is developed. As it is inspired by ASPP and the main idea is to capture the denser image pyramid feature, we name it denser atrous spatial pyramid pooling. As illustrated in Figure 4, the DASPP module consists of a skip connection, a Cascaded Atrous Spatial Pyramid Block (CASPB) and a global context aggregation block. The skip pathway is only composed of a simple 1 × 1 convolution layer, aiming at reusing the high-level features and accelerating network convergence [16]. In CASPB, we cascade the hybrid multiple dilated 3 × 3 depthwise separable convolution [46] layers with different dilation rates and connect them with dense connections. Here, the depthwise separable convolution is utilized for reducing the parameters of DASPP, and the negative effect is almost negligible. The CASPB can be formulated as L i = Conv i,d i (concat(L 0 , L 1 , ..., L i−1 )), where d i represents the dilatation rate of the ith layer L, Conv(.) means the convolution operation and concat(.) denotes the concatenation operation. In this study, d = {1, 3, 5, 7, 9, 11}. Compared to ASPP, this change brings us two main benefits: a denser feature pyramid and a larger receptive field. The sequence of the receptive field size in the original ASPP is 13, 25 and 37, respectively, when the output stride of the encoder is 16. However, the max receptive field of each layer in the CASPB is 3, 8, 19, 36, 55 and 78, which is denser and larger than ASPP. This means the CASPB is more robust with the building scale variations. In addition, the position-attention module of DANet [36] is also introduced to replace the image pooling branch in ASPP to generate denser pixel-wise global context representation. Unlike the global average pooling used in PPM and ASPP, the self-attention can a generate global representation and capture the long-range dependence between each pixel. As presented in Figure 4, the position-attention module re-weights each pixel according to the degree of correlation between any pixels.

Boundary-Aware Loss
Although the re-calibrated low-level features contribute to refining the segmentation results [30], this is still not sufficient enough to locate accurate building boundaries. As mentioned earlier, the commonly used per-pixel cross-entropy loss treats each pixel equally. In fact, depicting boundaries is more challenging than locating semantic bodies because of the inevitable spatial detail degradation. Consequently, an individual loss should be applied to force the model to pay more attention to boundary pixels explicitly. The key here is how to decouple the building edges from the final predicted maps. If the corresponding boundary maps are obtained, we could use the binary cross-entropy loss to reinforce the boundary prediction. Herein, the Laplacian operator, defined by Equation (8), is applied both on the final prediction maps and ground-truths to produce the boundary predictions and corresponding boundary labels.
where f is a 2D grey-scale image and x, y are the two coordinate directions of f . The output of Equation (8) is termed the gradient information map, where a higher value stands for the probability of a pixel locating at the boundary, and vice versa. We extend the Laplacian operator to process the multidimensional tensor. An instance of using the Laplacian operator to obtain the building boundary is given in Figure 5. With the yielded boundary maps B ∈ R N×1×H×W and boundary labels B ∈ R N×1×H×W , where N, H and W are the batch size, image height and image width, respectively, the boundary refinement is defined as a cost minimum problem expressed by: where θ denotes the trainable parameters of the BARNet. For every single image, the weighted binary cross-entropy loss [26] is employed to compute the boundary-enhanced loss L be as: where Z + and Z − represent the number of pixels in boundary and non-boundary regions, respectively. The final Boundary-Aware (BA) loss L ba is defined as the addition of two losses, i.e., in which λ be is empirically set to 1 to balance the contribution of boundary-enhanced loss, w i is the class weight calculated using the median frequency balance strategy [1] and y i and y i denote the model predictions and corresponding labels, respectively.

Training Loss
The cross-entropy loss expressed in Equation (12) is utilized to supervise each learned gate. It should be noted that each gate is up-sampled to the same size as the ground-truth for computing loss. In addition, in order to facilitate the training process, an auxiliary loss with a weight of 0.4 is set on the output feature map at the third stage of ResNet-101 [29]. Thus, the final total loss of our network is: where λ i ∈ {0.8, 0.6, 0.4} denotes the balanced weight parameter for different gate losses. Following the works [29,36], the Online Hard Example Mining (OHEM) strategy [29] is adopted to compute the BA loss L ba during training, to boost the performance of the BARNet.

Datasets
Two standard open-source VHR aerial datasets were used to verify the effectiveness of the proposed method. All the images were collected in a complex urban scene by airborne sensors and have very high resolution. Two close-ups of the datasets are shown in Figure 6. WHUaerial building dataset [21]: This dataset consists of more than 22,000 independent buildings extracted from aerial images with a 0.075 m spatial resolution and 450 km 2 covering Christchurch, New Zealand. The structure and roof materials of these buildings vary in different locations, ranging from low-rise urban residential settlements to homogeneous industry areas. In the previous literature, this dataset is the most popular benchmark for building extraction. Due to the size of the original images with a 0.075 m ground resolution being very large, the organizer down-sampled them to a 0.3 m ground resolution and seamlessly cropped them into 8189 tiles with 512 × 512 pixels. The dataset was officially divided into three parts: 4736 tiles for training, 1036 tiles for validation and 2416 tiles for testing.
ISPRS Potsdam dataset: This dataset, consisting of 38 True Orthophoto (TOP) aerial images, is provided by the International Society for Photogrammetry and Remote Sensing (ISPRS) and is widely used to evaluate the algorithm for urban remote sensing semantic labelling. The size of each image is 6000 × 6000 pixels. Compared to the WHU dataset, it is more challenging due to the finer spatial resolution of 0.05 m. Dense residual buildings with different shapes and roof materials dominate in this dataset, making it hard to accurately separate buildings from background objects. Following the official suggestion, fourteen images were set as the test set, and the remaining 24 images were randomly split into 7 images for validating and 17 images for training.

Experimental Settings
Experimental configurations: The proposed BARNet was implemented based on the Pytorch-1.6 framework in the Ubuntu 20.04 environment. All experiments were conducted on an Nvidia GeForce RTX 2080ti GPU with 11GB RAM.
Training settings: The encoder was initialized with the weight of ResNet-101 trained on ImageNet [47], and the rest was initialized using Kaiming uniform [48]. The Adam algorithm [49], where the initial learning rate was set to 0.0002 and the weight decay was set to 0.0005, was selected to optimize the network. The warmup strategy [50] with a base learning rate of 5 × 10 −8 and an exponential weight-decay strategy with a decay rate γ = 0.9 was employed to adjust the learning rate for each epoch. We set the warmup period to 5 epochs. During the warmup period, the learning rate was linearly increased.
The number of epochs was set to 100 for the WHU and Potsdam datasets. We set the batch size to 8 to make full use of the GPU memory. Note that all experiments were done using mixed-precision training [51].
Dataset settings: For the Potsdam dataset, the images used for training and validation were cropped into 512 × 512 pixels without overlap, and the tested images were cropped into 512 × 512 pixels with an overlap of 128 pixels. To avoid the risk of over-fitting, some commonly used data augmentation approaches, including random horizontal-vertical flipping, random cropping with a crop size of 512 × 512, random scaling within a range of {0.75, 1.0, 1.25, 1.5, 1.75} and random Gaussian smoothing, were applied on each training image.

Comparison to State-Of-The-Art Studies
We evaluated the proposed method on the WHU and Potsdam datasets. To reveal whether the proposed method gains an advantage over other recent State-Of-The-Art (SOTA) studies, several remarkable CNNs for semantic segmentation and building extraction were chosen as comparative methods, namely U-Net [26], DeepLab-v3+ [30] and DANet [36]. Moreover, MA-FCN [23], which achieved a very high IoU score of 90.7% on the WHU dataset [23], was also chosen to verify the superiority of our method. Among these methods, DANet is a representative dilated FCN, where the self-attention mechanism was applied to aggregate the holistic context. The robustness of DeepLab-v3+ equipped with a strong decoder head has been proven in previous studies. MA-FCN, one of the variants of U-Net, focuses on the effects of building scale. Except the mini-batch size and the input size being changed to suit the GPU memory used in this study, all the comparative methods were reproduced using the default settings given by the authors.

Visualization Results
The comparisons with different models are elaborated as follows: (1) WHU dataset: The results produced by different methods on the WHU dataset are illustrated in Figure 7. Visually, the proposed BARNet obtained the best global extraction results compared with other SOTA methods. As displayed in the first row of Figure 7a, a reasonable performance was achieved by U-Net, DeepLab-v3+, DANet and MA-FCN in a simple scene. However, with the increase in the complexity and structure of buildings, a dramatically decreased performance is clearly observed in the second and third rows of Figure 7, where parts of buildings are missed, implying that they have difficulty in accurately recognizing the buildings with irregular structures and large scales. This phenomenon seriously affects the visualizations. Conversely, almost all buildings were identified correctly and completely by the BARNet, and there were only a few errors according to the results presented in Figure 7g. This is mainly because our model can better aggregate contextual information. Comparing U-Net, PSPNet and MA-FCN, DeepLab-v3+ could maintain the completeness of the final predictions for relatively large buildings to some extent, whereas it also failed to handle the large buildings with complex shapes. A striking illustration of this can be seen in the third row of Figure 7d, where problems like a high missed rate occurred. Even though MA-FCN is the improved version of U-Net, its performance is similar to U-Net, sensitive to the variation in building scale and structure. To better clarify the detailed inspection, the close-ups of the selected regions (as marked in yellow rectangles in Figure 7) in the tested images are displayed in Figure 8. From the close-up views, we can observe that these SOTA methods exhibited a limited ability to separate the confusing non-building objects adjacent to buildings, yielding inaccurate boundary predictions. Nevertheless, owing to reinforcing the boundary and refining the multi-level feature fusion, our method performed well by generating only a small number of misclassified pixels in boundary regions. (2) Potsdam dataset: Figure 9 provides the experimental results on the Potsdam dataset for different methods. According to the results displayed in Figure 9, the BARNet always obtained the most consistent results with the reference building maps visually. Compared with other methods, our model is robust to cope with buildings in different complex scenes. In contrast, the performance of the other methods looks unstable. A striking illustration of close-up views for different methods is given in Figure 10, from which we can see that there are only a few errors in Figure 10g. Among the SOTA methods, U-Net produced building extraction results with incomplete predictions, suggesting that it has a poor ability to aggregate multi-scale contextual information. Meanwhile, the boundary location is not precise enough. Similar behaviours are observed in Figures 9 and 10 for DANet and DeepLab-v3+. Although DANet and DeepLab-v3+ are both equipped with a strong context modelling module, they still suffer an incomplete detection for large buildings. Another prominent problem for DANet and DeepLab-v3+ is that there exists an evident grid effect (see Column (d) and Column (e) in Figure 9) resulting from the atrous convolution, bringing unstable performance. Benefiting from the attention on the aggregation of the multi-scale representation, MA-FCN could extract most of the buildings stably, but its performance was not good enough. It can be clearly observed in Figure 10f that some buildings were not detected completely, and many non-building pixels were misclassified as buildings. For boundary refinement, as illustrated in the last row of Figure 10, the other four methods failed to accurately separate the cement square from the buildings that have similar characteristics to it, resulting in over-extracting.
According to the above analysis, we can conclude that the improvements of our method lie in recognizing the multi-scale buildings, especially the buildings with large scale, and locating building boundaries more accurately among different scenes, which demonstrates the effectiveness of the proposed method for automatic building extraction in urban VHR aerial images.

Quantitative Comparisons
To objectively assess the performance of the proposed method, following the work in [1,21,23], four commonly used metrics, i.e., precision, recall, F 1 score and Intersection over Union (IoU), were adopted in the following experiments. These metrics are expressed as: where TP (true-positive), the number of correctly identified building pixels, FP (falsepositive), the number of missed building pixels, TN (true-negative), the number of correctly classified non-building pixels, and FN (false-negative), the number of non-detected nonbuilding pixels. Precision reports the ratio of TP in the whole positive predictions, and recall assesses the proportion of TP over entire building pixels in the ground-truth. The F 1 score is the weighted numerical assessment by taking both precision and recall into consideration. The IoU measures the ratio that building pixels are correctly identified as the building category. The quantitative comparison of the results of different methods are listed in Table 1, where the best entries are in bold. According to Table 1, the BARNet achieved an outstanding performance, which was better than other SOTA methods. For the WHU dataset, we can see that the proposed method performs favourably against the SOTA methods in terms of all metrics. Compared with the second-best method (i.e., MA-FCN), the BARNet improved the IoU score from an already very high IoU (90.70%) to a new highest score of 91.51%. In particular, we can see that the precision of the BARNet is boosted by 2.01 points over MA-FCN, which is a near-perfect performance on this dataset. The near-saturated performance confirms that the BARNet is capable of extracting buildings in VHR aerial images completely. Even in the face of the challenging Potsdam dataset, where the images have a very high resolution, the BARNet also achieved a remarkable performance with a precision score of 98.64%, an F 1 score of 96.84% and an IoU score of 92.24%. Here, the recall score of the BARNet is lower than the method reported by Wang et al. [1] because the additional normalized Digital Surface Model (nDSM) data were not utilized to enhance the model performance. Under the condition of only using single R-G-B data, the proposed method improved the precision, F 1 and the IoU by 2.94%, 1.14% and 1.68%, respectively, compared against the second-best metrics. The improvements indicate that the BARNet is robust enough to cope with building extraction in VHR aerial images with complex urban scenes. To quantify the boundary segmentation quality of different methods, the comparison result of trimap boundary experiments performed on the WHU dataset is also reported, as illustrated in Figure 11. Notably, we did not conduct the trimap experiment on the Potsdam dataset owing to the evident texture distortion in building boundary regions for the tested images, leading to that the ground-truth not precisely corresponding to the actual building boundary. Specifically, as presented in Figure 11a, the eroded and dilated band along the boundary with a given width (pixels) is called a trimap. We utilized the morphological dilation and erosion operations to generate the boundary trimap with a width of {2, 4, 6, 8, 10, 12}. After obtaining the predicted trimap and reference trimap, we computed the mean IoU between them. The higher the mean IoU, the better the performance for boundary segmentation is. As shown in Figure 11, when the bandwidth of the trimap is lower than eight pixels, the comparative methods exhibited poor performance, suggesting that massive boundary pixels were classified wrongly. Nevertheless, it can be clearly observed that our method achieved significant performance on refining the boundary extraction, which verifies the positive effect of the GARFU and BA loss on enhancing model performance.

Ablation Studies
The ablation experiments included two parts: (a) exploring for further investigating the contribution of each sub-module introduced in Section 3 and the improvement strategies adopted for training and inference; (b) several quantitative comparisons for the GARFU, the DASPP, the BA loss and the boundary-enhanced loss with other corresponding SOTA methods. Unless otherwise stated, all the involved experiments were conducted under the same conditions and settings given for fairness.

Network Design Evaluation
To verify the design of the BARNet, U-Net was chosen as the baseline model, and IoU was adopted to assess the effectiveness quantitatively. The detailed evaluation results are summarized in Table 2. We first replaced the encoder part in the baseline with ResNet-101. Meanwhile, all the encoder feature maps were reduced to 256 channels to keep consistent with the BARNet, which is different from the baseline. Due to the strong feature encoding ability of ResNet-101, these changes brought an improvement of a 1.49% IoU. After inserting the GARFU module in the lateral skip connections, the IoU was improved by 0.72% points, implying that selectively fusing cross-level features is better than the direct concatenation fusion. Adding the DASPP module to capture the multi-scale context obtained a significant improvement with an IoU of 1.12%, indicating that the proposed GARFU is robust to handle the critical scale issue for building objects in VHR aerial images. As expected, the BA loss notably boosted the performance with an increment of a 0.53% IoU, compared with the commonly used cross-entropy loss. Additionally, the OHEM strategy further improved the IoU by 0.53% points. With the help of MS inference, our model achieved a 91.51% IoU, which significantly outperformed the previous SOTA model, MA-FCN, which achieved a 90.7% IoU on the WHU dataset with multi-model voting and refined overlap strategies. To test the proposed GARFU, herein, the BARNet served as the baseline model. We utilized the addition operation and the concatenation operation to replace the GARFU, respectively. As described in Section 5.1.1, all the feature maps from the encoder were also reduced to 256 channels using 3 × 3 convolutions, and the others remained unchanged. Table 3 shows the comparison results, from which we can observe that the IoU reduced by 0.89% and 0.78%, respectively, without the GARFU. The decreased performance for addition and concatenation can be attributed to ignoring the semantic gap between highand low-level features. We rethought the contribution of different level features and filled the gap well. The GARFU can adaptively harvest the useful information to fuse it by learning a spatially gated map and a channel attention vector from two adjacent high-and low-level features. This simple yet efficient fusion strategy contributes to making full use of different level features for building extraction. The idea of DASPP is to make the network more stable for coping with the buildings in VHR remote sensing images with complex scenes by capturing a denser multi-scale semantic context. It has been proven that it is non-trivial for the semantic segmentation task to append an additional module after the backbone network for enlarging the receptive field size of the network. We compared the IoU performance of DASPP with several well verified context modelling approaches, i.e., the Pyramid Pooling Module (PPM) in PSPNet [29], ASPP in DeepLab-v3+ [30], self-attention in the non-local net [35], and dual-attention in DANet [36]. Additionally, complexity is also a key factor for a context modelling method; thus, the complexity is also reported. The experimental results are summarized in Table 4, from which we can find that DASPP outperforms other multiscale context aggregation schemes both in terms of the IoU and Floating-Point Operations (FLOPs). Self-Attention based methods can establish dense pixel-wise relations and achieve a reasonable performance of the IoU, but the practical application is restricted by the high computational cost. Despite ASPP have the maximum trainable parameters, its performance looks insufficient. In contrast, DASPP can acquire the best performance when maintaining efficiency, which demonstrates the robustness of DASPP for coping with the building scale variation. Apparently, the traditional cross-entropy loss function only considers the pixel-level similarities between predictions and labels, resulting in that this loss is not sensitive when tackling boundary pixels and non-boundary pixels. In this study, we developed the Boundary-Enhanced (BE) loss to strengthen boundary segmentation explicitly. Since the importance was proven in Section 5.1.1, we only compared the BE loss with the well verified conventional DenseCRF introduced in DeepLab-v1 [38]. We fine-tuned the hyperparameters in DenseCRF for yielding the optimal results. According to the results listed in Table 5, the proposed BE loss outperforms the DenseCRF no matter what baseline model is embedded. The DenseCRF only brings a slight improvement in the IoU. On the other hand, in the experiments, we observed that embedding the DenseCRF comes with a high computational cost. The comparisons powerfully verify that the generality and robustness of the BE loss are better than the SOTA DenseCRF. Table 5. Comparison of the Boundary-Enhanced (BE) loss and the Dense Conditional Random Field (DenseCRF), where the best is in bold. We report the results performed on the WHU dataset. means that the corresponding method is adopted. Note, OHEM and MS are not used for this comparison.

Limitations and Future Works
Although the proposed approach has filled the left gap that the performance of most of the existing methods is insufficient to recognize large-scale buildings and locate building boundaries well and accurately, there are still some inherent pending problems that should not be ignored. First, the number of the total trainable parameters in the BARNet is 67.49 M, which is greatly larger than some medium-scale networks, such as U-Net (about 28.95 M). On the other hand, the efficiency of our method is relatively slow. It took about nine hours for training and about three minutes for inference on the WHU dataset, making the proposed method impractical to be deployed on mobile platforms such as a UAV. Thus, future work should pay attention to achieving real-time extraction. Then, in the experiments, we found that extracting small buildings was insufficient (54.29% IoU on the WHU dataset for small buildings with an area less than 2000 pixels). Therefore, a stronger high-resolution network should be developed to handle this problem. Last, we observed that there are many latent mistaken labels in several existing open-source benchmarks, making it quite hard to improve the performance further based on these datasets. For this reason, semi-supervised learning and few-shot learning should be given enough attention to reduce the dependence on massive high-quality labelled data.

Conclusions
Even though tremendous efforts have been made in automatic building extraction from VHR images using CNNs, extracting large-scale buildings completely and locating building boundaries precisely remain challenging issues due to the limited multi-scale context and the lack of boundary consideration. With such motivations, the BARNet is proposed to address the issues. Within the BARNet, the GARFU is introduced to make full use of multi-level features by re-calibrating the information contribution in the channel and the spatial dimensions. Besides, DASPP is developed to encode the multi-scale context better. In particular, the BE loss is embedded into the network to force the model to pay attention to the boundary. Comprehensive experiments performed on the WHU and ISPRS benchmarks indicate that the BARNet is suitable for processing building extraction in VHR aerial remotely sensed images over complex urban scenes. Compared with several SOTA models, our method exhibits the best performance with the highest accuracy consistently. A lightweight network and semi-supervised learning will be developed to improve the computational efficiency and extraction accuracy in our future research.