Building Extraction from Very High Resolution Aerial Imagery Using Joint Attention Deep Neural Network

: Automated methods to extract buildings from very high resolution (VHR) remote sensing data have many applications in a wide range of ﬁelds. Many convolutional neural network (CNN) based methods have been proposed and have achieved signiﬁcant advances in the building extraction task. In order to reﬁne predictions, a lot of recent approaches fuse features from earlier layers of CNNs to introduce abundant spatial information, which is known as skip connection. However, this strategy of reusing earlier features directly without processing could reduce the performance of the network. To address this problem, we propose a novel fully convolutional network (FCN) that adopts attention based re-weighting to extract buildings from aerial imagery. Speciﬁcally, we consider the semantic gap between features from di ﬀ erent stages and leverage the attention mechanism to bridge the gap prior to the fusion of features. The inferred attention weights along spatial and channel-wise dimensions make the low level feature maps adaptive to high level feature maps in a target-oriented manner. Experimental results on three publicly available aerial imagery datasets show that the proposed model (RFA-UNet) achieves comparable and improved performance compared to other state-of-the-art models for building extraction.


Introduction
Automatic extraction of buildings from remote sensing imagery is of paramount importance in many application areas such as urban planning, population estimation, and disaster response [1].Assigning a semantic building class label to each pixel in very high resolution (VHR) imagery of urban areas is a challenging task because of high intra-class and low inter-class variabilities [2,3].This is because in high resolution images, the building category contains many different sized manmade-objects in urban areas, where the amount of clutters is increasing-e.g., the shadow of tall buildings-the similarity of rooftops to some roads.The result is that it is difficult to label buildings reliably and accurately.
We have witnessed a rapid, revolutionary change in computer vision research, mainly driven by convolutional neural network (CNN) [4] and the availability of large scale training data [5].Recently, several CNNs-based semantic segmentation methods have been used in building extraction from earth observation images [6][7][8].The patch-based CNNs methods [9][10][11][12][13] were initially adopted for prediction in dense urban areas.These patched-CNNs label the center pixel by processing an image patch through a neural network.They tend to be computationally expensive and are usually used to detect large objects [14,15].Since Long et al. [16] adapted the classification network into fully convolutional network (FCN) for semantic segmentation, FCN and its extensions have gradually become the preferred solution in the field of semantic labeling [17][18][19][20].Though FCN-based methods can produce dense pixel-wise output directly, the pixel-wise classification derived from the final score map is quite coarse because of the sequential sub-sampling operations in the FCN.
To address the problem of coarse predictions, recent research [21][22][23][24][25][26] have further improved FCN-based methods for semantic labeling of remote sensing images.There is a growing body of literature that many studies [27][28][29][30][31] employ the encoder-decoder architecture with skip connection.UNet [32], a typical model in the style of encoder-decoder, reuses low-level information to refine the output, and results in better performance.For obtaining accurate labeling of VHR images, an effective structure to integrate the high-resolution, low-level features, and the low-resolution, high-level features is needed.The skip connection fuses features so as to compensate the loss of spatial information caused by repeating local operations (e.g., pooling and strided convolution).Features via skip connection are multi-scale in nature due to the increasingly large receptive field sizes [33].However, one thing to note is that most existing approaches that are built on top of a contemporary classification network are good at aggregating global contexts.While the reuse of information from early encoding layers contributes to localization in the decoding phase, it may introduce redundant information which results in over-segmentation [34] and unexpected ambiguous representations [35,36].To be specific, the low level features in the encoder are computed in the shallow layers of the network, while the high level features in the decoder are computed in the deep layers of the network.Obviously, we can assume that the latter has undergone more processing and there is a semantic gap between the features of encoder and decoder.For example, a deep layer in the decoding stage may confidently discriminate between a gray pixel belonging to 'asphalt roads' or 'rooftops', because more global contexts are passed through a long path from the low layers to the high layers.However, the signals from the symmetric layer early have different levels of discrimination that are specific to the primary class 'impervious surface' and therefore express confidence in both subclasses.As a result, integrating these features directly through skip connection may decrease the accuracy of prediction.A new research has shown that fusing semantically dissimilar features from the encoder and decoder subnetworks directly can degrade segmentation performance [37].Thus, it is important to bridge the semantic gap between features of encoder and decoder prior to fusion.
In recent years, several researchers have begun to apply attention mechanisms to CNNs.Initially, attention in CNNs was used to interpret the gradient of a class output score with respect to the input image [38].Later trainable self-attention was deployed for image captioning, image classification, object detection, and image segmentation [39][40][41][42].A large body of literature exploring different gating architectures has emerged.For instance, Oktay et al. [43] proposed a self-attention gating module that can be utilized in FCN models for medical image segmentation.Zagoruyko et al. [44] improved the performance of a student CNN by transferring the attention maps from a teacher network.Different from the above, where they used the grid-attention technique to capture spatial salient regions, Hu et al. [45] proposed channel-wise attention to highlight important feature dimensions.Subsequent studies [46][47][48][49] have demonstrated the performance of channel-wise attention mechanism in the semantic segmentation task.In remote sensing, some attempts [50][51][52] have been made to adopt attention mechanisms on the building extraction task.Yang et al. [52] used a spatial attention module that weights map generated by applying sigmoid function at the deep features.Pan et al. [50] used a generative adversarial network with spatial [34] and channel [45] attention to extract buildings.Though there are a few differences in the above attention modules, most of these implementations can be attributed to the use of self-attention to enhance the representation of single-layer features.
Since the attention can model interdependency and adjust the response of a position or a channel in the input feature maps, we expect to exploit it to alleviate the semantic difference between features from different depths in the skip connection.Similar to [39,46], we employ a joint attention module (RFA) in the deep neural network, while our focus is to bridge gap between hierarchical representations.To this end, we proposed an attention re-weighting process that could be integrated into UNet model for the building extraction task in VHR images.The proposed attention module emphasizes meaningful features and suppress insignificant features along both channel and spatial dimensions adaptively, under the guidance of deep features.Benefitting from global context information captured by joint attention, the semantic information of high spatial resolution but low level features in the encoder are gradually enriched in a task-oriented direction before fusion.In summary, the contributions of our work are summarized as follows: (1) We implement joint spatial and channel-wise attention mechanism to enhance consistency of features across layers in the U-shaped FCN.Experimental results show that using attention jointly is effective to reduce semantic differences between features.
(2) We integrate the proposed attention module into existing UNet model and propose an end-to-end method (RFA-UNet) for the building extraction task, which attains comparable and stable performance with other state-of-the-art model on three public datasets.
The remainder of this paper is organized as follows.Section 2 introduces the proposed method.The experimental results are presented in Section 3. The discussion about the method and experiments is given in Section 4. Section 5 concludes this paper.

Methods
In this section, we wish to put forward an end-to-end method (RFA-UNet) based on the common semantic segmentation architecture UNet with a new attention module.Our approach leverages the benefits of typical segmentation architecture with the skip connections.An overview of the proposed architecture is shown in Figure 1.First, input images are progressively filtered (convolution) and downsampled by factor of 2 (pooling) at each level in the contracting path.Second, features from each encoding stage are filtered by attention module before skip connection.Different from most existing methods, we introduce discriminative information from coarser scales to help generate joint attention maps.After that, the refined features are concatenated to the corresponding decoding features again through skip connection.The rest of this section describes the details of RFA-UNet.representations.To this end, we proposed an attention re-weighting process that could be integrated into UNet model for the building extraction task in VHR images.The proposed attention module emphasizes meaningful features and suppress insignificant features along both channel and spatial dimensions adaptively, under the guidance of deep features.Benefitting from global context information captured by joint attention, the semantic information of high spatial resolution but low level features in the encoder are gradually enriched in a task-oriented direction before fusion.In summary, the contributions of our work are summarized as follows: (1) We implement joint spatial and channel-wise attention mechanism to enhance consistency of features across layers in the U-shaped FCN.Experimental results show that using attention jointly is effective to reduce semantic differences between features.
(2) We integrate the proposed attention module into existing UNet model and propose an endto-end method (RFA-UNet) for the building extraction task, which attains comparable and stable performance with other state-of-the-art model on three public datasets.
The remainder of this paper is organized as follows.Section 2 introduces the proposed method.The experimental results are presented in Section 3. The discussion about the method and experiments is given in Section 4. Section 5 concludes this paper.

Methods
In this section, we wish to put forward an end-to-end method (RFA-UNet) based on the common semantic segmentation architecture UNet with a new attention module.Our approach leverages the benefits of typical segmentation architecture with the skip connections.An overview of the proposed architecture is shown in Figure 1.First, input images are progressively filtered (convolution) and downsampled by factor of 2 (pooling) at each level in the contracting path.Second, features from each encoding stage are filtered by attention module before skip connection.Different from most existing methods, we introduce discriminative information from coarser scales to help generate joint attention maps.After that, the refined features are concatenated to the corresponding decoding features again through skip connection.The rest of this section describes the details of RFA-UNet.

Semantic Labeling Using UNet
In general, the UNet model including an encoder and a decoder can make dense pixel-wise prediction naturally.UNet constructs the encoder part by applying a classification network.A cascaded convolution and pooling operations downsamples the output feature maps at each stage and increases the feature map channel number simultaneously through the contracting path.To generate results that are the same size as the input image, an expansive path symmetric to the

Semantic Labeling Using UNet
In general, the UNet model including an encoder and a decoder can make dense pixel-wise prediction naturally.UNet constructs the encoder part by applying a classification network.A cascaded convolution and pooling operations downsamples the output feature maps at each stage and increases the feature map channel number simultaneously through the contracting path.To generate results that are the same size as the input image, an expansive path symmetric to the contracting path is adopted as the decoder.The size of feature maps is retrieved gradually, and each upsampling operation is followed by two convolution layers.In order to yield more precise localization, the network propagates spatial precision from earlier layer in encoding to deep layers at the decoding stage, i.e., skip connection.
It is well known that a deep network is built upon supervised training in a big dataset such as the ImageNet dataset [53].In most cases, however, manual labelling for training data is a costly task, and it is also the same when using CNN for remote sensing classification.UNet has proven itself useful for segmentation problems with a relatively small datasets, e.g., satellite image analysis and medical image analysis [54,55].For this reason, we choose UNet as the baseline architecture for our study.

Residual Feature Map Attention
The complex structure of different buildings increases the difficulty of determining the building outlines in VHR images.Though the low-level features captured by the earlier layers can help refine the feature maps in the deepen layers via skip connections, these should be employed with caution to avoid introducing inconsistencies across different stages.In this work, we wish to make the low-level feature maps X l ∈ R H×W×C l adaptive to high-level feature maps X h ∈ R H×W×C h in the skip connection.First, the feature maps are re-weighted by a channel-wise weighted vector, thus the network tends to learn the most salient features that contribute to the classification.Then, a spatial attention map X S shows where the network focuses in order to highlight informative regions, as a complement to the channel attention map X C . Figure 2 illustrates the structure of the residual feature map attention (RFA).The proposed joint attention module can be summarized as where α represents channel attention weights, β represents spatial attention weights, f C (•) denotes multiplication of feature maps and corresponding weights on the channel dimension, f S (•) is the pixel-wise multiplication between spatial regions of feature maps and corresponding weights and ⊕ denotes element-wise addition.
Remote Sens. 2019, 11, x FOR PEER REVIEW 4 of 20 contracting path is adopted as the decoder.The size of feature maps is retrieved gradually, and each upsampling operation is followed by two convolution layers.In order to yield more precise localization, the network propagates spatial precision from earlier layer in encoding to deep layers at the decoding stage, i.e., skip connection.It is well known that a deep network is built upon supervised training in a big dataset such as the ImageNet dataset [53].In most cases, however, manual labelling for training data is a costly task, and it is also the same when using CNN for remote sensing classification.UNet has proven itself useful for segmentation problems with a relatively small datasets, e.g., satellite image analysis and medical image analysis [54,55].For this reason, we choose UNet as the baseline architecture for our study.

Residual Feature Map Attention
The complex structure of different buildings increases the difficulty of determining the building outlines in VHR images.Though the low-level features captured by the earlier layers can help refine the feature maps in the deepen layers via skip connections, these should be employed with caution to avoid introducing inconsistencies across different stages.In this work, we wish to make the lowlevel feature maps adaptive to high-level feature maps in the skip connection.First, the feature maps are re-weighted by a channel-wise weighted vector, thus the network tends to learn the most salient features that contribute to the classification.Then, a spatial attention map S X  shows where the network focuses in order to highlight informative regions, as a complement to the channel attention map C X  . Figure 2 illustrates the structure of the residual feature map attention (RFA).The proposed joint attention module can be summarized as where α represents channel attention weights, β represents spatial attention weights, ( ) denotes multiplication of feature maps and corresponding weights on the channel dimension, ( ) is the pixel-wise multiplication between spatial regions of feature maps and corresponding weights and ⊕ denotes element-wise addition.
At last, a residual mapping is used to obtain the output of attention schemes.This short connection draws on the idea of the residual network [56], making the network easier to optimize.(1) Channel Attention: Each filter performs pattern detection, and each channel of the feature map is a response activation of the corresponding convolutional filter [39].In a standard CNN, the importance of each channel is considered to be the same.Thus, applying attention mechanisms to feature channels intuitively distinguishes the features in different stages, and a channel attention map At last, a residual mapping is used to obtain the output of attention schemes.This short connection draws on the idea of the residual network [56], making the network easier to optimize.
(1) Channel Attention: Each filter performs pattern detection, and each channel of the feature map is a response activation of the corresponding convolutional filter [39].In a standard CNN, the importance of each channel is considered to be the same.Thus, applying attention mechanisms to feature channels intuitively distinguishes the features in different stages, and a channel attention map is produced to enhance feature consistency.For channel-wise attention, we first upsample the high stage features X h using bilinear interpolation to make them have the same shape (except in channel dimension) as the low stage features X l .Then the two sequences of features are concatenated to generate channel-wise statistics.The global information are computed using global average pooling for the concatenated features.The generated vector z ∈ R H×W×C l , which can be interpreted as a set of the channel descriptors for the image, is calculated by where [;] denotes the concatenate operation, scalar z i represents i-th channel descriptor.
To ensure that the module can learn nonlinear interaction between channels, the channel vector z is passed through two fully connected layers.Then the channel attention vector α is obtained with a sigmoid activation where σ 1 denotes the ReLU function and σ 2 the sigmoid activation.Two fully connected layers with parameters where W 1 ∈ R C r ×C with ratio r to reduce dimensions for simplicity and W 2 ∈ R C× C r to restore dimensions.
(2) Spatial Attention: In a standard CNN, a global image descriptor derived from fully connected layers maps the input into a high-dimensional space in order to make the classes linearly separable [38].However, using only the global feature representation to classify pixels ignores local spatial structural characteristics that need to be considered for semantic labeling tasks.Although FCN has made the architecture more suitable for local positioning, the importance of each pixel location is considered equal.In general, the identification of a pixel needs to consider its spatial context, and near pixels are more related to each other [57].Therefore, our spatial attention module is designed to pay more attention to the semantic regions.A spatial attention map is produced to emphasize or suppress feature responses in different spatial locations.We reshape current low stage features X l by flattening the height and width to V l = [v 1 , v 2 , . . ., v m ], where v i represents the i-th location pixel-wise vector of length C, and m = W × H. Similarly, a gating vector V h obtained from the upsampled high-level features is introduced as a guidance.We employ linear transformations to make two vectors have the same length, i.e., the same dimensional space.Finally, we add them to generate spatial attention map β, formulated as where σ 1 denotes the ReLU function and σ 2 the sigmoid activation, linear transformations W 1 ∈ R C l ×int , W 2 ∈ R C g ×int , W 3 ∈ R int×1 are computed using channel-wise 1 × 1 convolutions for the two inputs.

Network Architecture
We use VGG16 that consists of 16 sequential layers as the feature encoder, but we remove the full connected layer from the network.The architecture of the encoder is presented in Table 1.All convolutional layers followed by a ReLU activation function have 3 × 3 kernels and the number of output channels doubles after the max pooling operation.The output of 512 channels feature maps is served as a bottleneck of network, separating the encoder from the decoder.In the symmetric decoder part, low spatial resolution deep features are upsampled with a deconvolution layer.The upsampled feature map is regarded as providing consistency guidance for the corresponding earlier low-level feature map in encoding, and both features are transmitted into the proposed attention module (RFA) to obtain weighted feature maps.The reinforced meaningful features are then concatenated with the upsampled high level features via skip connection (see Figure 3).In the symmetric decoder part, low spatial resolution deep features are upsampled with a deconvolution layer.The upsampled feature map is regarded as providing consistency guidance for the corresponding earlier low-level feature map in encoding, and both features are transmitted into the proposed attention module (RFA) to obtain weighted feature maps.The reinforced meaningful features are then concatenated with the upsampled high level features via skip connection (see Figure 3).The output of the network model passes through the softmax nonlinearity layer and results in an image where each pixel corresponds to a probability of belonging to buildings.

Training and Inference
In the training stage, aerial images, and corresponding annotation images are both inputs to the model.As the building extraction task is a pixel-wise classification problem, we add cross entropy loss and dice coefficient loss together as the loss function of segmentation, defined by The output of the network model passes through the softmax nonlinearity layer and results in an image where each pixel corresponds to a probability of belonging to buildings.

Training and Inference
In the training stage, aerial images, and corresponding annotation images are both inputs to the model.As the building extraction task is a pixel-wise classification problem, we add cross entropy loss and dice coefficient loss together as the loss function of segmentation, defined by where ŷi is the label (1 for building, 0 for background) of pixel and y i is predicted probability for the pixel.
As the satellite images are large, we crop patches from original images and feed them into the network for training.In the prediction stage, we combine predictions back into the original size.However, splitting the image into small tiles and then stitching them together later sometimes results in blocking artifacts at the borders.In some ways, predicting a pixel in the central part of the image is much easier than at the edge because the latter have limited shared contextual information.To predict the pixels in the edge area of the image, one way is to make the predictions on overlapping patches and crop the edges, but we propose a more efficient approach.Firstly, we extrapolate the missing context by mirroring the input image [32].After that, we add a cropping layer to the output layers of the network, similar to [55], which solves two problems simultaneously: (1) in the predicting phase, it takes advantage of contextual information in the margin; (2) overlapping edges of each patch are cropped automatically in the prediction stage.Details of experimental settings is introduced in Section 3.2.

Dataset
Massachusetts Building Dataset (Mass.Buildings): This is proposed by Mnih [9].It includes 151 RGB images of the Boston area with a spatial resolution of 1 m.The ground truth obtained from the OpenStreetMap project are all available.There are 137 images in the training set, 10 images in the test set, and 4 images for validation.
ISPRS Potsdam Challenge Dataset (Potsdam) [58]: This dataset contains 38 images with a spatial resolution of 5 cm.The size of each tile is 6000 × 6000 pixels.Among them, 24 images with available ground truth are provided for training and 14 images are remained for test.We randomly split the 24 images into 17 for training and 7 for validation.It is noted that we only use three-band IRRG images for fair comparison with existing models.
WHU Aerial Dataset (WHU) [59]: This dataset contains 8189 RGB tiles with 0.3 m ground resolution, including 187,000 samples of building in New Zealand.These 512 × 512 images are divided into three parts by the provider: 4736 tiles for training, 1036 tiles for validation, and 2416 tiles for test.Each tile has a corresponding Boolean raster map derived from the building vector map. Figure 4 shows some images and reference data from three datasets.
24 images into 17 for training and 7 for validation.It is noted that we only use three-band IRRG images for fair comparison with existing models.
WHU Aerial Dataset (WHU) [59]: This dataset contains 8189 RGB tiles with 0.3 m ground resolution, including 187,000 samples of building in New Zealand.These 512 × 512 images are divided into three parts by the provider: 4736 tiles for training, 1036 tiles for validation, and 2416 tiles for test.Each tile has a corresponding Boolean raster map derived from the building vector map. Figure 4 shows some images and reference data from three datasets.

Experimental Setting
In order to train an effective deep model using a relatively small dataset, we cropped the raw images into small patches with the size of 320 × 320 pixels with the overlap of 80 pixels.Only with the WHU dataset did we maintain the original size of images at 512 × 512 pixels, following the settings of the providers.For each patch, we applied data augmentation consisting of flipping in the vertical or horizontal dimension and rotation of 90 degrees.It should be noted that for a given patch, we performed above transformations randomly rather than applying all of them each time.Table 2 shows the detailed number of patches of the three augmented datasets.In the prediction stage, we followed the abovementioned overlay cropping and stitching process to output the classification result of the large aerial imagery.Firstly, the size of test patch was set to 704 × 704 pixels for Mass.Buildings and Potsdam, and the overlapping pixels between adjacent patches were 204 pixels.Then, we cropped the outputs along the edge by 102 pixels and got images with size of 500 × 500 pixels, which was easily stitched into a large test image.For the WHU dataset, since the images of its test sets have been cropped into isolated tiles of 512 × 512 pixels, there is no need to do the process like the aforementioned datasets.
We implemented our models in the experiments by using the Keras framework with Tensorflow backbend.We initialized network parameters using Xavier uniform [60] and adopted Adam [61] as the weights optimization algorithm, with initial learning rate of 0.001.We set a batch size of 8 to suit the memory of graphics of the workstation used in the study.All experiments were processed on a desktop with 32GB of RAM and a 24GB Nvidia P6000 GPU.

Evaluation Metrics
We used overall accuracy to evaluate the global performance of the methods.In addition, the F1-score of the positive (building) class and Intersection over Union (IoU) were used to evaluate classification performance.The F1-score and IoU metric are defined by where TP is the number of true positives, FP the number of false positives, and FN the number of false negatives.

Evaluations of Attention
We evaluated the effect of the proposed joint attention module in UNet for building extraction in the VHR images.Furthermore, we compared our attention module with existing advanced attention modules on the three datasets.The information of these attention mechanisms is described as follows: (1) CAB [48]: A channel-wise attention block, which reweights the feature maps of low stage by using a weight vector inferred from concatenated features stage by stage.( 2) GRID [43]: A grid attention gate module for medical imaging that learns to identify salient image regions of varying shapes and sizes by a grid signal conditioned to image spatial information from deep features.(3) DenseAN [52]: A spatial attention fusion module uses the high level features activation output to reweight the low level features before summation.( 4) DualAN [62]: A dual attention module for scene segmentation which captures long-range contextual information in spatial dimension and channel dimension respectively.( 5) RFA (Ours): A joint residual attention module consists of channel attention and spatial attention for enhancing the semantic consistency of features across layers.In the training phase, the training parameters and strategies adopted for these methods are same as ours.
The comparisons with different attentions on the three test sets are presented as follows: (1) Mass.Buildings: As shown in Figure 5, CAB (Figure 5c) and Ours-RFA (Figure 5g) achieved better global performance than other attention methods visually.There are many FPs and FNs in the results of both GRID (Figure 5d) and DenseAN (Figure 5e).DualAN (Figure 5f) had difficulty in recognizing the small and dense buildings.Figure 6 shows the close-ups (as marked in yellow rectangles in Figure 5a) of the results for detailed inspection.The results in Figure 6 demonstrating that most buildings were correctly identified using all five methods, but CAB, GRID, and DenseAN tended to misclassify pixels as some FPs are found in the areas covered by ground or shadows (Figure 6c-e).DualAN (Figure 6f) performed better in the large building pixels.However, the FPs and FNs in the dense residential areas indicate that DualAN does not perform well enough.The comparisons with different attentions on the three test sets are presented as follows: (1) Mass.Buildings: As shown in Figure 5, CAB (Figure 5c) and Ours-RFA (Figure 5g) achieved better global performance than other attention methods visually.There are many FPs and FNs in the results of both GRID (Figure 5d) and DenseAN (Figure 5e).DualAN (Figure 5f) had difficulty in recognizing the small and dense buildings.Figure 6 shows the close-ups (as marked in yellow rectangles in Figure 5a) of the results for detailed inspection.The results in Figure 6 demonstrating that most buildings were correctly identified using all five methods, but CAB, GRID, and DenseAN tended to misclassify pixels as some FPs are found in the areas covered by ground or shadows (Figure 6c-e).DualAN (Figure 6f) performed better in the large building pixels.However, the FPs and FNs in the dense residential areas indicate that DualAN does not perform well enough.(2) Potsdam: Figure 7 displays the classification results on Potsdam test sets.Ours-RFA (Figure 7g) outperforms the other four methods because there are many FPs generated in the prediction (Figure 7c-f).The result of DenseAN is slightly better, except that some FPs in the first two rows of Figure 7e.The FPs in the results of CAB, GRID, and DualAN (Figure 8c,d,f) indicate that these models are sensitive to the low vegetation and roads, and they frequently misclassify vegetation pixels similar to the color of the rooftops as building pixels.Though the result of DualAN in the last row (Figure 8f) has more TPs, it also tends to have more FPs.Compared the other attention methods, except DualAN, Ours-RFA (Figure 8g) achieves better performance which has more TPs and less FPs.(2) Potsdam: Figure 7 displays the classification results on Potsdam test sets.Ours-RFA (Figure 7g) outperforms the other four methods because there are many FPs generated in the prediction (Figure 7c-f).The result of DenseAN is slightly better, except that some FPs in the first two rows of Figure 7e.The FPs in the results of CAB, GRID, and DualAN (Figure 8c,d,f) indicate that these models are sensitive to the low vegetation and roads, and they frequently misclassify vegetation pixels similar to the color of the rooftops as building pixels.Though the result of DualAN in the last row (Figure 8f) has more TPs, it also tends to have more FPs.Compared the other attention methods, except DualAN, Ours-RFA (Figure 8g) achieves better performance which has more TPs and less FPs.(3) WHU: Figure 9 exhibits the results of proposed attention method and its comparisons.The result obtained by using Ours-RFA (Figure 9g) are more complete, which indicates that the proposed attention module improves the performance of labeling fine-structured buildings in the VHR images.Closer inspection of the 2 rd row in Figure 10 testifies our point, only Ours-RFA method identified almost building pixels and a small number of misclassified pixels, while other methods struggled with identifying the rooftop and the results of them have many FNs.The results of CAB (Figure 9c) are relatively good, but still some FPs presented in Figure 10c, indicating that CAB did not distinguish ground pixels well enough.Most FNs in the results of GRID (see the third row in Figure 10d) implies that GRID does not fully utilize the context information and lack ability of identifying rooftop pixels with complex texture.(3) WHU: Figure 9 exhibits the results of proposed attention method and its comparisons.The result obtained by using Ours-RFA (Figure 9g) are more complete, which indicates that the proposed attention module improves the performance of labeling fine-structured buildings in the VHR images.Closer inspection of the 2 rd row in Figure 10 testifies our point, only Ours-RFA method identified almost building pixels and a small number of misclassified pixels, while other methods struggled with identifying the rooftop and the results of them have many FNs.The results of CAB (Figure 9c) are relatively good, but still some FPs presented in Figure 10c, indicating that CAB did not distinguish ground pixels well enough.Most FNs in the results of GRID (see the third row in Figure 10d) implies that GRID does not fully utilize the context information and lack ability of identifying rooftop pixels with complex texture.Table 3 provides the summary statistics for quantitve comparisons of different attention modules.Ours-RFA obtained the best result with the overall accuracy, the F1-score and the IoU score among all attention methods on three datasets, and the result of quantitative comparison is consistent with visual effect.Both on the Mass.Buildings test sets and WHU test sets, CAB achieved the second best performance.DenseAN had a comparable result of the overall accuracy with the proposed RFA on the Potsdam test sets but the IoU score was 6.63% lower than that of Ours-RFA.The quantitive results of DualAN on three datasets are not satisfactory, implying that the strategy of only applying refinement on the deep feature is not stable for building extraction in the VHR images.Table 3 provides the summary statistics for quantitve comparisons of different attention modules.Ours-RFA obtained the best result with the overall accuracy, the F1-score and the IoU score among all attention methods on three datasets, and the result of quantitative comparison is consistent with visual effect.Both on the Mass.Buildings test sets and WHU test sets, CAB achieved the second best performance.DenseAN had a comparable result of the overall accuracy with the proposed RFA on the Potsdam test sets but the IoU score was 6.63% lower than that of Ours-RFA.The quantitive results of DualAN on three datasets are not satisfactory, implying that the strategy of only applying refinement on the deep feature is not stable for building extraction in the VHR images.

Comparison with State-of-the-Art
To evaluate the effectiveness of the proposed attention model, comparisons were made with other existing FCNs methods for building detection.The list of models includes the following: (1) RFA-UNet (Ours): an encoder-decoder style fully convolutional network with extended hybrid attention module.(2) UNet: an architecture originally proposed for segmentation of biomedical images that captures context information at multiple scales via contracting and expansive paths.
(3) SegNet [63]: an encoder-decoder architecture for scene segmentation, in which the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear upsampling.(4) RefineNet [64]: a multi-path refinement network that exploits multi-level features for high-resolution prediction with long-range residual connections, achieving the state-of-the-art results on several public datasets.( 5) FC-DenseNet [65]: a model adopts dense connected convolutional networks into U-shape architectures to tackle the problem semantic segmentation.By using dense connections, multiple level features are concatenated iteratively to form a dense block.It should be noted that we implemented the methods above (the training parameters for these methods are same as ours) and also incorporated some advanced numerical results on each of the three datasets reported in the literatures [52,66,67].demonstrate the close-up views of the five classification results using three subset images of three test sets, respectively.SegNet obtained comparable results on Potsdam (Figure 12c) but cannot distinguish large building objects on WHU, and obvious FNs appeared in the last two rows of Figure 13c, indicating that SegNet is not robust to identify complex manmade objects.The results of FC-DenseNet and RefineNet are relatively smooth, while they are still less accurate.As shown in Figures 11d and 12e, FC-DenseNet and RefineNet did not perform well, as many FPs and FNs appeared in the second row of their results.Similar to the results on Mass.Buildings, there were also some FPs in the results of FC-DenseNet and RefineNet (see the first row in Figure 12d,e) on Potsdam.These findings suggest that their strategies for simply reusing features densely or using long-range residual connections are not efficient enough due to the categorical ambiguity of the low level features.Our RFA-UNet model were more effective in the recognition of building objects on three test sets.Though the result of RFA-UNet have a few flaws, they still perform more precise localization and accurate labeling (see Figure 11g, Figure 12g, and Figure 13g).Meanwhile, the performance of our model also shows that the RFA module has improved the classification ability of network, as compared to the performance of the UNet method.According to Table 4, compared with abovementioned FCNs-based semantic segmentation models and recently published methods in the remote sensing fields, our method is better than them on the same datasets.On the test set of Mass.Buildings (spatial resolution of 1.0 m), our method surpasses second best model by 0.64% in terms of F1-score.For the test set with higher spatial resolution, the IoU score of our model was around 2.42% higher than that of the second best model RefineNet on the Potsdam (0.05 m), and 1.0% higher than that of pervious best model SRI-Net on the WHU (0.3 m).

Discussion
Applying the attention mechanism to the segmentation model UNet, we observe that our joint attention module improves the performance of existing architecture for the task of building extraction in VHR images.The reason why the proposed attention improves the performance might be related to the inherent attributes of CNNs and the flaw of the plain skip connection in the encoder-decoder architectures.Generally, CNNs increase the receptive field by stacking convolution layers, which means the receptive field of a given layer only focus on a local region, especially at the shallow of the network.Therefore, the difference between deep layer and shallow layer in the use of context information leads to the variation of classification capacities.On the other hand, the spatial information of low level features is important to localize the classified objects, but these low level features also bring debatable noisy information that results in categorical errors [68].In this paper, we rethink the relationship between shallow and corresponding deep layers in the skip connection at the feature level.In order to leverage the spatial information from shallow layers and the context information from deep layers, we employ the attention mechanism that highlights advantageous features and suppress features making less contribution.The channel-wise attention part of the proposed module applies global average pooling to the concatenated features, which extracts global categorical information of two input features.Two subsequent fully connected layers play an important role in capturing feature dependencies in the channel dimension.This way ensures the cross-layers information exchange.Thus, the rescaled low level output activated by sigmoid is more dynamically consistent with high level features.Furthermore, the spatial attention part uses additive attention to refine the low level features with the aid from the high level features that with larger receptive fields, which introduces more elaborate context to improve the classifying ability of the features.
Compared to other existing attention method, flexibility is an advantage of our proposed attention module.The experimental results on three different datasets demonstrate that RFA module can better deal with the task of building extraction with different sources of aerial images.Taking channel and spatial dimensions into account successively allows for a more robust interaction of context information between the feature layers in the segmentation model.Meanwhile, the residual mapping branch of RFA alleviates the gradient vanishing in the training process.These are two reasons why the proposed RFA attention module outperforms other single attention methods in this study.With respect to DualAN that also uses two kinds of attentions in the comparison, our approach is quite different from it.In particular, DualAN applies attention mechanisms in parallel to the bottleneck of network, which focuses on employing self-attention to enhance representation of deep features, rather than reducing the semantic discrepancy between different level features.Moreover, because of the high cost of intermediate matrix multiplication in the DualAN, the authors [62] just place it for the bottleneck features with low spatial resolution.The experimental results imply this strategy is not effective enough for building extraction in the aerial images.However, our practice has shown that the proposed joint attention only increases small cost of additional model parameters (see Table 5, about 0.4 million) and computation (about 1.53 MB), even when applied at every level of the network.This flexibility implies the possibility of embedding RFA in other architectures in the future.There is abundant room for further studies.First, the proposed RFA module does not validate the possible improvements it might bring on the other encoder-decoder models.At present, the reason we do not apply the RFA module to other models is that many factors need to be considered, such as the computational resource consumption of the models, the applicability of models themselves to different data sets and hyperparameter settings of models.The comparison with other methods in training time also means further hyperparameter optimization of the proposed module is possible (see Tables 5  and A1).Therefore, it is needed to provide a more comprehensive comparison of these methods in the future.Second, we have conducted the experiments on three datasets of urban buildings in the public domain (e.g., Mass.Buildings, Potsdam, and WHU).It is promising to develop the RFA applied models on multi-source data and rural residential buildings.Finally, we only focus on the task of building extraction in this paper.Since the proposed RFA-UNet can be easily transformed into a multi-class semantic segmentation models, we plan to extend our model with extra geometric constraints and to multiple classes.

Conclusions
In this paper, an end-to-end attention FCN model was proposed for building extraction in very high resolution aerial imagery.We have implemented a re-weighting technique based attention mechanism to adjust the response of features dynamically in channel-wise and spatial dimensions.With the aid of the context information from high level features, the proposed joint attention module can effectively enhance the semantic consistency of features across layers so as to improve the discrimination power of the UNet model for the building extraction task.Experiments on three different high resolution building datasets verified the effectiveness of attention mechanism, and the proposed RFA-UNet model achieved state-of-the-art performance on these popular benchmarks.

Figure 2 .
Figure 2. Diagram of the proposed attention module.Under the guidance of high stage features, low stage features are refined adaptively before passing through skip connection.

Figure 2 .
Figure 2. Diagram of the proposed attention module.Under the guidance of high stage features, low stage features are refined adaptively before passing through skip connection.

Figure 3 .
Figure 3.A decoding stage embedded with proposed attention module, where 'RFA' refers to the attention module, and 'C' refers to channel number of features.

Figure 3 .
Figure 3.A decoding stage embedded with proposed attention module, where 'RFA' refers to the attention module, and 'C' refers to channel number of features.

Figure 4 .
Figure 4.The image samples and corresponding reference images on the three datasets.The label of (a) Massachusetts Building includes two classes: building (red) and background (black).The label of (b) Potsdam Challenge, and (c) WHU Aerial includes two classes: building (white) and background (black).

Figure 4 .
Figure 4.The image samples and corresponding reference images on the three datasets.The label of (a) Massachusetts Building includes two classes: building (red) and background (black).The label of (b) Potsdam Challenge, and (c) WHU Aerial includes two classes: building (white) and background (black).
Remote Sens. 2019, 11, x FOR PEER REVIEW 9 of 20 features.(3) DenseAN [52]: A spatial attention fusion module uses the high level features activation output to reweight the low level features before summation.(4) DualAN [62]: A dual attention module for scene segmentation which captures long-range contextual information in spatial dimension and channel dimension respectively.(5) RFA (Ours): A joint residual attention module consists of channel attention and spatial attention for enhancing the semantic consistency of features across layers.In the training phase, the training parameters and strategies adopted for these methods are same as ours.

Figure 5 .
Figure 5.Comparison with the state-of-the-art attention methods on Mass.Buildings test sets.True positive (TP) is marked as green, false positive (FP) as blue and false negative (FN) as red.

Figure 5 .
Figure 5.Comparison with the state-of-the-art attention methods on Mass.Buildings test sets.True positive (TP) is marked as green, false positive (FP) as blue and false negative (FN) as red.

Figure 5 .
Figure 5.Comparison with the state-of-the-art attention methods on Mass.Buildings test sets.True positive (TP) is marked as green, false positive (FP) as blue and false negative (FN) as red.

Figure 6 .
Figure 6.Close-ups of comparison with the state-of-the-art attention methods on Mass.Buildings test sets.True positive (TP) is marked as green, false positive (FP), as blue and false negative (FN) as red.

Figure 6 .
Figure 6.Close-ups of comparison with the state-of-the-art attention methods on Mass.Buildings test sets.True positive (TP) is marked as green, false positive (FP), as blue and false negative (FN) as red.

Figure 7 .
Figure 7.Comparison with the state-of-the-art attention methods on Potsdam test sets.True positive (TP) is marked as green, false positive (FP), as blue and false negative (FN) as red.

Figure 7 .
Figure 7.Comparison with the state-of-the-art attention methods on Potsdam test sets.True positive (TP) is marked as green, false positive (FP), as blue and false negative (FN) as red.

Figure 7 .
Figure 7.Comparison with the state-of-the-art attention methods on Potsdam test sets.True positive (TP) is marked as green, false positive (FP), as blue and false negative (FN) as red.

Figure 8 .
Figure 8. Close-ups of comparison with the state-of-the-art attention methods on Potsdam test sets.True positive (TP) is marked as green, false positive (FP), as blue and false negative (FN) as red.

Figure 8 .
Figure 8. Close-ups of comparison with the state-of-the-art attention methods on Potsdam test sets.True positive (TP) is marked as green, false positive (FP), as blue and false negative (FN) as red.

Figure 9 .
Figure 9.Comparison with the state-of-the-art attention methods on WHU test sets.Building (white) and background (black).True positive (TP) is marked as green, false positive (FP) as blue and false negative (FN) as red.

Figure 9 .
Figure 9.Comparison with the state-of-the-art attention methods on WHU test sets.Building (white) and background (black).True positive (TP) is marked as green, false positive (FP) as blue and false negative (FN) as red.

Figure 9 .
Figure 9.Comparison with the state-of-the-art attention methods on WHU test sets.Building (white) and background (black).True positive (TP) is marked as green, false positive (FP) as blue and false negative (FN) as red.

Figure 10 .
Figure 10.Close-ups of comparison with the state-of-the-art attention methods on WHU test sets.Building (white) and background (black).True positive (TP) is marked as green, false positive (FP) as blue and false negative (FN) as red.

Figure 10 .
Figure 10.Close-ups of comparison with the state-of-the-art attention methods on WHU test sets.Building (white) and background (black).True positive (TP) is marked as green, false positive (FP) as blue and false negative (FN) as red.

Figure 11 .
Figure 11.Comparison with the state-of-the-art attention methods on Mass.building test sets.True positive (TP) is marked as green, false positive (FP), as blue and false negative (FN) as red.

Figure 11 .
Figure 11.Comparison with the state-of-the-art attention methods on Mass.building test sets.True positive (TP) is marked as green, false positive (FP), as blue and false negative (FN) as red.

Figure 11 .
Figure 11.Comparison with the state-of-the-art attention methods on Mass.building test sets.True positive (TP) is marked as green, false positive (FP), as blue and false negative (FN) as red.

Figure 12 .
Figure 12.Comparison with the state-of-the-art attention methods on Potsdam building test sets.True positive (TP) is marked as green, false positive (FP), as blue and false negative (FN) as red.

Figure 13 .
Figure 13.Comparison with the state-of-the-art attention methods on WHU building test sets.True positive (TP) is marked as green, false positive (FP), as blue and false negative (FN) as red.

Figure 12 .
Figure 12.Comparison with the state-of-the-art attention methods on Potsdam building test sets.True positive (TP) is marked as green, false positive (FP), as blue and false negative (FN) as red.

Figure 11 .
Figure 11.Comparison with the state-of-the-art attention methods on Mass.building test sets.True positive (TP) is marked as green, false positive (FP), as blue and false negative (FN) as red.

Figure 12 .
Figure 12.Comparison with the state-of-the-art attention methods on Potsdam building test sets.True positive (TP) is marked as green, false positive (FP), as blue and false negative (FN) as red.

Figure 13 .
Figure 13.Comparison with the state-of-the-art attention methods on WHU building test sets.True positive (TP) is marked as green, false positive (FP), as blue and false negative (FN) as red.

Figure 13 .
Figure 13.Comparison with the state-of-the-art attention methods on WHU building test sets.True positive (TP) is marked as green, false positive (FP), as blue and false negative (FN) as red.

Table 1 .
Architecture of encoder.

Table 1 .
Architecture of encoder.

Table 2 .
Detailed information of experimental setting on three datasets.

Table 3 .
Quantitative comparison with other attention modules (%) on three test sets, where values in bold are the best and the underlined values are the second best.

Table 4 .
Quantitative comparison with other deep models (%) on three test sets, where values in bold are the best and the underlined values are the second best.
1This table incorporates the results by ourselves and numerical results reported by other authors.

Table 5 .
Comparison with baseline.
1Time consumed by the method to predict the complete test set.