Building Extraction from High-Resolution Aerial Imagery Using a Generative Adversarial Network with Spatial and Channel Attention Mechanisms

Segmentation of high-resolution remote sensing images is an important challenge with wide practical applications. The increasing spatial resolution provides fine details for image segmentation but also incurs segmentation ambiguities. In this paper, we propose a generative adversarial network with spatial and channel attention mechanisms (GAN-SCA) for the robust segmentation of buildings in remote sensing images. The segmentation network (generator) of the proposed framework is composed of the well-known semantic segmentation architecture (U-Net) and the spatial and channel attention mechanisms (SCA). The adoption of SCA enables the segmentation network to selectively enhance more useful features in specific positions and channels and enables improved results closer to the ground truth. The discriminator is an adversarial network with channel attention mechanisms that can properly discriminate the outputs of the generator and the ground truth maps. The segmentation network and adversarial network are trained in an alternating fashion on the Inria aerial image labeling dataset and Massachusetts buildings dataset. Experimental results show that the proposed GAN-SCA achieves a higher score (the overall accuracy and intersection over the union of Inria aerial image labeling dataset are 96.61% and 77.75%, respectively, and the F1-measure of the Massachusetts buildings dataset is 96.36%) and outperforms several state-of-the-art approaches.


Introduction
With the rapid advancement of aerospace remote sensing, the amount and spatial resolution of high-resolution remote sensing images are increasing rapidly.As a result, accurate and automatic semantic labeling of high-resolution remote sensing images is of great significance and receives wide attention [1].Large intra-class variance and small inter-class differences of higher spatial resolution remote sensing images may cause classification ambiguities, which makes semantic segmentation of high-resolution remote sensing images a challenge.Specific to the buildings in high-resolution aerial images, buildings in different regions have different characteristics.For instance, some regions have small and very dense buildings, whilst some other regions have low-density buildings.This variability brings great challenges to the building segmentation task, and requires strong generalization capabilities of classification techniques [2,3].
Over the last few years, deep learning architectures have made breakthroughs in the image analysis field.Convolutional neural networks (CNNs) have been proposed not only to deal with object detection and whole image classification but also progress fine inference, such as semantic segmentation.Semantic segmentation can accomplish pixel-wise prediction, which is a problem to give each pixel a class label.Long et al. [4] proposed fully convolutional networks (FCNs) to accomplish pixel-wise classification.They replaced the fully connected layers of whole image classification CNNs with convolutional layers and utilized deconvolutional layers to upsample feature maps to score map each class.FCNs created a precedent for pixel-based encoder-decoder architectures.Following this paradigm, many CNN architectures have been proposed and further improved the segmentation performance.In [5], U-Net was proposed to modify FCN by concatenating feature maps of encoder and decoder.Concatenation architecture can take full advantage of both low-level and high-level features.Hence, more precise segmentation results can be obtained.After that, DeepLab V1 [6] and V2 [7] were proposed to mitigate the information loss caused by pooling operations.The authors introduced atrous convolutions to increase receptive field size while maintaining higher resolution of feature maps, and the fully connected conditional random fields (CRFs) were utilized to further improve the segmentation performance as post processor.In [8], Noh et al. proposed DeconvNet which consists of convolution and deconvolution networks.In the deconvolution network, unpooling layers were applied to upscale feature maps and decovoconvolutional layers were followed to densify the initially upscaled sparse feature maps.Badrinarayanan et al. presented SegNet [9] which also included unpooling layers in the decoder stage and with smaller parameterization when compared with DeconvNet.
Although the CNN-based segmentation methods have achieved promising results, they still have drawbacks and can be further improved.The main problem is that the pixel-wise prediction of CNN can guarantee high pixel-wise accuracy, but the relationship between pixels is prone to be ignored.This may lead to discontinuous segmentation results, and the boundaries of objects are usually not accurate enough.Therefore, post-processing methods, e.g., fully connected CRFs or Markov random fields (MRFs), were needed to further improve the raw segmentation results [10][11][12].These graphical regularization models coupled both the input images and the predicted score maps of CNN to refine the predictions with the color information and pixel position of the original image.In addition, recurrent neural networks (RNN) can also refine the segmentation results by employing a feedback connection to form a directed cycle [13].Bergado et al. [14] proposed to incorporate the recurrent approach in the semantic segmentation task (ReuseNet) to learn contextual dependencies in the label space, and further refine the segmentation results.The ReuseNet applied the semantic segmentation operations in R cycles.Each cycle takes the score map of the previous cycle concatenated with the original image as input.Moreover, Generative adversarial networks (GANs) [15] based methods can enforce spatial label contiguity to refine the segmentation results without any time consumption during the testing phase.In [16], Luc et al. first applied adversarial training strategy to semantic segmentation task.A segmentation network and an adversarial network were trained in an alternating fashion to make the generated segmentation results hard to be distinguished from the ground truth.By doing so, the joint distribution of all label variables at each pixel location can be assessed as a whole, and thus, can enforce forms of high-order consistency that cannot be enforced by pixelwise classification or pairwise terms.Xue et al. [17] presented SegAN for medical image segmentation, which is composed of a segmentor and a critic network.The multi-scale L 1 loss function was minimized and maximized alternatively to train these two networks, and the SegAN received better image segmentation performance than the original GAN.
In semantic labeling of high-resolution remote sensing images, deep learning architectures also show excellent performance [18][19][20].Saito et al. [21] used patch-based CNN to learn classification maps from high-resolution images and achieved good results on Massachusetts roads and buildings datasets [22].However the patch-based methods suffer from limited receptive field and large computational overhead, so it was soon surpassed by pixel-based methods.Maggiori et al. [23] proposed the Inria aerial image labeling dataset that covers different forms of buildings and provided a baseline segmentation result by using an FCN-based architecture combined with multi-layer perceptron.In [24], Bischke et al. introduced a new cascaded multi-task loss to mitigate the poor boundaries of the existing prediction results.Learning with the proposed loss, the performance can achieve certain improvement without any changes in the network architecture.A multi-stage multi-task CNN for building extraction was introduced in [18].The first stage of the proposed network provided the segmentation results, while the second stage was aimed to give the precise location by two branches.In [25], Khalel et al. proposed a stack of U-Nets to automatically label the buildings from high aerial images, of which each U-Net can be regarded as the post-processor of the previous U-Net.However, the existing results usually suffered from poor boundaries, and the accuracy can be further improved.
In this paper, we propose a generative adversarial network with spatial and channel attention mechanisms (GAN-SCA) for high accurate semantic labeling of buildings in high-resolution aerial images with precise boundaries.The GAN-SCA is composed of a segmentation network and an adversarial network, in which the segmentation network is a semantic segmentation network to predict the pixel-wise labeling results, and the adversarial network is to distinguish whether the inputs are predicted results of the segmentation network or ground truth.Moreover, we embed channel and spatial attention mechanisms into the network to selectively enhance useful information, and further improve the segmentation accuracy.
The main contributions of our work can be summarized as follows: • A GAN-based network called GAN-SCA is proposed for building extraction from high-resolution aerial imagery.The architecture is composed of a segmentation network and an adversarial network.The segmentation network aims to predict pixel-wise labeling maps that are similar to ground truths, while the adversarial network is set to discriminate different characteristics of different label maps to further enhance the high-frequency continuity of the prediction maps.

•
Spatial and channel attention mechanisms are embedded in the proposed GAN-SCA architecture to enable selectively attaching important features from both the spatial dimension and channel relationship.

•
The adversarial network and segmentation network are trained to optimize a multi-scale L 1 loss and multiple cross entropy losses combined with a multi-scale L 1 loss alternatively.With no requirements for any post-processing, our proposed network improved the state-of-the-art performance on both the Inria aerial image labeling dataset and Massachusetts buildings dataset.
The rest of this paper is organized as follows.In Section 2, we introduce the architecture and training strategy of the proposed network in detail.The dataset description and experimental setting are presented in Section 3. Section 4 details the experimental results and analyses.Section 5 discusses the effectiveness of the spatial and channel attention mechanisms and the training strategy.The results are drawn in Section 6.

Proposed Network GAN-SCA
As shown in Figure 1, the proposed GAN-SCA is composed of two parts, i.e., the segmentation network and the adversarial network.The segmentation network is a U-Net-based architecture, where spatial and channel attention mechanisms are embedded.U-Net is a powerful CNN architecture for semantic segmentation and has been widely applied in remote sensing image classification field [5].U-Net was initially designed for binary segmentation of biomedical images with a relatively small number of training samples.As it achieves better performance than other classic semantic segmentation architecture, U-Net is a good choice for the building extraction task in this study.However, these classic deep convolutional neural network (DCNN) architectures for semantic segmentation usually produce a large number of multi-level feature maps but do not perform any feature selection operation throughout the whole process.On the one hand, fusion of the high-level and low-level features without feature selection may result in over-segmentation when the model tends to receive more information from lower layers.On the other hand, the channel-wise information combined by convolutional filters without considering channel-interdependencies might affect the segmentation performance of the network.Therefore, we propose to introduce the attention mechanisms to employ feature selection from the aspect of spatial information and channel relationship.
To mitigate the neglect of inter-pixel relationships caused by the pixel-wise loss function used in the training phase, we propose to refine the segmentation result using the adversarial training.The adversarial network can learn latent higher-order structural features which can be fed into the segmentation network in the training phase, and the segmentation results can be refined without an adversarial network in the testing phase.In contrast with graphical models and recurrent approaches, adversarial training can achieve segmentation refinement without extra time consumption.The architecture of adversarial network we adopt in the proposed GAN-SCA shares a similar structure as the encoder of the segmentation network and is fed with the predicted maps combined with the original images and ground truth maps combined with the original images.In particular, the multi-scale features from a different stage of the adversarial network are reshaped into one-dimensional vectors and concatenated together to compute the multi-scale L 1 loss.

Segmentation Network
The segmentation network of GAN-SCA is based on U-Net architecture.To accomplish feature selection from the aspect of the spatial information and channel-wise relationship, we introduce two kinds of attention mechanisms into the network architecture.The attention mechanism is an effective operation to enable the network to selectively enhance more useful features and has been widely applied in the image analysis field [26].In this work, we consider both spatial and channel-wise attention mechanisms to improve the segmentation performance.The spatial attention mechanisms are embedded between the contracting path and expanding path of the U-Net, as shown in Figure 1.The U-Net fuses low-level feature maps of the contracting path with the high-level features of the expanding path by concatenation to re-utilize fine details in the low-level features.However, the rough concatenation may result in the over-use of low-level features.Therefore, we can utilize flexible semantic information of the high-level features to assist the selection of low-level information.Usually, the low-level features contain rich details, and we prefer to enhance the hard classified information and suppress the interference information.Figure 2 shows the error map of U-Net prediction result, from which we can observe that building boundaries are prone to be mislabeled in the building extraction task.Inspired by [27], the entropy score map of high-level features has similar characteristics with the mislabeled map, as shown in Figure 2. Therefore, when we compute the entropy score map of high-level features in each decoder stage, and weight the low-level features according to the results of corresponding entropy score map before high-level and low-level feature fusion, we can selectively enhance the hard classified information while suppressing the less useful information of the low-level features.The entropy score map can be computed with Equation (1): where p i (x) denotes the score map of class i, K means the total number of the classes.Figure 2 displays the entropy score maps of four-scale spatial attention mechanisms, from which we can see that the entropy maps have a strong relationship with the error map.Usually, building boundary pixels are prone to being mislabeled, so the entropy maps also share similar characteristics with the boundaries of buildings.Thus, with the spatial attention mechanisms, building boundaries information from lower level features will be highly weighted into the final output fusion feature.The detailed structure of the spatial attention mechanism is shown in Figure 3a.As can be seen, high-level features are first convoluted by 1 × 1 convolutions for dimensionality reduction and normalized to [0,1] by using the sigmoid function to generate the score maps.Afterward, the entropy score map is computed to element-wise conducts with low-level features.After that, the high-level features are concatenated with the weighted low-level features to further process.It is worth noting that, the entropy score map has a strong relationship with the building boundaries in the building extraction tasks so the spatial attention mechanisms can bring benefits to the building boundaries segmentation.In particular, we compute four cross entropy losses of each spatial attention mechanism to combine with the overall cross entropy loss to train the segmentation network.The detail of model optimization will be introduced in Section 2.2.Apart from spatial attention, the proposed architecture also takes advantage of the channel relationship enhancement.Squeeze-and-excitation (SE) block is a computational unit that can re-scale each channel according to its importance adaptively.SE blocks can be stacked together with many existing state-of-the-art CNNs, and bring significant improvements in performance across different datasets with minimal additional computational cost [28].So we adopt SE blocks as channel attention mechanisms at each stage in both contracting path and expanding path, as shown in Figure 1.The structure of the SE block is depicted in Figure 3b, which can model channel inter-dependencies in two steps, namely, squeeze and excitation.The input features x are first squeezed into channel-wise statistics s by performing global average pooling, and the c-th channel of s can be computed by: where x c is the c-th channel of the input feature x, and H × W denote spatial dimensions of x c .
To properly capture the information of s to model the channel inter-dependencies, the excitation operation is followed.A fully connected layer is adopted to reduce the dimension of s 1×1×C to s 1×1× C R and a rectified linear unit (ReLU) layer is followed to activate.After that, another fully connected (FC) layer is performed to ascend s' back to the original dimension 1 × 1 × C. By doing so, it can better fit the complex relationship between channels with less computational overhead.The weight of each channel is normalized to [0,1] with a sigmoid activation.The excitation operation can be written as: where σ stands for the sigmoid activation, and δ stands for the ReLU function [29].W 1 and W 2 are two real matrices of size C R × C and C × C R to limit the complexity and generalization of the channel attention mechanism.This operation is implemented by two FC layers.
The final output of the channel attention mechanism is the re-scaled input features y c .The re-scaled operation can be expressed by Equation ( 4) below:

Adversarial Network
The adversarial network of GAN-SCA has a similar structure with the encoder in the segmentation network.Two inputs are fed into the adversarial network, namely original images concatenated with predicted label maps and original images concatenated with ground truths.The network starts with a 1 × 1 convolutional layer to learn to fuse the input images with the predicted label maps/ground truths.Figure 4 shows two visual results of such fusion.Then the fused images are fed into the encoder-like network to extract features, respectively.To capture long-and short-range spatial relations between pixels, we extract multi-scale feature maps from multiple layers and concatenate them together to compute the multi-scale L 1 loss [17], the detailed introduction of loss function will be presented in the next section.

Training Strategy
The proposed GAN-SCA is trained in an adversarial fashion.The segmentation network aims to generate the predicted labeling map to deceive the adversarial network, and the adversarial network aims to distinguish the ground truths from the predicted labeling maps generated by the segmentation network.Therefore, the segmentation network and adversarial network are trained alternatively in the training phase [30].We first fix the parameter of the segmentation network (S) and train adversarial network (A) to minimize the multi-scale L 1 loss (Equation ( 5)).Then the parameter of A is fixed, and the S is trained by minimizing the cross-entropy losses combined with the negative multi-scale L 1 loss (Equation ( 7)).
where (x n , S(x n )) is the concatenation of input images and the predicted results of (x n , y n ) is the concatenation of input images and ground truths, f A (x) denotes hierarchical features extracted from x, l mae is the L 1 distance or mean absolute error (mae), which is defined as: where L is the total number of the feature scales in the adversarial network, f A i (x) is the features in scale i.
where the L fa is the auxiliary cross entropy loss computed in each spatial attention mechanism, y(x n ) denotes the ground truth of the n-th image in the current batch.
The parameters of the segmentation network and adversarial network are initialized by normally distributed random variables.The initial learning rate is set to 10 −3 and divided by 2 every 15 epochs.The batch size is set to 5. We crop the training images into size 384 × 384 with 25% overlap, and data augmentation including flip and rotation are also implemented.In the testing phase, to meet the memory constraints, we employ a sliding window with size 1024 × 1024 to accomplish the full tile prediction.We set 75% overlapping size in the testing stage to mitigate inconsistent border phenomenon since the size is proven to give the best results in previous works [11,31].

Datasets
The datasets we used in this work are two open buildings datasets, namely Inria aerial image labeling dataset for buildings and Massachusetts buildings dataset.These two datasets cover various building characteristics, such as shape, size, distribution, and spatial resolution, which can evaluate the generalization ability of networks.
The first dataset we used is the Inria aerial image labeling dataset for buildings [23].The dataset consists of 360 high-resolution aerial images which over different cities including Austin, Chicago, Kitsap, Western/Eastern Tyrol, Vienna, Bellingham, Bloomington, and San Francisco.These regions cover dissimilar urban buildings, for instance, most buildings in Chicago and San Francisco are densely distributed and usually small in shape, while buildings in Kitsap are scattered.The spatial resolution of images is 30 cm with an image size of 5000 × 5000 pixels, and each image covers a surface of 1500 × 1500 m 2 .Only 180 tiles are provided with ground truths, and the other 180 tiles are preserved for testing.Following a common practice [23], we choose the first five images of each region from the training set for validation.
The second dataset is the Massachusetts buildings dataset [22].The dataset consists of 151 high-resolution aerial images of urban and suburban areas at Boston.The size of images in this dataset is 1500 × 1500 pixels, and each image covers a surface of 2250 × 2250 m 2 .The dataset is randomly divided into three subsets, namely training set (137 tiles), validation set (4 tiles), and testing set (10 tiles).

Evaluation Metrics
To make a fair comparison, we compute the same metrics as in other literatures.For the Inria Aerial Image Labeling Dataset, the overall accuracy (Acc.) and intersection over union (IoU) are utilized for quantitative performance evaluation.Acc. is the proportion of the correctly labeled pixels (see Equation ( 8)).IoU is the intersection of pixels labeled as building in the predicted results and ground truths, divided by the union of pixels labeled as building in the predicted results and ground truths (see Equation ( 9)).

Acc. =
tp + tn tp + tn + f p + f n (8) where tp denotes the number of true positive pixels, fp denotes the number of false positive pixels, tn denotes the number of true negative pixels, and fn denotes the number of false negative pixels.
For the Massachusetts buildings dataset, relaxed F 1 -measure is used to evaluate the segmentation performance of each network.A relaxed factor ρ is introduced when computing the confusion metrics because the tools producer of this dataset used to generate labels is only accurate up to a few pixels.Following the previous works [23][24][25], we compute the F 1 -measure with a relaxation factor of three, and the F 1 -measure without relaxation version (ρ = 0) is also reported.The F 1 -measure can be written as:

Ablation Study
In this section, we first evaluate whether the two attention mechanisms can bring benefit to the segmentation performance, so we compare the base architecture (i.e., the standard U-Net) with the U-Net embedded with the attention mechanisms (U-Net-SCA).It should be noted that the U-Net-SCA is the segmentation network of the proposed GAN-SCA.In addition, we employ dense CRFs as the post-processor of U-Net-SCA to further improve the segmentation results (U-Net-SCA+CRFs).We also explore the recurrent approach to achieve label refinement followed the ReuseNet in [14] (U-Net-SCA+Reuse), that applies U-Net-SCA in R cycles.We choose R = 3 in this experiment because the U-Net-SCA architecture in three cycles achieves the best performance on the Inria aerial image labeling dataset.Finally, we train the U-Net-SCA combined with an adversarial network (GAN-SCA) in an alternating fashion to see how the adversarial training affects the segmentation results.
The models described above are trained over five independent runs with random initialization, and the average accuracy and IoU with the standard deviation of the experimental results on the validation set of the Inria aerial image labeling dataset are reported in Table 1.As can be observed from Table 1, the proposed U-Net-SCA achieves improvement of 0.19% and 0.72% in terms of the overall accuracy and IoU compared to the standard U-Net.For accuracy and IoU of each region, the U-Net-SCA also outperforms the standard U-Net.Especially for the regions in Chicago and Vienna, where buildings are high-densely distributed, and the proportion of building pixels in the training set is higher, the accuracy increase is more evident.This indicates that the spatial and channel attention mechanisms enable the network to selectively enhance useful features to further improve segmentation accuracy.The U-Net-SCA+CRFs has few improvements over the U-Net-SCA, with the overall accuracy and IoU improved by 0.01% and 0.21%, respectively.By adopting the recurrent approach and adversarial network, the U-Net-SCA+Reuse and GAN-SCA have a similar small improvement of overall accuracy and IoU when compared to the U-Net-SCA.Let us recall that the adversarial training strategy adopted by the proposed GAN-SCA can learn high-order consistency without extra time consumption in the testing phase, whereas the recurrent approach of U-Net-SCA+Reuse is accompanied by the multi-fold increase of trainable weights which increases the computational complexity.Figure 5 shows the segmentation results of methods described above on the Inria aerial image labeling dataset.Figure 5a shows the results of an image patch over Austin, from which we can observe that the standard U-Net is affected by shadows and fail to segment the boundaries of complex structural buildings (upper left part of the figure) correctly.With the help of spatial and channel attention mechanisms, U-Net-SCA achieves better performance when dealing with the same situation, but still mislabels some non-building pixels in shadows as building.The extraction results from U-Net-SCA+CRFs, U-Net-Reuse, and GAN-SCA all seem to have clearer boundaries, especially for complex structural buildings, and this is due to the adopted different label refinement method.In contrast, the extraction result of the proposed GAN-SCA achieves clearer and more accurate outlines of this kind of buildings.In addition, some large buildings are difficult to labeled correctly, due to their edges on the rooftop which have a similar color to roads.Figure 5b shows the results of a large building in the Chicago city, where the results of U-Net suffer from over-segmentation of the inner edges of the detected building, while U-Net-SCA improves the results by using the channel and spatial attention mechanisms to selectively enhance useful features.U-Net-SCA+CRFs smooths the results, yet the improvements seem insignificant.The result of U-Net-SCA+Reuse also shows a slight improvement, but the over-segmentation has not been effectively solved.In contrast, the proposed GAN-SCA labeled the large building more completely.Moreover, buildings with complex shape and multiple colors are prone to be confused by the networks, as shown in the middle of Figure 5c, and most methods mislabel this kind of building as non-buildings, while the proposed GAN-SCA can provide a relatively proper segmentation results.

Inria Aerial Image Labeling Dataset
To evaluate the performance the Inria aerial image labeling dataset, we compare the proposed GAN-SCA (best results we achieved) with some state-of-the-art methods, including the baseline method FCN [23], multi-layer perceptron (MLP) [23], Mask R-CNN [32] performed by Ohleyer et al. [33], SegNet+Multi-Task Loss [24], 2-levels U-Nets [25], and the multi-stage multi-task (MSMT) [34].FCN and MLP are frameworks proposed by the producers of the Inria aerial image labeling dataset.MLP derived from the base FCN and introduced a multi-layer perceptron to learn how to combine features at different resolutions.Mask R-CNN consisted of a region proposed network (RPN) and an FCN, the RPN took the whole image as input and output the image with bounding box proposals.According to the proposal of RPN, the FCN then performed efficient segmentation.The SegNet+Multi-Task Loss was based on SegNet architecture and trained with an uncertainty based multi-task loss.In particular, one convolutional layer L was followed after the last layer of the decoder to generate the distance classes, and then the output of decoder's last layer was concatenated with the output of L to predict the final segmentation results.2-Levels U-Nets was proposed in [25], where two U-Net architectures were arranged end-to-end, and the last U-Net was served as the post-processor to the first one.Moreover, the test time augmentation was applied to further improve the segmentation performance.The MSMT architecture was proposed in [34].Authors proposed an MSMT neural network which had two stages, namely semantic segmentation and localization.The first stage was dedicated to semantic segmentation, while the second stage was designed for localization.
Table 2 presents the accuracy and IoU of different methods on the Inria aerial image labeling dataset.It is worth noting that IoU can take into account both the false alarms and the missing detections that is a more suitable metric than global accuracy on Inria dataset, because this dataset contains large areas of background pixels.It can be seen from Table 2 that MLP outperforms the base FCN [23] by introducing multi-layer perceptron to fuse multi-resolution features.Mask-RCNN is a promising architecture, but it requires very good hyperparameters tuning [33].Therefore, it achieves better performance in Austin and Tyrol-w but lower in most regions when compared to MLP.SegNet+Multi-Task Loss improves the performance of SegNet by introducing a cascaded multi-task loss, but the improvement is still limited.Although it achieves the best accuracy in regions of Chicago and Vienna, the corresponding IoU is not ideal.2-Levels U-Nets and MSMT achieve similar accuracy, of which the former approach outperforms the latter one in terms of IoU in all regions.This is mainly because the 2-Levels U-Nets is based on U-Net which is a deeper architecture than that of MSMT.The proposed GAN-SCA is also on top of U-Net.With the help of the attention mechanisms and adversarial training strategy, GAN-SCA outperforms 2-Levels U-Nets in most evaluation metrics and produces the highest IoU in most regions, especially the densely populated cities, such as Austin, Chicago, and Vienna.In terms of the overall accuracy and IoU, the proposed method surpasses all other methods by a considerable margin, which shows that the proposed method can accomplish accurate building segmentation.The qualitative results of the GAN-SCA are shown in Figure 6.It can be seen that the GAN-SCA achieves accurate building segmentation results in each region with smooth outlines.

Massachusetts Buildings Dataset
We tested the performance of the proposed GAN-SCA on the Massachusetts buildings dataset by using the same metrics as the compared methods.We compared the performance of GAN-SCA with several state-of-the-art methods including Mnih-CNN+CRFs [22], Satio-multi-MA&CIS [21], LG-Seg-ResNet-IL [35], and MTMS [34].The Mnih-CNN+CRF was proposed by the producers of the Massachusetts building dataset, which belonged to the patch-based category, and CRFs was included as a post-processor.Satio-multi-MA&CIS was based on Mnih-CNN architecture, in which channel-wise inhibited softmax (CIS) loss function and modeled averaging (MA) techniques were used to further enhance the extraction performance.LG-Seg-ResNet-IL is a dual local-global semantic segmentation architecture with residual connections and an intermediate contextual loss (IL), which learned to combine local appearance and global contextual information simultaneously in a complementary way.MTMS is the same method described in Section 4.2.1.
Table 3 compares the F 1 -measure of each method, in which ρ denotes the relaxed factor when computing the corresponding recall and precision measures.As shown in Table 3, our GAN-SCA obtains a superior performance than all other methods.With the help of the (CIS) loss function and (MA) with spatial displacement, the Satio-multi-MA&CIS achieves a slight improvement compared to the baseline method Mnih-CNN+CRFs.LG-Seg-ResNet-IL effectively combines the local and global information, which mitigates the problem of the limited receptive field of the patch-based method.So LG-Seg-ResNet-IL achieves a remarkable improvement compared to the first two methods.MTMS is an FCN-based method that introduces a multi-stage multi-task training strategy to enhance segmentation performance.MTMS and GAN-SCA achieve better performance compared to the patch-based methods, which indicates the superiority of the pixel-based method.Thanks to the deeper architecture and the feature selection by adopting attention mechanisms, the GAN-SCA exhibits better performance when compared to MTMS, which further indicates the rationality of the proposed method.Figure 7 exhibits the prediction results of the proposed model for three image patches.It can be seen that our proposed model presents a satisfying performance in challenging areas.-92.30%LG-Seg-ResNet-IL [35] -94.30%MTMS [34] 83.39% 96.04% GAN-SCA 84.79% 96.36%

Experiments on FCN based GAN-SCA
The experiments above adopted U-Net as the baseline of the segmentation network for the proposed GAN-SCA, and achieve a certain improvement when compared with the standard U-Net.In fact, our proposed GAN-SCA can be realized on the top of many other semantic segmentation architectures.In this section, we will explore the GAN-SCA on top of FCN-8s version to further demonstrate the effectiveness of the attention mechanisms and adversarial training in building extraction from high-resolution remote sensing images.Figure 8 shows the architecture of the segmentation network (FCN-8s-SCA), where the channel and spatial attention mechanisms are embedded into the FCN-8s architecture with the VGG-16 [36] architecture as an encoder.Same as the U-Net based GAN-SCA described above, the adversarial network of this version is followed by the encoder of its segmentation network.We train the FCN-8s, FCN-8s-SCA, and GAN-SCA on the Inria aerial image labeling dataset using the same training strategy as introduced in Section 2.2.The experimental results are reported in Table 4. Compared with the FCN-8s, the FCN-8s-SCA improved the overall accuracy and IoU by 0.49% and 3.71%, respectively.For the adversarial training strategy, the FCN-8s based GAN-SCA further improved the extraction performance by 0.48% and 3.16% for the overall accuracy and IoU, respectively.We can conclude that the attention mechanisms can improve the segmentation performance by feature selection, and adversarial training can further refine the segmentation result by learning high-order consistency.In addition, the improvement of FCN-8s based GAN-SCA is more significant than the aforementioned U-Net based GAN-SCA.This is because the standard U-Net architecture has already achieved remarkable segmentation performance, as it fused high-level and low-level feature by first concatenating features together and then performing convolutions for dimensionality reduction.The convolutional layers in this process enable the network to learn how to fuse multi-scale features which can be regarded as feature selection to some extent.While FCN-8s has lower segmentation accuracy on this dataset when compared to the standard U-Net, it fused features by adopting element-wise addition, which seems unsuitable without any feature selection.Therefore, FCN-8s can take more advantage of the attention mechanisms.

Discussion
The experimental results reported in Section 4 prove that the proposed approach achieved state-of-the-art performance on both Inria and Massachusetts buildings datasets.Furthermore, the GAN-SCA can also be employed on top of other semantic segmentation architectures with better performance.The effectiveness of our proposed method comes from the feature selection in spatial and channel dimensions, and the label refinement by learning high-order structural features.First, the adoption of spatial and channel attention mechanisms helps with enhancing the useful features while suppressing the interference information, improving the segmentation performance around building borders, and mitigating over-segmentation.Second, the adversarial training strategy learns the latent high-order structural information in the training phase and achieves label refinement in the testing phase without extra time consumption.Especially, the segmentation network and adversarial network of our architecture were optimized by multi-scale feature loss to better capture multi-range spatial relationships between pixels.These factors make the proposed GAN-SCA have a better feature extraction capability and better segmentation performance.
Although the proposed approach performs well as a fully supervised method, it relies on a large number of manual labeling samples.Further researches are needed to alleviate the task of manual annotation.Possible directions that can be explored include data augmentation techniques and adversarial learning for semi-supervised semantic segmentation.Data augmentation techniques can increase the number of training samples and improve the generalization ability of models.We have explored some standard data augmentation techniques including flip and rotation in this work and previous works to mitigate overfitting.More data augmentation strategies will be explored in our future work.In addition, adversarial learning for semi-supervised semantic segmentation is also an interesting research direction, which can take advantage of unlabeled data to generate self-taught signal to refine the segmentation network.These approaches will be highly relevant in fields, such as remote sensing images analysis, in which large datasets are expensive to obtain.

Conclusions
This paper presented an effective GAN-based approach for building extraction from high-resolution remote sensing images.The adopted architecture consists of two parts: the segmentation network and the adversarial network, which are, in turn, used to generate segmentation maps of buildings and to discriminate the ground truths and the predicted results of the segmentation network, respectively.To enable the segmentation network to focus on more useful information, spatial and channel attention mechanisms are embedded into the standard U-Net.The adversarial network architecture is similar to the encoder of the segmentation network, where the extracted multi-layer features are considered when computing the multi-scale L 1 loss in the adversarial training phase.
The experiments were conducted on the Inria aerial image labeling dataset for buildings as well as the Massachusetts buildings dataset.The experimental results show that the spatial and channel attention mechanisms can selectively enhance useful features to improve the segmentation performance, while adversarial training can further refine the segmentation results with little time consumption during the testing stage.Compared with the state-of-the-art methods on both the datasets,

Figure 1 .
Figure 1.Architecture of the proposed generative adversarial network with spatial and channel attention mechanisms (GAN-SCA).A is max pooling layer; B are convolutional + batch normalization + rectified linear unit (ReLU) layers; C is upsampling layer; D is the concatenation operation; SA is the spatial attention mechanism; CA is the channel attention mechanism; RS is the reshape operation.

Figure 2 .
Figure 2. Entropy score maps of four-scale spatial attention mechanisms.(a) is the original image; (b) is the ground truth; (c) is the prediction result; (d) is the error map; (e-h) are the entropy score maps of the low-to-high scale spatial attention mechanisms.

Figure 3 .
Figure 3. Composition modules in the GAN-SCA.(a) Spatial attention mechanism; (b) Channel attention mechanism.FC is fully connected layer.

Figure 4 .
Figure 4. Fusion features of input images (one channel) and the predicted label maps/ground truths.(a) Input images; (b) Fusion results of input images and ground truths; (c) Fusion results of input images and the predicted label maps (5000 iterations).

Figure 5 .
Figure 5. Building extraction results for three image patches of Inria aerial image labeling dataset.(a) Image patch over Austin; (b) Image patch over Chicago; (c) Image patch over Vienna.Green: true positive (tp) pixels; Gray: true negative (tn) pixels; Blue: false positive (fp) pixels; Red: false negative (fn) pixels.

Figure 6 .
Figure 6.Building extraction results of Inria aerial image labeling dataset.(a) Image patch over Austin; (b) Image patch over Chicago; (c) Image patch over Vienna.Green: true positive (tp) pixels; Gray: true negative (tn) pixels; Blue: false positive (fp) pixels; Red: false negative (fn) pixels.

Figure 7 .
Figure 7. Building extraction results on the Massachusetts buildings dataset.(a-c) prediction results of three image patches in Massachusetts buildings dataset.Green: true positive (tp) pixels; Gray: true negative (tn) pixels; Blue: false positive (fp) pixels; Red: false negative (fn) pixels.

Figure 8 .
Figure 8. Architecture of FCN-8s-SCA.A is max pooling layer; B are convolutional + Rectified Linear Unit (ReLU) layers; C is the transpose convolutional layer; SA is the spatial attention mechanism; CA is the channel attention mechanism; RS is the reshape operation.

Table 1 .
Experimental results on Inria aerial image labeling dataset.

Table 2 .
Experimental results on Inria aerial image labeling dataset.

Table 3 .
Experimental results on Massachusetts buildings dataset.

Table 4 .
Experimental results on Inria aerial image labeling dataset.