Self-Attention in Reconstruction Bias U-Net for Semantic Segmentation of Building Rooftops in Optical Remote Sensing Images

: Deep learning models have brought great breakthroughs in building extraction from high-resolution optical remote-sensing images. Among recent research, the self-attention module has called up a storm in many ﬁelds, including building extraction. However, most current deep learning models loading with the self-attention module still lose sight of the reconstruction bias’s effectiveness. Through tipping the balance between the abilities of encoding and decoding, i.e., making the decoding network be much more complex than the encoding network, the semantic segmentation ability will be reinforced. To remedy the research weakness in combing self-attention and reconstruction-bias modules for building extraction, this paper presents a U-Net architecture that combines self-attention and reconstruction-bias modules. In the encoding part, a self-attention module is added to learn the attention weights of the inputs. Through the self-attention module, the network will pay more attention to positions where there may be salient regions. In the decoding part, multiple large convolutional up-sampling operations are used for increasing the reconstruction ability. We test our model on two open available datasets: the WHU and Massachusetts Building datasets. We achieve IoU scores of 89.39% and 73.49% for the WHU and Massachusetts Building datasets, respectively. Compared with several recently famous semantic segmentation methods and representative building extraction methods, our method’s results are satisfactory.


Introduction
Building footprints play an important role in many applications, ranging from urban planning [1], population estimation [2], disaster management [3], land dynamic analysis, to illegal building-construction recognition [4]. Given the rapid development in automatic semantic segmentation technology in computer vision, traditional manual building extraction and labeling works have been released greatly. Automated extraction of buildings from high-resolution optical remote sensing images is a very active research topic in both computervision and remote-sensing communities and has made substantial progress [4][5][6][7][8][9][10][11][12][13].
Currently, the most popular approach to building extraction seems to be deep-learning based methods. In 2012, Hinton proposed a deep convolutional neural network (CNN) and won the championship in the competition in ImageNet [14]. Since then, deep CNN networks became an instant hit worldwide. It is amazing that many researchers found that deep CNNs can obtain much better performance compared with traditional manually

Morphological and Geometrical Feature-Based
For morphological and geometrical feature-based methods, the morphological and geometrical features are used as the criteria for building extraction. The morphological and geometrical features usually contain shapes, lines, length and width, etc. As morphological and geometrical features are simple and easy to be modeled for buildings in visual, morphological and geometrical features have been widely used and achieved large amounts of research success. A novel adaptive morphological attribute profile based on object boundary constraint was proposed in [29] for building extraction from high-resolution remote-sensing images. Their model was tested on groups of images from different sensors and showed good results. A building extraction approach was proposed in [30], which is based on morphological attributes' relationships among different morphological features (e.g., shape, size). They assessed their method on three VHR datasets and demonstrated good results. A method that combined CNN and morphological filters was presented Remote Sens. 2021, 13, 2524 3 of 27 in [40] for building extraction from VHR images. In their method, the morphological features were used for final extraction filtering after the extraction of CNN. The experiments proved that their method is effective. The morphological features and support vector machine (SVM) were used in [31] for building extraction from VHR images. They tested their method on WorldView-2 and Sentinal-2 images and demonstrated good F1-scores.
Although morphological and geometrical features are simple and easy for using, these kinds of features usually suffer from the problems of the rigid model and the sensitivity to image resolution, occlusions' interference, etc.

Manually Designed Feature-Based
For manually designed feature-based methods, the researchers usually use transformations to extract features and then combine the extracted features with classifiers for the final building extraction task. The typical classifiers include SVM, Hough Forest, TensorVoting, Random Forest, etc. Since the manually designed features have shown superiority to morphological features in robustness to occlusions, brightness changing, resolution changing and imaging perspective changing, etc., the manually designed feature-based methods became popular in the past 20 years. A building extraction approach from high-resolution optical satellite images was proposed in [34] and achieved quite good and impressive results. In their method, the SVM, Hough transformation and perceptual grouping were combined. The Hough transformation was used for delineating circular-shape buildings, while the perceptual grouping strategy was used for constructing the building boundaries through integrating the detected lines. A hybrid approach to building extraction was proposed in [35], which used a template matching strategy for automatically computing the relative height of buildings. After estimating the relative height of buildings, the SVM-based classifier was employed to recognize the buildings and non-buildings, thus extracting the buildings. They tested on images of WorldView-2 and achieved high building-detection accuracy.
The manually designed feature-based methods usually can extract the classical features of the buildings, and the buildings can be extracted with quite high accuracy through combining the classifiers. However, the models' extendibility is still weak due to the brightness variations, occlusions, etc. The main reason may be that the manually designed features cannot cover all the building appearance situations in the images, resulting in incomplete considerations for special situations.

Deep Learning-Based
Recently, deep learning-based building extraction methods have made great breakthroughs. The classical models usually extract the buildings with an end-to-end strategy, i.e., input a target image and output a building extraction result image. The benefits of the deep learning models lay in the great powers of automatic feature learning and representing. In addition, the deep learning-based methods also can obtain results with fast processing speed through using GPUs. The processing time of deep learning-based building extraction methods is usually only several seconds to produce the final results (sometimes even within only 1 second), while the unsupervised and manually designed methods usually take dozens of minutes (or even several hours) for processing one image.
A single path-based CNN model was proposed in [41] for simultaneously extracting roads and buildings from remote-sensing images. After the extraction of the CNN model, the low-level features of roads and buildings were also combined to improve performance. They tested their model on two challenging datasets and demonstrated good extraction results. A Building-A-Nets for building extraction was proposed in [42,43], in which the adversarial network architecture was applied and they jointly trained their model of generator and discriminator. They tested on open available datasets and achieved good results. A building extraction model of fully convolutional network (FCN) was proposed in [43]. To further improve the final results, the Conditional Random Fields were employed. They obtained high F1-scores and the intersection of union (IoU) scores in their experiments. A new deep learning model based on ResNet was proposed in [44], which used the specially designed guided filters to improve their results and remove the salt-and-pepper noise. The method illustrated good performance in the tests. A deep CNN model was proposed in [45], which integrated activation from multiple layers and introduced a signed distance function for representing building boundary outputs. They demonstrated superior performance on test datasets. A deep learning model was proposed in [46], which aimed to conquer the problems of sensitivity to unavoidable noise and interference, and the insufficient use of structure information. They showed good results on the test datasets. A Siamese fully convolutional network was proposed in [27] for building extraction and provided an open dataset called WHU that contained multiple data sources. Now the WHU dataset is quite famous in open available building extraction datasets and has been used in much building extraction research. An EU-Net for building extraction was proposed in [47] that designed a dense spatial pyramid pooling module. They achieved quite good results in the test datasets. In [48], a DE-Net that consisted of four modules (the inception-style down-sampling module, the encoding module, the compressing module and the densely up-sampling module). They tested the model on an open available dataset and a self-built dataset called Suzhou. The test results showed good performance of their model. Liu et al. proposed a building extraction model that used a spatial residual inception module to obtain multiscale contexts [49]. In addition, they used depthwise separable convolutions and convolution factorization to further improve the computational efficiency. In [13], a JointNet was proposed to extract both large and small targets using a wide receptive field, and it used focal loss function to further improve the road extraction performance. In [50], an FCN was proposed to use multiscale aggregation of feature pyramids to enhance the scale robustness. After the segmentation results were obtained, a polygon regularization approach was further used for vectorizing and polygonizing the segmentation results. In [40], a multifeature CNN was proposed to extract building outlines. To improve the boundary regularity, they also combined morphological filtering in the post-processing. They achieved good results in the experiments. In [51], a CNN model was proposed that used an improved boundary-aware perceptual loss for building extraction, and their experimental results were very promising. The DR-Net, a dense residual network presented in [1] showed promising results in the test datasets. The attention-gate-based encoder-decoder network was used in [5] for building extraction, and illustrated good performance in both an open available dataset and the dataset built by themselves.
Except for the methods specially designed for building extraction from remote-sensing images, the segmentation methods for natural scenes are also suitable for building extraction from remote-sensing images. Thus, we also give a brief introduction about segmentation methods for natural scenes. The classical segmentation methods based on deep learning include FCN [52], PSPNet [20], U-Net [53,54] , DANet [55] and Residual U-Net [53] etc. In addition to the classical segmentation methods, many recent semantic segmentation methods have been proposed. Chen et al. proposed a DeepLab for semantic segmentation, and they achieved good experimental results [56]. Zhong et al. proposed a Squeeze-and-Attention Network for semantic segmentation. They achieved good results on two challenging public datasets. Zhang et al. used an encoding part that extracts multiscale contextual features for semantic segmentation, and they showed good results in the experiments [57]. Yu et al. proposed a CPNet (Context Prior Network) for learning robust context features in semantic segmentation tasks, and they showed good results in the experiments [58].
In general, the deep learning-based methods have achieved great progress in building extraction from remote-sensing images. However, their main shortcoming is the requirement of a large amount of labelling work. On the other hand, the research about a more powerful decoding part of the designed models is still insufficient in our reviews.

Data and Method
In this section, we first illustrate the datasets used in this paper. Then, we give an introduction about our model architecture. Third, we give a detailed presentation about the transformation module. Finally, we illustrate the detailed module about the decoding in our model.

Datasets
In this paper, two publicly available datasets are employed for training and evaluating in our experiments: the WHU dataset [27] and the Massachusetts Building dataset [59].
The WHU dataset was publicly opened in 2019 and has become a famous and popular building extraction dataset in remote-sensing research. The dataset contains more than 2.2 million independent buildings in aerial images. The resolution of the aerial images is 7.5 cm, which is so high and it makes the features of buildings clear. The cover area of WHU is about 450 km 2 , covering Christchurch, New Zealand. One major advantage of WHU is that WHU contains various and versatile architectural types of buildings in different location areas (countryside, residential, culture, and industrial areas) and with different appearances (different colors and sizes). To make the aerial image with a resolution of 7.5 cm be more suitable for building extraction, the final images are down-sampled to a resolution of 0.3 m. The original image size in WHU is 512 × 512 pixels. Thus, the WHU contains 4736, 1036 and 2416 training, validation and testing images, respectively. In our experiment, as we design our model input size as 256 × 256 pixels, we further downsample each original WHU image into 256 × 256 pixels. Figure 1 shows several sample images for our experiments in training, validation and testing stages, respectively. As shown in Figure 1, we can see that the image quality in WHU is high and the backgrounds are complex.
The second dataset used in this paper is the Massachusetts Building dataset [59]. The dataset contains 137, 4 and 10 images for training, validation and testing, respectively. The size of the original images is 1500 × 1500 pixels. The authors did not provide the detailed resolution information about the images. It is obvious that the resolution of the Massachusetts Building dataset is much lower than the resolution of WHU. To roughly estimate the resolution of the Massachusetts Building dataset, we compared with the satellite images. After estimation, we found that the rough resolution of the Massachusetts Building dataset is about 1.5 meter. Each image in the Massachusetts Building dataset has a large image size, which is not suitable for directly using in training, validation and testing. Thus, we cut each image into small pieces with an image size of 256 × 256 pixels, which is just fit for our model. Note that we cut into small pieces without overlapping. Finally, we generated 3425, 100 and 250 images for training, validation and testing, respectively. Figure 2 shows several original images in the Massachusetts Building dataset, while Figure 3 shows several cut 256 × 256 pixels sample images used in our experiments. As shown in Figures 2 and 3, we can know that the image quality of the Massachusetts Building dataset is lower compared with the WHU dataset. Moreover, comparative brightness in the WHU and the Massachusetts Building dataset is also rather different.       Figure 4 shows the detail architecture of our model. The blue, pink, green, yellow, baby blue and red rectangles represent the convolution, ReLU, pooling, up-sampling, drop out and sigmoid operations, respectively. It can be seen that our model consists of two parts: the encoder and decoder. Essentially, our model is developed from the model of U-Net architecture. In the encoder part, our model includes two modules. The first module is the Self-Attention module, which is used for automatically learning the channel and position weights. The second module is the feature extraction module, which includes 4 groups of convolution, ReLU, pooling operations. The fourth group also includes a drop out operation. Through the 4 groups of operations, the feature is extracted. The detail convolutional size used in our model is 2 × 2. The max pooling operation with a pooling size of 2 × 2 is used in our model. In the decoder part, we use multi-up-sampling with multiple kernel sizes for each up-sampling layer strategy, which is similar with the strategy in [60].     Figure 5 shows the detailed flowchart of the Self-Attention module. In the Self-Attention module, three convolution operations are simultaneously implemented first. After that, the outputs of two convolutions are merged by a matrix multiplication operation. Then, a softmax layer is followed by the merge operation. Fourth, the third output of convolution operation is merged with the output of softmax through matrix multiplication. Finally, the merge result is merged into the original inputs through an element-wise addition operation. In our model, the kernel size of the convolution operations is set at

Loss Function
In our model, as our goal is to segment the input image into two classes (building and background areas), we use binary cross entropy as our model's loss function. Given a couple of image and label (x, y), the output of the model is denoted as y p , then the binary cross entropy loss of sample y can be represented as: Then, the leaning goal of our network will be: where n represents the total sample numbers.

Results
In this section, we first introduce the details about implementations. Then, we introduce the evaluation criteria used in the experiments. Finally, we present and analyze the experimental results on the test datasets.

Experimental Implementation Detail
We implement our experiments on a computer that has an Intel®Core™ i9-9900X 3.5 GHz CPU and 128 GB memory. The GPU type used in this computer is RTX 2080 Ti with 11 GB GPU memory. The codes of our experiments are all based on Python, TensorFlow and Keras. Due to the limitation of GPU memory, we use a small training batch size of 2 during the training and validation. Instead of using stochastic gradient descent (SGD), we use Adam optimizer for training our model. The learning rate is set at 0.0001. For both WHU and Massachusetts Building datasets, the training epoch is set at 500. The steps per epoch are set at 4000 and 1800 for WHU and Massachusetts Building datasets, respectively. To further enhance the performance and prevent the overfitting problem, we apply a data augmentation strategy that will randomly rotate the image in a range of −1 to 1 degree, width shift in a range of 0.1, height shift in a range of 0.1 and horizontal flip. During the augmentation, the fill mode is set as nearest.

Evaluation Criteria
To comprehensively evaluate the performance of models, we use four evaluation criteria that are widely used for evaluating building segmentation performance. The first to fourth criteria are recall, precision, IoU and F1-Score [61], the representations are as follows: where TP, FN and FP denote true positive, false negative and false positive, respectively. Note that all the following performance evaluation results are calculated based on pixels. In the precision metric, SRI-Net achieves the best score. However, our approach achieves a higher score in the Recall metric compared with SRI-Net. Thus, the IoU scores and F1-Scores are higher than SRI-Net, achieving the best performances among the compared methods. Since IoU and F1-Score are both comprehensive metrics, the results prove the satisfactory performance of our model. From the aspect of F1-Score, our model obtains score results that are about 1.96%, 3.7%, 0.6%, 0.2%, 1.3% and 0.9% higher than the results of U-Net, SegNet, DRNet, SRI-Net, DeepLabV3+ and Zhou's, respectively. From the aspect of IoU, our method achieves score results about 3.7%, 4.5%, 1.2%, 0.3%, 2.4% and 1.6% higher than the results of U-Net, SegNet, DRNet, SRI-Net, DeepLabV3+ and Zhou's, respectively.  Figure 6 shows the sample exhibitions of the extraction results by our method. The first, second and third columns represent the original images, ground truths and our building extraction results. The selected target original images are difficult for extraction as the buildings are located densely with complex backgrounds. Even so, our model extracts the building areas with pretty good results in visual. The visual results further prove that our model is effective.

Experimental Results on the Massachusetts Building Dataset
In this part, we first show the quantitative comparison results among our method and nine recent classical and popular methods tested on the Massachusetts Building dataset. The compared methods consist of the U-Net [54] [13]. Table 2 shows the quantitative comparison results among our method and other classical building extraction methods tested on the Massachusetts Building dataset. Note that all the results of other compared methods are obtained from the standard results reported by the authors. As not all the methods have the result reports for all the recall, precision, IoU and F1-score metrics, several methods have blanks for some metrics. Fortunately, all the compared methods have their scores for the comprehensive metric IoU. In Table 2, the MSCRF obtains the highest recall score among the valid four scores. However, our method obtains a much higher score for precision. Thus, for the comprehensive IoU metric, our method achieves a much higher score compared with MSCRF, and also higher than all the compared methods. In the F1-Score, only three methods have the score reports, among which our method shows a much higher score than other two methods. For detail, our method acquires a IoU score of 73.49%, which is about 11.9%, 15 Figure 7 shows several sample extraction results by our method tested on the Massachusetts Building dataset. The first, second and third columns are the original images, ground truths and our extraction results, respectively. The given test samples are all quite challenging in visual. However, our method obtains quite good extraction results in all the given sample images in visual. Compared with the ground truth, our extraction results seem to be smoother. Moreover, our extraction results seem to lose the very small building area targets. Remote Sens. 2021, 13, x FOR PEER REVIEW 14 of 27

Major Abbreviations Used in Our Paper
In order for readers to have a better understanding of our work, we provide a detailed introduction about the major abbreviations used in our paper. The details are as shown in Table 3:

Effectiveness of Transformer Module
In this part, we analyze the effectiveness of the transformer module. To verify the effectiveness of the transformer module, we designed an ablation experiment. In this experiment, we removed the transformer module from our model structure and kept other parts of our model the same as our original model. We tested the model without the transformer model on the WHU dataset and compared the result with our original model. Table 4 shows the comparison results. In Table 4, the score of the recall metric of our original model was a little lower than the score of our model without the transformer. However, our original model achieved higher scores in all the precision, IoU and F1-Score metrics Compared with our model without transformer module. The results prove that the transformer module can effectively enhance the attention recognition ability through the channel and position weights. Thus, the final performance is improved through the transformer module. Figure 8 shows several sample result comparisons in visual between our model with and without the transformer module. The first, second, third and fourth columns represent the original images, ground truths, results with transformer and results without transformer, respectively. From Figure 8, we can see that the results by our whole model present clearer in the areas that may be interfered with by other complex background targets. The reason may be that the transformer module can automatically learn the position and channel weights during the training. Thus, the learned attention weights can tell the network where it needs to pay more attention when extracting buildings on the given test image. Through focusing on positions where there may be building areas with larger probabilities, the model can achieve a better performance finally. Remote Sens. 2021, 13, x FOR PEER REVIEW 16 of 27  We also verified the effectiveness of the transformer module on the Massachusetts Building dataset. Table 5 shows the quantitative comparison results between our method with and without the transformer module tested on the Massachusetts Building dataset. In the table, our method with the transformer module achieves a higher precision, IoU and F1-Score than our method without the transformer module. The recall of our method with the transformer module is only a little lower than our method without the transformer module. The results also prove that the transformer module is effective to improve the performance of our model.

False Extraction
In this section, we analyze the false extractions (including false positive and false negative) in the tests. The goal is to find out the wrong situations, from which we may find a way to further improve the model performance in our future works. Figure 9 shows the major false positive exhibitions of our extraction results in the WHU dataset. The green, red and blue areas represent the right, false positive and false negative areas, respectively. From the visualization results of major false positive areas, we can see that the major false positive extractions occurred for four reasons: (1) the building yard appears at an unusual shape similar to a building and may lead to the hard recognition at the edge areas and cause false positives; (2) the shapes of other objects look very similar to buildings and may result in false recognition; (3) the containers seem to cause a large amount of false positives; (4) the areas that are not buildings and have shadows may also result in false positives. Figure 10 shows the major false negative exhibitions of our extraction results in the WHU dataset. The green, red and blue areas represent the right, false positive and false negative areas, respectively. We analyzed the false negatives in the false negative areas in our test images, and found that the major false negatives may occur for the following reasons: (1) the occlusions by the trees or Other objects; (2) the building with a special roof color that is similar to the color of the ground or road; (3) our model seems to lose the consistency restrictions of a building, such as the example in the last three false negative samples in the Figure 10. Figure 11 shows the major false positive and false negative exhibitions of our extraction results in the Massachusetts Building dataset. The green, red and blue areas represent the right, false positive and false negative areas, respectively. In Figure 11, the major false positives occur at four kinds of positions: (1) the edges of the buildings; (2) the interspace between buildings, especially where it appears with dark shadows; (3) the sports ground, such as the tennis court; (4) the areas with light grey colors, such as the beach areas. On the other hand, the major false negatives seem to occur at the following positions: (1) building areas that are occluded by shadows and trees, etc.; (2) the wrong labels in the Massachusetts Building dataset; (3) the areas that look similar to roads. Remote Sens. 2021, 13, x FOR PEER REVIEW 18 of 27    Figures 12 and 13 show the extraction results of our model in the WHU building dataset and Massachusetts Building dataset, respectively. In Figures 12 and 13, few wrong extractions occur when the given test images have no buildings or other artificial objects. The results prove that our model is robust to images that only contain lands, grasses, vegetation and soil, etc.  Overall, from the visualizations of false positive and negative areas, we can clearly see what happened in the wrong extractions. After that, we can obtain helpful information through analyzing the wrong extractions. According to the analysis about the visualizations of false positive and negative areas of our method, we find that our model may partly lose the extractions within a building under several situations. We think the reason is that our model does not consider the entirety of the building structure and the local context information. Moreover, the in-painting processing is not considered after the extractions of our model. From the above analysis, the following research directions may be the breakthrough points in our future research. First, we will further study how to enhance the context information learning to help improve the performance. Second, the entirety consistence restriction may be useful for promising the completeness of an extracted building. Third, to further improve the model's generalization ability, a powerful and effective augmentation strategy for the training images is necessary. Augmenting the training images simply through rotation and shift transformation is insufficient. In our future work, we will keep studying building extraction in remote-sensing images from the above three aspects. Overall, from the visualizations of false positive and negative areas, we can clearly see what happened in the wrong extractions. After that, we can obtain helpful information through analyzing the wrong extractions. According to the analysis about the visualizations of false positive and negative areas of our method, we find that our model may partly lose the extractions within a building under several situations. We think the reason is that our model does not consider the entirety of the building structure and the local context information. Moreover, the in-painting processing is not considered after the extractions of our model. From the above analysis, the following research directions may be the breakthrough points in our future research. First, we will further study how to enhance the context information learning to help improve the performance. Second, the entirety consistence restriction may be useful for promising the completeness of an extracted building. Third, to further improve the model's generalization ability, a powerful and effective augmentation strategy for the training images is necessary. Augmenting the training images simply through rotation and shift transformation is insufficient. In our future work, we will keep studying building extraction in remote-sensing images from the above three aspects.

Result Comparison Analysis
In this section, we discuss the experimental comparison results among our method and other compared methods. Although Tables 1 and 2 proved the satisfactory performance of our method, we further reproduced the results of six classical building extraction and segmentation methods tested on both the WHU and Massachusetts Building datasets. From the comparisons in quantitative results as well as visual results, we further analyzed the comparisons. Note that we only reproduced the methods for which the codes are public. Table 6 shows the reproduced quantitative comparison results among our method, U-Net [54], SegNet [62], DANet [55], PSPNet-101 [20], and Residual U-Net [53] tested on the WHU dataset. From the results, we can see that our method obtained the highest score in both IoU and F1-Score, which further convincingly proves the good performance of our method. For the results of U-Net and SegNet, their reproducing performances were similar with the results in Table 1. Although the DANet, PSPNet-101 and Residual U-Net are all famous segmentation methods, our method performed better than those methods in the experiment.

Result Comparison Analysis
In this section, we discuss the experimental comparison results among our method and other compared methods. Although Tables 1 and 2 proved the satisfactory performance of our method, we further reproduced the results of six classical building extraction and segmentation methods tested on both the WHU and Massachusetts Building datasets. From the comparisons in quantitative results as well as visual results, we further analyzed the comparisons. Note that we only reproduced the methods for which the codes are public. Table 6 shows the reproduced quantitative comparison results among our method, U-Net [54], SegNet [62], DANet [55], PSPNet-101 [20], and Residual U-Net [53] tested on the WHU dataset. From the results, we can see that our method obtained the highest score in both IoU and F1-Score, which further convincingly proves the good performance of our method. For the results of U-Net and SegNet, their reproducing performances were similar with the results in Table 1. Although the DANet, PSPNet-101 and Residual U-Net are all famous segmentation methods, our method performed better than those methods in the experiment.
We also illustrate the visual comparisons. Figure 14 shows the visual comparison results of our method and the compared methods tested on the WHU dataset. The first column to the eighth column are the original test images, ground truth, results of U-Net, SegNet, DANet, PSPNet-101, Residual U-Net and ours, respectively. From the visual comparison results, we can see that our results presented better performance compared with the other methods. The U-Net seems to show more false positives in the results. For example, the tennis court is extracted as a building in the first row. The SegNet shows worse results compared with ours in the extraction entirety attribution, and it seems to have more missing positives. The same problems exist in the results of DANet. The PSPNet is sensitive to large buildings, while it shows worse accuracy in the building edge areas. The Residual U-Net's results were similar with the U-Net's. Our method showed fewer false positives and false negatives in most results. However, our method also showed bad building entirety attribution in some test areas, e.g., the results shown in the fourth row. According to the results of all the methods, we found that all the methods do not perform well in the building-entirety attribution. Thus, conquering the problem of missing part of a building may be an interesting research point in our future work. We also illustrate the visual comparisons. Figure 14 shows the visual comparison results of our method and the compared methods tested on the WHU dataset. The first column to the eighth column are the original test images, ground truth, results of U-Net, SegNet, DANet, PSPNet-101, Residual U-Net and ours, respectively. From the visual comparison results, we can see that our results presented better performance compared with the other methods. The U-Net seems to show more false positives in the results. For example, the tennis court is extracted as a building in the first row. The SegNet shows worse results compared with ours in the extraction entirety attribution, and it seems to have more missing positives. The same problems exist in the results of DANet. The PSPNet is sensitive to large buildings, while it shows worse accuracy in the building edge areas. The Residual U-Net's results were similar with the U-Net's. Our method showed fewer false positives and false negatives in most results. However, our method also showed bad building entirety attribution in some test areas, e.g., the results shown in the fourth row. According to the results of all the methods, we found that all the methods do not perform well in the building-entirety attribution. Thus, conquering the problem of missing part of a building may be an interesting research point in our future work. Table 7 shows the quantitative comparison results among our method and the reproducing results of the other five classical building extraction and segmentation methods tested on the Massachusetts Building dataset. The reproducing results of U-Net and Se-gNet seemed to obtain a little higher scores compared with the reported results. However, their result scores were still much lower ours. Although the DANet obtained the best precision score, its recall rate was lowest, resulting in lower scores in IoU and F1-Score. The PSPNet-101 showed good performance in IoU score, and the Residual U-Net showed good performance in F1-Score. From the quantitative results, our method achieved the best results in this comparison, which further proves the effectiveness of our model convincingly.   Table 7 shows the quantitative comparison results among our method and the reproducing results of the other five classical building extraction and segmentation methods tested on the Massachusetts Building dataset. The reproducing results of U-Net and Seg-Net seemed to obtain a little higher scores compared with the reported results. However, their result scores were still much lower ours. Although the DANet obtained the best precision score, its recall rate was lowest, resulting in lower scores in IoU and F1-Score. The PSPNet-101 showed good performance in IoU score, and the Residual U-Net showed good performance in F1-Score. From the quantitative results, our method achieved the best results in this comparison, which further proves the effectiveness of our model convincingly.  Figure 15 shows the visual comparison of building extraction results among our method and the other five discussed methods tested on the Massachusetts Building dataset. The U-Net, SegNet, DANet and Residual U-Net performed well in small building areas, except for PSPNet-101. However, PSPNet-101 performed better than the other four methods in the large building areas. It may benefit from the pyramid pooling strategy, resulting in larger insight areas and better understanding of context information. From this phenomenon, we think that finding how to obtain both larger areas of context information and keep the detail information may be an attractive research point in the building extraction from remote-sensing images. Compared with the other five methods, our results seem to perform well in both large building areas and small building areas. In this respect, our method performed better than the other five methods in the visual results.   Figure 15 shows the visual comparison of building extraction results among our method and the other five discussed methods tested on the Massachusetts Building dataset. The U-Net, SegNet, DANet and Residual U-Net performed well in small building areas, except for PSPNet-101. However, PSPNet-101 performed better than the other four methods in the large building areas. It may benefit from the pyramid pooling strategy, resulting in larger insight areas and better understanding of context information. From this phenomenon, we think that finding how to obtain both larger areas of context information and keep the detail information may be an attractive research point in the building extraction from remote-sensing images. Compared with the other five methods, our results seem to perform well in both large building areas and small building areas. In this respect, our method performed better than the other five methods in the visual results.

Conclusions
This paper proposed an encoder-decoder model for building extraction from optical remote-sensing images. In our model, we added a transformer module in the encoder, making the model automatically learn the channel and position weights for an input image. In the decoder part, we used a reconstruction-bias structure, which reinforced the decoding ability of our model. We tested our model on two publicly available datasets: the WHU and Massachusetts Building datasets. We also compared our performances with

Conclusions
This paper proposed an encoder-decoder model for building extraction from optical remote-sensing images. In our model, we added a transformer module in the encoder, making the model automatically learn the channel and position weights for an input image. In the decoder part, we used a reconstruction-bias structure, which reinforced the decoding ability of our model. We tested our model on two publicly available datasets: the WHU and Massachusetts Building datasets. We also compared our performances with several other currently classical and popular methods on both datasets. The quantitative comparisons proved that our model can achieve satisfactory performance in both datasets. For detail, our model can achieve an IoU score and F1-score at 89.39% and 94.4%, respectively, tested on the WHU dataset. And our method can obtain an IoU score and F1-score at 73.49% and 84.72%, respectively, tested on the Massachusetts Building dataset. We also set up an ablation experiment to prove the effectiveness of the transformer module. In the WHU dataset, our model with the transformer module can achieve an 89.39 IoU score and a 94.4 F1-score, while our model without the transformer module can only achieve an 88.2 IoU score and a 93.77 F1-score. Finally, we visualized the wrong extractions for analyses, which is helpful for our future further study. From the false extractions analysis, we found that our model still shows its weakness in understanding the context information and the entirety structure of buildings. Thus, in future work we will continue our study and focus on catching context information and learning the entirety structure of buildings for our model.