Building Extraction of Aerial Images by a Global and Multi-Scale Encoder-Decoder Network

: Semantic segmentation is an important and challenging task in the aerial image community since it can extract the target level information for understanding the aerial image. As a practical application of aerial image semantic segmentation, building extraction always attracts researchers’ attention as the building is the speciﬁc land cover in the aerial images. There are two key points for building extraction from aerial images. One is learning the global and local features to fully describe the buildings with diverse shapes. The other one is mining the multi-scale information to discover the buildings with different resolutions. Taking these two key points into account, we propose a new method named global multi-scale encoder-decoder network (GMEDN) in this paper. Based on the encoder-decoder framework, GMEDN is developed with a local and global encoder and a distilling decoder. The local and global encoder aims at learning the representative features from the aerial images for describing the buildings, while the distilling decoder focuses on exploring the multi-scale information for the ﬁnal segmentation masks. Combining them together, the building extraction is accomplished in an end-to-end manner. The effectiveness of our method is validated by the experiments counted on two public aerial image datasets. Compared with some existing methods, our model can achieve better performance.


Introduction
The aerial image is a kind of high-resolution remote sensing image, which can provide diverse and high-definition information of land covers [1]. With the development of imaging techniques, a large number of aerial images can be collected. How to effectively use them to get more useful knowledge for understanding our planet is always an open and tough task. Many technologies can be adopted to mine the contents from aerial images, such as image classification [2], image semantic segmentation [3], and content-based image retrieval [4,5]. Among the mentioned techniques, image semantic segmentation is the basic and important research topic as it can assign the corresponding semantics to each pixel within the aerial image [3]. Aerial image segmentation has been successfully applied on many remote sensing applications, such as land-use/land-cover classification [6] and points-of-interest detection [7][8][9][10].

1.
How to explore local and global information is the first key point. Note that, in this paper, we define the detailed structure (e.g., buildings' outlines and shapes) as the local information. Meanwhile, the overall structure (e.g., buildings' context within an aerial image) is defined as the global information.

2.
How to capture the multi-scale information is the second key point. Due to the specific characteristics of the aerial images, the buildings within an aerial image are different in size.
The buildings with bigger sizes may contain hundreds of pixels while the buildings with smaller sizes occupy dozens of pixels. In this paper, we define the mentioned issue as the multi-scale information contained in the aerial images.
Considering two key points mentioned above, we propose a new building extraction network with the encoder-decoder structure for aerial images in this paper, and we name it a global multi-scale encoder-decoder network (GMEDN). To extract the local and global information, we design a local and global encoder in this paper. The VGG16 network [29] is used to extract local information through several convolutional layers. Based on the feature maps with local information, a non-local block is introduced to capture global information through mining the similarities between all pixels.
To learn multi-scale information, a distilling decoder is developed, which includes the de-convolution and the multi-scale branches. The multi-scale branch predicts the final segmentation results by aggregating multi-layer feature maps in the decoder. Our source code is available on https://github. com/smallsmallflypigtang/Building-Extraction.
The main contributions of this method are summarized as follows: 1.
The proposed GMEDN adopts the encoder-decoder framework with a simple skip connection scheme as the backbone. For the local and global encoder, a VGG16 is used to capture the local information from the aerial images. Based on this, the non-local block is introduced to explore the global information from the aerial images. Combing local and global information, the buildings with diverse shapes can be segmented.

2.
A distilling decoder is developed for our GMEDN, in which the de-convolution and the multi-scale branches are combined to explore the fundamental and multi-scale information from the aerial building images. Through the de-convolution branch, not only the low-level features (e.g., edge and texture) but also the high-level features (e.g., semantics) can be extracted from the images for segmenting the buildings. We name these features the fundamental information.
By the multi-scale branch, the multi-scale information that is used to predict the buildings with different sizes can be captured. Integrating the fundamental and multi-scale information, the buildings with various sizes can be segmented.
The remaining of this paper is organized as follows. The related work of segmentation network and aerial building extraction are summarized in Section 2. In Section 3, we describe our GMEDN method in detail. Then the experimental and discussion are dedicated to Section 4. Finally, the conclusion of our letter is in Section 5.

Semantic Segmentation Architecture
Image semantic segmentation is a classical problem in the computer vision community. With the development of deep learning, CNN becomes important in this topic. Here, we roughly divide the current deep CNN-based segmentation networks into three groups.
In the first group, the proposed methods are always general-purpose [30,31]. In 2015, the FCN model [27] transformed the fully connection layer of the traditional CNN into a series of convolutional layers. It provides a basic idea to solve semantic segmentation problem with the deep learning. Apart from FCN, another group of successful semantic segmentation networks were proposed and named DeepLab. There are three versions of DeepLab, and they were published in [32][33][34], respectively. DeepLab-v1 uses a fully connected conditional random field (CRF) at the end of the FCN framework to obtain more accurate information. To control the receptive field of the network, they adopt the dilated convolution to replace the last two pooling layers. Based on DeepLab-v1, DeepLab-v2 adds atrous spatial pyramid pooling (ASPP) into the network. ASPP contains dilated convolutions of different rates, which can capture the context of the image at multiple scales. Moreover, DeepLab-v2 uses deep residual network instead of VGG16 [29] (which was used in Deeplab-v1) to improve the performance of the model. Based on DeepLab-v2, DeepLab-v3 refines the structure of the network so that the segmentation results are further enhanced.
In the second group, researchers make full use of CNN's hierarchical structure to extract rich information from the images, which is beneficial for segmenting complex contextual scenarios and small targets. For example, the pyramid scene parsing network (PSPNet) [35] aggregates the context information of different layers to improve the models' capacity of obtaining global information. PSPNet embeds the multi-scale, global, and local information in the FCN-based framework through the pyramid structure. Furthermore, to speed up convergence, the authors add the supervisory loss function into the backbone network. Another popular method, feature pyramid network (FPN), was introduced in the literature [36]. It uses feature maps of different resolutions to explore objects of different sizes. Through continuous up-sampling and cross-layer fusion mechanism, both the low-level visual information and the high-level semantic information can be reflected by the output feature maps.
In the third group, scholars develop segmentation networks based on object detection algorithms. Inspired by the region selection idea, the mask region-based CNN (Mask R-CNN) was presented in the literature [37]. It adds the mask prediction branch on the basis of the structure of faster R-CNN [38] to complete the semantic segmentation. In addition, the regions of interest (RoI) Align is used to replace the RoI Pooling for removing the rough quantization. Although the segmentation performance of Mask R-CNN is positive, the evaluation function shared scheme would influence the segmentation results negatively. To address this issue, Huang et al. [39] developed the mask score R-CNN (MSR-CNN) model in 2019. It adds the Mask-IoU block to learn how to predict the quality of an instance mask and obtain a great result.

Aerial Building Extraction
The semantic segmentation methods discussed above are proposed for natural images. Although they can be used to complete the building extraction from the aerial images, their performance may not reach the satisfactory stage. Therefore, many aerial images oriented semantic segmentation approaches were introduced in recent years. Here, we review some popular aerial building extraction networks from the following two aspects.
The methods within the first group can be regarded as the variations of FCN. Based on FCN architecture, several post-processing technologies are developed to improve the performance of aerial building extraction. For example, in [40], the authors designed a patch-based CNN architecture. In the post-processing stage, they combine the low-level features of adjacent regions with CNN features to improve the performance. Shrestha et al. [41] proposed an enhanced FCN for building extraction. This model utilizes CRF to optimize the rough results obtained by the FCN for improving the performance of aerial building extraction. Another method was proposed in the literature [42], which is an end-to-end trainable gated residual refinement network (GRRNet). Through fusing the high-resolution aerial images and the LiDAR point clouds, GRRNet improves the performance of building extraction a lot.
Besides the FCN families, some other methods are developed with the consideration of the characteristics of the aerial images. To exactly extract the buildings from the aerial images, the high-quality features with multi-scale and detailed information are important. Therefore, some researchers pay their attention to feature learning. Yuan et al. [43] used a simple CNN to learn features and aggregated feature maps of multiple layers for the building prediction. Furthermore, the authors introduce the signed distance function to classify boundary pixels, which is useful to get fine segmentation results. In [44], another feature learning based building extraction method was proposed, which uses spatial residual inception (SRI) module to capture and aggregate multi-scale context information. This model could accurately identify large buildings while preserving global features and local detailed features. Besides the multi-scale schemes, there are some other approaches that can be used to obtain rich features for the semantic segmentation task. For example, Liu et al. [45] proposed their network with the encoder-decoder structure. By adding the skip connection, the information loss caused by the usual pooling could be reduced obviously. Furthermore, the spatial pyramid pooling is embedded in this model to ensure the rich contextual information can be learned. In [46], the authors combined the residual connection unit, the extended sensing unit, and the pyramid aggregation unit together to complete the building extraction task. Due to the introduction of the form filter, the boundaries of segmented results are accurate and smooth.

Overall Framework
The pipeline of our building extraction model GMEDN is shown in Figure 1. It consists of a local and global encoder and a distilling decoder. For the local and global encoder, it is constructed by a basic feature extraction network (VGG16), a non-local block [47], and a connection block. It aims to explore the local and global information from the aerial images. Through mining the object-level and the spatial context information, the diverse and abundant buildings can be fully represented. The details of the local and global encoder are discussed in Section 3.2. For the distilling decoder, it contains the de-convolution branch and the multi-scale branch. The de-convolution branch captures the fundamental information (e.g., low-level visual features and the high-level semantics) and multi-scale branch explores multi-scale information, which are useful to segment accurately buildings with different sizes and distribution. The details of the distilling decoder are discussed in Section 3.3.

Concat
1×1Conv Mask  There is another point we want to touch on. As shown in Figure 1, there are five convolutional layers (i.e., convolution, max-pooling, and batch normalization) and five de-convolutional layers (i.e., up-sampling, convolution, and dropout) in the local and global encoder and the distilling decoder, respectively. Due to the max-pooling and the up-sampling operations, some information within the images and feature maps would be lost. To reduce this information loss, the output of the i-th convolutional layer and the output of the i-th de-convolutional layer are summed. The operation mentioned above is the simple skip connection, which is widespread in semantic segmentation networks [27,48,49].

Local and Global Encoder
Before introducing the local and global encoder, we describe the non-local block first. The non-local block [47] is developed to obtain global information by capturing the dependencies of all the pixels. The architecture of the non-local block is shown in Figure 2. Transpose:C/2×HW Suppose the input and the output feature maps of the non-local block are X ∈ R H×W×C and Z ∈ R H×W×C , where H and W represent the height and width of the input feature maps, C indicates the number of the channels. First, the input X is convoluted by three groups of 1 × 1 convolutions with different initialized weights to get the U ∈ R H×W×C/2 , V ∈ R H×W×C/2 , and G ∈ R H×W×C/2 . Note that, the 1 × 1 convolution has two functions [50]. First, it can enhance the non-local block's capacity of non-linear fitting due to the active function followed by the convolution. Second, it can change the input data's channels through adjusting the numbers of kernels. Here, according to the literature [47], we set the numbers of 1 × 1 convolution kernels within each group equal to C/2. Therefore, the sizes of the outputs (U, V, and G) are C/2, which can be regarded as the channel reduction. Taking the U as an example, the process can be expressed as where X is the input feature maps, i and j are the coordinates of pixels of X, K denotes the convolutional kernel, k(m, n) indicates the value of K in the coordinate (m, n), b is the convolution bias, the sign " * " represents the convolution operation, and the sign "·" denotes the value multiplication. Second, to use the similarity relationship between pixels for the complex contents mining, the non-local block constructs the similarity matrix R through U and V, and further uses R to obtain global feature maps Y. To obtain the affinity matrix R, U and V are reshaped to U C ∈ R HW×C/2 and V C ∈ R C/2×HW first. Then, R ∈ R HW×HW is obtained by where HW means the number of pixels, C/2 indicates the channels of each pixel, and the sign "×" represents the matrix multiplication. To get the global feature Y , the similarity relationships between pixels are assigned as weights to the feature maps G. Before that, a softmax is used to normalize the affinity matrix R and a reshape operation is adopted to transform G into G C ∈ R HW×C/2 . The process of getting Y can be represented as the following equation, where the sign "×" denotes matrix multiplication. Third, the output of the non-local block Z ∈ R H×W×C can be obtained by fusing the global feature Y and the input feature maps X. This leads that Z contains not only the initial characteristics of single pixels but also the resemblance among all of the pixels. To complete the feature fusion, Y is transformed into Y C ∈ R H×W×C through the reshaping and 1 × 1 convolution operations. This process can be expressed as The non-local block has several superiorities. First, since the number of channels is halved in the first step, the non-local block is a lightweight model that would not increase the computational cost to the original network. Second, since the input and output data have the same scale, the non-local block is a flexible model that can be embedded after any layer. In this paper, we put the non-local block on the top of the basic feature extraction network.
Based on the non-local block, we design our local and global encoder as follows. The local and global encoder is divided into three blocks, i.e., basic feature extraction network, non-local block, and connection block. First, we adopt VGG16 as the basic feature extraction network. VGG16 is a classical CNN, which contains five layers. The first, second, third layers consist of the same components, i.e., one max-pooling, three convolutions, and one batch normalization (BN) operations. The fourth layer contains two convolutions compared with the 1st layer. Then the 5th layer contains two convolutions and one batch normalization (BN) operations. The sizes of the kernels of the convolutions in VGG16 are 3 × 3. The kernel sizes and the strides of max-pooling operations in VGG16 are 3 × 3 and 2. Suppose the input image is I ∈ R H 0 ×W 0 ×C 0 , the output of VGG16 block can be represented by X ∈ R H×W×C . Here the H 0 , W 0 , and C 0 represent the height, width, and a number of channels of the input image. This process can be expressed as where F VGG16 (·) denotes the mapping function of the VGG16 block and W VGG16 is the learnable weights of VGG16. Second, we embed a non-local block on the top of VGG16 so that the input of the non-local block is X. According to the above description of the non-local block, the output of the non-local block is Z ∈ R H×W×C . This process can be expressed as where F non−local (·) denotes the mapping function of the non-local block and W non−local represents the learnable weights of the non-local block. Since the non-local block can capture global information (e.g., buildings' context within an aerial image), which can be used to distinguish the buildings from other objects. Thus, global information is beneficial to the building extraction task. Third, we add a connection block on the top of the non-local block to fuse the information from VGG16 and non-local block. It contains a 3 × 3 max-pooling with 2 strides and the repeated operations, i.e., 3 × 3 convolution and dropout operations. Here, the convolution operation aims to fuse information through the weight matrix, the dropout operation is used to avoid overfitting. Suppose the input of connection block is Z ∈ R H×W×C , the output can be represented by En ∈ R H 1 ×W 1 ×C 1 , where the H 1 , W 1 , and C 1 denote the height, width, and channels of the feature maps. The connection process can be expressed as where F connection (·) denotes the mapping function of the connection block and W connection represents the learnable weights of the connection block.

Distilling Decoder
The distilling decoder is developed to obtain the final prediction mask which has the same size as the input image. There are two branches in the distilling decoder, i.e., a de-convolution branch and a multi-scale branch. In the de-convolution branch, there are five de-convolutional layers, constructed by 2 times up-sampling, 3 × 3 convolution, and dropout operations. Here, the up-sampling and convolution operations are selected to accomplish the de-convolution operation. The dropout operation is adopted to prevent the issue of over-fitting. As shown in Figure 1, when En is input the de-convolution branch, the output of the 1st, 2nd, 3rd, and 4th layers are D1 ∈ R H 0 /16×W 0 /16×C D1 , D2 ∈ R H 0 /8×W 0 /8×C D2 , D3 ∈ R H 0 /4×W 0 /4×C D3 , and D4 ∈ R H 0 /2×W 0 /2×C D4 , respectively. Furthermore, the output of the de-convolution branch is Dec ∈ R H 0 ×W 0 ×Cla , where Cla denotes the number of classes. This process can be formulated as where the F de−convolution (·) denotes the de-convolution branch and the W de−convolution is the weights. The output of the de-convolution branch Dec only contains fundamental information, which is limited to describe the buildings with diverse sizes accurately. Thus, the multi-scale branch is developed to fully explore the different scale information from the input aerial image. Here, the convolution and up-sampling operations are adopted to unify the sizes of outputs corresponding to the different de-convolutional layers. In detail, the 8, 4, and 2 times up-sampling are used for the feature maps D2, D3, D4, respectively. Meanwhile, the 1 × 1 convolution is applied to reduce their dimensions to the number of classes. In this way, we get the following feature maps, i.e., Mul1 ∈ R H 0 ×W 0 ×Cla , Mul2 ∈ R H 0 ×W 0 ×Cla , and Mul3 ∈ R H 0 ×W 0 ×Cla . Then, we concatenate them with Dec to get Fin ∈ R H 0 ×W 0 ×4Cla , which contains fundamental information and multi-scale information. This process is shown in the following equation where the concat (·) means that concatenates them on the channel. To reduce the channels of Fin ∈ R H 0 ×W 0 ×4Cla , we transform Fin ∈ R H 0 ×W 0 ×4Cla to out ∈ R H 0 ×W 0 ×Cla through a 1 × 1 convolution operation. Finally, a mask ∈ R H 0 ×W 0 ×1 with the values of semantic classes can be generated from out ∈ R H 0 ×W 0 ×Cla using the argmax function. In this paper, the mask is a binary prediction that denotes which pixel belongs to the building and which one to the background.
There are some points we should further explain. First, in our distilling decoder, the 3 × 3 convolution is used to learn the features from the inputs, while the 1 × 1 convolution is selected to reduce the number of dimensions through adjusting the number of convolutional kernels. Second, since the feature maps of the first de-convolutional layer have little high-level semantic information, we do not take them into account during the multi-scale fusion.

Datasets Introduction
In this paper, two benchmark datasets are chosen to verify our model, i.e., the Inria aerial image labeling dataset [1] and the Massachusetts building dataset [51]. Both of them have two classes, they are, building and non-building. The buildings in these datasets are diverse in type, various in volume, and different in distribution. Therefore, these two datasets are suitable to validate if our method is useful or not.
The Inria dataset contains 360 aerial Red-Green-Blue (RGB) images collected from ten cities and covers areas with 810 km 2 . Each city has 36 images, which are numbered 1 to 36. These images cover different urban settlements, from densely populated areas to sparsely populated forested towns. The sizes of these aerial images are 5000 × 5000, and their spatial resolution is 0.3 m. In this dataset, only 180 images (corresponding to five cities, i.e., Austin, Chicago, Kitsap, Western Tyrol, and Vienna) and their ground truth are published. Some examples, including images and their ground truth, are shown in Figure 3. Suggested by the authors who released this dataset [1], 31 images are selected randomly from each released city to construct the training set and the rest of the images are used to be the testing set in the following experiments.

Austin
Chicago Kitsap Vienna Tyrol_w

Experimental Settings
All of the experiments are carried out on an HP-Z840-Work station with Xeon (R) CPU E5-2630, GeForce GTX 1080, and 64 RAM. Our GMEDN is trained by the Adam optimizer with the sparse softmax cross-entropy loss function. Here, the parameters of Adam optimizer are set as follows, β 1 = 0.9, β 2 = 0.999, = 1 × 10 −8 . In addition, we use a schedule decay of 0.004, a dropout rate of 0.5 and a small learning rate of 2 × 10 −4 to complete the training. We initialize the weights of the encoder (parts of VGG16) with pre-trained weights via ImageNet [52]. Other weights of our model are initialized with a Glorot uniform initializer that draws samples from a uniform distribution. The optimization is stopped at 40 epochs, and the early stopping scheme is adopted to avoid overfitting. Note that, due to the limitations of the GPU memory, original images are cropped into 256 × 256 image patches in this paper. For the Inria dataset, the non-overlapped grid scheme is selected to generate the image patches. For the Massachusetts dataset, we use grids with a stride size of 156 to augment limited training patches.
To evaluate the effectiveness of our model numerically, two assessment criteria are selected, they are, overall accuracy (OA) and the intersection of union (IoU). OA is the ratio of the number of correctly predicted pixels to the total number of pixels in all test sets, which is defined as, OA = N (correct _pixels) N (total_pixels) , (10) where N (correct _pixels) is the number of correctly predicted pixels, and N (total_pixels) is the number of the total number of testing pixels. IoU measures the correlation between prediction and target label, which is widely used in binary semantic segmentation tasks [6,53]. Here, IoU is defined as, where A is the prediction and B is the target label. IoU equals 0 means A and B do not overlap and 1 means that A and B are the same.

Building Extraction Examples
The building extraction results of our GMEDN model are exhibited and studied in this section. For the Inria dataset, five images are selected from the published cities to accomplish the building extraction, and the results are exhibited in Figure 5. For each block, the original image, the ground truth map, and the results of our model are located in the top, middle, and bottom. From the observation of these images, we can find that our method can extract the diverse buildings positively. Taking the Chicago city as an example, although the buildings are dense in distribution and diverse in shape, our extraction result is similar to the ground truth map, in which the location of buildings is accurate and the boundaries of buildings are smooth. Taking the Kitsap city as another example, although the buildings are distributed sparsely and irregularly, our model can also obtain the promising extraction results compared with the ground truth. These encouraging results prove that our GMEDN is useful for the Inria dataset. For the Massachusetts dataset, two images are chosen randomly to complete the building extraction by our model. The results are shown in Figure 6. As mentioned in Section 4.1, the resolution of these two images is 1m. Thus, compared with the images from the Inria dataset, the buildings within these aerial images are smaller in size but larger in volume, which increases the difficulty of extraction. Even so, through observing the extraction maps, we can easily find that the proposed method still achieves good performance. The buildings under the shadows of obstacles (such as trees, roads) are segmented well. These good visualization results indicate that our GMEDN is effective in the building extraction task.

Comparisons with Different Methods
To further validate our method, we compare our GMEDN with the following four popular methods.

•
Fully convolutional network (FCN). This may be the first general-purpose semantic segmentation network was proposed in 2015 [27]. The network consists of a common classification CNN (e.g., AlextNet [54] and VGG16 [29]) and several de-convolution layers. The CNN aims to learn the high-level features from the images while the deconvolution layers are used to predict the dense labels for pixels. Compared with the traditional methods, its cracking performance attracts scholars' attention successfully. • Deep convolutional segmentation network (SegNet). SegNet was introduced in the literature [55], which is an encoder-decoder model. SegNet first selects VGG16 as the encoder for extracting the semantic features. Then, a symmetrical network is established to be the decoder for transforming the low-resolution feature maps into the full input resolution maps and obtaining the segmentation results. Furthermore, a non-linear up-sampling scheme is developed to reduce the difficulty of the model training and generate the sparse decoder maps for the final prediction. • U-Net with pyramid pooling layers (UNetPPL). The UNetPPL model was introduced in [56] for segmenting the buildings from high-resolution aerial images. By adding the pyramid pooling layers (PPL) in the U-Net [48], not only the shapes but also the global context information of the buildings can be explored, which ensures the segmentation performance. • Symmetric fully convolutional network with discrete wavelet transform (FCNDWT). Taking the properties of aerial images into account, the FCNDWT model [49] fuses the deep features with the textural features to explore the objects from both spatial and spectral aspects. Furthermore, by introducing DWT into the network, FCNDWT is able to leverage the frequency information to analyze the images.
Note that, all of the comparisons are accomplished by ourselves, and their experimental settings are equal to ours for the sake of fairness.
The OA and IoU scores of different methods counted by the Inria dataset are displayed in Table 1. It is obvious that the proposed GMEDN performs best. Among all of the comparisons, the performance of FCN is the weakest. This is because that FCN is a general-purpose semantic segmentation model which does not consider the specific characteristics of aerial images. Compared with FCN, SegNet performs better since its sufficient de-convolution layers and the specific up-smapling scheme. Due to the multi-scale aggregation scheme, the behavior of UNetPPL is stronger than that of FCN and SegNet. However, its performance is still not as good as DWTFCN since the UNetPPL model does not take the textural features into account, which are beneficial to explore the objects from the complex background. These encouraging results demonstrate that our model is useful for the building extraction task. The IoU and OA scores of different models counted by the Massachusetts dataset are summarized in Table 2. Similar to the Inria dataset's results, our GMEDN still gets the best performance among all of the methods. The gains of our GMEDN in IoU/OA are 3.89%/1.12% (FCN), 6.96%/1.22% (SegNet), 4.74%/0.8% (UNetPPL), and 2.43%/0.48% (FCNDWT). Besides the above conclusion, a notable observation is that there is a distinct performance gap between GMEDN and other methods in IoU. The reasons behind this are summarized as follows. First, the details of buildings (e.g., shapes and edges) can be grasped by the backbone network. Second, through adding the non-local block, the global information can be further obtained which can be used to distinguish the objects from the complex background. Third, the multi-scale scheme can help our GMEDN model to capture the buildings with diverse sizes.  Through observing those results, the following issues can be discovered. First, FCN and SegNet have smooth boundary prediction, and they perform well on the small buildings. Second, UNetPPL could extract more diverse buildings since the PPL model can learn the multi-scale information from aerial images. Third, FCNDWT achieves the good behavior on the building locations and boundaries as the intension frequency information is fused. Compared with the comparisons, the maps generated by our GMEDN have clear boundaries and precise locations. These positive visual extraction results prove the effectiveness of our method again.

Ablation Study
As mentioned in Section 3, GMEDN can be divided into four sub-models, including basic encoder-decoder network (i.e., VGG16, de-convolution branch, and skip connection), non-local block, connection block, and the multi-scale block. To study their contributions to the building extraction task, we design four networks to complete the building extraction respectively, they are, Note that, the experimental settings of the networks in this section are the same as mentioned in Section 4.2.
The results of these networks counted on two datasets are shown in Figure 9, where Figure 9a shows the OA scores and Figure 9b displays the IoU scores. The first six groups of bars are the results of the Inria dataset and the last group of bars is the results of the Massachusetts dataset. From the observation of Figure 9, we can find the performance of different networks is proportional to the numbers of sub-models. In detail, the behavior of Net1 is the weakest among all compared networks since it only consists of a basic encoder-decoder network. After adding the non-local block, the performance of Net 2 is stronger than that of Net 1, which proves the usefulness of the non-local block. Due to the fusion scheme, Net 3 outperforms the Net 1 and Net 2 which confirms the contribution of the connection block. Integrating all sub-models, Net 4 achieves the best performance. Furthermore, the performance gap between Net 4 and other networks is distinct. The results discussed above demonstrate that each sub-model can make a positive contribution to our GMEDN model.

Robust Study of Non-Local Block
In this section, the robustness of the non-local block is studied. To this end, we vary the position of the non-local block. Here, to study the influence of the non-local block to GMEDN, we construct three models by changing the non-local block's positions, i.e., • Model 1: Embedding the non-local block after the third layer of VGG16; • Model 2: Embedding the non-local block after the fourth layer of VGG16; • Model 3: Embedding the non-local block after the fifth layer of VGG16.
To get the segmentation results, three models are trained with the experimental settings discussed in Section 4.2. The OA and IoU scores counted on the Inria and Massachusetts datasets are exhibited in Figure 10, in which the first six sets of bars correspond to the Inria dataset while the last set of bars corresponds to the Massachusetts dataset. It is easy to find that the performance differences between the three models are small. Taking the Vienna city within the Inria dataset as an example, three models' OA scores are 94.43%, 94.32%, and 94.54%, while their IoU scores are 80.47%, 80.02%, and 80.72%. This indicates that our GMEDN is not sensitive to the non-local block.

Running Time
Here, we discuss the time costs of our GEMDN from the training and inference aspects. As mentioned in Section 4.2, the input data is the 256 × 256 aerial image patches that are cropped from the original aerial images. The volumes of the training sets are 1805 and 360 for the Inria and Massachusetts datasets. The training times of the two datasets are 5.71h and 5.23h, respectively. For the inference, the time cost of predicting one patch is 0.26s. Therefore, we need 98.7s to segment an aerial image (5000 × 5000) from the Inria dataset, and 9.36s to complete the building extraction from an aerial image (1500 × 1500) of the Massachusetts dataset.

Conclusions
With the consideration of the characteristics of aerial images, a simple yet useful method (GMEDN) is proposed in this paper for building extraction based on the encoder-decoder framework with a skip connection.

1.
To extract the local and global information from the aerial images for fully describing the buildings with various shapes, a local and global encoder is developed. It consists of a VGG16, a non-local block, and a connection block. VGG16 is used to learn the local information through several convolutional layers, the non-local block aims at learning global information from the similarities of all pixels, and the connection block further integrates local and global information.

2.
To explore the fundamental and multi-scale information from the aerial images for capturing the buildings with different sizes, a distilling decoder is developed. It contains a de-convolution branch and a multi-scale branch. The de-convolution branch focuses on learning the buildings' representation (low-and high-level visual features) under different scales by several de-convolutional layers, and the multi-scale branch aims at fusing them to improve the discrimination of the prediction mask.

3.
To reduce the information loss caused by the max-pooling (encoder) and the up-sampling (decoder) operations, the simple skip connection is added between the local and global encoder and the distilling decoder.
The introduced GMEDN can accomplish the building extraction in an end-to-end manner, and its superiorities are confirmed by the encouraging experimental results counted on two aerial image datasets. Compared with some existing methods, for the Inria dataset, the minimum enhancements obtained by GMEDN are 0.26% in and 1.39% in IOU. For the Massachusetts dataset, the minimum gains achieved by GMEDN are 0.48% in OA and 2.43% in IOU.
Although GMEDN achieves positive results in building extraction tasks, it is not a lightweight network which limits its practicability in many realistic scenarios. To address this issue, we will pay our attention to developing the light model for the building extraction task in the future.