Multi-Scale Residual Deep Network for Semantic Segmentation of Buildings with Regularizer of Shape Representation

: It is challenging for semantic segmentation of buildings based on high-resolution remote sensing images, given high variability of appearance and complicated backgrounds of the buildings and their images. In this communication, we proposed an ensemble multi-scale residual deep learning method with the regularizer of shape representation for semantic segmentation of buildings. Based on the U-Net architecture using residual connections and multi-scale ASPP (atrous spatial pyramid pooling) modules, our method introduced the regularizer of shape representation and ensemble learning of multi-scale models to enhance model training and reduce over-ﬁtting. In our method, the shape representation was coded in an antoencoder that was used to encode and reconstruct the shape characteristics of the buildings. In prediction, we consider multi-scale trained models for di ﬀ erent resolution inputs and side e ﬀ ects to obtain an optimal semantic segmentation. With the high-resolution image of the Changshan, an island county in China, we used two-thirds of the study region image to train the model and the remaining one-third for the independent test. We obtained the accuracy of 0.98–0.99, mean intersection over union (MIoU) of 0.91–0.93 and Jaccard coe ﬃ cient of 0.89–0.92 in validation. In the independent test, our method achieved state-of-the-art performance (MIoU: 0.83; Jaccard index: 0.81). By comparing with the existing representative methods on four di ﬀ erent data sets, the proposed method consistently improved the learning process and generalization. The study shows important contributions of ensemble learning of multi-scale residual models and regularizer of shape representation to semantic segmentation of buildings.


Introduction
Extraction of the buildings from high-resolution remotely sensed images is an important branch of remote sensing applications. Accurate extraction of urban buildings can provide critical information about spatial distribution of the buildings which can, in turn, be applied in urban planning, administration and development, and disaster and crisis management [1][2][3]. However, given high variability of appearance and complicated background of the buildings and their remote sensing images, it is challenging to obtain high accurate extraction of buildings.
There are two types of methods to extract buildings from remotely sensed images: top-down model-driven methods and bottom-up data-driven methods [4]: based on multi-dimensional highresolution remotely sensed data, the former extracts the building information as a whole scenario through semantic models and a priori knowledge [5][6][7]. However, the performance of the top-down model-driven methods considerably depends on the model's accuracy and a priori knowledge, and needs massive training samples, thus, their applicability is limited. The latter bottom-up methods mainly consider the appearance of buildings and intrinsic features such as shape, texture, spectrum and auxiliary information (e.g., shadow) to distinguish the buildings from the other geo-features. For example, morphological building/shadow indices such as morphological building index (MBI) [8] and the texture-derived built-up presence index (PanTex) [9] have been suggested as two indicators of building presence. The building indicators may be subject to commission and omissions due to bright soil and roads with similar spectral characteristics as buildings [10]. Professional feature extractions such as Grabcut, Mean Shift, and Seeded Region Growing were introduced to segmentation of buildings [11][12][13].   [2] combined Mean Shift and regional neighborhood graph to segment the buildings with post processing using MBI, and Aytekin et al. (2012) [14] used Mean Shift to obtain the segmentation results of artificial geo-features with post processing using principal component analysis and morphological calculation.
Furthermore, for semantic segmentation of buildings, as one critical step of extraction of buildings, machine learning methods such as support vector machine (SVM) [15][16][17] and random forest classifiers [18] have been used. However, for SVM, manual and professional feature extractions are important for its performance, and traditional machine learning methods also have the limitation in computing for massive pixel-level classification for semantic segmentation of buildings from high-resolution remotely sensed images [19].
As a modern machine learning method, deep learning is increasingly used to achieve the state-of-the-art performance in many fields including computer vision, natural language processing and bioinformatics [20]. However, applications of deep learning in remote sensing including semantic segmentation of buildings are limited by the difference in the spectrum and texture between general images/videos and remotely sensed images, and shortcoming of the labels [21,22]. The early approaches used convolutional layers as image-patch feature extractors for pixel-level classification, which is a key limitation of computing resource for wide applications of semantic segmentation, similar to that of traditional machine learning methods. In 2015, the fully convolutional network (FCN) was first proposed [23] (in this neural network, the fully connected layer connected to the last convolutional layers is replaced with a convolutional layer). Compared with traditional machine learning and patch-pixel convolutional neural network (CNN), computing in FCN was considerably and efficiently improved [24]. Based on FCN, many advanced deep learning methods such as upsampling fully convolutional network (Up-FCN) [23], U-Net [25], SegNet [26], DeepLab Version 1, 2 [27,28], RefinedNet [29], global convolutional network (GCN) [30] and DeepLab Version 3+ [31] have been constructed. Although avoiding manual extraction of professional features, these advanced and efficient deep learning methods are based on the samples of general optical images or biomedical images that have spectral characteristics quite different from remote sensing images [19], and thus, may not be used directly for semantic segmentation of buildings from remotely sensed images.
Gradually, deep learning methods have been used for extraction of buildings or their features from remotely sensed high-resolution images. Zuo (2017) [32] enhanced FCN through extraction and fusion of multi-level features. Maggiori et al. (2017) [33] developed a multi-scale FCN using the original OSM (Open Street Map) data as the label to pre-train the model and then train the models using a small size of artificial label samples. Yang et al. (2018) [34] proposed an extraction method of convolutional neural network based on local features to improve retrieval efficiency. Qin [36] developed a gated graphic convolutional neural network in building segmentation. These approaches achieved a good performance but they did not explicitly consider the influence of morphological characteristics of the buildings [37] that might result in the decrease of generalization or over-fitting Remote Sens. 2020, 12, 2932 3 of 21 in practical applications. Although several of these existing methods consider fusion of multi-scale variability within the models, training of the models were constrained by the size of the input images.
In view of the state of semantic segmentation of buildings, this paper proposes a deep learning method that incorporates the regularizer of shape representation, and multi-scaling ensemble modeling. Based on our previous work, residual connections [19,38], multi-scale modules in the network and side effects [19] were similarly used in this study. Unlike our previous semantic segmentation method [19], we further added shape representation as a regularizer in the model to capture the morphological features of the buildings and reduce over-fitting, and conducted ensemble learning of multi-scale models to improve the scale invariance. The regularizer of shape representation model was extracted from the ground truth masks of the building. Both the multi-scale modules within a model and ensemble learning of multi-scale models were used to enhance the invariance of spatial scales. Transfer learning was used to pre-train the models using a third-party data to enhance learning. Using the high-resolution remote sensing image of Changshan, island county of China, the proposed method was tested and evaluated. In addition, in an extensive evaluation of our method, we compared our method with the baseline U-Net [25] and the residual multi-scale model [19] in four datasets.
In total, this paper makes the following contributions to the literature of semantic segmentation of buildings: (1) We first proposed an end-to-end residual deep U-Net with the regularizer of shape representation, which can capture morphological features of the buildings within the models to reduce over-fitting; (2) We first used two ways of multiple scales to improve the generalization in prediction: embedding of multi-scale modules through atrous spatial pyramid pooling (ASPP) in the models and ensemble learning of multi-scale models.
The remainder of this paper is organized as following. Section 2 describes the proposed method (the network architecture with each component: residual connections, ASPP multi-scale modules, regularizer of shape representation, and ensemble learning with multi-scale inputs), Section 3 introduces the study region and evaluation method, Section 4 presents and compares the results and discusses their implication, and Section 5 makes a conclusion from the study.

Deep Residual Segmentation Method with Shape Representation and Multi-Scaling
With embedding of the regularizer of shape representation and consideration of multi-scales and side effects, our model was constructed based on the residual U-Net structure. This section describes the architecture and its components.

U-Net Architecture
Our network was constructed based on the U-Net structure, similar to the encoder-decoder architecture. Deriving from the FCN, U-Net [25] has a structure of U-shape [25] with three parts, i.e., encoding, coding and decoding. The encoding part usually consists of multiple hidden layers with a decreasing number of nodes for each layer to extract powerful representation features from the input; the coding layer is used as compressed informative representation layer; and the decoding layer also consists of multiple hidden layers with an increasing number of nodes for each layer (corresponding to a encoding layer) to recover the original input (as an autoencoder) or retrieve the target output (e.g., semantic segmentation). In the U-Net, skip connections were used to retrieve the early information in the corresponding encoding layers to boost the training process [39]. U-Net provides a starting point for later advanced semantic segmentation network structures [24].
Based on the U-Net structure, our network architecture ( Figure 1) was enhanced using residual connections, multi-scale context modules and regularizer of the shape representation model. the input layer and each encoding layer to capture multi-scale context information (Figure 1a), and shape regularizer (Figure 1c). ASPP (Section 2.3) uses multiple atrous (dilated) convolutions at different atrous rates to extract feature representations in a multi-scale context. In our previous study [19], the ASPP module has been shown to well capture the context information in semantic segmentation of land-use of remote sensing. The shape regularizer (Section 2.4 and 2.5) was pretrained to capture the shape characteristics of buildings and incorporated within the models through total loss function ( Figure 1b).

Residual Learning
Residual learning employs skip connections or shortcuts to jump over some hidden layers to reuse activations from a previous layer until the adjacent layer learns its weights [40,41]. Residual learning can effectively reduce or avoid the problem of vanishing gradient to improve learning efficiency [42]. For convolutional neural network, a typical residual unit usually consists of two or more convolutional layers with skips that contain nonlinearities (ReLU) and batch normalization in Compared with traditional U-Net, in order to reduce model complexity and improve model learning, our architecture introduced short residual connections [40] at each encoding or decoding layer, and used long residual connections from encoding layers to decoding layers through the tensor addition to implement skip connections (see Section 2.2 for details).
In addition to short and long residual connections, we also embedded an ASPP module between the input layer and each encoding layer to capture multi-scale context information (Figure 1a), and shape regularizer ( Figure 1c). ASPP (Section 2.3) uses multiple atrous (dilated) convolutions at different atrous rates to extract feature representations in a multi-scale context. In our previous study [19], the ASPP module has been shown to well capture the context information in semantic segmentation of land-use of remote sensing. The shape regularizer (Sections 2.4 and 2.5) was pre-trained to capture the shape characteristics of buildings and incorporated within the models through total loss function ( Figure 1b).

Residual Learning
Residual learning employs skip connections or shortcuts to jump over some hidden layers to reuse activations from a previous layer until the adjacent layer learns its weights [40,41]. Residual learning can effectively reduce or avoid the problem of vanishing gradient to improve learning efficiency [42]. For convolutional neural network, a typical residual unit usually consists of two or more convolutional layers with skips that contain nonlinearities (ReLU) and batch normalization in between ( Figure 2). In our previous study, residual connections have been extended between the encoding layers and the corresponding decoding layers to considerably improve the performance in the encoder-decoder-based deep neural network [38].
Remote Sens. 2020, 11, x FOR PEER REVIEW 5 of 21 between ( Figure 2). In our previous study, residual connections have been extended between the encoding layers and the corresponding decoding layers to considerably improve the performance in the encoder-decoder-based deep neural network [38]. In the proposed architecture ( Figure 1), residual connections were used in two aspects: Traditional residual units were used within each encoding or decoding layer and the extended residual connection was used between each encoding layer and its corresponding decoding layer. Similar network structure was used in our study, which showed an optimal performance [19].
Although our architecture has a U-shape structure as U-Net, it is different from U-Net. The critical difference is in skip connections that is connected by concatenation tensor operation in the U-Net but connected by residual tensor operation (matrix addition) in our architecture. Thus, the residual learning has been implemented in our network with fewer parameters than regular U-Net.

ASPP
ASPP was used to capture multi-context information to improve semantic segmentation in DeepLab Version 3+ [31]. ASPP was embedded in our network to probe convolutional feature layers with filters at multiple sampling rates, thereby capturing image context at multiple scales [43].
In our model, we set up multiple ASPP modules ( Figure 1a) for the encoding and coding layers ( Figure 3)-an ASPP module (Figure 3a) for each encoding or coding layer. For example, in the example architecture of Figure 1, we have five encoding layers and one coding layer, so we have six ASPP modules in total. For each ASPP module, we used four different atrous rates (r = [4, 8,16,32]) to capture the objects of different sizes. Depending on complexity of the segmentation target, more atrous rates can be used in an ASPP module to capture multi-scale context information.
In each ASPP module, these dilated convolutional layers filtered using different atrous rates, were concatenated first along the channel dimension to become a matrix tensor (Figure 3b), which was then connected with the layers of ReLU activation and batch normalization (BN). To embed ASPP modules in our model, the output of the merged ASPP modules was concatenated with the corresponding encoding layer along the channel dimension to a matrix tensor as the input for next encoding or coding layer (Figure 3c and d). We also developed a custom resizing layer ( Figure 3c) to alter the output shape of the merged ASPP multi-scale output to match the output shape of the corresponding encoding or coding layer to be connected. Therefore, our method implemented multiscale ASPP modules in the network, similar to [19]. In the proposed architecture ( Figure 1), residual connections were used in two aspects: Traditional residual units were used within each encoding or decoding layer and the extended residual connection was used between each encoding layer and its corresponding decoding layer. Similar network structure was used in our study, which showed an optimal performance [19].
Although our architecture has a U-shape structure as U-Net, it is different from U-Net. The critical difference is in skip connections that is connected by concatenation tensor operation in the U-Net but connected by residual tensor operation (matrix addition) in our architecture. Thus, the residual learning has been implemented in our network with fewer parameters than regular U-Net.

ASPP
ASPP was used to capture multi-context information to improve semantic segmentation in DeepLab Version 3+ [31]. ASPP was embedded in our network to probe convolutional feature layers with filters at multiple sampling rates, thereby capturing image context at multiple scales [43].
In our model, we set up multiple ASPP modules ( Figure 1a) for the encoding and coding layers ( Figure 3)-an ASPP module (Figure 3a) for each encoding or coding layer. For example, in the example architecture of Figure 1, we have five encoding layers and one coding layer, so we have six ASPP modules in total. For each ASPP module, we used four different atrous rates (r = [4, 8,16,32]) to capture the objects of different sizes. Depending on complexity of the segmentation target, more atrous rates can be used in an ASPP module to capture multi-scale context information.
In each ASPP module, these dilated convolutional layers filtered using different atrous rates, were concatenated first along the channel dimension to become a matrix tensor (Figure 3b), which was then connected with the layers of ReLU activation and batch normalization (BN). To embed ASPP modules in our model, the output of the merged ASPP modules was concatenated with the corresponding encoding layer along the channel dimension to a matrix tensor as the input for next encoding or coding layer (Figure 3c,d). We also developed a custom resizing layer (Figure 3c) to alter the output shape of the merged ASPP multi-scale output to match the output shape of the corresponding encoding or coding layer to be connected. Therefore, our method implemented multi-scale ASPP modules in the network, similar to [19]. Remote Sens. 2020, 11, x FOR PEER REVIEW 6 of 21

Regularizer of the Shape Representation Autoencoder
To code the morphological feature of buildings, we developed a shape representation autoencoder (Figure 4). The autoencoder has the ground truth masks of the training samples as both the input and the output to learn the shape representation. We used a structure of autoencoder similar to U-Net in Figure 1 but no residual units and ASPP modules were used and the input and the output were the same. Residual connections between each encoding layer and its corresponding decoding counterpart (but no residual units used) were used to optimize the learning process. The middle layer of latent representation was used to encode the shape characteristics of buildings, and then the input image (the building mask image) was reconstructed in the decoder based on the latent shape representation layer.
The shape representation autoencoder was pre-trained using the mask labels of the training samples and then the trained shape representation was embedded into the loss function of the semantic segmentation network as a regularizer. For binary classification, one channel of integer label can be used as the mask labels (1 represents building and 0 represents the background); for multiclassification, the K channels (K: the number of classes) of one-hot encoding [44] can be used as the mask labels. The mask labels from the data samples were used as the input and output of the shape representation autoencoder. However, pre-training is not limited to the data sample of the study area. If the mask labels from the other sources are available, we can also use them to re-train the shape autoencoder to enhance the generalization in extraction of shape representation of the buildings.

Regularizer of the Shape Representation Autoencoder
To code the morphological feature of buildings, we developed a shape representation autoencoder ( Figure 4). The autoencoder has the ground truth masks of the training samples as both the input and the output to learn the shape representation. We used a structure of autoencoder similar to U-Net in Figure 1 but no residual units and ASPP modules were used and the input and the output were the same. Residual connections between each encoding layer and its corresponding decoding counterpart (but no residual units used) were used to optimize the learning process. The middle layer of latent representation was used to encode the shape characteristics of buildings, and then the input image (the building mask image) was reconstructed in the decoder based on the latent shape representation layer.

Loss Function, Multi-Scale and Boundary Effects
The total loss function ( Figure 5) consists of three parts, i.e., semantic segmentation, shape and reconstruction: The shape representation autoencoder was pre-trained using the mask labels of the training samples and then the trained shape representation was embedded into the loss function of the semantic segmentation network as a regularizer. For binary classification, one channel of integer label can be used as the mask labels (1 represents building and 0 represents the background); for multi-classification, the K channels (K: the number of classes) of one-hot encoding [44] can be used as the mask labels. The mask labels from the data samples were used as the input and output of the shape representation autoencoder. However, pre-training is not limited to the data sample of the study area. If the mask labels from the other sources are available, we can also use them to re-train the shape autoencoder to enhance the generalization in extraction of shape representation of the buildings.

Loss Function, Multi-Scale and Boundary Effects
The total loss function ( Figure 5) consists of three parts, i.e., semantic segmentation, shape and reconstruction: where Y denotes the ground truth mask matrix, Y' denotes the predicted probability matrix, t represents the total loss, seg is the primary loss of semantic segmentation, shp is the loss of shape representation of Y, rec is the loss of reconstruction, λ 1 and λ 2 are the weights for shp and rec respectively, shp and rec are used as regularizers in the total loss function. E( . . . ) and D( . . . ) represents the encoder and decoder parts in the shape representation model respectively (Figure 4).

Loss Function, Multi-Scale and Boundary Effects
The total loss function ( Figure 5) consists of three parts, i.e., semantic segmentation, shape and reconstruction: where Y denotes the ground truth mask matrix, Y' denotes the predicted probability matrix, t  represents the total loss, seg  is the primary loss of semantic segmentation, shp  is the loss of shape representation of Y, rec  is the loss of reconstruction, 1  and 2  are the weights for shp  and rec  respectively, shp  and rec  are used as regularizers in the total loss function. E(…) and D(…) represents the encoder and decoder parts in the shape representation model respectively (Figure 4).  For the loss of semantic segmentation, we used the summation of binary cross-entropy and normalized Jaccard Index that proved reliable [19]. For the reconstruction loss of the shape representation model, we used a similar combinational function (binary cross-entropy + normalized Jaccard Index); for the loss of latent shape representation, we used the mean squared error (MSE). As the hyper-parameters, an optimal solution of λ 1 and λ 2 was retrieved using grid search [45].
Introduction of multiple scales generally improves semantic segmentation in practical applications [46]. Our previous study [19] embedded the multi-scale modules connecting to the input layer to improve semantic segmentation. However, the multi-modules are limited by the input sample size. The limitations of GPU memory prevent us from using large-size inputs. Thus, in addition to the multi-scale ASPP, we used multi-scale ensemble base models to further improve the effectiveness of semantic segmentation. For a specific study area, sensitivity analysis was conducted to find an optimal number of multi-scale models. Thus, we trained a certain number of models of residual deep network with ASPP embedded and the regularizer of shape representation model, and then obtained the averages of the predicted probabilities from three models as the final predictions.
In order to make predictions, we needed to crop small patches of the same size as the input sample from a big new image, and then merge the prediction masks of each small patch together to obtain the label prediction of the entire image. This patching strategy usually leads to square structure in the edge of the resulted images and the prediction quality is decreasing when moving from the patch center [47]. Thus, we filtered out local boundary square effects of the predictions by removal of the boundaries with a distance of 16 pixels to each side ( Figure 5). The same distance was used in [47] to solve the square structure. The sensitivity analysis showed that such a distance was appropriate for our study region, where the building size is small and the morphology complexity is low.
In terms of implementation, we developed the proposed method and conducted the test based on the Keras (Version: 2.2.2) with Tensorflow (Version: 1.9.0) as the backend.

Study Region
The study region is the Dachangshan Island, a county of China, located in the southeast of Liaodong Peninsula and in the north of Changshan Islands. It is the seat of Changhai County Government in Liaoning Province, with a land area of 31.79 square kilometers, a coastline of 94.4 kilometers, and a sea area of 651.5 square kilometers. We obtained an RGB image covering the whole island ( Figure 6) in April of 2019 from Google Earth (https://www.google.com/earth) with a spatial resolution of approximately 1 × 1 m 2 (total pixel size: 49920 × 131840 = 6,581,452,800). Through manual interpretation, we obtained the ground truth masks of the buildings in this study region.
To improve generalization of the trained model, we used a third-part dataset (20 images) with similar bands to pre-train the models to quantify the initial parameters before training. We used a set of very-high-resolution (0.61 × 0.61 m 2 ) images of the city of Zurich (Switzerland) (https://sites.google.com/site/michelevolpiresearch/data/zurich--dataset) by the QuickBird satellite in 2002. The original images were subset to match our base scale model with RGB channels used (near-infrared band not used). With the data, we pre-trained the models of semantic segmentation to obtain the initial coefficients (weights and bias). Then, based on the pre-trained parameters, we trained the segmentation models. Thereby, we shortened the training time. Remote Sens. 2020, 11, x FOR PEER REVIEW 10 of 21 Figure 6. The study region of Dachangshan Island of China with a RGB image. Figure 6. The study region of Dachangshan Island of China with a RGB image.

Evaluation
The left third of the image of the study region was used for the independent test and the remaining two-thirds of the image was randomly sampled around the label features to train (60% of the samples), validate (20% of the samples) and test (20% of the samples) the models. The independent test samples were not used in training and validation, and were just used to evaluate true generalization of the model after training was finished.
For training of the models, we used the Adam with Nesterov momentum as the optimizer. We used the early stopping criterion to reduce over-fitting in training. Sensitivity analysis was also conducted to examine the effects of residual learning, the regularizer of the shape representation model and multi-scales.
To measure the performance of the trained models, we used three metrics: pixel accuracy (PA, defined as the ratio of the number of correctly classified pixels to the total number of pixels), the Jaccard index (JI, defined as the size of the intersection of two sets divided by the size of their union), and mean intersection over union (MIoU defined as the mean of JI for each class). For model comparison, in addition to MIoU and PA, we also reported three other metrics: recall (defined as the fraction of the total amount of relevant instances that were actually retrieved), precision (defined as the fraction of relevant instances among the retrieved instances) and F-measure (defined as a measure of test accuracy through the weighted average of precision and recall).
In order to validate generalization of our method, using three additional publicly accessible datasets, we conducted an extensive evaluation by comparing our method with the baseline U-Net [25], DeepLab V3+ [31], GCN [30] and residual multi-scale model [19]: (1) The Kaggle dataset from the Defence Science and Technology Laboratory (DSTL) Satellite Imagery Feature Detection challenge in 2017 [48]. The dataset has high-resolution panchromatic images with a 31 cm resolution, 8-band (M-band) images with a 1.24 m resolution, and shortwave infrared (A-band) images with a 7.5 m resolution (all by the WorldView-3 satellite). Panchromatic sharpening [49] was performed to fuse high-res panchromatic images and low-res images to obtain 25 images with a 31 cm resolution. We extracted binary building labels from six class labels (buildings, crops, roads, trees, vehicles and background) for semantic segmentation of buildings. (2) The dataset of 20 multispectral ultra-high resolution images collected by the QuickBird satellite in Zurich, Switzerland in 2002 [50]. The spatial resolution of the pan-sharpened images was 0.61 m, with 4 channels, spanning the near infrared to visible spectrum (NIR-R-G-B). We extracted binary building labels from nine class labels (road, trees, bare soil, rail, buildings, grass, water, pools and background) for semantic segmentation of buildings. (3) The dataset of DroneDeploy Segmentation [51], including the aerial scenes captured from drones with a ground resolution of 10 cm in 2019. The images are RGB TIFFs and the labels are PNGs with 7 colors representing 7 classes (building, clutter, vegetation, water, ground, car and background). In total, we had 36 small images (spatial resolution: 256×256) for training, 130 small images for validation, and 130 small images for testing. Due to the small number of training samples for building labels, we trained the models to perform image segmentation on all the classes (including buildings) to evaluate our model.

Results and Discussion
In total, we obtained approximately 0.25% of the total pixels as the label masks of the buildings. Using grid search, we got an optimal solution for the set of hyper-parameters: An initial learning rate of 0.001 with adaptive adjustment in learning, mini batch of 12 images, λ 1 =0.01, λ 2 =0.001, and 80 training epochs. The sensitivity analysis showed high learning efficiency (but no consistent change in test performance) for the pre-trained models in transfer learning using the Zurich dataset, compared with random initialization of the parameters.
In our study region of Dachangshan, the buildings were small in size and less complex in the shape and characteristics. Sensitivity analysis showed that ensemble learning of three different scale models (256 m, 512 m and 1024 m) were used to obtain an optimal solution. The results (three scales (resolutions): 4 m, 2 m and 1 m) (Table 1)   The results of the models with and without the regularizer of shape representation (Figure 7 for their learning curves of loss and MIoU) at three scales consistently showed better performance (lower loss and higher MIoU) for the models with the shape regularizer. Sensitivity analysis shows a 1-3% improvement of MIoU in the independent test for the models with the shape regularizer. Tong et al. (2018) [52] used a shape representation model to constrain FCN to improve multi-organ segmentation for head and neck cancers. Our results showed that the similar method consistently improved semantic segmentation of buildings using the remotely sensed data. The regularizers of shape representation helped reduce over-fitting of the trained models and improve their generalization in semantic segmentation of buildings. Although morphological features were extracted manually and used in early studies [8][9][10] to distinguish the buildings from the other geo-features, such a feature is missing in many existing studies of semantic segmentation of the buildings using deep learning. As demonstrated in the study, we used the autoencoder to extract the morphological feature as the shape representation that was embedded as the regularizer within the loss function to reduce the noise output and over-fitting. As far as we know, this is one of the first studies that fused the shape representation as the regularizer within the trained models to improve semantic segmentation of buildings.
As shown in our previous study [19], residual connections used to replace concatenation of matrix in U-Net reduced the number of parameters and thus, over-fitting. Our sensitivity analysis showed 1-3% improvement in the validation and testing by residual connections.
The scale of the input images has an important effect upon semantic segmentation given different context at different scales (resolution). To capture multi-scale contextual information, many studies embedded multi-scale modules within the models. For example, we embedded two multi-scale modules (resizing and ASPP) to capture local and long-term contextual information [19]; Zhang et al. [53] used high-resolution network to aggregate multi-scale context in segmentation of remote sensing images. Although the multi-scale module within the network can help the model to capture local and long-term or global contextual information [19,30,43,53,54], the scale of the input images may constrain the trained model from a wider or more global context. The results (Figure 8) showed that the contextual information were captured at three different scales for the input size of 256 × 256: For the scale of 4m resolution (Figure 8a,b), the input had a wider context than the other two scales; for the scale of 1m resolution (Figure 8e,f), the input had more local details than the other two scales but with less long-term or global contextual information. The trained model of each scale had different generalization in the independent test, as shown in Table 1. The model of the scale of 1m resolution had lowest Jaccard index (0.72 vs. 0.86-0.88), illustrating the importance of a long-term or global context for generalization of the trained model. However, due to the memory limitation of GPU used to train the model, there is a threshold for the size of the input image (for our case: 256). Thus, we resampled the large images to obtain the target resolution samples using the nearest neighbor interpolation. With a fixed input size, we used the resampling technique to obtain different context at three resolutions (4 m, 2 m and 1 m).
Remote Sens. 2020, 11, x FOR PEER REVIEW 12 of 21 As shown in our previous study [19], residual connections used to replace concatenation of matrix in U-Net reduced the number of parameters and thus, over-fitting. Our sensitivity analysis showed 1-3% improvement in the validation and testing by residual connections. The scale of the input images has an important effect upon semantic segmentation given different context at different scales (resolution). To capture multi-scale contextual information, many studies embedded multi-scale modules within the models. For example, we embedded two multiscale modules (resizing and ASPP) to capture local and long-term contextual information [19]; Zhang et al. [53] used high-resolution network to aggregate multi-scale context in segmentation of remote sensing images. Although the multi-scale module within the network can help the model to capture 1m resolution had lowest Jaccard index (0.72 vs. 0.86-0.88), illustrating the importance of a longterm or global context for generalization of the trained model. However, due to the memory limitation of GPU used to train the model, there is a threshold for the size of the input image (for our case: 256). Thus, we resampled the large images to obtain the target resolution samples using the nearest neighbor interpolation. With a fixed input size, we used the resampling technique to obtain different context at three resolutions (4m, 2m and 1m). Furthermore, multi-scale (resolution) ensemble models have been used to improve generalization of the trained models. For instance, Lee (2017) [55] used the strategy of over-and under-sampling to obtain multi-scale data samples to train multiple models. The final predictions were obtained through merging the outputs of multi-scale ensemble models. This strategy can reduce the limitation of the input sample size for a single multi-scale model, and avoid the high GPU memory requirements of large multi-scale networks, although it requires more time to train multiple models.
In our method, the ASPP modules were embedded within the network to be an end-to-end integrated deep network with dynamic extraction of multi-context information based on the input. Although the input samples of a sufficient size (e.g., 1024x1024) may be expected to train a robust multi-scale model, it is actually difficult to train such a large model effectively due to the limitations of GPU memory. Generally, in order to meet the memory requirements of GPU, we may need to crop the input image or reduce the size of the input image through scaling so that our model can be trained. The use of smaller input samples by clipping in training may result in the loss of extensive context information; the use of reduced size input samples by scaling in training may result in the lack of local details. Thus, ensemble learning of multi-scale models is a compromise between a large model of a sufficient input sample size and the limitations of the available GPU memory. For this study, in addition to embedding of the multi-scale ASPP module within the model, we used the strategy of multi-scale ensemble models. In our method, the final predicted probability of building label was obtained by averaging the probability outputs (weighted by Jaccard index in the independent test) of three scale models (spatial resolution: 4 m, 2 m and 1 m). From the final prediction probability, we used the threshold of 0.5 to extract the buildings. For the ensemble predictions at the original resolution (1 m), we obtained JI of 0.81 and MIoU of 0.83, with a 1-2% improvement in MIoU compared with the results of base models, in the dependent test. Regarding the choice of model scales, for a larger study region with varying sizes of the buildings and more complex shape characteristics than ours, we may need more local and global scale models to get an optimal solution.
By comparing with the baseline U-Net, DeepLab V3+, GCN, and residual multi-scale model, our method has been extensively evaluated in independent tests ( Table 2)  Compared with the baseline U-Net and DeepLab V3+, the residual multi-scale model performed better, indicating the contribution of residual learning and multi-scale modules in the network, which has been proven in an extensive comparison of our previous work [19]. As the optimal model, compared with the residual multi-scale model, our method consistently had an additional improvement in the tests of the four datasets, showing the significant contribution of the regularizer of shape representation and ensemble learning of multi-scale models.
Overall, the performance of GCN and DeepLab V3+ was similar to or better than U-Net, but similar to or worse than our method. As aforementioned, DeepLab V3+ and GCN were mainly developed for general image or video data, and their direct applications in segmentation of the buildings using remote sensing data are restricted, as shown in our test. However, some advanced techniques in DeepLab V3+ and GCN may be adapted and applied in our architecture. For example, we introduced the multi-scale modules of ASPP from DeepLab to enhance the extraction of multi-context information in our architecture (Figure 1). As a potential improvement to capture global context information, the global convolution and boundary refinement in GCN may be incorporated in our future architecture. The results ( Figure 9 for the upper left part of the test area; Figure 10 for the upper right part of the test area; Figure 11 for the lower left part of the test area) showed that the predicted output wellmatched the ground truth masks of the buildings in the independent test. The majority of the ground truth masks (>80%) were basically covered by the ensemble predicted masks, illustrating the reliability of the proposed method. The ensemble results also showed fewer noise segmentations for the model with the shape regularizer, compared with that without the shape regularizer. Compared with the predictions of the model at a single scale, the ensemble predictions have the advantage of integrating the outputs from multiple base models at different scales, thus better capturing local and long-term contextual information.
matched the ground truth masks of the buildings in the independent test. The majority of the ground truth masks (> 80%) were basically covered by the ensemble predicted masks, illustrating the reliability of the proposed method. The ensemble results also showed fewer noise segmentations for the model with the shape regularizer, compared with that without the shape regularizer. Compared with the predictions of the model at a single scale, the ensemble predictions have the advantage of integrating the outputs from multiple base models at different scales, thus better capturing local and long-term contextual information.   reliability of the proposed method. The ensemble results also showed fewer noise segmentations for the model with the shape regularizer, compared with that without the shape regularizer. Compared with the predictions of the model at a single scale, the ensemble predictions have the advantage of integrating the outputs from multiple base models at different scales, thus better capturing local and long-term contextual information.   There are two limitations for this study. One is a limited number of the scales in ensemble multi-scale learning. We used just three resolutions (4 m, 2 m and 1 m) to train the base models at three scales. However, our method can be conveniently generalized to more scales such as those between or beyond the three scales. This can be used to capture more local details and wider contextual information to enhance generalization in practical predictions. The other limitation is lack of post processing for the predicted masks. The post processing techniques such as conditional random fields (CRF) can be used to remove noise masks and obtain the integral results [56], but this is beyond the study scope of this paper. In the future, one important direction of the study will be the development of an integral end-to-end method fusing multiple base multi-scale models, regularizer of shape representation and post processing for semantic segmentation of the buildings.

Conclusions
Considering the high variability of building appearance and complex background, accurate semantic segmentation of buildings is challenging. Many deep learning methods have been developed for general or biomedical images or videos. Given the difference in spectral and morphological characteristics between remotely sensed data and general images, these methods have been only limitedly applied in semantic segmentation of buildings. In this paper, we present a residual deep learning method with incorporation of multi-scale modules, ensemble learning of multi-scale models, and the regularizer of shape representation for semantic segmentation of the buildings. Based on the encoder-decoder architecture, similar to U-Net, we used residual connections to boost learning efficiency of deep networks, and the autoencoder to encode the shape representation of the buildings as the regularizer in the model to capture the shape characteristics of the buildings to improve generalization of the trained models. To capture local and long-term or global contextual information in semantic segmentation, in addition to embedding of the multi-scale ASPP modules within the model, we applied ensemble learning of multi-scale base models to reduce the limitation in the size of the input samples. Compared with the predictions of the trained model of a single scale, the ensemble predictions improved generalization (higher MIoU). Compared with the existing representative methods (the baseline U-Net, DeepLab V3+, GCN and residual multi-scale model), our method achieved the state-of-the-art performance in the independent tests of this study region, and three additional datasets that are publicly accessible. The study showed important contributions of multi-scale residual models and ensemble learning, and regularizer of shape representation to semantic segmentation of the buildings. Although only a limited number of multi-scale models (three scale models) were used in our research case, according to the size and morphological complexity of the buildings, our flexible modeling architecture can be easily expanded by adding more scale models.
From the perspective of future model development, we consider merging global convolution and boundary refinement into the network architecture to capture global contextual information, as well as integrating multiple base multi-scale models, regularizer of shape representation and post processing into a systematic end-to-end method to improve the efficiency in learning and predicting for segmentation of the buildings.
Author Contributions: C.W. was responsible for conceptualization, methodology, data, literature of building extraction and financial support. L.L. was responsible for conceptualization, methodology, literature of deep learning, software, validation, formal analysis and writing. All authors have read and agreed to the published version of the manuscript.