A Building Segmentation Network Based on Improved Spatial Pyramid in Remote Sensing Images

: Building segmentation is widely used in urban planning, disaster prevention, human ﬂow monitoring and environmental monitoring. However, due to the complex landscapes and highdensity settlements, automatically characterizing building in the urban village or cities using remote sensing images is very challenging. Inspired by the rencent deep learning methods, this paper proposed a novel end-to-end building segmentation network for segmenting buildings from remote sensing images. The network includes two branches: one branch uses Widely Adaptive Spatial Pyramid (WASP) structure to extract multi-scale features, and the other branch uses a deep residual network combined with a sub-pixel up-sampling structure to enhance the detail of building boundaries. We compared our proposed method with three state-of-the-art networks: DeepLabv3+, ENet, ESPNet. Experiments were performed using the publicly available Inria Aerial Image Labelling dataset (Inria aerial dataset) and the Satellite dataset II(East Asia). The results showed that our method outperformed the other networks in the experiments, with Pixel Accuracy reaching 0.8421 and 0.8738, respectively and with mIoU reaching 0.9034 and 0.8936 respectively. Compared with the basic network, it has increased by about 25% or more. It can not only extract building footprints, but also especially small building objects.


Introduction
Building segmentation from remote sensing images is the process of segmenting the roof or footprint boundary by grouping pixels which are adjacent with similar feature values (such as brightness, edge, texture, and color). Allowing for fine details in certain areas and very fine details under certain conditions, remote sensing images greatly facilitate the classification and extraction of area-related features, such as roads and buildings. They have also been widely used in many applications, including urban planning [1], urban environmental modeling [2], disaster management [3], land-use change monitoring [4], and digital city evolution [5].
The application of such methods is quite challenging for the following three key reasons: First, the existence of objects are at multiple scales. In a remote sensing image, a pixel may belong to inconspicuous small objects, salient large objects, or stuff. Inconspicuous objects and stuff may be dominated by salient objects and, as such, their information will be somewhat weakened or even disregarded, resulting in an uneven categorical distribution; second, remote sensing images are generally obtained through satellites, airplanes, drones, etc. Therefore, they only provide a top-down view, which does not contain all of the important characteristics of objects. The missing characteristics are usually visible in the ground-based or panoramic view of the object; third, the adjacent buildings belonging to the same category in remote sensing images may have huge differences in appearance (such as color and shape). In recent years, there have been more and more researches in this field, and some remarkable achievements have been made.
Conventional remote sensing image segmentation methods usually rely on handcrafted features to extract spatial and texture information of the image, and consider a building as a combination of low-level features, such as spectral information [6], tone [7], texture [8], and geometric shape [9,10]. Many classic powerful feature extractors can also achieve high efficiency, such as support vector machines (SVMs) [11], multinomial logistic regression [12,13], boosted decision trees (DTs) [14], neural networks [15][16][17], Scale-Invariant Feature Transform (SIFT) [18], Histogram of Oriented Gradient (HOG) [19], and Speeded Up Robust Features (SURF) [20]. These methods do, however, have high requirements for the researchers themselves, and cannot represent the high-level semantic features of the image well.
A deep Convolutional Neural Network (CNN) [21] predicts using only features learned from training data, instead of hand-engineered features. Essential to their success is the built-in invariance of deep CNNs to local image transformations, which allows them to learn increasingly abstract feature representations. The challenge of CNN-based semantic segmentation comes from the inner content, shape, and scale variations of the same objects, as well as the easily confused and fine boundaries among different objects. First, the convolution layer has both shift and spatial-invariant characteristics. While invariance is clearly desirable for high-level vision tasks, it may hamper low-level tasks, such as pose estimation and semantic segmentation, in which precise localization is required, rather than the abstraction of spatial details. Second, CNNs are able to capture informative representations with global receptive fields, by stacking convolutional layers and downsampling; however, the use of down-sampling to compress data is irreversible, resulting in information loss and further causing translation invariance and smooth results. This also leads to inaccurate object boundary acquisition. Third, the multiple scales of objects are also a challenge for CNNs. The features extracted by CNNs usually have a limited receptive field, where the feature mainly describes the core region and largely ignores the context around the boundary, no matter whether the receptive field is large or small. The extraction of any feature from an object, when conducted at both a certain scale and other scales, will produce dissimilar results.
With the development of deep learning, techniques have been introduced to meet these requirements. To enlarge the receptive field of feature maps, the Pyramid Scene Parsing Network (PSPNet) [22] adopts Spatial Pyramid Pooling, which pools the feature maps into different sizes and concatenates them after up-sampling. DeepLab [23][24][25] adopts Atrous Spatial Pyramid Pooling (ASPP), which employs dilated/atrous convolution [26], in order to extract at a single scale and to perform accurate and effective classification of regions at any scale. To improve the resolution of the reconstructed image and, thus, object boundaries, Single-Image Super-Resolution (SISR) [27] has been proposed to generate a high-resolution image from low-resolution observations of the same scene. The Super-Resolution Convolutional Neural Network (SRCNN) [28] is the first end-to-end super-resolution algorithm that uses the CNN structure, which has achieved superior performance against traditional methods. Very Deep Convolutional Networks for Super Resolution (VDSR) [29] and Deeply-Recursive Convolutional Network (DRCN) [30] further improve SRCNN by using gradient clipping, skip connection, or recursive-supervision to ease the difficulty of training deep network. Accelerating the Super-Resolution Convolutional Neural Network (FSRCNN) [31] has been proposed to accelerate the training and testing of SRCNN. Super-Resolution Using a Generative Adversarial Network (SRGAN) has been submitted to construct a deeper network with perceptual losses [32] and a generative adversarial network (GAN) [33] for photo-realistic super-resolution. Efficient Sub-Pixel Convolutional Neural Network (ESPCN) [34] uses a shallow network (e.g., FCN) and only upscale the low-resolution image at the very last stage, in order to achieve real-time performance. DeepBack-Projection Networks For Super-Resolution (DBPN) [35] provides an error feedback mechanism for projection errors at each stage. Recent research has made improvements to the CNN structure, based on the characteristics of remote sensing images.
Recent studies have also applied CNN networks for remote sensing images, such as FCN, Deeplabv3+, ENet [36], and ESPNet [37], which have been used to achieve excellent results in building detection. Bittner et al. used FCN to extract buildings [38]. Xu et al. used the improved UNet [39] network with the guided filter to extract buildings [40]. Wu [41] used SegNet for Unmanned Aerial Vehicle-Based Images in Riverbank Monitoring. Zhang [42] proposed a multi-constraint fully convolutional network (MC-FCN) model to perform end-to-end building segmentation. Sanjeevan [43] improved the FCN using an exponential linear unit (ELU), in order to reduce the noise (i.e., falsely classified buildings) and sharpen the boundaries of the buildings. SPP-PCANet [44] has been designed to combine PCANet and spatial pyramid pooling, in order to reduce the amount of false positives and to improve the detection rate. Wang [45] applied the ASPP and superpixel-based DenseCRF to perform dense semantic labeling on remote sensing images. Ma [46] used an MR-SSD model to detect different objects in large-scale SAR images. WSF-Net [47] can achieve comparable results to fully supervised methods, when using only image-level annotations with weakly supervised binary segmentation. SPMF-Net [48] takes the image-level labels as supervision information in a classification network that combines superpixel pooling and multi-scale feature fusion structures for building segmentation. Wang [49] combined semi-supervised training, Average Update of Pseudo-label (AUP) with pseudo-labels, and strong labels to improve the segmentation performance. Gergelova [50] proposes a processing procedure in a geographic information system (GIS) environment which advocates the identification of roof surfaces based on the LiDAR point cloud. Lyu [51] proposes the statistical region merging (SRM) and a shape context similarity model for Light Detection and Ranging (LiDAR) data.
Inspired by the methods above, we propose a novel building segmentation network, which combines semantic segmentation and super-resolution for accurate building segmentation in remote sensing images. Our main contributions can be summarized as follows:

1.
Detail Enhancement (DE) module based on SISR is proposed to enhance the boundaries of buildings. Deep Residual Block is designed to improve the learning ability of the network. This design is to reduce the local information loss caused by the use of convolutional layers and down-sampling in a single Deep CNN which will make negatively impact on the details of the image around the small object.

2.
Widely Adapted Spatial Pyramid (WASP) is designed to enlarge the receptive fields for extracting more context information. This desigh is to multiple scale feature maps with pooling layer and dilated convolution, and integrates feature of different scales step-by-step, which can incorporate neighbor scales of context features more precisely and introducing better feature learning and fitting capabilities. We further apply the improved ResNet50 and Global Feature Guidance to improve the segmentation accuracy.
The paper is organized as follows: Section 2 presents our building segmentation network. Section 3 presents the experimental results on two public datasets along with the discussion. Our conclusions are presented in Section 4.

Proposed Methods
In this paper, the Semantic Segmentation module and DE modules are designed. The Semantic Segmentation module is used to achieve the coarsely segmentation results buiding objects in the remote sensing images. Among them, the DE module is independent of the Semantic Segmentation module, using an independent feature extraction module (Deep Residual Block) and a feature transfer method (Feature Fusion Connection) to decrease the information loss caused by the pooling layer and down-sampling in the segmentation network. As shown via the overall architecture in Figure 1, we apply the improved ResNet50 (LResNet50) to extract dense features. We used pooling operation to obtain spatial distribution information in the image and use this as a guide for intact bodies and accurate boundaries, by embedding the results of the Detail Enhancement and Semantic Segmentation modules.

Detail Enhancement
Global pooling

Detail Enhancement (DE) Module
A super-resolution reconstruction algorithm can reconstruct a corresponding highresolution image from a low-resolution image. In a neural network, the convolutional layer can directly learn low-frequency information through a non-linear filter; the low-frequency information is well-preserved, such that the low-resolution image input to the network and the high-resolution image output are similar, considering the outlines of ground objects. However, the loss of high-frequency information will still cause the reconstructed image to lack details. Therefore, existing super-resolution networks generally use a shallow network to ensure the quality of the reconstructed image and the shallow network means limited learning ability. As remote sensing images contain a large amount of information, the network can be fully utilized only when the depth of the network is guaranteed. Meanwhile, in the Deep Learning community, many theoretical studies [52][53][54] have shown that deep convolutional neural networks (number of layers > 5) can recursively identify patches of the images at lower layers, resulting in exponential increases in many linear regions of the input space. The deep layers typically extract more complex features from the input image than the shallow layers (i.e., layers < 5).
Based on the above analysis, we designed a Detail Enhancement (DE) module to realize the detail enhancement of the buildings in the remote sensing image, based on ESPCN. The module, as shown in Figure 2, contains a Deep Residual Block and Feature Fusion Connections. ESPCN uses a pixel shuffle layer, which uses the sub-pixel upsampling method. By using sub-pixel up-sampling differs from the deconvolution layer and interpolation-based up-sampling, the interpolation function is implicitly included in the previous convolutional layer which can be learned automatically. The image size is transformed only in the last layer, the previous convolution operation is performed on low-resolution images, the efficiency of sub-pixel up-sampling will be higher than the bicubic up-sampling and deconvolution. In this module, we retain the original sub-pixel upsampling layer. For the feature extraction part, we designed a feature enhancement module, in order to increase the depth of the network, and we used a feature transfer module to ensure that the network retains more high-frequency information.
In Deep Residual Block, we apply the Residual Block proposed by EDSR [55]. Compared with the original ResNet [56], this Residual Block proposed by EDSR does not apply the pooling layer and the batch normalization(BN) [57] layer; correspondingly, the structure is simpler and less calculation is required, which makes it suitable for processing pictures with rich information. As shown in Figure 2a, every Group Block contains three Residual Blocks, in Figure 2b the Deep Residual Block (DRblock) contains two Group Blocks. Since the image super resolution is an image-to-image translation task where the input image is highly correlated with the target image, the residuals between them is only we need to learn the feature which we named global residual feature. In this case, it avoids learning a complicated transformation from a complete image to another, instead only requires learning a residual map to restore the missing high-frequency details. The local residual learning is similar to the residual learning in ResNet and used to alleviate the degradation problem caused by increasing network depths, reducing training difficulty and improving the learning ability. Between Deep Residual Blocks and Group Blocks, we applied skip connections for feature transfer, using the Feature Fusion Connection inspired by ResNet. The Feature Fusion Connections are both implemented by shortcut connections and element-wise addition. The difference is that the former directly connects the input and output images as black line can transport the global residual feature to the pixel shuffle layer, the latter one usually adds multiple shortcuts between layers with different depths inside the network which are represented as black line and transfer the feature extracted from every block step by step to the sub-pixel up-sampling layer. In this case, the Feature Fusion Connection part of the module is intended to achieve mitigation of the dense network, which is used to address the gradient vanishing problem.

Widely Adapted Spatial Pyramid (WASP) Module
In a deep neural network, the receptive field [58] size can roughly indicate how much we use contextual information. Due to the convolutional nature of a CNN, local convolutional features usually have a limited receptive field. Even with a large receptive field, the feature mainly describes the core region and ignores the contextual information around boundary. Due to the arbitrary sizes of objects in the image, purely using singlescale convolution for image semantic segmentation will lead to the incorrect classification of pixels between categories. Some small-size objects, such as streetlight, are hard to distinguish but may be of great importance. In addition, large objects may exceed the receptive field of the CNN, thus causing discontinuous prediction. These issues demand processing at multiple scales. As strided convolutions are used in convolution layers and spatial pooling is used in pooling layers, the resolution of the feature maps is reduced. The use of a pooling layer also leads to reduced localization accuracy in the labelled images. As a result, many recent works have attempted to deal with the problems mentioned above, in one way or the other.
The DeepLab series utilizes Atrous spatial pyramid pooling (ASPP), with image-level features responsible for encoding the global context. Differing from the conventional approaches, which tackle objects at multiple scales, it applies rescaled versions of an input image, then aggregates the feature maps. The ASPP applies dilated/atrous convolution to integrate the visual context at multiple scales. With the dilated convolution, the size of the feature map captured will avoid reducing. The structure can also probe an incoming feature map using multiple-sampling rate filters with effective fields-of-view encoding the context at multiple scales. Aggressively increasing the dilation rate may, however, cause an inherent problem called "grid" [59]. PSPNet abstracts different sub-regions by adopting varying-size pooling kernels, and uses global pooling to extract global context information, combining it with the spatial pyramid. The structure combines global pooling with the spatial pyramid [60][61][62][63], fuses features under four different pyramid scales, and up-samples the low-dimensional feature maps to obtain the same size features as the original feature map by bilinear interpolation. The use of pooling layers reduces the resolution of the feature map and abandons local information, which further makes it difficult to distinguish objects, especially small-sized objects.
Current semantic segmentation architectures perform feature learning by extracting multi-scale features and merging them, but lack selective extraction of features at different scales. ENet base SegNet uses two blocks with maxpooling to heavily reduce the input size, and use only a small set of feature maps. The idea behind it, is that visual information is highly spatially redundant, and thus can be compressed into a more efficient representation. Additionally, our intuition is that the initial network layers should not directly contribute to classification. Instead, they should rather act as good feature extractors and only preprocess the input for later portions of the network. ESPNet uses point-wise convolutions to be helped in reducing the computation, while the spatial pyramid of dilated convolutions re-samples the feature maps to learn the representations from large effective receptive field. EncNet [64] employs the dilation strategy and proposes a Context Encoding Module (CEM) incorporating the Semantic Encoding Loss (SE-loss), in order to capture contextual information and to selectively highlight class-dependent feature maps. The proposed SE-Loss, unlike per-pixel loss, is capable of taking the global context into consideration, resulting in context-aware training. The Selective Kernel Network (SENet) [65] proposes a dynamic selection mechanism to adaptively adjust the receptive field size of each neuron, based on multiple scales of the input information, and multiple branches with different kernel sizes are fused using the softmax attention function, guided by the information in these branches. Although these branches can extract receptive fields with different scales in feature maps, they lack global context prior attention to select the features in a channelwise manner. The Pyramid Attention Network (PAN) [66] exploits the impact of global contextual information to combine an attention mechanism and spatial pyramid to extract precise dense features for pixel labeling, instead of complicated dilated convolutions and artificially designed decoder networks.
With Pyramid Attention Network (PAN) [66], we consider how to exploit high-level feature maps to guide low-level features in recovering pixel localization. Global context features and sub-region context are helpful, in this regard, to distinguish among various categories. This global prior is designed to remove the fixed-size constraint of CNNs for image classification.
Considering the above observations, the Widely Adapted Spatial Pyramid (WASP) is proposed, as shown in Figure 3. In this module, we concatenate dilated convolution and pooling layers in the Multi-Scale Feature Extraction block. Each branch first uses a global pooling operation to capture global contextual information at the image level, where the contextual information can use spatial statistical data to interpret the entire image scene. Then we use dilated convolution to expand the receptive field of the feature map without reducing the resolution of the feature map, where each pixel belongs to the feature map from the different image levels and corresponds different sub-regions in the orginal image. When the level size of the pyramid is N, we reduce the size of the context representation to 1/N of the original size. Specifically, the design in the pyramid is 1/8, 1/4, 1/2 and 1 of the four branches. Compared with a single branch, this setting can more effectively restore the spatial information damaged by the pooling operation [67], so that objects in the final prediction have a sharper boundary. We use a spatial pyramid structure similar to DeepLabv3, where the dilated convolutions we selected were 6, 12, 18, and 1. This design is based on the formula for the receptive field shown as Equation (1). This structure can avoid the scale of the receptive field obtained by one branch being equal or an integral multiple of that of another branch, under the premise of obtaining a receptive field as large as possible.
where n represents the number of layers, r n represents the receptive field size of the nth layer, k n represents the convolution kernel size of the nth layer, d represents the dilated convolution size of the nth layer, and S n represents the convolution stride size of the nth layer.  The calculation formula of the dilated convolutional kernel is: where k d represents the equivalent convolution kernel for dilated convolution of the nth layer. In our network design, each branch of the pyramid has a pooling layer and a dilated convolution layer. Then the receptive field of each branch can be calculated as: where k p represents the convolution kernel of pooling layer. In this paper, we set the stride and convolution size in the pooling layer to be equal to the pooling rate. Assuming that the pooling rate = p, the formula can be expressed as: It can also be expressed as: Then, we upsample the feature maps to obtain features with the same size as the original feature map. Benefiting from the Progressive Upsample structure, the feature maps at multi-scales are fused and learned through the convolution layers in the progressive process. Compared with the structures that directly concatenate the feature maps after the parallel branches, as in DeepLab or PSPNet, it can lead to better feature learning and fitting capabilities, as well as relatively dense feature maps. The Progressive Upsample structure integrates features at different scales step-by-step, and can incorporate neighboring scales of context features more precisely. At the output end of the final multi-scale branch, we also adopt a progressive convolution structure design. After the feature processing of each parallel branch is completed, we use bilinear interpolation to upsample the feature map of the current branch, and feature maps of the same size are added in a step-by-step manner. Finally, a feature map of the same size as the original input can be obtained. In this way, the feature maps at multiple scales are fused and learned through the convolution operator in the progressive process. Compared with structures that only use a single-layer convolution to process the feature maps after parallel branches are merged, it can introduce more convolution operations, better feature learning and fitting capabilities, and relatively dense feature maps.
We add the multi-scale feature map which is the coarsely obtained from the Semantic Segmentation module and the feature map obtained by the Detail Enhancement module, for which average pooling is used for the image. The global context information of the entire image is obtained by pooling operation. Assuming that the input image is h * w * c, a tensor of size 1 * 1 * C is obtained after pooling. It is more native to the convolution structure, by enforcing correspondences between feature maps and categories, such that it can generate one feature map for each corresponding category of the classification task and sums the spatial information. Therefore, it is more robust to spatial translations of the input. We apply the global context information to directly perform pixel-by-pixel weighted selection of the input feature map. Here we call it Global Feature Guidance (abbreviated as GFG) to facilitate subsequent experiments.

Other Settings in Our Method
In our method, ResNet50 was chosen to extract the feature map. The original ResNet50 uses the Rectified Linear Unit (ReLU) as the activation function. ReLU (which can be calculated as in Equation (2)) retains the biological inspiration of the step function (i.e., the neuron is activated only when the input exceeds the threshold). When the input is positive, the derivative is non-zero, allowing gradient-based learning (although, at x = 0, the derivative is undefined). Using this function can make the calculation process very fast, as neither the function nor its derivative contains complex mathematical operations. When the input is less than zero or the gradient is zero, its weight cannot be updated and, therefore, the learning speed of ReLU may become very slow, or it may even make the neuron directly invalid. The Leaky ReLU function (LReLU) (which can be calculated as in Equation (3)) is a variant of the classic ReLU activation function. The output of this function still has a small slope when the input is negative. When the derivative is non-zero, it can reduce the appearance of silent neurons, allowing for gradient-based learning (although it will be slow), thus solving the problem of neurons not learning after the ReLU function enters a negative interval. Compared with ReLU, LReLU has a larger activation range. Based on the above, we replaced all ReLU functions in ResNet50 with LReLU (Leaky Rectified Linear Unit) as the basic module. In this paper, the ResNet50 with LReLU, instead of ReLU, is termed "LResNet50".

Datasets Description and Training Details
To verify the effectiveness of our proposed method, we conducted various experiments using two publicly available datasets: the Inria Aerial Image Labeling dataset (Inria aerial dataset) [68] and the Satellite dataset II (East Asia) [69]. Considering building shapes are usually regular and mostly rectangular, we use thousands of images in 2 databases as the training set to ensure the generalization performance of our method.
The Inria Aerial Image Labeling Dataset, released by Maggiori et al., covers an area of 810 square kilometers, with a total of 360 images, each with 5000 × 5000 pixels. The dataset includes 10 densely populated cities and remote villages (Austin, Bellingham, Bloomington, Chicago, Innsbruck, Kitsap, San Francisco, Western and Eastern Tyrol and Vienna). The spatial resolution of the images reaches 0.3 m and the shape and structure of the buildings are clearly visible. The images cover dissimilar urban settlements, ranging from densely populated areas (e.g., San Francisco's financial district) to alpine towns (e.g., Lienz in Austrian Tyrol). Instead of splitting adjacent portions of the same images into the training and test subsets, different cities are included in each of the subsets. The datasets are constructed by combining public domain imagery and public domain official building footprints. The Satellite dataset II (East Asia) consists 17,388 images which are 512 × 512, of six neighboring satellite images covering 860 km 2 on East Asia with 0.45 m ground resolution. In this experiment, we use 3135 images in the training set and 903 images in the test set. Unlike the Satellite dataset II, the testing set in the Inria aerial dataset does not have the labels that be used for segmentation comparative experiment, so we chose four group images and their labels from training set as the testing set of Inria aerial dataset in our experiments.
Due to GPU memory limitations, we resize the images from the datasets. We cropped the image and its corresponding label to 512 × 512 size. In order to improve the image utilization of the datasets, the images and corresponding labels in the original dataset were randomly cropped. In order to facilitate the observation of the experimental results, the test images were cropped to a size of 512 × 512 pixels with an overlap of 10 pixels between the images. These operations can ensure that the network has enough data and prevent the model from overfitting and improve its robustness.

Evaluation Metrics
To vertify the effectiveness of the Detail Enhancement module, we used the Peak Signal-to-Noise Ratio (PSNR) as an evaluation metric. In the experiments, the original images from the dataset were used as I HR images. The corresponding I LR were obtained by downscaling the images with the factor U through bicubic interpolation. A low spatial resolution image I LR with W columns, H rows, and C bands can be expressed by a tensor of size W × H × C. Its corresponding I HR had a size of UW × UH × UC. We compared our module with some super-resolution state-of-the-art methods. MSE represents the mean square error between the reconstructed image F and the original high-resolution image I. Assume the size of the reconstructed image and original high-resolution image is m × n, MSE can be calculated as: Let k represent the number of bits per pixel. The PSNR can be calculated as: To vertify the effectiveness of the Semantic Segmentation module, we used the Pixel Accuracy (PA) and Mean Intersection over Union (mIoU) as evaluation metrics. PA is used to indicate the proportion of correctly classified pixels to the total pixels. Assuming that there are a total of k + 1 types of targets in the image, p ij represents the number of pixels that belong to class i but are predicted to be class j, while p ii represents the number of pixels that are actually classified correctly.
The mIoU is used to measure the accuracy of an object detector on the datasets. The mIoU can therefore be calculated as:

Comparasion Experiments
In the experiments, we decomposed and combined the proposed method, in order to verify the effectiveness of each module. In Section 3.3.1, ESPCN is used as the basic network. In Section 3.3.2, ResNet50 with ASPP (similar to DeepLabV3) is used as the basic network. The feature map extracted by ResNet50 was fed into the WASP. In Section 3.3.3, we integrated the two modules together, with some other optimization operations, to obtain our proposed method (LResNet50 + DE + WASP + GFC). The results are summarized in the associated Tables and Figures.

Comparision Experiments for DE Module
In this subsection, the super-resolution results under different factors are discussed. The performance of the proposed DE module was evaluated over the two selected datasets. For each dataset, a 512 × 512 sub-region was randomly selected from the image, in order to train the network, and another 512 × 512 sub-region, divided equally from the image, was used to validate the performance of our proposed method. In order to simulate a low resolution image, we first downsampled by the factor. The corresponding image size was changed to the original 1/N, and used as the input to the super-resolution module. For all the compared models in this experiment, we used the PSNR as the metric.
Comparison experiment of the proposed DE module with different settings: the factors we set were 2, 3, and 4. The results are summarized in Tables 1 and 2. It was observed that: (1) Our proposed DE module obviously outperformed all of the other methods over these two datasets, showing the highest PSNR; (2) the PSNR value improved when we used the Deep Residual Block (DRblock); (3) when we further used the Feature Fusion Connection (FFC), the PSNR further improved. Based on the use of these two blocks, the module can extract features more accurately, and can fuse global and local residual features efficiently.  EDSR, SRCNN, and DBPN). Among these methods, SRCNN and ESPCN have the simplest structure, consisting of only a few convolutional layers in the feature extraction part, where a sub-pixel upsampling layer is used in the up-sampling part of ESPCN. EDSR uses the Residual Block module, but uses an interpolation method in the up-sampling part. The structure of DBPN is the most complicated. The DE module we designed uses the same sub-pixel upsampling layer as ESPCN, and uses the Residual Block in EDSR as the feature extraction module. The factors we set were 2, 3, and 4. The results are summarized in Tables 3 and 4. It was observed that: (1) The PSNR of DE was very close to the value of DBPN and much higher than those of EDSR and SRCNN on the two datasets; and (2) when the upsampling factor increased, the superiority of our proposed DE became less obvious. This was due to the problem of SR being more difficult with a higher upsampling factor. The visual results of our proposed DE module were compared with those of five other state-of-the-art methods (bicubic, ESPCN, EDSR, SRCNN, and DBPN) on the two datasets, as shown in Figures 4 and 5. In order to facilitate comparison, we used the bicubic method (which is commonly used for up-sampling) to up-sample the low-resolution images obtained through downsampling to the size of the original image, in order to compare the visual results. It can be observed that the details in the image reconstructed using the DE module were richer. In Figure 4, compared with the SRCNN and ESPCN algorithms, objects with different sizes in the image reconstructed by DE module can be easily distinguished from the surrounding environment. Some small-sized objects (such as cars in shadows) can also be distinguished from their surroundings. However, due to a lack of details, the distinction between the outline of the car in the images reconstructed by SRCNN and ESPCN and the surrounding scenery was not high. The reconstruction effects of DBPN and our algorithm were similar. In the red boxes of Figure 5, the ridges in the fields can be clearly seen in the image reconstructed by our DE module. The DBPN algorithm performed slightly better, but the fields in the red box area of the reconstructed image of the SRCNN and ESPCN algorithms are mixed and difficult to distinguish. This demonstrates that the proposed module can effectively enrich the details of the reconstructed image; especially the boundaries between some insignificant adjacent objects, which can also be clearly distinguished.

Comparision Experiments for Semantic Segmentation Module
In this experiment, we applied ResNet50 with ASPP (which is a similiar structure to the ASPP in DeepLab V3) as the basic structure. In detail, we conducted experiments with many settings, including max pooling or average pooling, WASP, and Global Feature Guidance. The accuracy results of the proposed structure are summarized in Tables 5 and 6  In Tables 5 and 6, "AVE" represents average pooling. "LResNet50" represents ResNet50 with LReLU instead of ReLU. "MAX" and "AVE" represent max pooling and average pooling, respectively."GFG" represents Global Feature Guidance. It can be observed that: (1) the higher amount of contextual information extracted by WASP was very effective in improving the segmentation accuracy. WASP could also effectively solve the problem of large differences in object scales. The proposed structure could combine the features of every parallel branch and improve the segmentation effect; (2) average pooling and LReLu improved the segmentation effect, but the effect was limited. In Table 5, they improved the performance by almost 5% on the Inria aerial dataset and almost improved the mIoU by 7% on the Satellite dataset II (East Asia); (3) the use of GFG did not obviously increase the mIoU (less than 1%), but increased the PA (by more than 4% on the Inria aerial dataset and by almost 7% on the Satellite dataset). This indicates that the use of global features can preserve the spatial information of the original image, in order to improve the segmentation effect, such that our method could better distinguish categories and reduce false positives.
The visual results of our proposed Semantic Segmentic module are shown, using the two datasets, in Figures 6 and 7. The roofs of buildings in remote sensing images (which are from a top view) may be very close to the square, road, etc. Therefore, some adjoining building objects may have been mixed together and their boundaries could not be distinguished. In Figures 6c and 7c, which show the segmented results with the basic structure, small building objects had a relatively good segmentation effect, but the shapes of the segmented buildings were quite different from the shapes in the ground-truth or the original image. Some adjoining buildings still had mixed boundaries. In Figures 6d and 7d, the number of building objects was the same as the number in the original image and label. The shapes were also very close, with clear boundaries between adjacent building targets. In Figures 7e and 8e, the building object boundary in the segmentation result was clear and the segmented object shape was close to the target shape in the original image. In Figures 6f and 7f, the original image and most of the building targets in the groundtruth were segmented. The shapes of the building objects were very close to those in the ground-truth, and their outlines were very clear; however, the boundaries and shapes of some adjoining buildings in the image were still different from the buildings in the original image.
Different from Figures 6 and 7, which use the Inria aerial dataset, the buildings and backgrounds in Figures 8 and 9 are very similar in color (green); some adjacent buildings were very small and difficult to distinguish. While ensuring the accurate segmentation of the building shape in the ground-truth, the proposed Semantic Segmentation module also segmented small-size buildings that were not marked in the original image. This demonstrates that our proposed method can be especially useful for small building object when the difference between the sizes of objects is too large.

Comparison Experiments for Different Semantic Methods
In this experiment, ResNet50 with ASPP was set as the basic method. The quantitative results of this experiment, using the two datasets, are listed in Tables 7 and 8. Compared with other state-of-the-art semantic methods, our method had a better segmentation effect. Compared with the basic method, our method improved the performance, in terms of the mIoU, from 0.6412 to 0.9034 (i.e., constituting an improvement of mIoU by almost 28.2%) and the PA was improved from 0.6826 to 0.8421 (i.e., improved by 24.5%) when using the Inria aerial dataset. When using the Satellite dataset II (East Asia), our method improved the performance, in terms of the mIoU, from 0.6981 to 0.8936 and the PA from 0.7408 to 0.8738, thus improving the mIoU by almost 28% and the PA by 17%. The reason for this is that the connection method of dilated convolutution and pooling in the parallel branch of the proposed pyramid structure could extract richer contextual information. At the same time, due to the use of the super-resolution algorithm in the DE module, the low-and high-frequency information in the network could be better utilized. Therefore, it was more suitable for remote sensing images. From the visualization results, we determined that our method had an ideal segmentation effect, especially for small building objects, as can be seen in Figures 10 and 11. The methods used in the experiment used pooling operations, dilated convolutions, and pyramid structures, in order to obtain contextual information. Therefore, almost all the ground objects and boundaries were correctly identified when using our method. Similar to the results in Figures 10 and 11, our method also outperformed the other methods on the Satellite dataset II (East Asia). It can be seen that our method could segment accurate contours for complex-shaped building objects. The building objects in Figures 12 and 13, especially the adjacent buildings, were accurately segmented, regardless of whether they were large-or small-sized buildings. In particular, the shape and location of small-sized buildings were consistent with the ground-truth. These findings show that our method can effectively improve the performance of remote sensing image semantic segmentation.

Conclusions
In this paper, we proposed a notable building segmentation network for remote sensing images combining the characteristics of image semantic segmentation and superresolution methods. Through extensive experiments on two remote sensing image datasets, our method was shown to significantly improve the segmentation performance, compared to the state of the arts, it can achieve the accurate segmentation of building objects at multiple scales, especially considering small objects. The method we designed can also achieve very effective segmentation in complex scenes facing complex-shaped buildings.
At present, remote sensing technology is developing rapidly, especially for autonomous driving and artificial intelligence applications. The recognition of multi-scale targets in remote sensing images has been the focus of a large body of research. These applications also have requirements, in terms of calculation speed and the computational complexity of the associated algorithms, which will be the focus of our future work.