Building Outline Extraction Directly Using the U 2 -Net Semantic Segmentation Model from High-Resolution Aerial Images and a Comparison Study

: Deep learning techniques have greatly improved the efﬁciency and accuracy of building extraction using remote sensing images. However, high-quality building outline extraction results that can be applied to the ﬁeld of surveying and mapping remain a signiﬁcant challenge. In practice, most building extraction tasks are manually executed. Therefore, an automated procedure of a building outline with a precise position is required. In this study, we directly used the U 2 -net semantic segmentation model to extract the building outline. The extraction results showed that the U 2 -net model can provide the building outline with better accuracy and a more precise position than other models based on comparisons with semantic segmentation models (Segnet, U-Net, and FCN) and edge detection models (RCF, HED, and DexiNed) applied for two datasets (Nanjing and Wuhan University (WHU)). We also modiﬁed the binary cross-entropy loss function in the U 2 -net model into a multiclass cross-entropy loss function to directly generate the binary map with the building outline and background. We achieved a further reﬁned outline of the building, thus showing that with the modiﬁed U 2 -net model, it is not necessary to use non-maximum suppression as a post-processing step, as in the other edge detection models, to reﬁne the edge map. Moreover, the modiﬁed model is less affected by the sample imbalance problem. Finally, we created an image-to-image program to further validate the modiﬁed U 2 -net semantic segmentation model for building outline extraction. conclusions can be made for complex buildings with irregular boundaries, the effectiveness of our approach.


Introduction
With the rapid development of aerospace, high-resolution remote sensing images have become increasingly accessible and convenient to use [1]. As a key element of a city's structure, buildings play an important role in human society. Extracting building information from remote sensing images is beneficial as it allows us to quantify the spatial distribution characteristics of buildings. Such information is priceless for urban planning [2,3], disaster management [4,5], land use change [6], mapping [7][8][9] and many other applications. For the actual production activities, we tend to obtain the edges of a building, as this allows us to understand the shape, position, distribution, and other geometric features of the building in a comprehensible manner. In recent years, with the continuous advancement of deep learning technology, there has been significant progress in building edge extraction [10][11][12]; however, it is still a large challenge to automatically extract clear and precise building edges from remote sensing images in practical applications.
In the past few decades, the extraction of remote sensing-based building edges by using traditional methods has been extensively studied. These efforts can be divided into three categories: (1) edge or straight-line detection technology, (2) curve propagation techniques, and (3) segmentation technology. Based on traditional edge detection technology, the edges of an image are first extracted. Then, the edges or the straight lines are combined to form closed contours. Ultimately, a complete building contour is extracted using prior knowledge of a building's shape. Mohammad et al. [13] extracted a straight line and obtained the connection points to identify and quantify a building roof contour through these connection points. Qi et al. [14] first used a priori knowledge to constrain the extracted lines, group the segments of the lines, and ultimately extract the polygonal building boundary. Turker et al. [15] used the support vector machine (SVM) to classify the data and build image blocks in a study area. They subsequently used the Canny operator and applied the Hough transform for each image block to retrieve straight-line segments. Finally, the straight-line segments were grouped to extract the building contours. Given a variety of discontinuous edge segments in traditional edge detection methods, a post-processing step is required to improve the accuracy of the building edge detection. To solve this problem, curve propagation techniques such as active contour extraction and the level set method have been used to extract a closed contour of a target building. For instance, Karantzalos et al. [16] used various prior information sources to construct a target energy function for the level set segmentation, which can be effectively used to extract multiple targets in an image that meets the requirements of the building shape library. Ahmadi et al. [17] used the difference between the internal and external contours of the target to form an energy function for active contour tracking to extract the target contour of a typical building. Although such methods can be used for extracting closed contours, the sensitivity of the method to the initial edge detection cannot guarantee the acquisition of a globally optimal boundary. Moreover, it is difficult to detect weak building edges by using only local information. Given the fact that the former two methods cannot fully exploit the prior global and local building information, the object-oriented segmentation technique has been widely used for building target extraction purposes. As such, Attarzadeh et al. [18] obtained optimal scale parameters through the segmentation technique. They segmented a remote sensing image and extracted multiple features to quantify the buildings through classification. Sellouti et al. [19] used the watershed algorithm to segment the input images for choosing candidate regions. Then, they classified the segmented objects using the similarity index and finally extracted the target building information using the region growing method. Thus, the segmentation technique allows for the association of global and local variables to a certain extent, but its performance strongly depends on the initial segmentation, and it is difficult to extract complex buildings.
In recent years, the rapid development of deep learning techniques has resulted in the subsequent progress of convolutional neural networks [20] in the field of computer vision including image classification [21][22][23][24], object detection [25][26][27], and semantic segmentation [28,29] methods. The deep learning techniques promoted the transformation of manual design features to autonomous learning features [30,31]. From the classification perspective, a new generation of convolutional networks led by AlexNet [24] and ResNet [21] have shown great success on image classification tasks. Based on the traditional convolutional neural network, Long et al. [29] proposed a fully convolutional network (FCN) by eliminating the original full connection layer. It realizes the pixel-to-pixel model and lays a foundation for subsequent models such as U-Net [32] or Segnet [33]. Owing to this, the FCN and its variants have been widely used in the fields of remote sensing classification [23,34] and objective extraction [35,36]. The fusion of CNN and GCN proposed by Hong et al. [37] broke the performance bottleneck of a single model and was used to extract different spatial-spectral features to further improve the classification effect.
Since deep learning technology was introduced into the remote sensing field, numerous influential studies have been published [37][38][39][40][41]. Recently, FCN has yielded robust results in remote sensing image semantic segmentation. Li et al. [42] performed a comparative analysis on applying an FCN for a building extraction, and their results proved the effectiveness of the FCN. Liu et al.'s [43] proposed SRI-net model can accurately detect large Remote Sens. 2021, 13, 3187 3 of 20 buildings that are easily ignored, while maintaining the overall shape. Huang et al. [35] proposed the DeconvNet model, using different band combinations (RGB and NRG) in the Vancouver dataset to fine-tune the model and verify its effectiveness. However, a traditional FCN easily ignores the rich low-level features and intermediate semantic features, thus ultimately blurring building boundaries [44]. To this end, some scholars utilized dilated convolution [45] or added skip connections [32] to aggregate multi-scale contexts. Moreover, in order to further optimize the extraction results, some researchers tried to use post-processing. For example, Chen et al. [46] applied conditional random fields (CRFs) for the post-processing of network output to improve the quality of boundary extraction. Li et al. [47] proposed a two-step method to implement building extraction. First, the modified CRF model was used to reduce the edge misclassification effect. Then, the prominent features of buildings were exploited to increase the expression of building edge information. Shrestha et al. [48] used CRF as a post-processing tool for segmentation results to improve the quality of building boundaries. In order to obtain a clear and complete building boundary, some researchers tried to directly extract the building boundary method. For example, Lu et al. [12] introduced the richer convolutional features (RCF) [49] edge detection network into the field of remote sensing, constructed an RCF-building network, and used terrain surface geometry analysis to refine the extracted building boundaries. Li et al. [50] proposed a new instance segmentation network to detect several key points of a single building, and then used these key points to construct the building boundary. Zorzi et al. [51] used machine learning methods to construct automatic regularization and polygonization of segmentation results. First, an FCN is used to generate a prediction map, then a generative adversarial network is used to construct boundary regularization, and finally a convolutional neural network is used to predict building corners from the regularized results. Chen et al. [10] introduced the PolygonCNN model. The network first uses the FCN variant to segment the initial building area, then encodes the building polygon vertices and combines the segmentation results, and finally uses the improved pointNet network to predict the polygon vertices and generate the building vector results. These studies further promote the solution of the vector generation problem, but a lot of experiments are still needed to improve and verify their accuracy and effectiveness.
Although deep learning technology has greatly promoted the development of remote sensing-based building edge extraction methods, high-quality building edge extraction results that can be applied to the field of surveying and mapping still face huge challenges. In the present work, we tested the feasibility of using the semantic segmentation model U 2 -net directly for building outline extraction, and further refined the model output by modifying the loss function of the model. The rest of the paper is organized as follows. The approach is described in Section 2. Section 3 is the experiment, including dataset introduction, experimental setup, and evaluation metrices. Section 4 reports and discusses the experimental results on the Nanjing dataset and WHU dataset, and compares with the edge detection model and semantic segmentation model, respectively. Section 5 provides our conclusions and further research plan.

Methods
The main idea of this study is to use the existing semantic segmentation model U 2 -net [52] to extract the edges of buildings in high-resolution remote sensing images in an automated manner. This section provides an overview of the network architecture. The U-type residuals are described in Section 2.2. The loss function is described in Section 2.3.

Overview of Network Architecture
The U 2 -net model was proposed by Qin [52], and its structure is shown in Figure 1. Qin [52] evaluated its performance on six datasets and achieved good results. The model has also made good progress in other areas. For example, Vallejo [53] developed an application to extract the foreground of an image using this model. Tim [54] introduced an application to remove the background from a video. Finally, Scuderi [53] designed a clipping camera based on this model, which can detect relevant objects from the scene and clip them, and also developed a style transfer program based on this model.

Overview of Network Architecture
The U 2 -net model was proposed by Qin [52], and its structure is shown in Figure 1 Qin [52] evaluated its performance on six datasets and achieved good results. The mode has also made good progress in other areas. For example, Vallejo [53] developed an application to extract the foreground of an image using this model. Tim [54] introduced an application to remove the background from a video. Finally, Scuderi [53] designed a clipping camera based on this model, which can detect relevant objects from the scene and clip them, and also developed a style transfer program based on this model.  [52]. The network as a whole is a U-shaped structure, the basic unit is RSU, the decoder and the last encoder, respectively, produce side output, and the side output is merged in the channel direction and then passed through the convolutional layer to produce the fused output result.
The U 2 -net is a two-level nested U-structure. The outer layer is a large U-structure consisting of 11 stages. Each stage is populated by a residual U-block (RSU) (inner layer) Theoretically, the nested U-structure allows the extraction of multi-scale features and multi-level features more efficiently. It consists of three parts: (1) encoder, (2) decoder, and (3) map fusion module, as shown below.  [52]. The network as a whole is a U-shaped structure, the basic unit is RSU, the decoder and the last encoder, respectively, produce side output, and the side output is merged in the channel direction and then passed through the convolutional layer to produce the fused output result.
The U 2 -net is a two-level nested U-structure. The outer layer is a large U-structure consisting of 11 stages. Each stage is populated by a residual U-block (RSU) (inner layer). Theoretically, the nested U-structure allows the extraction of multi-scale features and multilevel features more efficiently. It consists of three parts: (1) encoder, (2) decoder, and (3) map fusion module, as shown below.
(1) There are six stages in the encoder stage. Each stage is composed of an RSU. In the RSU of the first four stages, the feature map is reduced to increase the receptive field and to obtain more large-scale information. In the next two stages, dilated convolution is used to replace the pooling operation. This step is required to prevent the context information loss. Note that the receptive field is increased while the feature map is not reduced. (2) The decoder stages have structures similar to those of the encoder stages. Each decoder stage concatenates the up-sampled feature maps from its previous stage, and those from its symmetrical encoder stage as the input.
(3) Feature map fusion using a deep supervision strategy is the last stage applied to generate a probability map. The model body produces six side outputs. These outputs are then up-sampled to the size of the input image and fused with a concatenation operation.
To summarize, the U 2 -net design not only has a deep architecture with rich multi-scale features, but also has low computing and memory costs. Moreover, as the U 2 -net architecture is only built on the RSU blocks and does not use any pre-trained backbones, it is flexible and is easy to be adapted to different working environments with little performance penalty.

Residual U-Blocks
Unlike the previous series stacking, the RSU is a U-structure embedded residual module used to capture multi-scale features. The structure of RSU-L (C in , M, C out ) is shown in Figure 2, where L is the number of layers in the encoder, C in and C out denote the input and output channels, and M denotes the number of channels in the internal layers of the RSU. The RSU consists of three main parts.
(1) There are six stages in the encoder stage. Each stage is composed of an RSU. In the RSU of the first four stages, the feature map is reduced to increase the receptive field and to obtain more large-scale information. In the next two stages, dilated convolution is used to replace the pooling operation. This step is required to prevent the context information loss. Note that the receptive field is increased while the feature map is not reduced.
(2) The decoder stages have structures similar to those of the encoder stages. Each decoder stage concatenates the up-sampled feature maps from its previous stage, and those from its symmetrical encoder stage as the input.
(3) Feature map fusion using a deep supervision strategy is the last stage applied to generate a probability map. The model body produces six side outputs. These outputs are then up-sampled to the size of the input image and fused with a concatenation operation.
To summarize, the U 2 -net design not only has a deep architecture with rich multiscale features, but also has low computing and memory costs. Moreover, as the U 2 -net architecture is only built on the RSU blocks and does not use any pre-trained backbones, it is flexible and is easy to be adapted to different working environments with little performance penalty.

Residual U-Blocks
Unlike the previous series stacking, the RSU is a U-structure embedded residual module used to capture multi-scale features. The structure of RSU-L (Cin, M, Cout) is shown in Figure 2, where L is the number of layers in the encoder, Cin and Cout denote the input and output channels, and M denotes the number of channels in the internal layers of the RSU. The RSU consists of three main parts.  Residual U-blocks [52]. A residual unit composed of a U-shaped structure. The dashed frame is a U-shaped block. The feature map obtained by the convolution operation and the output result of the U-shaped block are added to obtain the output feature map.
At the outset, the convolution layer transforms the input feature map (H, W, C in ) into an intermediate map F 1 (x) with a channel of C out . This is a plain convolution layer used to extract local features. Second, a U-structure symmetric encoder-decoder with height L takes the intermediate feature map F 1 (x) as the input. This structure can extract multi-scale information and reduce the loss of context information caused by up-sampling. Third, the local features and the multiscale features are fused by a residual connection.

Loss
Deep supervision can restrain multiple stages in the network, and its effectiveness is verified by HED [55], RCF [49] and DexiNed [56]. This is defined in Equation (1): side and w f use denote the weights of each loss, and l (m) side indicates the loss of the side output map. In the original version, the loss function uses binary cross entropy to calculate the loss (l side , l f use ), as shown in Equation (2) below: where p i and q i denote the pixel values of the ground truth and the predicted probability map obtained through the sigmoid function, respectively, and N is the total number of pixels. Our purpose was to minimize loss.
To obtain more extreme edges, we replace the binary cross entropy with multi-class cross entropy, which can help in obtaining thinner edges and directly generating binary maps. The definition is given in Equation (3) below: where a 0 represents the probability of the pixel (x, y) being predicted as the background, a 1 is the probability of the pixel (x, y) being predicted as the building edge, ϕ 0 represents the value of the background in the label, ϕ 1 represents the value of the edge in the label, H and W, respectively, represent the length and width of the input image, and N represents the total number of pixels. Different from the probability map generated by binary cross entropy, the output results of the model using multi-classification cross entropy are directly classified into background and building edge, and each pixel is limited to these two classes instead of a probability value, which is expected to make the difference between the two classes more obvious, and then the edge more refined.

Experiment
Our model was implemented using the machine learning library PyTorch 1.7.1 and CUDA toolkit 10.2 software. The model was trained from scratch with no pre-trained weight. To check the validity and correctness of the model, we conducted an experiment of building edge extraction on remote sensing images and compared it with the semantic segmentation and the edge detection models. This section introduces the dataset details, describes the experimental setup, and presents the evaluation criteria.

Dataset
To evaluate the effectiveness of the method, two building datasets, the Nanjing dataset and the WHU building dataset, were used. The Nanjing dataset was produced by the authors of this study, and the WHU building dataset is an international open-source dataset.
The Nanjing dataset covers a range of building types, shapes, and complex surroundings, as shown in Figure 3. It has three bands (red, green, blue) with a spatial resolution of 0.3 m. We manually annotated all labels. The whole image was cropped into 2376 pieces of 256 × 256 pixels without overlapping, and only 500 images were obtained after eliminating invalid data. We used 400 images as the training and validation sets and 100 images as the test set. Through data augmentation, including flips and rotations, 400 images were expanded to 2000 (80% were used for training and 20% for validation).
The Nanjing dataset covers a range of building types, shapes, and complex surroundings, as shown in Figure 3. It has three bands (red, green, blue) with a spatial resolution of 0.3 m. We manually annotated all labels. The whole image was cropped into 2376 pieces of 256 × 256 pixels without overlapping, and only 500 images were obtained after eliminating invalid data. We used 400 images as the training and validation sets and 100 images as the test set. Through data augmentation, including flips and rotations, 400 images were expanded to 2000 (80% were used for training and 20% for validation). The WHU building dataset [57] contains two subsets of satellite and aerial imagery. The satellite dataset was classified into two subsets, containing 204 and 17,388 images, respectively. In this study, we used aerial imagery that covers 450 km 2 of Christchurch in New Zealand, containing 8188, 512 × 512 pixel images. These images are involved in the down-sampling from 0.075 to 0.300 m. The dataset was divided into a training set (4736 images), a test set (2416 images), and a validation set (1036 images). We made several changes as the labels that did not contain buildings and the corresponding original images were eliminated (training set: 4317, test set: 1731, validation set: 1036). In addition, our label is a binary building region picture, which can be easily converted into a binary building edge picture by the Canny operator, as shown in Figure 4. The WHU building dataset [57] contains two subsets of satellite and aerial imagery. The satellite dataset was classified into two subsets, containing 204 and 17,388 images, respectively. In this study, we used aerial imagery that covers 450 km 2 of Christchurch in New Zealand, containing 8188, 512 × 512 pixel images. These images are involved in the down-sampling from 0.075 to 0.300 m. The dataset was divided into a training set (4736 images), a test set (2416 images), and a validation set (1036 images). We made several changes as the labels that did not contain buildings and the corresponding original images were eliminated (training set: 4317, test set: 1731, validation set: 1036). In addition, our label is a binary building region picture, which can be easily converted into a binary building edge picture by the Canny operator, as shown in Figure 4.

Experimental Setup
Owing to the small data volume of the Nanjing dataset, data enhancement methods were used to expand it, including up to down, left to right, and 90 rotations. To verify the effectiveness of this method, we compared it with several other models. As the essence of the semantic segmentation model and edge detection model is pixel-level prediction, we need to compare the method with the current mainstream semantic segmentation models such as FCN [29], U-Net [32], and Segnet [33]. It is compared with the edge detection models in the field of computer edges (such as HED [55], RCF [49], and DexiNed [56]) in a similar way.

Experimental Setup
Owing to the small data volume of the Nanjing dataset, data enhancement meth were used to expand it, including up to down, left to right, and 90 rotations. To verify effectiveness of this method, we compared it with several other models. As the essenc the semantic segmentation model and edge detection model is pixel-level prediction, need to compare the method with the current mainstream semantic segmentation mod such as FCN [29], U-Net [32], and Segnet [33]. It is compared with the edge detec models in the field of computer edges (such as HED [55], RCF [49], and DexiNed [56] a similar way.
We implemented and tested the model in PyTorch on a 64-bit Ubuntu 16.04 sys equipped with an NVIDIA Quadro P4000 GPU. The model used the Adam optimi starting with a learning rate of 0.001, beta_1 of 0.9, and epsilon of 1e-8. As mentioned, model was trained from scratch with no pre-trained weights. After each up-samp stage, the feature map was up-sampled to the original image size through the bilin interpolation, yielding six side outputs in total. Finally, the six side outputs were con enated in the direction of channels to form a feature map with the same size as the in image, and the number of channels was six. Finally, a fusion result was generated thro a 1 × 1 convolution. The seven output results were calculated with the label to calcu the loss, and the sum of the loss was used to complete the back propagation and param optimization.

Evaluation Metrics
The outputs of the model are probability maps that have the same spatial resolu as the input images. Each pixel of the predicted map had a value within the 0-1 ran The ground truth is usually a binary mask, in which each pixel is either 0 or 1 (1 indic the background pixels; 0 indicates the edge object pixels).
To evaluate the methodological performance correctly, our approach virtually rep sents the deformation of semantic segmentation tasks. In particular, the four commo We implemented and tested the model in PyTorch on a 64-bit Ubuntu 16.04 system equipped with an NVIDIA Quadro P4000 GPU. The model used the Adam optimizer, starting with a learning rate of 0.001, beta_1 of 0.9, and epsilon of 1e-8. As mentioned, the model was trained from scratch with no pre-trained weights. After each up-sampling stage, the feature map was up-sampled to the original image size through the bilinear interpolation, yielding six side outputs in total. Finally, the six side outputs were concatenated in the direction of channels to form a feature map with the same size as the input image, and the number of channels was six. Finally, a fusion result was generated through a 1 × 1 convolution. The seven output results were calculated with the label to calculate the loss, and the sum of the loss was used to complete the back propagation and parameter optimization.

Evaluation Metrics
The outputs of the model are probability maps that have the same spatial resolution as the input images. Each pixel of the predicted map had a value within the 0-1 range. The ground truth is usually a binary mask, in which each pixel is either 0 or 1 (1 indicates the background pixels; 0 indicates the edge object pixels).
To evaluate the methodological performance correctly, our approach virtually represents the deformation of semantic segmentation tasks. In particular, the four commonly used indicators in the traditional semantic segmentation were adopted, including intersection-over-union (IoU), recall, precision, and F1 score (F1). IoU is the ratio of the intersection and union of the prediction and the ground truth that is calculated (Equation (4)). Precision denotes the fraction of the identified "edge" pixels that are correct with the ground reference (Equation (5)). Recall expresses how many "edge" pixels in the ground reference are correctly predicted (Equation (6)). F1 is a commonly used index in deep learning, expressing the harmonic mean of precision and recall (Equation (7)). where TP (true positive) is the number of pixels correctly identified with the class label "edge", FN (false negative) indicates the number of omitted pixels with the class label of "edge", FP (false positive) is the number of "non-edge" pixels in the ground truth that are mislabeled as "edge" by the model, and TN (true negative) is the number of correctly identified pixels with the class label of "non-edge".

Results and Discussion
To assess the extraction quality results, several state-of-the-art models (Segnet [33], U-Net [32], FCN [29], RCF [49], HED [55], and DexiNed [56]) were used for the comparison with our approach using the two datasets. The edge detection model and the semantic segmentation model are essentially pixel-level predictions. Therefore, we also compared with the semantic segmentation model.
In the field of edge detection, non-maximum suppression is typically used for postprocessing to refine the edge. We eliminated non-maximum suppression and implemented end-to-end model comparisons. That is closer to reality, while the contrast is more visible. Therefore, the following quantitative and qualitative analyses were based on the direct output of the model. Table 1 shows the experimental results from our approach and the three comparison methods on the Nanjing dataset. As the most effective and popular models for edge detection purposes, RCF [49], HED [55], and DexiNed [56] achieved 45.33%, 55.94%, and 57.80% performance in the mIoU score, respectively. The IoU indicator is the total number of pixels that are predicted in the buildings/no-buildings edge category to the total number of pixels of all buildings/no-buildings edge categories. Given the imbalance of the samples, we used mIoU, which considers both. Two widely exploited evaluation metrics in the edge detection community were adopted. The fixed edge threshold (ODS) utilizes a fixed threshold for all prediction maps in the test set. The per-image best threshold (OIS) selects an optimal threshold for each prediction map in the test set. The F-measure of both the ODS and OIS was used in our experiments. As seen from Table 1, our approach outperforms the other models in terms of three indicators, suggesting that our approach has certain advantages for building edge extraction.  Figure 5 shows a qualitative assessment of the various models for the Nanjing dataset. We selected four representative buildings in the test set for the display. The 1 st row of Figure 5 shows the results for small buildings that touch the image borders. We note that our approach has correctly extracted the building edges. Meanwhile, the other models are prone to erroneous extraction of river boundaries. The 2nd row shows the results for different types of buildings, (one type-solid blue, other type-hollow gray). As seen, our approach is suitable for the identification of the edge (both solid and hollow), while the other models only extract either solid building edges or hollow ones. The 3rd row shows buildings and a similar background and this task is challenging for all the models. We found that our approach can detect a relatively large building edge. Despite the fact that the small buildings similar to the background were ignored, our approach exhibited superior results compared with the other models. The 4th row shows that the proposed approach can work in extremely complex situations. that our approach has correctly extracted the building edges. Meanwhile, the other mo els are prone to erroneous extraction of river boundaries. The 2nd row shows the resul for different types of buildings, (one type-solid blue, other type-hollow gray). As see our approach is suitable for the identification of the edge (both solid and hollow), whi the other models only extract either solid building edges or hollow ones. The 3rd ro shows buildings and a similar background and this task is challenging for all the mode We found that our approach can detect a relatively large building edge. Despite the fa that the small buildings similar to the background were ignored, our approach exhibite superior results compared with the other models. The 4th row shows that the propose approach can work in extremely complex situations.

Comparison with Semantic Segmentation Model
Although only few scholars have used the semantic segmentation model for edg detection, our experiment confirmed its suitability. In terms of ODS and OIS, our metho has fundamental advantages over the semantic segmentation model. The results of the semantic segmentation models on the Nanjing test set are summarized in Table 2.

Comparison with Semantic Segmentation Model
Although only few scholars have used the semantic segmentation model for edge detection, our experiment confirmed its suitability. In terms of ODS and OIS, our method has fundamental advantages over the semantic segmentation model. The results of these semantic segmentation models on the Nanjing test set are summarized in Table 2. Similar to the above, four different building types were compared in Figure 6. The 1st row shows that both Segnet [33] and FCN [29] misinterpret the river margins for the building boundaries. Although the U-Net [32] correctly identified the building boundaries, they were insufficiently thin. The 2nd and 3rd rows show that our method can fully extract the building edges, also yielding cleaner results. The 4th row analysis indicates that the same conclusions can be made for complex buildings with irregular boundaries, thus underlining the effectiveness of our approach. Similar to the above, four different building types were compared in Figure 6. T 1st row shows that both Segnet [33] and FCN [29] misinterpret the river margins for building boundaries. Although the U-Net [32] correctly identified the building bound ries, they were insufficiently thin. The 2nd and 3rd rows show that our method can fu extract the building edges, also yielding cleaner results. The 4th row analysis indica that the same conclusions can be made for complex buildings with irregular boundari thus underlining the effectiveness of our approach. To summarize, we conducted comparative experiments using the semantic segm tation and edge detection models on the Nanjing dataset to verify whether our approa can yield better building recognition results. The experimental results showed that o method can provide a more complete and clean building edge. At the same time, showed that the semantic segmentation model can obtain a clearer effect than the ed detection model with respect to the building edge. We emphasize that the semantic s mentation model does not consider the loss function for the sample imbalance proble while the edge detection model does. To summarize, we conducted comparative experiments using the semantic segmentation and edge detection models on the Nanjing dataset to verify whether our approach can yield better building recognition results. The experimental results showed that our method can provide a more complete and clean building edge. At the same time, we showed that the semantic segmentation model can obtain a clearer effect than the edge detection model with respect to the building edge. We emphasize that the semantic segmentation model does not consider the loss function for the sample imbalance problem, while the edge detection model does.

Comparison with Edge Detection Model
On one hand, the WHU dataset provides abundant training information. On the other hand, it contains larger data volume and different building types, thus posing additional challenges for building extraction purpose. A quantitative assessment of the various models on the WHU dataset is given in Table 3. Among the edge detection models, DexiNed [56] shows the best performance as it achieves 89.29, 89.54, and 62.22 on ODS-F1, OIS-F1, and mIoU, respectively. Our method exhibited slightly better performance compared to the other methods. Four representative buildings were selected from the test set to qualitatively assess the experimental results on the WHU dataset (Figure 7). The 1st row shows the results of the small buildings, located close to each other and near the boundaries of the image. Although the three edge detection models can extract the building edge, there are some malignant phenomena in the extraction process that is, ultimately, not sufficiently accurate. The 2nd row illustrates the performance of the method with regard to the identification of the buildings of different sizes and shapes. As seen, the RCF [49] and HED [55] models are susceptible to interference with irrelevant information and the extraction results are not clean enough. However, the boundary extraction of circular buildings is more complete compared to DexiNed [56] and our method. The 3rd row confirms the strong ability of our method to extract dense, small building edges, while the 4th row shows the results for large and staggered buildings.
OIS-F1, and mIoU, respectively. Our method exhibited slightly better performance co pared to the other methods. Four representative buildings were selected from the test set to qualitatively ass the experimental results on the WHU dataset (Figure 7). The 1st row shows the results the small buildings, located close to each other and near the boundaries of the image. A hough the three edge detection models can extract the building edge, there are some m lignant phenomena in the extraction process that is, ultimately, not sufficiently accura The 2nd row illustrates the performance of the method with regard to the identificat of the buildings of different sizes and shapes. As seen, the RCF [49] and HED [55] mod are susceptible to interference with irrelevant information and the extraction results not clean enough. However, the boundary extraction of circular buildings is more co plete compared to DexiNed [56] and our method. The 3rd row confirms the strong abi of our method to extract dense, small building edges, while the 4th row shows the resu for large and staggered buildings.

Comparison with Semantic Segmentation Model
Likewise, we compared our results with the semantic segmentation model. A quantitative assessment of several methods on the WHU dataset is shown in Table 4. The Remote Sens. 2021, 13, 3187 13 of 20 experimental results indicated that Segnet [33] and U-Net [32] had similar scores on ODS-F1 and OIS-F1, and the score of mIoU of Segnet [33] was lower than that of U-Net [32], 5.04%. The indicators of the FCN [29] were low. Overall, these experiments confirmed that our method outperforms the other models with regard to these indicators. The results of the models on the WHU test set are shown in Figure 8. We selected four representative buildings in the test set as a representative example. In particular, Segnet [33] and U-Net [32], alongside our approach, yielded good results. Zhou [58] reported that Segnet [33] was based on an encoding-decoding structure, and the convolution kernel is mainly used for image feature extraction, which cannot maintain the integrity of the building. However, our experiments provide arguments that they all represent the encoding-decoding structures and provide more acceptable results than FCN [29]. The extraction results of Segnet [33] and U-Net [32] were not clean enough, and some edges were slightly thick. Meanwhile, our approach yielded more accurate results Likewise, we compared our results with the semantic segmentation model. A qu titative assessment of several methods on the WHU dataset is shown in Table 4. The perimental results indicated that Segnet [33] and U-Net [32] had similar scores on OD F1 and OIS-F1, and the score of mIoU of Segnet [33] was lower than that of U-Net [3 5.04%. The indicators of the FCN [29] were low. Overall, these experiments confirmed t our method outperforms the other models with regard to these indicators. The results of the models on the WHU test set are shown in Figure 8. We selec four representative buildings in the test set as a representative example. In particular, gnet [33] and U-Net [32], alongside our approach, yielded good results. Zhou [58] ported that Segnet [33] was based on an encoding-decoding structure, and the convo tion kernel is mainly used for image feature extraction, which cannot maintain the int rity of the building. However, our experiments provide arguments that they all repres the encoding-decoding structures and provide more acceptable results than FCN [2 The extraction results of Segnet [33] and U-Net [32] were not clean enough, and so edges were slightly thick. Meanwhile, our approach yielded more accurate results  We conducted experiments on the WHU dataset in a similar way to the Nanjing dataset by applying the semantic segmentation model and edge detection, respectively. The experimental results confirm the applicability of the proposed method. We emphasize that in the domain of the computer edge detection (HED [55], RCF [49], and DexiNed [56]), sample imbalance will be considered. However, we empirically showed that the semantic segmentation model without the sample imbalance considered was better than the edge detection model with the sample balance. This finding requires further scientific investigation, whereas two cases of each model should be separately compared for explicit interpretation of this phenomenon. Moreover, in the domain of computer edge detection, non-maximum suppression (NMS) is used to optimize the results of the network output to retrieve thinner edges. Our experiment showed that NMS as a post-processing tool is not a prerequisite condition for acquiring accurate results.
Finally, we found that the indicators of the WHU dataset were much higher than those of the Nanjing dataset. We explain this finding by the presence of many buildings shaded by trees and the very irregular buildings in our data, which hampered manual annotation and model training. Furthermore, although we used data enhancement, the quantity is still only half that of the WHU dataset.

Loss Function for Clearer Edge
A dedicated comparison before and after the model modification is shown in Figure 9.
Here, A and B are from two types of buildings in the Nanjing dataset (simple regular buildings and complex irregular buildings), while a1 and b1 are the edge detection results before model modification, and a2 and b2 are the results after the loss function modification of the corresponding models. Our modified loss function plays the role of the threshold segmentation and produces thinner building edges. C and D are two types of images from the WHU dataset. C is that buildings in the picture are sparse, and D is that the buildings in the picture are dense. Furthermore, c1 and d1 are the results before model modification, while c2 and d2 are the results after model modification. The comparison of different datasets and different types of buildings showed that after the model was modified (given the function of the threshold segmentation), the binary map is directly generated. At the same time, the edges of the buildings produced are also thinner.

Application Prospect in Practical Work
To explore the application prospects of our findings, we implemented an image-toimage program. In simple words, we input a high-resolution remote sensing image of a non-fixed size and output its corresponding building edge map. The program completed a full workflow of image cutting, prediction, and splicing, thus realizing the automatic extraction of the building contour. Figure 10 shows a remote sensing image of the Nanjing aerial data with a size of 13,669 × 10,931 and a resolution of 0.3 m in the same period as the training data. The final output and its corresponding edge map of the buildings are shown in the same figure. We selected five representative buildings for this analysis. For regular buildings, the edges can be predicted well, as shown in panels (a), (c) and (e) of Figure 10. For irregular buildings, most of them can predict their building edges, but there are pixel discontinuities in some building edge prediction results, as shown in panels (b) and (d) of Figure 10. As mentioned, this effect can emerge due to the lack of training data and as many buildings are shaded by trees, which hampers our training and marking.
Sens. 2021, 13, x FOR PEER REVIEW 15 Figure 9. Comparison of building edge details before and after model modification.

Application Prospect in Practical Work
To explore the application prospects of our findings, we implemented an imag image program. In simple words, we input a high-resolution remote sensing image non-fixed size and output its corresponding building edge map. The program compl a full workflow of image cutting, prediction, and splicing, thus realizing the autom extraction of the building contour. Figure 10 shows a remote sensing image of the Nanjing aerial data with a siz 13,669 × 10,931 and a resolution of 0.3 m in the same period as the training data. The output and its corresponding edge map of the buildings are shown in the same figure selected five representative buildings for this analysis. For regular buildings, the e can be predicted well, as shown in panels (a), (c) and (e) of Figure 10. For irregular bu ings, most of them can predict their building edges, but there are pixel discontinuiti some building edge prediction results, as shown in panels (b) and (d) of Figure 10 mentioned, this effect can emerge due to the lack of training data and as many build are shaded by trees, which hampers our training and marking. As we did not have complete WHU image data, we mosaicked a part of the data in the test set into a large image with a size of 3072 × 2048 and a resolution of 0.3 m. We input the large image into the program to obtain the corresponding building prediction results, as shown in Figure 11. Five places were randomly selected for a more specific illustration. We found that the edges of the buildings can be extracted well, and only a few of the edges of buildings are discontinuous. The corresponding edge maps of the buildings can be obtained from the above two images of different sizes. Although there are few discontinuous edges of buildings, most of them can be predicted accurately, and the width of the predicted edges of buildings is sufficiently low. This solves the problem of processing large images in the memory that had been previously reported by Lu et al. [12,59]. Furthermore, this auspiciously provides a vital reference for the realization of the automatic edge extraction of buildings.
ote Sens. 2021, 13, x FOR PEER REVIEW 16 o Figure 10. Image-to-image prediction results on the Nanjing dataset (a-e).
As we did not have complete WHU image data, we mosaicked a part of the data the test set into a large image with a size of 3072 × 2048 and a resolution of 0.3 m. We in the large image into the program to obtain the corresponding building prediction resu as shown in Figure 11. Five places were randomly selected for a more specific illustrati We found that the edges of the buildings can be extracted well, and only a few of the ed of buildings are discontinuous. The corresponding edge maps of the buildings can be tained from the above two images of different sizes. Although there are few discontinuo edges of buildings, most of them can be predicted accurately, and the width of the p dicted edges of buildings is sufficiently low. This solves the problem of processing la images in the memory that had been previously reported by Lu et al. [12,59]. Furthermo this auspiciously provides a vital reference for the realization of the automatic edge traction of buildings. ote Sens. 2021, 13, x FOR PEER REVIEW 17 o Figure 11. Image-to-image prediction results on the WHU dataset (a-e).

Conclusions
This study explored a method for automated building edge extraction from rem sensing images and we used a semantic segmentation model to generate a binary map the building outline. The U 2 -net model was used as the experimental model. We c ducted two experiments: (a) the U 2 -net was directly compared with other semantic s mentation models (Segnet [33], U-Net [32], and FCN [29]) and edge detection models (R [49], HED [55], and DexiNed [56]) on the two datasets of WHU and Nanjing; and (b) original binary cross-entropy loss function was replaced with a multiclass cross-entro loss function in the U 2 -net. The experimental results showed that the modified U 2stands out with three advantages for building outline extraction. First, it can provid clearer and more precise outline without an NMS post-processing step. Second, the mo exhibited better performance than the semantic segmentation models or the edge de tion models. Third, the modified U 2 -net can directly generate a binary map of the build edge with a further refined outline. Overall, it is robust because it is somewhat immu Figure 11. Image-to-image prediction results on the WHU dataset (a-e).

Conclusions
This study explored a method for automated building edge extraction from remote sensing images and we used a semantic segmentation model to generate a binary map of the building outline. The U 2 -net model was used as the experimental model. We conducted two experiments: (a) the U 2 -net was directly compared with other semantic segmentation models (Segnet [33], U-Net [32], and FCN [29]) and edge detection models (RCF [49], HED [55], and DexiNed [56]) on the two datasets of WHU and Nanjing; and (b) the original binary cross-entropy loss function was replaced with a multiclass crossentropy loss function in the U 2 -net. The experimental results showed that the modified U 2 -net stands out with three advantages for building outline extraction. First, it can provide a clearer and more precise outline without an NMS post-processing step. Second, the model exhibited better performance than the semantic segmentation models or the edge detection models. Third, the modified U 2 -net can directly generate a binary map of the building edge with a further refined outline. Overall, it is robust because it is somewhat immune to the sample imbalance problem. Moreover, an image-to-image method has clear prospects for practical applications in building edge extraction. In future, we plan to use more datasets to validate the modified U 2 -net and to further improve the model's accuracy by incorporating the attention mechanism. This will ultimately open a window for the direct generation of vector polygons of building outlines by incorporating other nets such as the pointNet. Meanwhile, we also noted the recent achievements of graph convolutional networks (GCN) in remote sensing classification, for example, through the fusion of GCN and CNN to improve classification accuracy [37]. This provides us with a good paradigm, and the incorporation of the GCN for building edge extraction has aroused our interest, which needs further study.