A Multiscale and Multitask Deep Learning Framework for Automatic Building Extraction

: Detecting buildings, segmenting building footprints, and extracting building edges from high-resolution remote sensing images are vital in applications such as urban planning, change detection, smart cities, and map-making and updating. The tasks of building detection, footprint segmentation, and edge extraction affect each other to a certain extent. However, most previous works have focused on one of these three tasks and have lacked a multitask learning framework that can simultaneously solve the tasks of building detection, footprint segmentation and edge extraction, making it difﬁcult to obtain smooth and complete buildings. This study proposes a novel multiscale and multitask deep learning framework to consider the dependencies among building detection, footprint segmentation, and edge extraction while completing all three tasks. In addition, a multitask feature fusion module is introduced into the deep learning framework to increase the robustness of feature extraction. A multitask loss function is also introduced to balance the training losses among the various tasks to obtain the best training results. Finally, the proposed method is applied to open-source building datasets and large-scale high-resolution remote sensing images and compared with other advanced building extraction methods. To verify the effectiveness of multitask learning, the performance of multitask learning and single-task training is compared in ablation experiments. The experimental results show that the proposed method has certain advantages over other methods and that multitask learning can effectively improve single-task performance.


Introduction
Buildings, as an integral part of human life, are among the most important elements on a map. Accordingly, building extraction is extremely important in urban planning, land use analysis, and map-making. With the rapid development of earth observation technology, the available spatial resolution of remote sensing imagery has increased year by year. By capturing rich and detailed information on ground objects, high-resolution remote sensing images enable fine extraction of ground objects (e.g., buildings and roads), providing important data support for the automatic extraction of large-scale buildings. However, because of the complexity of high-resolution remote sensing images and the lag in the development of extraction techniques [1], the automatic and precise extraction of buildings in high-resolution remote sensing images has remained a central challenge in remote sensing applications and cartography.
Most early building extraction methods including the mathematical morphologybased methods [2][3][4][5] and methods based on shape, color, and texture features [6][7][8][9], relied on manually extracted features for judgment. However, due to their limited capabilities for image expression, manually designed features are usually applicable only to specific regions and provide minimal support for model generalization. In recent years, deep learning technology has been widely applied in various fields related to computer vision and image processing, such as image classification [10], object detection [11], and image segmentation [12,13]. Considering the similarity between building extraction from remote sensing images and computer vision tasks, some building extraction methods based on deep learning technology [14][15][16] have been applied in remote sensing, enhancing the intelligence of building extraction methods to a new level. Compared with traditional building extraction methods that rely on manually designed features, deep learning methods have the advantage of powerful feature representation capabilities, enabling them to address more complex tasks.
In the computer vision field, building extraction tasks are commonly divided into three categories: building detection tasks, building footprint segmentation tasks, and building edge extraction tasks.
Building detection involves recognizing each building instance in a remote sensing image, applying object recognition techniques to obtain the location of each building instance as a rectangular bounding box, and determining the quantity of buildings. In recent years, as deep learning technology has rapidly advanced, a series of outstanding object detection algorithms have become available. These algorithms can be approximately divided into two categories. One category includes the two-stage object detection algorithms represented by the region-based convolutional neural network (RCNN) [17], Fast R-CNN [18] and Faster R-CNN [19] methods, whose main concept is to generate regional proposal boxes first and then input them into a convolutional neural network (CNN) for further classification. The other category includes single-stage object detection algorithms represented by the single-shot multibox detector (SSD) [20] and You Only Look Once (YOLO) [21] series of models, which constitute an end-to-end object detection framework that can directly output the category of each detected object. The aforementioned object detection methods have been applied for building detection in remote sensing images. Based on Faster R-CNN, Ding et al. [22] used deformable convolution to improve the adaptability to arbitrarily shaped collapsed buildings and proposed a new method of estimating the intersected proportion of objects (IPO) to describe the degrees to which bounding boxes intersect, thus offering better detection precision and recall. Bai et al. [23] proposed a Faster R-CNN method based on DRNet and ROI Align and utilized texture information to solve region mismatch problems. Building detection method can approximately recognize building locations but cannot achieve pixel-level segmentation, which is more accurate.
Building footprint segmentation refers to the pixel-level segmentation of remote sensing images, in which each pixel in an image is assigned either a building or nonbuilding label. In most building footprint segmentation methods, the fully convolutional network (FCN) architecture [12] or one of its variants is used as the basic architecture, and various measures are implemented to improve the multiscale learning capability of the model. Xie et al. [24] proposed MFCNN, a symmetric CNN with ResNet [25] as the feature extractor, which contains many complex designs, such as dilated convolution units and pyramid feature fusion. MAP-Net, proposed by Zhu et al. [26], has an HRNet-like [27] architecture with multiple feature encoding branches and a channel attention mechanism. Ma et al. [28] proposed the global and multiscale encoder-decoder network (GMEDN), which consists of a U-Net-like [29] network and a nonlocal modeling unit. These methods have greatly enhanced the accuracy of footprint segmentation for differently sized buildings in remote sensing images. However, the aforementioned methods focus only on distinguishing between building and nonbuilding pixel values and rarely closely observe building edge information, often resulting in blurred contours and failure to obtain regular boundaries in the segmentation results.
Building edge extraction refers to marking and extracting the outer boundaries of building instances in remote sensing images. Building edge extraction methods prioritize building boundary information and attempt to reach beyond pixel-level footprint segmentation to directly obtain regular and accurate building contour lines. Recently, some researchers have introduced building boundary information into deep learning networks to improve their building extraction accuracy. Lu et al. [30] adopted a deep learning network to extract building edge probability maps from remote sensing images and applied postprocessing to the edge probability maps based on geometrical morphological analysis to achieve refined building edge extraction. Wu et al. [31] proposed a novel deep FCN architecture, named the boundary regulated network (BRNet) architecture, which utilizes local and global information to simultaneously predict segments and contours to achieve better building segmentation and more accurate contour extraction. Jiwani et al. [32] improved the DeepLabV3+ [33] model by introducing a feature pyramid network (FPN) [34] module to achieve cross-scale feature extraction and designing a special weighted boundary diagram to penalize incorrect predictions of building boundaries. Li et al. [35] combined a graph-based conditional random field model with a segmentation network to preserve clear boundaries and fine-grained segmentation. However, due to the structural diversity of buildings and their complex environments, accurately locating and recognizing building edges remain significant challenges.
As a basis for interpreting remote sensing images, building detection provides a foundation for coping with higher-level interpretation tasks, such as building footprint segmentation and building edge extraction. Building detection determines the general locations for building footprint segmentation and edge extraction, while building footprint segmentation and edge extraction allow building shape features to be enhanced for building detection. Simultaneously, building footprint segmentation provides closed shape information for edge extraction, while building edge extraction yields exact boundary information for footprint segmentation. To an extent, building detection, building footprint segmentation, and building edge extraction have a symbiotic relationship of mutual dependence and information complementarity. Nevertheless, although many deep learning-based methods have been used to extract buildings and achieved good performance, the existing methods have been developed for certain tasks, e.g., footprint segmentation of buildings. Hence multitask learning frameworks are required to simultaneously perform multiple tasks, e.g., the detection, footprint segmentation, and edge extraction of the buildings.
Multitask learning can improve the performance on each task by learning better feature representations from the shared information for multiple related tasks. The classic instance segmentation framework Mask R-CNN [13] is based on the object detection framework Faster R-CNN with the addition of a branch for object mask prediction. It first locates the objects in an image and then segments the target objects in the positioning boxes, effectively combining the semantic segmentation and object detection tasks to facilitate their mutual performance enhancement. Considering the symbiotic relationship between road detection and centerline extraction, Lu et al. [36] proposed the MSMT-RE framework for performing these two tasks simultaneously, and this framework has delivered excellent road detection results. MultiNet [37] consists of a shared encoder and three independent decoders for simultaneously completing the three scene perception tasks: scene classification, object detection, and drivable area segmentation. Wu et al. [38] proposed the panoptic driving perception network YOLOP to simultaneously perform the tasks of traffic object detection, drivable area segmentation, and lane detection, significantly improving performance on each single task. Bischke et al. [39] adopted a multitask learning framework to combine the learning of boundaries and semantic information to improve the semantic segmentation of building boundaries. As seen above, multitask learning methods have been widely applied in segmentation tasks, but there is a lack of multitask learning frameworks that can solve the tasks of building detection, footprint segmentation and edge extraction simultaneously.
To address the abovementioned problems, this study proposes a multiscale and multitask deep learning framework called MultiBuildNet for simultaneously performing the building detection, footprint segmentation, and edge extraction tasks. This framework is also integrated with a multiscale feature fusion network to combine features from different scales, aiming to improve the robustness of feature extraction against complex backgrounds. In addition, to minimize the loss function across the three tasks, this study introduces a multitask loss function that fully considers any deviations between the predicted values and true values in all three tasks to obtain the best training effect.
The main contributions of this study are as follows: (1) An effective multiscale and multitask deep learning framework is designed that can simultaneously handle the three key tasks in building extraction: building detection, footprint segmentation, and edge extraction.
(2) A multitask loss function is introduced that can address the imbalances between positive and negative samples and among sample categories in the three tasks.
The remainder of this study is organized as follows. The proposed MultiBuildNet framework is introduced in detail in Section 2. Then, Section 3 describes experiments conducted with the proposed method on open-source building datasets and large-scale high-resolution remote sensing images and presents comparisons with other advanced building extraction methods. Ablation experiments conducted to compare the performances of multitask learning and single-task training in order to verify the effectiveness of multitask learning are also reported. Section 4 discusses the performance improvements achieved by the MultiBuildNet framework compared with other deep learning methods as well as its limitations and prospects for future work. Finally, conclusions are presented in Section 5.

Materials and Methods
The MultiBuildNet framework essentially possesses an encoder-decoder architecture. The encoder part consists of a backbone network and a neck network, while the decoder primarily includes specific networks for the three building extraction tasks. As illustrated in Figure 1, the MultiBuildNet framework consists of three main parts: (1) A multiscale feature extraction and fusion network. A spatial pyramid pooling (SPP) [40] module and a FPN module are integrated into the encoder portion of the Multi-BuildNet framework. The framework can generate and fuse features at different scales and different semantic levels by performing multiscale feature extraction in the spatial context to improve the robustness of feature extraction.
(2) A building extraction module based on multitask learning. The MultiBuildNet framework includes a multitask learning module based on YOLACT [41], which can simultaneously train networks for building detection, footprint segmentation, and edge extraction.
(3) A multitask loss function. A multitask loss function is introduced into the Multi-BuildNet framework to avoid situations in which the multiple tasks are not sufficiently collaborative in relation to the network weights and to balance the joint training process for all tasks.

Multiscale Feature Extraction and Fusion Network
Convolutional layers of different sizes can generally extract feature information at different semantic levels. Therefore, this study introduces a multiscale feature extraction and fusion network, which incorporates a SPP module and a FPN module in the encoder portion while conducting multiscale encoding in the spatial context to yield rich spatial information that can be used to fuse features at different scales and different semantic levels, thus enhancing the robustness of feature extraction in complex environments. Figure 2 shows the multiscale feature extraction and fusion network. YOLACT serves as the backbone network. The neck network fuses the features generated by the backbone network and is formed by combining a SPP module and a FPN module. In addition, multiple parallel convolutional layers of different sizes are combined in place of the resampling operation to achieve information extraction at different semantic levels and feature extraction at different scales. The convolutional layer for each level integrates convolution kernels of three different sizes, 1 × 1, 3 × 3, and 5 × 5, each of which generates a corresponding number of feature maps. The cascaded multiscale features are fed to the next convolution operation and subjected to a cubic convolution operation on the contracting path.

Building Extraction Module Based on Multitask Learning
As shown in Figure 3, the building extraction module based on multitask learning consists of three networks, namely, a building detection network, a footprint segmentation network, and an edge extraction network, all of which run in parallel during training. The building extraction module takes red-green-blue (RGB) images as input. In the encoder part, there are four repeated groups of convolutional layers, where each group consists of two convolutional (Conv) layers, each followed by a batch normalization (BN) layer and a rectified linear unit (ReLU) activation function. Every group is followed by a max-pooling (Pooling) layer for downsampling. The decoder part primarily consists of the specific networks for each of the three building extraction tasks.
The building detection network adopts a multiscale detection scheme based on anchor boxes. The encoder part transfers semantic features from top to bottom through the FPN, while the decoder part uses a path aggregation network (PAN) [42] to transfer localization features from bottom to top. The combination of the FPN and PAN in the building detection network achieves an improved feature fusion effect and allows building detection to be performed using multiscale fusion feature maps in the PAN. Each grid point of the multiscale feature maps is used to generate three prior anchor boxes with different aspect ratios. The building detection network head predicts the position offset and the longitudinal and transverse scaling for each prior anchor box and the corresponding probabilities of building instances and predicted confidence values.
The building footprint segmentation network and building edge extraction network have the same network structure. The final prediction map is obtained by upsampling the underlying feature map from the FPN three times to restore the output feature map to the input image size. Every upsampling layer is followed by a deconvolutional (DConv) layer, a BN layer and a ReLU activation. Every group is followed by a convolutional layer. In addition, instead of the deconvolution operation, the bilinear interpolation method is adopted in the upsampling layers to reduce computational costs. Thus, the building contour segmentation network not only delivers high-performance segmentation results but also ensures an efficient inference speed.

Multitask Loss Function
Multitask learning is actually a problem of multiobjective optimization, and one of the complex challenges in solving such a problem is the optimization process itself, i.e., designing an appropriate loss function. In the extreme case in which the loss value for one task is considerable, but those for the other tasks are very small, multitask learning approximately degenerates into single-task learning and no longer has the advantage of multitask information sharing. Therefore, to avoid a situation in which one or more tasks dominate the network weights during the training process, an appropriate loss function is necessary to balance the loss values of each task. To this end, this study introduces a multitask loss function L ext , which consists of three components: a building detection loss function L det , a building footprint segmentation loss function L seg , and a building edge extraction loss function L edg . The symbols used for the loss functions in this paper and their meanings are shown in Table 1. The multitask loss function is a weighted sum of all three parts, as shown in Equation (1): Complete intersection over union loss function L ce Cross-entropy loss function The building detection loss function is a weighted sum of the classification loss L cla , the objective loss L obj , and the bounding box loss L box , as shown in Equation (2): The focal loss L fl is adopted for the classification loss and objective loss, which is used to control the weights of positive and negative samples and of difficult and easy samples. Two penalty factors are introduced to reduce the weights of easy samples so that the model can focus more on difficult samples during the training process, thus solving the issue of unbalanced sample categories. The classification loss is used to penalize problems in the binary classification of building/nonbuilding samples, while the objective loss is used to penalize confidence in building instance predictions. The complete intersection over union loss function L CIoU is adopted for the bounding box loss. It is used to penalize the predicted frame confidence and takes into account three important geometric factors, namely, the overlap area between the predicted and real frames, the center point distance, and the aspect ratio, as shown in Figure 4. In this figure, the solid yellow rectangle represents the real frame for building detection, the solid green rectangle represents the predicted frame for building detection, the dotted black rectangle indicates the minimum closed area that can cover both the predicted and real frames, c is the diagonal length of the minimum closed area covering both the predicted and real frames, and d is the distance between the center points of the predicted and real frames. The focal loss and the complete intersection over union loss are calculated as shown in Equations (3) and (4), respectively: For the building footprint segmentation loss function, the cross-entropy loss L ce is adopted to minimize the classification errors between the predicted pixels and building instances.
On the basis of the cross-entropy loss, the building edge extraction loss function additionally includes the complete intersections over union loss to improve the prediction performance for building edges in sparse areas, as shown in Equation (6): wherep is the probability for a model-predicted sample and p is the sample label. For positive samples, p is 1; otherwise, p is 0. w, h, w gt , and h gt , respectively, represent the widths and heights of the predicted and real frames; IoU is the intersection over union of the building pixels, i.e., the ratio between the intersection and union of the predicted and real building pixels; v is the similarity between the aspect ratios of the predicted and real building frames; α is a penalty factor for negative samples, with a range of [0, 1], which is used to control the ratio of positive to negative samples; γ is the modulation coefficient, with a range of [0, +∞], which is used to distinguish the complexity of samples and enables the model to focus more on difficult samples by reducing the weights of easy samples during training; β is a balance factor for the overlapping areas of the predicted and real building frame; b and b gt represent the center points of the predicted frame and real frame, respectively; d is the distance between the center points of the predicted and the real frames, respectively; and c is the diagonal length of the smallest closed area that can cover both the predicted and the real frames.

Experimental Setup
In this study, three groups of experiments were conducted to verify the effectiveness of the proposed MultiBuildNet framework: experiments based on open-source building datasets, experiments based on large-scale high-resolution remote sensing images, and ablation experiments. The hardware used for the experiments has the following specifications: an Intel(R) Xeon(R) Gold 6230 CPU @ 2.10 GHz, 128 GB of RAM, and an NVIDIA Corporation (https://www.nvidia.cn/) TU102GL [Quadro RTX 6000/8000] graphics card, with Ubuntu 20.04 as the operating system and Python 3.7 and PyTorch (https://pytorch.org/) 1.7.1 as the programming environments.

Accuracy Evaluation Indicators
To thoroughly and quantitatively evaluate the performance of the MultiBuildNet framework, Recall and AP50 are adopted in this study as evaluation indicators for building detection accuracy. In addition, six indicators, specifically Recall, Precision, F1-score, IoU (intersection over union), mIoU (mean IoU), and Kappa, are used to evaluate the precision of building footprint segmentation and edge extraction. Recall refers to the proportion of correctly predicted building pixels among all real building pixels, Precision refers to the proportion of correctly predicted building pixels among all predicted building pixels, the F1-score is a combined metric considering both Recall and Precision, IoU refers to the ratio between the intersection and union of the predicted and real building pixels, mIoU is the mean of the IoU mean values for buildings and background, Kappa reflects the accuracy for both buildings and background, and AP50 represents the building detection accuracy when detected buildings with an IoU greater than 50% are regarded as correct. The above indicators are calculated using the following equations [43][44][45]: where TP represents the number of correctly predicted building pixels, TN represents the number of correctly predicted background pixels, FP represents the number of background pixels incorrectly predicted to be building pixels, and FN represents the number of building pixels incorrectly predicted to be background pixels.

Dataset
In this experiment, we used the WHU building dataset (WHU) [14], the Massachusetts buildings dataset (Massachusetts) [46] and the remote sensing imagery for building extraction (RSIBE) dataset [47] to evaluate the performance of the proposed MultiBuildNet framework. The original images in these datasets are orthoimages derived from very high-resolution aerial imagery. The details of each dataset are listed in Table 2. Among the above datasets, the RSIBE dataset natively supports multitask learning. We reconstructed the detection labels and edge labels from the segmentation labels provided by the WHU and Massachusetts datasets for multitask building learning. Figure 5 shows the dataset images and labels used in our experiment.

Results of Building Detection
In this experiment, the MultiBuildNet framework was used to perform building detection on the three open-source building datasets and was compared with four classic object detection methods: YOLOv4, Faster R-CNN, Mask R-CNN, and YOLACT. The visualized results are shown in Figures 6-8. Among the tested methods, YOLOv4 and Faster RCNN can only detect buildings, while Mask RCNN, YOLACT and MultiBuildNet can simultaneously detect buildings and segment building footprints. Figures 6-8 indicate that the MultiBuildNet framework has an advantage over the other models when detecting small buildings, adjacent buildings, buildings with unique shapes, and building groups with large dimensional differences. Other models often fail to detect small buildings shaded by trees, while the MultiBuildNet framework can successfully adapt to building detection tasks in different scenarios. Tables 3-5 compare the quantitative results for the building detection accuracy of the tested methods. According to Tables 3-5, all models achieve satisfactory building detection results on the three open-source datasets, but the method proposed in this study is superior in terms of both Recall and AP50.

Results of Building Footprint Segmentation
In this experiment, the MultiBuildNet framework was utilized to conduct building footprint segmentation on the three open-source building datasets and was compared with four classic deep learning methods: U-Net, PSPNet [48], HRNet and DeepLabV3+. The visualized results are shown in Figures 9-11. Figures 9-11 indicate that these deep learning methods achieve satisfactory results on the three open-source datasets, and they all can fundamentally and accurately recognize buildings in remote sensing images. However, due to the range limitations of the receptive fields range obtained by their convolution kernels, U-Net, PSPNet, HRNet, and DeepLabV3+ are prone to producing holes in the extracted buildings, and fragmentation can easily occur in areas with dense buildings. In contrast, the MultiBuildNet framework adopts a multiscale feature fusion strategy. It can extract complete large buildings and obtain segmentation results with regular contours, and it does not often overlook small buildings shaded by trees. Even for circular buildings that are difficult to extract, MultiBuildNet can deliver better results than the other models.   Tables 6-8 compare the results for the building footprint segmentation accuracy of the above methods. As indicated in the tables, the MultiBuildNet framework is superior to the other deep learning models on the RSIBE dataset in terms of all indicators, with an IoU of 93.54%, a Precision of 95.23%, and a Recall of 96.34%, presenting significant performance advantages. Table 6. Accuracy of building footprint segmentation on the WHU dataset.  Table 8. Accuracy of building footprint segmentation on the RSIBE dataset.

Results of Building Edge Extraction
In contrast to deep learning networks, such as U-Net, PSPNet, HRNet, and DeepLabV3+, the MultiBuildNet framework proposed in this study can refine the footprint segmentation results to additionally achieve building edge information extraction. Figures 12-14 show the visual results of building edge extraction on the three open-source building datasets obtained using the MultiBuildNet framework. As shown in Figures 12-14, the MultiBuildNet framework can improve unclear building edges to some extent and obtain complete boundaries for building instances of various shapes. However, it still has shortcomings. For example, it still has difficulty extracting the edges of small buildings. Moreover, although the right-angle features of buildings can be effectively maintained, the problem of how to construct the edges of irregular buildings still requires attention. Figure 15 shows the accuracy evaluation of building edge extraction on three open-source datasets using the MultiBuildNet framework. The IoU of buildings trained and tested in the RSIBE dataset is 19.86% higher than that of the Massachusetts dataset, because the RSIBE dataset has high-resolution images and highaccuracy labeling. Meanwhile, the Massachusetts dataset directly uses OSM data as labels, which leads to its relatively poor labeling quality.

Experiments Based on Large-Scale High-Resolution Remote Sensing Images
Due to limited GPU memory, it is often necessary to crop remote sensing images into smaller images when deep learning methods are adopted for automatic building extraction. Hence, deep learning models perform well when inferring the contents of smaller cropped images but poorly when conducting direct inference for larger remote sensing images. In addition, when smaller images are spliced to obtain the building extraction results for a larger area, inaccurate segmentation of objects whose edges have been cut by the cropping process will commonly occur. To verify the MultiBuildNet framework's performance in building footprint segmentation on large-scale high-resolution remote sensing images, three large-scale aerial images of Christchurch in New Zealand were selected for experimentation, each with a resolution of 0.075 m. Residential, commercial, and industrial areas all appear in the three selected images, each of which has an area of 10,000 × 10,000 pixels. Figure 16 shows the visualized results of building footprint segmentation obtained from these largescale high-resolution remote sensing images with U-Net, PSPNet, and MultiBuildNet. The building labels and the segmentation results of U-Net, PSPNet, and MultiBuildNet are shown from top to bottom on the original image. As seen from Figure 16, U-Net performs well when extracting relatively regular buildings in residential areas but yields a large number of missed extractions and incomplete segmentations in commercial and industrial areas. In comparison, although PSPNet can achieve relatively complete building extraction, its extraction effect is poor. It cannot maintain the inherent features of buildings, resulting in tortuous edges and unclear right-angle features in the extracted buildings. Meanwhile, MultiBuildNet demonstrates strong feature extraction capabilities in all three areas, with complete building edge extraction and few missed or incorrect extractions. The results indicate its significant advantages gained through multitask learning and multifeature fusion. Figure 17 shows the quantitative accuracy comparison of three deep learning methods when extracting building footprints from large-scale remote sensing images. Compared with U-Net and PSPNet, MultiBuildNet has obvious advantages in extracting buildings from large-scale remote sensing images.

Ablation Experiment
Ablation experiments were designed to compare the performances of the multitask and single-task learning strategies to verify the effectiveness of multitask learning and clarify the significance and role of each network in the multitask learning framework. In these experiments, we tested the building detection (Det), footprint segmentation (Seg) and edge extraction (Edg) modules of the MultiBuildNet framework and various combinations thereof on the RSIBE dataset. Table 9 compares the performance of the two strategies for a single specific task. The models using the multitask learning strategy perform better in terms of all indicators than the models adopting the single-task learning strategy. Similar tasks share some relevant features due to their underlying commonality. In the multitask learning strategy, the tasks of building footprint segmentation and edge extraction are highly similar; therefore, the combination of these two tasks is especially helpful in improving the network performance. The building detection network can use the global information of building objects to better identify buildings and to reduce false and missed detections of buildings. The footprint segmentation network can use the local information of surrounding pixels to better segment buildings and capture more details to achieve more complete segmentation. The building edge extraction network can make use of the global information of polygons to generate better contours and obtain smoother building boundaries.

Regarding the Proposed MultiBuildNet Framework
The multitask learning framework proposed in this paper can accomplish building detection, building footprint segmentation and building edge extraction simultaneously. Such an architecture can obtain smooth and complete building extraction results. Compared with the classic building extraction methods, the proposed MultiBuildNet framework can capture more details and obtain more consistent results. The ablation experiments show that the mutual dependence among building detection, footprint segmentation and edge extraction is helpful for feature fusion and extraction. Because of its multitask learning capabilities, the MultiBuildNet framework can utilize both local information from surrounding pixels to segment buildings and global information from polygons to generate building outlines, thus achieving superior performance.

Limitations and Future Work
The MultiBuildNet framework is currently suitable for the pixel-level semantic segmentation of buildings, but it cannot generate regular building polygons. When the framework was tested on large-scale images for which the data distribution was different from that of the training set, its ability to extract buildings decreased because of the framework's limited transfer learning ability. In our future work, a more general deep learning framework will be studied for application to data from different sources. In addition to network learning, additional postprocessing could help to vectorize some irregular building edges and improve the network prediction results. For example, Zhao et al. [49] used building boundary regularization in the postprocessing stage to produce better regularized polygons and achieved good performance. Our team has previously carried out extensive research work based on building simplification [50]. Therefore, in our future research, we will consider integrating building simplification techniques into the network learning process to achieve end-to-end mapping.

Conclusions
This study proposes the MultiBuildNet framework to address the common problems associated with the extraction of buildings from high-resolution remote sensing images, such as incorrect extraction, missed extraction, insufficient integrity, and inaccurate boundaries. The proposed method can generate and fuse features at different scales and different semantic levels by means of its multiscale feature extraction and fusion network and learn rich semantic information in the spatial context, thereby improving the robustness of feature extraction against complex backgrounds. In addition, a multitask learning strategy is adopted to simultaneously perform the tasks of building detection, footprint segmentation, and edge extraction. Through the utilization of shared information among multiple tasks, this strategy allows better feature representations to be learned, thereby improving the performance on each task. Experiments on open-source building datasets and large-scale high-resolution remote sensing images indicate that the multiscale and multitask learning frameworks proposed in this study demonstrate considerable performance superiority over other deep learning methods and can better adapt to building extraction tasks in different scenarios. However, the method is currently applicable only for the raster-level semantic segmentation and instance segmentation of buildings and still produces pixel-based building patches. Methods to extend the building extraction process from the raster level to the vector level and to extract building vector polygons directly from remote sensing images merit further study.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.

Acknowledgments:
The authors would like to thank the anonymous reviewers for their constructive comments which improved considerably the quality of the final manuscript, and also express their gratitude to Springer Nature Author Services (https://authorservices.springernature.cn/, accessed on 15 September 2022) for the expert linguistic services provided.

Conflicts of Interest:
The authors declare no conflict of interest.