MSL-Net: An Efficient Network for Building Extraction from Aerial Imagery

: There remains several challenges that are encountered in the task of extracting buildings from aerial imagery using convolutional neural networks (CNNs). First, the tremendous complexity of existing building extraction networks impedes their practical application. In addition, it is ardu-ous for networks to sufficiently utilize the various building features in different images. To address these challenges, we propose an efficient network called MSL-Net that focuses on both multiscale building features and multilevel image features. First, we use depthwise separable convolution (DSC) to significantly reduce the network complexity, and then we embed a group normalization (GN) layer in the inverted residual structure to alleviate network performance degradation. Fur-thermore, we extract multiscale building features through an atrous spatial pyramid pooling (ASPP) module and apply long skip connections to establish long-distance dependence to fuse features at different levels of the given image. Finally, we add a deformable convolution network layer before the pixel classification step to enhance the feature extraction capability of MSL-Net for buildings with irregular shapes. The experimental results obtained on three publicly available datasets demonstrate that our proposed method achieves state-of-the-art accuracy with a faster inference speed than that of competing approaches. Specifically, the proposed MSL-Net achieves 90.4%, 81.1% and 70.9% intersection over union (IoU) values on the WHU Building Aerial Imagery dataset, Inria Aerial Image Labeling dataset and Massachusetts Buildings dataset, respectively, with an inference speed of 101.4 frames per second (FPS) for an input image of size 3 × 512 × 512 on an NVIDIA RTX 3090 GPU. With an excellent tradeoff between accuracy and speed, our proposed MSL-Net may hold great promise for use in building extraction tasks.


Introduction
As the main gathering places for human production and living, buildings are crucial indicators for monitoring urbanization, and the extraction of buildings is playing an increasingly notable role in the study of urbanization [1]. Currently, various data sources, such as satellite images, aerial imagery, and point cloud data, are available for building extraction tasks. Among them, aerial images have high spatial resolutions and are easy to obtain. Building extraction from aerial imagery is critical for urban planning [2], population estimation [3] and digital cartography [4].
In building extraction tasks, considerable human labor will be consumed if all buildings are manually annotated. Therefore, how to extract buildings with algorithms rather than human experts is an immediate challenge to be addressed. Traditional extraction methods can generally be divided into feature detection-based methods [5][6][7], area segmentation-based methods [8][9][10][11] and auxiliary information-combined methods [12][13][14][15][16][17][18]. However, based on handcrafted features such as spectral, shadow, and texture features, these traditional methods can only process the low-or mid-level information contained in images, and their building extraction results usually have poor accuracy and integrity [19]. Traditional methods are not sufficiently intelligent and often require tedious parameter tuning steps. With aerial image acquisition becoming easier, if buildings can be automatically extracted by algorithms in real time, the efficiency of the extraction task will be significantly improved, and human labor costs will be dramatically reduced.
Considerable semantic segmentation methods paired with deep learning have been developed in recent years. Compared with traditional methods, semantic building segmentation methods based on deep learning approaches are capable of obtaining and utilizing the high-level features contained in images. Applicable for fully automatic semantic segmentation, trained deep learning models hold great promise in building extraction tasks. In fully convolutional networks (FCNs) [20], the fully connected layers in the convolutional neural network (CNN) structure are replaced with convolutional layers, and some researchers [21,22] have used variants of FCNs to automatically extract buildings, as they eliminate the jagged edges encountered when segmenting blocky regions and achieve obviously improved segmentation accuracy. However, as early semantic segmentation models, FCNs are limited to utilizing the contextual information contained in an image, leading to discontinuities and holes in the segmentation results.
To fully obtain and utilize the features in images, two main approaches are currently available for improving semantic segmentation models.
(1) For feature maps, feature pyramids can be applied to enlarge their receptive fields and obtain multiscale target features. The pyramid scene parsing network (PSPNet) [23] fuses four different scales of feature maps in parallel via a pyramid pooling module, improving the network's ability to obtain multiscale information. In DeepLabv3 [24], an atrous spatial pyramid pooling (ASPP) structure is adopted to enlarge the receptive fields and, thus, has a significant advantage in large object segmentation.
To restore more building contour information, Xu [25] enhanced the combination of an encoder and a decoder based on a DeepLabv3+ [26] network embedded with an ASPP module. (2) For input images, skip connections are applied to fuse different levels of feature maps to obtain multilevel image features. The level of image features increases as the network layers deepen. Low-level features provide the basis for object category detection, and high-level features facilitate accurate segmentation and positioning. U-Net [27] uses long skip connections to integrate low-level features with high-level features and has high performance in medical image segmentation. Improved from U-Net, networks such as IEU-Net [28], HA U-Net [29] and EMU-CNN [30] have performed well. The MPRSU-Net [31] was constructed by combining long and short skip connections, alleviating the holes and fragmentary edges in the segmentation results obtained when extracting large buildings.
Researchers [32][33][34] have also considered both types of approaches and constructed new building segmentation models by combining feature pyramid modules and skip connections, notably enhancing the efficiency of the building extraction task and the generalization capacities of the developed models. Nevertheless, the majority of existing building extraction methods fail to address model applicability, resulting in considerable computational complexity and tedious parameter tuning steps. These problems limit their deployment in practical applications such as disaster/emergency response [35], damage assessment [36,37], and military reconnaissance [38] that require high algorithmic efficiency. Therefore, to facilitate the practical application of our method, we reduce the model complexity and propose a network with an "encoder-decoder" structure called MSL-Net, which is capable of obtaining and utilizing both multiscale and multilevel features, where "M" represents the prefix "multi", "S" represents "scale", "L" represents "level", and "Net" represents "Network ". The key contributions are as follows.
1. In the encoding stage of MSL-Net, we introduce the MobileNetV2 [39] architecture to extract multilevel features. The inverted residual blocks in MobileNetV2 are constructed as bottlenecks using depthwise separable convolution (DSC) [40] and group normalization (GN) operations [41], which noticeably reduce the model complexity while improving its training and inference speeds. The multiscale features are extracted by an ASPP module to enhance the ability of the model to recognize multiscale buildings. 2. In the decoding stage of MSL-Net, long skip connections [42] are applied to establish a long-distance dependence between the feature encoding and feature decoding layers. This long-distance dependence is beneficial for obtaining the rich hierarchical features of an image and effectively preventing holes in the segmentation results [31]. Before performing pixel classification, a deformable convolution network (DCN) layer [43] is added to ensure strong model robustness even when extracting buildings with irregular shapes.

MSL-Net Architecture
The encoder and decoder in MSL-Net are shown in Figure 1. The function of the encoder is to extract image features in a layer-by-layer manner. As the network layers gradually deepen, the feature map becomes more abstract; nonetheless, the extracted semantic information becomes richer, which is beneficial for classifying each pixel in the input image. In MSL-Net, the lightweight MobileNetV2 network with inverted residual blocks is introduced as the backbone to extract the original image features. We embed a GN layer in the inverted residual block, and three levels of feature maps with channels × height × width values of 24 × 128 × 128, 32 × 64 × 64, and 320 × 64 × 64 are output. Since the image downsampling operation during feature extraction lowers the feature map resolution and causes a partial loss of spatial information, the downsampling rate is limited to 8, and only three downsampling operations are performed. An ASPP module is used to extract multiscale features from the output high-level feature maps. The decoder outputs prediction results with the same size as that of the original input image; these results are expressed as the final binary building segmentation map in the building semantic segmentation task. First, layer 1 is output by the ASPP module and concatenated with the mid-level feature map output by the backbone through a long skip connection. After the channels of the concatenated feature map are adjusted to 256 through a 1 × 1 convolution, we obtain layer 2. Second, layer 2 is resized by bilinear reshaping (denoted RESHAPE) to the same size as that of the low-level feature map extracted by the backbone, and then a concatenation operation (denoted CONCAT) and a 1 × 1 convolution (denoted CONV 1 × 1) are executed to obtain layer 3 with 256 channels and a size of 128 × 128. Thus far, multilevel image features and multiscale feature map features have been extracted. Third, the features of irregular building shapes are extracted through a DCN layer, whose output feature map has the same number of channels and size as its input feature map. Eventually, after a 1 × 1 convolution and a bilinear reshaping, a semantic segmentation image of the buildings with 2 channels and a size of 512 × 512 is output.

Feature Extraction Backbone in the Encoder
Typically, the deeper a network is, the richer the extracted features and the better the model performs. However, a deep learning model does not always perform better after simply stacking the layers of the network. Instead, the weight matrix may degrade, causing the network performance to deteriorate. The residual structure in ResNet [44] allows certain layers to be connected to each other through short skip connections, which weakens the strong correlation between two adjacent layers and mitigates network degradation. The inverted residual structure in the MobileNetV2 feature extraction backbone is based on the residual structure of ResNet, while the feature extraction sequence of "downscaling-convolution-upscaling" is changed to "upscaling-convolution-downscaling", and the middle normal convolution is replaced by a DSC.

DSC in the Backbone
As shown in Figure 2, a DSC consists of two steps, a depthwise convolution and a pointwise convolution, which are performed separately in the spatial and channel dimensions to capture spatial information and fuse cross-channel depth information. Suppose that the input feature map size (length × width × channels) is DF × DF × M, the output feature map size is DF × DF × N, and the normal convolution kernel size is DK × DK × M. Then, the ratio of the number of DSC parameters to the number of normal convolution parameters is: which indicates that the DSC can exponentially reduce the required number of parameters, and this advantage becomes increasingly apparent as the number of layers increases.

GN in the Backbone
Batch normalization (BN) [45] has been widely used in existing deep learning algorithms. MobileNetV2 contains three BN layers in each inverted residual structure. Each BN layer takes the overall statistics for inference, imposing constraints on the search spaces of the system parameters, accelerating network convergence, and alleviating the overfitting problem. However, due to the stacking effect of BN in the network, the input distribution deviation between the training and test sets causes BN estimation bias to accumulate, which adversely affects the test performance of the model [46]. Note that GN can prevent the accumulation of such estimation bias. For this reason, we replace the second BN layer in the original inverted residual structure (shown in Figure 3c) with a GN layer to prevent network performance degradation due to distribution bias and ensure the robustness of the network. The general feature normalization formula is, For two-dimensional images, x is a computed feature derived from the feature map, and i = (iN, iC, iH, iW) is a four-dimensional vector of features indexed in the following order: "batch axis, channel axis, spatial height axis, spatial width axis". µ and σ are the mean and standard deviation, respectively, computed using the following equations: where ε is a constant with a small value, Si is the set of pixels used to compute the mean and standard deviation, and m is the size of Si. Then, formally, the set of groups normalized computation sets is defined as, here, G and C are the numbers of groups and channels, respectively, and C/G is the number of channels in each group.
·   is the floor operation, and / indicates that both indices i and k are in the same channel group, assuming that each channel group is stored sequentially along the channel axis. GN computes µ and σ along the spatial height axis, the spatial width axis and a group of C/G channels. Specifically, we use the same µ and σ to normalize the pixels in the same group.

ASPP in the Encoder
Atrous convolutions [47] introduce the concept of "dilation rates" based on the normal convolution, as shown in the ASPP module in Figure 4. Atrous convolutions with different dilation rates insert corresponding zero values into the normal convolution kernel to achieve convolution dilation, thereby increasing the receptive fields; this is similar to pooling operations. The scale features extracted by atrous convolutions with different dilation rates are also different. Additionally, since the inserted zero values are not calculated, the calculation counts do not increase, and the spatial resolution of the output feature maps does not decrease. We replace the original ASPP branch with an atrous convolution possessing a dilation rate of 2 in our experiments. The ASPP module in MSL-Net concatenates and fuses the five feature maps output by a 1 × 1 convolution and four atrous convolutions with dilation rates of 2, 12, 24 and 36, obtaining both large-scale global information and smallscale local detail information. After a 1 × 1 convolution, the number of channels in the concatenated feature map is compressed to 256.

Deformable Convolution in the Decoder
Buildings in images are susceptible to different degrees of deformation due to external conditions such as the attitude of the equipment and weather conditions. In addition, due to the diversity of the shapes of buildings, a normal convolution has difficulty extracting the shape features of buildings, while a deformable convolution is able to adaptively adjust according to the deformation of the object in an image and efficiently extract robust features from objects with different shapes and directions. Figure 5 depicts schematic diagrams of a normal convolution and a deformable convolution in the two-dimensional plane. Figure 5a represents the normal convolution, where the convolution kernel size is 3 × 3, and the sample points are organized in a regular pattern. Figure 5b represents the deformable convolution, where each sample point has position offsets, and the arrangement becomes irregular.
where k lists the positions in K.
In a deformable convolution, the position offsets are first obtained through a normal convolution layer, and then the offsets and magnitudes of the features learned from each sampling point are modulated. Finally, a more complex geometric transform feature learning process is performed, which is calculated as, where ∆pk and ∆mk represent the learnable offset and weight scalar at position k, respectively. ∆mk represents the modulation scalar at position k in the range [0, 1], and the calculation of the offset value is executed using bilinear interpolation. ∆pk and ∆mk can be obtained by applying a convolution to the same input feature map layer. The number of output channels is 3K, and K represents the convolutional kernel size of the backbone. The first 2K dimensions represent the x and y offsets (∆pk) at each position, and the subsequent K dimensions are used to obtain the weight (∆mk) of each position according to the sigmoid layer.
We add a deformable convolution layer to enhance the feature extraction ability of our model for geometric shapes before the final pixel classification output of the network.

Warmup and Cosine Annealing Learning Rate Policy
During training, gradient descent is usually adopted to optimize models. The learning rate (LR) is one of the hyperparameters that affect the model optimization process; it plays a guiding role in how to use the loss function gradient to adjust the network weights in the gradient descent step. When model training starts, a large initial LR is generally set to rapidly decrease the loss value of the network, and the LR decreases in a certain way as the number of iterations increases to ensure small model fluctuations during the later training stages as it gradually approaches the global optimal solution. The LR usually decreases via exponential decay, piecewise constant decay, or cosine annealing [48].
Since the network is relatively unstable in the early training stage, a large initial LR causes the gradient of the weights to fluctuate back and forth, and a small initial LR decelerates network convergence; thus, we employ the "warmup and cosine annealing" LR policy to ensure the performance of the model. As shown in Figure 6b, in the first 10 epochs, the LR linearly increases from 0 to the base LR and then gradually decreases from the base LR to 0 as the number of epochs increases. The LR is calculated by: where ηt is the LR of the current training epoch, ηmax is the base LR, Tcur is the current number of training epochs, Twu is the total number of warmup epochs, and Tmax is the maximum number of training epochs.

Descriptions of the Datasets
In this study, we use the WHU Building Aerial Imagery dataset (WHU dataset) [3], Inria Aerial Image Labeling dataset (Inria dataset) [49], and Massachusetts Buildings dataset (Massachusetts dataset) [50], with spatial resolutions ranging from 0.3 m to 1.0 m; thus, we can fully test the performance of the developed model. The details of each dataset are listed in Table 1. For both the Massachusetts dataset and the WHU dataset, the training and validation steps are performed directly using the training, validation, and test set ratios that have been partitioned by default in these datasets. For the Inria dataset, the training, validation, and test sets are divided at a ratio of 8:1:1. The numbers of images in the training, validation and test sets used for each dataset in the experiments are shown in Table 2.

Experimental Settings
The main software and hardware used in our study are listed in Table 3. Since buildings are the only experimental objects, the pixel value range of the labeled binary map is adjusted from [0, 255] to [0, 1] before training, where pixels with values of 1 represent the buildings and pixels with values of 0 represent the background. The widely used and high-performing U-Net, PSPNet, and DeepLabv3+ are selected for comparison, and all four models are tested on the above three datasets using the same training, validation, and test sets. The input images are one-hot coded, the batch size is set to 12, the loss function is a direct summation of the focal loss [51] and dice loss [52], the Adam optimizer is used in the training process, the base LR is set to 0.0005, and each comparison model is trained for 120 epochs using an exponential decay LR policy with a gamma value of 0.0005. MSL-Net is first warmed up for 10 epochs and then trained for 110 epochs using the cosine annealing LR policy.
In the training stage, data augmentation strategies are applied to preprocess the input images to obtain more feature information from the limited data. Due to the rich geometric features of the buildings in the dataset, we first apply spatial data augmentation strategies, including random horizontally mirror flipping with a 50% probability, random rotation with 50% probabilities at different angles (−10° to 10°), and random scaling with ratios between 0.25 and 2. Then, spectral data augmentation strategies are applied to reduce the impact caused by imaging condition differences to improve the generalization ability of the network. The spectral data augmentation techniques include hue augmentation, saturation augmentation, value augmentation and random Gaussian blurring.

Evaluation Metrics
To quantitatively evaluate the reliability and accuracy of each model, we use six metrics, the Intersection-Over-Union (IoU), F1-score, Accuracy, Recall, Precision and Kappa, to evaluate the segmentation results. The building segmentation results are compared with the corresponding building labels at the pixel level, and the instances are classified as positive or negative, with true indicating a correct prediction and false indicating an incorrect prediction; thus, true positives (TPs) represent the correctly predicted building pixels, false positives (FPs) represent the pixels that predict the background as buildings, false negatives (FNs) represent the pixels that predict buildings as the background, and true negatives (TNs) represent the correctly predicted background pixels. The confusion matrix is shown in Table 4. Among these six metrics, Recall denotes the proportion of correctly predicted building pixels among all real building pixels, Precision denotes the proportion of correctly predicted building pixels among all predicted building pixels, and the IoU, which is currently the most commonly used evaluation metric in semantic segmentation tasks, denotes the ratio of the intersection to the union of the predicted building and real building pixels. Accuracy indicates the proportion of correctly predicted pixels among all pixels. The F1-score is the harmonic mean of the Recall and Precision. Kappa is a metric that considers both the target and background accuracies. The Recall, Precision, IoU, Accuracy, F1-score and Kappa are defined as:

Comparison on the WHU Dataset
The WHU dataset is the most accurate building dataset available to date [53]; it consists of aerial images of Christchurch, New Zealand, with extraordinarily high image resolutions. Building images with obvious spectral features, geometric features, and spatial distribution features are selected from the dataset for display, as shown in Figure 7, where white pixels indicate the correctly detected parts of a building, red pixels indicate incorrectly detected parts of a building, blue pixels indicate the missed parts of a building, and black pixels indicate the correctly detected background. In Figure 7a, U-Net, PSPNet, and DeepLabv3+ all fail to detect large bungalows to some degree, and some results have rough building edges with some discrete pixels, whereas MSL-Net decreases the salt and pepper noise by fusing multiscale and multilevel features through its ASPP module and long skip connections. In Figure 7b, the other three methods have varying degrees of false detection. This is because the building roofs are similar to the ground in terms of color and texture, which signifies "intraclass spectral heterogeneity". The results of Figure 7b demonstrate that MSL-Net effectively reduces the negative impact of "intraclass spectral heterogeneity" on the extraction of buildings and achieves enhanced recognition accuracy. Figure 7c depicts a region with a significant number of densely distributed small-scale buildings, and MSL-Net efficiently mitigates false detection. Buildings of various sizes and shapes can be found in Figure 7d. According to the extraction results, U-Net and PSPNet are unable to effectively discriminate between the ground and buildings. Buildings with irregular shapes, such as round buildings, are ineffectively extracted by PSPNet and DeepLabv3+. MSL-Net not only eliminates the negative impact of "intraclass spectral heterogeneity" but also completely extracts round buildings and maintains their continuity, revealing that the deformable convolutional layer is involved in the detection of target features with irregular shapes.
To quantitatively evaluate the extraction effect of each method, the results of each metric are calculated, as shown in Table 5. Table 5. Metrics produced on the WHU dataset. The highest scores are bolded; the second-highest scores are underlined. In Table 5, MSL-Net exceeds 90% in every metric, with IoU, F1-score and Kappa values that are about 2.4%, 1.4% and 1.5% higher than those of the second-best model, respectively. Our proposed model outperforms the widely used models in accuracy of recognition outcomes.

Comparison on the Inria Dataset
The Inria dataset comprises a variety of urban landscapes, covering an area of 810 km 2 in 10 different cities. Various places have diverse architectural types, and the spectral features and shadow features of the images also vary depending on their imaging times and meteorological conditions, so this dataset might be a good indicator of a model's robustness in different scenarios. Some of the original images and labels and the corresponding results extracted by each method are shown in Figure 8. The spatial distribution of the buildings in Figure 8a is uneven, and the materials and spectral features of the roofs of the scattered buildings vary. The buildings in Figure 8b are large in scale and are connected by several buildings with different spectral features. The building structures are complex, and shadows obscure the buildings. Figure 8c contains two types of buildings, villas and large bungalows, and the spectral features of the buildings and the ground are relatively similar. The buildings in Figure 8d are rectangular ambulatory planes in terms of shape, and the spectral features of the roofs are complex.
MSL-Net effectively reduces the amount of missed detection and effectually suppresses the occurrence of false detection, as shown in Figure 8a,c. Figure 8b shows how MSL-Net successfully distinguishes rooftops and road surfaces with similar spectral features, weakening the influence of the "interclass spectral homogeneity". MSL-Net also effectively distinguishes between white and brown rooftops with different spectral features and successfully extracts white buildings shaded by trees, indicating that MSL-Net can reduce the impact of shaded buildings to some extent. We can observe in Figure 8b,d that MSL-Net can recognize buildings with complicated structures. Table 6 lists the computed evaluation metric results. The IoU, F1-score and Kappa coefficient of MSL-Net are 0.2%, 0.2% and 0.2% higher than those of the second-best model PSPNet, respectively; these results are not much higher than those of the PSPNet, but they still indicate that MSL-Net has good competitiveness in the building recognition task in various situations.

Comparison on the Massachusetts Dataset
The Massachusetts dataset includes Boston aerial images with 1-m spatial resolutions, which are significantly lower than the 0.3-m resolutions of the WHU dataset and Inria dataset. With these lower spatial resolutions, the feature information of buildings is rough and more difficult to extract. Some of the original images and labels and the corresponding results extracted by each method are shown in Figure 9. In general, most of the buildings in the segmentation results are displayed in fragmented patchy distributions, which greatly test each model's ability to extract small targets. Figure 9a,b demonstrate that MSL-Net can alleviate the occurrences of missed and false detections to a certain extent, and Figure 9c,d demonstrate that MSL-Net is also able to effectively identify irregular buildings.
The results of the calculated evaluation metrics are shown in Table 7. The IoU, F1score and Kappa coefficient values of MSL-Net are 3.3%, 2.3% and 2.9% higher than those of the second-best model U-Net, respectively, and all other metrics are also better. The results show that MSL-Net still has strong robustness even when working with images possessing poor spatial resolutions.

Complexity Comparison
Complexity is a critical factor that affects the practical application of a model. In the building extraction task based on the CNN method, lower numbers of parameters and floating-point operations (FLOPs) often result in faster training and inference speeds. A model with lower complexity is more convenient for practical applications. To objectively evaluate the complexity of each model, the number of parameters, the number of FLOPs, the training speed and the inference speed are calculated separately for each model. On an NVIDIA RTX 3090 GPU, the training speed is expressed as the number of frames per second (FPS) required for an input image of size 3 × 512 × 512, and the inference speed is expressed as the number of FPS required for 2 input images of size 3 × 512 × 512. The quantitative comparison results are shown in Figure 10.
In Figure 10, we can easily find that MSL-Net obviously achieves the fastest training speed and inference speed with very small numbers of parameters and FLOPs. In detail, the numbers of parameters and FLOPs required by MSL-Net are much lower than those of DeepLabv3+ and PSPNet, which are approximately 14% and 37% of the numbers required by the suboptimal U-Net model, respectively. Our proposed MSL-Net reaches a competitive training speed of 53.1 FPS, which is 65% faster than that of the second-best model, while the inference speed surpasses those of other models by more than 57.1% with an FPS of 101.4.

Comparison with State-of-the-Art Methods
To verify the effectiveness of the proposed network, MSL-Net is compared with recent state-of-the-art building extraction methods, including AGs-Unet [54], PISANet [55], DR-Net [56], RSR-Net [57], BRRNet [58], and SRI-Net [59], on the accurate WHU dataset. As WHU dataset is a publicly available dataset, we use the reported model performances for our comparisons. The quantitative comparison results are shown in Table 8. As seen in Table 8, the IoU and F1-score of MSL-Net are superior to those of all the tested methods that have been developed in recent studies, demonstrating the state-ofthe-art performance of our proposed method. Compared with RSR-Net, BRRNet, and SRI-Net, our proposed MSL-Net achieves IoU improvements of 2.1%, 1.4%, and 1.3% on the WHU dataset, respectively. MSL-Net also achieves competitive number of parameters, demonstrating its great tradeoff between complexity and accuracy.

Ablation Experiments
To verify the effectiveness of each improvement, we perform ablation experiments on the WHU dataset, Inria dataset and Massachusetts dataset. Based on the baseline (MSL-Net with the unimproved MobileNetV2), we first change the LR policy from exponential decay (ED) to warmup and cosine annealing (WCA), and then we add a DCN layer at the end of the network. Finally, we replace the second BN layer with a GN layer in the inverse residual structure. The results of the six metrics are calculated, as shown in Table 9 and  The metrics of the model demonstrate constant trends toward superiority with the adjustment of the LR policy, the addition of the DCN, and the embedding of GN in the inverse residual module. For instance, on the WHU dataset, the WCA increases the IoU by 0.8% and the F1-score by 0.7%, the DCN module increases the IoU by 0.3% and the F1score by 0.3%, and the improvement of the inverse residual structure increases the IoU by 1.0% and the F1-score by 0.1%. The ablation experimental results strongly prove the effectiveness of our introduced improvements.

Limitations and Future Work
Despite superior performance achieved by the proposed MSL-Net in accuracy and complexity, MSL-Net still has limitations that need to be addressed. The experimental results of MSL-Net on the Inria dataset with more complex scenes have insignificant advantages over those of PSPNet, and it is still necessary to strengthen the robustness of lightweight MSL-Net in complex scenes. Additionally, the number of parameters of the ASPP module accounts for a large proportion of that of the whole network. In future work, the improvements or replacements of the ASPP module can be considered to address this limitation.

Conclusions
In this paper, we propose MSL-Net, an efficient neural network for building extraction. In terms of its network structure, MSL-Net adopts an ASPP module and skip connections to obtain the multiscale features of buildings and the multilevel features of images. The numbers of network parameters and computations are reduced by a DSC, GN is embedded in the inverted residual structure to alleviate network degradation, and the extraction capability of the model for irregularly shaped buildings is ensured by a DCN layer. Experiments are conducted on three publicly available datasets with varying spatial resolutions and building styles, and MSL-Net outperforms the comparison methods in model accuracy, demonstrating its superiority and robustness. Complexity evaluation experiments reveal that MSL-Net surpasses other models by more than 57.1% with an inference speed of 101.4 FPS; it requires only 14% of the parameters and 37% of the FLOPs required by the second-best method, manifesting the efficiency of MSL-Net. Ablation experiments indicate the effectiveness of each improvement. However, we find that the experimental results obtained by MSL-Net on the Inria dataset with more complex scenes are not sufficiently superior, and we will focus on enhancing the robustness of MSL-Net to complex scenes in our future work so that MSL-Net can perform well when monitoring urbanization in different cities.  Acknowledgments: We thank the editors and reviewers for their constructive and helpful comments that led to the substantial improvement of this paper.

Conflicts of Interest:
The authors declare no conflict of interest.