DMU-Net: A Dual-Stream Multi-Scale U-Net Network Using Multi-Dimensional Spatial Information for Urban Building Extraction

Automatically extracting urban buildings from remote sensing images has essential application value, such as urban planning and management. Gaofen-7 (GF-7) provides multi-perspective and multispectral satellite images, which can obtain three-dimensional spatial information. Previous studies on building extraction often ignored information outside the red–green–blue (RGB) bands. To utilize the multi-dimensional spatial information of GF-7, we propose a dual-stream multi-scale network (DMU-Net) for urban building extraction. DMU-Net is based on U-Net, and the encoder is designed as the dual-stream CNN structure, which inputs RGB images, near-infrared (NIR), and normalized digital surface model (nDSM) fusion images, respectively. In addition, the improved FPN (IFPN) structure is integrated into the decoder. It enables DMU-Net to fuse different band features and multi-scale features of images effectively. This new method is tested with the study area within the Fourth Ring Road in Beijing, and the conclusions are as follows: (1) Our network achieves an overall accuracy (OA) of 96.16% and an intersection-over-union (IoU) of 84.49% for the GF-7 self-annotated building dataset, outperforms other state-of-the-art (SOTA) models. (2) Three-dimensional information significantly improved the accuracy of building extraction. Compared with RGB and RGB + NIR, the IoU increased by 7.61% and 3.19% after using nDSM data, respectively. (3) DMU-Net is superior to SMU-Net, DU-Net, and IEU-Net. The IoU is improved by 0.74%, 0.55%, and 1.65%, respectively, indicating the superiority of the dual-stream CNN structure and the IFPN structure.


Introduction
Buildings, as the basic geographical entity of cities, its spatial information are widely used in urban planning, disaster mitigation and prevention, population forecasting, and energy consumption [1][2][3][4]. With the rapid development of remote sensing technology, the spatial resolution of remote sensing images has reached the sub-meter level. The images have more spectral, texture, and spatial structure information, making it possible to refine and automatically extract buildings. Due to the density of urban buildings and the diversification of roof materials and structures, how to extract buildings accurately and quickly still face challenges.
In recent decades, there has been a great deal of research on building extraction. According to the type of data used, the methods of building extraction can be roughly the end of the encoder [34]. However, the amount of calculation is large, as to adding the SPP module at the end of the encoder and ignoring the shallow characteristics of the model, and the size of different receptive fields needs to be determined by multiple experiments. Lin et al. proposed the feature pyramid structure (FPN) to construct semantic features at various scales through a hierarchical structure of lateral connections [35]. FPN is simple to operate, has a minimal amount of computation, is similar to the U-Net, and can be easily integrated into the U-Net backbone network. It has been used in related semantic segmentation tasks and achieved good results [36,37].
CNNs are mainly based on RGB images, which cannot be directly applied to multimodal data, so the corresponding multi-modal fusion strategy is necessary. According to the different positions of fusion, it can be divided into three types: (1) Data-level fusion. Multi-modal data are fused before feature extraction using data superposition or dimensional reduction [38,39]. However, this strategy ignores the correlation between different modal data features. (2) Feature-level fusion. In the feature learning stage, the features of different modal data are fused [40,41]. This method fails to fully exploit the high-level features of individual modality data. (3) Decision-level fusion. The output results of different modal data are fused by averaging or voting [41,42]. This method fails to exploit individual modality data's low-level and mid-level features fully. We propose a new fusion architecture with the U-Net as the basic network structure to fully use the low-level, mid-level, and high-level features of different modal data. The dual-stream CNN structure is used to extract the features of RGB images and NIR + nDSM images. The shallow and middle-level features of different modalities are fused with the deep features of the up-sampling stage with the help of the skip-layer connection structure, to avoid the loss of varying depth features.
To effectively extract the features of buildings of different scales and fully leverage both individual modal and cross-modal features, we proposed a simple and effective dualstream multi-scale building extraction network named DMU-Net. The main contributions are as follows: (1) The dual-stream structure of the DMU-Net can effectively extract the features of multimodal data, and the building features of different scales can be effectively integrated through the IFPN structure. (2) The fusion of three-dimensional data with two-dimensional data significantly improves the accuracy of urban building extraction. (3) Compared with different semantic segmentation networks, the DMU-Net has higher accuracy while preserving edge details.

Study Area and Dataset
GF-7 was successfully launched in November 2019, and it is China's first civilian sub-meter stereoscopic mapping satellite equipped with dual line scan cameras. GF-7 acquires 0.8 m front-view images (+26 • ) and 0.65 m rear-view images (−5 • ), and 2.6 m four-band multispectral images acquired by the rear-view multispectral camera. Taking the region within the Fourth Ring Road of Beijing as the study area, two adjacent GF-7 images obtained on 16 October 2020, were selected, with an area of about 302 km 2 , as shown in Figure 1. The study area covers major commercial and residential zones in Beijing, with dense buildings and diverse building structures.
To ensure the generalizability of the model, five distinct regions are selected from the study area, regions (a), (b), (c) as training and validation regions, and regions (d) and (e) as test regions. The ground-truth labels of the five regions are obtained by manual annotation. Regions (a), (b), and (c) cover the main types of buildings in the Fourth Ring Road, such as dense low-rise buildings, medium and high-rise buildings, and factory buildings. Region (d) contains other ground objects, such as water bodies, vegetation, squares, and roads, which can test the ability of different data to distinguish buildings from other ground objects. Region (e) contains large factory buildings, which can test the power of nDSM data and different network structures in extracting completeness and details of buildings.

Materials and Methods
The flowchart of the proposed method in this study is illustrated in Figure 2. It can be summarized by the following steps: (1) Data Preprocessing. The GF-7 backward multispectral image and backward panchromatic image are fused to obtain a VHR multispectral image, and the nDSM data is constructed based on the front and backward panchromatic images. (2) DMU-Net for urban building extraction. (3) Morphological operations and vector data regularization are used for post-processing.

Data Preprocessing
In this study, GF-7 image Sharpening and nDSM generation are performed by PCI Geomatica. Image Sharpening mainly includes ground control point (GCP) and tie point (TP) collection, orthorectification, and panchromatic sharpening. The 0.5 m resolution orthophoto from the Map World (TianDiTu) is used as the geographic reference image. The fast Fourier transform phase matching algorithm (FFTP) collects the GCPs between the rear-view panchromatic, multispectral image and the georeferenced image. GCPs with residual values greater than three are removed, and TPs are gathered to match panchromatic and multispectral images. Afterward, the points with larger errors are removed based on the residual report to complete the collection of GCPs and TPs. Finally, the backward panchromatic, multispectral images are orthorectified based on the rational functional model, and then the multi-resolution analysis algorithm is used for image sharpening. nDSM generation mainly includes the collection of GCPs and TPs, creating epipolar images, extracting DSM, and image filtering. The GCPs and TPs of the front-view panchromatic and forward panchromatic images are collected using FFTP. Then, the forward panchromatic image and the backward panchromatic image are determined as the left epipolar image and the right epipolar image to complete the creation of the epipolar images. The semi-global matching algorithm is used to generate DSM data [43]. Finally, a variety of filtering strategies are used to obtain digital elevation model (DEM) data, DEM data is subtracted from DSM data to obtain nDSM data, and the nDSM values of water bodies and buildings with missing height information are corrected by calculating the average value in the region or reassigning them.

Data Preprocessing
In this study, GF-7 image Sharpening and nDSM generation are performed by PCI Geomatica. Image Sharpening mainly includes ground control point (GCP) and tie point (TP) collection, orthorectification, and panchromatic sharpening. The 0.5 m resolution orthophoto from the Map World (TianDiTu) is used as the geographic reference image. The fast Fourier transform phase matching algorithm (FFTP) collects the GCPs between the rearview panchromatic, multispectral image and the georeferenced image. GCPs with residual values greater than three are removed, and TPs are gathered to match panchromatic and multispectral images. Afterward, the points with larger errors are removed based on the residual report to complete the collection of GCPs and TPs. Finally, the backward panchromatic, multispectral images are orthorectified based on the rational functional model, and then the multi-resolution analysis algorithm is used for image sharpening. nDSM generation mainly includes the collection of GCPs and TPs, creating epipolar images, extracting DSM, and image filtering. The GCPs and TPs of the front-view panchromatic and forward panchromatic images are collected using FFTP. Then, the forward panchromatic image and the backward panchromatic image are determined as the left epipolar image and the right epipolar image to complete the creation of the epipolar images. The semiglobal matching algorithm is used to generate DSM data [43]. Finally, a variety of filtering strategies are used to obtain digital elevation model (DEM) data, DEM data is subtracted from DSM data to obtain nDSM data, and the nDSM values of water bodies and buildings with missing height information are corrected by calculating the average value in the region or reassigning them.

DMU-Net Architecture
We design DMU-Net that can fuse multi-modal data and multi-scale features to improve the accuracy of building extraction. As shown in Figure 3, inspired by previous studies [36,[44][45][46], based on U-Net, we embed the two-stream CNN structure into the encoder of U-Net to obtain comprehensive features of different modal data. In the dual-Sensors 2023, 23, 1991 6 of 19 stream structure, one stream is used to input the RGB image to get the feature of the two-dimensional spatial structure. The other stream is used to input the NIR + nDSM image, mainly used to obtain the feature of the three-dimensional spatial structure. In the decoder structure, it is still the up-sampling and skip connections of U-Net. Up-sampling is used to restore the dimension of the feature map, and the skip connection fuses the downsampled feature map during the up-sampling process to realize the fusion of shallow and deep features, and to reduce the loss of original data details. The design of the independent dual-stream structure can extract multi-modal data features while avoiding the mutual interference between them and make full use of the image information of R, G, B, NIR, and nDSM bands [47]. Then, the IFPN structure is introduced in the decoding structure to fuse multi-scale information to account for the features of buildings of different scales. Finally, the sigmoid function is used to obtain the building segmentation map.

DMU-Net Architecture
We design DMU-Net that can fuse multi-modal data and multi-scale features to improve the accuracy of building extraction. As shown in Figure 3, inspired by previous studies [36,[44][45][46], based on U-Net, we embed the two-stream CNN structure into the encoder of U-Net to obtain comprehensive features of different modal data. In the dualstream structure, one stream is used to input the RGB image to get the feature of the twodimensional spatial structure. The other stream is used to input the NIR + nDSM image, mainly used to obtain the feature of the three-dimensional spatial structure. In the decoder structure, it is still the up-sampling and skip connections of U-Net. Up-sampling is used to restore the dimension of the feature map, and the skip connection fuses the down-sampled feature map during the up-sampling process to realize the fusion of shallow and deep features, and to reduce the loss of original data details. The design of the independent dual-stream structure can extract multi-modal data features while avoiding the mutual interference between them and make full use of the image information of R, G, B, NIR, and nDSM bands [47]. Then, the IFPN structure is introduced in the decoding structure to fuse multi-scale information to account for the features of buildings of different scales. Finally, the sigmoid function is used to obtain the building segmentation map.

Fusion Strategy
As shown in Figure 4, the dual-stream architecture in the encoder has the same network structure and is independent of each other, and each stream performs four max-pooling down-sampling operations. Before each pooling, the features of the two streams are fused by the Add method and then fused with the corresponding up-sampled features in the decoder by the Concate process. To avoid a large number of parameters and memory consumption, the number of channels of the convolution kernel in the dual-stream structure is set to 32, 64, 128, 256, and 512 in turns. The up-sampling stage in the encoder adopts a method of first performing linear interpolation up-sampling, and then performing convolution. This method is equivalent to the transposed convolution operation, which is more effective than the simple interpolation up-sampling method and can effectively eliminate the aliasing effect. To speed up the training of the network, a BN layer is added after each 3 × 3 convolution for data normalization [48]. Dropout layers with a probability of 0.5 are added after the 4th and 5th groups of convolutional layers to enhance the robustness of the network and avoid overfitting [49].
work structure and is independent of each other, and each stream performs four maxpooling down-sampling operations. Before each pooling, the features of the two streams are fused by the Add method and then fused with the corresponding up-sampled features in the decoder by the Concate process. To avoid a large number of parameters and memory consumption, the number of channels of the convolution kernel in the dualstream structure is set to 32, 64, 128, 256, and 512 in turns. The up-sampling stage in the encoder adopts a method of first performing linear interpolation up-sampling, and then performing convolution. This method is equivalent to the transposed convolution operation, which is more effective than the simple interpolation up-sampling method and can effectively eliminate the aliasing effect. To speed up the training of the network, a BN layer is added after each 3 × 3 convolution for data normalization [48]. Dropout layers with a probability of 0.5 are added after the 4th and 5th groups of convolutional layers to enhance the robustness of the network and avoid overfitting [49].

Improved Feature Pyramid Network
FPN was first proposed to solve the multi-scale problem in object detection. The highlevel features of low-resolution, high-semantic information and low-level features of highresolution, low-semantic information are fused by top-to-bottom side connections. Therefore, the features of FPN at all scales have rich semantic information, which helps to extract objects at different scales. As shown in Figure 5 Figure 5(c)), and inspired by this, the IFPN is designed [45]. Different from the studies of Ji et al., in the decoder part, we first up-sampled the first three up-sampled feature graphs by 8×, 4×, and 2× linear interpolation to restore the original image size. Then 3 × 3 convolution is used to reduce the dimension of the feature map, as shown in Figure 5d. The 3 × 3 convolution has a larger receptive field than the 1 × 1 convolution, further enhancing the semantic features of buildings of different scales. Finally, the Add method is used to fuse the feature map.

Improved Feature Pyramid Network
FPN was first proposed to solve the multi-scale problem in object detection. The high-level features of low-resolution, high-semantic information and low-level features of high-resolution, low-semantic information are fused by top-to-bottom side connections. Therefore, the features of FPN at all scales have rich semantic information, which helps to extract objects at different scales. As shown in Figure 5a,b, U-Net and FPN have similar network structures. Ji et al. proposed a method to introduce the FPN module into the U-Net network (Figure 5c), and inspired by this, the IFPN is designed [45]. Different from the studies of Ji et al., in the decoder part, we first up-sampled the first three up-sampled feature graphs by 8×, 4×, and 2× linear interpolation to restore the original image size. Then 3 × 3 convolution is used to reduce the dimension of the feature map, as shown in Figure 5d. The 3 × 3 convolution has a larger receptive field than the 1 × 1 convolution, further enhancing the semantic features of buildings of different scales. Finally, the Add method is used to fuse the feature map.

Digital Morphological Processing
The binary images of buildings predicted by CNN usually have noise and voids, which significantly interfere with the accuracy of building extraction and require a series of optimization processes. In this paper, the opening and closing operations in morphology are used to optimize the building prediction map, in which erosion and dilation are the basis of opening and closing operations [50]. The mathematical expressions for morphological erosion and dilation are: The corrosion of B to A is expressed as B A , the expansion of B to A is defined as B A , A is a collection of building pixels, B is the structuring element, and z is the pixel value of the building.
The mathematical expressions for opening and closing operations are: The open operation of B on A is expressed as

Boundary Regularization
Although the building boundary after morphological processing has been smoothed, it cannot reflect the regular boundary of the buildings well. To better fit the boundary of the building, we adopt the polyline compression algorithm proposed by Gribov [51,52]. For building vector data, the algorithm's goal is to find a point within the tolerance of all nodes of the vector line segment, so that the sum of the penalties of the synthetic polyline connected by all the points and the source polyline is the smallest. If there is a synthetic

Digital Morphological Processing
The binary images of buildings predicted by CNN usually have noise and voids, which significantly interfere with the accuracy of building extraction and require a series of optimization processes. In this paper, the opening and closing operations in morphology are used to optimize the building prediction map, in which erosion and dilation are the basis of opening and closing operations [50]. The mathematical expressions for morphological erosion and dilation are: The corrosion of B to A is expressed as AΘB, the expansion of B to A is defined as A ⊕ B, A is a collection of building pixels, B is the structuring element, and z is the pixel value of the building.
The mathematical expressions for opening and closing operations are: The open operation of B on A is expressed as A • B, The closed operation of B on A is defined as A•B.

Boundary Regularization
Although the building boundary after morphological processing has been smoothed, it cannot reflect the regular boundary of the buildings well. To better fit the boundary of the building, we adopt the polyline compression algorithm proposed by Gribov [51,52]. For building vector data, the algorithm's goal is to find a point within the tolerance of all nodes of the vector line segment, so that the sum of the penalties of the synthetic polyline connected by all the points and the source polyline is the smallest. If there is a synthetic polyline with the same penalty, the polyline with the smallest square deviation from the source polyline is selected.

Evaluation Metric
To quantitatively evaluate the prediction results of the model, three evaluation metrics are calculated based on the confusion matrix: overall accuracy (OA), intersection-overunion (IoU), and F1-score (F1). The OA was used to assess the global accuracy of the extraction results, IoU was used to measure the overlap between building prediction results and real labels, and F1 took into account both the precision and recall of the model. The expression is as follows: where TP (true-positive) is the number of correctly identified building pixels, FP (falsepositive) is the number of missed building pixels, TN (true-negative) is the number of correctly classified non-building pixels, and FN (false-negative) is the number of nondetected non-building pixels.

Experimental Details
All experiments were performed on a desktop computer with 64-bit Windows 11. It is equipped with Intel (R) Core (TM) i5-11400 F CPU @ 2.60 GHz, a GPU of NVIDIA GeForce RTX 3060 with 12 GB RAM, and 16 GB memory (DDR4 3200 MHz). The methods in this paper are based on TensorFlow (version 2.5.0) and Keras (version 2.5.0), and the programming language is Python. The hyperparameters are set as follows: cross-entropy loss function and Adam optimization algorithm were used for 100 iterations of backward propagation, with four images in each batch, and the learning rate was 0.0001. In DMU-Net, RGB images are input in one stream, and NIR + nDSM images are input in the other stream. The changes in the accuracy and loss of the dataset with the number of training times are shown in Figure 6.

Comparative with SOTA Methods
To prove the excellent performance of DMU-Net, we use the GF-7 self-annotated building dataset, and select four excellent building extraction models for comparison, Figure 6. The Accuracy and Loss of DMU-Net for training the GF-7 self-annotated building dataset.

Comparative with SOTA Methods
To prove the excellent performance of DMU-Net, we use the GF-7 self-annotated building dataset, and select four excellent building extraction models for comparison, namely PSPNet, DeepLab v3+, EU-Net, and RU-Net [32,[53][54][55]. Among them, the feature extraction network of PSPNet and DeepLab v3+ has the same structure as the single-stream CNN of SMU-Net.
As summarized in Table 2, which evaluated all three metrics on the GF-7 self-annotated building dataset, our proposed DMU-Net outperforms PSPNet, DeepLab v3+, EU-Net, and RU-Net and achieves a considerably high IoU (84.49%). Figure 7     We adopt DMU-Net to extract the buildings of the whole study area. To improve the binary results of DMU-Net, remove isolated points, and fill in holes, a mathematical morphology method is employed based on 3 × 3 rectangular structure elements; first, two open operations are performed, and then three closed operations are performed. Then further eliminate the small non-building noise to obtain the final building extraction result, and vectorize it based on the ArcGIS platform. To correct the deformation of the building's vector boundary and better fit the building's edge, the polyline compression algorithm is used to regularize the building vector data. Figure 8c shows the vector result of buildings extracted by DMU-Net in the study area. Results showed that most of the buildings are correctly extracted. Due to the missing DSM data generated by GF-7, we have no extraction result of the building in the upper left area. To further analyze the results, we selected a typical local region from Figure 8b; this region contains high-rise independent buildings, contiguous low-rise buildings, and other ground objects similar to buildings. From Figure 8d, DMU-Net can completely extract large buildings with regular and complete edges. However, due to the low spatial resolution of the GF-7 image, the dense low-rise buildings (marked by the blue boxes) can only be extracted contiguously, and individual buildings cannot be distinguished. left area. To further analyze the results, we selected a typical local region from Figure 8(b); this region contains high-rise independent buildings, contiguous low-rise buildings, and other ground objects similar to buildings. From Figure 8(d), DMU-Net can completely extract large buildings with regular and complete edges. However, due to the low spatial resolution of the GF-7 image, the dense low-rise buildings (marked by the blue boxes) can only be extracted contiguously, and individual buildings cannot be distinguished.

Validity of NIR and nDSM Data
To explore the impact of NIR and nDSM data on building extraction, we fixed one of the streams to input RGB images, and the second stream input NIR, nDSM, and NIR +

Validity of NIR and nDSM Data
To explore the impact of NIR and nDSM data on building extraction, we fixed one of the streams to input RGB images, and the second stream input NIR, nDSM, and NIR + nDSM images, respectively. Based on the different inputs of the second stream, four different models, M1, M2, M3, and M4, are designed. Figure 9 shows the building extraction results of different models. Among them, M1 and M3 introduce nDSM images, which can effectively distinguish similar objects (such as playgrounds, squares, etc.), improve the integrity of large buildings extraction and avoid the adhesion of adjacent buildings. To further evaluate the effectiveness of our method, we quantitatively analyze the building extraction results of different models. As shown in Table 3, for the GF-7 self-annotated building dataset, M1 has the highest building extraction accuracy. Compared with M4, the IoU of M1, M2, and M3 have increased by 8.31%, 5.12%, and 7.61%, respectively, indicating that NIR and nDSM data help improve the accuracy of building extraction. At the same time, compared with M2, the IoU of M3 has increased by 2.49%; it shows that nDSM data contribute more to improving the accuracy of building extraction. further evaluate the effectiveness of our method, we quantitatively analyze the building extraction results of different models. As shown in Table 3, for the GF-7 self-annotated building dataset, M1 has the highest building extraction accuracy. Compared with M4, the IoU of M1, M2, and M3 have increased by 8.31%, 5.12%, and 7.61%, respectively, indicating that NIR and nDSM data help improve the accuracy of building extraction. At the same time, compared with M2, the IoU of M3 has increased by 2.49%; it shows that nDSM data contribute more to improving the accuracy of building extraction.

Comparison of Different Network Structures
To verify the contribution of the dual-stream CNN structure and the IFPN structure to DMU-Net, the dual-stream with FPN model (DMU-Net (FPN)), the single-stream with IFPN model (SMU-Net), the dual-stream without IFPN model (DU-Net) and the singlestream without IFPN model (IEU-Net) were constructed for comparison (Table 4). For a fair comparison, the input data of all models contain RGB + NIR + nDSM images. Among them, one stream of DMU-Net, DMU-Net (FPN) and DU-Net input RGB images, and the other stream input NIR + nDSM images. SMU-Net and IEU-Net directly input RGB + NIR + nDSM images. As shown in Figure 10, DMU-Net has apparent advantages in the accurate extraction of adjacent buildings and the completeness of extensive buildings extraction. According to Table 5, using the dual-stream structure and the IFPN structure can effectively improve the accuracy of building extraction. DMU-Net compared with SMU-Net, DU-Net, and IEU-Net, the IoU of buildings increased by 0.74%, 0.55%, and 1.65%, respectively. In addition, DMU-Net compared with DMU-Net (FPN), the IoU of buildings increased by 0.22%, and the params and floating-point operations (FLOPs) of the model changed little, showing the advantages of IFPN. Compared with the single-stream CNN structure, the Trainable params and FLOPs of the two-stream CNN structure increase by about 1.6 times. IFPN structure had little effect on the model; the Trainable params and FLOPs did not increase by more than 0.02 M. Table 4. Models of different network structures.

DMU-Net
The method proposed in this paper. DMU-Net (FPN) The data is input by a dual-stream CNN and the FPN structure is retained.

SMU-Net
The data is input by a single-stream CNN and the IFPN structure is retained.

DU-Net
The IFPN structure is removed and the dual-stream CNN structure is retained.

IEU-Net
Data are input by single-stream CNN and the IFPN structure is removed [46].

Different Fusion Methods
Multimodal data fusion methods are divided into data-level fusion, feature-level fusion, and decision-level fusion. We used feature-level fusion. The building extraction accuracy of different fusion methods is shown in Table 6, the fusion method in this paper is the best in three indicators: OA, IoU, and F1. The IoU increased by 0.58% and 2.08%, respectively, compared with data-level fusion and decision-level fusion.   Multimodal data fusion methods are divided into data-level fusion, feature-level fusion, and decision-level fusion. We used feature-level fusion. The building extraction accuracy of different fusion methods is shown in Table 6, the fusion method in this paper is the best in three indicators: OA, IoU, and F1. The IoU increased by 0.58% and 2.08%, respectively, compared with data-level fusion and decision-level fusion. Note: Due to memory limitations, each batch takes three training images.

Advantages of Regularization
To confirm that the regularization method adopted in this paper can effectively optimize the results of building extraction, we refer to the PoLiS metric proposed by Avbelj et al. [56]. We evaluate the similarity of all building vector predictions to the ground truth building vectors by computing the overall mean of PoLiS [57]. Smaller values indicate a higher similarity between the predicted building vectors and the actual building vectors. According to Table 7, although the three indicators of OA, IoU, and F1 of the building extraction results after regularization processing are slightly reduced, PoLiS is halved, indicating that the vector boundaries of buildings after regularization processing are more similar to actual buildings. Figure 11 shows that after morphological and regularization treatment, the edge of the building is more consistent with the natural shape of the building, and the holes in the building are eliminated. higher similarity between the predicted building vectors and the actual building vectors. According to Table 7, although the three indicators of OA, IoU, and F1 of the building extraction results after regularization processing are slightly reduced, PoLiS is halved, indicating that the vector boundaries of buildings after regularization processing are more similar to actual buildings. Figure 11 shows that after morphological and regularization treatment, the edge of the building is more consistent with the natural shape of the building, and the holes in the building are eliminated. Figure 11. The vectorization result of building extraction. The first row is the vectorized result of the original prediction of the buildings; The second row is the results of the building after morphological processing and regularization of the original building prediction results.

Limitations and Future Works
Although the nDSM data constructed by GF-7 stereo pairs can significantly improve the accuracy of building extraction, compared with LiDAR data, the quality of nDSM data produced based on multi-view satellite images has shortcomings, such as insufficient precision of nDSM, and lack of height information of some buildings due to occlusion, which affects the performance of multi-view satellite data in building extraction. However, the main advantage of multi-view satellite imagery is that it can quickly obtain the height information of ground objects in a large area, and has strong timeliness and economic applicability. In the future, improving the quality of nDSM data generated from multiview satellite data such as GF-7 is a promising exploration.
The DMU-Net performs the best effect on the GF-7 self-annotated building dataset. However, the two-stream structure has greater computational and memory overhead than the single-stream, which limits the applicability on different hardware. In the future, how to reduce the computational cost of the dual-stream CNN is a bottleneck. It must be overcome to ensure its wide application. More efficient multi-modal data fusion networks need to be proposed. In addition, the regularized post-processing method used in this paper needs to set complex calculation rules, which undoubtedly added to the task's

Limitations and Future Works
Although the nDSM data constructed by GF-7 stereo pairs can significantly improve the accuracy of building extraction, compared with LiDAR data, the quality of nDSM data produced based on multi-view satellite images has shortcomings, such as insufficient precision of nDSM, and lack of height information of some buildings due to occlusion, which affects the performance of multi-view satellite data in building extraction. However, the main advantage of multi-view satellite imagery is that it can quickly obtain the height information of ground objects in a large area, and has strong timeliness and economic applicability. In the future, improving the quality of nDSM data generated from multi-view satellite data such as GF-7 is a promising exploration.
The DMU-Net performs the best effect on the GF-7 self-annotated building dataset. However, the two-stream structure has greater computational and memory overhead than the single-stream, which limits the applicability on different hardware. In the future, how to reduce the computational cost of the dual-stream CNN is a bottleneck. It must be overcome to ensure its wide application. More efficient multi-modal data fusion networks need to be proposed. In addition, the regularized post-processing method used in this paper needs to set complex calculation rules, which undoubtedly added to the task's workload. It is an exciting direction to integrate the regularized method into the end-to-end segmentation model in the future.

Conclusions
In this paper, we propose a dual-stream multi-scale U-Net network named DMU-Net, to automatically and accurately extract urban buildings from GF-7 stereo images. For DMU-Net, the dual-stream CNN architecture is designed in the encoder to learn the multidimensional features of different modal data. The decoder introduces the IFPN structure to fuse features of different scales. Compared with four SOTA models, our model achieves the best results on the GF-7 self-annotated building dataset. In addition, the nDSM data constructed from GF-7 stereo images can help improve the accuracy of building extraction. In particular, it can help distinguish buildings from similar ground objects and improve the integrity of large buildings. In the future, we will develop more effective multimodal fusion models and regularization methods.