MSPNet: Multi-Scale Strip Pooling Network for Road Extraction from Remote Sensing Images

: Extracting roads from remote sensing images can support a range of geo-information applications. However, it is challenging due to factors such as the complex distribution of ground objects and occlusion of buildings, trees, shadows, etc. Pixel-wise classiﬁcation often fails to predict road connectivity and thus produces fragmented road segments. In this paper, we propose a multiscale strip pooling network (MSPNet) to learn the linear features of roads. Motivated by the strip pooling being more aligned with the shape of roads, which are long-span and narrow, we develop a multi-scale strip pooling (MSP) module that utilizes strip pooling layers with long but narrow kernel shapes to capture multi-scale long-range context from horizontal and vertical directions. The proposed MSP module focuses on establishing relationships along the road region to guarantee the connectivity of roads. Considering the complex distribution of ground objects, the spatial pyramid pooling is applied to enhance the learning ability of complex features in different sub-regions. In addition, to alleviate the problem caused by an imbalanced distribution of road and non-road pixels, we use binary cross-entropy and dice-coefﬁcient loss functions to jointly train our proposed deep learning model. Then, we perform ablation experiments to adjust the loss contributions to suit the task of road extraction. Comparative experiments on a popular benchmark DeepGlobe dataset demonstrate that our proposed MSPNet establishes new competitive results in both IoU and F1-score.


Introduction
The automatic extraction of roads maps from very high-resolution remote sensing images is an essential and hot research domain, which can be applied to numerous applications that rely on the efficient and real-time updating of road maps, such as navigation, cartography, urban planning, location-based mobile services and autonomous driving. In disaster zones, especially in developing countries, maps and accessibility information are crucial for crisis response. Extracting roads from RSIs is a promising approach and has been studied for decades. Recently, the wide use of convolutional neural networks (CNNs) [1], especially networks with fully-convolutional network (FCN) [2] architecture, has greatly improved the accuracy of road extraction and made the task end-to-end trainable [3]. However, the existing extraction results of road maps are still not satisfactory, which is mainly due to the complex urban traffic environments and special characteristics of roads. Compared with most other ground natural objects with bulk shape, such as buildings and trees, the roads in remote sensing images are narrow, long-span, and can be broadly formulated as elongated regions with similar spectral and texture patterns. Therefore, the road extraction algorithms often produce fragmented road segments leading to road network disconnection due to the occlusion of trees, buildings, cloud, etc. Additionally, the similarities often exist between the roads and other ground objects that are difficult to identify, as they look visually similar to targets. To match those features' geometric and physical features, the road extraction methods are expected to have a certain level of optimization of the results to reduce the missing connections and false alarms [4].
Conventional expert-knowledge-based methods, such as knowledge-driven-based, template-matching-based, and object-oriented-based, are mainly adopt to extract roads from remote sensing images [5,6]. Those hand-crafted methods are often cumbersome in steps and lead to error accumulation problems, since they usually combine multiple algorithms to match the complex features of roads. The fixed hand-crafted criteria based on geometric and physical features are usually unsuitable and inefficient to large volumes of remote sensing data, and they also cannot guarantee the connectivity of roads. With the rapid development of deep learning technologies, the CNN-based methods make it possible to obtain the results of large-scale remote sensing data [4]. Accordingly, the CNN-based methods improve the accuracy of road extraction significantly compared with conventional methods and have been the mainstream in road extraction due to the great feature learning power of CNNs [7].
Road extraction can be viewed as a binary semantic segmentation problem in CNNbased methods. Several works have been proposed and applied successfully for road segmentation tasks, especially some with an encoder-decoder architecture. In [8], a Dense-UNet model with similar architecture of encoder-decoder and dense connection units is proposed to extract the road network from remote sensing images. Combining the deep residual network, pyramid pooling module, and deep decoder, Han et al. [9] propose a novel deep residual and pyramid pooling network (DRPPNet) for extracting road regions from high-resolution remote sensing images. A dual-attention capsule U-Net (DA-CapsUNet) in [10] is designed for road maps extraction by combining the properties of capsule representations and the features of attention mechanisms. Some other typical works based on modified encoder-decoder structure include [4,11]. However, there are still challenges in accurately extracting road maps, which are mainly due to the special characteries of thin and elongated roads that are easily interpreted by tress, buildings and shadows, etc. These difficulties lead to fragmentation of the road segmentation, and the above methods cannot guarantee the connectivity of roads. A possible solution to these problems is to enhance the embedding of linear features within the CNN architectures.
In this paper, we propose a multi-scale strip pooling network (MSPNet) to address the above-mentioned problems. We first introduce an encoder-decoder architecture network to learn the feature of roads, where the Pyramid Pooling module (PPM) is adopted to increase the receptive field of feature points and learning ability of complex features in different sub-regions. Inspired by the strip pooling being more aligned with the shapes of roads that are long-span, narrow, and distributed continuously, we propose a multi-scale strip pooling (MSP) module to learn the linear features of roads, which is placed in the skip-connection paths. The MSP focuses on establishing relationships in the elongated road region between road and occluded road pixels to guarantee the connectivity of roads. Extensive experiments on popular benchmarks in terms of several metrics demonstrate the superiority of our MSPNet compared with several state-of-the-art methods.
The main contributions of this work are summarized as follows.
• We propose an end-to-end multi-scale strip pooling network (MSPNet) with symmetric encoder-decoder network design for the task of road extraction. This network design can preserve spatial detailed information and therefore optimize the smoothness of roads. In addition, it is also suitable for processing large-scale images. • We develop a multi-scale strip pooling (MSP) module that utilizes strip pooling layers to aggregate multiple long-range contextual information. The linear features of roads are enhanced within CNN architecture, which thus improves the road connectivity. • Ablation studies and comparative experiments on a benchmark DeepGlobe data set are performed to verify the effectiveness of our proposed MSPNet.
The remaining of this article is organized as follows. Section 2 introduces the related works of road extraction. In Section 3, we describe datasets, evaluation metrics, and implementation details, and we illustrate our proposed MSPNet in detail. Extensive experiments are performed to evaluate the performance of the proposed method for road extraction in Section 5. The conclusion and discussion are presented in Section 6.

Related Work
The literature research on automatic road extraction can be divided into two categories: expert knowledge-based methods and CNN-based ones. Although the CNN-based methods improve the accuracy significantly due to the powerful feature-embedding abilities, traditional methods on how to utilize the geometric and physical properties of roads provide inspiration for future research. In this section, we briefly introduce these two categories of methods.

Expert Knowledge-Based Methods
Conventional expert knowledge-based algorithms for road extraction usually utilize geometric and physical features to match. Fu et al. [12] propose a road detection method based on a Circular Projection (CP) matching and tracking strategy, which is beneficial for twisty roads detection. Xu et al. [5] present a morphological method by combining the automatic thresholding and morphological operation techniques to extract roads from remote sensing images. Wang et al. [13] propose an automatic road extraction method for vague aerial images with an improved Canny edge detection operator and Hough line transform algorithm. Herumurti et al. [14] propose a road extraction based on zebra crossings detection. Song et al. [15] develop an approach for road extraction utilizing pixel spectral information for classification and image segmentation-derived object features. However, these methods based on hand-designed road features are generally inefficient and unsuitable for large-scale remote sensing data.

CNN-Based Methods
(1) Segmentation of Roads: Semantic segmentation is a basic and essential research domain in computer vision. With the great success of deep learning in the field of semantic segmentation, some studies [3,16,17] consider the road extraction as a binary semantic segmentation problem using CNN-based approaches. Mnih et al. [18] firstly present a neural network-based approach with restricted Boltzmann machine (RBM) for detecting roads in high-resolution aerial images. Some deep learning models with encoder-decoder structures such as UNet [19] and LinkNet [20] have been proven to be efficient in the field of semantic segmentation, and their variants have also been widely proposed to segment roads [8,10,21]. A semantic segmentation neural network, which combines the strengths of residual learning and U-Net, is proposed for road area extraction by Zhang et al. [22]. Zhou et al. [23] follow the LinkNet architecture and employ dilated convolution layers with both cascade mode and parallel mode to enlarge the receptive field. Zhou et al. [24] propose an HsgNet, which inserts a Middle Block based on bilinear pooling into the middle of LinkNet between the encoder and decoder. These methods usually perform better compared with traditional methods, but they cannot guarantee the connectivity of roads and thus produce fragmented road segments.
(2) Connectivity of Roads: Recently, several works are proposed to obtain segmentation results with better road connectivity. Li et al. [25] put forward a road extraction method based on a LinkNet deep learning model, and at the pre-processing step, an auxiliary constraint task is designed to solve the connectivity problem caused by occlusions. Zhou et al. [21] propose a fusion network (FuNet) with the fusion of remote sensing imagery and location data, and a universal iteration reinforcement (IteR) module is added to enhance the ability of road connectivity reasoning. Meanwhile, Zhang et al. [26] introduce a deep learning-based multistage framework to extract the road surface and road centerline simultaneously. They initially segment roads with an FCN-based model, after which an iterative search strategy is applied to track consecutive and complete road networks. However, the iterative steps are time-consuming. The authors of [6] employ a novel linearity index for the discrimination of elongated road segments from other objects and customized tensor voting, which is utilized to fill missing parts of the road network. In these approaches, pre-processing or post-processing is added to maintain the connectivity of roads. However, they are time consuming and not suitable for some areas with high road density and occlusions.
In conclusion, although the CNN-based methods have greatly improved the accuracy of road extraction, most of them are simple extensions of the widely used CNN architectures and do not consider the structural features of roads. Therefore, there are still margins to improve the accuracy of road extraction in terms of topological connectivity.

Dataset
A public road extraction dataset DeepGlobe [27] is applied for evaluating the performance of the proposed method. The dataset provides images with a pixel size of 1024 × 1024 and a pixel resolution of 50 cm/pixel which includes multiple scenes such as cities, villages, wild suburbs, seashores, tropical rainforests, etc. The dataset contains 6226 images and corresponding annotated ground truth labels. In this paper, we divide it into 4696 images for training and 1530 for testing following [4]. Some samples are shown in Figure 1. Data enhancement is a common and useful strategy, which can enhance the generalizability of deep learning models. In this paper, data enhancement is applied by flipping and shifting the images randomly with a probability of 0.5. The visualization of data enhancement is shown in Figure 2.

Evaluation Metrics
In this paper, to evaluate the performance of our proposed method and other methods, we use several metrics of overall accuracy (OA), Intersection over Union (IoU), Mean Intersection over Union (MIoU), and F1-score. These are the most widely used measurements in both road extraction and other segmentation tasks [28]. They are defined by: in which, where TP, FP, TN and FN are the true positive, false positive, true negative and false negative, respectively.

Network Structure
We propose a multi-scale strip pooling network (MSPNet) for road extraction from remote sensing images as illustrated in Figure 3. Encoder-decoder networks are applied to many computer vision tasks, and their superiority has also been validated [19,29]. Therefore, we apply one as the overall architecture of our proposed MSPNet. The ResNet [30] is widely used in image recognition because of its outstanding performance for feature learning. The ResNet contains a series of residual neural network models, which have different numbers of layers. The CNN-based model with more layers generally improves the performance, while it requires a higher computational cost and training time. We will provide analysis of the performance and parameter cost for a ResNet series network in the next subsection. Here, we employ ResNet-34 pre-trained on ImageNet [31]   The roads in remote sensing images are narrow, long span, and straight distribution that often produce fragmented road segments. To improve the connectivity of roads, our proposed multi-scale strip pooling (MSP) module is sequentially added to the first three skip-connection paths to extract linear features of roads at multiple scales. Different from the roads, most background objects have bulk shapes. Considering the advantages of traditional pooling, we adopt the Pyramid Pooling Module (PPM) [32] to effectively learn complex features that are added to the last skip-connection path. Due to the complex distribution of roads and other objects, the use of PPM will enhance the learning ability of complex features and thus improve the performance. In the decoder module, we apply transpose convolution for upsampling the feature maps to an appropriate size and 1 × 1 convolution for adjusting the channels of feature maps; then, we concatenated with the corresponding output feature map of the MSP module in the skip-connection path.

Multi-Scale Strip Pooling
Most natural objects often have bulk shapes. Accordingly, the traditional kernel shape of the pooling layer in most CNN architectures is designed to be square for feature learning, which is suitable for most computer vision tasks. However, the roads in remote sensing images are narrow, long-span and can be described as elongated areas. Traditional pooling layers with square kernel shapes neglect the modeling of linear features of roads. By contrast, the strip pooling is more in line with the shape of the roads, which utilizes a long but narrow kernel to capture long-range dependencies in road regions and thus enhance the embedding of linear features within CNN models. The strengthened learning ability of linear features is helpful for retaining the connectivity and making the segmentation results more complete.
Motivated by the above fact and the advantages of strip pooling, we develop a multiscale strip pooling module (MSP) to help our MSPNet generate road networks with better connectivity. As shown in Figure 4, MSP uses multiple strip pooling layers with long but narrow kernel shapes to capture multi-scale long-range context from horizontal and vertical directions. In addition, the two directions we choose are also aligned with the distribution of most roads in remote sensing images. Let X ∈ R H×W denote the input tensor for the MSP module, where H, W represent the height and width. In the MSP module, X is first fed into two pathways along either the horizontal or vertical spatial dimension, each of which contains a strip pooling layer with long but narrow kernel shapes of H r × r or r × w r to extract linear features of roads, where r is the scaling factor for adjusting the kernel sizes. Let y h r and y v r be the output feature maps extracted by the two strip pooling layers along the horizontal or vertical direction. Then, we upsample the two feature maps to the same size of input tensor by using a bilinear upsampling layer. Afterwards, we combine the two feature maps of y h r and y v r to obtain y r , which can be formulated as: The feature maps of y r contain rich long-range contextual information of roads with different scales, which is related to the scaling factor r. The appropriate selection of the scaling factor r is usually based on experience. In this paper, r is set to 1, 3 and 7, respectively, to obtain three feature maps, containing information with three different scales. Like [32], different scales of features are concatenated as the final pooling global feature, which can be formulated as: y = Concat(y r=1 , y r=3 , y r=7 ) Finally, the output of the MSP module can be written as: where Scale(·, ·) represents element-wise multiplication, α is the sigmoid function and f is a 1 × 1 convolution.

Loss Function
Road extraction can be formulated as a pixel-wise binary classification task in semantic segmentation. Cross-entropy is defined as a measure of the differences between two probability distributions for a given random variable. In the deep learning domain, the binary cross-entropy (BCE) loss function is used to optimize models in binary classification tasks. Assuming that the size of the input image is H × W, then the BCE loss function is calculated as follows: where y i is the ground truth denoting road or background for a given pixel in position i, p i is the corresponding probability predicted by the model and N = H × W. The BCE loss function separately evaluates the predicted classes of each pixel and then averages all pixels, so it can be considered that all pixels are learned equally. Thus, it is difficult to learn the features of road pixels when there is a great imbalance in which there are far fewer road pixels than background pixels. The dice-coefficient loss function is introduced to alleviate the above problem caused by sample imbalance. Compared with the BCE loss function, the dice-coefficient loss function directly supervises the similarity of prediction and ground truth [33], which can be calculated as follows: Due to the imbalance of road and non-road pixels, a simple combination of binary cross-entropy (BCE) and dice-coefficient (Dice) loss functions is used to train deep learning models to alleviate this problem in previous work [11,23], which can be defined as: This simple combination can be considered to have the same weight, which may lead to suboptimal training results in road extraction tasks. Therefore, the final loss function used in this paper is modified as follows: where the adjustment factor K is set to balance the loss contribution of the BCE and dice loss function.

Implementation Details
The proposed method is implemented on the PyTorch machine learning framework and is trained on two NVIDIA GeForce RTX 2080 Ti GPUs with 11 GB memory. The source code will be made available at: https://github.com/Shenming-Qu/MSPNet (accessed on 25 March 2022).
During the experiment, due to the limitation of GPU memory size, the batch size is set to 8. Following most previous works [34,35], we adopt stochastic gradient descent (SGD) with momentum as the optimizer, and the parameters for SGD are set as follows: weight decay is set to 0.0005, and momentum is set to 0.9. We adopt the "poly" learning rate policy (base learning rate × (1 − iter max_iter power ) to gradually reduce the learning rate, where the base learning rate is set to 0.005 and power is set to 0.9. The number of training epoch is set to 200 by default. Finally, we save the trained model for testing the performance on the test set.

Comparison of Backbone Networks
This subsection compares the performance and parameters of ResNet series networks as the encoder in the proposed model, with the purpose of selecting the appropriate ResNet model for subsequent research. Figure 5 plots the progression of F1-score values when ResNet series models are used as the backbone of the encoder during the training process. Table 1 provides summary statistics for performance evaluated with the metrics of F1-score and the number of parameters. Through experiments, it is found that the performance of the network improves with the increase of parameters. The ResNet with 101 parameter layers performs best in terms of F1-score, 85.14% obtained, while the Resnet-18 performs the worst for its lowest number of parameter layers, which is 3.35% lower than the former. Sequential performance was achieved by ResNet-50 and ResNet-34, reaching F1-scores of 84.51% and 84.92%, respectively. The above four models have similar results except for ResNet-18. However, it cannot be ignored that ResNet-50 and ResNet-101 have several times the number of parameters compared with ResNet-34, while it gives trivial performance gain. Since road extraction is a simple binary segmentation problem that does not need to model complex background information, the parameters in ResNet-34 are powerful enough for road extraction tasks. Finally, we choose the ResNet34 model as the backbone of the encoder based on performance and parameter considerations.  To alleviate the problem caused by imbalance between road and background pixels as described in Section 3.5, the loss function used to optimize our model contains a manually selected hyper-parameter K, which is used to balance the loss contribution of BCE and dice loss function. Therefore, choosing an appropriate hyper-parameter K may be beneficial to reach the local optimum quickly and improve performance. We train our proposed model with different values of hyper-parameter K in ascending orders, while other conditions are maintained the same to select an optimal K. Considering representativeness and experiment quantity, K is set to 0, 0.2, 0.4, 0.6, 0.8 and 1, respectively, to observe the performances in this paper.
The experiment results with different configurations of hyper-parameter K are shown in Figure 6. It can be seen that the F1-score and MIoU are fluctuating as K increases. When K = 2, it achieves a best MIoU score of 86.74% and F1-score of 84.51%. It is noted that the performance is reduced to 82.28% in terms of F1-score with K = 0 and 79.41% with K = 1; the comparison results show that the separate use of BCE (K = 0) or dice (K = 0) loss function cannot obtain the optimal results for road extraction. As listed in Table 2, the performance with K = 0.2 is also better than the simple combination of "BCE + Dice", which obtains improvements of 1.48% on the MIoU score and 0.75% on the F1-score.   To further illustrate the influences of different weight combinations between dice and BCE loss function, we plot the changes in the loss value during training with respect to the number of epochs, as shown in Figure 7. It can be seen that the loss function curve with K = 0.2 is smoother than another, which indicates that there is a reasonable allocation of loss weights. To better optimize our model, we adopt K = 0.2 in the method evaluations.

Comparison with State-of-the-Art Methods
To evaluate the performance of the proposed model for road extraction from remote sensing images, we compared with several baseline and state-of-the-art methods: FCN [2], ResUNet [22], D-LinkNet [23], and SE-DeepLab [36]. ResUNet is built with residual learning and UNet; D-LinkNet is a variant of LinkNet architecture and added dilated convolution layers in the center part, which has achieved best performance in the CVPR DeepGlobe 2018 Road Extraction Challenge [27]; SE-Deeplab employs the structure of Deeplab v3 and incorporates a squeeze-and-excitation (SE) module. All these models are trained with the same learning rates and employ the same data processing to ensure fairness. Table 3 reports the quantitative experiment results of the compared methods on the DeepGlobe dataset. The accuracy of FCN is lower than those of other methods, which is mainly due to the loss of spatial details. The D-LinkNet with additional dilated convolution layers outperforms ResUNet, obtaining an improvement of 1.5% on the IoU score and 0.67% on the F1-score. The design of an SE module in SE-DeepLab improves the IoU score by 7.62% and the F1-score by 4.01% compared with the D-LinkNet.
Compared with those methods, the proposed MSPNet achieves the best performance in OA, IoU, and F1-score. For example, MSPNet obtains an F1-score of 84.51% and an IoU score of 73.64%, which are better than SE-DeepLab by 1.78% and 2.27%, respectively. We attribute these significant performance gains to three factors. (1) Due to the symmetric network design with skip-connections, the proposed MSPNet is able to preserve low-level spatial details and thus extract roads with smoother boundaries from remote sensing images. (2) The strip pooling design in our proposed model enhances the embedding of linear features, which greatly improves the connectivity of roads; therefore, the proposed MSP module can further improve the segmentation accuracy. (3) The joint supervision of BCE and dice loss alleviates the class imbalance problem. The partially visualized segmentation results of our MSPNet and other methods are illustrated in Figure 8, which shows three examples from different scenes. The corresponding distributions of FP and FN are plotted in Figure 9. The results extracted by FCN are worse than those of other methods, which is mainly caused by the loss of spatial information after multiple downsampling operations in its early layers. The results of UNet and D-LinkNet are similar, and they also both contain many missing connections and FP pixels. The SE-Deeplab shows better road connectivity compared with other methods, and it obtains the second-best results. In comparison, our proposed method extracts roads with better connectivity and smoother road edges. The roads segmented by other methods may be interrupted especially in some regions where occlusions exist, while our MSPNet recovers the connectivity very well by effectively capturing long-range dependencies along road regions. For example, in the results of rural areas (the first row in Figure 8), there are several residential houses in the upper-right and lower-left corner of the image, which are planted many trees on both sides; the results extracted by other methods contain some broken road segments and fail to maintain the connectivity, but the result extracted by MSPNet is consistent with the ground truth. In addition, in the results of densely connected road areas (the second and third row in Figure 8), other methods recognize some branch roads as background, while the proposed methods are consistent with those of ground truth, and very few FP pixels exist. These visualized segmentation results verify the superiority of our MSPNet in the task of road extraction from remote sensing images.

Discussion
The above experimental results demonstrates that our proposed MSPNet obtains new competitive performance over other state-of-the-art methods. However, the road maps extracted by all the considered methods still have some interruptions. This situation exists in some special surface environments. We show two samples as illustrated in Figure 10. The road regions are severely occluded by a large number of trees that are difficult to distinguish in the upper-left corner of Figure 10 (the first row). Figure 10 (the second row) shows some areas of farmland, and the color of the roads is very similar to the surrounding environment. Our MSPNet fails to predicate complete road maps in these challenging regions. In the future studies, some other prior information will be introduced to segment the roads in these challenging regions, such as the direction information of the roads, which may be helpfull to generate more complete results in those occluded regions. To fairly evaluate different methods, as shown in Table 4, we list the floating point operations per second (FLOPS) and inference time of our MSPNet and several compared methods. The comparison experiments of all methods are run on a workstation with a NVIDIA RTX 2080Ti GPU. For fair comparison, the FLOPs and inference time are calculated based on an input size of 1024 × 1024. It is seen that the proposed MSPNet obtains competitive inference time, while it has a reasonable computational cost compared with other methods. The comparison results illustrate that our proposed MSPNet is suitable for road extraction tasks from remote sensing images.
During the training process, we apply data augmentation for improving the generalization of the proposed model. We also perform comparative experiments to show the contribution of data augmentation, as shown in Table 5. We find that it gives an improvement of 0.38 in terms of IoU and 0.44 in F1-score. The results demonstrate that data augmentation is a useful strategy to improve performance.

Conclusions
In this paper, we propose an end-to-end road segmentation network for road extraction tasks from remote sensing images. Although the CNN-based methods have greatly improved the accuracy of road extraction over traditional approaches, the road connectivity should be further improved to generate more complete results. As one of the important geometric topological properties, road connectivity is necessary for autonomous driving, vehicle navigation, and route planning. However, the existing CNN-based methods often fail to predict road connectivity and thus produce fragmented road segments. As a comparison, our proposed MSPNet is able to generate the road segmentation results with better connectivity and therefore meet the requirements of large-scale remote sensing data analysis. Specifically, the proposed MSPNet strengthens the linear feature of roads by introducing strip pooling layers, where its pooling kernel shapes are more in line with the roads. Accordingly, a multi-scale strip (MSP) module is developed to learn multiple long-range contextual information. In this paper, the widely used design of a symmetric encoder-decoder network with skip-connections is adopted to the low-level features to recover the spatial details, which is beneficial for parsing high-resolution remote sensing images. What is more, to alleviate the problem caused by unbalanced road and background pixels, we have performed ablation experiments to adjust the loss contributions between cross entropy and dice-coefficient loss functions to suit the task of road extraction. We have also compared the performance and computational cost of ResNet series models as the backbone network to select an appropriate backbone. Experimental results on a popular benchmark DeepGlobe dataset show the superiority of our proposed MSPNet compared with several mainstream methods.
As discussed in the above section, some road types that have similar spectral and texture to background are not well identified, and there are still some discontinuities. Since road extraction can be viewed as a binary classification problem, it may be useful to suppress background noise to improve the generalization ability of the road segmentation model. This is left for our future studies.