Road Extraction in Mountainous Regions from High-Resolution Images Based on DSDNet and Terrain Optimization

: High-quality road network information plays a vital role in regional economic development, disaster emergency management and land planning. To date, studies have primarily focused on sampling ﬂat urban roads, while fewer have paid attention to road extraction in mountainous regions. Compared with road extraction in ﬂat regions, road extraction in mountainous regions suffers more interference, due to shadows caused by mountains and road-like terrain. Furthermore, there are more practical problems involved when researching an entire region rather than at the sample level. To address the difﬁculties outlined regarding mountain road extraction, this paper takes Jiuzhaigou county in China as an example and studies road extraction in practical applications. Based on deep learning methods, we used a multistage optimization method to improve the extraction effect. First, we used the contrast limited adaptive histogram equalization (CLAHE) algorithm to attenuate the inﬂuence of mountain shadows and improve the quality of the image. Then the road was extracted by the improved DSDNet network. Finally, the terrain constraint method is used to reduce the false detection problem caused by the terrain factor, and after that the ﬁnal road extraction result is obtained. To evaluate the effect of road extraction comprehensively, we used multiple data sources (i.e., points, raster and OpenStreetMap data) in different evaluation schemes to verify the accuracy of the road extraction results. The accuracy of our method for the three schemes was 0.8631, 0.8558 and 0.8801, which is higher than other methods have obtained. The results show that our method can effectively solve the interference of shadow and terrain encountered in road extraction over mountainous regions, signiﬁcantly improving the effect of road extraction. Synthesizing multiple evaluation schemes, the extraction effect of mountain roads proposed in this paper had the highest extraction accuracy and a high practical value. The CLAHE algorithm improved results especially for quality degradation caused by and fog and the interference of road spectrum information caused by mountain shadows. Through DSDNet, road features can be more accurate. By using postprocessing with terrain constraints, the problem of false detection can be reduced well.


Introduction
High-quality road network information plays an important role in many practical applications [1]. In urban regions the extraction of road network information can be used to make maps and for urban planning. In mountainous regions, the extraction of road network information can be used to make travel route planning adjustments and informs emergency decision-making for disasters such as earthquakes and floods. Road extraction from remote sensing imagery gained considerable attention, while this task remains challenging owing to irregular and complex road sections and structures [2].
With the recent development in artificial intelligence, especially deep learning technology, new ideas have been provided for road extraction [3]. Deep learning was developed 2 of 18 from research of neural networks and has made significant advancements in computer vision [4,5] and other fields. It has shown great advantages in image scene classification [6,7], target detection [8,9] and semantic segmentation [10][11][12]; the latter is a useful tool for road extraction [3] so we use it here as the basis for our method. However, from the perspective of practical applications, there are many special issues that need to be studied. The problems associated with extracting roads from complete mountainous regions are summarized as follows: (1) The shadowed area of mountains will cause transmission change of the spectrum. Unlike urban architectural shadows, there are large areas of shadow in mountainous regions.
(2) Mountainous regions are cloudy on most days of the year so it is often difficult to guarantee a high-quality data source for mountainous regions. There are higher requirements for feature extraction and recognition.
(3) Due to mountain topography, the spectrum and shape of valleys and the bare regions where water once flowed are very similar to the roads, which can cause false detection.
The above problems are based on the complete regional analysis of Jiuzhaigou county. If we only selected some samples, some of these problems may be ignored. Therefore, it is necessary to study the road extraction of a complete region. Here, we address the problems discussed as follows: (1) We enhance the image quality through the contrast limited adaptive histogram equalization (CLAHE) algorithm to alleviate the problems of mountain shadows and low-quality images.
(2) We propose a new network to improve the extraction ability of the neural network by using the squeeze-and-excitation module and integrating dense upsampling convolution into the encoder-decoder model. We also improve the learning rate decay method and activation function.
(3) The interference caused by the complex terrain is eliminated using a post-processing method. We use the least squares method to fit the road direction; then we calculate the road-gradient angle and the road-direction slope through topographic analysis to constrain the road extraction results.
This article uses a variety of precision evaluation schemes to study mountain road extraction from the perspective of practical applications, including cross-validation using raster data, validation using point data, and validation using OSM (OpenStreetMap, https://www.openstreetmap.org/) data.
The remainder of this article is organized as follows: Section 2 reviews the relevant literature on road extraction; Section 3 introduces the principle of the method proposed in this article; Section 4 introduces the research region and experimental plan of this article; Section 5 introduces the basic accuracy evaluation indicators and three accuracy evaluation schemes used in this article; Section 6 introduces the experimental results and related discussions and Section 7 summarizes our results and discusses future research directions.

Related Work
The mountain road network can be drawn by manual editing and field inspection. However, these methods consume a lot of manpower and financial resources, and it is difficult to update on a large scale over a short time. Many automated methods have been used to extract road information such as spectral feature-based method, objectoriented method, shallow machine learning method, deep learning method and so on. Previous studies have classified existing road extraction algorithms from different point of views [13][14][15]. Considering the particularity of deep learning methods, in this article, we divided the existing methods into non-deep learning methods and deep learning methods.

Non-Deep Learning Methods
There are some useful algorithms used in road extraction that do not come from learning methods. Gruen et al. [16] proposed a semiautomatic road extraction scheme that used wavelet decomposition and a model-driven linear feature extraction algorithm. It is robust in gaps and long roads but the extraction accuracy needs to be improved. Hinz et al. [17] proposed a model that comprised explicit knowledge about geometry, radiometry, topology and context. They also considered the influence of some mountain terrain factors. Their proposed approach is suited for images with a resolution from 0.2 to 0.5 m, but there are still obvious discontinuities in their results and their model is too complex, which was only tested on high-quality images in small areas. Zhang et al. [14] proposed a method combining stroke width transformation and mean drift. This method performs image segmentation through mean shift and extracts roads through stroke width transformation. When the road features are obvious, this method can achieve good results. However, the applicability of the algorithm is greatly affected by the images of surrounding objects. Zhao et al. [18] proposed a method based on the marked point process with local structure constraints. This method uses a random point process to define the position, and uses line segments as the mark to define the geometric structure. By using the characteristics of the road, the road network is extracted by combining Bayesian theory and the reversible jump Markov Chain Monte Carlo algorithm. This method has good extraction accuracy, but it is difficult to extract the contour information of the road. Peng et al. [19] used an adaptive spatial filtering algorithm to control the watershed algorithm for segmentation. They extracted roads through features such as geometry and texture. This method can extract roads with high accuracy, but the effect of contour edge information extraction is problematic.
The shallow machine learning methods were also used in road extraction. Soni et al. [15] proposed a supervised multilevel framework based on least squares support vector machine (LS-SVM), mathematical morphology and road shape features to extract road networks from remote sensing images. This method uses the LS-SVM method to segment the image into road and non-road regions, and then uses the morphological and shape features to extract non-road objects. This method has good results with high-quality images, but does not perform as well in regions with complex spectra and complex road junctions.
In the above methods, there are relatively good results for road extraction, but issues including the demand of high image quality, poor contour boundary effect and complicated feature design and extraction are obvious.

Deep Learning Methods
The rise of deep learning technology has had a strong impact in the field of image processing. Many scholars have conducted in-depth research on the application of deep learning in road extraction [20,21] and different effective neural network structures have been applied and improved for road extraction.
The use of CNN has shown good results in road extraction. Bastani et al. [22] proposed the RoadTracer method, which uses an iterative search process guided by CNN-based decision-making functions to directly derive a road network map from the output of CNN with a relatively small error rate. Qi et al. [23] combined the structure of multiscale convolution and attention mechanism with the LinkNet network and obtained the ATD-LinkNet, which can effectively use spatial and semantic information in remote sensing images. In view of the low quality of road boundary extraction and the discontinuity of road extraction, Zhou et al. [1] proposed a boundary and topological-aware road extraction network (BT-RoadNet), a coarse-to-fine architecture composed of a coarse map prediction module (CMPM) and a fine map prediction module (FMPM); this method handles interruptions caused by shadows and occlusions well.
To solve the problem of limited quantity of data, some scholars applied generative adversarial networks (GANs) [24] to road extraction. Zhang et al. [15] proposed a method based on the generative adversarial network, which showed better performance than other methods on the Massachusetts roads dataset. GAN can also be used to estimate the roads covered by trees or shadows. Zhang et al. [25] proposed a multi-supervised generative adversarial network (MsGAN), which learns the spectral and topology features. Costea et al. [26] proposed a new dual-hop generative adversarial network (DH-GAN) and applied a smoothing-based graph optimization to road extraction. This method performed well on a large dataset with European roads.
Deep transfer learning [27] is also used for road extraction. Transfer learning refers to a learning process that uses the similarity between data, tasks or models to apply the information learned in the previous domain to a new domain [28]. Deep transfer learning is the method that uses deep models such as deep neural networks for transfer learning [29]. Senthilnath et al. [30] used deep transfer learning and ensemble classifier to extract roads from UAV (unmanned aerial vehicle) imagery. He et al. [31] trained a U-Net-based baseline network on a large-scale remote sensing image dataset and used the cross-mode training data to fine-tune the first two convolutional layers of the pretrained network to achieve adaptation to the local features of the cross-mode data. These methods do not usually require large amounts of data to achieve good extraction results, so they may be applicable to mountain road extraction.
In addition, studies have used different data sources. Zhang et al. [32] adjusted the U-net network and used sentinel-1 SAR (Synthetic Aperture Radar) imagery's dualpolarimetric (VV and VH) to conduct road extraction research and improve the accuracy of road extraction. However, extraction using high-resolution data may be somewhat different in feature extraction. Henry et al. [33] combined CNN and a tolerance rule for a spatially small mistake to reach an effective solution and perform road extraction from SAR images. However, high resolution SAR data are not easily obtained and are prone to high noise interference.
Most of the existing research only studied road extraction from sample locations. However, as described in Section 1 of this article, there will be some specific problems for the extraction of a road network at the region level. It is important to pay attention to the practical application of methods and solve problems for entire regions. Some scholars have studied land cover mapping based on deep learning at a large scale [34,35], but it lacked specificity for road extraction. Salberg et al. [36] used a fully convolutional network for large-scale mapping of small roads, but they used lidar images, which are not easy to obtain. In addition, road extraction in mountainous regions is a context that has rarely been studied. Although some studies have considered terrain factors [25], they only focused on common factors like occlusion and shadows but did not consider some interference caused by the spatial and spectral characteristics of mountains. In the study by Courtial et al. [37] on mountain roads generalization, some interference can be resolved. However, some details will be ignored, which is not applicable in some situations such as earthquake emergency. Therefore, it is useful to study the road extraction of the complete mountainous region. This paper takes Jiuzhaigou county as an example and improves road extraction effect from the perspective of practical application.

Overall Process of Road Extraction
The VHR (very high-resolution image) data used in this study are Google Earth images. The images are displayed at different spatial separations from the Earth level and we used the images with a spatial resolution of 0.6 m. These images are in the RGB color model with 8-bit per color. Google Earth images are widely used [38], but unlike some images with more color bands and larger dynamic range, these usually require better feature extraction and image processing methods.
Starting from practical applications, we addressed the issue of road extraction from a complete mountainous region from different aspects. The method described in this paper can be divided into the following three aspects:  Figure 1 shows the overall flow chart of road extraction. In some studies, some interrupted roads are connected by topology [1] or other similar methods [22]. However, these kinds of methods are not very suitable for this study because road interruptions are often a reality, especially after earthquakes, floods or other natural disasters. These interrupted regions are also important information. Therefore, the method in this article will not use additional methods to connect all the suspected interrupted regions. It is worth noting that the CLAHE method and DSDNet network can optimize the road interruption problems of vehicles, shadows and thin clouds through their effective feature extraction capabilities.

FOR PEER REVIEW 5 of 19
 Network: We proposed DSDNet with optimizations of the existing network model.  Postprocessing: We calculated some indicators to constrain the extraction results according to the characteristics of the mountain terrain. Figure 1 shows the overall flow chart of road extraction. In some studies, some interrupted roads are connected by topology [1] or other similar methods [22]. However, these kinds of methods are not very suitable for this study because road interruptions are often a reality, especially after earthquakes, floods or other natural disasters. These interrupted regions are also important information. Therefore, the method in this article will not use additional methods to connect all the suspected interrupted regions. It is worth noting that the CLAHE method and DSDNet network can optimize the road interruption problems of vehicles, shadows and thin clouds through their effective feature extraction capabilities.

CLAHE Algorithm
An image histogram reflects the statistics of the different gray levels of the image. Through the histogram equalization (HE), the brightness can be better distributed on the histogram, so the contrast of the image can be enhanced. However, when there is a place in the image that is obviously brighter or darker than other regions, ordinary histogram equalization algorithms cannot describe the details of the place. The adaptive histogram equalization (AHE) algorithm achieves the effect of expanding the local contrast and displaying the details of the smooth region by performing histogram equalization in a rectangular region around the pixel being processed. However, there is still some noise in images obtained by the AHE algorithm. The CLAHE algorithm can deal with this problem by limiting the contrast improvement of the AHE algorithm. The contrast enlargement around the specified pixel value is mainly determined by the slope of the transformation function. This slope is proportional to the slope of the cumulative histogram of the field. CLAHE cuts the histogram with a predefined threshold before calculating the CDF (cumulative distribution function) to limit amplitude.
The main steps of the CLAHE algorithm are as follows: (1) Extend the image boundary so that it can be just divided into several sub-blocks.
(2) Divide the image into blocks, take the block as the unit, first calculate the histogram, then trim the histogram and carry out equalization. For each gray level of each sub-block's histogram, use the preset limit value to limit and count the number of pixels that exceed the limit in the entire histogram.

CLAHE Algorithm
An image histogram reflects the statistics of the different gray levels of the image. Through the histogram equalization (HE), the brightness can be better distributed on the histogram, so the contrast of the image can be enhanced. However, when there is a place in the image that is obviously brighter or darker than other regions, ordinary histogram equalization algorithms cannot describe the details of the place. The adaptive histogram equalization (AHE) algorithm achieves the effect of expanding the local contrast and displaying the details of the smooth region by performing histogram equalization in a rectangular region around the pixel being processed. However, there is still some noise in images obtained by the AHE algorithm. The CLAHE algorithm can deal with this problem by limiting the contrast improvement of the AHE algorithm. The contrast enlargement around the specified pixel value is mainly determined by the slope of the transformation function. This slope is proportional to the slope of the cumulative histogram of the field. CLAHE cuts the histogram with a predefined threshold before calculating the CDF (cumulative distribution function) to limit amplitude.
The main steps of the CLAHE algorithm are as follows: (1) Extend the image boundary so that it can be just divided into several sub-blocks.
(2) Divide the image into blocks, take the block as the unit, first calculate the histogram, then trim the histogram and carry out equalization. For each gray level of each sub-block's histogram, use the preset limit value to limit and count the number of pixels that exceed the limit in the entire histogram.
(3) Go through each image block and measure the linear difference between blocks. (4) Carry out layer filter blending operation with the original image. We used HE and CLAHE algorithms for enhancement processing on high-resolution remote sensing images; example results are shown in Figure 2. Both the CLAHE and HE algorithms can enhance the display effect of the original image to a certain extent. However, while the HE algorithm enhances the display effect, it also interferes with road extraction. The spectral information for the road and the ground in the image obtained by the HE algorithm in the upper right corner was very similar, which is prone to misdetection during feature extraction. In the image enhanced by the CLAHE algorithm, the road in the upper right corner was still easy to distinguish. feres with road extraction. The spectral information for the road and the ground in the image obtained by the HE algorithm in the upper right corner was very similar, which is prone to misdetection during feature extraction. In the image enhanced by the CLAHE algorithm, the road in the upper right corner was still easy to distinguish.

Network Structure
Based on a variety of excellent neural network structures, we proposed the DSDNet network structure (network with dilated convolution, SE module and dense upsampling convolution). The DSDNet network model structure is shown in Figure 3. The main features of this structure are as follows: (1) DSDNet uses encoder-decoder structure as its basis.
(2) The encoder is improved from the D-LinkNet network [39]. DSDNet uses the ResNet [9] pretraining model and the empty convolution pooling module. Due to the channel characteristics, our network added the SE module to the base of each ResNet network to better analyze the number of channels.
(3) The decoder adopts a dense upsampling method and integrates it into the encoder-decoder model. It is realized by a convolutional layer and a pixelshuffle layer.
(4) DSDNet uses Leaky ReLU as the activation function. (5) DSDNet uses a new method with two control variables to optimize the learning rate.

Network Structure
Based on a variety of excellent neural network structures, we proposed the DSDNet network structure (network with dilated convolution, SE module and dense upsampling convolution). The DSDNet network model structure is shown in Figure 3. The main features of this structure are as follows: feres with road extraction. The spectral information for the road and the ground in the image obtained by the HE algorithm in the upper right corner was very similar, which is prone to misdetection during feature extraction. In the image enhanced by the CLAHE algorithm, the road in the upper right corner was still easy to distinguish.

Network Structure
Based on a variety of excellent neural network structures, we proposed the DSDNet network structure (network with dilated convolution, SE module and dense upsampling convolution). The DSDNet network model structure is shown in Figure 3. The main features of this structure are as follows: (1) DSDNet uses encoder-decoder structure as its basis.
(2) The encoder is improved from the D-LinkNet network [39]. DSDNet uses the ResNet [9] pretraining model and the empty convolution pooling module. Due to the channel characteristics, our network added the SE module to the base of each ResNet network to better analyze the number of channels.
(3) The decoder adopts a dense upsampling method and integrates it into the encoder-decoder model. It is realized by a convolutional layer and a pixelshuffle layer.
(4) DSDNet uses Leaky ReLU as the activation function. (5) DSDNet uses a new method with two control variables to optimize the learning rate.  (1) DSDNet uses encoder-decoder structure as its basis.
(2) The encoder is improved from the D-LinkNet network [39]. DSDNet uses the ResNet [9] pretraining model and the empty convolution pooling module. Due to the channel characteristics, our network added the SE module to the base of each ResNet network to better analyze the number of channels.
(3) The decoder adopts a dense upsampling method and integrates it into the encoderdecoder model. It is realized by a convolutional layer and a pixelshuffle layer.
(4) DSDNet uses Leaky ReLU as the activation function. (5) DSDNet uses a new method with two control variables to optimize the learning rate.
DSDNet uses the encoder-decoder structure, which is used in many other effective networks, such as SegNet [12], U-Net [13], LinkNet [14], DeepLab v3+ [40], etc. For the encoder, multiple levels of features of the image were obtained through multilayer convolution, and for the decoder the extraction results were restored to the original image size in a certain way. The encoder and decoder in DSDNet were connected through pointwise addition. Through this method, the position and contour information of the extracted object can be preserved while extracting the categorical information from the image.
In the encoder, DSDNet retains the part of the ResNet network, so through this kind of transfer learning method we can quickly train a better model on our dataset. In order to extract information effectively, DSDNet adds an SE module in the encoder section. For convolution operations, most work is focused on improving the receptive field, that is, to fuse more feature fusion spatially, or to extract multiscale spatial information such as the multibranch structure of the inception network [8]. The convolution operation basically defaults to fusing all channels of the input feature map for feature fusion of channel dimensions. The squeeze-and-excitation (SE) module [41] focuses on the relationship between channels, hoping that the model can automatically learn the importance of different channel features. The SE module performs the squeeze operation on the feature map to obtain the channel-level global features, and then performs the excitation operation on the global features to learn the relationship between each channel and obtains the weights of different channels. It then multiplies the channel features by the original features to obtain the final features. Since there are many mountainous regions with poor-quality imagery, the SE module can help to extract features well. We adjusted the method of reading the pretrained model so that the network could partly read the pretrained parameters of ResNet while adding an SE module.
In the decoder, DSDNet uses a modified dense connection upsampling method. Wang et al. [42] designed dense upsampling convolution (DUC) for upsampling. This method compensates for the loss of length and width through channel dimensions. DSDNet uses a convolutional layer and a pixelshuffle layer to achieve similar functions. Other differences are that DSDNet uses multilayer upsampling and connects each layer to the encoder. The pixelshuffle algorithm was first used for super-resolution reconstruction [43]. Unlike the bilinear interpolation method, this algorithm uses subpixel convolution to achieve feature map magnification and upsampling ( Figure 4). The process can be trained to achieve better results. DSDNet uses the encoder-decoder structure, which is used in many other effective networks, such as SegNet [12], U-Net [13], LinkNet [14], DeepLab v3+ [40], etc. For the encoder, multiple levels of features of the image were obtained through multilayer convolution, and for the decoder the extraction results were restored to the original image size in a certain way. The encoder and decoder in DSDNet were connected through pointwise addition. Through this method, the position and contour information of the extracted object can be preserved while extracting the categorical information from the image.
In the encoder, DSDNet retains the part of the ResNet network, so through this kind of transfer learning method we can quickly train a better model on our dataset. In order to extract information effectively, DSDNet adds an SE module in the encoder section. For convolution operations, most work is focused on improving the receptive field, that is, to fuse more feature fusion spatially, or to extract multiscale spatial information such as the multibranch structure of the inception network [8]. The convolution operation basically defaults to fusing all channels of the input feature map for feature fusion of channel dimensions. The squeeze-and-excitation (SE) module [41] focuses on the relationship between channels, hoping that the model can automatically learn the importance of different channel features. The SE module performs the squeeze operation on the feature map to obtain the channel-level global features, and then performs the excitation operation on the global features to learn the relationship between each channel and obtains the weights of different channels. It then multiplies the channel features by the original features to obtain the final features. Since there are many mountainous regions with poor-quality imagery, the SE module can help to extract features well. We adjusted the method of reading the pretrained model so that the network could partly read the pretrained parameters of ResNet while adding an SE module.
In the decoder, DSDNet uses a modified dense connection upsampling method. Wang et al. [42] designed dense upsampling convolution (DUC) for upsampling. This method compensates for the loss of length and width through channel dimensions. DSDNet uses a convolutional layer and a pixelshuffle layer to achieve similar functions. Other differences are that DSDNet uses multilayer upsampling and connects each layer to the encoder. The pixelshuffle algorithm was first used for super-resolution reconstruction [43]. Unlike the bilinear interpolation method, this algorithm uses subpixel convolution to achieve feature map magnification and upsampling (Figure 4). The process can be trained to achieve better results. DSDNet uses two control variables to optimize the learning rate ( Figure 5). Both variables are related to the loss of each epoch. In one epoch, if the loss exceeds the previous minimum loss, the values of the two variables will be increased by 1. If not exceeded, variable 1 is reset to 0 and variable 2 remains unchanged. We set dynamic thresholds for the two variables, and the thresholds are determined according to the DSDNet uses two control variables to optimize the learning rate ( Figure 5). Both variables are related to the loss of each epoch. In one epoch, if the loss exceeds the previous minimum loss, the values of the two variables will be increased by 1. If not exceeded, variable 1 is reset to 0 and variable 2 remains unchanged. We set dynamic thresholds for the two variables, and the thresholds are determined according to the magnitude of the loss. If one of the variables reaches the threshold, both variables will be reset to 0 and the learning rate will be updated (multiplied by the change factor). In addition, we set a minimum learning rate, and when the learning rate reaches 0.000002, it will not be updated. In addition, we used Leaky ReLU to replace the original ReLU (rectified linear unit) as the activation function.
PEER REVIEW 8 of 19 Figure 5. Method of optimizing the learning rate: V1 and V2 represent the two variables, Th1 and Th2 represent the thresholds, LR represents the learning rate and mLoss and mLR represent the minimum loss and minimum learning rate, respectively.

3.4.Terrain Constraints Processing
Due to the terrain of mountainous regions there are often streams in the valley. After these creeks dry up, bare riverbeds often have a similar appearance to roads. Even if there are no riverbeds, the valley itself is very similar to a road. After the processing of CLAHE and DSDNet, the false detection caused by the special terrain can be partly solved. We went on to analyze the remaining false detection regions:  The length of bare riverbeds and valley lines is usually very short.  The directions of these bare riverbeds are sometimes similar to the slope gradient, but the roads are not. We proposed the road-gradient angle to represent the angle of road direction and the steepest direction.  The slope in the road direction is usually small but the bare riverbeds and valley lines are not. We used road-direction slope to represent the slope at the road direction. Each individual condition outlined above is not enough to distinguish between roads and non-roads. For example, some roads are relatively short and some special roads in mountainous regions may also follow directions of slope gradient. However, if we consider the sum of these conditions, we can distinguish roads better. Therefore, we could consider these three conditions at the same time by setting appropriate thresholds, and only the predictions that meet the conditions with shorter length, smaller road-gradient angles and smaller road-direction slopes will be judged as non-roads. It should be noted that we did not consider mountain roads, which are too narrow. These mountain roads can be almost perpendicular to the ground, but they are different from the road typical roads. In addition, in mountainous regions, there are many places to walk, and we cannot use all these places as road extraction objects. Therefore, we set three thresholds: The maximum length of the road, the maximum value of the road-gradient angle and the maxi-

Terrain Constraints Processing
Due to the terrain of mountainous regions there are often streams in the valley. After these creeks dry up, bare riverbeds often have a similar appearance to roads. Even if there are no riverbeds, the valley itself is very similar to a road. After the processing of CLAHE and DSDNet, the false detection caused by the special terrain can be partly solved. We went on to analyze the remaining false detection regions:

•
The length of bare riverbeds and valley lines is usually very short.

•
The directions of these bare riverbeds are sometimes similar to the slope gradient, but the roads are not. We proposed the road-gradient angle to represent the angle of road direction and the steepest direction.

•
The slope in the road direction is usually small but the bare riverbeds and valley lines are not. We used road-direction slope to represent the slope at the road direction.
Each individual condition outlined above is not enough to distinguish between roads and non-roads. For example, some roads are relatively short and some special roads in mountainous regions may also follow directions of slope gradient. However, if we consider the sum of these conditions, we can distinguish roads better. Therefore, we could consider these three conditions at the same time by setting appropriate thresholds, and only the predictions that meet the conditions with shorter length, smaller road-gradient angles and smaller road-direction slopes will be judged as non-roads. It should be noted that we did not consider mountain roads, which are too narrow. These mountain roads can be almost perpendicular to the ground, but they are different from the road typical roads. In addition, in mountainous regions, there are many places to walk, and we cannot use all these places as road extraction objects. Therefore, we set three thresholds: The maximum length of the road, the maximum value of the road-gradient angle and the maximum value of road-direction slope. The predictions within these three thresholds are regarded as non-roads and will be removed. Figure 6 shows a schematic diagram of mountain terrain variables. The length of the road is easy to calculate and then we fit the road to a straight line by the least square method and calculated the road direction according to the direction of the straight line. Combined with aspect, which was calculated from the DEM (Digital Elevation Model), the Remote Sens. 2021, 13, 90 9 of 18 road-slope angle was calculated. Then, we calculated the road-direction slope by using the straight line of road and the DEM data. 020, 10, x FOR PEER REVIEW Figure 6. Schematic diagram of mountain terrain varia the slope gradient of the mountain, b represents the cated position,  represents the road-gradient angle rection directly obtained from a DEM and  represe

Research Region
Jiuzhaigou county belongs to the Tibetan Q in Sichuan province, China. It is located on the e and the northeastern part of Ngawa prefecture, c complex geological background with an exten folds and fractures, strong neotectonics movem complex forces creating a variety of landforms. T vertical and horizontal. The terrain is high in the θ represents the road-gradient angle, α represents the slope at the steepest direction directly obtained from a DEM and β represents the required road-direction slope.

Research Region
Jiuzhaigou county belongs to the Tibetan Qiang Autonomous Prefecture of Ngawa in Sichuan province, China. It is located on the eastern edge of the Qinghai-Tibet Plateau and the northeastern part of Ngawa prefecture, covering a total area of 5286 km 2 . It has a complex geological background with an extensive carbonate distribution, developed folds and fractures, strong neotectonics movement, large crustal uplift and a variety of complex forces creating a variety of landforms. The river valleys in Jiuzhaigou county are vertical and horizontal. The terrain is high in the northwest and low in the southeast; it is dominated by high mountains. The topography changes throughout this region and the elevation difference is up to 2000 meters. There are many lakes, waterfalls and calcified beach streams in the ditches and virgin forest covers more than half of the area.
A topographic map of Jiuzhaigou county is shown in Figure 7. The DEM data in the figure come from the hi-res terrain corrected data of ALOS PALSAR (https://search. asf.alaska.edu/). The region outline comes from the national catalogue service for geographic information of China (http://www.webmap.cn/). It can be seen that the terrain of Jiuzhaigou county is complex so that special processing is needed in road extraction.

Figure 7.
Topography of Jiuzhaigou county. We used a rendering method to display the DEM, so there is not a precise numerical representation, but rather high and low elevation.

Experimental Environment
We used the Windows 10 system with an NVIDIA GeForce RTX 2070 8G graphics card as the experimental environment. We used Python and PyTorch framework to implement the network models.

Experimental Using Samples Locations
We selected 90 sample points in the Jiuzhaigou county region to obtain sample images of 1000 pixels × 1000 pixels. After that, the labels were drawn manually according to the RGB images. Finally, 90 image samples and corresponding labels were obtained. In order to avoid instability caused by the limited amount of data, the labels were organized into two groups, each with a total of 90 images but the data used for training and testing is different. From each group, 60 images were used as training data and 30 images were used as test data. The test data used in the two groups were completely different. We compared the results of our method with those of the U-Net and D-LinkNet networks, which perform well in road extraction [37]; D-LinkNet achieved first place in CVPR2018: deepglobe road extraction challenge and U-Net is a classic network structure widely used for road information extraction [44]. When comparing the networks, the model effects before and after adding the CLAHE algorithm were also compared.

Experiment Over a Complete Region
We conducted experiments using the complete high-resolution data of Jiuzhaigou county. Road extraction was performed on all images in Jiuzhaigou county region using moving windows. After extraction, the road extraction raster map of the complete region was obtained and the raster map was converted into a shapefile. Then we used a postprocessing operation with terrain constraints to improve road extraction.

Accuracy Evaluation Scheme
Since our study involved the extraction of complete road information in a mountainous region, a single accuracy evaluation scheme cannot reasonably evaluate the methods we proposed. We used three accuracy evaluation methods to evaluate the results of mountain road extraction. All of the evaluation methods have the same basic evaluation indicators.

Experimental Environment
We used the Windows 10 system with an NVIDIA GeForce RTX 2070 8G graphics card as the experimental environment. We used Python and PyTorch framework to implement the network models.

Experimental Using Samples Locations
We selected 90 sample points in the Jiuzhaigou county region to obtain sample images of 1000 pixels × 1000 pixels. After that, the labels were drawn manually according to the RGB images. Finally, 90 image samples and corresponding labels were obtained. In order to avoid instability caused by the limited amount of data, the labels were organized into two groups, each with a total of 90 images but the data used for training and testing is different. From each group, 60 images were used as training data and 30 images were used as test data. The test data used in the two groups were completely different. We compared the results of our method with those of the U-Net and D-LinkNet networks, which perform well in road extraction [37]; D-LinkNet achieved first place in CVPR2018: deepglobe road extraction challenge and U-Net is a classic network structure widely used for road information extraction [44]. When comparing the networks, the model effects before and after adding the CLAHE algorithm were also compared.

Experiment over a Complete Region
We conducted experiments using the complete high-resolution data of Jiuzhaigou county. Road extraction was performed on all images in Jiuzhaigou county region using moving windows. After extraction, the road extraction raster map of the complete region was obtained and the raster map was converted into a shapefile. Then we used a postprocessing operation with terrain constraints to improve road extraction.

Accuracy Evaluation Scheme
Since our study involved the extraction of complete road information in a mountainous region, a single accuracy evaluation scheme cannot reasonably evaluate the methods we proposed. We used three accuracy evaluation methods to evaluate the results of mountain road extraction. All of the evaluation methods have the same basic evaluation indicators.

Basic Accuracy Evaluation Indicators
The main evaluation indicators are precision, recall and comprehensive accuracy score. We took the sample label as the actual value and the output result of our methods as the predicted value. The precision and recall are obtained by: where FP indicates the number of extraction errors. FN indicates the number of correct values that have not been extracted. On this basis, the F1 score can be obtained by: Precision indicates whether the extracted road was extracted correctly, and recall indicates whether the real road was extracted completely. The F1 score can comprehensively reflect the extraction accuracy, so we used the F1 score to judge accuracy.

Cross Validation Based on Raster Data
This method is a commonly used road extraction accuracy verification method that verifies every pixel of the samples, so it is convincing and credible. Since our study used fewer samples, in order to make the experimental results more stable, a cross-validation method was adopted. We used the samples introduced in Section 4.3 for this. All samples were divided into two groups, the test samples in one group are used as part of the training samples in the other group, and the training samples of the two groups are different from each other. We tested the two groups of samples separately and calculated the average of the accuracy of the results.

Large-Scale Validation on Point Data
Although the method in Section 5.2.1 can perform pixel-by-pixel inspection, there will be some concentrated validation regions and some sparse validation regions. At the large regional scale, some areas may be missed. Point samples in mountainous regions can make the sample distribution more uniform so that the extraction accuracy can be expressed more completely. Therefore, 200 sample points were selected here and they were relatively evenly distributed over the study region. These points were verified on Google Earth and field survey data of Jiuzhaigou.

Validation Using OSM Data
We used OSM data to test the road extraction results and improve the reliability of the accuracy measurement. OSM is a famous world map that can be used freely under an open license agreement. As shown in Figure 8, the OSM data were consistent with the shape and position of the actual road, but there were some deviations in the specific spatial position. To ensure accurate verification, we calculated a 100 m buffer of the OSM data and used the buffer of OSM and the road extraction results to calculate the accuracy value. We calculated the length of the OSM data and the length of the matching extraction result, then calculated the accuracy by comparing their lengths.

Results of the Raster Samples
The experimental results for the raster samples are shown in Table 1. All expe ments used pretrained models. According to the characteristics of the structure, we us the pretrained model of VGGNet [45] in U-Net, and the pretrained model of ResNet other networks. It can be seen from Table 1 that the extraction accuracy of the D-LinkNet netwo was higher than the U-Net series. The accuracy of U-Net in the two groups was not s ble, while the accuracy of extraction in D-LinkNet is relatively stable. After adding t histogram equalization method, the extraction accuracy of the D-LinkNet network duced. Due to the influence of mountain shadows and poor image quality, the globa enhanced histogram equalization method does not have a positive effect. However, af using the CLAHE algorithm, the accuracy greatly improved, which shows the effectiv ness of the CLAHE algorithm. The accuracy of road was highest when using the DSDN proposed in this paper as the backbone network, showing that feature extraction ben fited from the SE-module and dense connection upsampling method. Figure 9 shows the details of the extraction effect from different methods. In Figu 9a, the results of multiple network structure extraction showed road interruption, bu did not appear in DSDNet (CLAHE), and the result of DSDNet (CLAHE) was the clos to the ground truth. In (b), it can be seen that there is some interference information b side the road, which is non-road based on a manual judgment. In the results of the neu

Results of the Raster Samples
The experimental results for the raster samples are shown in Table 1. All experiments used pretrained models. According to the characteristics of the structure, we used the pretrained model of VGGNet [45] in U-Net, and the pretrained model of ResNet in other networks. It can be seen from Table 1 that the extraction accuracy of the D-LinkNet network was higher than the U-Net series. The accuracy of U-Net in the two groups was not stable, while the accuracy of extraction in D-LinkNet is relatively stable. After adding the histogram equalization method, the extraction accuracy of the D-LinkNet network reduced. Due to the influence of mountain shadows and poor image quality, the globally enhanced histogram equalization method does not have a positive effect. However, after using the CLAHE algorithm, the accuracy greatly improved, which shows the effectiveness of the CLAHE algorithm. The accuracy of road was highest when using the DSDNet proposed in this paper as the backbone network, showing that feature extraction benefited from the SE-module and dense connection upsampling method. Figure 9 shows the details of the extraction effect from different methods. In Figure 9a, the results of multiple network structure extraction showed road interruption, but it did not appear in DSDNet (CLAHE), and the result of DSDNet (CLAHE) was the closest to the ground truth. In (b), it can be seen that there is some interference information beside the road, which is non-road based on a manual judgment. In the results of the neural network extraction, there are different degrees of mis-extraction. The result extracted by DSDNet (CLAHE) had the least false detection and was the closest to ground truth. In (c), there is a small road next to the main road. The path is narrow with many trees covering it, so it is difficult to identify. Each network structure had different degrees of missed road detection, but among them D-LinkNet (HE) and DSDNet (CLAHE) were most effective. In (d), the spectral characteristics of the road and the background are very similar. There Remote Sens. 2021, 13, 90 13 of 18 are obvious misdetections by the U-Net and D-LinkNet networks. After using histogram equalization processing, the effect was significantly improved. The CLAHE method had a greater improvement in the extraction effect.
Remote Sens. 2020, 10, x FOR PEER REVIEW 13 of 19 histogram equalization processing, the effect was significantly improved. The CLAHE method had a greater improvement in the extraction effect. It can be seen from these samples that the method proposed in this paper can extract roads very well. Moreover, compared with manually drawn roads, the road extraction contour obtained by the neural network is smoother and more in line with the actual situation.

Results for the Complete Region
Although we selected raster samples as evenly as possible, it is still inevitable that some information will be missed, especially with a complex mountainous background. Therefore, the road extraction experiment was carried out over the complete region of Jiuzhaigou county, and a complete road extraction map was obtained. We used 200 verification points ( Figure 10) and the results of verification are shown in Table 2. For the reason that most parts of the mountain areas are not roads, uniform or random points cannot fully reflect the accuracy, so the selected verification points were not uniform, nor were they randomly generated. However, these points were relatively evenly distributed throughout the research region and had undergone strict authentic certification. It can be seen from these samples that the method proposed in this paper can extract roads very well. Moreover, compared with manually drawn roads, the road extraction contour obtained by the neural network is smoother and more in line with the actual situation.

Results for the Complete Region
Although we selected raster samples as evenly as possible, it is still inevitable that some information will be missed, especially with a complex mountainous background. Therefore, the road extraction experiment was carried out over the complete region of Jiuzhaigou county, and a complete road extraction map was obtained. We used 200 verification points ( Figure 10) and the results of verification are shown in Table 2. For the reason that most parts of the mountain areas are not roads, uniform or random points cannot fully reflect the accuracy, so the selected verification points were not uniform, nor were they randomly generated. However, these points were relatively evenly distributed throughout the research region and had undergone strict authentic certification.
According to Table 2, DSDNet (CLAHE) was better than the D-LinkNet network in terms of overall F1 accuracy, and its advantage was mainly reflected in the recall rate. Some mountainous terrain was also extracted by mistake because of its similarity to roads. After using terrain constraints, the precision was significantly improved, which shows that terrain constraints could effectively reduce false detections. Although recall was slightly reduced after using terrain constraints, its extraction accuracy significantly improved and was 5.68% and 3.44% higher than D-LinkNet and DSDNet (CLAHE) without postprocessing, respectively. Figure 11 shows the details of road extraction by different methods. In the first row, some parts of the road were obscured by mountain shadows, resulting in a significant spectral difference. In the algorithm with no CLAHE enhancement, the shadow areas had obvious missed detections. CLAHE-preprocessing substantially improved road recognition rates. In the second and third rows, there were no roads, but some areas were incorrectly extracted as roads in the method without terrain constraints. Many mountainous features were very similar to roads in the mountainous region. After using terrain constraint postprocessing, false detection was reduced.
Remote Sens. 2020, 10, x FOR PEER REVIEW 14 of 19 Figure 10. Diagram of point samples and OSM roads. We used a rendering method to display the DEM, so there is not a precise numerical representation, but rather high and low elevation. According to Table 2, DSDNet (CLAHE) was better than the D-LinkNet network in terms of overall F1 accuracy, and its advantage was mainly reflected in the recall rate. Some mountainous terrain was also extracted by mistake because of its similarity to roads. After using terrain constraints, the precision was significantly improved, which shows that terrain constraints could effectively reduce false detections. Although recall was slightly reduced after using terrain constraints, its extraction accuracy significantly improved and was 5.68% and 3.44% higher than D-LinkNet and DSDNet (CLAHE) without postprocessing, respectively. Figure 11 shows the details of road extraction by different methods. In the first row, some parts of the road were obscured by mountain shadows, resulting in a significant spectral difference. In the algorithm with no CLAHE enhancement, the shadow areas had obvious missed detections. CLAHE-preprocessing substantially improved road recognition rates. In the second and third rows, there were no roads, but some areas were incorrectly extracted as roads in the method without terrain constraints. Many mountainous features were very similar to roads in the mountainous region. After using terrain constraint postprocessing, false detection was reduced. Figure 10. Diagram of point samples and OSM roads. We used a rendering method to display the DEM, so there is not a precise numerical representation, but rather high and low elevation. We used OSM data to further test the accuracy of road extraction. Since OSM data in mountainous regions was not very detailed, we tested the extraction accuracy of the main road network ( Figure 9) and we only calculated the recall. The verification results are shown in Table 3. It can be seen that the extraction results using DSDNet (CLAHE) were acceptable and higher than the result of D-LinkNet. Therefore, we could conclude that the method we proposed had a good practical application effect. Figure 12 shows the results (polylines) from the different methods for a typical area of the region. The figure shows the road extraction accuracy using our proposed method in a shaded area. In the shaded area, the method we proposed performed better than the original D-LinkNet method. As the OSM data are public data and not determined by the experimenter, the validation of the OSM data further verified the effectiveness of the method we proposed. some parts of the road were obscured by mountain shadows, resulting in a significant spectral difference. In the algorithm with no CLAHE enhancement, the shadow areas had obvious missed detections. CLAHE-preprocessing substantially improved road recognition rates. In the second and third rows, there were no roads, but some areas were incorrectly extracted as roads in the method without terrain constraints. Many mountainous features were very similar to roads in the mountainous region. After using terrain constraint postprocessing, false detection was reduced.
Remote Sens. 2020, 10, x FOR PEER REVIEW 15 of 19 We used OSM data to further test the accuracy of road extraction. Since OSM data in mountainous regions was not very detailed, we tested the extraction accuracy of the main road network ( Figure 9) and we only calculated the recall. The verification results are shown in Table 3. It can be seen that the extraction results using DSDNet (CLAHE) were acceptable and higher than the result of D-LinkNet. Therefore, we could conclude that the method we proposed had a good practical application effect.  Figure 12 shows the results (polylines) from the different methods for a typical area of the region. The figure shows the road extraction accuracy using our proposed method in a shaded area. In the shaded area, the method we proposed performed better than the original D-LinkNet method. As the OSM data are public data and not determined by the experimenter, the validation of the OSM data further verified the effectiveness of the method we proposed.  We used OSM data to further test the accuracy of road extraction. Since OSM data in mountainous regions was not very detailed, we tested the extraction accuracy of the main road network (Figure 9) and we only calculated the recall. The verification results are shown in Table 3. It can be seen that the extraction results using DSDNet (CLAHE) were acceptable and higher than the result of D-LinkNet. Therefore, we could conclude that the method we proposed had a good practical application effect.  Figure 12 shows the results (polylines) from the different methods for a typical area of the region. The figure shows the road extraction accuracy using our proposed method in a shaded area. In the shaded area, the method we proposed performed better than the original D-LinkNet method. As the OSM data are public data and not determined by the experimenter, the validation of the OSM data further verified the effectiveness of the method we proposed.  Synthesizing multiple evaluation schemes, the extraction effect of mountain roads proposed in this paper had the highest extraction accuracy and a high practical value.

Conclusions and Future Lines of Research
According to the experiments we conducted, the method proposed in this paper was able to accurately extract road from remote sensing data in a mountainous region. The CLAHE algorithm improved results especially for the quality degradation caused by clouds and fog and the interference of road spectrum information caused by mountain shadows. Through DSDNet, road features can be more accurate. By using postprocessing with terrain constraints, the problem of false detection can be reduced well. Figure 13a,b show that our method had good results in the case of tree shadows and vehicle interference. However, Figure 13c shows that some tall trees could completely obscure longer roads (sometimes more than 1 km), which are difficult to extract even with visual interpretation. We considered combining multisource (such as SAR data) and multiperiod data for comprehensive judgment and extraction under these circumstances. In addition, the postprocessing in this article was limited by the threshold method, which requires multiple experiments to reach the optimal values for thresholds. In the future, we will consider studying general methods to determine thresholds or use shallow machine learning algorithms (such as SVM) to replace the threshold method. Synthesizing multiple evaluation schemes, the extraction effect of mountain roads proposed in this paper had the highest extraction accuracy and a high practical value.

Conclusions and Future Lines of Research
According to the experiments we conducted, the method proposed in this paper was able to accurately extract road from remote sensing data in a mountainous region. The CLAHE algorithm improved results especially for the quality degradation caused by clouds and fog and the interference of road spectrum information caused by mountain shadows. Through DSDNet, road features can be more accurate. By using postprocessing with terrain constraints, the problem of false detection can be reduced well. Figure 13a,b show that our method had good results in the case of tree shadows and vehicle interference. However, Figure 13c shows that some tall trees could completely obscure longer roads (sometimes more than 1 km), which are difficult to extract even with visual interpretation. We considered combining multisource (such as SAR data) and multiperiod data for comprehensive judgment and extraction under these circumstances. In addition, the postprocessing in this article was limited by the threshold method, which requires multiple experiments to reach the optimal values for thresholds. In the future, we will consider studying general methods to determine thresholds or use shallow machine learning algorithms (such as SVM) to replace the threshold method.

Conflicts of Interest:
The authors declare no conflict of interest.