Complex Mountain Road Extraction in High-Resolution Remote Sensing Images via a Light Roadformer and a New Benchmark

: Mountain roads are of great signiﬁcance to trafﬁc navigation and military road planning. Extracting mountain roads based on high-resolution remote sensing images (HRSIs) is a hot spot in current road extraction research. However, massive terrain objects, blurred road edges, and sand coverage in complex environments make it challenging to extract mountain roads from HRSIs. Complex environments result in weak research results on targeted extraction models and a lack of corresponding datasets. To solve the above problems, ﬁrst, we propose a new dataset: Road Datasets in Complex Mountain Environments (RDCME). RDCME comes from the QuickBird satellite, which is at an elevation between 1264 m and 1502 m with a resolution of 0.61 m; it contains 775 image samples, including red, green, and blue channels. Then, we propose the Light Roadformer model, which uses a transformer module and self-attention module to focus on extracting more accurate road edge information. A post-process module is further used to remove incorrectly predicted road segments. Compared with previous related models, the Light Roadformer proposed in this study has higher accuracy. Light Roadformer achieved the highest IoU of 89.5% for roads on the validation set and 88.8% for roads on the test set. The test on RDCME using Light Roadformer shows that the results of this study have broad application prospects in the extraction of mountain roads with similar backgrounds.


Introduction
Mountainous road extraction is essential in spatial geographic information databases, which is significant for traffic navigation and military road planning [1]. However, the mountainous environment is desolate and complex, accompanied by frequent sandstorms, rainstorms, blizzards, and other poor weather. These environmental conditions may cause the loss of human lives and economic loss during long-distance driving. To safeguard the economy and people's safety, the extraction of mountain roads is becoming increasingly important.
HRSIs have become increasingly critical for geographic information system applications [2][3][4]. HRSIs are also an effective means of road extraction in mountainous areas. Research scholars have extracted information from the spectral features, shape features, and spatial relationships of HRSIs before classifying and identifying roads [5][6][7]. Currently, much road extraction work has been performed on urban road datasets [8][9][10][11][12][13][14][15][16][17], but there is a lack of extraction work on mountain roads. On the one hand, there is a lack of road datasets in the complex environment of mountainous areas, and on the other hand, there is a lack of effective models for road extraction in mountainous areas.
Road extraction work used to be manually annotated; however, even though the manual process is accurate, it is time-consuming. The task of road extraction has received more attention with the advancement of computer applications. The existing methods can be divided into heuristic methods and data-driven methods. Because of the different degrees of interaction, heuristic road extraction methods can be divided into semi-automatic and automatic extraction methods. The mainstream methods of the semi-automatic methods are the active contour model [18], dynamic programming [19][20][21], and template matching [22][23][24][25]. The semi-automatic methods require human intervention, so these methods are less efficient. The most commonly used automatic methods are segmentation methods [26][27][28][29][30][31], edge analysis methods [32,33], object-based methods [34,35], and multispectral segmentation methods [36,37].
In the context of big data and deep learning, semantic segmentation methods in datadriven methods have become mainstream. Data-driven methods have been proposed using deep learning. The proposed methods cannot scale effectively to challenging large datasets because of the small size of neural networks and the lack of big data [9]. After deep learning was proposed [38], Minh et al. [9] first attempted to extract roads using deep learning and achieved significant improvements in precision and recall. Because distinguishing categories cannot rely on a single pixel, Mnih et al. [39] predicted every small patch of labels from one large image context by using a patch-based deep convolutional neural network (DCNN). However, the overlay of the patches and duplicates of the adjacent pixels made the prediction process time-consuming and inefficient. In 2015, the FCN was proposed using the pixel-level classification method [40]. The FCN replaced the last fully connected convolutional layer. Here, the FCN was applied to extract roads in 2016 [41]. Deconvnets was proposed based on the FCN, and Segnet [42], Deeplab [43], and U-Net [44] replaced the interpolation layers as deconvolutional layers (called the decoder). In 2014, generative adversarial nets (GANs) were designed [45], consisting of a generator and a discriminator. GANs were then used in road extraction work and achieved better extraction results [46,47]. However, the GANs models may suffer from non-convergence and vanishing gradients. Based on pixel-level segmentation, the model output is noisy. The above data-driven methods result in the poor continuity of the extracted road segments [48]. Iterative road tracking [49] and polygon detection [50] are both based on graph-based methods, which are vectorized representations and show higher connectivity, but the graph reconstruction and optimization process are complex. Attention-like mechanisms were introduced in the 1990s. Based on this mechanism, attention was added to RNNs to learn the important part of an image [51]. Attention mechanisms were also applied in NLP by Bahdanau et al. [52]. Many researchers also attempted to use attention on CNNs [53]. In 2017, self-attention was proposed to replace the RNN or CNN, which was shown to perform well on NLP [54]. Compared with the CNN and RNN, the attention mechanism has fewer parameters and lower model complexity. In 2020, the first pure transformer structure achieved outstanding results in computer vision [55], and then, various transformer variants such as T2T-ViT [56], IPT [57], PVT [58], and SwinTransformer [59] were conceived of. The self-attention and transformer modules showed better performance than the FCN, being able to capture the global information from the whole image and focusing on the crucial details. However, the above methods are more suitable for extracting urban roadsthan extracting mountain roads.
Additionally, in recent years, some road datasets have been published, mainly for urban road extraction [60][61][62]. Wang introduced the TorontoCity [63] benchmark, which covers the full Greater Toronto Area (GTA). RoadTracer, which covers high-resolution ground truth road network maps of 40 urban cores in six countries, was proposed in 2018 [64]. In 2019, the Toulouse Road Network Dataset was released for road network extraction from remote sensing images [65]. In 2021, Xu et al. proposed Topo-boundaries for offline topological road boundary detection [66], which contains 25,295 1000 × 21,000-sized four-channel aerial images.Although many urban road datasets have been proposed, the road datasets are not accurate enough because cities are developing and changing all the time. At the same time, there is also a lack of mountain road datasets and network models suitable for mountain road extraction. In 2021, Zhou et al. [67] proposed a split depthwise (DW) separable graph convolutional network (SGCN) and a mountain road dataset. The SGCN achieved good accuracy on the Massachusetts road dataset. However, the classical network model was not very accurate on the mountain road dataset. Hence, the accuracy of different network models on Zhou's mountain dataset is mixed.In 2021, the DSDNet [68] also achieved good results on mountain road extraction, but it is limited by the threshold in the post-processing. The NIGAN [69] was proposed in 2022 and achieved good results on mountain road extraction. However, the resolution of the dataset used in this work is low for extracting roads, and the NIGAN does not solve the problem of poor road extraction in complex environments such as shadow areas.
Compared with urban roads, mountain roads have two characteristics. Figure 1 shows that mountain roads are small in size and have blurred edges. Figure 2 shows that mountain roads have high similarity with terrain objects with respect to the topological and morphological features. Both of these features make the extraction of mountain roads challenging.  To address the low-precision segmentation of mountain roads because of vague road edges and complex backgrounds, we first labeled RDCME collected from high-resolution remote sensors. Then, the transformer-based road extraction model Light Roadformer was used. A self-attention module and pyramid structure were employed to attend to the entire image information and focus on local details. A post-process was applied to remove incorrectly classified road segments based on road topological features. Finally, we tested our model on RDCME with other road extraction models, and two road segments from the area were extracted to test the performance.

Study Area
RDCME is located in the northwest of China, and it has two characteristics as mentioned above. First, RDCME has a small size and blurred edges. The road edges in DeepGlobe are clear and bulky, while the roads in RDCME are covered more, with the edges being blurred, as shown in Figure 1. In addition to complex environmental factors such as mountain weathering caused by frequent dust storms, road surfaces and road boundaries are blurred due to the limited quality of the satellite imagery. Additionally, mountain roads are smaller in size and situated in complex environments. The actual width of the roads in mountainous regions is about 4-5 m, and the road objects are 12-15 pixels in width in the HRSIs. Second, mountainous roads have high similarity in their topological and morphological features with the terrain objects. In a complex mountainous area, there are various terrain objects, such as rivers, mountains, and dunes. Geologically, geographicalfeatures such as rivers, dunes, and ridgelines have linear representations. The roads in DeepGlobe are clearly differentiated from cities, towns, and farmland. However, the roads in RDCME are morphologically similar to the geologicallines, as Figure 2 shows.

Datasets
The remote sensing image data are from the QuickBird satellite, which uses Ball Aerospaces Global Imaging System 2000 (BGIS 2000). The satellite collects panchromatic (black and white) imagery at a 0.61 m resolution and multispectral imagery at a 2.44 m to 1.63 m resolution. The main objects of our study area are roads, rivers, and mountains, and the elevation is between 1264 m and 1502 m.
The remote sensing images we chose for road extraction include red, green, and blue channels with a resolution of 0.61 m. HRSIs could appear blurry with low confinement because of the camera angles, sunlight angles, shadows from mountains, and the motion of satellites. Removing the received corrupted images, we selected 22 HRSIs containing mountain roads from the study area. The size of image samples ranges from 1536 × 2048 pixels to 12,288 × 13,312 pixels. In general, we chose an experimental area with a more complex geological environment, as shown in Figure 3. We located the roads by visual recognition and labeled the roads segments at the pixel level with painting in the program Krita. To train the models and test the performance, 20 HRSIs were used to make the datasets, and the remaining two image samples were predicted by the model.

Image Pre-Processing
In the shadow of the mountains in the study area, road objects may be dark and difficult to segment from their surroundings. Contrast limited adaptive histogram equalization (CLAHE) [70] has been proposed on the basis of adaptive histogram equalization (AHE) [71] to increase the global contrast of image samples. CLAHE was used for tiles in the image, not the entire image. To remove artificial boundaries, we combined adjacent tiles using bilinear interpolation. The effect of the CLAHE image enhancement method is shown in Figure 4: as shown in the images, the low contrast problem and dim image problem are solved. In the enhanced images, road objects are easy to identify.

Light Roadformer Model
In the current paper, we propose Light Roadformer to extract roads for a mountainous area. The architecture of the Light Roadformer model is shown in Figure 5. The Light Roadformer model consists of two parts: an encoder and a decoder module. We also adjusted some parameters of the Light Roadformer model to improve the performance of the model. The num-layers of the segformer in Light Roadformer was adjusted to be (3,6,32,3), which improved the overall performance of the model. •

Encoder:
The encoder consists of a pyramid structure, extracting high-resolution coarse features and low-resolution fine-grained features, which help enhance the performance of the segmentation: for every transformer block, there is a self-attention layer, feed-forward network, and overlap patch merging. The attention module maps the query and sets the key-value pairs to the output: first, a compatibility function is used to calculate the weight from the query and the corresponding key of dimension d head . Then, the weight is assigned to each value, and the weighted sum of the values is computed as the output. The attention module can be described as in Equation (1) and (a) in Figure 6.
where O represents the query, P represents the key, and Q represents the value. The complexity of the attention module is O(n 2 ). The complexity of the self-attention mechanism is reduced using a reduction ratio R, which is expressed as: where the shape of P is (N, C); Equation (2) reshapes P toP, whose shape is ( N R , C · R); Equation (3) convertsP to the origin sequence; the complexity is reduced from O(N 2 ) to O( N 2 R ). Instead of using single attention, multihead attention was applied in the model. The multihead attention projects queries, keys, and values multiple times, and then, an attention function is performed in parallel. This results in the concatenated final values, as shown in (b) of Figure 6. The role of the multihead attention model structure is to jointly learn information from different representation subspaces. The mix-feed-forward network (Mix-FFN) was used to learnthe location information, after the self-attention module. In vision transformer (ViT) [55], positional encoding (PE) is used to introduce the location information. However, the fixed PE leads to a drop in accuracy when interpolated. Here, 3 × 3 conv was applied to alleviate the problem in [72]. Here, Xie et al. [73] showed that a 3 × 3 convolution could provide positional information and replaced the PE with the FFN. The Mix-FFN is formulated as follows: where M in is the feature from the self-attention module. The Mix-FFN mixes a 3×3 convolution and an MLP into each FFN. • Decoder: The decoder part first unifies the channel dimension by an MLP layer in Equation (5), and all the features from different transformer blocks are up-sampled and concatenated in Equation (6); then, the feature is fused by an MLP layer. Finally, the segmentation mask M is predicted from the fused feature in Equation (8).
The decoder part gathers the extracted features and draws the road extraction map, which consists of four steps: 1. Unify the channel dimension by an MLP layer: where F i represents the feature extracted from the transformer blocks, C i represents the channel number of the output features, C represents the channel number of the feature, andF i is the unified feature.

2.
All the features of different sizes are upsampled to the same size: where W is the image width and H is the image height. 3.
The features are concatenated and fused by an MLP layer.
4. The mask M is segmented from the fused feature by an MLP layer to produce the extracted road segments. M = Linear(C, N cls )(F) (8) where N cls represent the class number of objects and M is the final mask predicted.

Post-Process
Combined with the characteristics of mountain roads, we post-processed the predicted results. When the study area is predicted, the image is clipped into small image patches, called tiles. Furthermore, due to the different characteristics of mountainous and urban roads, the roads on each tile in RDCME are not as dense as the urban road dataset, which increases the probability of misclassification, so post-processing also affects the prediction results in the Light Roadformer key factor. Every tile is predicted and combined into one whole image. Because the road segment is always connected to other road segments, the wrongly classified pixels will not be connected to other predicted road segments. Based on this characteristic, the wrongly predicted road segments can be easily removed by checking the size of road segments. Our post-processing work uses the DFS algorithm [74]. DFS traverses or searches a tree or graph, going through the nodes of the tree along the depth of the tree and searching deep into the branches. When the edge of the node v is searched or the node does not meet the conditions during the search process, the search will go back until the starting node of the edge of node v is found. The whole process of the algorithm is iteratively repeated until all nodes have been visited. This blind search method is less efficient, but more effective.
Post-processing can be divided into the following steps: 1.
First, traverse the whole resulting image to find the predicted road pixel and ignore the non-road pixel.

2.
Second, for each road pixel, take every pixel that is predicted as the road and connect this to itself as in the same road segment; then, search the connected pixel to calculate the size of the road segments.

3.
Third, according to the ratio of the extracted road segment area to the entire image area and whether the road segment is connected to the edge of the image, the non-road segment is removed.

Experimental Setting and Evaluation Metrics
An Intel CPU i7 11700K, NVIDIA GeForce RTX 3090 GPU, with 24 GB graphic, and 32 GB memory were employed to conduct the experiments. The operating system was Windows 11. The batch size was set to use the graphics card memory for every model fully. The graphics card memory of the model was fully utilized for the batch size settings. The models were constructed based on the MMSegmentation [75] toolbox and the PyTorch [76] framework, while a total of 40,000 iterations were performed.
The evaluating metrics of the road methods were precision and intersection over union (IoU). The evaluation index used to describe the accuracy of road area segmentation was the IoU, and the calculation formula is: where A is the road reference and B is the road obtained by segmentation. Figure 7 shows the predicted results of Light Roadformer. The images in the left column are image samples; the middle column is the labeled ground truth; the right column is the predicted image of the model. In Figure 7a, although the road pixels take up a small proportion of the whole sample, the IoU of the road is 89%. In Figure 7d,g, the model reached a high IoU of 97% and 96% despite the cars as noise.

Prediction Experiment
To test the Light Roadformer model's performance on the dataset test set, two road segments were selected from the study area, which were not included in the dataset. Firstly, the remote sensing image was divided into image patches, all of whose sizes were 256 × 256. Then, the road segments were extracted with Light Roadformer, and the image patches were combined into one. After that, we applied road post-processing to obtain the final predictions, and the result is shown in Figures 8 and 9.
Figures 8a and 9a are the original remote sensing images for the road segments; the labels are shown in Figure 8b,c; Figure 9b,c show the segment result of the Light Roadformer model. Since the image is divided into multiple patches, the corner of the patch is wrongly segmented, and parts of the river and mountain shadow are also classified into roads due to their striped shape, which is similar to the road objects. The road IoUs are 65.46% and 79.55%. After the post-processing on the predicted roads, the wrongly classified road segments are removed, and the IoUs are 84.47% and 87.11%, which is significantly boosted.

Comparison with the Existing Datasets
RDCME was compared with other public road extraction datasets: DeepGlobe [77] and Massachusetts [39]. The DeepGlobe road extraction dataset is in RGB, with a size of 1024 × 1024 pixels, was collected by DigitalGlobe's satellite, and has a 0.5 m resolution. The Massachusetts roads dataset covers an area of 2.25 square kilometers and includes 1,171 remote sensing images, each with a size of 1500 × 1500 pixels. The Massachusetts roads dataset mainly contains various urban, suburban, and rural areas. RDCME aims at roads in mountainous areas and excludes the roads in villages and cities. Table 1 shows the comparison of RDCME, DeepGlobe, and Massachusetts.
The IoU in the training process is shown in Figure 10. The Light Roadformer model reached 89.6%, surpassing other semantic segmentation models; the IoU for the Light Roadformer on the test set was 88.8%. The parameter numbers and road IoU of the models are shown in Table 2. The parameter number for Light Roadformer was 68,719,810. By comparison, Light Roadformer outperformed the other road extraction models while maintaining a moderate size of the parameters.  Compared to other network models, Light Roadformer achieves the highest accuracy.

Conclusions
In the current study, with the aim of road extraction work in mountain areas, a mountain road extraction dataset was first manually labeled. Then, we proposed Light Roadformer model to address the road extraction in the mountain environment. The model uses a self-attention module to focus on road edge details and uses a pyramid structure to obtain high-resolution coarse features and low-resolution fine features, providing better segmentation for mountain roads. A post-process module is also used to remove the incorrectly segmented road based on the road topological features. The model reached an 88.8% IoU for roads in the dataset that were manual labeled, outperforming other road extraction models. The validation of the remote sensing images showed good potential for road extraction in mountainous areas.
Author Contributions: X.Z., conceptualization, data curation, methodology, writing-original draft, and writing-review and editing; Y.J., conceptualization, data curation, methodology, and writingoriginal draft; L.W., conceptualization, funding acquisition, methodology, writing-review and editing, and project administration; W.H., investigation and data curation; R.F. (Ruyi Feng), writingreview and editing; R.F. (Runyu Fan), writing-review and editing; S.W., writing-review and editing. All authors have read and agreed to the published version of the manuscript.

Conflicts of Interest:
The authors declare no conflict of interest.