1. Introduction
Using remote sensing imagery to obtain road information has become a research hotspot. Along with the development of remote sensing technology, the resolution of remote sensing images has been continuously improved [
1,
2,
3]. Sandy roads represent an indispensable and important part of fragile ecological environments, such as deserts and grasslands [
4], and extracting sandy roads in field environments will be beneficial for field ecological inspections, field wind farm inspections, and path planning in complex environments [
5,
6]. However, due to intricate textural elements, occlusion effects, and low contrast with surrounding objects, sandy roads are challenging to extract [
7]. Deep learning technology is now growing and is widely employed in the feature extraction process for remote sensing images [
8,
9,
10]. There is an ambition to use deep learning technology to extract sandy roads from remote sensing images, and the technology has some practical application value and potential in this field.
Previous researchers have studied many road extraction techniques, the most common of which are imagery-element-based extraction methods [
11,
12] and object-oriented road extraction methods [
13,
14]. Li et al. [
15] used a watershed road segmentation algorithm based on threshold labeling, which utilizes the watershed transform to extract road information. Zhou et al. [
16] used an object-oriented imagery analysis method to extract roads and achieved some results. As can be shown, the aforementioned techniques work well in certain application settings, but because they are constrained by artificial subjective thresholds, they cannot be generalized for road extraction tasks in multiple scenarios.
In terms of urban road extraction, given the need for navigation and urban planning, as well as the wide application of neural network technology in remote sensing information processing in recent years [
17,
18,
19], the method for extracting urban roads using neural networks is gradually developing [
20,
21,
22]. G. Zhou et al. [
23] introduced a split depth-wise (DW) Separable Graph Convolutional Network (SGCN), tailored for addressing the challenges encountered in scenarios involving closed tarmac or tree-covered roads. The SGCN proficiently extracts road features and enhances noise interference mitigation throughout the road extraction process. Wang et al. [
24] used a convolutional neural network to extract roads from high-resolution remote sensing images, which also optimized the road breaks appearing in the extraction results, and finally obtained complete extraction results for roads. Although deep learning methods have made some progress in road extraction, they still face some challenges. Scholars are now adopting some new approaches with which to improve their accuracy, among which a common strategy is to increase the depth of the network and introduce some feature enhancement modules through which to enhance the extraction ability of the model [
25,
26,
27]. Jing P et al. [
28] enhanced the network’s learning ability by deepening the network, using a residual stacking module, and obtained better extraction results. Wu Q et al. [
29] proposed a dense global spatial pyramid pooling module, based on Atrous spatial pyramid pooling, to obtain multi-scale information, which enhances the perception and aggregation of contextual information. Qi X et al. [
30] proposed the AT replaceable module that incorporates different scale features; it utilizes the rich spatial and semantic information in remote sensing imagery to improve the extraction of roads. Liu et al. [
31] introduced an improved asymmetric convolutional block (ACB)-based initiation structure, which extended low-level features in the feature extraction layer to reduce computational effort and obtain better results. It can be seen that the extraction methods based on convolutional neural networks have made great progress in urban road extraction.
The formation of sandy roads is influenced by various factors, including geography, weathering, hydrological processes, and human activities. These factors lead to the development of sandy roads with distinct characteristics. Despite their distinctive attributes, there is a paucity of datasets dedicated to sandy road extraction and relatively limited scholarly research in this domain. The extraction of sandy roads faces major limitations. While the U-Net network is commonly used for image segmentation tasks due to its unique encoding–decoding and skip-connection structure, as well as its advantages in training speed, inference speed, and accuracy, simply applying the original U-Net network is not well-suited to the task of sandy road extraction. The influencing factors are shown in
Figure 1. (a) The boundary between vegetation and bare soil is ambiguous. (b) The road has poor continuity. (c) The road is obscured by extraneous features, such as trees and rocks. (d) Sandy roads can be easily mistaken for other features like rivers or canals. (e) The road spans a large scale and has a long-range banded structure [
32,
33]. Therefore, due to complex structural features interfering with road extraction, convolutional neural networks alone do not achieve the desired results.
To address these issues, the following work was conducted in this study:
- (a)
This study proposes a sandy road extraction model PAM-Unet based on an improved U-Net [
34,
35,
36,
37]. To address the issue of poor continuity in sandy roads, PAM-Unet employs stacked residual modules in the encoder section to enhance the model’s feature extraction capability. Meanwhile, at the end of the model encoder, the ASPP module proposed in the DeepLab series of models [
38,
39,
40,
41] is combined with the stripe pooling module [
42] to better perceive the multi-scale features [
43]and to adapt to the sandy roads’ long-range banded features. For the occlusion of other targets in the field environment, the parallel attention mechanism (PAM) is proposed and adopted in the feature fusion part of the process to enhance the reducibility of the feature map.
- (b)
This study proposes the RSISR dataset, which covers a variety of complex sandy road scenarios including bare soil, grassland, forests, etc. For this dataset, 12,252 data samples were finally obtained. The construction of this dataset provides strong support and a reliable baseline for this study and analysis of sandy roads.
- (c)
The PAM-Unet model was tested and analyzed several times on the RSISR dataset and DeepGlobe dataset, which proved that the PAM-Unet model is effective in terms of the extraction of qualitative roads and the improvement of modules. The results showed that the PAM-Unet achieved the ideal extraction results on the sandy road dataset, with an IoU value of 0.762, and obtained a high F1 value and recall, while on the DeepGlobe dataset, the results further demonstrated the positive effects of the model’s modules.
This paper is structured as follows:
Section 1 briefly introduces the background of road extraction research, analyzes the difficulties of sandy road extraction, introduces the extraction method of sandy roads, and describes the innovations of this paper.
Section 2 describes the proposed PAM-Unet model in detail. First, the overall structure of PAM-Unet is presented, followed by the parallel attention mechanism (PAM) with SASPP structure.
Section 3 describes the construction of the sandy road dataset, the application of the DeepGlobe dataset, and the related setup of the experiment.
Section 4 presents the results of comparison and ablation experiments of the model on the dataset to demonstrate the feasibility and effectiveness of the model. The last two sections describe our investigation and summarization.
2. Research Methodology
2.1. Basic U-Net Structure
The U-Net network structure is a highly symmetrical end-to-end segmentation algorithm [
44]. It consists of an encoder–decoder framework and hopping connections, which effectively realize the information transfer of low-level and high-level features. The overall structure of the network is shown in
Figure 2.
In the encoder section, successive downsampling is required to complete the step-by-step extraction of features, which is activated by the Relu activation function after convolution, followed by the pooling operation using the maximum pooling layer with a step size of 2. The number of channels of the feature homography is increased by gradually decreasing the resolution of the feature map, to better obtain the global information. In the decoder section, to recover the features lost in the encoding network part, the decoder part gradually restores the feature maps to the original size, and two convolutional layers are used in the upsampling process. The feature maps are cascaded in each stage of the downsampling process, to recover the low-level semantic information. Finally, the number of channels is adjusted through the convolution to obtain the segmentation result.
2.2. PAM-Unet Structure
The PAM-Unet proposed in this study is an improvement of the U-Net structure, which is shown in
Figure 3. In the encoder section, the model employs a stacked residual network to perform the initial extraction of sandy road features. In the feature fusion phase, the model first concatenates the feature maps in high and low dimensions to capture multi-dimensional features. Subsequently, the model incorporates the parallel attention mechanism (PAM) to enhance feature extraction in both channel and spatial dimensions. Finally, in the last stage of the encoder, the model integrates the SASPP module. This operation enhances the receptive field to strengthen contextual connections within the model, reduces road fragmentation in segmentation results, and enhances the extraction of subtle roads.
PAM-Unet is made up of three parts: encoder, feature fusion, and decoder. In the encoder section, we employ the distinctive residual stacking unit, known for its ability to preserve detailed target features effectively. This unit is integrated into the encoder component of the backbone extraction network, ensuring accurate feature extraction. Details of the overall structure of the stacked residual network used in the coding layer of the PAM-Unet model are shown in
Table 1. In the feature fusion component of the model, the concatenation of high- and low-dimensional feature maps yields multi-dimensional features. After the fusion of these multi-dimensional features, the parallel attention module, referred to as PAM, is embedded. This module consists of both spatial attention and channel attention components. In these two components, global average pooling and max-pooling operations are separately applied to the input feature maps in the spatial dimension, while global average pooling is applied in the channel dimension. These operations compute weight values for both spatial and channel dimensions, which are used to fuse information with the original feature maps, preserving the integrity of road extraction. Towards the end of the backbone network, the ASPP (Atrous Spatial Pyramid Pooling) module is capable of obtaining multi-scale information through dilated convolutions, contributing to the acquisition of global features. However, due to the specific characteristics of road features, the adoption of square-shaped pooling kernels is insufficient for capturing the stripe-like features of roads. Therefore, the stripe pooling module is integrated into the ASPP structure to enhance the focus on road features and suppress background noise. In the decoder part, it takes the feature maps obtained from the previous two parts, performs feature fusion through dimensional orientation, and reduces them to the size of the original map. The SASPP (Spatial Attention Spatial Pyramid Pooling) structure plays a vital role in acquiring fine road details and improving road fragmentation. Meanwhile, the parallel attention mechanism (PAM) reconstructs road information while reducing the interference of irrelevant features. By integrating these three components, PAM-Unet enhances road extraction accuracy and reduces external noise interference, enabling the precise extraction of sandy roads.
2.3. Parallel Attention Mechanism (PAM)
The downsampling network, comprised of stacked residual units, excels at finely extracting features. In the feature fusion phase, at each stage of the downsampling process, a unique feature map is generated. These feature maps possess distinct characteristics, and their diversity is enhanced by overlaying them [
45,
46]. After the feature maps are superimposed, the parallel attention mechanism (PAM) is added to the superimposed feature maps to enhance the attention to the target. This module obtains the global features by compressing the channels and enhancing the spatial information, which, in turn, improves the overall accuracy of the model.
The model adopts the feature fusion and attention mechanism enhancement module in
Figure 4. The image is fused with high- and low-dimensional features to obtain a feature map of size
(
C,
H, and
W representing the number of channels, length value, and width value of the feature map, respectively), which is fed into two parallel modules, where the spatial information extraction part is mainly for the enhancement of spatial dimensional information, while the channel side compresses the channel to extract the features and makes the feature map better for cross-channel interaction through the adaptive convolution kernel. In the spatial information extraction part, keeping the spatial dimension unchanged, two feature maps
and
are obtained by max-pooling and average pooling of the feature map
. Then, the dimension of the feature map is converted from
to
. Subsequently, the feature maps are concatenated, and the weights are obtained through the sigmoid activation function. The weights are multiplied with the input feature map to obtain the result of the spatial feature extraction part. The spatial attention calculation formula is shown in Equation (1):
In Equation (1), the sigmoid activation function is denoted by while the convolution process is denoted by , F represents the feature map, and S represents spatial attention.
In the channel information compression and extraction part, the input feature map size is also
. Firstly, the feature map is subjected to the flat pooling operation in the channel direction to obtain the feature map
, and the size of the feature map is converted to
. Then, the dynamic convolution operation is carried out. Unlike ordinary channel attention, this module does not perform a dimensionality reduction operation but uses a 1D convolution of size 1 for fast implementation. This operation effectively captures cross-channel interaction information. Subsequently, the output feature map is passed through the sigmoid function to obtain the weights, and the weights are multiplied by the original feature map to obtain the result. The channel characteristic compression calculation formula is shown in Equation (2):
In Equation (2), the sigmoid activation function is denoted by while the convolution process is denoted by , F1 represents the feature map, and c represents channel attention.
Finally, the feature maps, which have integrated information from both aspects, undergo an additional operation to obtain the enhanced feature map.
2.4. Improved ASPP Module
In 2015, Kaiming He et al. [
47] first proposed the ASPP structure, which consists of multiple null convolutions with different expansion rates and a global average pooling module. Increasing the sensory field of the entire feature map without loss of resolution is achieved by superimposing null convolutions with different dilation rates. Therefore, we can focus on the multi-scale features of the target and later fuse the multi-scale feature maps. The original ASPP structure is shown in
Figure 5.
Roads exhibit long-range banded features, and employing a large square pooling window would result in the inclusion of noise from irrelevant regions. Furthermore, due to the extensive span of road targets, long-range convolutions can weaken contextual connections between feature maps. The stripe pooling structure, which maintains a relatively narrow kernel shape in the spatial dimension, is capable of capturing long-distance relationships within isolated regions. The ASPP (Atrous Spatial Pyramid Pooling) structure enhances the receptive field, enabling the capture of multi-scale information regarding the overall target [
48,
49]. Effectively combining these two modules allows for the synthesis of advantages from both aspects. Therefore, in the overall model, this structure is introduced into ASPP to expand the sensing field while establishing long-distance relationships to extract the multi-scale features of the road.
The SASPP structure consists of six convolutional blocks parallel to each other, and the overall structure is shown in
Figure 6. The number of channels in the first and last convolutional blocks remains unchanged, preserving the original information of the feature map. Meanwhile, the middle four convolutional blocks have 256 output channels each to enhance feature extraction capabilities. Since different dilation rates help the model to perceive features at different scales, in order to ensure the effectiveness of the combination with the bar pool, we slightly adjusted the value of r in the intermediate convolution block of SASPP in such a way that the adjustment balances the relationship between global context and local details. Finally, a cascade operation is used to fuse the features to obtain the performance of the image at different scales of the dilatation rate, and lastly, the cascaded feature maps are produced through 1 × 1 convolution to generate the feature maps.
5. Discussion
The results from the ablation experiments highlight the effectiveness of the introduced parallel attention mechanism (PAM) in enhancing road continuity, consistency, and completeness. The SASPP module, on the other hand, is better adapted to the banding characteristics of roads as well as increasing the sensory field during the extraction process. The use of stacked residual networks, on the other hand, ensures the accuracy of the whole extraction process and lays the foundation for the extraction of sandy roads. In this study, stacked residual networks were used as the basis and these two feature enhancement modules were introduced into the model to construct the PAM-Unet network model, with which better experimental results were achieved. In this regard, we believe that PAM-Unet is valuable for field ecological patrols and path planning. In the comparison experiments, PAM-Unet was tested against other semantic segmentation models using the RSISR dataset. The IoU value obtained with PAM-Unet was 0.762, which was higher than the other semantic segmentation models (evaluation values shown in
Table 3).
Figure 10 demonstrates a comparison of the effect of this model against those of the other models on the test set, from which it can be seen that the extraction of PAM-Unet for occluded roads and subtle road features is enhanced compared to the extraction of other models. This is thanks to the dual application of the parallel attention mechanism (PAM) and the SASPP structure, which yields relatively good results. To verify the effectiveness of the improvement, an ablation test of each module was carried out on the DeepGlobe dataset. It can be seen in
Table 5 that the optimal result was obtained after adding both the SASPP structure and the parallel attention mechanism, with an IoU value of 0.658. Meanwhile, the IoU value of adding only the parallel attention mechanism was 0.651, and the IoU value of adding only the SASPP structure was 0.644. This proves the optimization of the results obtained by adding the two modules at the same time and justifies the interplay between the two modules. Meanwhile, from
Figure 11, it can be seen that the improved model can extract the roads more completely and outperforms the other models in terms of comprehensive performance. In order to further validate, compare, and substantiate the performance differences with the improved method, we conducted a statistical hypothesis test using Student’s
t-test on multiple sets of experimental values obtained during the experiments with PAM-Unet and U-Net. We selected IoU and F1 score as the key performance indicators. The calculation formula for the
t-test is shown in Equation (8), aimed at ascertaining whether there was a statistically significant difference in performance between the improved model and the original model.
in which
and
are the means of the two sample groups,
S₁ and
S₂ are the standard deviations of the two sample groups, and
n₁ and
n₂ are the sample sizes of the two sample groups.
In this analysis, we focused on the t-statistic values and corresponding p-values for IoU and F1 score. For IoU, the t-statistic value was 2.158, and the p-value was 0.043. Based on a significance level of 0.05, the p-value was less than the significance level, indicating that we can reject the null hypothesis. This demonstrates that our improved algorithm shows a statistically significant difference in IoU compared to the original model. Regarding F1 score, the t-statistic value was 2.605, and the p-value was 0.017. Similarly, based on a significance level of 0.05, the p-value was less than the significance level, leading to the rejection of the null hypothesis. In the case of F1 score, we also found a statistically significant difference between the two algorithms. Considering the results above, for both the statistical IoU and F1 score values, the PAM-Unet model exhibits a significant difference compared to the Unet model across multiple statistical tests. The results indicate that PAM-Unet outperformed the original model in multiple statistical tests. The hypothesis-testing statistical experiments further enhance the credibility of our improvements, demonstrating the improved model’s performance advantages and stronger generalizability across multiple applications. We hope that our improved model design will inspire future researchers to advance the work on sandy road extraction, ultimately contributing to more precise sandy road extraction in the future.
In summary, compared to traditional manual road extraction methods, sandy road extraction through semantic segmentation offers the advantage of accurately identifying sandy roads in complex environments. This capability proves valuable for tasks like vehicle navigation, path planning, and ecological patrols in challenging landscapes, making it highly adaptable. However, our model suffers from some limitations when facing unfriendly geographic environments as well as bad weather in real-world applications, and it remains a challenge to continue to improve the accuracy of model recognition. Future research should focus more on cross-data source fusion and real-time semantic segmentation, and investigate how to integrate different data sources to improve the robustness of road extraction. We also propose developing efficient semantic segmentation models that can run in real-time environments. Furthermore, researchers may investigate semi-supervised and unsupervised learning methods to reduce the reliance on large amounts of labeled data. Finally, there is reaearch scope to combine semantic segmentation with other perceptual modalities (e.g., object detection and speech recognition) to improve the comprehensiveness of scene understanding.
6. Conclusions
The main aim of this study was to address the problem of extraction of sandy roads. We firstly established a sandy road extraction dataset RSISR and secondly improved on the U-Net network by proposing the PAM-Unet. Compared to the baseline U-Net network, our model yields significantly improved prediction results. Notably, PAM-Unet excels in enhancing the extraction of fine sandy road features, improving road continuity, and reducing the interference from extraneous features. The achieved results align with our expectations. From the results, it can be seen that PAM-Unet improves the completeness and continuity of the road extraction, which makes the sandy road extraction result clearer and more holistic. Comparison experiments, on the other hand, verify that the model outperforms other semantic segmentation models, and ablation experiments demonstrate the usefulness of the individual improvement modules. In summary, the challenges in sandy road extraction can be effectively dealt with by reasonably designing the network structure and feature fusion method, which effectively improves the accuracy and robustness of road extraction from remote sensing images. However, some limitations of the model still exist, which relate to the size and diversity of the dataset and the rationalization of the combination of network model structures. Future research can explore these issues to further improve the performance and generalizability of road extraction, and it is hoped that the research results in this paper will provide useful references and insights for further research and applications in the field of sandy road extraction from remote sensing imageries.