We conducted extensive experiments to evaluate the proposed PLS strategy. This section first introduces the experimental datasets and elaborates on the implementation settings. We then perform parameter sensitivity analysis on the mini validation sets of the two datasets to explore feasible configurations of patch size S and patch number K. With the optimal parameters determined, we train D-LinkNet and Segformer as backbone networks on both datasets, and analyze their performance with and without the PLS module embedded. We also compare the performance of the GCE noise-robust loss function when applied to D-LinkNet, and further conduct direct comparisons with state-of-the-art noise-robust road extraction methods including RCFSNet and UGD-DLinkNet. Finally, we implement cross-dataset generalization validation experiments, which comprehensively demonstrate the superiority of the proposed PLS strategy.
5.1. Datasets
5.1.1. CH4P (China Four Provinces) Dataset
CH4P is a large-scale dataset constructed to evaluate model robustness facing severe real-world underlabeling scenarios. It contains 13,498 remote sensing images with a fixed resolution of pixels and a ground sampling distance of approximately 0.5 m/pixel. These images are collected from four representative Chinese provinces: Shandong, Shanxi, Gansu, and Guangxi. The selected provinces cover diverse geographical terrains, including coastal plains, loess plateaus, arid inland areas, and karst landforms, which guarantees rich and representative rural road morphological patterns.
The dataset construction pipeline is detailed as follows. We first retrieved road centerline coordinates from OpenStreetMap (OSM) covering the four target provinces. Along the extracted road network, we randomly sampled seed points and took their latitude and longitude as the centers of candidate image tiles. For each seed point, we downloaded a satellite image tile via the Mapbox Static Images API at zoom level 17, corresponding to a spatial resolution of about 0.5 m/pixel.
For each downloaded image tile, we extracted OSM road annotations within the corresponding geographic bounding box. We retained line features tagged as highway, bridge, or tunnel as road vector primitives, and rasterized these vector data into binary segmentation masks consistent with the spatial resolution of remote sensing imagery. Since OSM lacks explicit width information for most minor rural roads, we adopted a fixed default road width of 6 pixels, following the standard setting of the Massachusetts Roads Dataset [
2]. The raw annotations directly derived from public map data naturally contain realistic label noise, such as missing road segments, coordinate offsets, and inaccurate road widths.
The proposed China Four Provinces (CH4P) dataset consists of 13,498 high-resolution remote sensing image-mask pairs. We partitioned the whole dataset into a training set of 11,296 samples and a validation set of 2202 samples, accounting for 83.7% and 16.3% of the total volume, respectively. Geographically, CH4P spans a broad spatial range across China, with longitude ranging from 93.03°E to 122.00°E and latitude from 20.92°N to 42.25°N, ensuring abundant geographic diversity and complex rural road scenarios. In terms of road density statistics, the training set achieves an average road coverage of 3.94% with a standard deviation of 3.10%, while the test set has a mean road density of 4.15% and a standard deviation of 3.27%. Both subsets share similar road proportion distributions, with the minimum road density of 0.10% and the maximum value around 25%.
To support reliable quantitative evaluation with clean ground truth, we additionally constructed a refined subset containing 150 manually annotated images sampled from the same geographic distribution as CH4P. The manual refinement mainly focuses on retrieving missing road segments and optimizing road boundary accuracy. We adopted a dual-annotator cross-verification annotation protocol: two annotators independently revised each image, and all annotation inconsistencies were settled through collective discussion and consensus confirmation. From these 150 high-quality refined images, we randomly selected 25 samples to form CH4P-mini-val for hyperparameter tuning, and reserved the remaining 125 samples as CH4P-mini-test for final model evaluation.
To quantitatively analyze the annotation quality of raw OSM labels, we conducted a comparative evaluation against the manually refined annotations on the 150-image subset. Statistically, 18.4% of real road pixels are missing in the original OSM annotations. Among the pixels marked as roads in raw labels, 94.2% are verified as correct by manual refinement, demonstrating that positive road annotations remain highly reliable despite widespread underlabeling. These imperfections arise from coordinate offsets where annotated road centerlines deviate from actual road locations, width biases caused by inappropriately narrow or broad default road settings, and outdated map geographic data. Several typical cases of such raw annotation errors are illustrated in the
Figure 3.
These results verify that the CH4P dataset well reproduces the inherent annotation defects of public map resources. It thus provides a realistic and challenging benchmark to evaluate the noise robustness of road extraction algorithms under practical real-world label corruption.
5.1.2. DeepGlobe Road Extraction Dataset
The DeepGlobe Road Extraction Dataset [
33] provides 6226 high-resolution satellite images (1024 × 1024 pixels, 0.5 m/pixel) with pixel-wise road annotations, covering rural and urban areas in Thailand, Indonesia, and India. We randomly split the 6226 annotated images into a training set of 5189 images and a noisy validation set of 1037 images. The dataset also contains an additional 1243 unannotated images originally intended for benchmark evaluation.
To obtain clean evaluation data with near-complete road labels, we randomly selected 100 images from the 1243 unannotated images and manually refined their road annotations to correct primarily missing roads, such as narrow, unpaved, or partially occluded segments. The refinement protocol followed the same dual-annotator cross-verification procedure used for CH4P: two independent annotators corrected each image and all discrepancies were resolved through consensus review to minimize subjective bias.
From these 100 refined images, we randomly split off 20 images as DG-mini-val, which are used exclusively for hyperparameter selection (e.g., patch size S and number of patches K). The remaining 80 images constitute the DG-mini-test set, which is used to evaluate the practical capability of different methods, including our proposed approach, in extracting underlabeled roads.
5.2. Implementation Details
We adopted D-LinkNet34 and Segformer as the backbone networks to validate the effectiveness of the proposed PLS strategy. The training pipeline follows the original implementation of D-LinkNet34 and Segformer, with a batch size of 8 and standard data augmentation strategies, including random flipping, rotation, and color jitter. All models were trained for 100 epochs, with an early stopping mechanism set based on validation loss. For the PLS strategy, we cropped
K patches of size
per image, each centered on a positively labeled road pixel. The loss function adopted a combination of binary cross-entropy loss and Dice loss, which was computed exclusively within these sampled patches. Parameter sensitivity analysis (
Section 5.3) was performed on the DG-min-val subset and CH4P-mini-val subset, and the optimal parameters obtained were fixed for all subsequent experiments. Notably, the mini-val subsets of the DeepGlobe and CH4P datasets were used solely for hyperparameter tuning, while the corresponding mini-test subsets were reserved exclusively for the final comparative experiments.
We report standard segmentation metrics: Intersection over Union (IoU), F1-score, Precision, and Recall. IoU measures the overlap between predicted and ground truth road pixels, while F1-score balances precision and recall. Precision reflects the accuracy of positive predictions and recall indicates the fraction of true road pixels captured by the model. All metrics are computed on the original validation splits, and we also report results on the manually refined samples to assess performance under corrected labels.
5.3. Analysis of Parameter Sensitivity
In accordance with the implementation workflow of Algorithm 1, it is essential to determine two core hyperparameters: the size of local patches S and the number of sampled patches K. To quantitatively evaluate the hyperparameter sensitivity of the proposed PLS strategy, we trained models with various combinations of S and K on both the DeepGlobe and CH4P datasets, and assessed their inference performance on the manually refined mini-val validation subsets.
5.3.1. Effect of Patch Size S
Table 1 reports IoU, F1, Precision, and Recall for
with
fixed, along with the baseline full-image training (
). The choice of powers of two for
S is not a restriction but merely a convenience, for instance to allow potential future extensions such as hierarchical or multi-scale patch processing.
On the two manually refined mini-val subsets, the influence of patch size S on model performance exhibits a consistent trend. Precision increases monotonically as S grows, while Recall decreases accordingly with increasing S. These ablation results reveal that the patch size S serves as an effective knob to balance the model’s sensitivity to false negatives (i.e., missing roads). As S increases, the model incorporates richer contextual information around each positive anchor, which boosts Precision by suppressing spurious predictions, yet incurs a decline in Recall. This is because larger patches may introduce background noise that overwhelms weak road signals. Conversely, an excessively small S forces the model to compute loss over a higher proportion of positive samples within a much smaller receptive field, making the model more sensitive to positive annotation noise (i.e., mislabeled road pixels). Therefore, by tuning S, PLS can be adapted to diverse data characteristics and application requirements. For instance, we can prioritize Recall for comprehensive road network mapping in rural areas, or emphasize Precision for urban planning scenarios where false positives incur higher costs. This inherent flexibility renders PLS a versatile framework for robust road extraction under diverse real-world conditions.
Finally, the model achieves the optimal overall comprehensive performance at on both manually refined mini validation subsets. Accordingly, the fixed configuration of is adopted for the aforementioned PLS strategy in all subsequent experiments.
5.3.2. Effect of Number of Patches K
With the patch size fixed at
, we varied the number of sampled patches
K within the set
, trained the PLS-based model on both the DeepGlobe and CH4P datasets, and evaluated its inference performance on the manually refined mini-val subsets. The corresponding quantitative metrics are summarized in
Table 2. Theoretically, the value of
K governs the supervision intensity of the locally sampled patches: a larger
K corresponds to a higher probability of applying effective supervision in valid road regions within the image. However, the impact of
K on model performance does not stem from the alteration of the model’s feature extraction capability, but mainly from the training convergence process. For this reason, its influence on final segmentation performance is less pronounced than that of the patch size
S.
Quantitative results show that the maximum performance fluctuation across the reasonable range of is less than 6% in IoU for both datasets. On the DeepGlobe dataset, the model achieves the optimal performance at , while the best performance on the CH4P dataset is obtained at . In this paper, we select as the fixed configuration for all subsequent test experiments. Collectively, compared to the patch size S, the number of patches K has a relatively mild impact on model performance, indicating that PLS is not highly sensitive to the exact choice of K within a reasonable range. Notably, an excessively large K will lead to performance degradation. This is because repeated local sampling within sparse road regions may cause the model to suffer from local overfitting, which further impairs the overall segmentation performance.
5.4. Comparative Experiments
To comprehensively demonstrate the superiority of the proposed PLS strategy, we conducted a series of comparative experiments as follows.
First, we adopted D-LinkNet (a representative encoder–decoder architecture) and Segformer (a representative Transformer-based architecture) as the backbone networks, to verify that the PLS strategy can bring robust performance, especially a significant improvement in the extraction capability of underlabeled roads.
Second, with D-LinkNet as the backbone network, we tested the noise-robust loss function GCE, as well as RLS, a variant of the PLS strategy that samples patches randomly from the whole image regardless of the label of the patch center (random sampling). This group of experiments is designed to verify the superiority of PLS over noise-robust loss functions and other local supervision strategies, and further demonstrate the effectiveness of local supervision based on positive sample sampling.
Finally, we conducted direct comparisons with two recent state-of-the-art road extraction methods that are robust to annotation noise, namely RCFSNet [
27] and UGD-DLinkNet [
26], to validate the superior performance of the proposed PLS strategy. Both methods were fully trained using the publicly released source codes from their authors.
5.4.1. Results on DeepGlobe
All models were trained on the DeepGlobe training set, and the quantitative performance of different methods on the DG-mini-test subset is presented in
Table 3.
First, with the PLS strategy integrated, our method outperforms the vanilla D-LinkNet backbone by 0.124 in F1 score and 0.123 in IoU, and outperforms the vanilla Segformer backbone by 0.076 in F1 score and 0.082 in IoU. These quantitative results solidly demonstrate the superiority of the proposed PLS strategy. The significant performance gain mainly comes from the sharp rise in recall, which validates our core claim that PLS can effectively improve the model’s ability to extract underlabeled roads.
Compared with the noise-robust GCE loss function, D-LinkNet + PLS outperforms D-LinkNet + GCE by 0.128 in F1 score and 0.150 in IoU. Overall, D-LinkNet + GCE achieves slightly worse performance than the vanilla D-LinkNet backbone, which confirms that the statistical noise smoothing mechanism of GCE is not applicable to scenarios with widespread road underlabeling, and is at least not significantly superior to the standard loss function.
Compared with different local supervision strategies, D-LinkNet + PLS outperforms D-LinkNet + RLS by 0.093 in F1 score and 0.113 in IoU. This result strongly validates the effectiveness of positive-sample-based local sampling, and further justifies our core assumption that “positive annotations are reliable”. Meanwhile, it is worth noting that D-LinkNet + RLS achieves slightly better performance than the vanilla D-LinkNet backbone. This is because random sampling passively discards part of the underlabeling areas during the training process, thus reducing the adverse impact of erroneous gradients from underlabeled samples to a certain extent.
Compared with state-of-the-art (SOTA) noise-robust road extraction methods, both our D-LinkNet + PLS and Segformer + PLS significantly outperform RCFSNet and UGD-DLinkNet, with a minimum lead of 0.063 in F1 score and 0.065 in IoU. Two key conclusions can be drawn from this comparison: First, RCFSNet exhibits notable robustness, which is reflected in its processing of semantic consistency of similar appearance features. This enables it to mine and extract potential road features through global context, making it outperform all other compared methods except the proposed PLS. Second, although UGD-DLinkNet is designed to address the same underlabeling problem as our work, its complex design and restrictive assumptions lead to degraded performance. It tends to down-weight the features with high uncertainty, which results in fewer underlabeled road regions participating in gradient update during training.
Representative visualization examples are presented in
Figure 4, which intuitively corroborate the quantitative conclusions summarized in
Table 3.
Figure 4a,b illustrate the extraction performance of rural unpaved trails in diverse environments. Within the orange bounding boxes, the proposed PLS method correctly extracts most of these rural roads, followed by RCFSNet which retrieves a portion of the target roads, while all other compared methods fail. This performance gap arises because such low-grade roads are partially underlabeled as background in the training set.
Figure 4c presents a challenging scenario where roads are occluded by dense tree cover. Similarly, only our proposed PLS method achieves complete road extraction in this case, RCFSNet realizes partial extraction, and all other methods fail to retrieve the occluded road segments. Finally,
Figure 4d depicts another common scenario: the extraction of arterial roads and secondary roads in dense road network regions, where some secondary roads are also prone to underlabeling. In this scenario, all methods can successfully extract most of the road segments. Our method shows the capability to retrieve some underlabeled secondary roads, but its superiority is not prominent in this scenario. From the above qualitative analysis, we can draw a clear conclusion: the core advantage of the proposed PLS strategy lies in extracting low-grade rural roads that are far from arterial roads and frequently affected by underlabeling. In contrast, the superiority of our method diminishes in dense urban road regions. This is because the sampling range is wider in dense road regions, and even underlabeled road segments are highly likely to be included in the sampled patches for loss calculation under the PLS framework, and thus cannot be excluded from the gradient optimization process.
5.4.2. Results on CH4P
The CH4P dataset exhibits even more severe noisy labels, indicating the situation of real-world map data that lack human annotation. On this dataset, we test the robustness of the road extraction methods for large-scale road extraction applications in real scenarios.
The quantitative results of different methods on the CH4P-mini-test subset are summarized in
Table 4.
First, compared with the vanilla backbone networks, D-LinkNet + PLS outperforms the original D-LinkNet by 0.104 in F1 score and 0.104 in IoU; Segformer + PLS outperforms the original Segformer by 0.083 in F1 score and 0.086 in IoU. Consistent with the observations on the DeepGlobe dataset, the performance gain is mainly attributed to a sharp rise in recall, accompanied by a moderate decline in precision, which reflects the inherent trade-off between precision and recall under noisy annotation scenarios. Overall, the PLS-based methods achieve a substantial lead over their corresponding backbone networks in comprehensive evaluation metrics.
Second, compared with the noise-robust loss function, D-LinkNet + PLS outperforms D-LinkNet + GCE by 0.299 in F1 score and 0.258 in IoU. The conclusions here are consistent with the analysis on the DeepGlobe dataset, and thus will not be elaborated further.
Subsequently, compared with the random local supervision strategy, D-LinkNet + PLS significantly outperforms D-LinkNet + RLS in comprehensive metrics, further validating the effectiveness of the positive-sample-based sampling strategy.
Finally, compared with the two SOTA methods, both of our PLS-based variants outperform SOTA robust road extraction methods by a minimum margin of 0.071 and 0.082 in F1 score, demonstrating the superiority of the proposed method. Notably, RCFSNet achieves a lower quantitative score, which is inconsistent with the visualization results presented in subsequent sections. This is because its segmentation outputs contain tiny holes and artifacts in regions with high uncertainty, which reduces its pixel-level evaluation scores, while the structural accuracy of the extracted road network is actually favorable.
Figure 5 presents the road extraction results of representative samples from the four provinces covered by the CH4P dataset. The results demonstrate that our method can effectively extract low- and medium-grade roads, including the rural roads in
Figure 5a, the mountain trails in
Figure 5b, the dense roads in villages and towns in
Figure 5c, and the secondary roads in
Figure 5d. This confirms that the proposed PLS strategy also outperforms the compared methods on the CH4P dataset with inherent positive annotation noise, and experimentally validates the correctness of the theoretical analysis in
Section 4.3.
Meanwhile, it should be noted that the performance difference between different methods is no longer significant in dense urban road scenarios, and the prominent superiority of PLS is difficult to observe from visualization examples. This phenomenon has been analyzed in
Section 5.4.1. We further analyze the performance variation of each method under different road densities in the following subsection to verify this conclusion, and clarify the applicable boundary of the superiority of PLS.
5.4.3. Analysis with Road Density
We statistically analyzed the performance of different methods on samples with varying road densities, with the results shown in
Figure 6. In the left analysis plot for the DeepGlobe dataset, the D-LinkNet + PLS variant achieves excellent performance across all road density intervals. It only underperforms the vanilla D-LinkNet in high-density samples with road density ranging from 0.15 to 0.30, while the two PLS-based variants rank top 2 in overall performance for low- and medium-density samples with road density below 0.10. This is because these low- and medium-density regions are exactly the high-frequency occurrence scenarios of rural low-grade roads, which contribute the main performance gain for the PLS-based methods. In the right analysis plot for the CH4P dataset, we draw consistent conclusions: the PLS strategy exhibits prominent superiority in low- and medium-density road scenarios with a positive sample proportion below 10%. Notably, the two datasets show opposite trends in performance variation with road density. For the DeepGlobe dataset, the performance of all methods improves as road density increases, while for the CH4P dataset, performance declines with rising road density. This discrepancy arises because most samples in the CH4P dataset lack explicit road width information, which makes the model more prone to width estimation errors in dense road scenarios such as urban arterial roads, resulting in lower quantitative scores. In summary, the core superiority of PLS lies in the extraction of rural low-grade roads in low and medium-road-density scenarios, while its advantage over baseline methods diminishes in high-road-density scenarios. This is because positive-sample-guided local sampling may cover most regions of the image in high-density scenarios, and its ability to shield the model from adverse gradients of underlabeled samples degrades accordingly. On the other hand, as another core contribution of this work, the CH4P dataset is of great value for the training of rural low-grade road extraction tasks, rather than for model training in high-density road scenarios (mainly urban road types).
5.5. Analysis of Cross-Dataset Generalization Ability
To further verify the generalization ability of the proposed PLS method, we perform zero-shot cross-dataset evaluation. All models are trained on the source dataset and directly evaluated on the unseen target dataset without additional fine-tuning. Quantitative results are reported in
Table 5 and
Table 6, while qualitative visualization illustrations are provided in
Figure 7 and
Figure 8.
The quantitative results in
Table 5 consistently demonstrate the strong generalization capability of the proposed PLS strategy on unseen data, which significantly outperforms its corresponding vanilla backbones and other compared methods. We attribute this robustness to the suppression of overfitting to negative sample features achieved by the positive-sample-guided local sampling of PLS. This confirms that the PLS method can learn robust road extraction capability from datasets with higher annotation noise, enabling it to substantially outperform the compared methods on unseen datasets. Meanwhile, the overall performance of the compared methods warrants further in-depth analysis: models trained on the CH4P dataset yield overall lower quantitative metrics on the DG-mini-test subset. The visualization results in
Figure 7a,b reveal that this performance degradation stems from the failure of these models to effectively extract the abundant low-grade rural roads present in the DeepGlobe dataset. In addition, the insufficient annotation of road width in the CH4P dataset also constrains model performance, as shown in
Figure 7c,d, where all compared methods fail to achieve accurate road width prediction.
Compared with the cross-dataset evaluation on the DG-mini-test subset, the cross-dataset validation on the CH4P-mini-test subset presents a notable distinct phenomenon.
Table 6 shows that for most methods, the road extraction performance achieved via cross-dataset transfer (i.e., models trained on the DeepGlobe dataset and tested on the CH4P-mini-test subset) is even superior to that of models with in-domain training and testing (i.e., models trained and tested on the CH4P dataset). By comparing
Table 6 with
Table 4, the vast majority of methods achieve better or comparable results in the cross-dataset setting. The representative examples presented in
Figure 8 also explicitly support this conclusion: models trained on the DeepGlobe dataset mostly exhibit superior visual road extraction performance on the CH4P-mini-test subset.
This phenomenon can be attributed to two key factors. First, the road annotations in the DeepGlobe dataset include explicit width information, which is consistent with the annotation format of the CH4P-mini-test subset. Thus, training with DeepGlobe data enables the model to learn more accurate road width estimation, leading to improved pixel-level quantitative scores. Second, the DeepGlobe dataset has higher annotation quality with less positive annotation noise, which has been verified in
Section 4.2, further contributing to the performance improvement.
We then conduct further analysis on the comparison results in
Table 4: First, both PLS-based variants achieve consistent performance gains over their corresponding vanilla backbones. Specifically, D-LinkNet + PLS outperforms the vanilla D-LinkNet backbone by 0.009 in F1 score and 0.007 in IoU, while Segformer + PLS outperforms the vanilla Segformer backbone by 0.015 in F1 score and 0.016 in IoU. Meanwhile, the D-LinkNet backbone outperforms Segformer + PLS. This result reveals that cross-dataset transfer can improve generalization performance, because the sample noise distributions differ between the two datasets: the biased samples present in the DeepGlobe dataset rarely appear in the CH4P dataset.
Second, D-LinkNet + PLS still significantly outperforms D-LinkNet+GCE and D-LinkNet + RLS, and these two methods still fail to achieve competitive generalization performance on unseen data.
Finally, compared with the two SOTA schemes, both PLS-based variants outperform the two SOTA compared methods, which further demonstrates the superior generalization capability of our proposed PLS on unseen datasets.
In addition, the visualization examples shown in
Figure 8 further confirm that the proposed PLS method maintains stronger low-grade road extraction capability and better overall comprehensive performance in cross-dataset scenarios.
Overall, the most critical conclusion is drawn from the zero-shot cross-dataset generalization experiment with models trained on the CH4P dataset and tested on the DeepGlobe dataset: the proposed PLS strategy possesses stronger noise robustness, and is capable of learning more robust road representation capability from real-world noisy data compared with all other compared methods.
5.6. Limitations
While the proposed PLS strategy achieves superior performance in extracting low-grade rural roads, several limitations and application caveats should be noted as follows.
Despite the improved extraction capability for low-grade rural roads, the proposed method still has failure cases. As shown in
Figure 9, neither the proposed PLS strategy, the vanilla backbones, nor other compared methods successfully extract the road segments within the orange bounding boxes. This failure is attributed to two main factors. First, these target road segments are inherently challenging. The ambiguity of their semantic features makes it difficult for the algorithm to achieve effective identification. Second, the fixed values of patch size
S, and the shape of local supervision patches adopted in this work are overly rigid when dealing with more complex sample distributions. As a result, the positive-guided sampling inevitably includes some underlabeled road segments, meaning that the adverse gradients caused by underlabeling are only mitigated rather than completely eliminated. For future work, we will explore the adaptive adjustment of
S,
K, and further the shape of the sampled patches according to road density or the uncertainty of feature extraction, as well as more intelligent positive sample anchor selection strategies, to further mitigate the adverse impact of underlabeled samples.
In addition, it should be noted that the core superiority of the PLS strategy is concentrated in the extraction of low-grade rural roads. Therefore, this method is not the optimal choice for application scenarios that focus on extracting drivable arterial roads, such as high-precision map construction for autonomous driving.