1. Introduction
In remote sensing image analysis, road extraction is a crucial task, with road information widely applied in fields such as geographic information system construction [
1], urban planning [
2], and disaster assessment [
3]. However, remote sensing images are often affected by complex surface features, shadows, and diverse climatic conditions, resulting in blurred road boundaries or confusion with surrounding environments. These challenges make it extremely difficult to accurately extract roads from high-resolution remote sensing images [
4,
5,
6].
Traditional road extraction methods include threshold segmentation [
7], edge detection [
8], and machine learning [
9]. Threshold segmentation methods calculate the grayscale value of each pixel based on image grayscale characteristics, setting appropriate thresholds according to specific scenes to distinguish pixels and extract road information. Bajcsy et al. [
10] used high and low grayscale thresholds and road width thresholds to segment road areas in images. The road width threshold was derived from image resolution and actual road width, while high and low grayscale thresholds were determined by histogram analysis. Edge detection methods use operators such as Canny, Sobel, and Roberts to identify road edges. For instance, Ma et al. [
11] employed a Retinex-based algorithm to enhance high-resolution but low-contrast images, followed by segmentation with an improved Canny edge detector. Although edge detection performs well in certain scenarios, its effectiveness typically declines in complex environments. Road extraction methods based on machine learning usually rely on manually extracted features, with support vector machines (SVM) being commonly used. Yang et al. [
12] achieved high completeness and accuracy in road extraction from high-resolution remote sensing images in industrial park scenes by combining 3D wavelet transform with an optimized support vector machine method. Soni et al. [
13] proposed a multistage framework based on LS-SVM, mathematical morphology, and road shape features to accurately extract road networks from remote sensing images, removing non-road elements through morphological processing. Experimental results show this method outperforms others. However, due to the spectral similarity of objects such as roads, buildings, and parking lots, along with limitations in model generalization, these traditional methods are inefficient in complex road scenes and often fail to yield satisfactory results.
With the rapid development of deep learning technology in fields such as image classification [
14,
15,
16], image segmentation [
17], and object detection [
18], deep learning technology has also offered novel and innovative approaches for the task of road extraction. Sofal et al. [
19] proposed a method combining UNet [
20] with a spatial channel weighting (SE) module for road recognition, where the Squeeze-and-Excitation (SE) module reweights UNet feature maps to emphasize useful channels Wang et al. [
21] proposed a Deeplabv3+ [
22] based road extraction method, using ResNeSt [
23] as the backbone and incorporating the Atrous Spatial Pyramid Pooling (ASPP) module for multi-scale feature extraction, thereby enhancing extraction accuracy. To retain more spatial information on roads, Zhou et al. [
24] proposed D-LinkNet based on LinkNet [
25], adding dilated convolution layers to expand the receptive field and facilitate multi-scale feature fusion, though it shows limited performance when roads are occluded. Consequently, Wu et al. [
26] introduced a coordinate attention module to enhance feature representation in the central part of D-LinkNet and replaced linear feature fusion with an attention feature fusion module to improve detail extraction. Kampffmeyer et al. [
27] introduced a directional awareness module to predict pixel connectivity with neighboring pixels, while Maji et al. [
28] proposed a deep learning generator with a guided decoder, enhancing the predictive capability of the decoding layer through weighted guided loss to improve output precision and road connectivity.
Although deep learning methods have achieved notable success in road extraction, many methods still fall short in capturing deeper-level detail expressions and are limited by occlusions from trees or buildings. Even with specifically designed modules to address these issues [
29,
30,
31,
32,
33,
34], serious omissions and misclassifications in road extraction persist, which impacts road morphology and compromises the suitability of the extraction results for spatial decision-making and analysis. To optimize deep learning-based extraction results, several studies have combined deep learning with traditional post-processing methods. For instance, Wang et al. [
35] aimed to improve road connectivity by using augmented and expanded sample data as input for UNet to train the model for optimal road extraction. They then applied polynomial curve fitting to correct road discontinuities in the extraction results, compensating for the network’s limitations in road completeness. Gao et al. [
36] proposed a refined deep residual convolutional neural network for road extraction from optical satellite images, utilizing residual connection and dilated perception units for symmetric output, and optimizing post-processing with mathematical morphology and tensor voting algorithms. Experimental results show that this method outperforms other network architectures in complex scenes.
Although traditional post-processing combined with deep learning can improve road connectivity to some extent, such approaches still struggle to capture complex road topologies. For severely fragmented, misclassified, or terrain-complicated roads, the effectiveness of post-processing corrections is limited, as these methods heavily rely on hand-crafted rules and are thus vulnerable to noise. Furthermore, despite the notable success of deep learning methods in recent years, they continue to face significant limitations. For example, under occlusions caused by trees or buildings, models often produce broken or missing segments; in complex backgrounds, false positives and misclassifications frequently occur. These challenges reveal that existing approaches remain inadequate in achieving global consistency and generalization. Therefore, there is a pressing need for a new framework that can effectively fuse multi-modal information, thereby enhancing road connectivity, robustness, and adaptability to diverse scenarios. To address these challenges, we propose DRMNet, a deep learning model specifically designed to optimize road extraction results. The model’s encoder adopts a dual-branch structure, with RGB images as input for one branch and a coarse road prediction MASK for the other. Through this dual-branch setup, it achieves fine-grained road extraction by fusing both modalities. To achieve better modality fusion, we introduce an attention-based MAF module, which fuses information across multiple scales, further enhancing road contextual information and feature consistency. To further improve road connectivity, we use multi-loss function weighting to balance BCE loss [
37] and Dice loss [
38], thus addressing the issue of sparse road target pixels and achieving optimal segmentation performance.
In conclusion, our contributions are listed as follows:
We propose a novel dual-branch encoder that jointly processes RGB images and preliminary road masks (MASK), effectively fusing spatial and semantic information to enhance road extraction capability in complex scenarios.
We design the MAF module, which dynamically refines feature fusion through channel and spatial attention mechanisms, significantly recovering fragmented roads while suppressing noise and improving global consistency.
We propose DRMNet, an end-to-end model that directly refines initial segmentation results without relying on traditional post-processing, thereby achieving substantial improvements in both road connectivity and accuracy on public benchmarks. More importantly, DRMNet demonstrates strong robustness in handling occlusions and complex backgrounds, effectively restoring broken road segments and enhancing the structural integrity of road networks. Since the practical value of road extraction lies not only in pixel-level accuracy but also in maintaining the topological connectivity of road networks, recovering fragmented or disconnected roads is crucial for applications such as path planning, urban development, emergency response, and automated mapping. Compared with traditional post-processing methods that depend heavily on hand-crafted rules, the proposed framework exhibits greater generalization ability and adaptability, making it well-suited for practical applications such as geographic information system (GIS) construction and disaster emergency response.
2. Materials and Methods
The DRMNet road extraction model was developed based on the LinkNet framework. First, we proposed a dual-branch encoder structure, with each encoder comprising a 7 × 7 convolution layer with a stride of 2 and five ResNet [
39] residual modules. The dual-branch encoder takes a 512 × 512 RGB image and a coarse MASK image as inputs; this coarse MASK can be generated either by a single-branch DRMNet or by other road segmentation models. The pretrained network uses ResNet-34, and each Encoder-block in both encoders employs the MAF module for refined feature extraction. The fused features are then passed to the next block in the RGB branch. The MAF module enhances the model’s information extraction ability from both channel and spatial perspectives, improving multi-level feature extraction and integration. In the decoding stage, a residual structure is used with 1x1 convolution kernels to reduce computational complexity, while full convolution restores the image to its original size. The DRMNet network structure is shown in
Figure 1.
2.1. Encoder
In the proposed DRMNet, RGB and MASK features are extracted simultaneously through dual-branch encoders. Each branch consists of five ResNet convolutional blocks, with an MAF module added after each block. Since standard ResNet models are designed for three-channel RGB images, they are not suitable for single-channel MASK images. To address this issue, we modified the first convolutional layer to accept a single-channel input, ensuring compatibility with MASK images.
By using a coarse MASK as input, combined with the RGB image to generate a refined MASK, the main purpose is to provide preliminary positional information, enabling the model to focus on fine feature extraction in key areas. Compared to a single-branch model, this dual-branch structure enables refinement based on the coarse localization, reducing the search space and improving computational efficiency and edge precision. Additionally, this design provides extra contextual information when image quality is poor or background complexity is high, enhancing model robustness.
Effectively extracting features from RGB images and coarse MASK images is the focus of this study. Inspired by [
40], we designed an MAF module that leverages attention components to learn discriminative features from the fused data, thereby improving prediction accuracy. As shown in
Figure 1, the MAF module is applied after each convolutional layer in both encoder streams to enhance feature compatibility. The MAF module consists of two sequential components: channel attention and spatial attention.
Let the input feature map be denoted as
. The channel attention mechanism aims to generate a channel-wise weighting vector to recalibrate the importance of each feature channel. First, global max pooling is applied to compress spatial information:
A non-linear transformation
, typically implemented as a fully connected layer followed by a sigmoid activation, is used to compute the attention weights:
where
and
are learnable parameters, and
denotes the sigmoid function. The input features are then reweighted accordingly:
Next, the spatial attention component focuses on enhancing region-level responses. It begins by applying average and max pooling across the channel dimension to generate two spatial descriptors:
These descriptors are concatenated and fed into a convolutional layer with a
kernel, followed by a sigmoid activation to obtain a spatial attention map:
This spatial attention is broadcast across all channels and applied to the intermediate feature map:
In summary, the MAF module utilizes both channel-wise and spatial-wise recalibration to progressively enhance feature representations. The channel attention focuses on foreground semantic cues extracted by the convolutional layers, while the spatial attention strengthens contextual awareness across global regions, improving continuity in occluded road sections.
2.2. Decoder
After computing the multi-level features from the two encoder streams, we obtained the final feature maps for the RGB and coarse MASK inputs. The decoder is designed primarily to make efficient use of multi-layer information to refine pixel details. Our decoder architecture is an improvement upon the LinkNet decoder, allowing feature maps to be restored to the original image size.
Multiple downsampling operations in the encoder result in the loss of some spatial information, and relying solely on the encoder’s downsampling output makes it difficult to recover this lost information. To address this, we bypass the input of each encoder layer to the corresponding output of the decoder. This design aims to recover the lost spatial information, which can then be utilized by the decoder and its upsampling operations. Additionally, because the decoder shares the knowledge learned at each encoder layer, it can achieve this with fewer parameters.
Each decoder consists of two 1 × 1 convolutions and one 3 × 3 full convolution, as shown in
Figure 2. Each convolution layer takes three parameters as input, with
Table 1 listing the values of m and n for each decoder layer, allowing the corresponding convolutional layer parameters to be derived.
To provide a clearer overview of the entire DRMNet pipeline, we summarize the complete processing steps in Algorithm 1. The pseudocode illustrates the overall workflow, including feature extraction, multi-modal attention fusion, decoding, and optimization.
Algorithm 1. DRMNet Road Extraction Framework |
1 Inputs: RGB image I_rgb, coarse mask M_coarse |
2 Outputs: Refined road mask M_refined |
3 Feature Extraction: |
4 F_rgb ← Encoder_RGB(I_rgb) |
5 F_mask ← Encoder_MASK(M_coarse) |
6 Feature Fusion: |
7 For each stage l do |
8 F_rgb[l] ← MAF(F_rgb[l], F_mask[l]) |
9 End For |
10 Decoding: |
11 F_dec ← Decoder(F_rgb, F_mask) |
12 M_pred ← Conv(F_dec) |
13 Loss Calculation: |
14 L_bce ← BCE(M_pred, GroundTruth) |
15 L_dice ← Dice(M_pred, GroundTruth) |
16 L_total ← α×L_bce + β× L_dice |
17 Optimization: |
18 Update network parameters using Adam optimizer |
19 Return: M_refined = Sigmoid(M_pred) |
2.3. Loss Function
The algorithm in this paper employs both the BCE loss function and the DICE loss function. The BCE loss function calculates the difference using cross-entropy, as shown in Equation (7). When the “penalty” on the current model increases, the logarithmic function used causes the loss value to exhibit an exponential growth pattern. This characteristic encourages the model to adjust the prediction output
y to be closer to the true label.
where
is the real category of pixel
I, and
is the prediction of the corresponding pixel.
The Dice loss function is a similarity metric, as shown in Equation (8), with a range from 0 to 1. A higher value indicates a greater number of intersecting elements between the two sets.
where
is the real category of pixel
I, and
is the prediction of the corresponding pixel.
The total loss function is defined in Equation (9). Through a series of comparative experiments, we found that setting the weights to
α = 1 and
β = 4 yields the best performance. Therefore, unless otherwise specified, subsequent experiments in this paper adopt
α = 1 and
β = 4.
4. Discussion
To validate the effectiveness of the dual-branch structure and the MAF module, we designed comparative experiments to assess the impact of single-branch structure, dual-branch structure, and the presence of the MAF module on the road mIoU metric.
Table 4 presents the quantitative comparison results of different model variants on the DeepGlobe dataset. In the test set of the LinkNet model, the mIoU result is 0.797. The performance of the RGB model is similar to that of LinkNet, with a mIoU of 0.799, as DRMNet is based on LinkNet. After incorporating the MAF structure into the RGB model, the mIoU improves to 0.809. Similarly, in the MASK model, adding the MAF structure leads to a significant improvement in mIoU, rising from 0.806 to 0.820. The RGB-MASK model with a dual-branch encoder structure outperforms the single-branch RGB model by 1.6%, and the mIoU in the dual-branch RGB-MASK model increases by 2.2% after incorporating the MAF module.
The variant models are shown in the “variants” column of
Table 5. “RGB” indicates the encoder uses only the RGB branch; “RGB-MAF” means the MAF module is added to the RGB branch; “MASK” indicates the encoder uses only the MASK branch; “MASK-MAF” means the MAF module is added to the MASK branch; “RGB-MASK” refers to using the dual-branch structure of RGB and MASK without the MAF module; “RGB-MAF-MASK” uses the dual-branch structure of RGB and MASK with the MAF module added to the RGB branch; “RGB-MASK-MAF” uses the dual-branch structure of RGB and MASK with the MAF module added to the MASK branch; “RGB-MAF” refers to the structure proposed in this paper, which uses the dual-branch structure of RGB and MASK, with MAF modules added to both branches.
Table 5 further reports the robustness evaluation of DRMNet on the DeepGlobe dataset under different perturbation scenarios, including Gaussian noise with varying probabilities and random occlusion with different block sizes. As can be observed, the mIoU decreases slightly as the noise probability or occlusion size increases, but the overall performance degradation remains limited. For example, under 10% noise, the mIoU only drops from 0.835 to 0.778, and even with a large occlusion of 256 × 256, the mIoU is still maintained at 0.803. These results demonstrate that DRMNet exhibits strong robustness against noise interference and partial occlusion, ensuring reliable performance in real-world applications.
Table 6 shows the quantitative comparison results of different model variants on the Massachusetts dataset. In the test set of the LinkNet model, the mIoU result is 0.793. The RGB model yields a similar performance, with a mIoU of 0.795. After adding the MAF structure to the RGB model, the mIoU increases to 0.823. In the MASK model, incorporating the MAF structure similarly leads to a substantial improvement, with the mIoU rising from 0.819 to 0.858. The RGB-MASK model with a dual-branch encoder structure improves the performance by 5% compared to the single-branch RGB model, and the mIoU in the dual-branch RGB-MASK model improves by 4.4% after incorporating the MAF module.
Table 7 further presents the robustness evaluation of DRMNet on the Massachusetts dataset under different perturbation conditions. Specifically, Gaussian noise with different probabilities and random occlusion with varying block sizes were applied to the test images. As shown in the results, the mIoU only decreases slightly as the noise level or occlusion size increases. For instance, under 10% Gaussian noise, the mIoU drops from 0.904 to 0.881, while with a large occlusion of 128 × 128, the mIoU remains as high as 0.896. Even when noise and occlusion are combined, the performance degradation is limited, with the mIoU still reaching 0.883. These findings indicate that DRMNet exhibits strong robustness against both noise interference and partial occlusion on the Massachusetts dataset, maintaining reliable segmentation accuracy under challenging conditions.
Table 8 summarizes the comparison of model complexity and performance with different baselines under an input size of 256 × 256. Although DRMNet requires relatively higher computational cost (44.57M parameters and 52.89 G FLOPs) compared with the baseline models, it achieves the highest segmentation accuracy, reaching an mIoU of 0.904. Specifically, DRMNet outperforms D-LinkNet by 11%, demonstrating its superiority in restoring fragmented and occluded roads and ensuring global connectivity. The higher FLOPs are the main drawback of DRMNet. Nevertheless, the substantial improvements in connectivity and accuracy make this trade-off worthwhile. In particular, the FLOPs of DRMNet scale quadratically with input size which is consistent with the quadratic growth expected when doubling input resolution: approximately 211 G at 512 × 512 and 846 G at 1024 × 1024. Despite this increase in computational cost, DRMNet consistently delivers robust performance and practical feasibility, underscoring the effectiveness of the proposed dual-branch multi-modal attention fusion design under real-world conditions.
Overall, the results in
Table 4,
Table 5,
Table 6,
Table 7 and
Table 8 comprehensively demonstrate the effectiveness and robustness of DRMNet. On both the DeepGlobe and Massachusetts datasets, incorporating the MAF module and adopting the dual-branch encoder consistently improve segmentation accuracy compared with baseline models. Under perturbation scenarios with Gaussian noise and random occlusion, DRMNet maintains stable performance, with only slight decreases in mIoU, thereby validating its robustness against noise interference and partial missing information. Although DRMNet introduces higher computational cost, the improvements in connectivity and accuracy make this trade-off worthwhile. Moreover, ablation studies confirm that the MASK branch contributes more significantly than the RGB branch, while cross-modal fusion through the MAF module further refines road extraction. Taken together, these findings highlight that DRMNet achieves superior road extraction performance with strong robustness and practical feasibility in real-world applications.
5. Conclusions
Existing methods often lack global consistency in road feature perception, leading to fragmentation and omission in road extraction results. To address this issue, we propose the DRMNet model for fine-grained road extraction. The model consists of two main components: the RGB-MASK dual-branch encoder structure and the MAF feature enhancement module. The RGB-MASK dual-branch encoder extracts features from the input RGB image and coarse MASK, enabling the fusion of two modalities of data. The MAF module enhances the features extracted from each branch via an attention mechanism, optimizing the feature fusion of the two modalities.
Comparative experiments and analysis were conducted on the Massachusetts and DeepGlobe datasets. The results show that DRMNet effectively reduces fragmentation in road extraction, capturing finer roads and producing smoother road boundaries. After inputting the coarse masks from Unet, DeepLabV3, LinkNet, and D-LinkNet into the DRMNet model for refined extraction, the average improvements in Precision, Recall, F1-score, and mIoU in the Massachusetts dataset were 11.88 percentage points, 11.63 percentage points, 11.75 percentage points, and 9.25 percentage points, respectively. In the DeepGlobe dataset, the average improvements in Precision, Recall, F1-score, and mIoU were 2.33 percentage points, 9.15 percentage points, 4.78 percentage points, and 3.35 percentage points, respectively.
To further validate the effectiveness of the proposed dual-branch structure and MAF module, ablation experiments were conducted comparing single-branch structures, multi-branch structures, and the presence or absence of the MAF module. The results show that in the Massachusetts dataset, the RGB-MASK dual-branch structure outperforms the RGB single-branch model by 5% in mIoU, and incorporating the MAF module into the RGB-MASK dual-branch structure improves the mIoU by 4.4%. In the DeepGlobe dataset, the RGB-MASK dual-branch structure outperforms the RGB single-branch model by 1.6% in mIoU, and adding the MAF module to the RGB-MASK dual-branch structure results in a 2.2% improvement in mIoU.
The experimental results demonstrate that the DRMNet model effectively addresses issues such as omission, fragmentation, and jagged boundaries in current remote sensing road extraction methods. This confirms the feasibility of using DRMNet for fine-grained road extraction in high-resolution remote sensing images. Looking ahead, future research will focus on further enhancing the model’s ability to recover extremely narrow road segments with widths of only a few pixels, reducing computational complexity to improve efficiency, and exploring lightweight architectures that can be more readily deployed in large-scale or real-time applications. In addition, integrating self-supervised learning or domain adaptation strategies may help reduce reliance on extensive labeled data and improve generalization across diverse geographic environments.