Road Extraction from Remote Sensing Images Using a Skip-Connected Parallel CNN-Transformer Encoder-Decoder Model

Gui, Linger; Gu, Xingjian; Huang, Fen; Ren, Shougang; Qin, Huanhuan; Fan, Chengcheng

doi:10.3390/app15031427

Open AccessArticle

Road Extraction from Remote Sensing Images Using a Skip-Connected Parallel CNN-Transformer Encoder-Decoder Model

by

Linger Gui

¹,

Xingjian Gu

¹,

Fen Huang

¹,

Shougang Ren

¹,

Huanhuan Qin

^1,*

and

Chengcheng Fan

^2,3,4,*

¹

College of Artificial Intelligence, Nanjing Agricultural University, Nanjing 210095, China

²

Innovation Academy for Microsatellites of CAS, Shanghai 201210, China

³

Shanghai Engineering Center for Microsatellites, Shanghai 201210, China

⁴

Key Laboratory for Satellite Digitalization Technology of CAS, Shanghai 201210, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(3), 1427; https://doi.org/10.3390/app15031427

Submission received: 25 December 2024 / Revised: 25 January 2025 / Accepted: 26 January 2025 / Published: 30 January 2025

(This article belongs to the Special Issue Deep Learning and Digital Image Processing)

Download

Browse Figures

Versions Notes

Abstract

Extracting roads from remote sensing images holds significant practical value across fields like urban planning, traffic management, and disaster monitoring. Current Convolutional Neural Network (CNN) methods, praised for their robust local feature learning enabled by inductive biases, deliver impressive results. However, they face challenges in capturing global context and accurately extracting the linear features of roads due to their localized receptive fields. To address these shortcomings of traditional methods, this paper proposes a novel parallel encoder architecture that integrates a CNN Encoder Module (CEM) with a Transformer Encoder Module (TEM). The integration combines the CEM’s strength in local feature extraction with the TEM’s ability to incorporate global context, achieving complementary advantages and overcoming limitations of both Transformers and CNNs. Furthermore, the architecture also includes a Linear Convolution Module (LCM), which uses linear convolutions tailored to the shape and distribution of roads. By capturing image features in four specific directions, the LCM significantly improves the model’s ability to detect and represent global and linear road features. Experimental results demonstrate that our proposed method achieves substantial improvements on the German-Street Dataset and the Massachusetts Roads Dataset, increasing the Intersection over Union (IoU) of road class by at least 3% and the overall F1 score by at least 2%.

Keywords:

semantic segmentation; road extraction; remote sensing image; Transformer; CNN

1. Introduction

As critical infrastructure connecting cities, rural areas, and various resources, roads are essential for urban planning [1], geographic information system updates [2], autonomous driving [3], and beyond. The rapid pace of urbanization and increasing transportation demands have made efficient and accurate road extraction a core technological requirement. Advances in high-resolution remote sensing and deep learning have significantly revolutionized road extraction methods [4,5,6], improving both accuracy and efficiency while minimizing manual labor and accelerating information updates. These developments provide solid support for urban and rural development, underscoring the vital role of precise, automated road extraction in intelligent and efficient infrastructure management.

Road extraction is a specialized form of semantic segmentation that involves binary classification of roads versus background, which introduces unique challenges not typically encountered in general semantic segmentation. The binary nature of this task adds complexities despite the applicability of standard models. Advancing road extraction further requires overcoming three major challenges: complex environments impact extraction accuracy, subtle inter-class variance causes misclassification, and category imbalance limits model sensitivity. Complex urban and rural environments, as illustrated in Figure A1a, are frequently characterized by roads obscured by trees, buildings, or shadows, compromising continuity and topological integrity. Subtle inter-class variances, depicted in Figure A1b, cause urban roads to resemble rooftops and parking lots, while rural roads can look similar to exposed soil or farmland, increasing the likelihood of misclassification. Furthermore, the significant category imbalance shown in Figure A1c and quantified in Table A1 can lead models to overlook minority classes, reducing their sensitivity to roads and resulting in poorer performance in road extraction.

Research on road extraction methods has advanced significantly, with various algorithms utilizing diverse techniques for road recognition and optimization. Consequently, road extraction from remote sensing images can be broadly classified into three categories: early-generation methods, CNN-based methods, and Transformer-based methods.

Early-generation methods relied on traditional image processing and basic feature extraction techniques. These approaches identified roads using fundamental visual characteristics such as color, shape, edges, and texture, employing mathematical and morphological algorithms. Anil et al. [7] combined the Snake model with a median filter to reduce noise while preserving edges, iteratively generating curves to represent road contours. Abraham et al. [8] applied level set methods and fuzzy inference to extract road networks from low-quality satellite images. While adaptable to complex scenes, these methods are computationally intensive and sensitive to parameter settings and initial conditions.

V. Mnih et al. [9] introduced Boltzmann machines (RBMs) for road segmentation in high-resolution images, pioneering deep learning in this field. CNNs later improved practicality by reducing manual tuning, computational complexity, and enhancing generalization. Fully Convolutional Networks (FCNs) [10], an encoder-decoder model for semantic segmentation, replaced fully connected layers with convolutional layers and used upsampling in the decoder to counteract size reduction from convolution and pooling. DeepLab [11] series expanded the receptive field with dilated convolutions to avoid resolution loss from pooling and enhanced multi-scale context capture through spatial pyramid pooling. DeepLabv3+ [12] further refined an encoder-decoder structure, progressively restoring spatial resolution and significantly improving boundary segmentation accuracy. The introduction of U-Net [13] further improved road extraction accuracy in remote sensing images. U-Net’s skip connections enable multi-scale feature fusion, retaining high-level semantic information alongside low-level details. Due to its versatility and efficiency, U-Net has become one of the most popular models in image segmentation. Variants like U-Net++ [14] with dense connections, ResUNet [15] with residual connections, and U-SegNet [16] with SegNet’s [17] pooling index transfer mechanism improve gradient propagation and computational efficiency and are adapted for complex segmentation tasks. CNNs have been extensively applied in road extraction with notable success. However, their reliance on convolution operations for local feature extraction inherently limits the receptive field, leading to insufficient contextual information and capping performance. Expanding the receptive field to capture global context requires deeper networks, which can result in issues like gradient vanishing or explosion. NL-LinkNet [18] was pioneering in incorporating non-local operations into LinkNet [19], using an attention-like mechanism to directly calculate the similarity between any two positions in the feature map. This captures global context while preserving local details, making it especially suitable for long-span targets like roads. The approach enhanced global consistency and sparked interest in using attention mechanisms, including Transformers, for road extraction.

In recent years, the successful application of Vision Transformer (ViT) [20] across various computer vision tasks has drawn significant attention to Transformers in the field [21,22,23]. Transformer models replace CNN’s convolutional operations with self-attention mechanisms, enabling better global context capture from images without stacking layers to expand the receptive field. SETR [24] first applied Transformers to semantic segmentation by treating it as a sequence-to-sequence prediction task. It introduced and evaluated three decoder architectures, ultimately developing a comprehensive model that overcomes FCN limitations and achieving top performance on multiple datasets. Pyramid Vision Transformer (PVT) [25] employed a progressively shrinking pyramid structure with spatially reduced attention layers to generate multi-scale features, achieving strong results in object detection and segmentation. Twins [26] enhanced PVT by replacing fixed positional encoding with conditional encoding, minimizing the impact of image resolution on model performance and computational load, thus allowing flexible handling of multi-scale features. CCNet [27] introduced a criss-cross attention mechanism, improving computational efficiency and reducing complexity without sacrificing segmentation quality. Given roads’ extensive geometric characteristics, capturing global features is essential in remote sensing images, prompting more researchers to integrate Transformers into road extraction tasks. RNGDet [28] combined Transformers with graph-based depth-first search algorithms to iteratively generate road networks from aerial images, improving the model’s ability to capture road segments near complex intersections. Wang et al. [29] addressed road shape features by using an efficient strip Transformer module to model long-range road dependencies, effectively capturing the elongated nature of roads. SAM-Road [30] introduced a lightweight graph neural network built on a Transformer architecture, integrating Non-Maximum Suppression (NMS) to extract road network vertices. This design significantly enhances the speed of road extraction tasks while maintaining high accuracy.

The analysis of datasets reveals significant class imbalance in road data, emphasizing the importance of global context for road recognition. Obviously, it is precisely why Transformers have been widely applied and deeply explored in road extraction. While they excel at capturing global context, the unique shape and distribution of roads also require local detail features to enhance extraction accuracy by reducing misclassification and improving topological connectivity.

This paper proposes a stepped parallel encoder architecture that combines a CNN Encoder Module (CEM) and a Transformer Encoder Module (TEM). By leveraging CNN’s inductive bias for spatial locality and translation invariance, alongside the Transformer’s global feature learning capability, the design effectively integrates local details and global context, enhancing feature representation and model generalization. Feature fusion between the CEM and TEM facilitates knowledge sharing, allowing the attention mechanism to flexibly focus on key features. To address the linear characteristics of roads, the decoder integrates a Linear Convolution Module (LCM), which improves the capture of linear features while significantly reducing parameters and computational complexity.

Our contributions are summarized as follows:

Novel Stepped Parallel Architecture: We propose a stepped parallel encoder architecture that integrates the CNN Encoder Module (CEM) and Transformer Encoder Module (TEM), enabling the fusion of local and global information, compensating for the limitations of conventional CNNs.
Enhanced Encoder Module Design: We redesign the CNN and Transformer encoder modules from classic models to focus on local feature capture and global context collection, respectively. This design maintains their feature-capturing strengths while enabling integration into a parallel architecture.
Innovative Linear Decoder Module: We specifically design the Linear Convolution Module (LCM) in the decoder for linear shape characteristics and large-span distribution features of roads, which improves the edge integrity while reducing computational complexity.

2. Methods

2.1. Architecture Overview

To address CNN’s limitations in capturing long-range dependencies and the constraints of columnar Transformers for semantic segmentation, we propose a parallel encoder-decoder structure as shown in Figure 1. Both the encoder and decoder adopt a stepped design instead of the traditional columnar approach, decoupling model performance and computational cost from input resolution and reducing computational cost. It progressively extracts features by reducing spatial dimensions while increasing feature depth, allowing multi-scale contextual information capture.

Inside the encoder, the multi-head attention mechanism in TEMs alleviates gradient correlation issues from stacked CNN layers, effectively capturing global context. Features learned by CEMs are passed to subsequent stages to strengthen the Transformer’s inductive bias, thereby enhancing learning and generalization capabilities. To retain maximum feature information and enrich the model’s feature representation for subsequent extraction, tensor concatenation fuses the outputs of the encoders. Notably, the CNN and Transformer modules learn the image features in parallel rather than sequentially. This approach avoids the potential for irreversible defects in feature maps caused by previous stages of a sequential process. Parallel learning preserves the CNN’s strength in capturing local features while utilizing the Transformer’s capability to understand global context.

Inside the decoder, LCM leverages the continuous distribution and shapes of roads, gathering long-distance information from multiple directions to enhance segmentation accuracy. Each decoder layer directly corresponds to an encoder layer, preserving local details through skip-connections and preventing information loss during upsampling due to feature sparsity and context reduction. This skip-connected trapezoidal structure achieves efficient multi-scale feature fusion, utilizing both low-level and high-level feature information to improve road edge and detail segmentation. Additionally, it enhances noise resilience and robustness.

At each stage, features are processed in parallel through both the CEM and TEM then fused and passed to the next stage for further learning. At lower scales, local details such as edges, textures, and fine features are captured; at medium scales, broader shapes, road contours, and environmental context are represented; and at higher scales, global features including road networks and major highways are captured, providing long-range dependencies. The parallel approach generates multi-scale feature maps that effectively capture both global context and fine-grained details.

2.2. CNN Encoder Module

The design of the CNN encoder in this paper, illustrated in the left half of Figure 2a, draws inspiration from the VGG16 [31], ResNet [32], and Unet [13] architectures. Each stage comprises stacked convolution, Depthwise Separable Convolution (DSC) [33], and

R e L U

activations.

Before performing convolution, a max-pooling layer is applied to downsample the feature map, reducing both spatial dimensions and computational demands. The convolution operation utilizes a

3 \times 3

kernel for feature extraction, with a zero padding [34] operation used to increase the receptive field to a certain extent, preserving the original positional information while capturing local contextual details. Stacking multiple convolutions enables the progressive extraction of complex features and adjustment of channel dimensions as needed. Utilizing multiple small kernels (

k e r n e l_s i z e = 3

) instead of a single large kernel (

k e r n e l_s i z e

= 5 or 7) achieves an equivalent receptive field with fewer parameters and lower computational cost. Specifically, depthwise separable convolutions apply a

3 \times 3

kernel independently to each input channel, followed by point-wise convolutions that use a

1 \times 1

kernel across channels. This approach reduces the number of parameters and computational load while effectively capturing feature information. The

R e L U

activation function introduces non-linearity, allowing the network to learn complex patterns and representations while mitigating gradient vanishing.

2.3. Transformer Encoder Module

The Transformer encoder in this paper shown in the right half of Figure 2a is inspired by PVT [25] and designed in a stepped format. This design prevents the impact of resolution changes typical in columnar structures while ensuring compatibility with the CNN encoder architecture. Each stage of the TEM includes spatial embedding, positional embedding, and attention operations.

Spatial embedding transforms images into lower-dimensional vector representations through basic

3 \times 3

convolution operations, effectively abstracting high-level features. Low-level features in the images aid in defining object boundaries, while high-level features represent complex semantic concepts such as shapes and inter-object relationships. This enables the model to interpret image content in a manner similar to human perception. To prevent boundary information loss during patch segmentation, overlapping patches are employed, differing from standard image partitioning methods. This approach preserves more contextual information, reduces blocking artifacts, and enhances local continuity.

As previously noted, Islam et al. [34] demonstrated that CNNs can implicitly learn and encode absolute positional information without explicit encoding. This design eliminates the inefficiency and rigidity associated with fixed-resolution positional embeddings in practical applications. Positional embedding adjusts feature dimensions via

1 \times 1

convolution, while a

3 \times 3

convolution implicitly encodes positional information to more effectively model spatial relationships. Additionally,

G E L U

is employed instead of

R e L U

for activation, as it preserves more information through Gaussian smoothing rather than discarding small values. Its smooth gradient transitions further enhance stability during optimization.

A t t e n t i o n = s o f t m a x (\frac{Q K^{T}}{\sqrt{d}}) V

(1)

The attention operation, as shown in Figure 2b and Equation (1), generates Q (query), K (key), and V (value) matrices from the input features. It performs a dot product between Q and K to compute attention scores, which can be interpreted as cosine similarity for each pixel pair. Pixels with higher similarity receive increased attention scores, indicating they should allocate more “attention” to each other, thereby facilitating mutual feature learning. The dot product of attention scores with V aggregates global contextual information. To ensure numerical stability and prevent gradient vanishing, the final result is scaled by the square root of the feature dimension

\sqrt{d}

. This scaling also helps to prevent excessive output dimensions and reduces the numerical range, ensuring that the model remains stable during training. After computing global attention, a linear layer further adjusts the feature dimensions and transforms the attention output, enhancing the model’s representational power.

2.4. Linear Convolution Decoder Module

As shown in Figure 3, this paper employs a multi-stage decoder corresponding to the encoder, enabling comprehensive multi-scale image understanding and enhancing the model’s ability to distinguish between background and road pixels. The gradual restoration of spatial information minimizes loss of positional and spatial details during upsampling, improving reconstruction accuracy and detail retention. This multi-stage approach also effectively reduces checkerboard artifacts, which can otherwise impair model performance, and enhances output quality through layered processing. Additionally, the decoder retains edge information and fine details during feature fusion, improving edge detection accuracy. The encoder-decoder structure, augmented with skip-connections, preserves the spatial structure and contextual information of the original feature maps, ensuring robust and detailed reconstructions.

The ground-truth in road datasets reveals the continuous distribution and broad spans of roads in images, typically exhibiting linear characteristics and shapes like ‘Y’, ‘T’, and ‘+’ [4]. CCNet [27] utilizes criss-cross attention to model horizontal and vertical context, balancing computational efficiency with contextual integration. To more effectively capture these linear road features, the decoder incorporates a specialized module using parallel linear convolutions. This module applies parallel linear convolutions in four directions: horizontal, vertical, left diagonal, and right diagonal. The features from these convolutions are then fused and output after adjusting channel dimensions. Horizontal convolutions use

1 \times n

kernels, vertical convolutions use

n \times 1

kernels, and diagonal convolutions apply weights only along the diagonal of an

n \times n

kernel. The first two configurations expand the receptive field to better match road shapes while maintaining parameter efficiency. Diagonal convolutions further reduce the parameter count while effectively capturing road features.

3. Results

3.1. Dataset Introduction

We evaluated the model on two datasets: the German-Street dataset and the Massachusetts roads dataset.

The German-Street dataset, a subset of the CITY-OSM dataset, was published by Kaiser et al. [35] and includes images from various cities and regions, with annotations generated via OpenStreetMap. It consists of 4000 street images measuring 512 × 512 pixels, divided into 3600 for training, 40 for validation, and 360 for testing.
The Massachusetts Roads dataset, created by Mnih et al. [36], consists of 1171 aerial images of Massachusetts, each measuring 1500 × 1500 pixels and covering an area of 2.25 square kilometers. The dataset includes a wide range of landscapes such as urban, suburban, and rural areas and is randomly split into 1108 for training, 14 for validation, and 49 for testing.

The two datasets encompass diverse urban and rural road scenarios, presenting challenges like road occlusion, inter-class similarity with the background, and scattered yet continuous road distributions. These characteristics facilitate evaluation of a model’s local precision and global coherence in road extraction. The datasets’ differences in geographic distribution, road features, and remote sensing styles further enable assessment of a model’s generalization across complex urban and varied rural environments.

3.2. Evaluation Metrics

We evaluated the model using four metrics: Intersection over Union (IoU), Recall, Precision, and F1-Score. IoU measures the overlap between predictions and ground truth, while Recall quantifies the proportion of true positives correctly identified, and Precision assesses the accuracy of positive predictions, and the F1-Score, which is the harmonic mean of Precision and Recall, provides a balanced performance assessment. For single-category assessments, we simplified the evaluation by focusing on IoU and Recall, with IoU reflecting overall performance and Recall focusing on road continuity and completeness.

The evaluation matrices are formulated as follows, where TP, FP, and FN represent true positives, false positives, and false negatives, respectively:

IOU = \frac{TP}{TP + FP + FN}

(2)

Recall = \frac{TP}{TP + FN}

(3)

Precision = \frac{TP}{TP + FP}

(4)

F 1 - Score = 2 \times \frac{Precision \times Recall}{Precision + Recall}

(5)

3.3. Implementation Details

All experiments were conducted using MMSegmentation, a PyTorch-based semantic segmentation toolbox from MMlab, which provides diverse mainstream models and supports various dataset formats. The experimental environment included CUDA 11.3, PyTorch 1.12.0, Python 3.9, and a single NVIDIA GeForce RTX 4090 GPU with 24 GB of memory. Our method uses the AdamW optimizer [37] with an initial learning rate of 0.00006. For the first 1500 iterations, we used LinearLR to gradually increase the learning rate, followed by PolyLR [38], which progressively decreases the learning rate over 80k iterations. The batch size is set to 4, and validation is performed every 4k iterations to ensure a balance between training and validation.

3.4. Experimental Data and Result Visualization

3.4.1. Experiments Based on the German-Street Dataset

Table 1 compares the evaluation results on the German-Street dataset. IoU and Recall for the background class are generally high, while those for the road class are lower, reflecting the challenges of binary semantic segmentation with imbalanced samples, where models tend to prioritize background regions. The CCNet model, based on criss-cross attention, performs relatively poorly overall but achieves a low omission rate for roads. Other methods demonstrate balanced performance, though PSPNet converges more slowly. U-Net outperforms other methods in accuracy, convergence speed, and parameter efficiency but falls slightly behind our method. Overall, our method achieves the highest scores across all four metrics, demonstrating superior performance and effectively balancing segmentation between background and roads.

Figure 4 presents a visual comparison of results from different models on the German-Street dataset. Red boxes highlight areas where the ground truth fails to correctly label roads, such as misidentifying rooftops and parking lots as roads or missing roads due to occlusion. In contrast, our method accurately identifies these areas, effectively mitigating the impact of inter-class similarity and complex backgrounds on road extraction. Furthermore, our method achieves notable improvements in road continuity and completeness compared to other models.

3.4.2. Experiments Based on the Massachusetts Roads Dataset

Table 2 compares the evaluation results on the Massachusetts Roads dataset. The proportion of positive samples is only half of the German-Street dataset, making class imbalance more pronounced. With a larger sample size, the IoU and Recall for the background class are 4% higher on average than in Table 1, allowing better road region identification. Our method achieves the highest IoU and Recall for the road class, with an IoU improvement of 3.21% over Deeplabv3+, demonstrating superior performance in capturing road region details and completeness. Overall, PSPNet and PSANet perform poorly, while SegFormer, with higher accuracy in recognizing non-road areas, shows better accuracy. Our method effectively balances segmentation performance for both background and road classes under class imbalance, boosting the F1-score by 3%, significantly surpassing other models.

Figure 5 presents a visual comparison of results from different models on the Massachusetts Roads dataset. The Massachusetts Roads dataset includes roads from urban, suburban, and rural areas, with relatively dispersed distributions, emphasizing the importance of maintaining road continuity in recognition. Visual results show that other methods often produce fragmented roads and incomplete edges in dense areas or under challenges like tree shadows and similarities with exposed soil. In contrast, our method achieves superior continuity and classification performance.

3.4.3. Ablation Study

Table 3 presents ablation experiments on the CEM, TEM, and LCM, validating the effectiveness of each component in our method. Although Transformers and CNNs learn complementary features from different perspectives, standalone Transformers perform worse than standalone CNNs in road extraction on both datasets. However, combining the two modules improved road IoU by over 1.64% on the German-Street dataset and 1.38% on the Massachusetts Roads dataset, along with an overall mIoU increase of over 0.71% and 0.66%, respectively. Incorporating the LCM, designed to capture road shape features, further improves road extraction performance, with a more significant improvement on the Massachusetts Roads dataset, where road IoU increases by 1.64%.

4. Discussion

This paper addresses the challenges in road extraction from remote sensing images, by integrating global context and local feature information through parallel CEM and TEM modules, alongside an LCM decoder. These modules tackle issues such as complex backgrounds, high inter-class similarity, and extreme sample imbalance, thereby improving the continuity and edge integrity of road extraction results.

Future work could focus on optimizing feature fusion mechanisms to process feature channels more effectively, reducing redundant features from Transformers and CNNs to accelerate model training and increase the proportion of effective features. Additionally, lightweighting the CEM and TEM could improve the computational speed of the model and reduce its computational cost.

5. Conclusions

This paper begins by conducting a thorough examination of existing datasets and summarizing the limitations observed in experimental results from current methodologies. Building on this foundation, we propose the Parallel CNN-Transformer Encoder-Decoder Model.

In our method, the CEM leverages locality and translation equivariance to allow neighboring pixels to contribute to target pixel classification, enhancing consistent feature recognition and local feature extraction. The TEM integrates global context via attention mechanisms without stacking layers, enabling whole-image feature assistance in pixel classification. Multi-head attention optimizes this process, reducing computation time and increasing efficiency. To maximize CEM and TEM integration, we adopt a parallel structure instead of the common serial pipeline. This approach mitigates each module’s limitations during feature learning, improving real-time feature extraction and overall robustness. For road segmentation, the LCM adjusts pixel importance within regions, ensuring continuous segmentation and complete edges. Compared with traditional stacked convolution kernels, parallel use of linear convolution in multiple directions achieves a larger receiving field, is more suitable for road shapes, and reduces parameter and computational complexity.

Experiments on the German Street dataset and Massachusetts Roads dataset, featuring diverse urban, suburban, and rural scenes, demonstrate our method’s superior accuracy in distinguishing roads from backgrounds with enhanced completeness and continuity. This robust performance facilitates rapid and efficient road network extraction from satellite imagery, potentially reducing manual labor costs and accelerating follow-up tasks.

In summary, our method significantly improves road extraction accuracy, serving as a valuable reference for existing research. Future work will focus on advancing high-resolution remote sensing road extraction.

Author Contributions

Conceptualization, X.G. and L.G.; methodology, X.G. and L.G.; validation, X.G., L.G. and F.H.; formal analysis, L.G.; investigation, X.G.; resources, S.R. and C.F.; data curation, X.G., L.G. and F.H.; writing—original draft preparation, X.G. and L.G.; writing—review and editing, F.H.; visualization, H.Q.; supervision, S.R. and C.F.; project administration, H.Q. and C.F.; funding acquisition, C.F. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Defense Science and Technology Outstanding Youth Science Fund (2021-JCJQ-ZQ-017).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this study are publicly available and can be accessed as follows: (1) The German-Street dataset: https://zenodo.org/records/1154821?utm_source; (accessed on 31 October 2024) (2) The Massachusetts Roads dataset: https://www.cs.toronto.edu/~vmnih/data/ (accessed on 31 October 2024).

Acknowledgments

The authors would like to express their gratitude for the valuable feedback and suggestions provided by all the anonymous reviewers and the editorial team.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A underscores the challenges in road extraction posed by complex backgrounds, high inter-class similarity, and extreme sample imbalance across various datasets. It includes visual examples and statistical data to elucidate these issues.

Figure A1. Existing challenges in road extraction illustrated by the CHN6-CUG, the DeepGlobe, the German-Street, and the Massachusetts Roads dataset: (a) Complex environments, such as trees and shadows, impact extraction accuracy. (b) The subtle inter-class differences between road and certain background objects can easily cause misclassification. (c) Category imbalance limits the model’s ability to accurately identify minority classes.

Table A1. Statistics of positive and negative sample pixels in road dataset.

Datasets	Number of Road Pixels	Number of Background Pixels	Positive and Negative Sample Ratio (%)	Positive Sample Ratio (%)
CHN6-CUG	85,356,849	1,097,174,735	7.78	7.22
DeepGlobe	276,640,131	6,251,794,045	4.42	4.24
Massachusetts	58,409,723	1,169,472,773	4.99	4.76
German-Street	115,195,161	933,380,839	12.34	10.99

References

Sun, Z.; Wu, J.; Yang, J.; Huang, Y.; Li, C.; Li, D. Path Planning for GEO-UAV Bistatic SAR Using Constrained Adaptive Multiobjective Differential Evolution. IEEE Trans. Geoence Remote Sens. 2016, 54, 6444–6457. [Google Scholar] [CrossRef]
Mckeown, D.M. The Role of Artificial Intelligence in the Integration of Remotely Sensed Data with Geographic Information Systems. IEEE Trans. Geosci. Remote Sens. 1987, GE-25, 330–348. [Google Scholar] [CrossRef]
Xu, W.; Wei, J.; Dolan, J.M.; Zhao, H.; Zha, H. A Real-Time Motion Planner with Trajectory Optimization for Autonomous Vehicles. In Proceedings of the 2012 IEEE International Conference on Robotics and Automation, Saint Paul, MN, USA, 14–18 May 2012. [Google Scholar]
Wang, W.; Yang, N.; Zhang, Y.; Wang, F.; Cao, T.; Eklund, P. A review of road extraction from remote sensing images. J. Traffic Transp. Engin. (Eng. Ed.) 2016, 3, 271–282. [Google Scholar] [CrossRef]
Lian, R.; Wang, W.; Mustafa, N.; Huang, L. Road Extraction Methods in High-Resolution Remote Sensing Images: A Comprehensive Review. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 5489–5507. [Google Scholar] [CrossRef]
Mo, S.; Shi, Y.; Yuan, Q.; Li, M. A Survey of Deep Learning Road Extraction Algorithms Using High-Resolution Remote Sensing Images. Sensors 2024, 24, 1708. [Google Scholar] [CrossRef] [PubMed]
Anil, P.N.; Natarajan, S. A Novel Approach Using Active Contour Model for Semi-Automatic Road Extraction from High Resolution Satellite Imagery. In Proceedings of the Second International Conference on Machine Learning & Computing, Bangalore, India, 9–11 February 2010. [Google Scholar]
Abraham, L.; Sasikumar, M. A fuzzy based road network extraction from degraded satellite images. In Proceedings of the 2013 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Mysore, India, 22–25 August 2013. [Google Scholar]
Mnih, V.; Hinton, G.E. Learning to Detect Roads in High-Resolution Aerial Images; Springer: Berlin/Heidelberg, Germany, 2010. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 39, 640–651. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv 2015, arXiv:1505.04597. [Google Scholar]
Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. UNet++: A Nested U-Net Architecture for Medical Image Segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. DLMIA ML-CDS 2018; Springer: Cham, Switzerland, 2018. [Google Scholar]
Jha, D.; Smedsrud, P.H.; Johansen, D.; De Lange, T.; Johansen, H.D.; Halvorsen, P.; Riegler, M.A. A Comprehensive Study on Colorectal Polyp Segmentation with ResUNet++, Conditional Random Field and Test-Time Augmentation. arXiv 2021, arXiv:2107.12435. [Google Scholar] [CrossRef]
Kumar, P.; Nagar, P.; Arora, C.; Gupta, A. U-SegNet: Fully Convolutional Neural Network based Automated Brain tissue segmentation Tool. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Wang, Y.; Seo, J.; Jeon, T. NL-LinkNet: Toward Lighter But More Accurate Road Extraction With Nonlocal Operations. IEEE Geosci. Remote Sens. Lett. 2021, 19, 3000105. [Google Scholar] [CrossRef]
Chaurasia, A.; Culurciello, E. LinkNet: Exploiting encoder representations for efficient semantic segmentation. In Proceedings of the 2017 IEEE Visual Communications and Image Processing (VCIP), St. Petersburg, FL, USA, 10–13 December 2017. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, Virtual, 3–7 May 2021. [Google Scholar]
Liu, Y.; Zhang, Y.; Wang, Y.; Hou, F.; Yuan, J.; Tian, J.; Zhang, Y.; Shi, Z.; Fan, J.; He, Z. A Survey of Visual Transformers. arXiv 2021, arXiv:2111.06091. [Google Scholar] [CrossRef] [PubMed]
Han, K.; Wang, Y.; Chen, H.; Chen, X.; Tao, D. A Survey on Visual Transformer. arXiv 2023, arXiv:2012.12556. [Google Scholar]
Tay, Y.; Dehghani, M.; Bahri, D.; Metzler, D. Efficient Transformers: A Survey. ACM Comput. Surv. 2023, 55, 109.1–109.28. [Google Scholar] [CrossRef]
Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Zhang, L. Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers. arXiv 2020, arXiv:2012.15840. [Google Scholar]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Shao, L. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. arXiv 2021, arXiv:2102.12122. [Google Scholar]
Chu, X.; Tian, Z.; Wang, Y.; Zhang, B.; Ren, H.; Wei, X.; Xia, H.; Shen, C. Twins: Revisiting the Design of Spatial Attention in Vision Transformers. arXiv 2021, arXiv:2104.13840. [Google Scholar]
Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. CCNet: Criss-Cross Attention for Semantic Segmentation. In Proceedings of the International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Xu, Z.; Liu, Y.; Gan, L.; Sun, Y.; Liu, M.; Wang, L. RNGDet: Road Network Graph Detection by Transformer in Aerial Images. arXiv 2022, arXiv:2202.07824. [Google Scholar] [CrossRef]
Wang, C.; Xu, R.; Xu, S.; Meng, W.; Wang, R.; Zhang, J.; Zhang, X. Towards accurate and efficient road extraction by leveraging the characteristics of road shapes. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4404616. [Google Scholar] [CrossRef]
Hetang, C.; Xue, H.; Le, C.; Yue, T.; Wang, W.; He, Y. Segment Anything Model for Road Network Graph Extraction. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 17–18 June 2024. [Google Scholar]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Islam, M.A.; Jia, S.; Bruce, N.D.B. How Much Position Information Do Convolutional Neural Networks Encode? arXiv 2020, arXiv:2001.08248. [Google Scholar]
Kaiser, P.; Wegner, J.D.; Lucchi, A.; Jaggi, M.; Hofmann, T.K. Learning Aerial Image Segmentation From Online Maps. IEEE Trans. Geosci. Remote Sensing. 2017, 55, 6054–6068. [Google Scholar] [CrossRef]
Mnih, V. Machine Learning for Aerial Image Labeling. Ph.D. Thesis, University of Toronto, Toronto, ON, Canada, 2013. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2016; pp. 6230–6239. [Google Scholar] [CrossRef]
Zhao, H.; Zhang, Y.; Liu, S.; Shi, J.; Loy, C.C.; Lin, D.; Jia, J. PSANet: Point-Wise Spatial Attention Network for Scene Parsing; Springer: Cham, Switzerland, 2018. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. arXiv 2021, arXiv:2105.15203. [Google Scholar]

Figure 1. The overall architecture comprises a multi-level encoder with parallel CEM and TEM and LCM connected to the corresponding encoder levels via skip connections.

Figure 2. Illustration of the encoder module structure: (a) The left shows the CEM, and the right shows the TEM. (b) The attention mechanism of the TEM.

Figure 3. The LCM uses four-directional convolutions to capture contextual information, with skip-connections to retain key pixel data.

Figure 4. Visual comparison of various methods on the German-Street dataset. Red boxes highlight areas with ground truth misidentifications.

Figure 5. Visual comparison of various methods on the Massachusetts Roads dataset. Red boxes highlight areas where our method outperforms others in road extraction continuity.

Table 1. Summary of the results on the German-Street dataset.

Model Name	Background		Road		Overall
Model Name	IoU (%)	Recall (%)	IoU (%)	Recall (%)	mIoU (%)	mPA (%)	Accuracy (%)	f1-Score (%)
PSPNet [39]	92.26	96.91	47.84	59.58	70.03	78.25	92.74	84.88
PSANet [40]	92.22	97.11	47.05	57.83	69.63	77.47	92.72	84.41
CCNet [27]	92.32	97.33	46.95	56.91	69.64	77.12	92.81	84.24
Deeplabv3+ [12]	91.95	96.49	47.51	60.75	69.73	78.62	92.49	84.99
Unet [13]	92.52	96.76	50.57	63.57	71.54	80.17	93.05	86.13
Segformer [41]	92.48	96.83	50.05	62.66	71.72	79.75	93.01	85.87
Our method	92.85	96.68	53.24	67.29	73.05	81.98	93.39	87.31

The bolded portions indicate the best method under the specific evaluation criterion.

Table 2. Summary of the results on the Massachusetts Roads dataset.

Model Name	Background		Road		Overall
Model Name	IoU (%)	Recall (%)	IoU (%)	Recall (%)	mIoU (%)	mPA (%)	Accuracy (%)	f1-Score (%)
PSPNet	96.31	98.88	38.06	46.61	67.19	72.75	96.39	82.92
PSANet	96.07	98.34	39.65	52.86	67.86	75.6	96.17	84.65
CCNet	96.35	98.59	41.66	53.44	69	76.01	96.44	85.01
Deeplabv3+	96.77	99.06	46.55	57.51	71.66	78.17	96.86	85.52
Unet	96.97	99.36	44.85	50.63	70.91	74.99	97.04	84.6
Segformer	97	99.48	44.19	48.76	70.59	74.12	97.07	84.06
Our method	96.83	98.58	49.71	63.85	73.27	81.21	96.93	88.38

The bolded portions indicate the best method under the specific evaluation criterion.

Table 3. Ablation experiment results of each module of our method on the German-Street dataset and Massachusetts Roads dataset.

Dataset	CEM	TEM	LCM	Road Iou	Overall Miou
German-Street	✓			50.76	71.71
		✓		49.77	71.33
	✓	✓		52.4	72.42
	✓	✓	✓	53.24	73.05
Massachusetts	✓			46.69	71.86
		✓		45.98	71.27
	✓	✓		48.07	72.52
	✓	✓	✓	49.71	73.27

A ✓ symbol indicates that the module has been added.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gui, L.; Gu, X.; Huang, F.; Ren, S.; Qin, H.; Fan, C. Road Extraction from Remote Sensing Images Using a Skip-Connected Parallel CNN-Transformer Encoder-Decoder Model. Appl. Sci. 2025, 15, 1427. https://doi.org/10.3390/app15031427

AMA Style

Gui L, Gu X, Huang F, Ren S, Qin H, Fan C. Road Extraction from Remote Sensing Images Using a Skip-Connected Parallel CNN-Transformer Encoder-Decoder Model. Applied Sciences. 2025; 15(3):1427. https://doi.org/10.3390/app15031427

Chicago/Turabian Style

Gui, Linger, Xingjian Gu, Fen Huang, Shougang Ren, Huanhuan Qin, and Chengcheng Fan. 2025. "Road Extraction from Remote Sensing Images Using a Skip-Connected Parallel CNN-Transformer Encoder-Decoder Model" Applied Sciences 15, no. 3: 1427. https://doi.org/10.3390/app15031427

APA Style

Gui, L., Gu, X., Huang, F., Ren, S., Qin, H., & Fan, C. (2025). Road Extraction from Remote Sensing Images Using a Skip-Connected Parallel CNN-Transformer Encoder-Decoder Model. Applied Sciences, 15(3), 1427. https://doi.org/10.3390/app15031427

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Road Extraction from Remote Sensing Images Using a Skip-Connected Parallel CNN-Transformer Encoder-Decoder Model

Abstract

1. Introduction

2. Methods

2.1. Architecture Overview

2.2. CNN Encoder Module

2.3. Transformer Encoder Module

2.4. Linear Convolution Decoder Module

3. Results

3.1. Dataset Introduction

3.2. Evaluation Metrics

3.3. Implementation Details

3.4. Experimental Data and Result Visualization

3.4.1. Experiments Based on the German-Street Dataset

3.4.2. Experiments Based on the Massachusetts Roads Dataset

3.4.3. Ablation Study

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI