A Review of Deep Learning-Based Methods for Road Extraction from High-Resolution Remote Sensing Images

: Road extraction from high-resolution remote sensing images has long been a focal and challenging research topic in the field of computer vision. Accurate extraction of road networks holds extensive practical value in various fields, such as urban planning, traffic monitoring, disaster response and environmental monitoring. With rapid development in the field of computational intelligence, particularly breakthroughs in deep learning technology, road extraction technology has made significant progress and innovation. This paper provides a systematic review of deep learning-based methods for road extraction from remote sensing images, focusing on analyzing the application of computational intelligence technologies in improving the precision and efficiency of road extraction. According to the type of annotated data, deep learning-based methods are categorized into fully supervised learning, semi-supervised learning, and unsupervised learning approaches, each further divided into more specific subcategories. They are comparatively analyzed based on their principles, advantages, and limitations. Additionally, this review summarizes the metrics used to evaluate the performance of road extraction models and the high-resolution remote sensing image datasets applied for road extraction. Finally, we discuss the main challenges and prospects for leveraging computational intelligence techniques to enhance the precision, automation, and intelligence of road network extraction.


Introduction
Road networks are a fundamental component of urban and rural infrastructure, playing a crucial role in promoting economic development and improving the quality of life for residents.Accurate extraction of road information holds significant practical application value in various fields, including urban planning [1][2][3], traffic monitoring [4][5][6], disaster emergency response [7][8][9], and environmental monitoring [10][11][12].With the continuous advancement of remote sensing technology, we now have access to a greater amount of clear image data [13].The acquisition cycle for high-resolution remote sensing images is becoming shorter, which offers a rich dataset for the automatic extraction of roads [14].
High-resolution images capture finer details, and additional color bands increase the data volume.This requires more computing power and more efficient algorithm design.In addition, high-resolution images can introduce more noise and error.
Specifically, the application of high-resolution remote sensing imagery in road extraction has garnered widespread attention in recent years.However, the complexity of the ground information introduces noise from trees, buildings, vehicles, and spectral variances [15].To address these challenges, researchers have designed various computational intelligence-based methods for road extraction.
This review utilizes the Google Scholar database, employing "road extraction" and "remote sensing" as keywords to filter relevant literature from 2012.Recent review articles summarize road extraction techniques in remote sensing imagery based on different classification criteria.For instance, based on road features and selected road models, Wang et al. [16] categorized road extraction methods into clustering, knowledge-based, morphological, active contour models, and dynamic programming.Lian et al. [17] divided them into heuristic and data-driven methods based on design principles and data types.Chen et al. [18] classified methods that use 2D earth observing images and 3D LiDAR point clouds.In 2D optical images, road targets are divided into road areas and road lines.In 3D point clouds, road extraction methods are categorized into MLS-based, ALS-based, and TLS-based.Two-dimensional optical images are characterized by their low cost and mature technical research.With the exceptional performance of deep learning in various fields, the interest in its methodological research has significantly surpassed that of traditional techniques [19].Therefore, this review focuses on the application of deep learning in extracting road information from 2D high-resolution optical images.
According to the different deep learning models used, Abdollahi et al. [20] classified the approaches that utilize GAN, deconvolution, FCN, and patch-based CNN models.However, with the emergence of methods based on different network models, this classification approach has become insufficiently detailed.Pruthi et al. [21] divided the task of road extraction into four categories based on road features and target extraction: edge extraction, centerline extraction, surface extraction, and their combinations.Nonetheless, our analysis shows most networks concentrate on road surface extraction.Liu et al. [22] and Mo et al. [23] classified methods into fully supervised learning, semi-supervised learning, and unsupervised learning based on the type of data annotation and learning approach.
Integrating the advantages of the above reviews and analyzing their deficiencies, this review categorizes deep learning methods for road extraction from high-resolution remote sensing images based on the type of data annotation into fully supervised learning, semi-supervised learning, and unsupervised learning.Fully supervised methods are divided into six types based on network models: Patch-CNN, Encoder-Decoder, GAN, Graph, Transformer, and Mamba.Based on the different annotated data, semi-supervised methods are divided into less labeled data-based and weak labeled data-based methods.Unsupervised methods are divided into those based on the fewer parameters models and those based on large remote sensing models.Figure 1 illustrates the specific categorization of deep learning methods for road extraction in remote sensing images, and Figure 2 displays the organization of chapters in this review.

Background
In this section, we discuss the development of deep learning-based methods in the field of computer vision, particularly focusing on their application in extracting road networks from high-resolution remote sensing images.Figure 3 displays a roadmap of the methods used in the relevant literature, and Figure 4 illustrates the sources of literature used in this review.Road extraction methods typically classify each pixel of the image as "road" or "non road" [116].Typically, road areas are obscured by surrounding objects such as cars, buildings, and trees.Although these obstructions make road area identification more complex, they also provide valuable background information that aids in identifying road areas in complex scenarios.
The development of road extraction has three main stages: morphological featurebased, manual feature-based, and deep learning approaches [117].Initial traditional methods are often costly in terms of time and resources, and their reliance on manual analysis tends to restrict their accuracy [22].As deep learning technology advances, methods that leverage it are progressively being refined, leading to ongoing improvements in both the precision and efficiency of road extraction.Therefore, this section focuses on the development of deep learning methods in the field of road extraction.In 2011, convolutional neural networks (CNNs) showed initial success in optical character recognition (OCR) tasks, and subsequently, in the 2012 IMAGENET competition, Hinton et al. [24] achieved remarkable results using a CNN.With the improvement of several open-source platforms and the open-sourcing of models, CNN-based methods begin to show better results in various image-related tasks.Given the focus of our review on deep learning, we collect all relevant literature from 2012 to the present.The earliest neural network-based road extraction method in the past decade was proposed by Yuan et al. [118], who designed a network called LEGION, which emphasizes local information while suppressing global information.However, there was a gap in research on road extraction methods based on deep learning between 2011 and 2017, with few related works emerging during this period.Starting from 2017, a substantial number of deep learning-based algorithms have been applied in the field of road extraction.Recently, the notion of large remote sensing models has emerged.These models leverage deep learning algorithms and extensive remote sensing data collections to markedly improve the capabilities in tasks like road extraction.

Fully Supervised Methods for Road Extraction
In this section, fully supervised methods are categorized into five categories based on the different backbones used, including those based on Patch-CNN, Encoder-Decoder, GAN, Graph, and Transformer.

Methods Based on Patch-CNNs
The process of road extraction from remote sensing images using a patch-based CNN model primarily involves several steps.Firstly, the image and segment are preprocessed into patches.Then, they are inputted into CNNs to extract features and identify the patches containing road information.Finally, the road patches are aggregated and the complete road network is outputted.Figure 5 illustrates the general architecture of a patch-based CNN model.SAR images exhibit distinctive geometric and scattering properties, offering distinctive landmark information compared to optical images [119].Popescu et al. [25] proposed a combined radiometric/structure-driven method based on spectral descriptors for SAR images.Specifically, they utilized a feature extraction approach using 200 × 200 pixel image patches to recognize targets.Li et al. [26] introduced a CNN-based framework.Initially, a CNN model was employed to extract road features from small patches of SAR images and identify candidate road areas.Subsequently, an enhanced radon transform was applied to group the candidate roads, followed by the utilization of a Markov random field (MRF) for global road network connectivity.
Alshehhi et al. [120] presented a patch-based CNN model that extracts road and building areas from remote sensing images.The model uses fully connected layers and simple linear iterative clustering (SLIC) to enhance features and refine results.Unlike the method that integrates features from both low and high layers, as presented in [120], Chen et al. [121] proposed a coarse-to-fine road extraction strategy that integrates grayvalue distribution and structural features.It employs a local Dirichlet mixture model for initial segmentation and a high-order deep-learning approach to capture road context.
Saito et al. [27] devised a novel output function called channel-wise inhibited softmax (CIS) to effectively train the network.Sun et al. [28] designed experiments to analyze the influence of different patch sizes and input image resolutions on segmentation accuracy and proposed a multi-scale collective fusion (MSCF) method to extract information from multiple resolutions.
In contrast to the aforementioned methods for surface extraction, the extraction of road centerlines is also a typical task in road extraction.Li et al. [122] employed a CNN model based on 32 × 32 patches to extract road centerlines from high-resolution remote sensing images.They combined common image processing operators to obtain the road centerlines and design line integral convolution (LIC) to optimize the extracted road network.Differing from [122], Liu et al. [123] proposed a four-stage approach for road centerline extraction, where road centerlines are extracted using Gabor filtering models and multi-directional non-maximum suppression methods.

Methods Based on Encoder-Decoder
The Encoder-Decoder network architecture is a type of deep learning model designed to efficiently extract object information from input images.Recently, it has been the most commonly used semantic segmentation model in road extraction tasks from remote sensing images.The encoder is typically a pre-trained classification network used to extract features from input remote sensing images, transforming them into high-dimensional feature representations.The decoder part, combined with the features extracted by the encoder, restores the feature map size using upsampling techniques and then reconstructs the road network label map as output.Figure 6 illustrates the general architecture of an Encoder-Decoder model.As most fully supervised methods in recent years have been implemented based on this structure, for clarity, this section further categorizes them into five groups based on the variations in their decoders.

Methods Based on FCNs
Long et al. [29] introduced the fully convolutional network (FCN) in 2015, which replaces the fully connected layers of existing classification networks with convolutional layers.Unlike patch-based CNN models, a FCN makes end-to-end predictions on images.It accepts images of any size.The decoder of a FCN uses bilinear interpolation filters to restore the feature map to the same size as the input image.It retains spatial information and represents the membership relationship among pixels.
Subsequently, several methods are improved based on FCNs for the task of road extraction.First, these methods standardize remote sensing images to fit network inputs.Then, a FCN is utilized for layer-wise feature extraction and output generation.Finally, post-processing is applied to improve the accuracy.According to the different methods used by the network, the classifications are elaborated as follows.
(1) Feature Fusion.Zhong et al. [124] designed a model that integrates low-level semantic information with high-level semantic information.It also adds the output of pooling layers to the final score layer to enhance the overall accuracy of the model.Fu et al. [125] proposed an improved FCN model, which is divided into segmentation and classification stages.It primarily fuses multi-scale features of roads by designing skip connections.
(2) Different Loss Functions.Wei et al. [30] introduced a RSRCNN, which constructs a unique road structure loss function.It is the first to use structure-based loss for CNN in aerial image road extraction based on the minimum Euclidean distance.Henry et al. [31] devised FCN-8s with class-weighted mean squared error (MSE) loss and control parameters for model spatial tolerance to improve network performance.To address the issue of sample imbalance between the road and background in aerial images, Zhang et al. [32] proposed an ensemble method based on a FCN with spatial consistency (SC).The main idea of this method is to increase the weight of misclassified pixels.Li et al. [33] developed a noise probability model named RDNN to tackle the problem of noise in training data.The purpose was to leverage the relationship between input images, noisy labels, and true labels to learn the noisy data.RDNN effectively trains the noisy dataset using a loss function based on regularization methods.
(3) Data Augmentation.Chen et al. [34] suggested an improved CNN named MCN-NTL.This network employs data augmentation, transfer learning, data preprocessing, and backpropagation algorithms to enhance road extraction accuracy.To address the challenge of low accuracy in extracting unpaved and narrow-width roads, Babaali et al. [35] designed DAA-SSEG for extracting unpaved and narrow roads.It utilizes a novel data augmentation technique based on geometric transformation and image refinement.
(4) Innovative Architecture.Varia et al. [126] employed the FCN-32 variant and GAN for road extraction from UAV remote sensing datasets.Kestur et al. [36] introduced UFCN, a U-shaped FCN architecture characterized by symmetric convolution and deconvolution operations with skip connections to retain local information.This model is similar to the UNet discussed in Section 3.2.2.Chen et al. [127] introduced CR-HR-RoadNet, which fuses local and global information for comprehensive road network analysis.It has a specialized encoder for detail retention and uses multi-scale, residual learning for spatial detail extraction.A compact coordinate attention module enhances global context awareness and infers relationships between segments.
(5) Training Speed Enhancement.Zhang et al. [128] proposed a MFFCN for road extraction in mountainous remote sensing images.MFFCN is improved on the basis of FCN and removes six convolution layers to improve training speed.Similarly, to boost the efficiency of the model, Pan et al. [129] proposed an automatic road centerline extraction method based on a FCN, using atrous convolution instead of pooling layers to enhance efficiency.
(6) Multi-Output Network.In the field of road extraction, there are three typical tasks: road surface segmentation, road centerline extraction, and road edge detection [130].Some studies can simultaneously obtain two or more outputs through a multi-output network.Wei et al. [131] introduced a framework for simultaneous road surface and centerline extraction.It employs a FCN for initial segmentation and refines details through the iterative application of a lightweight FCN.The method utilizes a multi-seed point-tracking mechanism for road tracking and integrates segmentation and tracking to generate the final road network.Liu et al. [37] designed RoadNet, a multitask CNN that simultaneously predicts road surfaces, edges, and centerlines.It employs a specially designed cascaded network for learning multi-scale features by end-to-end training.

Methods Based on UNet
The UNet, proposed by Ronneberger et al. [38] in 2015, consists of a downsampling path for capturing context and an upsampling path for precise localization, both structured symmetrically resembling a "U".The UNet model employs convolution operations in the upsampling path to reconstruct the details and structures of images.Since the majority of recent approaches are based on this architecture, they are further divided into several categories according to the specific techniques they utilize.
(1) Tailored Loss Function.Mosinska et al. [132] designed an iterative refinement method for road topology extraction.It uses a novel loss function to identify high-order topological features of roads.He et al. [133] introduced a structural SIMilarity (SSIM) loss function to refine extraction clarity.Ding et al. [39] proposed DiResNet with a loss function utilizing angular operators for directional mapping based on road direction.Constantin et al. [134] merged the UNet and atrous convolution architectures, using binary cross-entropy (BCE) and Jaccard distance in their loss functions.Buslaev et al. [135] combined the intersection over union (IoU) with BCE.Xin et al. [136] suggested a Dense-UNet model with a weighted loss function to emphasize foreground pixels and improve precision.Qi et al. [40] developed DSCNet, a U-shaped network structure-based model, incorporating a continuity constraint loss function derived from persistent homology for enhanced topological continuity extraction.
(2) Multi-scale Contextual Information Fusion.Li et al. [41] proposed a HCN that consists of three subnets that extract features of roads at different granularities.Then, a shallow convolutional subnet is used to integrate them.Zhu et al. [42] introduced GCB-Net, incorporating a global context-aware (GCA) block into the network to capture global contextual information of roads.To better utilize spatial information, Tan et al. [137] utilized scale-sensitive and fusion modules to merge multi-scale information and learn the weight tensors of features.Hu et al. [138] offered DCANet comprising a discriminative contextaware feature module, which not only captures contextual information but also aggregates local information at multiple scales.RCFSNet, designed by Yang et al. [43], consists of MSCE and FSFF modules to enhance the feature representation of roads.Gao et al. [44] proposed an improved deep residual CNN named RDRCNN.It consists of a residual connected unit (RCU) and an expanded perception unit (DPU).
Wu et al. [45] devised a DGRN aimed at improving the utilization of spatial information.The model incorporates a dense global spatial pyramid pooling (DGSPP) module based on ASPP to capture contextual information.Doshi et al. [139] summarized three approaches they employed in the 2018 DeepGlobe Road Extraction Challenge.The first model maintained a constant number of 128 feature maps throughout the entire network.This enabled the model to tolerate a reduction in representational power within the encoder, as the presence of skip connections allowed the decoder to access low-level features.Zhang et al. [140] combined the advantages of residual learning and UNet to propose a new network for road extraction.The rich skip-connection structures within the model facilitated information propagation and enhanced performance while reducing parameters.Hong et al. [46] advocated a road centerline extraction method named Road-RCF, which is based on richer convolutional features (RCF).The RCF model processes the entire image to obtain high-level semantic information.It then leverages complementary information from different convolutional layers for precise extraction of road networks.Wang et al. [141] designed a feature extraction algorithm called dual feature fusion (DFF), which is based on context fusion and self-learning sampling.This method can suppress redundant features.Furthermore, they proposed a dense feature convolutional network (DFC-UNet).
(3) Diverse Attention Mechanisms.Xu et al. [47] proposed the GL-Dense-UNet for extracting roads of different widths.The model includes feature attention blocks to extract local and global information.Dong et al. [48] put forward BMDANet, which combines cross-layer information exchange with the block multi-dimensional attention (BMDA) module.Akhtarmanesh et al. [142] utilized both hard attention and soft attention to assist in designing an improved UNet.Xiao et al. [49] recommended RATT-UNet to extract mine roads, which incorporates a RATT module that integrates residual connections and attention to reduce parameters.Dai et al. [50] advocated RADANet, which includes a road augmentation module (RAM) and a deformable attention module (DAM) to obtain multi-scale semantic information.Mei et al. [51] designed CoANet, which includes a connectivity attention module (CoA) to predict the connectivity of the eight pixels adjacent to a given pixel.Utilizing the spectral representation of images, Yang et al. [52] put forward AFUNet with modulation learning (MoL) for modulating spectral features across different granularities.Patil et al. [53] introduced Tiny-AAResUNet, a method that combines the advantages of self-attention mechanisms and the residual UNet architecture to achieve higher accuracy and long-range dependency relationships.
(4) Specialized Network Architecture.Wang et al. [54] introduced the dual decoder UNet (DDUNet), incorporating a novel dilated convolution attention module (DCAM) that facilitates the fusion of multi-scale features between the encoder and decoder.Similar to DDUNet [54], Wang et al. [143] integrated the squeeze-and-excitation mechanism into a small decoder to extract the information for roads.Then, it is passed to another standard decoder, which refines the contextual understanding of the road network.Luo et al. [55] introduced AD-RoadNet, an auxiliary decoding network for road extraction.It mainly comprises the hybrid receptive field module (HRFM) and the topological feature representation module (TFRM) to better utilize road details.
Xu et al. [144] offer a road extraction method leveraging the advantages of UNet on top of the deep residual network.It introduces a multitask network to handle remote sensing images at different scales.Fan et al. [145] presented a deep residual-based U-shaped network model to address the problem of existing methods ignoring high-dimensional features in remote sensing images.
(5) Lightweight Model.Sun et al. [56] addressed the challenge of excessive parameters in the existing models by introducing LRSR-net.This model utilizes an expanded joint convolution module to mitigate the loss associated with pooling layers and to reduce the number of parameters.Sultonov et al. [57] designed two lightweight networks for road network extraction from UAV images.They integrated UNet, depth-wise separable convolutions, ConvMixer layers, and initialization modules.Han et al. [58] introduced a lightweight target-aware network named LOANet.The encoder of LOANet utilizes a lightweight, dense connection network.
(6) Road Topology Focus.Hao et al. [59] proposed a geometric-aware deep recursive neural network called Geo-DRNN for high-spectral classification.This network is built on the foundation of UNet and recursive neural networks (RNN).Additionally, the model introduces a Net-Gated GRU and geometric-aware ResNet loss to better encode complex geometric shapes.Ge et al. [60] introduced deep FR TransNet, which was designed to improve the learning capabilities of road contours.The encoder incorporates a novel deep feature review (FR) module, which learns the contour features of roads to minimize road fragmentation resulting from weight parameter loss.Qiu et al. [61] presented a dual-branch semantic-geometric framework named SGNet.The semantic-dominant branch collects dense semantic information about roads from the input, while the geometric-dominant branch generates sparse boundary features of the image.Finally, the information generated by the two branches is adaptively fused.Shao et al. [146] designed MCTN-Net, which is capable of recognizing railways, roads, sidewalks, and bridges.The network employs a dense feature-sharing encoder (DFSE) to extract directional and semantic features.These features are integrated into the orientation-guided stacking module (OGSM) to enhance connectivity detection.
(7) Multi-source Fusion.Luo et al. [62] combined LIDAR images with high-resolution images to build a dual-encoder cross-modal complementary network named DECCFNet.The encoder includes a cross-modal feature fusion (CMFF) module designed to blend features from different sources.Furthermore, a multi-direction strip convolution (MDSC) module was created to help the network concentrate more sharply on road features.Wang et al. [147] designed the DelvMap framework, which leverages delivery courier paths and satellite data to generate complete road maps.The framework operates in two steps.It first uses the dual signal fusion network (DSFNet) to create an inferred map by merging both types of data and then applying a map completion algorithm to integrate this inferred map with the existing road map, effectively filling in any missing details.
(8) Multi-Output Network.Cheng et al. [63] highlighted the significance of road detection and centerline extraction.They proposed CasNet, a cascaded CNN that addresses both tasks concurrently.It is composed of a main sub-network designed for efficient road detection, complemented by a secondary sub-network that utilizes the feature maps generated by the primary sub-network to delineate road centerlines.The model employs a refinement algorithm to enhance the centerline output.CasEANet, an improvement of CasNet designed by Liu et al. [148], introduces an edge perception module (ESM) and an attention module (AM) to refine road edges and enhance global contextual information.Lin et al. [149] presented a dual-task CNN adapted to road shape and scale variations.This network includes a residual encoder and is equipped with a multi-scale, multi-direction strip convolutional module (MSMD-SCM) within the decoder to improve the accuracy of road extraction.Additionally, Liu et al. [150] developed LRDNet, a lightweight road detection method.It incorporates a multi-scale convolutional attention network (MSCAN) and a coupled decoder head.This design aims to achieve efficient detection and smooth edge output, addressing efficiency and connectivity issues in occluded scenes.Guo et al. [151] proposed CRIN, which extracts roads and buildings concurrently through their complementary relationship.The model features an MTI module for task-specific information exchange and a CSI module for learning varying receptive fields across different structures.

Methods Based on FPNs
In 2017, Lin et al. [64] proposed the feature pyramid network (FPN), which is a framework with lateral connections that operates in a top-down manner.Its introduction was primarily aimed at improving feature fusion.
For example, Gao et al. [65] proposed a network called the multi-feature pyramid network (MFPN) for road extraction.The MFPN utilizes feature pyramids and an improved pyramid pooling module to extract multi-level semantic features of roads.In the optimization phase, a weighted, balanced loss function is implemented to tackle the issue of significant variance in pixel distribution between roads and the background within images.Yu et al. [152] designed a new model called CS-CapsFPN, which integrates context enhancement techniques with self-attention capsule feature pyramid networks to enhance the representational capacity of features.The model primarily enhances the representation of road features by extracting and fusing higher-order capsule features from various levels and scales.

Methods Based on SegNet
In 2017, Badrinarayanan et al. [66] devised SegNet, an innovative and practical network structure based on FCN.The architecture comprises an encoder, a decoder, and a pixel-wise classification layer.In contrast to FCN, the decoder of SegNet implements non-linear upsampling by leveraging pooling indices calculated during the max-pooling steps of its encoder, which minimizes the additional overhead associated with learning upsampling modules.
Panboonyuen et al. [67] utilized SegNet as the backbone and design DCED.The model employs an exponential linear unit (ELU) instead of a rectified linear unit (ReLU), typically used to enhance network accuracy.Furthermore, landscape metric thresholds are applied to eliminate excessively detected roads.The same group of authors proposed an enhanced version of SegNet in [68], drawing parallels with DCED [67] by employing ELU activation functions and landscape metric thresholds.Distinct from DCED [67], their approach introduces a conditional random field (CRF) to hone the road network extraction by considering the low-level information gleaned from the local interactions between pixels and edges.
To confront challenges like indistinct object boundaries, erroneous classifications, and irregularities, Zhao et al. [69] proposed a model called DANet, utilizing two spatial pyramid pooling (ASPP) structures for multi-scale feature fusion.Akhtar et al. [70] replaced the basic convolution blocks with dense residual blocks to achieve context information fusion and employ geometric shape analysis to filter out non-road segments after segmentation.

Methods Based on LinkNet
While current approaches predominantly concentrate on enhancing model accuracy, they frequently neglect the aspect of model efficiency.Therefore, Chaurasia et al. [71] introduced LinkNet in 2017, a model specifically designed for semantic segmentation.Drawing insights from UNet, LinkNet achieves feature learning without substantially increasing parameters, ensuring both speed and precision.Specifically, ResNet18 replaces commonly used encoders like ResNet101 and VGG16.Unlike UNet, LinkNet directly transfers the extracted features of the encoder to the decoder, bypassing pooling or stride convolutions.This refined approach accelerates the process while maintaining feature richness and the accuracy of the outcomes.
In 2018, Zhou et al. [72] introduced D-LinkNet, a variant of LinkNet, by integrating cascaded stacked dilated convolutions into its central layers.This modification enables the network to achieve a larger receptive field while preserving the high resolution of the feature maps.
(1) Design of Different Modules.Li et al. [73] enhanced D-LinkNet and devised D-LinkNetPlus by incorporating a bottleneck layer and ESIPs to reduce parameters and remove isolated blocks.Xie et al. [74] introduced HsgNet, which employs bilinear pooling in the intermediate module to capture global context.Deng et al. [75] designed SPD-LinkNet with strip pooling, considering large receptive fields and distant contextual information.Wang et al. [76] proposed FE-LinkNet to handle occlusions with a modified DP-Block for the multi-scale context.Wulamu et al. [153] presented a UNet-based network equipped with ASPP and a LinkNet-like decoder.Lu et al. [77] developed a global-aware deep network (GAN) featuring a spatial-aware module (SAM) and a channel-aware module (CAM) for road detection.Jie et al. [78] constructed MECA-Net, an enhanced LinkNet that integrates multi-scale encoding and long-range context for remote sensing road images.
To address the issue of roads in high-resolution remote sensing images being easily confused with surrounding terrain and susceptible to interference from non-road features, Wu et al. [79] introduced the NL-DLinkNet.This model incorporates non-local blocks into the DLinkNet encoder to capture long-distance dependencies among features in the satellite imagery.Wang et al. [80] also proposed a DLinkNet variant named NL-LinkNet with non-local blocks for road extraction from high-resolution satellite images.
(2) Multi-source Data Fusion.Sun et al. [81] merged crowdsourced GPS data and aerial images to improve road extraction.Their model employs novel techniques such as data augmentation, GPS rendering, and 1D transpose convolution to enhance network performance.Liu et al. [82] proposed a cross-modal message propagation network (CMMPNet) by leveraging aerial images and crowdsourced trajectory data.Zhang et al. [83] introduced FND-Linknet, which merges DLinkNet with filter response normalization (FRN) layers.It also applies transfer learning from multi-source road datasets to enhance the precision of road extraction.To address the issue of the time-consuming and labor-intensive process of obtaining a large dataset with precise annotations, Zhang et al. [154] proposed a method that utilizes the GPS trajectories of floating cars as the training set.
(3) Attention Mechanisms.Wu et al. [84] recommended a dual attention network (DA-LinkNet) that combines the advantages of D-LinkNet and dual attention mechanisms.To better integrate features from different branches and reduce information loss, an attention feature fusion module is used to replace skip connections.Li et al. [155] introduced a DLinkNet-based cascaded network designed to enhance the precision of road boundary detection.The network leverages spatial attention residual blocks across various scales to maintain long-range dependencies, while channel attention mechanisms are employed to refine the integration of features.Ai et al. [85] applied variance and the coefficient of variation to the squeeze-and-excitation (SE) mechanism, designing a multi-parameter-guided SE module named MPGSE, which was then integrated into the D-LinkNet architecture.Weng et al. [156] introduced an improved D-Linknet that integrates an edge detection module for the purpose of detecting railway tracks.This module incorporates a channel-spatial dual attention mechanism to expand the receptive field, thereby reducing missed detections.
(4) Design of Specialized Network Architecture.Motivated by strategies employed in lane detection, Hu et al. [86] designed the location-guided network (LGNet), aimed at resolving the problem of disjointed extraction results common in segmentation techniques.They devised an auxiliary road location prediction (RLP) branch, which predicts road positions through row and column anchors.Yang et al. [87] designed RUW-Net, a dualencoder structure network based on D-LinkNet.They introduced a decoder-encoder combination (DEC) module to connect the two networks and minimize the semantic gap.
(5) Multi-Output Network.Lu et al. [130] proposed CasMT, which can simultaneously perform road surface segmentation, centerline extraction, and edge detection.It leverages topology-aware learning and hard example mining (HEM) loss to enhance accuracy.
Deeplab v1 combines DCNN and DenseCRF, using VGG16 as the base model, and employs dilated convolutions and fully connected conditional random fields to enhance the accuracy of semantic segmentation.DeepLab v2, built upon v1, replaces VGG16 with ResNet101 as the backbone and introduces the ASPP module, providing a powerful plugand-play module for future semantic segmentation models.DeepLab v3 validates the effectiveness of the parallel ASPP modules and directly upsamples the decoder part by 16 times to obtain the output.
The latest DeepLab v3+ adopts an encoder-decoder architecture, utilizing Xception as the backbone network with fine-tuning.The decoder does not progressively restore image size or directly upsample by 16 times like v3.Instead, the encoder features are firstly upsampled by four times.They are then connected with the corresponding lowlevel features that have the same spatial resolution.Finally, the features are refined using convolution and upsampled to the same size as the input image using bilinear interpolation.
Building upon the Deeplab v3 framework, Lin et al. [92] introduced nested SE-Deeplab, which incorporates the SE module to refine road network extraction.In addition, the model leverages multi-scale upsampling to integrate data from various levels.Huan et al. [157] proposed SANet with strip attention.Building upon the architecture of DeepLab, they incorporated a strip attention module (SAM) to extract contextual semantic information and spatial positional information of roads.They also added a channel attention fusion module (CAF) to fuse low-level and high-level features.
Lourenço et al. [158] presented an improved method for automatically detecting rural roads.They utilized the road network output from DeepLab v3+ and refined it using morphological methods to obtain the centerlines.Xu et al. [159] introduced P2CNet, which integrates partial maps with satellite images.The network incorporates a gated self-attention module (GSAM) to capture long-range dependencies and introduces a missing part (MP) loss function.

Methods Based on GAN
Recently, methods based on generative adversarial networks (GAN) have made significant progress in road extraction from remote sensing images.This strategy involves training a generator to produce realistic road images while simultaneously training a discriminator to distinguish between real road images and generated ones.The adversarial training process helps to enhance the accuracy and robustness of road extraction.Figure 7 illustrates the general architecture of the GAN model.GANs are utilized in various approaches to optimize the structure of network models, thereby ultimately improving the general performance of the model.A deep convolutional generative adversarial network (DCGAN) was suggested by [93].Both the generator and discriminator of the DCGAN are components of deep CNN architectures used to improve the performance of the entire network, such as U-Net, SegNet, FCN, and so on.Shi et al. [160] proposed an end-to-end GAN framework for road detection.In the generator, SegNet was employed to produce pixel-level classification results.Zhang et al. [161] offered a refined network that does not need large training datasets with a simpler architecture.The network uses a FCN as the generator and a CNN as the discriminator.
In addition to optimizing the network architecture, efforts have been directed towards refining the loss function to strengthen relevant constraints.Pre-processing and post-processing techniques are also utilized to further enhance the effectiveness of road extraction.Gulrajani et al. [162] leveraged the Wasserstein distance in standard GANs and introduced a gradient penalty to ensure a more stable training process.Yang et al. [94] put forward E-WGAN-GP for road extraction.This network uses the UNet and BiSeNet as generators, respectively.Furthermore, a spatial penalty term was added to the loss function to solve the class imbalance problem.Abdollahi et al. [95] devised a modified UNet architecture, denoted as MUNet, as the generator for generating road network segmentation maps.This network incorporates a simple pre-processing step involving edge-preserving filtering techniques.Cira et al. [163] presented a lightweight conditional GAN framework based on Pix2pix, which was designed to improve the extraction of road surface areas.The method incorporates a post-processing mechanism to enhance the precision of road extraction outcomes.
To extract more comprehensive features, some multi-stream networks have been proposed.Tao et al. [164] proposed a GAN-assisted two-stream neural network to enhance the effectiveness of feature extraction.The primary stream leverages high-resolution panchromatic images to retain low-level details, while the auxiliary stream uses an unsupervised approach to extract high-level features from multispectral images.Costea et al. [165] designed DH-GAN, a model that operates in two stages involving GANs.In the first stage, a pair of GANs are trained.The first generates road segmentations and the second recognizes intersections concurrently.In the second stage, a graph optimization process based on smoothness is applied to produce the final road map.Liu et al. [166] proposed a novel model called TPEGAN, which combines a segmentation model based on road pixel enhancement with graph inference.During the process of generating pixel-enhanced images, GAN leverages the consistency among road pixels to improve the segmentation accuracy.Furthermore, the multi-scale dual-branch segmentation module employs graph inference to capture the long-range dependencies of roads.
In addition to using GANs to enhance the overall network structure, some studies focus on integrating multi-scale features to improve the accuracy of road network segmentation.When applying GANs to road segmentation, a significant challenge arises when dealing with input data of uniform resolution.In such cases, the network may overlook the interrelationships between pixels, which can lead to incomplete segmentation of road objects and discrepancies in the size and shape of the segmented objects compared to the ground truth.To address this issue, Li et al. [167] put forward a network that integrates GAN with multi-scale context aggregation.By inputting three scale images (0.5n, 1n, and 2n) into the generator, corresponding scale road extraction results are obtained with identical parameters.Lin et al. [168] presented a network designed for road extraction that leverages the combination of multi-scale information.The model integrates the ASPP module and a feature fusion module within the encoder of the generator, allowing for the effective consolidation of multi-scale features and the utilization of background information.Moreover, the generator utilizes an asymmetric encoder-decoder structure to minimize feature redundancy.Zhang et al. [96] designed MsGAN, which improves topological connectivity and spectral structure through multi-scale feature fusion.The network is designed with two discriminators, each containing four sub-discriminators that take the same image at four different scales as input, enabling the network to extract roads of varying widths.Shamsolmoali et al. [169] incorporated a feature pyramid (FP) into GAN.A FP is used to extract features which contains four divisions: feature map fusion (FMF), an optimized u-shape network (OUN), feature transportation division (FTD), and scale-wise feature concatenation (SFC).They cooperate with each other to obtain the final multistage multi-scale output features.

Methods Based on Graph
Currently, the majority of road extraction methods are built upon CNNs.Although these approaches can deliver high-quality road networks, CNN-based techniques often exhibit suboptimal performance in extracting the topological connectivity of road networks due to the inherent constraints of convolution operations.To improve the quality of road network topology, numerous methods resort to sophisticated post-processing techniques for optimization.However, the efficacy of these post-processing steps is frequently limited by the quality of the initial road segmentation results.
Consequently, preserving the topological connectivity of roads remains a significant challenge.In light of this, methods based on graph structures are gaining increased attention.In this context, the term "graph" does not refer to graph neural networks (GNN) but rather emphasizes the topological relationships among roads.Figure 8 shows the general architecture of the Graph model.

Methods Based on Graph Representation
A road network can be represented by an undirected graph denoted as G(V,E), where V and E represent the set of road nodes and edges between nodes, respectively.Therefore, the focus of these methods is on finding the key points that make up G and the connectivity between them.The connectivity between points is usually represented by an adjacency matrix.In graph representation-based methods, the nodes and edges that delineate the roads are typically derived from CNN.
Xu et al. [170] proposed a new method for computing vector maps from remote sensing images, which is based on well-defined patched line segment (PaLiS) representations of road graphs with geometric significance.These fragments contain both the location and direction of the road.Xu et al. [171] designed csBoundary, a method that initially generates a keypoint map and subsequently utilizes AfANet to delineate the road edges by predicting the adjacency matrix of the vertices.Zao et al. [172] proposed an end-to-end road extraction approach known as Road2Graph.This method encodes road maps into a seven-dimensional representation that encompasses segmentation maps, vertex maps, midpoint maps, and their respective endpoint displacements.It refines the output by integrating multi-scale features.Finally, a decoding module is employed to recover the topological representation.
To further improve road connectivity and topology, many works combine CNN and G(V,E) to form multi-task branches to ensure the contextual semantic information of features and the connectivity of roads.For example, Li et al. [173] devised a multi-task architecture within an encoder-decoder framework to simultaneously predict the segmentation, anchor points, and connectivity maps.The latter two branches can improve road segmentation performance by enhancing road connectivity and topology.Then, the road network is constructed and simplified based on three predicted maps.Mattyus et al. [97] proposed DeepRoadMapper, a network that involves a two-step road extraction process.It initially employs a CNN to segment aerial images, followed by the generation of a graph that portrays the road topology, where nodes represent road endpoints and edges represent the curves joining them.
Beyond integration with segmentation tasks, certain methods are also designed to create a multi-branch network that encompasses additional operations such as direction extraction and node extraction.Wu et al. [174] introduced Bi-HRNet, which contains three parts: the "top-to-down" and "down-to-top" road direction prediction branch and node heatmap prediction.Chen et al. [175] suggested a multi-task network which combines three branches: a boundary auxiliary branch, a road extraction backbone, and a node inferring branch.All of them are trained together, and the latter two branches are trained with equalweighted loss.This network incorporates the road boundary details and road junction information.Zhang et al. [176] offered a method for extracting road nodes and inferring the connectivity between them, known as NodeConnect.This method predicts road nodes by learning a confidence map and simultaneously proposes a multi-task framework to learn the connectivity map for the nodes.Zao et al. [98] proposed TopoRoad, a method that learns road topological maps to extract road networks.It comprises three main components: road vertex prediction, direction graph prediction, and segmentation graph prediction.After a unified decoding process, these three components are able to obtain the vertices and edges of the final road map.This method effectively addresses the issues of excessive parameters and low computational efficiency.
There are also some methods based on Graph neural networks (GNN) and Graph convolutional networks (GCN) for road extraction.Liu et al. [177] introduced RDPGNet, a network that integrates a CNN for feature extraction with a GCN for information interaction, centered around a GCN-based dual-view perceptor (GDVP).A GDVP includes an RFSG for reweighting regional features during graph inference and an RSHS to detect long-range road dependencies.They also implement an MVFA strategy to effectively consolidate road information.Zhou et al. [178] designed a split depth-wise (DW) separable GCN named SGCN to obtain spatial and channel features.The network then uses a GCN to capture global contextual information and constructs the adjacency matrix of the feature map with the Sobel gradient operator.

Methods Based on Iterative Detection
Iterative detection-based methods construct road extraction as an iterative graph generation.They start by defining an initial vertex, then iteratively predict the next vertex and ultimately obtain the entire road network.There are two issues that need to be addressed with this method: how to obtain the initial vertices and how to locate the subsequent point or the direction of advancement.
RoadTracer [179] is the first method used to employ iterative detection for road extraction.It begins at an initial vertex, predicting one of the fixed angles at each step and moving by a fixed step size.However, relying solely on the information from the current location to identify the next step may lead to a deviation between the extracted road and the actual road.To enhance road connectivity, Tan et al. [99] proposed a point-based iterative graph exploration scheme that integrates segmentation cues and a flexible step approach.This method employs a point-based detector capable of learning an appropriate step size through point-based supervisory encoding.Lian et al. [100] proposed DeepWindow, which utilizes a CNN to identify central points within patches and progressively determines subsequent center points.
To tackle the challenges of inaccuracies and inefficiencies encountered during the iterative procedure, Xu et al. [101] applied imitation learning to road detection, training agents to mimic expert policies using initial vertex candidates from segmentation and heatmaps.They introduced a training algorithm combining exploration methods for robust generation.Later, Xu et al. [180] developed a novel model called RNGDet, leveraging a CNN and transformer for feature extraction and vertex prediction.Enhanced with instance segmentation, RNGDet++ [181] refines the training process of network and reasoning by utilizing multi-scale features.Cheng et al. [182] introduced JTFN, which extracts curvestructured objects via an iterative feedback strategy.JTFN employs the object boundary to provide global topological regularization for the predicted mask.It also integrates a feature interchange model (FIM) to facilitate better feature exchange in segmentation and boundary detection.Additionally, a Gaussian attention unit (GAU) is included for feature enhancement.

Methods Based on Polygon Boundary
Iterative detection methods maintain road network topology but face challenges due to time-consuming vertex-by-vertex boundary generation.The narrow and elongated nature of roads necessitates global information for feature extraction, which CNNs struggle to capture.To overcome these issues, research has shifted towards treating road bound-ary extraction as a polygon identification problem, focusing on direct shape prediction from images.
Some models have been proposed to directly predict polygons from the input images using CNNs, such as PolygonRNN [102] and its improved variant PolygonRNN++ [183].In the encoder, a CNN is employed to extract features that predict the initial vertex, which are then passed to a recurrent decoder.The RNN predicts additional vertices in the decoder, thereby constructing polygons incrementally.PolygonRNN++ builds upon PolygonRNN with several enhancements.It incorporates a novel CNN encoder, employs reinforcement learning for training, and utilizes a GNN to enhance the resolution of the output.Some studies enhance global perception and continuity by adding modules or loss constraints.Hu et al. [103] introduced PolyRoad, which uses a transformer for parallel road boundary detection and proposed a polyline matching cost and additional losses for improved topology.
Numerous polyline detection methods focus on particular targets and may not perform well across a diverse range of categories.Yang et al. [184] designed TopDiG, a model that adapts to diverse boundary extractions, including road boundaries.It involves a topological-concentrated node detector for initial extraction, dynamic graph supervision for label generation, and a directional graph generator for constructing topological graphs, offering a general approach to boundary detection.In the field of road extraction from remote sensing images, numerous studies have deeply investigated methods that combine CNNs with Transformers to extract more abundant features.This integration strategy leverages the advantages of CNN in capturing spatial information and utilizes the capabilities of Transformer in processing sequence data and addressing long-distance dependencies.Previous research has already demonstrated the effectiveness of the Encoder-Decoder network architecture.Within these methods, the majority of CNN frameworks adopt a "U"-shaped structure.Wang et al. [185] integrated a CNN-Transformer into the UNet architecture to enhance feature extraction.They connected the Transformer in succession to the CNN and introduced a dual up-sampling module to improve performance.RoadCT, designed by Liu et al. [104], fuses CNN and Transformer features in a two-step decoder for road extraction.Li et al. [186] proposed a MACN with a mixed attention and convolutional Transformer (MACT) layer for efficient feature capture.Meng et al. [105] introduced an axial Transformer module (ATM) and a multilayer attention fusion module (MLAF) on UNet for feature learning and a channel attention module (CAM) for enhanced feature representation.Jamali et al. [187] combined residual learning with UNet and ViT in ResUNetFormer, employing a neighborhood attention Transformer for local feature enhancement.

Methods Based on Transformer
There are some works that combine Transformer into modules to extract features with multi-scale, multi-stage, and rich contextual information.Luo et al. [188] introduced BDT-Net, which uses a Transformer-enhanced BDTM module to capture multi-scale contextual information of roads, followed by a feature refinement module (FRM).Hu et al. [189] proposed MDTNet, incorporating a multi-scale deformable Transformer (MDTB) module for comprehensive feature capture, blending Transformer and deformable convolution.Wang et al. [190] integrated a Transformer-based ESTM into the neck of their model for global context modeling.In addition, they introduced the GDEM for the automatic extraction of contextual information within the model.Alongside these, they proposed REF loss to improve the accuracy of road extraction under conditions of sample imbalance.
A mix-Transformer enhances the capability of road extraction in remote sensing images through its hybrid attention and local-global fusion features.Deng et al. [191] designed UMiT-Net.It consists of four mix-Transformer blocks for global feature extraction and a dilated attention module (DAM) for semantic feature fusion.The decoder employs multiscale self-adaptive modules (MSAM) to boost segmentation precision, concatenating multi-scale features and refining outputs through attention mechanisms, resulting in more connected and accurate road segmentation.
The Swin-Transformer, known for its efficient multi-head and shifted window selfattention, streamlines road extraction computations.Ge et al. [192] integrated it into a U-shaped architecture to boost global learning.TransRoadNet [193] employs the Swin-Transformer in a CIEM framework for feature map downsampling.Zhang et al. [194] presented a Transformer-based approach with modules dedicated to detailed road feature extraction and fusion of global/local contexts.Yang et al. [195] presented SSEANet, a framework that jointly trains the CNN and Swin-Transformer with the aid of consistency loss to improve their cross-supervised capabilities.
Yuan et al. [106] proposed RRSIS, a model that generates segmentation masks from natural language descriptions using a Transformer-based LAVT model with an LGCE module for better detection of small targets.

Methods Based on Mamba
In recent years, the breakthrough progress of artificial intelligence technology in large language models and basic visual models has attracted scholars' attention to largescale, remote sensing model technology.In the research field of road extraction from remote sensing images, methods based on VMamba [196] have been widely applied.These methods not only improve the efficiency of road network extraction through the deep learning ability of large models but also highlight the enormous potential of large models in remote sensing applications.
The VMamba-based approach maintains the superior features of ViT while utilizing linear time complexity for processing.This method effectively captures global information in two-dimensional images through variants of the cross-scanning module.For example, Chen et al. [197] proposed RSMamba for remote sensing scene classification, which includes roads.RSMamba integrates the advantages of global receptive fields and linear complexity modeling and designs a dynamic multi-path activation mechanism to enhance the modeling capability for two-dimensional image data.Zhao et al. [198] proposed RSM, a remote sensing Mamba that captures global contextual information with only linear complexity.RSM is designed for dense prediction tasks in high-resolution remote sensing imagery, including road detection.RSM mitigates the loss of contextual information caused by input image segmentation and employs an omnidirectional selective scan module for global modeling from multiple directions.Ma et al. [107] developed RS3Mamba, a dual-branch network that enhances CNNs and Transformers with an auxiliary VSS block for global information and an inter-branch collaboration completion module for feature enhancement and fusion.Zhu et al. [108] proposed Smamba for semantic segmentation of high-resolution remote sensing images.They used Samba blocks as encoders and an FPN-based UperNet as the decoder.In Samba blocks, Mamba substitute the multi-head self-attention of ViT and are combined with multiple MLPs for efficient image feature extraction.The UperNet decoder effectively captures multi-level semantic information.

Comparison of Six Models Based on Fully-Supervised Learning Methods
Specifically, each model within the fully supervised learning approach presents its own set of strengths and weaknesses.We provide a concise yet comprehensive overview in Table 1 that includes the quantity of relevant literature and clearly delineates the advantages and limitations of these six distinct models.

Semi-Supervised Methods for Road Extraction
Supervised-learning methods remain the predominant approach for road extraction from remote sensing imagery, continuously achieving breakthroughs in performance.However, these methods necessitate extensive datasets with clear labels, which can be both time-consuming and resource-intensive to compile.Consequently, semi-supervised learning methods have emerged as a viable alternative.Within this paradigm, weakly supervised learning techniques represent a significant branch.We categorize semi-supervised learning approaches into two types based on the nature of the training data: those utilizing partially labeled data and those using imprecisely labeled data.

Methods Based on Less Labeled Data
Methods based on less labeled data can leverage a small amount of labeled data alongside a large volume of unlabeled data to train the network.The primary idea is to mine deep and useful information from the vast pool of unlabeled data, thereby reducing annotation costs and lowering the demands on labeling expenses [199].
Xia et al. [200] focused on creating representative datasets and a semi-supervised technique to leverage deep learning for road extraction from satellite images.He et al. [201] presented ClassHyPer, a semi-supervised method using hybrid perturbation to improve model performance with limited data, incorporating boundary information and implicit pseudo-supervision without extra threshold settings.
Han et al. [202] introduced a semi-supervised learning (SSL) method for road detection using GAN alongside a weakly supervised learning (WSL) approach based on conditional GAN.IndSSL, thedgenerator produces road detection results for both labeled and unlabeled images, withdthe discriminator determining labeling.WSL predicts road shapes to guide both the generator and discriminator.Chen et al. [203] presented SemiRoadExNet, a GANbased method that overcomes the limitations of previous SSL methods in utilizing pseudo label information.It features one generator and two discriminators on UNet, extracting features and producing road segmentation and entropy maps.The discriminators enforce feature consistency between predictions, with the generator refined through adversarial training by leveraging unlabeled data.
Yang et al. [195] designed a semi-supervised edge-aware network combining CNNs and Transformers for road segmentation named SSEANet, focusing on road edges to overcome limitations of traditional self-training methods.Cheng et al. [109] developed an algorithm that integrates semi-supervised segmentation with multi-scale filtering and multidirectional non-maximum suppression for road centerline extraction.Xiao et al. [204] proposed a semi-supervised FCN algorithm that optimizes labeled and unlabeled sample losses to prevent overfitting.Further advancing the field, You et al. [205] presented FMWDCT, which integrates road information into a dual network, combining semi-supervised training and data perturbation to address overfitting and class imbalance.

Methods Based on Weak Labeled Data
Methods based on weak labeled data do not require detailed annotations for the data, even scribble labels can suffice, significantly reducing the cost and time required for data labeling [206].Currently, datasets based on partial road maps can provide incomplete road network labels, enabling models to learn and infer the complete road network despite missing information [159].These methods make full use of available resources and leverage algorithmic intelligence to compensate for the lack of annotation information, thereby achieving broader road network extraction under resource constraints [206].
Wang et al. [207] proposed CRAUP, an object segmentation method based on imprecise annotation in remote sensing images, using consistency regularization (CR) and average update of pseudo labels (AUP) to refine the semantic segmentation network with pseudo and accurate labels.They enhanced CRAUP [207] with the RanPaste algorithm and mean teacher approach [208] for higher accuracy.Bonafilia et al. [110] merged weakly supervised learning and semi-supervised learning to detect buildings and roads from OpenStreetMap(OSM), using D-LinkNet with weakly-supervised methods for robust road extraction on noisy datasets.In summary, this work is considered the first research considering pre-training globally with OSM without fine-tuning.Chen et al. [209] introduced SW-GAN, which employs a weakly supervised network within a GAN framework, enhancing performance with a mix of weak and clear labels.Wu et al. [116] proposed MD-ResUNet, a weakly supervised method for road extraction, relying on OSM centerlines and outperforming fully supervised counterparts.Meng et al. [210] developed a segmentation model that leverages OSM road data and satellite imagery to mitigate the need for precise pixel-level annotations and enhance generalization.Leveraging data annotated with road center points, Lian et al. [211] designed a method based on point annotations for road extraction.They employed a CNN for the detection of road seeds and trained it solely with point annotations.
Hu et al. [111] introduced a weakly supervised GAN-based method for road extraction, using a ResNet generator and WGAN-GP optimization with threshold post-processing.Hua et al. [212] proposed a semantic segmentation framework based on sparse scribble annotations.The framework utilizes the feature and spatial relational regularization method, designing an unsupervised learning signal that combines spatial and feature term neighborhood structures to complement the supervised task.Wei et al. [213] put forward a dual-branch ScRoadExtractor based on a weakly supervised road extraction model.This model can learn features from scribble annotations, which are relatively easy to obtain, eliminating the need for large datasets with pixel-level annotations.Zhou et al. [214] designed SOC-RoadNet, a dual-branch network for weakly supervised learning based on structural and directional consistency.The segmentation branch of this network is capable of learning road surface features using only scribble labels.
In addition to these standard models, some research has also been attempted on large models.The segment anything model (SAM) [215] proposed by Meta AI is a powerful tool.It can improve segmentation efficiency without the need for completely labeled data.Some studies make adjustments based on SAM to make it suitable for road extraction in remote sensing images.For example, Osco et al. [216] tested SAM across multi-scale datasets with various input prompts and implemented an automated technique that combines text prompts derived from general examples with a single training to improve accuracy.Hetang et al. [112] improved the SAM by designing SAM-ROAD to extract road networks from remote sensing imagery.They modified the encoder of SAM and utilized the nonmaximum suppression method to extract the vertices of the road map and use a lightweight Transformer-based GNN to predict the topology of the graph.Ma et al. [217] introduced a semantic segmentation model for remote sensing images that leverages SAM to integrate target and boundary constraints.The model generates SAM-generated objects (SGO) and SAM-generated boundaries (SGB) to improve accuracy through object consistency and boundary preservation losses.By incorporating a SAM-based phase into traditional models, the approach directly generates SGO and SGB, enhancing segmentation performance.

Unsupervised Methods for Road Extraction
Road extraction methods based on unsupervised learning do not rely on labeled datasets [116].Instead, they explore the inherent structures and features within input images to identify road regions.Initially unsupervised road extraction methods are based on traditional image processing techniques such as edge detection, mathematical morphology, and template matching, which heavily depend on geometric and radiometric features.However, these methods often exhibit poor adaptability to complex scenes [218].With the advent of deep learning, many approaches utilize autoencoders to learn intricate road features or leverage GAN to generate more precise road network emerge as viable options.Notably, self-supervised learning methods, a subtype of unsupervised learning method, train models by formulating prediction tasks (e.g., predicting missing image regions), thereby indirectly learning pertinent features for road extraction [219].

Methods Based on Models with Fewer Parameters
Zhang et al. [113] introduced the category-anchored guided UDA (CAG-UDA) model for semantic segmentation to mitigate bias in unsupervised domain adaptation (UDA) classifiers.It employs category-anchored feature alignment and utilizes pixel-level and discrimination losses to improve target domain identification, enhancing inter-class variance while reducing intra-class variance.To address the domain shift (DS) challenge, Zhang et al. [220] designed RoadDA, a two-stage unsupervised domain adaptation network.Initially, the generator, equipped with a feature pyramid fusion module (FPFM), predicts segmentation for unlabeled target data, with the discriminator identifying domain labels.In the subsequent stage, the model generates pseudo-labels to refine segmentation and minimize domain discrepancies.
Deng et al. [206] proposed an adversarial learning framework for semantic segmentation in remote sensing images that reduces the need for extensive labeled data.Their GAN-like framework includes a segmentation network and a discriminator to handle distribution shifts between datasets.Initially, the segmentation network is trained in a supervised manner on labeled source data, followed by unsupervised fine-tuning on the target dataset using adversarial loss from the discriminator.Similarly, Cira et al. [221] designed a cGAN to enhance road feature representation in semantic segmentation through unsupervised generative learning, validating the approach through qualitative perception.
Han et al. [222] proposed a self-supervised technique, termed segmentation and reconstruction, designed to overcome the constraints of standalone segmentation models regarding the preservation of road connectivity and the attainment of boundary smoothness.Their architecture includes a segmentation model for initial road extraction from remote sensing images and a reconstruction model based on an all-visible denoising autoencoder (AV-DAE) for refining the results.The AV-DAE, trained without additional constraints, effectively improves road topology as a post-processing step.

Methods Based on Large Remote Sensing Models
Cha et al. [223] proposed a billion-scale foundational remote sensing image model.Their research investigated how the size of model parameters affects the performance of tasks like semantic segmentation.They pretrained foundational models with varying numbers of parameters, including 86 M, 605.26 M, 1.3 B, and 2.4 B, to determine whether the performance of downstream tasks improves with an increase in parameters.Additionally, they introduced an enhanced Transformer approach that enhances parallelism.
Yan et al. [224] designed RingMo-SAM, a foundational model for multimodal remote sensing image segmentation that can handle object segmentation and target classification in both optical and SAR data.They constructed a large-scale training set using multiple open-source datasets.The model features a classification decoupling mask decoder (CDMDecoder) for accurate classification and segmentation.Furthermore, it introduces a prompt encoder that optimizes the precision of multi-object segmentation and enhances the segmentation performance of SAR images.
Sun et al. [225] developed RingMo, a foundational model for remote sensing images that leverages generative self-supervised learning.They constructed a large-scale dataset with 2 million images and used the PIMask strategy and RingMo MIM method, which effectively handle dense small targets in complex scenes.The encoder, once trained, is suitable for various optical remote sensing tasks and uses ViT and Swin Transformer architectures to optimize reconstruction accuracy through L1 regression loss.

Metrics
Road extraction from remote sensing images is a binary classification problem, where road pixels are positive samples and background pixels are negative samples.The performance of a model is assessed based on a series of important metrics.These metrics are typically derived from the four fundamental elements within the confusion matrix of the classification results: TP (the number of pixels correctly predicted as roads), TN (the number of pixels correctly predicted as non-roads), FP (the number of pixels incorrectly predicted as roads), and FN (the number of pixels incorrectly predicted as non-roads) [95].
Among them, Accuracy, Precision, Recall, F1 score, IoU, and mIoU are the most commonly used evaluation metrics, which are calculated based on the elements within the confusion matrix.

Accuracy
Accuracy refers to the proportion of the number of samples correctly predicted by the model to the total number of samples, which is defined as Equation (1).A higher value indicates that the model has a stronger ability to correctly predict road pixels, meaning that the predictions of the model are more aligned with the actual road positions.

Precision
Precision refers to the proportion of pixels predicted by the model as roads that actually belong to roads, as defined by Equation (2).It is suitable for situations where reducing false positive rates is important.A higher value indicates higher accuracy of the model in predicting road pixels.

Recall
Recall refers to the proportion of pixels correctly identified by the model as roads relative to the total number of actual road pixels, as defined by Equation (3).It measures the completeness of the predictions and is suitable for scenarios where minimizing false negatives is crucial.A higher value rate indicates a stronger ability of the model to capture all actual road pixels.

F1 Score
The F1 Score is the harmonic mean of Precision and Recall, considering both the accuracy and completeness of the model, making it suitable for scenarios where balancing accuracy and completeness is important.As defined by Equation ( 4), a high value indicates that the model has achieved a good balance in predicting road pixels, minimizing both false positives and false negatives as much as possible. )

IoU
The IoU indicator represents the degree of overlap between the road area predicted by the model and the actual road area defined as Equation (5).Higher IoU values generally correspond to higher Accuracy, Precision, Recall, and F1 Score.A higher value reflects that the model prediction results are closer to the actual situation.

mIoU
The mIoU in road extraction tasks calculates the average IoU value between road pixels and background pixels.Its formula is shown as Equation (6).Similar to IoU, the value of mIoU ranges between 0 and ). (6)

APLS
APLS is used to measure the similarity between the extracted road network and the real road network.It is defined by Equation (7).By comparing the average path lengths between them, the accuracy and completeness of the road extraction results can be evaluated, determining whether the topology of the road network is consistent with the real situation.In the definition of APLS, N is number of unique paths, and L(a, b) is the length of path (a, b).The node a ′ represents the node in the predicted graph closet to the location of ground truth node a(source) and the node b ′ represents the node in the predicted graph closet to the location of ground truth node b(target).A higher value indicates that the road extraction result is closer to the real road map.
6.8.ECM ECM evaluates object connectivity in remote sensing road extraction by quantifying pixel relationships based on entropy.It is defined by Equation (8), where C i denotes the connectivity of the ith ground-truth instance, α i denotes the completion of ith ground-truth instance, M i is the number of predicted road-boundary instances, p j is the dominance of the jth predicted instance, and N is the total number of ground-truth instances.The larger the value, the more connected the road network. 6.9.CC The CC metric assesses the degree of connectivity among road pixels within the segmented road network.It is calculated using Equation ( 9), where N c is the total number of connected road pixels, and N t is the total number of road pixels in the segmented region.A higher value indicates that the connectivity of the extracted road network is higher.

Datasets
Numerous datasets dedicated to road extraction from remote sensing images have emerged.They play a pivotal role in model training.
Table 2 presents a chronologically organized overview of the details for various remote sensing road datasets.It includes information such as the size and resolution of the images within the datasets, as well as the number of images in the training, testing, and validation sets for each dataset.Additionally, the table provides information on the source of the datasets and the year of publication.
Figure 10 illustrates an overview of a subset of datasets.For each dataset, columns (a) and (b) are relatively simple graphs, while columns (c) and (d) are more complex graphs.

Massachusetts
The Massachusetts dataset is a diverse collection of remote sensing images designed for road extraction tasks.It covers a range of scene types, including urban, suburban, and rural areas.It also includes a variety of terrain and landform features.The moderate scale of the dataset makes it suitable for training medium-scale models without the burden of handling large amounts of data.However, the dataset also presents challenges, especially in dealing with occlusions between roads and adjacent objects, and in maintaining high accuracy in image recognition under different lighting and weather conditions.

DeepGlobe
The DeepGlobe dataset includes over 10,000 satellite images, providing a rich resource for road extraction tasks.It covers countries such as Thailand, Indonesia, and India, encompassing environments from urban and rural to coastal and tropical rainforest.This environmental variety is beneficial for developing robust road extraction algorithms that can adapt to different conditions.However, extracting roads from satellite imagery is challenging.Roads often appear as narrow strips in these images and can be mistaken for other linear features like rivers and railways.

SpaceNet
The SpaceNet dataset is constantly expanding and updating.Currently, there are eight versions, among which v3 and v5 are datasets specifically designed for road extraction.SpaceNet v3 covers Las Vegas, Paris, Shanghai, and Khartoum.Apart from these four cities, v5 has added Moscow, Mumbai, San Juan, and a mysterious city.Compared to v3, the SpaceNet v5 dataset has advantages in resolution, coverage, and data diversity.
In the SpaceNet dataset, detailed information such as road centerline, road type, pavement type, bridges, and number of lanes are covered.However, due to the perspective of satellite images, changes in lighting conditions and the complexity of urban environments, the roads in the dataset may appear narrow or difficult to identify, which increases the challenge of accurately extracting roads from remote sensing images.

CHN6-CUG
The CHN6-CUG dataset, compiled and shared by Zhu Qiqi's team from the China University of Geosciences, is a suite of remote sensing images centered on urban road extraction in China.It features six distinct Chinese cities: Beijing's Chaoyang District, Shanghai's Yangpu District, the central region of Wuhan, Shenzhen's Nanshan District, and the Sha Tin District of Hong Kong and Macau.The dataset includes meticulously annotated road information, encompassing both covered and uncovered roads, as well as a detailed classification of road types, such as railways, highways, urban streets, and rural paths.

Discussion
This review categorizes road extraction tasks from remote sensing images into three primary types based on the requirements for annotated information within the datasets: fully supervised learning, semi-supervised learning, and unsupervised learning.Table 3 summarizes the quantity of relevant literature, annotation requirements, advantages, and limitations of these methods.
Fully supervised learning methods employ comprehensive annotated information for model training, leading to the efficient extraction of road details and a significant enhancement in accuracy.This approach currently predominates in technological applications.In contrast, semi-supervised and unsupervised learning methods reduce the reliance on large-scale annotated datasets, significantly lowering the cost of data preparation and bolstering the generalization of the model to new datasets.Although these methods may not match the precision of fully supervised learning, they are capable of actively exploring and mining the implicit structures and features within the input images.However, given that these methods are still in their nascent stage, there is room for improvement in terms of segmentation accuracy and adaptability.
To provide a comprehensive and intuitive assessment of the performance of various models, this review employs a suite of common performance metrics for comparison.We selected two widely recognized datasets: Deepglobe and Massachusetts, and compared the performance of several representative models on these datasets.The comparative results for the Deepglobe dataset are summarized in Table 4, while the results for the Massachusetts dataset are presented in Table 5.Since the F1 score is a comprehensive indicator adopted by most methods, both tables are sorted in descending order based on the F1 score.This allows for a clearer view of the advanced methods in the current field.Through the analysis of these two tables, a significant difference is observed in the F1 scores of RoadCT [104] on the DeepGlobe and Massachusetts datasets.There are two reasons for this phenomenon.Firstly, there are significant differences in geographic coverage, image resolution, road types, and complexity among datasets.Secondly, the generalization ability of the model is insufficient, resulting in performance degradation on new datasets.Therefore, the model we choose should be relevant to the characteristics of the selected dataset and the actual application requirements.

Conclusions
This review systematically collates deep learning algorithms employed in the field of road extraction from remote sensing images over the past 13 years.We examine a series of road extraction methods proposed in approximately 232 relevant articles and categorize the deep learning-based approaches into three primary categories based on the differing requirements for annotated datasets: fully supervised learning, semi-supervised learning, and unsupervised learning.For each category, we provide a comprehensive summary and in-depth analysis.In light of the literature analysis indicating that the majority of current methods still rely on fully supervised learning, we further subdivide the fully supervised learning approach into five subcategories and conduct a detailed comparison and analysis of the performance of each subcategory.Moreover, this review summarizes the evaluation metrics and datasets that are commonly utilized in the field.
Currently, models for road extraction from remote sensing images perform well in images taken under clear and well-lit conditions but struggle when faced with road occlusion, adverse weather, and other challenging scenarios.Major challenges include the complexity of remote sensing images, the high cost associated with data annotation, model generalization ability, and robustness.In the era of large-scale models and multimodal data, road extraction from remote sensing images holds significant importance.Therefore, this review looks forward to further research and development in the following aspects.

1.
Multi-modal Data Fusion As technology continues to progress, the effective fusion of multi-modal data from different sensors, such as remote sensing images, LiDAR images, and videos, is becoming a focal point of current research.The integration of multi-modal data not only offers a wealth of information but also addresses the limitations of relying on a single data source, leading to a better capture of road features.For example, LiDAR data can provide highly accurate terrain information, while high-definition video data are capable of capturing dynamic changes on the roads.The combination of various data modalities allows for more precise identification of road positions, shapes, and features, thus improving the robustness and generalization of road extraction models.

2.
Semi-supervised Networks or Unsupervised Networks Currently, most road extraction methods are based on fully supervised models, which rely on manually annotated datasets.This process is time-consuming, labor-intensive, and the annotated data are often limited in size, leading to potential performance issues when applied to other datasets.Therefore, the exploration of semi-supervised and unsupervised approaches, which aim to understand the internal structure of data or facilitate adaptive training without human annotation, remains a prominent research focus.Presently, methods based on GAN can automatically generate data annotations to bridge the gap between synthetic and real images, making it a significant direction for future research.

Adaptive Modeling in Complex Scenarios
The adaptability of road extraction models on remote sensing images is crucial when encountering complex scenarios.This adaptability enables models to effectively extract road information in diverse environments, including urban settings with building occlusions, tree cover, and uneven lighting conditions.By learning and understanding complex scenes, models can adapt to different geographical environments and remote sensing images conditions, thereby improving the accuracy and robustness of road extraction.Techniques such as multi-modal data fusion, data augmentation, and adversarial training can be employed to continuously enhance model structures and algorithms, enabling them to better adapt to various challenges and changes.4.
Lightweight Networks Many road extraction methods, such as Graph-based and Transformer-based approaches, encounter challenges related to large computational requirements.Therefore, designing lightweight networks is necessary.Lightweight networks can significantly reduce model parameters and computational complexity while maintaining high accuracy.Leveraging knowledge distillation techniques, key knowledge about road features can be extracted from large and complex models and transferred to lightweight networks, enabling them to learn effective road feature representations.

Figure 1 .
Figure 1.Classification of road extraction approaches based on deep learning.

Figure 2 .
Figure 2. The organization of this review.

Figure 4 .
Figure 4.The proportion of literature sources in this review.

Figure 5 .
Figure 5.The general architecture of a patch-based CNN model.

Figure 6 .
Figure 6.The general architecture of an Encoder-Decoder model.

Figure 7 .
Figure 7.The general architecture of the GAN model.

Figure 8 .
Figure 8.The general architecture of the Graph model.
Road networks in remote sensing images are extensive but relatively small in scale, which often leaves traditional methods lacking in global context and localization accuracy.The Transformer architecture excels in acquiring global information and leveraging contextual cues from the input imagery.It employs a self-attention mechanism to capture relationships across different positions in the input sequence, enabling parallel computation.Although numerous methods based on Transformers have been proposed in recent years, few rely solely on Transformer for road extraction.Most methods integrate Transformer within a neural network architecture, allowing it to interact with other components to collaboratively accomplish the task of road extraction.Figure 9 illustrates the general architecture of a neural network-fused Transformer model.

Figure 9 .
Figure 9.The general architecture of a neural network-fused Transformer model.

Figure 10 .
Schematic illustrations of representative samples from the partial dataset.Columns (a,b) are relatively simple graphs, columns (c,d) are more complex graphs.

Funding:
The work was jointly supported by the National Science and Technology Major Project under grant No. 2022ZD0117103, the National Natural Science Foundation of China under grant No. 62272364, the Guangxi Key Laboratory of Trusted Software under grant No. KX202061, the provincial Key Research and Development Program of Shaanxi under grant No. 2024GH-ZDXM-47, and the Fundamental Research Funds for the Central Universities under grant No. XJSJ24021.

Table 1 .
The quantity of relevant literature advantages and limitations of fully supervised models.

Table 2 .
Datasets for road extraction from remote sensing images.

Table 3 .
Comparison of fully supervised, semi-supervised, and unsupervised learning approaches.

Table 4 .
Comparative performance of representative models on the Deepglobe dataset.

Table 5 .
Comparative performance of representative models on the Massachusetts dataset.