Next Article in Journal
Statistical Analysis of Ionospheric Midnight Collapse Events Observed by Arecibo Incoherent Scatter Radar
Previous Article in Journal
Spatiotemporal Heterogeneity and Zonal Adaptation Strategies for Agricultural Risks of Compound Dry and Hot Events in China’s Middle Yangtze River Basin
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

PWFNet: Pyramidal Wavelet–Frequency Attention Network for Road Extraction

1
Beijing Laboratory of Water Resources Security, Capital Normal University, Beijing 100048, China
2
College of Resources Environment and Tourism, Capital Normal University, Beijing 100048, China
3
State Key Laboratory of Urban Environmental Processes and Digital Simulation, Capital Normal University, Beijing 100048, China
4
Key Laboratory of 3D Information Acquisition and Application, Ministry of Education, Beijing 100048, China
5
CCCC Xingyu Technology Co., Ltd., Beijing 102200, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2025, 17(16), 2895; https://doi.org/10.3390/rs17162895
Submission received: 4 July 2025 / Revised: 18 August 2025 / Accepted: 19 August 2025 / Published: 20 August 2025

Abstract

Road extraction from remote sensing imagery plays a critical role in applications such as autonomous driving, urban planning, and infrastructure development. Although deep learning methods have achieved notable progress, current approaches still struggle with complex backgrounds, varying road widths, and strong texture interference, often leading to fragmented road predictions or the misclassification of background regions. Given that roads typically exhibit smooth low-frequency characteristics while background clutter tends to manifest in mid- and high-frequency ranges, incorporating frequency-domain information can enhance the model’s structural perception and discrimination capabilities. To address these challenges, we propose a novel frequency-aware road extraction network, termed PWFNet, which combines frequency-domain modeling with multi-scale feature enhancement. PWFNet comprises two key modules. First, the Pyramidal Wavelet Convolution (PWC) module employs multi-scale wavelet decomposition fused with localized convolution to accurately capture road structures across various spatial resolutions. Second, the Frequency-aware Adjustment Module (FAM) partitions the Fourier spectrum into multiple frequency bands and incorporates a spatial attention mechanism to strengthen low-frequency road responses while suppressing mid- and high-frequency background noise. By integrating complementary modeling from both spatial and frequency domains, PWFNet significantly improves road continuity, edge clarity, and robustness under complex conditions. Experiments on the DeepGlobe and CHN6-CUG road datasets demonstrate that PWFNet achieves IoU improvements of 3.8% and 1.25% over the best-performing baseline methods, respectively. In addition, we conducted cross-region transfer experiments by directly applying the trained model to remote sensing images from different geographic regions and at varying resolutions to assess its generalization capability. The results demonstrate that PWFNet maintains the continuity of main and branch roads and preserves edge details in these transfer scenarios, effectively reducing false positives and missed detections. This further validates its practicality and robustness in diverse real-world environments.

1. Introduction

Road extraction is a fundamental task in the field of remote sensing [1], as road networks serve as critical infrastructure for a wide range of applications, including urban planning [2], intelligent transportation systems [3], and emergency response [4]. To meet the increasing demand for high precision road information in practical scenarios, road extraction techniques have been extensively studied and continuously refined. However, optical remote sensing imagery often contains shadows, building occlusions, and blurred or inconsistent road textures, which pose persistent challenges. These factors frequently lead to the performance degradation of existing methods, particularly in complex and cluttered urban environments.
In recent years, the extensive application of deep learning in image semantic segmentation tasks has led to its increased utilization in road extraction from remote sensing imagery [5]. This is a marked departure from traditional methods such as thresholding, edge detection, and rule-based morphological or machine learning techniques that heavily depend on handcrafted features and road texture heuristics [6,7]. Deep learning models are capable of automatically extracting more discriminative representations, thereby significantly enhancing the efficiency and accuracy of road extraction [8]. Early deep learning-based methods for road extraction were primarily built upon Fully Convolutional Networks (FCNs) [9]. As network architectures advanced, encoder–decoder frameworks like U-Net [10] became widely adopted with further enhancements through attention mechanisms [11] to improve feature representation and boost the discrimination of road regions. However, due to the inherently limited receptive field of convolution operations, these models often struggle to capture long-range dependencies [12]. The advent of Vision Transformers [13] brought stronger global modeling capabilities to the field, enabling significant progress in capturing long and continuous road structures [14]. Nonetheless, transformer-based networks that rely solely on spatial-domain modeling still encounter limitations when handling complex remote sensing scenarios [15]. For instance, in images with occluded or curved roads, or those disturbed by vegetation and building textures, models may become overly sensitive to local variations. This often leads to the misclassification of frequency-similar background regions, resulting in broken connectivity and blurred boundaries—particularly around narrow roads and intricate intersections. To address these challenges in spatial structure and frequency-domain representation, recent studies have begun to incorporate spectral information into road extraction tasks. It has been observed that road regions typically exhibit smooth, continuous low-frequency characteristics in remote sensing images [16], while background elements such as buildings and vegetation contain abundant mid- and high-frequency edge textures. Therefore, frequency-domain modeling has emerged as an effective strategy to enhance the precision and robustness of road extraction.
In this context, wavelet convolution has increasingly been recognized as an analytical tool that can simultaneously capture spatial and frequency locality. This technique stems from the multi-scale filtering system of wavelet transformation, which is based on scaling and translation operations. Wavelet convolution dissects input features, creating sub-band representations across various frequencies and resolutions [17]. This demonstrates its unique advantages in tasks like road extraction. Unlike the global frequency analysis offered by Fourier transformation, wavelet transformation retains spatial structural information while facilitating the hierarchical modeling of both high-frequency details and low-frequency structures. Specifically, the low-frequency components depict the overall connectivity trend of roads, whereas the high-frequency components capture local features such as edges and textures. This combined spatial-frequency modeling capability makes wavelet convolution particularly suited for road extraction in remote sensing images. These images are often characterized by large-scale variations, severe occlusions, and strong texture interference, requiring enhanced feature discrimination and structural restoration capabilities, which wavelet convolution can offer.
In summary, this paper introduces a novel remote sensing road extraction technique that synergistically combines frequency modeling and multi-scale feature enhancement. The objective is to optimize the utilization of spatial and frequency attributes inherent in road structures within remote sensing imagery. Central to this approach is the Pyramidal Wavelet Convolution (PWC) module, which conducts multi-scale downsampling and a stepwise wavelet decomposition of feature maps. This architecture facilitates targeted modeling and the reconstruction of frequency components at varying scales. Capitalizing on the time-frequency localization capabilities of the wavelet transform, the model ensures that low-frequency sub-bands augment the perception of overarching road networks. In contrast, high-frequency sub-bands are adept at discerning intricate edge details and texture nuances, enhancing the depiction of both primary roads and finer branches across multiple scales. Furthermore, given that road regions in remote sensing images predominantly exhibit smooth low-frequency traits, while non-road elements like buildings and vegetation present intricate high-frequency interferences, a Frequency-aware Adjustment Module (FAM) has been incorporated. This module employs the Fourier transform to transpose the input into the frequency domain, where predefined frequency masks are deployed to isolate residual features from distinct spectral bands. By determining adaptive weights for each frequency band, FAM judiciously amplifies road-centric low-frequency structures, concurrently minimizing extraneous high-frequency background disturbances. Collectively, the wavelet-driven local multi-scale frequency analysis and the Fourier-driven global frequency modulation constitute a harmonious frequency modeling framework. This dual-module system is seamlessly integrated into an encoder–decoder structure, with Res2Net serving as its foundational component, culminating in a frequency-enhanced road extraction network.
Our contribution consists of the following aspects:
  • We propose a novel road extraction model that integrates multi-scale spatial and frequency features. By jointly leveraging spatial continuity and frequency structural information, the model effectively addresses challenges such as complex textures and severe occlusions in remote sensing imagery, thereby enhancing robustness and representational capacity in complex scenes. The source code is publicly available at https://github.com/zong3124/PWFNet (accessed on 18 August 2025).
  • We design a Pyramidal Wavelet Convolution (PWC) module, which introduces Discrete Wavelet Transform (DWT) at multiple spatial scales to perform high–low frequency decomposition and the reconstruction of features. This enhances the model’s ability to capture both global road structures and local details, making it particularly effective for road extraction tasks with significant scale variations.
  • We propose a Frequency-aware Adjustment Module (FAM), which conducts multi-band decomposition and the weighted modulation of features in the Fourier domain. By learning adaptive weights for each frequency band, the module strengthens road-related low-frequency components while suppressing high-frequency background interference, improving extraction accuracy in texture-rich environments.

2. Related Works

2.1. Road Extraction in Deep Learning

As a representative cross-disciplinary task between computer vision and remote sensing, road extraction has undergone a significant evolution—from traditional image processing techniques to deep learning-based approaches. Early studies primarily relied on explicit features such as texture and geometric shape in images, using manually defined thresholds, edge detection, or morphological operations to identify road regions [18,19,20], or incorporating machine learning methods such as Support Vector Machines (SVM) for road extraction [21,22]. However, these methods typically depend on handcrafted features and exhibit limited generalization ability, making them inadequate for handling the complex and diverse scenarios often encountered in remote sensing imagery [23].
With the development of deep learning, researchers have begun applying it to the task of road extraction in remote sensing. Initial approaches were mostly based on Convolutional Neural Networks (CNNs), such as the classic Fully Convolutional Network (FCN) [24] and the widely adopted U-Net architecture [25]. These models extract and fuse multi-scale features through encoding and decoding processes, significantly improving road extraction accuracy. Subsequently, lightweight architectures such as LinkNet [26] were proposed, maintaining competitive recognition performance while enhancing computational efficiency. To further improve performance, various enhancements were introduced, including attention mechanisms, dilated convolutions, and residual connections. For instance, DA-RoadNet [27] incorporated a Dual Attention Module (DAM) to capture road features and their global dependencies in remote sensing images. D-LinkNet [28] added dilated convolutions between the encoder and decoder to expand the receptive field. CoANet [29] proposed a Strip Convolution Module (SCM) to capture directional road features and introduced a Connectivity Attention Module (CoA) to enhance structural integrity and connectivity in road extraction. With further advances in deep learning, the transformer architecture—originally successful in natural language processing—has also been introduced into image segmentation tasks. Zhu et al. [30] combined the Swin Transformer with ResNet as a hybrid encoder for road extraction, capturing both global and local road information. Ge et al. [31] replaced the conventional multi-head self-attention (MSA) in U-Net with window-based (W-MSA) or shifted-window-based (SW-MSA) attention modules to better capture contextual information. Additionally, hybrid models combining CNNs and transformers have recently been proposed to exploit the local feature extraction strength of CNNs and the global modeling capacity of transformers, aiming for more accurate and robust road extraction in complex scenarios. For example, RoadCT [32] adopted a CNN–transformer hybrid framework with a relation fusion block to integrate road features at different receptive fields. Wang et al. [33] proposed an enhanced hybrid decoder with dual upsampling modules and employed the hard-swish activation function to improve generalization and nonlinear feature extraction while mitigating gradient vanishing. Beyond architectural innovations, other studies have focused on optimizing the training paradigm. In recent years, semi-supervised, weakly supervised, and unsupervised road extraction methods have emerged, aiming to reduce dependence on large volumes of pixel-level annotations and improve model adaptability in low-label scenarios. For instance, SemiRoadExNet [34] utilizes a Generative Adversarial Network (GAN)-based framework, where a discriminator guides the extractor to generate more realistic road predictions, enabling semi-supervised road extraction. MCMCNet [35] proposed a Guided Contrastive Learning Module (GCLM) to enhance feature discrimination between road and background regions. In addition to the conventional segmentation head, it introduces a Road Skeleton Prediction Head (RSPH) and an Adaptive Road Augmentation Module (ARAM), which together improve the modeling of fine-grained road structures and connectivity. SOC-RoadNet [36] adopted a weakly supervised strategy using simplified “scribble” annotations to train the model, enabling effective road representation learning from limited labels. Lu et al. [37] developed a fully unsupervised framework by introducing a graph-based label propagation algorithm that automatically generates pseudo-masks with road, non-road, and unknown classes. Combined with auxiliary boundary priors extracted from the imagery, this approach significantly reduces reliance on manual annotations. Despite continuous evolution in network architectures and training paradigms—bringing improvements in adaptability, precision, and training efficiency—mainstream methods still largely rely on spatial-domain convolution operations. These approaches often fail to fully exploit the frequency characteristics of road structures in remote sensing imagery. As a result, they struggle to effectively capture key attributes such as spatial continuity, geometric regularity, and the multi-scale distribution of roads in complex environments. These limitations have become a major bottleneck in achieving further performance gains.

2.2. Wavelet Transform in CNNs

The wavelet transform (WT) has increasingly been recognized in recent years for its frequency analysis capabilities, particularly in the fields of image processing and deep learning. This mathematical tool offers superior spatial-frequency localized modeling capabilities compared to the Fourier transform (FT). While the FT employs sinusoidal basis functions to analyze global frequency components with high-frequency resolution, it falls short in effectively capturing localized spatial features, especially in regions with sudden texture changes or blurred boundaries. On the other hand, the WT utilizes compactly supported wavelet basis functions such as Haar and Daubechies. These functions facilitate multi-scale decomposition and the reconstruction of input features via scaling and translation operations. As a result, the WT can extract high-frequency edge details and low-frequency structural contours without sacrificing spatial localization information. This makes it particularly adept at modeling image scenes where complex textures and multi-scale semantic information coexist.
In deep learning-based semantic segmentation tasks, wavelet transforms have been progressively incorporated to enhance frequency modeling capabilities [38]. Representative models such as ACM-UNet [39] and Frequ-FNet [40] have successfully integrated wavelets for semantic segmentation in medical imaging. Moreover, wavelet-based approaches have demonstrated their generality and effectiveness in fine-grained structural modeling across various tasks, including image classification [41], super-resolution reconstruction [42], receptive field expansion [43] and feature compression [44]. However, most existing studies focus on natural or medical images, while the systematic exploration of wavelet-based modeling remains limited for remote sensing road extraction—a task characterized by unique structural patterns and distinct frequency distribution properties.
In remote sensing imagery, road regions typically exhibit continuous and smooth low-frequency characteristics, representing the main structural layout and large planar areas of the roads [45], while background elements such as buildings and vegetation contain complex mid- to high-frequency textures, which often lead to misclassification or omission [46]. High-frequency components also capture edges, contours, and fine details of roads, including branch roads and lane boundaries, which are essential for preserving connectivity and structure. Furthermore, main roads and branch roads differ significantly in scale and texture complexity, requiring models to have strong multi-scale structural modeling and frequency-aware discrimination capabilities [47]. Traditional convolutional models, constrained by fixed receptive fields and purely spatial-domain operations, often fail to fully exploit these frequency characteristics, highlighting the need for frequency-aware mechanisms—such as wavelet or Fourier-based modules—to enhance feature discriminability and improve road extraction performance [48]. Despite the integration of Fourier convolution or wavelet transforms into deep learning models in previous studies to enhance frequency-domain awareness, there still exist two fundamental limitations. Firstly, global Fourier convolution-based methods typically depend on fixed low-pass or high-pass filters, which do not allow for the adaptive weighting for specific frequency bands [16]. This makes it challenging to simultaneously maintain low-frequency road structures while reducing high-frequency noise. Secondly, current wavelet-based approaches predominantly concentrate on compression or denoising rather than the adaptive modeling of road-specific frequency distributions [41,42]. As a result, they often fail to preserve details of branch roads and edges in regions with complex textures.
This paper proposes a novel remote sensing road extraction model that addresses existing challenges by integrating frequency awareness and multi-scale enhancement. Initially, we construct a PWC module that performs wavelet decomposition and localized convolution across various spatial scales. The low-frequency sub-bands of the PWC module preserve primary road structures, while its high-frequency sub-bands capture detailed directional edge information. This makes up for the limitations of global frequency-domain methods in capturing local details. Subsequently, we introduce an FBM module. This performs fine-grained frequency partitioning in the Fourier domain and adaptively enhances low-frequency components while suppressing mid- and high-frequency interference through learnable frequency-band weights. This effectively overcomes the adaptability constraints of traditional fixed frequency-band methods. The PWC and FBM modules collaborate synergistically, combining local spatial-frequency reconstruction with global frequency-band response optimization. This establishes a complementary frequency modeling mechanism that balances the fidelity of low-frequency structures and the preservation of high-frequency details. Our approach significantly enhances the accuracy and robustness of road extraction in complex backgrounds.

3. Methods

This section introduces the overall framework of the proposed network. Subsequently, we detail the key modules used within the framework.

3.1. Overall Structure of PWFNet

As shown in Figure 1, the overall architecture of PWFNet adopts a classic encoder–decoder framework. The encoder uses Res2Net [49] as the backbone network for multi-scale semantic feature extraction, offering strong representational capacity and fine-grained structural awareness. In the second and third stages, a PWC module and FAM are introduced to more comprehensively integrate local and global spatial-frequency features. The PWC module applies multi-scale wavelet decomposition and reconstruction to extract structural information across different frequency components at various spatial resolutions, effectively enhancing the representation of edges and texture details. The FAM module performs frequency band partitioning and modulation on the feature maps, assigning learnable weights to different frequency components such as low-frequency structures and high-frequency textures, thereby improving the model’s responsiveness to texture complexity and structural variation. At the deepest encoding layer (the fourth stage), PWC is retained to maintain high-order frequency information for decoder support. This design enables PWFNet to effectively capture information across multiple scales and frequency levels while maintaining a large receptive field, thus improving its ability to identify roads and boundaries in remote sensing imagery.
The decoder part adopts the design principles of D-LinkNet and is composed of four cascaded decoder blocks. Each decoder block employs a skip connection to fuse features from the corresponding stage of the encoder, thereby mitigating feature loss and enhancing semantic guidance. Within each decoder block, a 1 × 1 convolution (conv1) is first used for dimensionality reduction, which is followed by batch normalization and a ReLU activation function for normalization and nonlinear mapping. Then, a transposed convolution (deconv2) is applied to restore spatial resolution, which is followed again by normalization and activation. Finally, another 1 × 1 convolution (conv3) is used to restore the channel dimension, completing the feature reconstruction process in the decoder. The final output layer includes one upsampling transposed convolution and two convolutional operations to progressively generate prediction results that match the input image size. A Sigmoid activation function is applied to produce the final binary segmentation map for the road extraction task.

3.2. Pyramidal Wavelet Convolution Module

The wavelet transform (WT) serves as a potent tool for signal processing and analysis. With the evolution of deep learning, wavelet transforms have been incorporated into deep network architectures for a myriad of tasks. However, traditional wavelet convolutions encounter constraints in capturing multi-scale contextual information, particularly in complex visual tasks such as road extraction. Prior research has demonstrated that utilizing wavelets to separately model high-frequency details and low-frequency structures during feature extraction can augment the network’s ability to represent non-stationary patterns. Building on WTCov, we propose a pyramid module to construct the PWC module. As depicted in Figure 2,this module executes layer-wise wavelet decomposition, localized convolution operations, and hierarchical reconstruction on features at varying resolutions, thereby facilitating the multi-scale coupling of road information and preservation of intricate details. It enhances the capture of low-frequency information (e.g., shape features), thereby boosting shape sensitivity and compensating for the convolutional neural network’s predilection toward high-frequency textures. As depicted in Figure 3, the input feature initially undergoes average pooling to construct a multi-scale pyramid representation. Each level of the pyramid corresponds to a distinct spatial resolution and is fed into a dedicated WTCov module for multi-scale frequency domain modeling and enhancement. Ultimately, the outputs from each scale are upsampled back to the original resolution and concatenated along the channel dimension to form the enhanced feature representation.
Specifically, as shown in Figure 3, PWC first performs multi-scale downsampling on the input features (using average pooling at different scales) to generate pyramid features at multiple resolutions. For each downsampled feature, the module performs the following operations sequentially:
It applies multi-level wavelet decomposition (using Haar wavelets) to divide the features into low-frequency components, which represent the main structural information, and high-frequency components, which capture local detail information.
X l = A v g P o o l 2 l X , l = 0,1 , , L 1
where X denotes the original input feature map of size H × W × C (height × width × channels). X l denotes the low-frequency feature map obtained at decomposition level l . A v g P o o l 2 l denotes an average pooling operation with a downsampling factor of 2 l in both spatial dimensions. L is the total number of decomposition levels.
Each decomposed low-frequency and high-frequency sub-band is processed independently using grouped convolutions. The relative contribution of each sub-band is then adaptively modulated through learnable scaling parameters, enabling the network to balance structural and detail information during feature reconstruction.
W l , m = G C o v X L L l , m 1
where G C o v represents Group Convolution, and X L L ( l , 0 ) = X ( l ) represents the low-frequency sub-band obtained from the input at decomposition level l .
The wavelet transform at stage m produces the following:
Low-frequency component: G C o v ( X L L l , m ) R H 2 m ×   W 2 m ×   B   × C
High-frequency components: G C o v ( X H l , m ) = { X L H , X H H , X H H } , X L H , X H H , X H H represent the horizontal, vertical, and diagonal detail sub-bands, respectively.
Perform inverse wavelet reconstruction on the convolved sub-band to restore the reconstructed features at the current scale.
F l , m = γ l , m G C o v C o n c a t X L L , X L H , X H H , X H H
where G c o v represents group convolution, γ l , m represents a learnable scaling parameter, and C o n c a t refers to concatenation along the channel dimension, fusing the low-frequency sub-band ( X L L ) with the three high-frequency sub-bands ( X L H , X H H , X H H ) to integrate structural and detailed information before reconstruction.
The reconstructed features from different spatial scales of WTCov are upsampled to restore the original input resolution:
x ^ l , m 1 = I W T F l , m  
where x ^ l , m 1 represents the reconstructed feature map at the l-th layer and ( m 1 )-th scale level. I W T F l , m   Represents the Inverse Wavelet Transform Function at the l-th layer and m-th scale level, which restores wavelet-domain features to a higher resolution.
Subsequently, the upsampled features from all spatial-frequency scales are concatenated to construct a unified multi-scale representation:
This fused feature map is further processed by a 3 × 3 convolutional block consisting of convolution, batch normalization (BN), and ReLU activation, producing an output that preserves the original spatial resolution and number of channels. By integrating information across multiple frequency bands and spatial scales, the proposed PWC module effectively captures both global contextual structures and fine-grained local details while maintaining computational efficiency. This design significantly enhances the segmentation performance, particularly in preserving road boundaries, connectivity regions, and texture-rich areas.

3.3. Frequency-Aware Adjustment Module

To further enhance the network’s ability to perceive and represent multi-frequency features of road structures, this study proposes an FAM module tailored for the task of road extraction in remote sensing imagery. Based on the two-dimensional Fourier transform, the FAM module decomposes the input features in the frequency domain and, considering the multi-scale nature of road morphology, divides the frequency spectrum into several sub-bands using predefined road-related thresholds [2,4,8]. Specifically, the low-frequency components capture the continuous global layout of roads, the mid-frequency components focus on edge details, and the high-frequency components emphasize the texture patterns of road networks. By progressively extracting residual information from both low- and high-frequency bands, the module effectively enhances the network’s frequency-domain modeling capability for road topology, thereby improving the representation of fine-scale roads and complex intersections.
As shown in Figure 4, the FAM module first applies a two-dimensional real-valued Fast Fourier Transform (2D-FFT) to the input feature map to obtain its frequency spectrum representation.
X = F F T X
where X R H × W × B × C denotes the input feature map; B is the batch size, C is the number of channels, and H and W are the height and width of the feature map, respectively; FFT(⋅) denotes the two-dimensional real-valued Fast Fourier Transform;
Let X C B × C × H × w 1 denote the complex-valued frequency spectrum, where w 1 = w 2 + 1 represents the spectral width after applying the two-dimensional real-valued Fast Fourier Transform (2D-FFT).
Based on the frequency radius r u , v , a set of band-pass masks is generated to extract the low-frequency components within each frequency band k i .
The frequency radius satisfies the following:
m a s k i u , v = 1 , i f   r u , v < 1 2 k i 0 , o t h e r w i s e
where u , v are the horizontal and vertical coordinates in the frequency spectrum; r u , v = u 2 + v 2 is the frequency radius, measuring the radial distance in the frequency domain; k i is the bandwidth control factor for the i-th frequency band (smaller values produce wider bands); and m a s k i is a binary matrix with ones at positions to be preserved and zeros elsewhere.
The masked frequency spectrum is then transformed back to the spatial domain via two-dimensional inverse Fourier transform, yielding the low-frequency features X low i 1 corresponding to that frequency band. The low-frequency component of the i -th frequency band is extracted as
X low i 1 = IFFT ( x ^   mask i )
where IFFT ( ) denotes the two-dimensional inverse Fourier transform; x ^ is the original frequency spectrum; and “⋅” denotes element-wise multiplication.
By progressively computing the residual between the low-frequency component and the previous frequency band, the local detail information of each frequency band can be obtained:
X low i = X pre i 1 X res i 1
where X pre i 1 is the input feature map of the ( i − 1)-th band; and X res i 1 is the high-frequency residual from the ( i − 1)-th band;
The current low-frequency feature X low i is then used as the input for the next frequency band, serving as the intermediate state in the recursive process:
For each frequency band, the module employs an independent group convolution layer G C o v to adaptively learn a frequency-specific weight w i :
w i = sigmod ( G C o v i ( x a t t ) )
where G C o v i denotes the group convolution operation for the ( i − 1)-th band; and x a t t is the attention feature map guiding the adaptive computation of the frequency-band weight.
These weights are then used to modulate the residual features, enhancing informative frequency components while suppressing redundant noise:
x m o d i f y i = w i x r e s i
Additionally, a learnable weight w n + 1 can optionally be applied to the final remaining low-frequency component:
x m o d i f y l o w = w n + 1 x p r e n
Finally, all weighted outputs from the frequency bands are summed together with the residual low-frequency component to form the final frequency-weighted output of the module:
i = 1 n x m o d i f y i + x m o d i f y l o w   , i f   l o w = t r u e x p r e n               ,               o t h e r w i s e
where l o w = t r u e indicates that the low-frequency weighting mechanism is enabled.
This explicit multi-band modeling and weighting mechanism enables the FAM module to more precisely capture texture, edge, and repetitive structure information across different frequency ranges, thereby significantly improving road and boundary extraction performance in complex scenes.

3.4. Comparative Methods

In this study, we selected a variety of representative baseline models for comparison, including both general-purpose semantic segmentation methods and deep networks specifically designed for road extraction. General-purpose models include DeepLab+ [50], which employs atrous separable convolution to effectively capture multi-scale contextual information, and Swin Transformer [51], which is a transformer-based architecture that performs global feature modeling through a window-based attention mechanism while maintaining computational efficiency. Task-specific road extraction models include D-LinkNet, which enhances the decoder’s receptive field using dilated convolutions, and RCFSNet [52], which integrates multi-scale contextual features to improve structural recognition. SGCN [53] introduces depth-wise separable convolutions and Sobel edge features to construct a graph-based representation, modeling global contextual relationships in both channel and spatial dimensions. GAMSNet [54] combines multi-scale residual learning with global-aware mechanisms to achieve coherent road structure recognition under complex backgrounds. Finally, DBRANet [55] employs a dual-branch encoder and regional attention module to strengthen multi-scale road structure representation and connectivity modeling.

4. Experiment and Datasets

4.1. Datasets

In our experiments, three remote sensing datasets were employed for road extraction: the DeepGlobe dataset [56], CHN6-CUG road dataset [57], and LSRV road dataset [54].

4.1.1. DeepGlobe Road Dataset

The DeepGlobe dataset was published as part of the CVPR DeepGlobe Road Extraction Challenge in 2018. The dataset contains 8570 satellite images from India, Indonesia, and Thailand, each with a size of 1024 × 1024 pixels and a spatial resolution of 0.5 m. This dataset covers the land cover of various scenes, including both rural and urban areas. It should be noted that among the 8570 satellite images in the DeepGlobe dataset, 6226 images with labels are provided as the training set, and 2344 images are provided as the verification set and test set, for which no corresponding image labels have been published. According to [58], the 6226 images with labels were divided into 4696 sample pairs as the training set and 1530 sample pairs as the test set in this experiment.

4.1.2. CHN6-CUG Road Dataset

CHN6-CUG contains 4511 labeled images of 512 × 512 size, which were divided into 3608 instances for the training set and 903 for the test set. The CHN6-CUG road dataset is a new large-scale satellite image dataset of representative cities in China. Its remote sensing image base map is from Google Earth. Six cities with different levels of urbanization, city size, development degree, urban structure, and history and culture are selected, including the Chaoyang area of Beijing, the Yangpu District of Shanghai, Wuhan city center, the Nanshan area of Shenzhen, the Shatin area of Hong Kong, and Macao. A marked road consists of both covered and uncovered roads, depending on the degree of road coverage. According to the physical point of view of geographical factors, marked roads include railways, highways, urban roads, rural roads, etc.

4.1.3. LSRV Road Dataset

The LSRV road dataset contains three large-scale remote sensing images: Boston, USA (23,104 × 23,552 pixels), Birmingham, UK (22,272 × 22,464 pixels), and Shanghai, China (16,768 × 16,640 pixels). In this study, all three images were cropped to a size of 512 × 512 pixels. After removing black borders and invalid areas, a total of 4943 valid images were obtained. Compared with publicly available road datasets, the LSRV dataset covers imagery from different regions and varying resolutions, providing a more comprehensive benchmark for evaluating the generalization capability of road detection models.

4.2. Implementation Details

For all experiments, the training images were processed using data augmentation techniques, including horizontal flipping, vertical flipping, and diagonal flipping. The loss function was defined as the sum of binary cross-entropy (BCE) loss and Dice coefficient loss. A fixed threshold of 0.5 was applied to generate the final binary output. All experiments were conducted on an NVIDIA 4090 GPU with 24 GB of memory.
All models were trained using the Adam optimizer with a batch size of 4 for 150 epochs. The initial learning rate was set to 2 × 10−4 and decayed by a factor of 5 at epochs 75, 97, and 120.

4.3. Evaluation Metrics

To quantitatively evaluate the performance of different methods in road detection, four widely used evaluation metrics are adopted: precision (P), recall (R), F1-score, and intersection over union (IoU). Precision measures the proportion of correctly predicted positive pixels with respect to all predicted positive pixels, while recall represents the proportion of correctly predicted positive pixels with respect to all ground-truth positive pixels. The F1-score is the harmonic mean of precision and recall, providing a balanced measure of both. The IoU quantifies the ratio of the intersection area to the union area between the predicted segmentation and the ground truth. These metrics are formally defined as follows:
r e c a l l = T P T P + F N
I o U = T P F N + T P + F P
P r e c i s i o n = T P T P + F P
F 1 = 2 × p r e c i s i o n × r e c a l l p r e c i s i o n + r e c a l l
where TP, FP, and FN denote the number of true positives, false positives, and false negatives, respectively.

4.4. Comparison of the Methods

4.4.1. Experiments on the DeepGlobe Road Dataset

Figure 5 presents a visual comparison of different road detection methods on the DeepGlobe road dataset. From left to right, each sequence includes: the original optical image, the corresponding ground truth label, and the prediction results of DBRANet, Deeplab+, D-LinkNet, GAMSNet, RCFSNet, SGCN, Swin Transformer, and our proposed PWFNet. The odd-numbered rows display the overall road prediction results of different models, while the even-numbered rows show enlarged views of the red box regions for detailed comparison. As shown in the figure, the odd-numbered rows illustrate the global segmentation performance of each method. It can be clearly observed that the proposed method achieves superior results in complex road scenarios. Compared to other baseline models, PWFNet exhibits fewer omissions, misclassifications, and discontinuities in road extraction. The even-numbered rows provide zoomed-in views of selected areas, further highlighting the differences and advantages in fine-grained segmentation. Specifically, the second row depicts a narrow path winding through rural buildings, where the road structure is curved and the texture is blurred. Traditional models tend to produce fragmented and inaccurate predictions in such irregular and small-scale road scenes. In contrast, PWFNet effectively preserves structural continuity through multi-scale wavelet modeling and frequency-domain enhancement. In the fourth row, the roads are partially occluded by dense vegetation with significant background interference. The FAM module introduced in our method enhances the dominant low-frequency structures in the frequency domain, significantly improving recognition in occluded regions. The sixth row shows a case where roads and surrounding buildings exhibit similar color and texture distributions. Conventional models relying on color and edge gradients often misclassify such regions due to feature confusion. By explicitly modeling frequency-domain information and modulating high- and low-frequency residuals, our method improves boundary distinction and segmentation accuracy. Finally, the eighth row presents a typical rural road scenario with unclear road boundaries and complex textures. Leveraging PWC to extract multi-scale contextual features and a learnable frequency-weighting mechanism, PWFNet effectively distinguishes roads from non-road regions, suppresses redundant texture noise, and enhances the overall extraction precision.
Table 1 presents the quantitative results of seven state-of-the-art semantic segmentation and road detection methods, alongside our proposed PWFNet, which are all trained for 150 epochs on the DeepGlobe road dataset. As illustrated in the table, PWFNet consistently outperforms all competing models, particularly in terms of F1-score and intersection over union (IoU), which are key metrics for assessing segmentation accuracy and spatial consistency. Specifically, PWFNet achieves an F1-score of 82.71% and an IoU of 70.52%, significantly outperforming both general-purpose and road-specific models. Compared with Deeplab+ (F1: 79.43%, IoU: 65.87%) and Swin Transformer (F1: 76.31%, IoU: 61.69%), PWFNet achieves improvements of 3.28% to 6.4% in F1-score and 4.65% to 8.83% in IoU. Against road-oriented methods such as GAMSNet, D-LinkNet, DBRANet, RCFSNet, and SGCN, PWFNet shows a consistent performance gain of 2.66% to 4.34% in F1-score and 3.78% to 6.08% in IoU. Moreover, while several baseline methods report relatively high precision (e.g., SGCN: 81.86%, Deeplab+: 81.82%), their recall values are generally lower, indicating a tendency to miss finer or ambiguous road regions. In contrast, PWFNet demonstrates a well-balanced performance, with a precision of 92.69% and recall of 84.67%, indicating its ability to both accurately detect road pixels and reduce false negatives—a crucial capability for real-world road extraction tasks. These results strongly validate the effectiveness of the proposed PWC and FAM modules, which together enhance the model’s capability to capture discriminative features across spatial and frequency domains, thereby improving road structure extraction under diverse and complex environmental conditions.

4.4.2. Experiments on the CHN6-CUG Road Dataset

Figure 6 illustrates a visual comparison of various road detection methods on the CHN6-CUG road dataset. From left to right, each sequence includes the optical image, its corresponding ground truth label, and the prediction results from DBRANet, Deeplab+, D-LinkNet, GAMSNet, RCFSNet, SGCN, Swin Transformer, and the proposed PWFNet. The odd-numbered rows display the overall road prediction results of each method, while the even-numbered rows provide enlarged views of the red-box regions for detailed examination. As shown in the odd-numbered rows, each model’s segmentation result on the full image is visualized. From the comparison, it is evident that the proposed method demonstrates lower fragmentation and higher structural continuity across multiple scenarios, preserving more complete road structures than other methods. The even-numbered rows, which offer zoomed-in views of specific areas, more intuitively reveal the differences in fine-grained performance and segmentation accuracy among the models. Specifically, the second row corresponds to road scenes around densely built-up areas in Macau; the fourth row depicts the port area in Shenzhen, where roads are heavily occluded by containers and boundaries are visually unclear, exhibiting typical high-frequency interference. The sixth row shows urban roads in Beijing covered by vegetation, which are also characterized by strong background noise; and the eighth row presents a complex overpass structure in Wuhan with both vegetation occlusion and significant curvature in road layout. Across these diverse scenes with varying frequency characteristics and structural complexity, models such as SGCN and Swin Transformer exhibit limitations in maintaining road continuity, preserving boundary sharpness, and capturing small-scale structures. These limitations manifest as breaks, misclassifications, and blurred edges. In contrast, the proposed PWFNet maintains superior segmentation quality even in such challenging conditions. This improvement is mainly attributed to the PWC module, which effectively captures multi-scale frequency features, and the FAM module, which explicitly models and modulates frequency components across different bands. The combination of these two modules enables the model to adaptively respond to diverse regional frequency distributions, enhancing its ability to perceive and represent complex road structures. As a result, PWFNet achieves more robust and accurate road extraction, especially in scenes with occlusions, high-frequency clutter, and structural complexity.
Table 2 presents the quantitative comparison results of seven representative semantic segmentation and road detection methods, alongside our proposed PWFNet, which are all trained for 150 epochs on the CHN6-CUG road dataset. As shown in the table, PWFNet consistently outperforms all baseline methods, especially in terms of F1-score and intersection over union (IoU)—two key metrics that reflect the overall segmentation quality and spatial overlap accuracy. Specifically, PWFNet achieves an F1-score of 77.02% and an IoU of 62.63%, ranking highest among all evaluated methods. Compared with general-purpose segmentation models such as Deeplab+ (F1: 75.67%, IoU: 60.87%) and RCFSNet (F1: 76.07%, IoU: 61.38%), PWFNet yields substantial improvements of 1.35% to 15.64% in F1-score and 1.76% to 16.92% in IoU, respectively. Similarly, when compared to road-specialized models like D-LinkNet, GAMSNet, DBRANet, RCFSNet, and SGCN, PWFNet still demonstrates consistent advantages, achieving up to 2.59% higher F1-score and up to 9.2% higher IoU. It is also worth noting that while several baseline models achieve high precision (e.g., SGCN: 79.64%, D-LinkNet: 80.17%), they often suffer from relatively low recall, indicating missed detections and a reduced completeness of road extraction. In contrast, PWFNet achieves the highest recall of 79.26%, with a reasonable precision of 74.90%, reflecting a better balance between detecting true positives and minimizing false negatives. This performance suggests that PWFNet is particularly effective at capturing narrow roads, occluded segments, and structurally complex areas, which are common in the CHN6-CUG dataset. These superior results further highlight the advantages of the proposed PWC and FAM modules, which collaboratively enhance the model’s ability to represent multi-scale spatial-frequency information. By explicitly modeling frequency components and adaptively enhancing key features across bands, PWFNet achieves more robust, complete, and spatially accurate road extraction even in the presence of background noise, occlusion, and intricate road topology.

4.4.3. Ablation Experiments Between Modules

The proposed PWC module and FAM module are two important components in our PWFNet. To demonstrate their effectiveness, we perform ablation experiments on the DeepGlobe and CHN6-CUG road datasets. In this section, we will elaborate on and analyze the effectiveness of the PWC module and FAM module as well as their complementary effects. The Base model only contains Res2Net as the encoder without any frequency modeling modules. We separately added the PWC module or FAM module to form Base_PWC and Base_FAM, and the full version of PWFNet contains both modules. By comparing these settings, we can evaluate the performance improvement brought by each module and their combination.
The first six rows of Figure 7 show the visualization results on DeepGlobe, and the last six rows show the results on CHN6-CUG. We zoom in the local regions in even-numbered rows and observe that using only Res2Net as an encoder tends to produce fragmented and discontinuous extraction. Even though either PWC or FAM can alleviate this problem to some extent, they are still unable to handle blurred boundaries and texture interference in complex scenes. Our full model PWFNet has significantly better connectivity and boundary sharpness in different scenes, which demonstrates its strong ability to represent complex road structures. For example, in the fourth row, roads are severely occluded by vegetation where conventional methods fail to detect or tend to break segments, while our PWFNet accurately restores occluded structures. In the twelfth row, the road winds through high-rise buildings and dense vegetation, showing strong frequency-mixed characteristics. Even under the condition of high-frequency noise and clutter, PWFNet still achieves good structural integrity in road detection. These results verify the effectiveness and complementarity of our proposed PWC and FAM modules in both frequency-domain feature modeling and spatial structure modeling. The two modules work together to make the model adaptive to more complex and diverse road patterns in remote sensing images.
Table 3 shows the quantitative results of our ablation study on DeepGlobe and CHN6-CUG road datasets. As can be seen, although both the PWC and FAM module individually make moderate improvements in the F1-score and IoU values compared to the baseline model, the performance improvement is relatively limited with a single module. However, significant performance improvement can be observed when incorporating both modules in the full PWFNet model. On the DeepGlobe dataset, the F1-score and IoU increase 3.63% and 5.13%, respectively, compared to the baseline model, while the F1-score and IoU increase 3.54% and 4.56%, respectively, on the CHN6-CUG dataset. The results indicate that the two modules are complementary in feature representation, where PWC contributes multi-scale spatial-frequency modeling and FAM realizes adaptive modulation and selection on different frequency bands. Thus, the effective integration of them enhances the network’s modeling ability on road structures and fine-grained textures, making it more robust and accurate for road extraction in complex remote sensing scenes.

5. Discussion

5.1. Model Complexity and Time Complexity Analysis

To comprehensively evaluate the computational requirements of the proposed PWFNet and its modules (PWC and FAM), we analyzed both model complexity and theoretical computational overhead. Model complexity is measured by the number of parameters (#Params), while theoretical computational cost is assessed using giga floating-point operations per second (GFLOPs). GFLOPs were calculated under the same conditions with a batch size of 1, and the parameter settings of PWFNet were consistent with the default configurations described in the manuscript.
As shown in Table 4, PWFNet exhibits moderate model complexity and computational cost. Although its number of parameters is higher than those of models such as GAMSNet and Swin Transformer, its GFLOPs remain much lower than high-complexity models like SGCN and RCFSNet, indicating a good balance between representational capacity and computational efficiency. Furthermore, the FAM module enhances low-frequency structural feature extraction with a negligible impact on parameters and GFLOPs, while the PWC module improves both low- and high-frequency feature representation with a modest increase in computation. Combined with the OA-Decoder design, the overall computational overhead remains within an acceptable range. Overall, PWFNet strengthens the extraction of main road structures and edge details without introducing excessive computational burden, ensuring its practical feasibility for high-resolution remote sensing imagery analysis and addressing concerns regarding the additional cost of wavelet and frequency transform operations.

5.2. Cross-Region Transfer Experiments

To evaluate the transferability and generalization capability of the model, we directly tested the model trained on the DeepGlobe dataset on the LSRV road dataset. Figure 7 presents a visual comparison of different road detection methods on the LSRV dataset. From left to right, each sequence includes the optical image, the corresponding ground truth, and the results of DBRANet, Deeplab+, D-LinkNet, GAMSNet, RCFSNet, SGCN, Swin Transformer, and the proposed PWFNet.
As shown in Figure 8, the odd-numbered rows present the original optical images from different regions along with the segmentation results of various methods in the transfer task, while the even-numbered rows display enlarged views of the red boxed regions, allowing for a more intuitive comparison of how each model preserves fine details. The comparison demonstrates that the proposed PWFNet exhibits superior practicality and robustness across a variety of complex scenarios. Compared with other models, PWFNet produces more complete and coherent extraction results, accurately capturing wide main roads while maintaining continuity in densely built urban areas or regions with ambiguous road boundaries. Notably, in areas where roads are partially occluded by trees, shrubs, or building shadows, PWFNet effectively identifies and recovers the obscured road segments, avoiding the breaks or omissions commonly observed in other methods. Furthermore, in images with complex surface textures, PWFNet provides a more precise delineation of road morphology, exhibiting high fidelity in detail preservation. These results convincingly demonstrate that the proposed method possesses strong generalization capability and practical value for cross-region and cross-resolution road extraction tasks.
Table 5 shows PWFNet significantly outperforms all other comparison methods on the LSRV dataset across multiple evaluation metrics. Specifically, PWFNet achieves 65.88% in IOU, 84.18% in precision (P), 80.28% in recall (R), and 80.15% in F1-score, all of which are the highest among the evaluated models. In terms of IOU, PWFNet surpasses the second-best method, GAMSNet, by approximately 7.91 percentage points, indicating that it can consistently maintain high overall extraction accuracy even under large variations in geographic regions and image resolutions. For precision, PWFNet improves by nearly 9 percentage points over the next highest value (75.25% by SGCN), effectively reducing false positives caused by differences in surface textures during cross-region testing. Regarding recall, PWFNet exceeds the highest comparison value (78.53% by DBRANet) by 1.75 percentage points, demonstrating its ability to recover road structures completely even under complex surface conditions and occlusions. The overall F1-score reaches 80.15%, significantly higher than the second-best GAMSNet (73.39%), further validating PWFNet’s robustness and reliability in cross-region transfer tasks. These results indicate that PWFNet not only performs excellently on a single dataset but also possesses strong cross-region generalization capability.

6. Conclusions

In this study, we propose PWFNet, a remote sensing road extraction network based on frequency-domain modeling. The network integrates two key components: the PWC module and the FAM module. These modules explicitly incorporate frequency information into the feature extraction stage to enhance the model’s ability to represent multi-scale structures and fine-grained textures. Specifically, the PWC module performs multi-scale downsampling and wavelet decomposition to capture low-frequency structural features and high-frequency edge details, thereby achieving a joint spatial–frequency representation. Meanwhile, the FAM module decomposes feature maps in the frequency domain into distinct bands and adaptively applies modulation weights. This enables the model to better respond to different semantic components across frequency bands, such as the overall shape, edge textures, and background interference. The synergy between these two modules significantly improves the connectivity and boundary clarity of the extracted road regions while also reducing false positives and missed detections under complex backgrounds. This makes PWFNet particularly effective in challenging scenarios such as urban–rural transitions, curved or occluded paths, and images affected by vegetation or building texture interference. We evaluate PWFNet on two public datasets, DeepGlobe and CHN6-CUG, where it achieves superior performance compared to existing state-of-the-art methods. To validate the contribution of each proposed module, we conduct ablation experiments by progressively removing the PWC and FAM modules. Four variants are constructed: the BaseLine model (using only the Res2Net encoder), BaseLine_FAM (with FAM only), BaseLine_PWC (with PWC only), and the complete PWFNet (with both modules). Experimental results show that both modules independently improve performance: PWC enhances multi-scale structural modeling and detail preservation, while FAM improves the adaptive perception of different frequency components. When combined, these modules exhibit strong complementarity, further enhancing the model’s adaptability and robustness in complex road scenes. This confirms the overall advantage of PWFNet in frequency-domain modeling and spatial structural awareness. In addition, we conducted cross-region transfer experiments, directly applying the trained model to remote sensing imagery from different geographic areas to evaluate its generalization capability. In these transfer experiments, PWFNet successfully preserves the integrity of road structures and the clarity of edge details. Even when faced with previously unseen landscapes, varying resolutions, or complex background interference, it can effectively identify both main roads and branch roads while reducing false positives and missed detections. The results further demonstrate PWFNet’s practicality and robustness in cross-region and cross-resolution road extraction tasks. Despite PWFNet’s strong performance on public benchmarks, several limitations remain. First, although the PWC and FAM modules effectively enhance multi-scale structural representation and preserve edge and fine-detail features—addressing the shortcomings of traditional fixed-frequency methods—these frequency-domain operations inevitably introduce additional computational overhead. Our analysis of the model parameters and GFLOPs shows that PWFNet remains moderate in overall cost compared with high-complexity models like SGCN and RCFSNet; however, inference efficiency may still be affected when processing in real-time applications. Therefore, practical large-scale usage requires balancing feature representation and computational efficiency. Second, the FAM module currently employs a fixed frequency partitioning strategy. While effective on the evaluated datasets, this may limit the model’s adaptability to scene-specific frequency distributions. For instance, rural roads, dense urban areas, or complex vegetation regions may exhibit distinct low- and high-frequency characteristics, which static frequency partitioning might not fully capture, potentially reducing feature modulation precision and slightly affecting performance in unseen scenarios. Future work could explore dynamic or learnable frequency partitioning to enable adaptive frequency-domain operations, further improving generalization and robustness across diverse real-world environments.
Future work will explore the following directions: develop more efficient and learnable frequency decomposition operators, enabling adaptive band partitioning to improve frequency-domain expressiveness and generalization; integrate frequency-domain modeling with spatial attention mechanisms, constructing more interpretable and scene-adaptive architectures; incorporate multi-source remote sensing data (e.g., SAR, infrared, panchromatic imagery) to improve the model’s robustness and general applicability in diverse geographical environments.

Author Contributions

Conceptualization: J.Z. and Y.S.; Methodology: J.Z. and Y.S.; Experimental: J.Z.; Model visualization: J.Z., Y.S. and D.X.; Validation: R.W., J.Z. and X.Z.; Data curation: R.W. and X.Y.; Writing—original draft: J.Z. and Y.S.; Writing—review and editing: J.Z. and R.W.; Visualization: D.X.; Funding acquisition: Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China (U2344225).

Data Availability Statement

The data presented in this study are available upon request from the corresponding author. The data are not publicly available because of permissions issues.

Conflicts of Interest

Author Xiaolin Zhao was employed by the company China Communications Construction Company. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Liu, R.; Wu, J.; Lu, W.; Miao, Q.; Zhang, H.; Liu, X.; Lu, Z.; Li, L. A Review of Deep Learning-Based Methods for Road Extraction from High-Resolution Remote Sensing Images. Remote Sens. 2024, 16, 2056. [Google Scholar] [CrossRef]
  2. Boyko, A.; Funkhouser, T. Extracting roads from dense point clouds in large scale urban environment. ISPRS-J. Photogramm. Remote Sens. 2011, 66, S2–S12. [Google Scholar] [CrossRef]
  3. Wang, J.; Song, J.; Chen, M.; Yang, Z. Road network extraction: A neural-dynamic framework based on deep learning and a finite state machine. Int. J. Remote Sens. 2015, 36, 3144–3169. [Google Scholar] [CrossRef]
  4. Senthilnath, J.; Varia, N.; Dokania, A.; Anand, G.; Benediktsson, J.A. Deep TEC: Deep transfer learning with ensemble classifier for road extraction from UAV imagery. Remote Sens. 2020, 12, 245. [Google Scholar] [CrossRef]
  5. Wang, X.; Jin, X.; Dai, Z.; Wu, Y.; Chehri, A. Deep Learning-Based Methods for Road Extraction From Remote Sensing Images: A vision, survey, and future directions. IEEE Geosci. Remote. Sens. Mag. 2025, 13, 55–78. [Google Scholar] [CrossRef]
  6. Yuan, X.; Shi, J.; Gu, L. A review of deep learning methods for semantic segmentation of remote sensing imagery. Expert. Syst. Appl. 2021, 169, 114417. [Google Scholar] [CrossRef]
  7. Wang, W.; Yang, N.; Zhang, Y.; Wang, F.; Cao, T.; Eklund, P. A review of road extraction from remote sensing images. J. Traffic Transp. (Engl. Ed.) 2016, 3, 271–282. [Google Scholar] [CrossRef]
  8. Abdollahi, A.; Pradhan, B.; Alamri, A. VNet: An end-to-end fully convolutional neural network for road extraction from high-resolution remote sensing data. Ieee Access. 2020, 8, 179424–179436. [Google Scholar] [CrossRef]
  9. Zhong, Z.; Li, J.; Cui, W.; Jiang, H. Fully convolutional networks for building and road extraction: Preliminary results. In Proceedings of the 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Beijing, China, 10–15 June 2016; pp. 1591–1594. [Google Scholar]
  10. Zhang, Z.; Liu, Q.; Wang, Y. Road extraction by deep residual u-net. IEEE Geosci. Remote Sens. Lett. 2018, 15, 749–753. [Google Scholar] [CrossRef]
  11. Wang, Y.; Peng, Y.; Li, W.; Alexandropoulos, G.C.; Yu, J.; Ge, D.; Xiang, W. DDU-Net: Dual-decoder-U-Net for road extraction using high-resolution remote sensing images. IEEE Trans. Geosci. Remote. Sens. 2022, 60, 1–12. [Google Scholar] [CrossRef]
  12. Jiang, X.; Li, Y.; Jiang, T.; Xie, J.; Wu, Y.; Cai, Q.; Jiang, J.; Xu, J.; Zhang, H. RoadFormer: Pyramidal deformable vision transformers for road network extraction with remote sensing images. Int. J. Appl. Earth Obs. Geoinf. 2022, 113, 102987. [Google Scholar] [CrossRef]
  13. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  14. Meng, Q.; Zhou, D.; Zhang, X.; Yang, Z.; Chen, Z. Road Extraction from Remote Sensing Images via Channel Attention and Multi-Layer Axial Transformer. IEEE Geosci. Remote. Sens. Lett. 2024, 21, 5504705. [Google Scholar] [CrossRef]
  15. Yang, H.; Zhou, C.; Xing, X.; Wu, Y.; Wu, Y. A High-Resolution Remote Sensing Road Extraction Method Based on the Coupling of Global Spatial Features and Fourier Domain Features. Remote Sens. 2024, 16, 3896. [Google Scholar] [CrossRef]
  16. Liu, H.; Zhou, X.; Wang, C.; Chen, S.; Kong, H. Fourier-Deformable Convolution Network for Road Segmentation from Remote Sensing Images. IEEE Trans. Geosci. Remote. Sens. 2024, 62, 4415117. [Google Scholar] [CrossRef]
  17. Li, Y.; Liu, Z.; Yang, J.; Zhang, H. Wavelet transform feature enhancement for semantic segmentation of remote sensing images. Remote Sens. 2023, 15, 5644. [Google Scholar] [CrossRef]
  18. Sardar, A.; Mehrshad, N.; Mohammad Razavi, S. Efficient image segmentation method based on an adaptive selection of Gabor filters. IET Image Process. 2020, 14, 4198–4209. [Google Scholar] [CrossRef]
  19. Omati, M.; Sahebi, M.R. Change detection of polarimetric SAR images based on the integration of improved watershed and MRF segmentation approaches. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2018, 11, 4170–4179. [Google Scholar] [CrossRef]
  20. Valero, S.; Chanussot, J.; Benediktsson, J.A.; Talbot, H.; Waske, B. Advanced directional mathematical morphology for the detection of the road network in very high resolution remote sensing images. Pattern Recognit. Lett. 2010, 31, 1120–1127. [Google Scholar] [CrossRef]
  21. Jeong, M.; Nam, J.; Ko, B.C. Lightweight multilayer random forests for monitoring driver emotional status. IEEE Access. 2020, 8, 60344–60354. [Google Scholar] [CrossRef]
  22. Song, M.; Civco, D. Road extraction using SVM and image segmentation. Photogramm. Eng. Remote Sens. 2004, 70, 1365–1371. [Google Scholar] [CrossRef]
  23. Huan, H.; Sheng, Y.; Zhang, Y.; Liu, Y. Strip attention networks for road extraction. Remote Sens. 2022, 14, 4516. [Google Scholar] [CrossRef]
  24. Buslaev, A.; Seferbekov, S.; Iglovikov, V.; Shvets, A. Fully convolutional network for automatic road extraction from satellite imagery. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 207–210. [Google Scholar]
  25. Pan, D.; Zhang, M.; Zhang, B. A generic FCN-based approach for the road-network extraction from VHR remote sensing images–using OpenStreetMap as benchmarks. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2021, 14, 2662–2673. [Google Scholar] [CrossRef]
  26. Chaurasia, A.; Culurciello, E. Linknet: Exploiting encoder representations for efficient semantic segmentation. In Proceedings of the 2017 IEEE Visual Communications and Image Processing (VCIP), St. Petersburg, FL, USA, 10–13 December 2017; pp. 1–4. [Google Scholar]
  27. Wan, J.; Xie, Z.; Xu, Y.; Chen, S.; Qiu, Q. DA-RoadNet: A dual-attention network for road extraction from high resolution satellite imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2021, 14, 6302–6315. [Google Scholar] [CrossRef]
  28. Zhou, L.; Zhang, C.; Wu, M. D-LinkNet: LinkNet with pretrained encoder and dilated convolution for high resolution satellite imagery road extraction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 182–186. [Google Scholar]
  29. Mei, J.; Li, R.; Gao, W.; Cheng, M. CoANet: Connectivity attention network for road extraction from satellite imagery. IEEE Trans. Image Process. 2021, 30, 8540–8552. [Google Scholar] [CrossRef]
  30. Zhu, X.; Huang, X.; Cao, W.; Yang, X.; Zhou, Y.; Wang, S. Road extraction from remote sensing imagery with spatial attention based on Swin Transformer. Remote Sens. 2024, 16, 1183. [Google Scholar] [CrossRef]
  31. Ge, C.; Nie, Y.; Kong, F.; Xu, X. Improving road extraction for autonomous driving using swin transformer unet. In Proceedings of the 2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC), Beijing, China, 18 September–12 October 2022; pp. 1216–1221. [Google Scholar]
  32. Liu, W.; Gao, S.; Zhang, C.; Yang, B. RoadCT: A hybrid CNN-transformer network for road extraction from satellite imagery. IEEE Geosci. Remote. Sens. Lett. 2024, 21, 105605. [Google Scholar] [CrossRef]
  33. Wang, R.; Cai, M.; Xia, Z.; Zhou, Z. Remote sensing image road segmentation method integrating CNN-Transformer and UNet. IEEE Access 2023, 11, 144446–144455. [Google Scholar] [CrossRef]
  34. Chen, H.; Li, Z.; Wu, J.; Xiong, W.; Du, C. SemiRoadExNet: A semi-supervised network for road extraction from remote sensing imagery via adversarial learning. ISPRS J. Photogramm. Remote. Sens. 2023, 198, 169–183. [Google Scholar] [CrossRef]
  35. Gao, L.; Zhou, Y.; Tian, J.; Cai, W.; Lv, Z. MCMCNet: A Semi-supervised Road Extraction Network for High-resolution Remote Sensing Images via Multiple Consistency and Multi-task Constraints. IEEE Trans. Geosci. Remote. Sens. 2024, 62, 1–16. [Google Scholar] [CrossRef]
  36. Zhou, M.; Sui, H.; Chen, S.; Liu, J.; Shi, W.; Chen, X. Large-scale road extraction from high-resolution remote sensing images based on a weakly-supervised structural and orientational consistency constraint network. ISPRS J. Photogramm. Remote. Sens. 2022, 193, 234–251. [Google Scholar] [CrossRef]
  37. Lu, X.; Zhong, Y.; Zheng, Z.; Wang, J. Cross-domain road detection based on global-local adversarial learning framework from very high resolution satellite imagery. ISPRS J. Photogramm. Remote. Sens. 2021, 180, 296–312. [Google Scholar] [CrossRef]
  38. Finder, S.E.; Zohav, Y.; Ashkenazi, M.; Treister, E. Wavelet feature maps compression for image-to-image CNNs. Adv. Neural Inf. Process. Systems. 2022, 35, 20592–20606. [Google Scholar]
  39. Huang, J.; Zhao, Y.; Li, Y.; Dai, Z.; Chen, C.; Lai, Q. ACM-UNet: Adaptive Integration of CNNs and Mamba for Efficient Medical Image Segmentation. arXiv 2025, arXiv:2505.24481. [Google Scholar]
  40. Xing, R. FreqU-FNet: Frequency-Aware U-Net for Imbalanced Medical Image Segmentation. arXiv 2025, arXiv:2505.17544. [Google Scholar]
  41. Li, Q.; Shen, L.; Guo, S.; Lai, Z. Wavelet integrated CNNs for noise-robust image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 7245–7254. [Google Scholar]
  42. Huang, H.; He, R.; Sun, Z.; Tan, T. Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1689–1697. [Google Scholar]
  43. Finder, S.E.; Amoyal, R.; Treister, E.; Freifeld, O. Wavelet convolutions for large receptive fields. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 363–380. [Google Scholar]
  44. Eliasof, M.; Bodner, B.J.; Treister, E. Haar wavelet feature compression for quantized graph convolutional networks. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 4542–4553. [Google Scholar] [CrossRef]
  45. Liu, H.; Wang, C.; Zhao, J.; Chen, S.; Kong, H. Adaptive fourier convolution network for road segmentation in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–14. [Google Scholar] [CrossRef]
  46. Lin, S.; Yao, X.; Liu, X.; Wang, S.; Chen, H.; Ding, L.; Zhang, J.; Chen, G.; Mei, Q. MS-AGAN: Road extraction via multi-scale information fusion and asymmetric generative adversarial networks from high-resolution remote sensing images under complex backgrounds. Remote Sens. 2023, 15, 3367. [Google Scholar] [CrossRef]
  47. Song, R.; Shi, F.; Du, G.; Zhang, X.; Jiang, C. MG-RoadNet: Road Segmentation Network for Remote Sensing Images Based on Multi-Receptive Field Graph Convolution. Signal Image Video Process. 2025, 19, 679. [Google Scholar] [CrossRef]
  48. Chen, L.; Gu, L.; Li, L.; Yan, C.; Fu, Y. Frequency Dynamic Convolution for Dense Image Prediction. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 30178–30188. [Google Scholar]
  49. Gao, S.; Cheng, M.; Zhao, K.; Zhang, X.; Yang, M.; Torr, P. Res2net: A new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 652–662. [Google Scholar] [CrossRef]
  50. Chen, L.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
  51. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
  52. Yang, Z.; Zhou, D.; Yang, Y.; Zhang, J.; Chen, Z. Road extraction from satellite imagery by road context and full-stage feature. IEEE Geosci. Remote. Sens. Lett. 2022, 20, 8000405. [Google Scholar] [CrossRef]
  53. Zhou, G.; Chen, W.; Gui, Q.; Li, X.; Wang, L. Split depth-wise separable graph-convolution network for road extraction in complex environments from high-resolution remote-sensing images. IEEE Trans. Geosci. Remote. Sens. 2021, 60, 1–15. [Google Scholar] [CrossRef]
  54. Lu, X.; Zhong, Y.; Zheng, Z.; Zhang, L. GAMSNet: Globally aware road detection network with multi-scale residual learning. ISPRS J. Photogramm. Remote. Sens. 2021, 175, 340–352. [Google Scholar] [CrossRef]
  55. Chen, S.; Ji, Y.; Tang, J.; Luo, B.; Wang, W.; Lv, K. DBRANet: Road extraction by dual-branch encoder and regional attention decoder. IEEE Geosci. Remote. Sens. Lett. 2021, 19, 3002905. [Google Scholar] [CrossRef]
  56. Demir, I.; Koperski, K.; Lindenbaum, D.; Pang, G.; Huang, J.; Basu, S.; Hughes, F.; Tuia, D.; Raskar, R. A challenge to parse the earth through satellite images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 172–181. [Google Scholar]
  57. Zhu, Q.; Zhang, Y.; Wang, L.; Zhong, Y.; Guan, Q.; Lu, X.; Zhang, L.; Li, D. A global context-aware and batch-independent network for road extraction from VHR satellite imagery. ISPRS J. Photogramm. Remote. Sens. 2021, 175, 353–365. [Google Scholar] [CrossRef]
  58. Batra, A.; Singh, S.; Pang, G.; Basu, S.; Jawahar, C.V.; Paluri, M. Improved road connectivity by joint learning of orientation and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 10385–10393. [Google Scholar]
Figure 1. The proposed PWFNet framework consisting of PWC and FAM modules integrated into an encoder–decoder architecture.
Figure 1. The proposed PWFNet framework consisting of PWC and FAM modules integrated into an encoder–decoder architecture.
Remotesensing 17 02895 g001
Figure 2. The architecture of the Pyramidal Wavelet Convolution (PWC) module, which performs multi-scale wavelet decomposition and local feature enhancement.
Figure 2. The architecture of the Pyramidal Wavelet Convolution (PWC) module, which performs multi-scale wavelet decomposition and local feature enhancement.
Remotesensing 17 02895 g002
Figure 3. Detailed structure of the WTCov module at one layer, showing its recursive design with progressive wavelet decomposition and low-frequency propagation.
Figure 3. Detailed structure of the WTCov module at one layer, showing its recursive design with progressive wavelet decomposition and low-frequency propagation.
Remotesensing 17 02895 g003
Figure 4. Detailed structure of the proposed FAM module.
Figure 4. Detailed structure of the proposed FAM module.
Remotesensing 17 02895 g004
Figure 5. Visual comparison of the different methods on the DeepGlobe road dataset.
Figure 5. Visual comparison of the different methods on the DeepGlobe road dataset.
Remotesensing 17 02895 g005
Figure 6. Visual comparison of the different methods on the CHN6-CUG road dataset.
Figure 6. Visual comparison of the different methods on the CHN6-CUG road dataset.
Remotesensing 17 02895 g006
Figure 7. A visual comparison of the ablation experiments conducted on PWFNet models using the DeepGlobe and CHN6-CUG datasets. Specifically, images 1 to 3 are from the DeepGlobe dataset, while images 4 to 6 are from the CHN6-CUG dataset.
Figure 7. A visual comparison of the ablation experiments conducted on PWFNet models using the DeepGlobe and CHN6-CUG datasets. Specifically, images 1 to 3 are from the DeepGlobe dataset, while images 4 to 6 are from the CHN6-CUG dataset.
Remotesensing 17 02895 g007
Figure 8. Visual comparison of the different methods on the LSRV road dataset.
Figure 8. Visual comparison of the different methods on the LSRV road dataset.
Remotesensing 17 02895 g008
Table 1. Quantitative results of the state-of-the-art methods and the proposed PWFNet method on the DeepGlobe datasets.
Table 1. Quantitative results of the state-of-the-art methods and the proposed PWFNet method on the DeepGlobe datasets.
MethodsIOU (%)P (%)R (%)F1 (%)
GAMSNet66.7481.1478.9980.05
DBRANet65.1481.5476.4178.89
Deeplab+65.8781.8277.1779.43
D-LinkNet64.4481.6375.3678.37
SGCN64.9681.8675.8878.76
RCFSNet64.9480.2577.3078.75
Swin Transformer61.6978.7074.0676.31
PWFNet70.5292.6984.6782.71
Table 2. Quantitative results of the state-of-the-art methods and the proposed PWFNet method on the CHN6-CUG datasets.
Table 2. Quantitative results of the state-of-the-art methods and the proposed PWFNet method on the CHN6-CUG datasets.
MethodsIOU (%)P (%)R (%)F1 (%)
GAMSNet61.0878.6173.2675.84
DBRANet60.8279.3572.2575.63
Deeplab+60.8778.1873.3275.67
D-LinkNet60.7280.1771.4675.56
SGCN53.4379.6461.8969.65
RCFSNet61.3878.3673.9176.07
Swin Transformer45.7165.6657.7161.38
PWFNet62.6374.9079.2677.02
Table 3. Quantitative results of the state-of-the-art methods and the proposed PWFNet method on the DeepGlobe and CHN6-CUG datasets.
Table 3. Quantitative results of the state-of-the-art methods and the proposed PWFNet method on the DeepGlobe and CHN6-CUG datasets.
PWCFAMDeepGlobe CHN6-CUG
Model F1 (%)IOU (%)F1 (%)IOU (%)
BaseLine 79.0865.3973.4858.07
BaseLine_FAM 79.7266.2876.3361.72
BaseLine_PWC 79.9466.5876.4561.88
PWFNet82.7170.5277.0262.63
Table 4. Model complexity and time complexity of the proposed method and some of the state-of-the-art methods.
Table 4. Model complexity and time complexity of the proposed method and some of the state-of-the-art methods.
MethodsParamsGFLOPs
GAMSNet29.33 M58.32 G
DBRANet47.68 M53.16 G
Deeplab+40.29 M69.14 G
D-LinkNet217.65 M120.31 G
SGCN42.73 M311.58 G
RCFSNet58.23 M182.31 G
Swin Transformer30.90 M48.5 G
BaseLine29.03 M73.87 G
BaseLine_FAM29.09 M72.94 G
BaseLine_PWC79.94 M102.20 G
PWFNet80.00 M102.31 G
Table 5. Quantitative results of the state-of-the-art methods and the proposed PWFNet method on the LSRV road dataset.
Table 5. Quantitative results of the state-of-the-art methods and the proposed PWFNet method on the LSRV road dataset.
MethodsIOU (%)P (%)R (%)F1 (%)
GAMSNet57.9773.2973.5073.39
DBRANet55.7965.8378.5371.62
Deeplab+49.9062.7070.9666.58
D-LinkNet56.3467.0277.9472.07
SGCN53.4075.2564.7869.62
RCFSNet55.0372.9469.1470.99
Swin Transformer43.5575.1950.8660.68
PWFNet65.8884.1880.2880.15
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zong, J.; Sun, Y.; Wang, R.; Xu, D.; Yang, X.; Zhao, X. PWFNet: Pyramidal Wavelet–Frequency Attention Network for Road Extraction. Remote Sens. 2025, 17, 2895. https://doi.org/10.3390/rs17162895

AMA Style

Zong J, Sun Y, Wang R, Xu D, Yang X, Zhao X. PWFNet: Pyramidal Wavelet–Frequency Attention Network for Road Extraction. Remote Sensing. 2025; 17(16):2895. https://doi.org/10.3390/rs17162895

Chicago/Turabian Style

Zong, Jinkun, Yonghua Sun, Ruozeng Wang, Dinglin Xu, Xue Yang, and Xiaolin Zhao. 2025. "PWFNet: Pyramidal Wavelet–Frequency Attention Network for Road Extraction" Remote Sensing 17, no. 16: 2895. https://doi.org/10.3390/rs17162895

APA Style

Zong, J., Sun, Y., Wang, R., Xu, D., Yang, X., & Zhao, X. (2025). PWFNet: Pyramidal Wavelet–Frequency Attention Network for Road Extraction. Remote Sensing, 17(16), 2895. https://doi.org/10.3390/rs17162895

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop