Forest Road Extraction via Optimized DeepLabv3+ and Multi-Temporal Remote Sensing for Wildfire Emergency Response

Gao, Zhuoran; Li, Ziyang; Yao, Weiyuan; Zhang, Tingtao; Qiu, Shi; Liu, Zhaoyan

doi:10.3390/app16073228

Open AccessArticle

Forest Road Extraction via Optimized DeepLabv3+ and Multi-Temporal Remote Sensing for Wildfire Emergency Response

by

Zhuoran Gao

^1,2,3,

Ziyang Li

^1,3

,

Weiyuan Yao

^1,4,*,

Tingtao Zhang

^1,3,

Shi Qiu

^1,4 and

Zhaoyan Liu

^1,4

¹

Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China

²

School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China

³

Key Laboratory of Target Cognition and Application Technology (TCAT), Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China

⁴

State Key Laboratory of Remote Sensing and Digital Earth, Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100101, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(7), 3228; https://doi.org/10.3390/app16073228 (registering DOI)

Submission received: 20 February 2026 / Revised: 13 March 2026 / Accepted: 20 March 2026 / Published: 26 March 2026

(This article belongs to the Special Issue From Prediction to Action: Next Generation AI Solutions for Disaster Preparedness Emergency Response and Community Safety)

Download

Browse Figures

Versions Notes

Abstract

Forest fires occur frequently in China; however, the complex terrain and incomplete road networks severely constrain ground rescue efficiency. Accurate forest road information is essential for the optimization of emergency response and rescue force deployment. Existing road extraction algorithms are primarily designed for urban environments and exhibit limited efficacy in forest scenarios due to dense canopy, complex background interference and specific forest road features. To address this gap, this study proposes a forest road extraction method based on an enhanced DeepLabv3+ model using multi-temporal, high-resolution satellite imagery. Specifically, a Multi-Scale Channel Attention (MCSA) mechanism is embedded in skip connections to suppress background interference, while strip pooling is integrated into the Atrous Spatial Pyramid Pooling (ASPP) module to better capture slender road features. A composite Focal-Dice loss function is also constructed to mitigate sample imbalance. Finally, by applying the model in multi-temporal remote sensing images, a fusion strategy is introduced to integrate multi-seasonal road masks to enhance overall accuracy and topological integrity. Experimental results show that the proposed method achieves a precision of 54.1%, an F1-Score of 59.3%, and an IoU of 41.8%, effectively enhancing road continuity and providing robust technical support for fire-rescue decision-making.

Keywords:

forest road extraction; DeepLabV3+ optimization; multi-temporal data fusion; forest fire rescue

1. Introduction

China possesses extensive forest resources characterized by intricate topography and diverse climatic conditions, ranging from northeastern coniferous forests to the vertically zoned forests of the southwest, making it one of the most fire-prone regions globally [1]. Forest fires are marked by their high spontaneity, rapid propagation, and catastrophic impact; once ignited, they frequently result in severe ecological degradation, economic losses, and casualties, posing extreme challenges to prevention and suppression. During tactical rescue operations, ground forces often encounter difficulties in reaching fire lines, leading to significant response lags. Specifically, complex terrain and limited trafficability, exacerbated by the widespread fragmentation or absence of road network data [2], severely impede the mobilization of rescue teams and the logistics of fire-fighting materials. For instance, during the 2022 Chongqing wildfire [3], the co-effects of rugged terrain and insufficient road information significantly hampered ground rescue efforts. Consequently, the accurate extraction of forest road networks is a critical prerequisite for enhancing fire-fighting efficiency. This information not only provides a robust foundation for rescue route optimization and emergency force deployment, but also effectively reduces response times and mitigates disaster impacts [4,5,6].

In contrast to forest road networks, urban roads are characterized by high geometric regularity, relatively homogeneous backgrounds, and significant spectral contrast between road surfaces and surrounding features. These attributes enable semantic segmentation architectures—such as Convolutional Neural Networks (CNNs), Fully Convolutional Networks (FCNs) [7], and U-shaped Network (U-Net) [8]—to demonstrate substantial potential in urban road extraction. Leveraging recent advancements in high-resolution remote sensing and deep learning [9,10,11,12], the geometric contours and topological structures of urban roads can now be accurately delineated. However, forest roads are typically situated in environments with rugged topography, variable illumination, and highly heterogeneous land cover. Crucially, forest roads are frequently obscured by dense canopy occlusion, resulting in fragmented spectral signatures and concealed spatial distributions. Traditional forest road extraction has primarily relied on manual field surveys or conventional image processing techniques, such as edge detection [13,14] and morphological operations [15,16,17]. While these methods may suffice in simplistic scenarios, they generally lack the robustness required to adapt to the dynamic terrain, lighting conditions, and complex land-cover characteristics inherent to forested regions.

In recent years, researchers have increasingly leveraged deep learning technologies for the extraction of suburban and rural roads, which share similar semantic and spectral characteristics with forest roads. For instance, Xu et al. [18] proposed the GL-Dense-U-Net architecture for road extraction, which leverages DenseNet and dual-attention units to capture both local pixel-level details and global morphological features. Their approach effectively mitigates interference from complex backgrounds, such as building shadows and roadside vegetation, thereby ensuring the structural continuity and integrity of the extracted road networks. Yang et al. [19] developed UGD-DLinkNet, an enhanced encoder–decoder framework that integrates hybrid attention mechanisms and uncertainty estimation to bolster road segmentation performance. By employing Monte Carlo dropout and an uncertainty-guided knowledge distillation strategy, their approach effectively addresses challenges such as road occlusions and annotation noise, significantly improving extraction accuracy and robustness in complex geospatial scenarios. Building upon these advancements, Li et al. [20] further refined the DeepLabv3+ framework, leveraging its superior multi-scale contextual encoding capabilities. By integrating an edge feature fusion module and a multi-level upsampling strategy, they addressed the common limitation of detail loss in high-resolution UAV imagery. This modification significantly enhances the sensitivity to fine-grained spatial features, optimizing both boundary delineation and the extraction of narrow road segments that traditional architectures often fail to resolve.

While the aforementioned studies have advanced road extraction in suburban and semi-urban contexts, research specifically targeting forest road recognition under dense canopy conditions remains notably scarce. Winiwarter et al. [21] employed CubeSat imagery with a SegNet-based CNN to extract forest road networks in Canadian boreal forests, achieving promising detection rates but noting significant challenges posed by narrow road widths and spectral similarity to surrounding bare soil. Their work highlighted that the majority of deep learning road extraction studies to date have focused on urban road networks rather than rural or forest roads, creating a substantial research gap. Similarly, Kleinschroth et al. [22] utilized multi-sensor satellite imagery and deep learning to monitor road development in Congo Basin forests, emphasizing the critical role of multi-temporal data in capturing roads obscured by rapid vegetation regrowth. These forest-specific approaches share several common limitations: (1) reliance on single-temporal imagery without exploiting seasonal phenological variations; (2) standard network architectures without targeted modifications for the slender, fragmented morphology of forest roads; and (3) no explicit mechanisms to address the extreme class imbalance (<5% road pixels) inherent in forest scenes. The present study addresses these gaps through a comprehensive framework that integrates architectural innovations with a multi-temporal fusion strategy, specifically designed to overcome the unique challenges of forest road extraction under dense canopy occlusion.

Despite the aforementioned progress, the application of existing deep learning models to forest road extraction remains constrained by several critical limitations. Firstly, the challenge posed by high canopy closure in forest regions has not been fundamentally addressed, which results in fragmented or entirely absent road features in optical imagery. Current road detection models are insufficient to compensate for the topological discontinuities caused by such severe occlusion. Secondly, forest roads typically exhibit poor geometric regularity with erratic trajectories and non-uniform widths. Along with complexity of land cover and spectral similarity of roads, shallows and bare soil, the road extraction accuracy is highly inhibited. Furthermore, the scarcity of high-quality, annotated forest road datasets also suppresses the generalization capability of deep learning models.

2. Materials and Methods

2.1. Study Area and Data Preprocessing

2.1.1. Overview of the Study Area

This study chooses Jinning District, Kunming, Yunnan Province as the experimental site (Figure 1), which has geographic coordinates from 102°12′ E to 102°52′ E and 24°23′ N to 24°48′ N. The region is situated within the gently dissected mid-mountain zone of the Central Yunnan Plateau, with a topographical gradient descending from south to north. The vegetation is highly diverse, encompassing ever-green broad-leaved forests, temperate coniferous forests, deciduous broad-leaved forests, and shrublands. The selection of this area is underpinned by its high susceptibility to forest fires. Notably, a major wildfire in April 2024 occurred, which deployed over 2300 emergency personnel [23].

The diverse land cover in Jinning District offer an ideal testbed for verifying the efficacy of multi-temporal extraction methods. This environment allows for a comprehensive assessment of algorithm performance under varying canopy closure and phenological shifts. The estimated distribution of tree species, based on historical vegetation inventories [24,25,26,27,28,29,30], is illustrated in Figure 1. It shows that the imageries from different seasons exhibit significant spectral complementarity: while summer imageries may suffer from data gaps due to dense canopy occlusion, the high spectral contrast between roads and vigorous vegetation facilitates precise boundary delineation. Conversely, winter imagery leverages deciduous phenology; reduced foliage cover increases canopy permeability, thereby exposing road segments previously obscured by the overstory. This is crucial for restoring topological connectivity and ensuring network integrity. Furthermore, even in non-deciduous zones, multi-temporal data remains indispensable. Variations in solar zenith angles, topographic shadows, and atmospheric conditions (e.g., cloud/mist interference) differ across timeframes. By integrating optimal observations from multiple phases, we effectively mitigate transient occlusions and enhance the radiometric stability of road features, ultimately enabling the construction of a continuous road network.

2.1.2. Data Acquisition and Temporal Phase Selection

The experimental dataset comprises Level-4 products from the Gaofen-2 (GF-2) satellite (Land Observation Satellite Data Service Platform, China Centre for Resources Satellite Data and Application, Beijing, China). The GF-2 Panchromatic and Multispectral Sensor (PMS) acquires data in one panchromatic band (450–900 nm) at 0.8 m spatial resolution and four multispectral bands at 3.2 m resolution: Band 1 (Blue, 450–520 nm), Band 2 (Green, 520–590 nm), Band 3 (Red, 630–690 nm), and Band 4 (Near-Infrared, 770–890 nm) [31]. These data underwent orthorectification and were projected to the WGS84 coordinate system, including orthorectification, geometric fine correction, and radiometric calibration, ensuring high spatial and spectral fidelity for precise feature extraction. To exploit temporal complementarity, six scenes of multi-temporal imagery were acquired between 2022 and 2025, specifically targeting two distinct seasonal windows: summer (May–August) and winter (November–March). This selection captures critical phenological stages—the peak growing season in summer and the dormancy or senescence period in winter—thereby maximizing variance in canopy closure, solar illumination, and spectral response. Such diversity provides a robust basis for mitigating occlusion and ensuring road network continuity. Only imagery with a cloud cover fraction of less than 20% was retained. The final dataset consists of high-quality imagery acquired on 23 August 2022; 13 November 2022; 26 January 2023; 5 March 2024; 26 July 2024; and 13 February 2025.

2.1.3. Data Preprocessing and Dataset Construction

To fully exploit the spatial resolution advantages of the PMS onboard the GF-2 satellite, pansharpening was performed on all acquired imagery. This process fused the 0.8 m resolution panchromatic (PAN) band with the 3.2 m resolution multispectral (MS) bands, yielding sub-meter multispectral imagery that integrates high-frequency spatial details with rich spectral information. Specifically, the Gram-Schmidt (GS) transformation [32]—a robust component substitution (CS) method—was employed. The GS algorithm is particularly effective for forest road extraction as it enhances geometric fidelity and edge definition while minimizing spectral distortion, which is critical for delineating slender linear features in complex environments.

The dataset developed for this study comprises two components: a general-purpose benchmark and a domain-specific forest dataset, utilizing a transfer learning strategy of large-scale pre-training followed by targeted fine-tuning.

The DeepGlobe Road Extraction Dataset [33] was collected from DigitalGlobe satellite imagery covering Thailand, Indonesia, and India. It contains 8570 RGB image pairs at 0.5 m spatial resolution with binary road/non-road pixel-level annotations, encompassing diverse road types including highways, rural roads, and unpaved tracks across varied land cover contexts. This dataset provides a broad foundation for learning general road geometric and topological features during pre-training.

It should be noted that the DeepGlobe dataset has a spatial resolution of 0.5 m, while the GF-2 fine-tuning data has a resolution of 0.8 m. Although both datasets use identically sized 1024 × 1024 pixel patches, a DeepGlobe patch covers approximately 512 m × 512 m on the ground, whereas a GF-2 patch covers approximately 819 m × 819 m. Consequently, road features occupy a proportionally larger pixel footprint in the pre-training data, which could bias the model toward wider road representations. However, this discrepancy is effectively mitigated by the staged transfer learning strategy: the substantially reduced fine-tuning learning rate (1 × 10⁻⁵, an 8× reduction from pre-training) allows the model to gradually recalibrate its scale-sensitive filters to the target resolution without catastrophic forgetting of the general road geometric and topological features acquired during pre-training. The training dynamics presented in the results section further confirm the effectiveness of this mitigation strategy. During the fine-tuning stage, the validation IoU rose sharply from epoch 1 through epoch 8, demonstrating that the model rapidly recalibrated its scale-sensitive filters from the pre-training resolution to the target resolution within only a few epochs. This rapid early convergence, without any initial performance degradation, suggests that the reduced learning rate (1 × 10⁻⁵, an 8× reduction) effectively constrained parameter updates, enabling gradual adaptation rather than catastrophic overwriting of the general road features acquired during pre-training. These observations confirm that the residual resolution discrepancy between the two datasets does not adversely affect the transfer learning outcome.

The self-prepared Multi-temporal Forest Area Remote Sensing Dataset was constructed from the six processed GF-2 pansharpened scenes (0.8 m resolution, 4 bands). Unlike DeepGlobe, this dataset is characterized by: (1) a substantially lower road pixel proportion (<5% vs. approximately 15% in DeepGlobe); (2) four spectral bands including near-infrared, providing vegetation discrimination capability; (3) multi-temporal coverage enabling cross-seasonal annotation verification; and (4) exclusive focus on forest roads under dense canopy conditions.

The fused GF-2 scenes were partitioned into 1024 × 1024 pixel sub-patches using a sliding window approach with a stride of 717 pixels (30% overlap), preserving spatial continuity at patch boundaries. Cross-temporal geometric consistency is guaranteed by the Level-4 orthorectification: all six temporal scenes share an identical WGS84 geographic reference frame, enabling the same fixed partitioning grid to be applied uniformly across all acquisition dates. The residual co-registration error between temporal phases is within 1 pixels (approximately 0.8 m), well within the tolerance for the logical OR fusion operation.

From the six temporal scenes, a total of approximately 2351 candidate patches were initially generated. During manual screening, 895 patches (38%) were excluded due to cloud cover exceeding 30%, extreme topographic shadow occlusion, or motion blur artifacts, yielding the final 1456 high-quality image pairs. The road pixel distribution in this dataset is relatively sparse: more than 90% of the patches have a road pixel ratio lower than 5%, and only a small number of patches contain dense road regions. The average road pixel fraction across all patches is 4.1%.

Road annotations were manually delineated by 2 trained annotators using LabelMe (MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA). A multi-temporal cross-referencing protocol was employed: annotators simultaneously examined imagery from different seasons at identical geographic coordinates to identify road segments obscured by seasonal canopy. To accelerate labeling, a semi-automatic model-in-the-loop strategy was adopted: after initial manual annotation of approximately 30% of the dataset, an intermediate model was trained to generate preliminary masks for the remaining patches, which were then manually refined.

For the DeepGlobe dataset, the official predefined split was adopted [33]. For the forest dataset, the 1456 pairs were divided into training (80%), validation (10%), and test (10%) sets using spatially stratified random sampling. All patches originating from the same geographic scene were assigned to the same partition to prevent data leakage from the 30% spatial overlap, and the split ensured proportional representation of both summer and winter temporal phases.

The statistical characteristics of the finalized dataset are summarized in Table 1.

2.2. Methodology

This study develops a comprehensive, multi-stage framework. The overall operational workflow is illustrated in Figure 2, which depicts the end-to-end pipeline from data acquisition to emergency decision support. The pipeline comprises six stages: (1) multi-temporal GF-2 satellite imagery acquisition across summer and winter seasonal windows; (2) preprocessing including radiometric calibration, orthorectification, and Gram-Schmidt pansharpening; (3) per-scene semantic road extraction using the enhanced DeepLabV3+ model; (4) multi-temporal mask fusion via georeferenced logical OR union with morphological refinement; (5) vector road network generation and quality assessment; and (6) integration into GIS-based emergency management platforms for route optimization and rescue force deployment. The following subsections detail the core technical components of stages (3) and (4), which constitute the principal methodological contributions of this study.

The methodology is implemented with three sections: First, an enhanced DeepLabv3+ architecture is adopted as the foundational framework. Specifically, to bolster its capacity for capturing fine-grained linear features, a Multi-Scale Channel Attention (MCSA) mechanism is seamlessly integrated, and the Atrous Spatial Pyramid Pooling (ASPP) module is reconfigured with strip pooling strategies to enlarge the receptive field while preserving spatial precision. Second, to address the extreme class imbalance inherent in forest remote sensing imagery—where road pixels are typically overwhelmed by background vegetation—a composite loss function is designed. The loss function is used to mitigate the training bias and enhance the model sensitivity to underrepresented road pixels. Third, capitalizing on precise georeference information, a multi-temporal mask fusion strategy is constructed to merge road extraction results from multi-temporal remote sensing images, aiming at enhancing the topological continuity and overall integrity of the recognized road network.

2.2.1. Network Architecture Selection

Currently, mainstream semantic segmentation networks for remote sensing road extraction—including FCN [7], U-Net [8], SegNet [9], and DeepLabv3+ [34]—employ distinct encoder–decoder structural paradigms. While these models excel in urban environments, their intrinsic mechanisms for spatial detail preservation vary significantly when capturing the fine-grained linear structures typical of forest roads. Comprehensive surveys of deep learning-based road extraction [35,36] confirm that these classic architectures remain widely adopted benchmarks in remote sensing road extraction due to their computational efficiency, mature implementations, and robust feature representation.

It should be noted that several recently prominent architectures were considered but not included in the primary comparison. HRNet [37] maintains high-resolution representations throughout the network by connecting multi-resolution subnetworks in parallel. While this approach excels at preserving spatial detail, its computational cost is substantially higher than DeepLabV3+, and its design philosophy—maintaining multiple parallel resolution streams—does not provide a natural insertion point for the strip pooling module, which specifically targets the ASPP’s multi-scale aggregation mechanism. The ASPP module in DeepLabV3+ offers a uniquely modular architecture that allows strip pooling to be integrated as a parallel branch without disrupting the core feature extraction pipeline. YOLO-based segmentation (e.g., YOLOv8-Seg) [38] is primarily designed for instance segmentation in real-time detection scenarios, where the objective is to segment individual object instances with bounding box priors. Forest road extraction, however, is a binary semantic segmentation task where roads form continuous, interconnected networks rather than discrete object instances. The instance-level paradigm of YOLO is architecturally misaligned with the requirement to produce spatially continuous road masks. SAM (Segment Anything Model) [39] is a foundation model designed for interactive, prompt-based segmentation. While powerful for zero-shot generalization, SAM requires user prompts (points, boxes, or text) for each segmentation target and is not designed for fully automatic, end-to-end semantic segmentation of specific categories. Its computational requirements (ViT-H backbone) are also substantially higher than the encoder–decoder architectures evaluated in this study. Recent work [40] has explored fine-tuning SAM for remote sensing tasks, but this remains an emerging area that falls outside the scope of the current encoder–decoder framework comparison.

DeepLabv3+ was specifically selected as the foundational framework because its architecture offers unique advantages for the proposed modifications: the ASPP module provides a natural, modular integration point for the strip pooling branch, and the encoder–decoder structure with explicit skip connections enables seamless insertion of the MCSA mechanism. This architectural compatibility is the primary selection criterion, rather than raw baseline performance, as the goal is to demonstrate the effectiveness of the proposed enhancement modules within a well-understood framework.

Due to its architectural advantages in multi-scale feature representation, this study adopts DeepLabv3+ as its foundational framework. As illustrated in Figure 3, the proposed model builds upon DeepLabv3+ by integrating a Strip Pooling branch into the ASPP module and embedding the MCSA mechanism in the decoder’s skip connection. The atrous convolution within the encoder expands the receptive field exponentially by inserting zeros between filter weights (effective sampling rate adjustment) without increasing the parameter count or sacrificing feature map resolution. This allows the model to capture long-range spatial dependencies while preserving the high-resolution spatial details necessary for road delineation. Furthermore, the ASPP module—the core of the DeepLabv3+ encoder—executes parallel convolutions at multiple atrous rates. This mechanism facilitates the effective aggregation of multi-scale features, enabling the simultaneous perception of localized textures and global topological structures. A comparative analysis of DeepLabv3+ versus other mainstream architectures regarding their adaptability to forest road extraction is summarized in Table 2.

The suitability assessments in Table 2 are grounded in each architecture’s inherent mechanism for spatial detail preservation. FCN [7] performs progressive downsampling through pooling layers, resulting in significant spatial detail loss; its reliance on a single bilinear upsampling step for recovery limits its capacity to reconstruct fine-grained linear features, leading to a ‘Low’ rating. U-Net [8] employs symmetric skip connections that concatenate encoder features with decoder features at corresponding scales, providing a partial compensation mechanism for spatial information loss; however, the bottleneck layer still induces information compression, yielding a ‘Medium’ rating. SegNet [9] preserves max-pooling indices from the encoder and uses them during upsampling to recover spatial positions; while this approach retains positional information, it does not preserve the feature values themselves, resulting in a ‘Medium–Low’ rating for elongated, weak targets. DeepLabV3+ [34] fundamentally differs by employing atrous (dilated) convolutions that expand the receptive field without reducing feature map resolution, thus maintaining spatial detail throughout the encoding process. Combined with its encoder–decoder structure and ASPP module for multi-scale context aggregation, DeepLabV3+ achieves a ‘High’ suitability rating for forest road extraction. These assessments are consistent with comparative analyses reported in the literature [34,35].

2.2.2. Multi-Scale Dynamic Spatial-Channel Attention (MCSA) Module

To mitigate road fragmentation caused by dense canopy occlusion, this study integrates the Multi-scale Dynamic Spatial-Channel Attention (MCSA) mechanism [41] into the skip connections between the encoder and decoder. This mechanism facilitates dynamic feature recalibration, adaptively assigning weights to amplify road-specific representations while suppressing interference from complex heterogeneous backgrounds. As illustrated in Figure 4 and Figure 5, MCSA is architecturally composed of three synergistic components: the Multi-head Embedded Patch (MEP) module, Multi-layer Dynamic Channel Attention (MDCA), and Multi-layer Dynamic Spatial Attention (MDSA).

The MEP module, comprising Patch Adaptive Generation (PAG) and Deep Patch Warping (DPW), is specifically designed to resolve spatial misalignments inherent in skip connections. By concurrently leveraging MDCA (channel-wise) and MDSA (spatial-wise), the MCSA mechanism optimizes spatial saliency while preserving multi-scale channel context. By leveraging this dual-domain attention, the model effectively discriminates road features from spectrally similar “non-road” elements, such as bare soil and topographic shadows. Specifically, the MCSA mechanism optimizes spatial saliency while capturing subtle spectral variations. This effectively mitigates the “same-spectrum-different-object” challenge prevalent in complex forest environments, thereby significantly enhancing the topological continuity and structural integrity of the extracted road networks.

2.2.3. Enhanced ASPP Module via Strip Pooling

The standard ASPP module in DeepLabv3+ captures multi-scale contextual information through parallel atrous convolutions with isotropic receptive fields. While effective for compact objects, this design is suboptimal for forest roads—characterized by their slender, fragmented, and blurry-edged profiles. When isotropic kernels expand their receptive field to capture long-range context, they inevitably incorporate a significant amount of irrelevant contextual noise from areas flanking the road (e.g., dense canopy, bare soil, and topographic shadows). This effectively attenuates the feature saliency of the faint linear road signals, exacerbating discontinuities and generating ‘void’ artifacts in the extraction masks.

To mitigate these issues, this study integrates Strip Pooling [42] into the ASPP framework. Strip Pooling employs highly anisotropic 1 × N (horizontal) and N × 1 (vertical) pooling kernels, which aggregate long-range context through one-dimensional global feature averaging. This architectural shift enables the model to establish long-range spatial dependencies between fragmented road segments along their longitudinal axis, while effectively filtering out interference from off-road distractors. To maintain the strengths of the original architecture, the Strip Pooling module is embedded as an auxiliary parallel branch alongside the standard atrous convolution paths. This configuration supplements the model’s ability to capture rectilinear features without sacrificing multi-scale versatility. Following the parallel processing, features from all branches undergo channel-wise concatenation and are subsequently fused via a 1 × 1 convolution for dimensionality reduction. The refined architecture is depicted in Figure 6. Each convolutional block within the module follows the ‘Conv + Batch Normalization (BN) + ReLU’ sequence, ensuring robust gradient flow and feature stability.

The Strip Pooling branch was configured with horizontal (1 × W) and vertical (H × 1) global average pooling kernels. Each pooling output undergoes 1 × 1 convolution for channel reduction (to 256 channels), batch normalization, and ReLU activation. After bilinear upsampling to restore H × W resolution, horizontal and vertical features are fused via element-wise summation, followed by a 1 × 1 convolution with sigmoid activation. The Strip Pooling output is concatenated with the five standard ASPP branches, forming a six-branch concatenation subsequently reduced via 1 × 1 convolution. Its contribution was isolated in the ablation study, where it independently improved IoU by 3.4 percentage points over the baseline.

2.2.4. Loss Function Design

Forest roads represent typical fine-grained and sparse targets in remote sensing imagery, leading to an acute semantic class imbalance. In the forest dataset constructed for this study, the mean road pixel proportion is only 4.1% (detailed in Section 2.1.3), far below the background classes. In such highly imbalanced scenarios, the standard Cross-Entropy (CE) Loss tends to bias the optimization toward the dominant background class [43], as it treats all correctly classified pixels equally regardless of class frequency. Consequently, while the model may achieve high global accuracy, its performance in the core task of road feature extraction remains suboptimal due to significant false negatives.

To mitigate this training bias, this study explored an optimal combination strategy tailored for the class imbalance inherent in forest datasets. We propose a composite objective function (

L_{total}

) that integrates Focal Loss (which prioritizes hard-to-classify pixels) and Dice Loss (which optimizes global structural similarity) through weighted fusion. The formulation is defined in Equation (1)

L_{total} = α \cdot L_{Focal} + β \cdot L_{Dice}

(1)

where

α

and

β

represent the weight coefficients assigned to Focal Loss and Dice Loss, respectively. To determine the optimal weight distribution, comparative experiments were conducted with

α

:

β

ratios set at 1:0, 0.7:0.3, 0.5:0.5, 0.3:0.7, and 0:1.

2.2.5. Multi-Temporal Fusion Strategy

To mitigate the discontinuities in road extraction caused by seasonal phenology and topographic shadows, this study develops a multi-temporal path fusion framework anchored by georeferenced alignment. The core logic involves mapping road masks from various temporal phases into a unified geospatial coordinate system. This ensures that observations captured under fluctuating illumination and phenological conditions achieve precise cross-temporal alignment, providing a foundation for restoring topological structures through temporal feature complementarity.

High-precision single-temporal extraction is the prerequisite for this framework. Each preprocessed scene is independently processed using the enhanced DeepLabv3+ model to generate temporal-specific road masks. These masks, along with their corresponding probability maps, were geospatially aligned to the WGS84 coordinate system of the original GF-2 imagery, ensuring a pixel-to-geographic-coordinate correspondence.

Building upon the aligned masks, this study employs a logical union completion strategy to ensure network integrity, capitalizing on the fact that road positions are fixed while occlusions are seasonally dynamic. In this phase, a pixel-wise logical ‘OR’ operation is performed to generate a preliminary candidate road network. While this strategy effectively maximizes recall by capturing road segments across seasons, it inherently increases the potential for false-positive noise from seasonal spectral interference. To resolve this, a rigorous post-processing pipeline—centered on Connected Component Analysis (CCA) and morphological filtering—was integrated. Crucially, this step selectively filters out the fragmented, non-topological pixels present in single-season results (particularly the noise in winter data), ensuring the final output preserves only high-integrity connective backbones as shown in our visualization. For instance, road segments exposed during the winter senescence period can bridge the gaps caused by dense canopy occlusion in summer imagery. Conversely, the high spectral contrast between roads and vigorous vegetation in summer provides sharp boundary constraints for segments that appear blurred in winter imagery.

The fused output undergoes morphological refinement to produce the final forest road vector data. Specifically, a morphological closing operation using a disk-shaped structuring element with a radius of 2 pixels was applied to bridge micro-gaps within predicted road segments. The closing operation was implemented using OpenCV (open-source library, https://opencv.org)’s morphologyEx function with the MORPH_CLOSE operator. Subsequently, Connected Component Analysis (CCA) was performed using OpenCV’s connectedComponentsWithStats function (based on the two-pass labeling algorithm [44]) to label all spatially connected regions in the binary road mask. Components with an area below 200 pixels (corresponding to approximately 128 m² at 0.8 m resolution) were classified as isolated noise and removed, retaining only road segments with meaningful spatial extent and topological connectivity. The morphological structuring element shape and size, as well as the area threshold, were empirically determined through visual inspection of intermediate results on a held-out validation subset. This framework effectively integrates discrete temporal features into a coherent road network, offering a robust solution for optical remote sensing applications in heavily obscured environments.

3. Results

3.1. Experimental Details and Evaluation Metrics

3.1.1. Implementation Details and Training Strategy

All experiments in this study were conducted on a high-performance workstation equipped with an an NVIDIA RTX 4090 GPU (40 GB VRAM; NVIDIA Corporation, Santa Clara, CA, USA), utilizing a software environment based on Python 3.8 (Python Software Foundation, Wilmington, DE, USA), PyTorch 1.10 (Meta Platforms, Inc., Menlo Park, CA, USA), and CUDA 11.3 (NVIDIA Corporation, Santa Clara, CA, USA). The The segmentation-models-pytorch framework (Pavel Iakubovskii, open-source, available at: https://github.com/qubvel/segmentation_models.pytorch, accessed on 19 February 2026) was employed for model construction. To address the scarcity of forest road samples and mitigate the risk of overfitting, a staged transfer learning strategy was implemented, transitioning from general road features to domain-specific forest characteristics.

The pre-training phase utilized the public DeepGlobe dataset for 10 training epochs. The Adam optimizer was employed with an initial learning rate of

8 \times 10^{- 5}

, integrated with a CosineAnnealingWarmRestarts scheduler (

T_{0} = 1

,

T_mult = 2

,

η_{\min} = 5 \times 10^{- 5}

). This configuration promoted the capture of general remote sensing features through periodic learning rate adjustments. Dice Loss was adopted as the objective function, while the Intersection over Union (IoU) with a 0.5 threshold served as the core monitoring metric to evaluate the model’s segmentation performance on linear features such as roads in real-time.

The fine-tuning phase was conducted using a self-built forest remote sensing dataset, with the training duration extended to 25 epochs to adapt to specific scene characteristics. While the Adam optimizer was retained, the initial learning rate was reduced to

1 \times 10^{- 5}

to avoid excessive fluctuations in the pre-trained parameters. Concurrently, a ReduceLROnPlateau scheduler (mode: max, patience: 3, factor: 0.5) was implemented to automatically decrease the learning rate if the validation IoU stagnated for three consecutive epochs. Considering the complex land cover and the low proportion of road pixels in forested areas, a composite loss mechanism was utilized to balance class imbalance and boundary segmentation precision. Additionally, an early stopping strategy was activated to terminate training if no improvement in validation IoU was observed for 10 consecutive epochs, thereby ensuring optimal model generalization.

The hyperparameter configurations were determined through systematic selection. During pre-training, a higher initial learning rate of

8 \times 10^{- 5}

was adopted to facilitate efficient convergence across the large-scale DeepGlobe dataset. The CosineAnnealingWarmRestarts scheduler [45] was selected for its periodic learning rate resets that help escape local minima during this exploratory phase. In contrast, fine-tuning employs a substantially lower initial learning rate of

1 \times 10^{- 5}

(

8 \times

reduction)—a standard practice in transfer learning [46]—to prevent catastrophic forgetting of pre-trained representations. The ReduceLROnPlateau scheduler was chosen for fine-tuning due to its adaptive nature: it monitors the validation IoU and reduces the learning rate by a factor of 0.5 only when the metric stagnates for three consecutive epochs (patience = 3), thereby automatically balancing convergence speed against overfitting risk. The Adam optimizer [47] was selected for its adaptive per-parameter learning rates suited to sparse gradient patterns in road segmentation. The batch size of 8 was determined by GPU memory constraints, and early stopping patience of 10 epochs was validated through preliminary experiments.

3.1.2. Evaluation Metrics

The model performance is assessed by two categories of evaluation metrics:

(1): Pixel-level Accuracy Metrics: These metrics assess the fundamental performance of the model in pixel-wise classification. In the context of forest road extraction, high Precision indicates high reliability of the predicted roads with fewer false positives (over-detection), while high Recall implies the model’s ability to identify as many ground-truth roads as possible with fewer false negatives (omissions) and better connectivity. The F1-score, as the harmonic mean of Precision and Recall, provides a balanced comprehensive evaluation. Intersection over Union (IoU) is the gold standard for measuring the overlap between the predicted mask and the ground truth, serving as one of the most rigorous and commonly used core metrics in semantic segmentation tasks. The calculation formulas for these metrics are as follows:

$Accuracy = \frac{TP + TN}{TP + FP + TN + FN}$

(2)

$P recision = \frac{TP}{TP + FP}$

(3)

$Recall = \frac{TP}{TP + FN}$

(4)

$F 1 = \frac{2 \times Precision \times Recall}{Precision + Recall}$

(5)

$IoU = \frac{TP}{TP + F P + FN}$

(6)

where TP, FP, FN, and TN represent the number of pixels for true positives, false positives, false negatives, and true negatives, respectively.
(2): Robustness Metrics for Class Imbalance: Since standard Accuracy can be misleading in class-imbalanced scenarios, Balanced Accuracy and the Matthews Correlation Coefficient (MCC) were introduced. The former treats the sparse foreground (roads) and the dominant background equally by calculating the average recall of both classes, thereby fairly reflecting the model’s recognition capability for rare road categories. The latter, which comprehensively accounts for TP, FP, FN, and TN, is considered one of the most robust evaluation metrics for imbalanced datasets. Its value ranges from −1 (complete misclassification) to +1 (perfect prediction), with 0 representing random guessing. The relevant formulas are as follows:

$Balanced Accuracy = \frac{1}{2} (\frac{TP}{TP + FN} + \frac{TN}{TN + FP})$

(7)

$MCC = \frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP) (TP + FN) (TN + FP) (TN + FN)}}$

(8)

3.2. Ablation Study

To verify the effectiveness of the Multi-layer Dynamic Spatial-Channel Attention (MCSA) mechanism and the improved ASPP module within the proposed architecture, progressive ablation experiments are conducted by using the standard DeepLabV3+ as the baseline. These experiments are performed under identical conditions and the results are compared to classic models, including FCN, SegNet, and U-Net.

The precision analysis results are as shown in Table 3. Initially, a comparison of standard segmentation models revealed that FCN and SegNet exhibited limited performance in complex forest environments, with IoU values of only 29.8% and 30.2%, respectively. U-Net showed a marginal improvement, achieving a Precision of 40.2% and an IoU of 31.8%. It indicates that the baseline model, while slightly outperforming these traditional frameworks with a Precision of 40.6% and an IoU of 32.2%, was still constrained by false positives and connectivity issues in complex forest backgrounds. In contrast, each enhancement module demonstrated significant performance gains. The independent introduction of the MCSA module effectively attenuated spectral interference from tree canopies, increasing the Precision to 46.8% and the F1-score to 53.7%. The standalone application of the improved ASPP module, leveraging its ability to capture long-range contextual information, enhanced the Recall to 62.4% and the IoU to 35.6%. Ultimately, the complete model integrating both modules achieved the best synergistic effect, with the Precision leaping to 51.6%, the Recall reaching 64.1%, and the IoU achieving 39.9%. Furthermore, the Matthews Correlation Coefficient (MCC), which reflects model robustness in class-imbalanced scenarios, improved to 51.5%. As illustrated in Figure 7, the visual comparison of segmentation results further confirms these quantitative findings, demonstrating that the proposed model produces more continuous road structures and fewer false positives compared to baseline methods. These comprehensive results fully validate the complementarity and superiority of the proposed modules for road extraction in complex forest environments.

3.3. Loss Function Modification

To address the extreme class imbalance caused by the low proportion of road pixels (<5%) in forested areas, this study evaluated the impact of different weight configurations within the composite loss function defined in Equation (1). As shown in Table 4, the experiments compared the baseline (CE loss) with five different weight configurations. The results indicate that the baseline model was severely hindered by class imbalance, yielding an IoU of only 32.2%. Among the composite loss configurations, Configuration 2 (

α = 0

.7,

β = 0

.3) achieved the best performance, with a Precision of 53.1%, an F1-score of 57.8%, and an IoU of 40.6%, representing a significant 8.4 percentage point increase over the baseline. This result suggests that a higher weight for Focal Loss (

α = 0

.7) is essential to guide the model’s focus toward hard-to-classify road pixels under the extreme class imbalance inherent in forest datasets, while the auxiliary Dice Loss (

β = 0

.3) effectively ensures global structural similarity. This demonstrates that appropriately balancing Focal Loss (targeting pixel-level classification difficulty) and Dice Loss (targeting global structural similarity) effectively guides the model to maintain high precision while preserving the structural integrity of forest roads. Furthermore, the Matthews Correlation Coefficient (MCC) reached 52.3% under this configuration, showcasing enhanced robustness in identifying sparse road targets.

To further verify the cumulative effect of modified loss function, Table 5 presents the superimposed results of network architecture optimization and loss function refinement. The analysis reveals that while individual improvements significantly boosted performance (with IoUs reaching 40.6% and 39.9%, respectively), the combined effect was the most prominent. Ultimately, the fully improved model achieved a Precision of 53.8%, a Recall of 64.9%, an IoU of 41.6%, and an MCC of 53.2%. This performance trajectory confirms the high synergy between architecture-side feature enhancement and loss-function-side sample balancing, collectively forming an effective extraction mechanism for complex forest scenarios.

To provide further insight into the training dynamics of the proposed model, Figure 8 illustrates the loss evolution and validation IoU across both the pre-training and fine-tuning stages. During pre-training on the DeepGlobe dataset, the Dice Loss decreased steadily from 0.56 to 0.22 over 10 epochs, while the validation IoU improved from 0.38 to 0.62, confirming effective acquisition of general road features from the ImageNet-pretrained ResNet-101 backbone. During fine-tuning on the forest dataset, the composite loss—comprising Focal Loss (α = 0.7) and Dice Loss (β = 0.3)—exhibited a consistent downward trend, with Focal Loss converging faster due to its emphasis on hard-to-classify pixels. The validation IoU rose rapidly in the first five epochs before plateauing around 0.41–0.42, reflecting the diminishing marginal gains typical of domain-specific adaptation. Two learning rate reduction events (triggered at approximately epochs 12 and 18 by the ReduceLROnPlateau scheduler with patience = 3) produced modest subsequent improvements, and early stopping was activated at epoch 25. The final reported metrics in Table 3, Table 4, Table 5 and Table 6 were evaluated on the independent test set, accounting for the minor difference from the validation IoU peak observed during training.

In addition to the improvement in quantitative metrics, the visual enhancement of the extraction results further validates the effectiveness of the proposed strategies. As illustrated in Figure 9, the extraction of forest roads demonstrates significant progress in structural integrity after optimizing the loss function. While the baseline model exhibited numerous fractures and holes due to the impact of class imbalance, the prediction masks generated by the composite loss function tend to be more continuous, effectively outlining the long-distance topological structure of forest paths. Furthermore, Figure 9 presents the extraction results under the joint effect of the composite loss function and the improved architecture. It can be observed that, through the synergy of the MCSA mechanism and the strip pooling-based ASPP module, the model not only accurately identifies weak targets partially obscured by tree canopies but also substantially reduces false positives by suppressing background noise such as forest clearings and shadows. This comprehensive optimization, spanning from pixel-wise classification to geometric morphology, brings the extraction results closer to the ground truth, fully meeting the high requirements for road network continuity in forest fire rescue operations.

3.4. Experimental Results of the Multi-Temporal Fusion Strategy

To address the discrepancies in forest road extraction across multi-temporal remote sensing imagery, this study selected two summer scenes and four winter scenes from GF-2 satellite imagery (spanning 2022 to 2025). We conducted road extraction experiments using an improved DeepLabV3+ model followed by a multi-temporal fusion process. This approach was designed to verify the effectiveness of temporal complementarity in mitigating seasonal vegetation interference. The experimental results are summarized in Table 6 and Figure 10.

Analysis of the single-temporal results reveals distinct complementary characteristics between the summer and winter datasets. Summer imagery, benefiting from the high contrast and clear boundaries between roads and dense vegetation, exhibited superior precision metrics, with Precision and Balanced Precision reaching 55.3% and 62.5%, respectively. However, significant obscuration by dense forest canopies led to severe fragmentation and missing segments in the extracted road networks, resulting in a Recall of only 60.2%. In contrast, winter imagery benefited from reduced canopy cover during the deciduous period, which increased the Recall to 67.8% and allowed for the detection of occluded road segments missed in summer. Nevertheless, the reduction in vegetation intensified the spectral confusion between exposed soil and road surfaces, causing the Precision (49.7%) and Intersection over Union (IoU, 39.5%) to fall below the summer levels.

Following the multi-temporal fusion, the comprehensive performance metrics—including the F1-score (59.3%), IoU (41.8%), and MCC (53.5%)—reached their peak values, surpassing any single-season observation. While the fusion results for Precision (54.1%) and Recall (65.5%) followed a balanced trajectory between the seasonal peaks, the strategy effectively addressed the inherent limitations associated with each individual season. Specifically, it mitigated the performance bottlenecks found in single-phase data: the restricted Recall in summer due to canopy occlusion and the diminished Precision in winter caused by spectral confusion. Although the visual alignment between the fusion results and ground truth is high, the quantitative metrics like IoU (41.8%) are constrained by the pixel-level sensitivity of slender features. In forest road extraction, even a minor sub-meter positional shift or a slight mismatch in road width between the prediction and the manually annotated ground truth can lead to significant penalties in IoU, despite preserving the overall topological integrity of the network.

While the logical “OR” union strategy effectively maximizes recall by capturing road segments across different seasons, it inherently increases the potential for false-positive noise. To counter this, a rigorous post-processing pipeline—including morphological refinement and connected component analysis (CCA)—was integrated. Although this filtering process removed a small number of isolated, non-topological segments from the winter data, it significantly enhanced the overall topological consistency and minimized scattered artifacts that could otherwise impede rescue route planning. Consequently, the fused Recall (65.5%) remained significantly higher than that of summer (60.2%) but slightly below the winter peak of 67.8%. The final results demonstrate that the multi-temporal complementary fusion strategy successfully balances the trade-off between commission and omission errors. By effectively repairing road network disconnections caused by phenological obscuration, this approach better fulfills the practical requirements for topological continuity and network integrity in forest fire rescue operations.

As illustrated in Figure 10, the fusion strategy integrates the precise boundary information of summer with the complete network structure of winter through logical union and morphological refinement. Visually, as shown in Figure 10, the final fused result does not blindly aggregate all winter pixels. Instead, it leverages the logical OR operation to capture potential connectivity and subsequently applies CCA to filter out fragmented noise. This ensures that only road segments with high topological consistency are preserved, resulting in a cleaner and more actionable network than any single-season observation.

4. Discussion

Aiming at the occlusion, discontinuity, and sample imbalance problems faced by complex forest road extraction in forest fire rescue, this study proposed an extraction method based on multi-temporal remote sensing images. By integrating the MCSA mechanism and an improved ASPP module with Strip Pooling into the DeepLabV3+ framework, our model enhanced the capture of slender road features and connectivity. Experimental results demonstrated that the proposed method achieved a strategic balance between seasonal observations, yielding an F1-score of 59.3% and an IoU of 41.8%. While the fusion Precision (54.1%) and Recall (65.5%) followed a balanced trajectory between seasonal peaks, the strategy effectively addressed the intrinsic phenological constraints: the omission errors (low Recall) in summer and the commission errors (low Precision) in winter caused by spectral confusion. Furthermore, the implementation of a Focal-Dice hybrid loss function and a logical union fusion strategy ensured road extraction accuracy and topological integrity. These achievements provide critical practical value for forest fire rescue in high-risk areas like Jinning District, offering a robust road network reference for rapid deployment and evacuation planning.

There is still room for further optimization and expansion of the method proposed in this study. Existing annotated datasets for forest roads mainly focus on single regions and conventional scenarios, with limited coverage of topographies, vegetation types, and fire interference scenarios. Furthermore, the current study is primarily confined to the Central Yunnan Plateau, and the model was trained on a limited variety of major tree species typical of this region. This geographic and botanical specificity may constrain the model’s generalization capability when applied to forest environments with significantly different canopy architectures or phenological patterns, such as tropical rainforests or boreal taiga. This leads to insufficient generalization ability of the model in cross-region and complex fire situations, making it difficult to adapt to the heterogeneous characteristics of different forest areas. Additionally, while the current evaluation relies on pixel-level segmentation metrics, future work will incorporate operationally meaningful indicators—such as network connectivity analysis (e.g., largest connected component ratio) and network completeness assessment near fire-prone zones—to more directly quantify the usability of the extracted road network for emergency route planning.

Beyond the limitations of optical data, the current model’s accuracy can be compromised by severe environmental factors such as thick smoke, heavy rain, or extreme illumination—scenarios common in active wildfire events. The occasional discontinuities observed in the fusion results, despite being connected in the ground truth, further reflect the inherent limitations of optical remote sensing. While human annotators can infer connectivity through dense canopy based on prior knowledge, the model relies strictly on detectable spectral evidence. In areas with 100% canopy closure or persistent topographic shadows, the lack of signal leads to unavoidable breaks; however, this conservative approach ensures the reliability of the extracted network by avoiding unfounded extrapolations. To address these challenges and meet the requirements for all-weather rescue, future research will focus on the multi-modal fusion of optical imagery with SAR or LiDAR data. Notably, SAR’s capability to penetrate smoke and clouds makes it a critical potential complement for reliable forest road extraction during emergency response operations.

Operationally, such a multi-modal fusion pipeline could be structured as a two-tier system: an optical-first tier that maintains and updates the baseline road network during non-emergency periods using the multi-temporal approach proposed in this study, and a SAR-activated tier that is triggered during active fire events when smoke and cloud cover render optical imagery unusable. In the SAR-activated tier, pre-event optical road masks would serve as spatial priors to constrain SAR-based detection, compensating for the inherently lower spatial resolution and higher speckle noise of SAR imagery. This tiered architecture would enable continuous road network availability throughout the full emergency lifecycle—from pre-event preparedness through active response to post-event damage assessment—thereby meeting the all-weather, real-time requirements of next-generation forest fire rescue systems.

Author Contributions

Conceptualization, Z.G., Z.L. (Ziyang Li) and W.Y.; methodology, Z.G.; software, Z.G.; validation, Z.G.; formal analysis, Z.G. and Z.L. (Ziyang Li); investigation, W.Y.; resources, W.Y.; data curation, Z.G.; writing—original draft preparation, Z.G.; writing—review and editing, W.Y.; visualization, Z.G.; supervision, Z.L. (Ziyang Li) and W.Y.; project administration, T.Z., S.Q. and Z.L. (Zhaoyan Liu). All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by National Key R&D Program of China (No. 2023YFC3006900).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data and materials supporting this study are available at varying levels of access. Publicly available: The DeepGlobe Road Extraction Dataset used for pre-training is publicly accessible from its official repository [33]. The complete model architecture—including the base DeepLabV3+ configuration, the integration positions and structural details of the MCSA module and Strip Pooling branch, all hyperparameter settings, and the training protocol—is fully described in this manuscript to ensure methodological reproducibility. Restricted: The self-constructed Multi-temporal Forest Area Remote Sensing Dataset, including GF-2 satellite imagery and manually annotated road vector data, cannot be publicly released due to geographic information security regulations governing high-resolution satellite products in China. Researchers seeking access for academic purposes may submit a formal request with a detailed research plan to the corresponding author (yaowy@aircas.ac.cn); access will be granted upon verification of compliance with information security and academic ethics requirements.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Qian, C.S.; Wu, Z.Y.; Chen, C.D. The types of vegetation in China. Acta Geogr. Sin. 1956, 37–92. (In Chinese) [Google Scholar]
Cui, R.K.; Qian, L.H.; Wang, Q.H. Research progress on fire fighting function evaluation of forest road networks. World For. Res. 2023, 36, 32–37. (In Chinese) [Google Scholar]
Wang, D.D. Analysis and suggestions on the response work to mountain fire disasters in Chongqing in August 2022. Disaster Reduct. China 2023, 40–43. (In Chinese) [Google Scholar]
Qin, X.H.; Zhao, X.D. Development of forest roads in Australia and its enlightenment. World For. Res. 2021, 34, 112–117. (In Chinese) [Google Scholar]
Wang, L.Z.; Zhang, W.; Liu, C.M.; Wang, Y.P.; Bao, J.Q.; Liu, H. Impacts of the construction, planning and maintenance of forest roads on forest fire prevention work. For. Fire Prev. 2023, 41, 16–18. (In Chinese) [Google Scholar] [CrossRef]
Bai, X.P.; Chen, S.Z.; He, Y.J.; Li, M.; Wang, Y.F. Current situation and enlightenment of foreign forest road development. World For. Res. 2015, 28, 85–91. (In Chinese) [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 39, 640–651. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention (MICCAI), Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Ge, X.S.; Cao, W. An improved DeepLabV3+ network method for road extraction from high-resolution remote sensing images. Remote Sens. Inf. 2022, 37, 40–46. (In Chinese) [Google Scholar]
Tao, C.; Qi, J.; Li, Y.; Wang, H.; Li, H. Spatial information inference net: Road extraction using road-specific contextual information. ISPRS J. Photogramm. Remote Sens. 2019, 158, 155–166. [Google Scholar] [CrossRef]
Panboonyuen, T.; Vateekul, P.; Jitkajornwanich, K.; Lawawirojwong, S. An enhanced deep convolutional encoder-decoder network for road segmentation on aerial imagery. In Proceedings of the International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, Arras, France, 27–30 June 2017; pp. 191–201. [Google Scholar]
Bajcsy, R.; Tavakoli, M. Computer recognition of roads from satellite pictures. IEEE Trans. Syst. Man Cybern. 1976, 6, 623–637. [Google Scholar] [CrossRef]
Fan, C.F.; Li, Z.M. Edge detection method and its real-time implementation in road recognition system. Signal Process. 1998, 337–345+357. (In Chinese) [Google Scholar]
Kass, M.; Witkin, A.; Terzopoulos, D. Snakes: Active contour models. Int. J. Comput. Vis. 1988, 1, 321–331. [Google Scholar] [CrossRef]
Mena, J.B. State of the art on automatic road extraction for GIS update: A novel classification. Pattern Recognit. Lett. 2003, 24, 3037–3058. [Google Scholar] [CrossRef]
Zhang, Q.; Couloigner, I. Comparing different localization approaches of the radon transform for road centerline extraction from classified satellite imagery. In Proceedings of the 18th International Conference on Pattern Recognition (ICPR’06), Hong Kong, China, 20–24 August 2006; Volume 2, pp. 138–141. [Google Scholar]
Xu, Y.; Xie, Z.; Feng, Y.; Chen, Z. Road extraction from high-resolution remote sensing imagery using deep learning. Remote Sens. 2018, 10, 1461. [Google Scholar] [CrossRef]
Yang, P.; Xiao, H.; Lin, C.; Xie, X. UGD-DLinkNet: An enhanced network for occluded road extraction using attention mechanisms and uncertainty estimation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 24144–24161. [Google Scholar] [CrossRef]
Li, X.; Li, Y.; Ai, J.; Shu, Z.; Xia, J.; Xia, Y. Semantic segmentation of UAV remote sensing images based on edge feature fusing and multi-level upsampling integrated with Deeplabv3. PLoS ONE 2023, 18, e0279097. [Google Scholar] [CrossRef]
Winiwarter, L.; Coops, N.C.; Bastyr, A.; Roussel, J.-R.; Zhao, D.Q.R.; Lamb, C.T.; Ford, A.T. Extraction of Forest Road Information from CubeSat Imagery Using Convolutional Neural Networks. Remote Sens. 2024, 16, 1083. [Google Scholar] [CrossRef]
Kleinschroth, F.; Laporte, N.; Laurance, W.F.; Goetz, S.J.; Ghazoul, J. Road expansion and persistence in forests of the Congo Basin. Nat. Sustain. 2019, 2, 628–634. [Google Scholar] [CrossRef]
Chinanews Network. The Fire Situation Tends to Be Stable During the Continuous Fighting of Kunming Jinning Forest Fire. Available online: https://www.yn.chinanews.com.cn/news/2024/0415/76251.html (accessed on 17 November 2025). (In Chinese)
Zhu, H. Study on vegetation diversity in Yunnan Province. J. Southwest For. Univ. (Nat. Sci.) 2022, 42, 1–12. (In Chinese) [Google Scholar]
Yang, J.; Dai, J.H.; Yao, H.R.; Tao, Z.X.; Zhu, M.Y. Changes in vegetation distribution and vegetation activity in the Hengduan Mountains from 1992 to 2020. Acta Geogr. Sin. 2022, 77, 16. (In Chinese) [Google Scholar]
Xiong, J.N.; Peng, C.; Cheng, W.M.; Li, W.; Liu, Z.Q.; Fan, C.K.; Sun, H.Z. Analysis of vegetation coverage change in Yunnan Province based on MODIS-NDVI. J. Geo-Inf. Sci. 2018, 20, 1830–1840. (In Chinese) [Google Scholar]
Wu, Z.Y.; Zhu, Y.C. Vegetation in Yunnan; Science Press: Beijing, China, 1987. (In Chinese) [Google Scholar]
Zeng, J.M. A study on the geographical distribution of Pinus and Cunninghamia species in Yunnan. J. Southwest For. Univ. (Nat. Sci.) 2021, 41, 1–12. (In Chinese) [Google Scholar]
Zeng, J.M. A study on the classification system and geographical distribution of natural forests in Yunnan. J. Southwest For. Univ. (Nat. Sci.) 2018, 38, 1–18+231. (In Chinese) [Google Scholar]
Zhu, H. A study on the vegetation geography of evergreen broad-leaved forests in Yunnan. Chin. J. Plant Ecol. 2021, 45, 224–241. (In Chinese) [Google Scholar] [CrossRef]
Li, D.R. China’s High-Resolution Earth Observation System (CHEOS): Advances and Perspectives. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2022, V-3-2022, 583–590. [Google Scholar] [CrossRef]
Maurer, T. How to Pan-Sharpen Images Using the Gram-Schmidt Pan-Sharpen Method—A Recipe. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2013, XL-1/W1, 239–244. [Google Scholar] [CrossRef]
Demir, I.; Koperski, K.; Lindenbaum, D.; Pang, G.; Huang, J.; Basu, S.; Hughes, F.; Tuia, D.; Raskar, R. DeepGlobe 2018: A challenge to parse the earth through satellite images. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; pp. 172–17209. [Google Scholar]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Abdollahi, A.; Pradhan, B.; Shukla, N.; Chakraborty, S.; Alamri, A. Deep Learning Approaches Applied to Remote Sensing Datasets for Road Extraction: A State-Of-The-Art Review. Remote Sens. 2020, 12, 1444. [Google Scholar] [CrossRef]
Hoeser, T.; Kuenzer, C. Object Detection and Image Segmentation with Deep Learning on Earth Observation Data: A Review-Part I: Evolution and Recent Trends. Remote Sens. 2020, 12, 1667. [Google Scholar] [CrossRef]
Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep High-Resolution Representation Learning for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 3349–3364. [Google Scholar] [CrossRef]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLO. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 13 March 2026).
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.-Y.; et al. Segment Anything. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 3992–4003. [Google Scholar]
Chen, K.; Liu, C.; Chen, H.; Zhang, H.; Li, W.; Zou, Z.; Shi, Z. RSPrompter: Learning to Prompt for Remote Sensing Instance Segmentation Based on Visual Foundation Model. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–17. [Google Scholar] [CrossRef]
Bai, H.; Ren, C.; Huang, Z.; Gu, Y. A dynamic attention mechanism for road extraction from high-resolution remote sensing imagery using feature fusion. Sci. Rep. 2025, 15, 17556. [Google Scholar] [CrossRef]
Hou, Q.; Zhang, L.; Cheng, M.-M.; Feng, J. Strip pooling: Rethinking spatial pooling for scene parsing. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 4002–4011. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Suzuki, S. Topological structural analysis of digitized binary images by border following. Comput. Vis. Graph. Image Process. 1985, 30, 32–46. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar]
Yosinski, J.; Clune, J.; Bengio, Y.; Lipson, H. How transferable are features in deep neural networks? Adv. Neural Inf. Process. Syst. 2014, 27, 3320–3328. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]

Figure 1. Tree Species Distribution Map.

Figure 2. The overall operational workflow.

Figure 3. Network Architecture of the Proposed Enhanced DeepLabV3+ [34].

Figure 4. Overall Architecture of MCSA, adapted from [41].

Figure 5. This figure presents the structural designs of the MDCA and MDSA modules, adapted from [41].

Figure 6. Improved ASPP Module.

Figure 7. Forest Area Path Recognition Results Corresponding to the Optimized Model Architecture. White boxes indicate representative regions selected for visual comparison. Red lines denote predicted road pixels.

Figure 8. Training and validation curves for the two-stage transfer learning strategy. Upper panels: pre-training on the DeepGlobe dataset (10 epochs, Dice Loss). Lower panels: fine-tuning on the forest dataset (25 epochs, Focal-Dice composite loss). Dashed orange lines indicate learning rate reduction events; the dotted purple line marks the early stopping point. Validation IoU is evaluated on the respective validation sets; final metrics in Table 3, Table 4, Table 5 and Table 6 are from the independent test set.

Figure 9. Joint Effect Results of Compound Loss Function and Improved Architecture in Forest Area Road Recognition. White boxes highlight representative regions demonstrating improvements attributed to the enhanced model architecture; yellow boxes highlight regions demonstrating improvements attributed to the composite loss function. Red lines denote predicted road pixels.

Figure 10. Multi-temporal Fusion Results.

Table 1. Description of the Database.

Dataset Name	Number of Image Pairs	Spatial Resolution	Image Size	Training/Validation/Test Set Split	Usage Description
DeepGlobe Road Extraction Dataset	8570	0.5 m	1024 × 1024	6226 images (training set), 1243 images (validation set), 1101 images (test set)	Model pre-training, preliminary feature extraction
Multi-temporal Forest Area Remote Sensing Dataset	1456	0.8 m	1024 × 1024	1165 images (training set), 146 images (validation set), 145 images (test set)	Model fine-tuning, forest road training and evaluation

Table 2. Comparison of Characteristics of State-of-the-Art Semantic Segmentation Models.

Model	Core Architecture	Spatial Detail Processing Strategy	Suitability for Elongated/Weak Targets
FCN	Encoder–Decoder	Lost first, then roughly recovered	Low
U-Net	Symmetric Encoder–Decoder	Sacrificed first, then compensated	Medium
SegNet	Encoder–Decoder	Record positions first, then recovered	Medium–Low
DeepLabV3+	Encoder–Decoder + Atrous Convolution	Maintained at all times	High

Table 3. Model Accuracy Analysis with the Introduction of the Multi-temporal Forest Area Remote Sensing Dataset.

Method	Precision (%)	Recall (%)	F1 Score (%)	IoU (%)	MCC (%)
FCN	37.7	58.9	45.9	29.8	39.6
SegNet	38.1	59.3	46.4	30.2	40.5
U-Net	40.2	60.9	48.4	31.8	43.1
DeepLabV3+ (Baseline)	40.6	61.1	48.8	32.2	43.5
+MCSA	46.8	62.9	53.7	36.7	48.2
+Improved ASPP	45.3	62.4	52.5	35.6	47.1
Combined Application	51.6	64.1	57.1	39.9	51.5

Bold values indicate the best performance in each column.

Table 4. Experimental Results of Loss Function Improvement on the Forest Area Dataset.

Loss Function Configuration	Weight (α)	Weight (β)	Precision (%)	Recall (%)	F1 Score (%)	IoU (%)	MCC (%)
Baseline (CE Loss)	-	-	40.6	61.1	48.8	32.2	43.5
Configuration 1	1.0	0.0	47.8	58.2	52.4	35.5	46.9
Configuration 2	0.7	0.3	53.1	63.4	57.8	40.6	52.3
Configuration 3	0.5	0.5	49.2	61.8	54.8	37.7	49.1
Configuration 4	0.3	0.7	50.3	62.5	55.7	38.6	50.2
Configuration 5	0.0	1.0	44.1	60.3	50.9	34.2	45.6

Bold values indicate the best performance in each column.

Table 5. Experimental Results of Full Improvements on the Forest Area Dataset.

Method	Precision (%)	Recall (%)	F1 Score (%)	IoU (%)	MCC (%)
Loss Function Improvement	53.1	63.4	57.8	40.6	52.3
Model Improvement	51.6	64.1	57.1	39.9	51.5
Combined Effect	53.8	64.9	58.8	41.6	53.2

Bold values indicate the best performance in each column.

Table 6. Comparison of Recognition Accuracy Across Different Seasons.

Evaluation Metric	Precision (%)	Recall (%)	F1 Score (%)	IoU (%)	MCC (%)
Summer Dataset	55.3	60.2	57.6	41.2	53.1
Winter Dataset	49.7	67.8	57.3	39.5	51.8
After Temporal Fusion	54.1	65.5	59.3	41.8	53.5

Bold values indicate the best performance in each column.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gao, Z.; Li, Z.; Yao, W.; Zhang, T.; Qiu, S.; Liu, Z. Forest Road Extraction via Optimized DeepLabv3+ and Multi-Temporal Remote Sensing for Wildfire Emergency Response. Appl. Sci. 2026, 16, 3228. https://doi.org/10.3390/app16073228

AMA Style

Gao Z, Li Z, Yao W, Zhang T, Qiu S, Liu Z. Forest Road Extraction via Optimized DeepLabv3+ and Multi-Temporal Remote Sensing for Wildfire Emergency Response. Applied Sciences. 2026; 16(7):3228. https://doi.org/10.3390/app16073228

Chicago/Turabian Style

Gao, Zhuoran, Ziyang Li, Weiyuan Yao, Tingtao Zhang, Shi Qiu, and Zhaoyan Liu. 2026. "Forest Road Extraction via Optimized DeepLabv3+ and Multi-Temporal Remote Sensing for Wildfire Emergency Response" Applied Sciences 16, no. 7: 3228. https://doi.org/10.3390/app16073228

APA Style

Gao, Z., Li, Z., Yao, W., Zhang, T., Qiu, S., & Liu, Z. (2026). Forest Road Extraction via Optimized DeepLabv3+ and Multi-Temporal Remote Sensing for Wildfire Emergency Response. Applied Sciences, 16(7), 3228. https://doi.org/10.3390/app16073228

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Forest Road Extraction via Optimized DeepLabv3+ and Multi-Temporal Remote Sensing for Wildfire Emergency Response

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area and Data Preprocessing

2.1.1. Overview of the Study Area

2.1.2. Data Acquisition and Temporal Phase Selection

2.1.3. Data Preprocessing and Dataset Construction

2.2. Methodology

2.2.1. Network Architecture Selection

2.2.2. Multi-Scale Dynamic Spatial-Channel Attention (MCSA) Module

2.2.3. Enhanced ASPP Module via Strip Pooling

2.2.4. Loss Function Design

2.2.5. Multi-Temporal Fusion Strategy

3. Results

3.1. Experimental Details and Evaluation Metrics

3.1.1. Implementation Details and Training Strategy

3.1.2. Evaluation Metrics

3.2. Ablation Study

3.3. Loss Function Modification

3.4. Experimental Results of the Multi-Temporal Fusion Strategy

4. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI