SAMS-Net: A Stage-Decoupled Semantic Segmentation Network for Forest Fire Detection

Tan, Yuxin; An, Jiazhe; Wang, Yabin; Li, Zhun; Gao, Jia; Yu, Fuxing

doi:10.3390/app16073144

Open AccessArticle

SAMS-Net: A Stage-Decoupled Semantic Segmentation Network for Forest Fire Detection

by

Yuxin Tan

¹

,

Jiazhe An

¹

,

Yabin Wang

¹,

Zhun Li

¹,

Jia Gao

¹ and

Fuxing Yu

^1,2,*

¹

College of Artificial Intelligence, North China University of Science and Technology, Tangshan 063210, China

²

Hebei Provincial Key Laboratory of Industrial Intelligent Perception, Tangshan 063210, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(7), 3144; https://doi.org/10.3390/app16073144

Submission received: 1 March 2026 / Revised: 22 March 2026 / Accepted: 22 March 2026 / Published: 24 March 2026

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

High-precision and real-time monitoring of forest fires is a critical requirement in disaster prevention and mitigation. During fire evolution, significant stage-wise variations occur, which make it difficult for conventional semantic segmentation models to simultaneously achieve robust multi-scale feature extraction and strong interference resistance. To address this issue, this paper proposes a stage-aware multi-head segmentation network, termed SAMS-Net. The proposed network decouples fire-stage recognition from pixel-level segmentation and employs a Hard-Switch Routing mechanism to dynamically activate the stage-specific decoder that matches the current fire phase during inference, while pruning irrelevant branches to reduce computational redundancy. Experimental results show that SAMS-Net achieves 76.16% mIoU, 81.30% Dice, and 90.31% PA, outperforming mainstream segmentation models such as FCN, U-Net++, DeepLabV3, and YOLOv9-Seg. In challenging stages, particularly the early and recession phases, the segmentation performance improves by more than 10% compared with the second-best model. Meanwhile, the proposed method maintains high accuracy with a real-time inference speed of 75.8 FPS. These results support the effectiveness of SAMS-Net for flame-and-ember foreground segmentation on the constructed multi-stage forest-fire benchmark dataset. Broader generalization across independent datasets and real-world deployment scenarios will be further investigated in future work.

Keywords:

SAMS-Net; forest fire detection; semantic segmentation; stage-aware learning; deep learning

1. Introduction

Forest fires are among the most frequent and destructive natural disasters worldwide. They are characterized by sudden outbreaks, rapid spread, and extensive damage, posing severe threats to human life and property, regional ecological balance, and the sustainable development of society and the economy. Driven by the combined effects of global climate warming, expanding human activities, and complex terrain conditions, the occurrence and scale of forest fires have intensified in recent years. Catastrophic events such as the 2023 Greek wildfires and the 2019 Australian bushfires not only resulted in significant casualties and ecological devastation but also released massive amounts of greenhouse gases, further exacerbating global warming and causing long-term environmental impacts [1,2]. In this context, the development of efficient and accurate early fire detection technologies has become an urgent and critical task in the field of disaster prevention and mitigation.

Early forest fire monitoring has primarily relied on traditional approaches such as ground-based sensor networks, satellite remote sensing, and fixed observation stations. Fernandes et al. utilized LiDAR technology and optimized radar station deployment to achieve early smoke detection in forest environments [3]. Wang et al. improved the MODIS fire detection algorithm to make it compatible with the infrared sensor of the HJ satellite and verified its feasibility and high accuracy through field experiments [4]. Varela et al. proposed an information-fusion-based wireless sensor network detection method, effectively reducing the high cost and limited real-time capability inherent in conventional approaches [5]. Dwivedi et al. adopted a density-based DBS algorithm to address anomaly detection in sensor cloud data [6]. Prasanna et al. developed an early fire detection system integrating tower-mounted cameras, LoRa communication, and solar power supply, enabling the timely transmission of geolocated warning information to forestry authorities [7].

With the rapid development of deep learning, vision-based intelligent fire detection methods have gradually become the mainstream research direction and have achieved significant improvements in both accuracy and real-time performance. Within object detection frameworks, Zhang et al. combined Faster R-CNN with synthetic smoke images to enable automatic forest fire smoke detection [8]. Zhan et al. integrated deconvolution and dilated convolution to design the recursive feature pyramid network ARGNet, achieving high-precision smoke detection [9]. Hu et al. proposed MVMNet, which incorporates multi-directional detection and attention mechanisms to effectively address the challenges of weak early smoke features and complex backgrounds [10]. Yuan et al. introduced FF-Net with F_Res feature extraction, FLA label assignment, and the KLF loss function, improving detection accuracy and mitigating data imbalance in mid-to-late fire stages under complex environments [11]. Zhu et al. developed the lightweight YOLO-MP model, which significantly reduces parameters and computational cost while improving detection accuracy [12].

In the field of semantic segmentation, Yu et al. proposed CCLNet, a lightweight model for small-target fire detection on UAV platforms by enhancing flame feature extraction through three core modules [13]. Han et al. constructed a lightweight multi-task model and expanded the dataset via image augmentation to improve robustness [14]. Yan et al. developed MAG-FSNet by integrating convolutional networks with Transformers, enhancing early smoke detection accuracy and generalization under complex backgrounds [15]. Regarding feature extraction and satellite remote sensing, Zhao et al. combined motion region extraction with multi-feature discrimination and employed the CSAdaboost algorithm to effectively distinguish fog from smoke [16]. Mambile et al. conducted a systematic comparison of nine CNN architectures on Sentinel-2 satellite imagery and identified MobileNetV2 as the optimal choice for computationally constrained scenarios [17]. In addition, Mahaveerakannan et al. integrated the Internet of Things with EfficientDet and LSTM networks to achieve fire prediction and detection [18]. Mowla et al. constructed the UAV-FFDB dataset containing 15,560 UAV-based forest fire images, providing an important data foundation for AI model training [19]. Giannakidou et al. presented a comprehensive review of artificial intelligence applications in forest fire management and future research directions [20].

To overcome the perceptual limitations of single visual modalities, some studies have introduced multi-sensor information fusion to enhance the robustness of fire detection in complex environments. Sun et al. proposed a physics-driven remote sensing framework for smoke detection and concentration inversion based on scattering–absorption theory and Mahalanobis distance, enabling multidimensional quantitative perception of smoke information [21]. Jin et al. developed the multimodal MM-SRENet model by integrating smoke recognition with risk factors, effectively reducing false positives and false negatives in fire detection [22]. Exaudi et al. converted fire-generated sounds into Mel spectrograms and fed them into a CNN, improving system robustness and lowering false alarm rates through audio–visual joint perception [23]. Krüll et al. combined remote sensing, UAVs, and airships to construct a modular forest fire detection and response system that integrates multiple sensing modalities, including smoke, gas, and microwave radiation, and validated its effectiveness through both indoor and outdoor experiments [24].

In summary, existing studies lack dynamic stage-aware modeling capability, the object detection paradigm has inherent limitations in achieving pixel-level accuracy, and effective joint optimization mechanisms that integrate segmentation with stage recognition remain underexplored. To address these three limitations, this paper proposes a stage-aware multi-head segmentation network, SAMS-Net. The proposed framework decouples fire physical stage recognition and semantic segmentation into two collaboratively optimized sub-tasks. A lightweight classification head is employed to determine the fire stage, and a Hard-Switch Routing mechanism dynamically activates the corresponding stage-specific decoder, thereby improving the adaptability of the segmentation framework to multi-stage fire scenarios. In this work, the segmentation objective is specifically defined as flame-and-ember foreground extraction under different fire evolution stages, while smoke is treated as contextual interference rather than an explicit target category. The main contributions of this work are summarized as follows:

(1): We propose SAMS-Net, a stage-aware decoupled segmentation architecture for multi-stage forest-fire imagery. The framework separates stage classification from flame-and-ember foreground segmentation through a three-tier design consisting of a shared backbone, a routing module, and stage-specific decoders. During inference, hard-switch routing activates only the decoder corresponding to the predicted stage, enabling dynamic pruning and more adaptive segmentation across different fire-evolution stages.
(2): We design stage-specific decoding modules for the distinct visual characteristics of different fire stages. PixelShuffle and Coordinate Attention (CA) are used to preserve fine details of small scattered fire spots in the early stage; a multi-dilation Atrous Spatial Pyramid Pooling (ASPP) module is employed to improve structural completeness in large contiguous flame regions in the middle stage; and the Convolutional Block Attention Module (CBAM) together with Mish enhances robustness under smoke-interfered, low-signal recession-stage conditions.
(3): A differentiable multi-task joint training strategy is proposed to address the non-differentiability of hard-switch routing in end-to-end training. Stage classification and stage-specific segmentation losses are explicitly decoupled, and a weighted joint loss function with hyperparameter sensitivity analysis is designed, allowing the shared backbone to simultaneously learn global stage discrimination and pixel-level reconstruction features. A hybrid Dice + BCE segmentation loss is adopted to effectively handle the severe foreground–background imbalance in forest fire scenarios.

2. Method

2.1. Overall Architecture

In this work, our objective is to improve semantic segmentation accuracy and inference efficiency for flame-and-ember foreground targets across different forest-fire image conditions, while explicitly accounting for the different fire-evolution stages. To this end, SAMS-Net decouples stage recognition from semantic segmentation and incorporates the dense nested skip-connection mechanism of U-Net++ to enhance multi-scale feature reconstruction.

As shown in Figure 1, the processing pipeline of SAMS-Net proceeds from left to right. First, the input forest-fire image is fed into a shared ResNet-50 backbone, which extracts a hierarchy of multi-scale features

{F 1, F 2, F 3, F 4, F 5}

. Among them, the shallow features preserve fine spatial details, while the deepest feature

F 5

contains the strongest semantic abstraction. Next,

F 5

is sent to the stage-classification branch shown at the top of Figure 1. After global average pooling and a three-layer MLP, this branch predicts the fire-evolution stage of the current image (early, middle, or recession). The predicted stage is then converted into a routing signal by the hard-switch module. Guided by this routing signal, only one of the three stage-specific decoders in Figure 1 is activated. At the same time, the selected decoder receives the multi-scale backbone features through skip connections for progressive feature fusion and mask reconstruction, whereas the other two decoders remain inactive for the current sample. Finally, the activated decoder outputs the segmentation result for flame-and-ember foreground extraction. In this way, Figure 1 illustrates a complete “feature extraction → stage classification → decoder selection → stage-specific segmentation” workflow, where the classification branch determines the routing path and the selected decoder performs the final pixel-level prediction.

2.2. Shared Backbone Feature Extraction

The backbone network serves as the core component for feature extraction, mapping the input images into a high-dimensional feature space and capturing general semantic information, including flame textures, environmental background characteristics, and smoke-related contextual variations. To achieve effective feature extraction for flame-and-ember foreground segmentation in multi-stage forest-fire imagery, ResNet-50 is adopted as the shared encoder backbone in this study. By introducing deep residual learning, ResNet-50 alleviates the gradient degradation problem commonly encountered in deep neural networks, enabling the extraction of more complex nonlinear feature representations. The core functionality and design rationale of the backbone during feature extraction are detailed as follows.

ResNet-50 Backbone Structure and Feature-Level Partitioning:

ResNet-50 consists of an initial convolutional module followed by four residual stages. Given an input image, the backbone progressively performs spatial downsampling through residual connections, producing a set of multi-scale semantic feature representations.

F = \{F 1, F 2, F 3, F 4, F 5\}

(1)

this hierarchical multi-scale feature structure provides a unified and stable foundation for subsequent stage classification and multi-decoder semantic segmentation. The sources of each feature level and their corresponding semantic characteristics are summarized below:

F1: Output from the initial convolution and max-pooling layers, featuring high spatial resolution and primarily representing low-level visual information such as edges and textures.
F2: Output from the Conv2_x stage, encoding local structural features.
F3: Output from the Conv3_x stage, capturing mid-scale flame morphology and smoke-related background interference patterns.
F4: Output from the Conv4_x stage, encoding stronger region-level semantic representations.
F5: Output from the Conv5_x stage, possessing the lowest spatial resolution but the highest semantic abstraction capability, and serving as the basis for global scene-level stage classification.

As shown in Figure 2, the shared ResNet-50 backbone generates a hierarchy of multi-scale feature maps (F1–F5), ranging from low-level spatial details to high-level semantic representations.

This architecture goes beyond the limitations of traditional single-task designs by adopting a fully shared backbone paradigm. Specifically, both the routing branch responsible for stage discrimination and the stage-specific decoders are fed by the same ResNet-50 encoder. Through end-to-end joint optimization, this design encourages the backbone to learn shared feature representations for both routing and segmentation, while reducing parameter redundancy and inference latency, which is beneficial for resource-constrained deployment settings.

Feature Flow and Skip Connections: The multi-scale feature maps extracted by the backbone are forwarded to the decoders via the stage-aware routing mechanism. Concretely, the deepest bottleneck feature F5, which contains high-level semantic information, is first fed into the stage classification sub-head to evaluate the global scene semantics and generate the routing signal indicating the current predicted fire-evolution stage. Guided by this routing decision, the multi-scale feature set {F1, F2, F3, F4, F5} is then delivered to the activated k-th stage-specific decoder. Within the selected decoding path, a dense skip-connection structure inherited from U-Net++ is employed to fuse multi-scale features. Shallow, high-resolution detail features are extensively reused through dense connections, enabling fine-grained reconstruction of fire contours. Meanwhile, the remaining two decoders are excluded from the forward propagation process, ensuring targeted feature flow and efficient utilization of computational resources.

2.3. Stage Classification Module

The primary function of the stage classification head is to map the high-level features extracted by the shared backbone into the three fire-evolution stage categories defined in this study. Considering that fire stage identification is inherently a global semantic recognition task, a lightweight yet stable stage classification structure is designed, as illustrated in Figure 3.

As shown in Figure 3, the stage-classification head is composed of three fully connected layers, denoted as FC1, FC2, and FC3. Specifically, FC1 reduces the 2048-dimensional global feature vector to 1024 dimensions for preliminary semantic compression, FC2 further maps it to 512 dimensions to strengthen stage-discriminative representation, and FC3 outputs a 3-dimensional prediction corresponding to the early, middle, and recession stages.

The input to the stage classification head is taken from the Conv5_x layer of ResNet-50, whose output feature map is denoted as:

F 5 \in R^{H_{5} \times W_{5} \times C}

(2)

where H₅ = W₅ = 16, C = 2048, This feature map possesses the strongest semantic abstraction capability and effectively represents the macroscopic state of the fire scene.

To eliminate spatial location interference and extract a globally consistent semantic descriptor, Global Average Pooling (GAP) is applied to compress the spatial dimensions of the feature map:

f_{g l o b a l} = G A P (F 5) = \frac{1}{H_{5} \times W_{5}} \sum_{i = 1}^{H_{5}} \sum_{j = 1}^{W_{5}} F 5 (i, j, c)

(3)

through the GAP operation, the three-dimensional feature tensor

F 5 \in R^{(16 \times 16 \times 2048)}

is mapped into a one-dimensional global feature vector

f_{g l o b a l} \in R^{2048}

.

After obtaining the global feature vector, a Multi-Layer Perceptron (MLP) is employed to perform progressive nonlinear mapping, enabling discriminative modeling from the high-dimensional semantic space to the fire-stage label space. A two-layer MLP provides insufficient nonlinear modeling capacity, while an MLP with four or more layers introduces parameter redundancy and overfitting. Therefore, a three-layer fully connected (FC) structure is adopted. The monotonically decreasing dimensionality (2048 → 1024 → 512 → 3) follows an information compression funnel principle, progressively extracting abstract features for stage discrimination. The specific formulation is given as:

\{\begin{cases} h_{1} = Re L U (B N (F C 1 (f_{g l o b a l}))) \\ h_{2} = D r o p o u t (Re L U (B N (F C 2 (h_{1}))), p = 0.3) \\ p_{s t a g e} = S o f t \max (F C 3 (h_{2})) \end{cases}

(4)

the MLP consists of three fully connected layers. FC1 (2048 → 1024) filters redundant details while preserving global semantic information; FC2 (1024 → 512) further abstracts the features and focuses on the core discriminative attributes; and FC3 (512 → 3) maps the features into the fire-stage probability space. A Dropout layer (p = 0.3) is applied to provide moderate regularization and prevent overfitting. The resulting output is a probability distribution:

p_{s t a g e} = [p_{e a r l y}, p_{m i d d l e}, p_{r e c e s s i o n}]

(5)

where

p_{e a r l y}

,

p_{m i d d l e}

,

p_{r e c e s s i o n}

and denote the predicted probabilities that the input image belongs to the early, middle, and recession stages, respectively, satisfying

\sum p_{s t a g e} = 1

.

To achieve an adaptive mapping from the feature space to the stage-category space while simultaneously ensuring training differentiability and deterministic inference, a hard-switch routing strategy is adopted:

k = \arg \max (p_{s t a g e}) = \arg \max ([p_{e a r l y}, p_{m i d d l e}, p_{r e c e s s i o n}])

(6)

based on the decision index k, only the k-th stage-specific decoder is activated during inference. In contrast to conventional soft-weighted routing strategies—which require all decoders to be evaluated and then fused—this hard-switch mechanism significantly reduces computational overhead. However, the Argmax operation is mathematically non-differentiable, which would block gradient backpropagation in end-to-end training. To address this issue, an explicit multi-task supervision strategy is introduced, decoupling network training into two complementary sub-tasks:

Stage classification task

The classification loss L_cls directly supervises the classification head to improve the reliability of routing decisions. This loss is backpropagated through the differentiable Softmax pathway, updating both the MLP classification head and the shared backbone.

Semantic segmentation task

The segmentation loss L_seg supervises the activated decoder. During training, the decoder corresponding to the ground-truth stage label y_stage is activated instead of the predicted k, thereby avoiding reliance on the non-differentiable Argmax operation and encouraging that each decoder specializes in its corresponding stage.

These two gradient pathways converge at the shared backbone, enabling it to simultaneously learn macroscopic discriminative features for stage classification and fine-grained reconstruction features for pixel-level segmentation. The detailed design of the loss functions and their weight configuration is presented in Section 3.5. During inference, since the classification head has already converged, the predicted routing index k reliably reflects the true fire stage, allowing the hard-switch routing mechanism to correctly activate the corresponding stage-specific decoder.

2.4. Stage-Specific Decoders

This module consists of three parallel and parameter-independent decoding sub-networks, corresponding to the Early, Middle, and Recession stages of fire evolution. All decoding sub-networks maintain a consistent network topology, utilizing the dense skip connections and progressive upsampling strategy of U-Net++ to facilitate multi-scale feature fusion between encoding and decoding features. Specifically, each node in the decoder receives upsampled features from the preceding layer in addition to the features from all prior nodes within the same layer, establishing nested dense connection paths. Let

X^{(i, j)}

denote the output feature map of a node; its computation is formalized as follows:

X^{(i, j)} = \{\begin{matrix} F i, i f j = 0 \\ C o n v (↑ (X^{(i + 1, j - 1)}) \oplus {[X^{(i, k)}]}_{\{k = 0\}}^{\{j - 1\}}), i f j > 0 \end{matrix}

(7)

where

(i, j)

denotes the node position, with i representing the scale level and j signifying the dense connection depth. Fi: is the i-th layer encoding feature provided by the shared backbone.

↑

denotes the 3× upsampling operation;

\oplus

: represents concatenation along the channel dimension;

C o n v

: refers to a 3 × 3 convolution;

{[X^{(i, k)}]}_{\{k = 0\}}^{\{j - 1\}}

: represents the features from all preceding nodes within the same level. Although the three decoders share the U-Net++ topology, significant differences exist in their key module designs to adapt to the specific feature requirements of different fire stages. As illustrated in the decoder architecture in Figure 4, the primary distinctions lie in the integration of specialized components—namely PixelShuffle, ASPP, and CBAM—alongside the use of differentiated activation functions.

Early Stage Decoder Design

Based on the characteristics of early-stage flames, this study designs an early-stage decoder centered on signal enhancement and detail reconstruction within the U-Net++ nested decoding framework. Traditional bilinear interpolation is essentially a linear weighting with fixed coefficients and has limited ability to recover high-frequency details. This enables the network to learn data-driven reconstruction patterns for the blurred edges of early-stage flames, thereby mitigating the detail-smoothing effect associated with conventional upsampling. In contrast, PixelShuffle achieves data-driven super-resolution reconstruction by learning channel-spatial rearrangement parameters. This enables the network to actively learn super-resolution reconstruction rules for the blurred edges of early-stage flames, significantly mitigating the detail-smoothing effect inherent in upsampling [25]. Given the input features

X \in R^{(H \times W \times C r^{2})}

, PixelShuffle performs periodic rearrangement as follows:

P S (X) [h, w, c] = X [⌊h / r⌋, ⌊w / r⌋, c \cdot r^{2} + (h \mod r) \cdot r + (w \mod r)]

(8)

this operation reorganizes

c \cdot r^{2}

channels into r times the spatial resolution to generate the final output

X^{'} \in R^{r H \times r W \times C}

.

Given that early-stage fire spots typically manifest as spatially discrete star-shaped structures, the Coordinate Attention (CA) mechanism is introduced to explicitly encode positional information along both horizontal and vertical directions. This helps preserve positional information for small fire spots, enhances the localization of faint flame targets against background textures, and reduces background interference introduced by naive feature concatenation [26]. CA aggregates features separately along the horizontal and vertical axes:

\{\begin{matrix} X_{h} = A v g p o o l_{w (X)} \\ X_{w} = A v g p o o l_{h (X)} \\ A t t e n t i o n = σ (C o n v ([X_{h}; X_{w}])) \end{matrix}

(9)

where

X_{h}

denotes pooling along the width dimension to preserve height h information;

X_{w}

denotes pooling along the height dimension to preserve width w information;

σ

is the Sigmoid activation function;

[\cdot; \cdot]

and denotes concatenation.

Furthermore, in the nonlinear activation stage, the early-stage decoder employs the Parametric Rectified Linear Unit (PReLU) as its activation function. Unlike ReLU, which applies a fixed zero threshold, PReLU allows the network to adaptively learn the slope parameter α in the negative half-plane:

P Re L U (x) = \max (0, x) + α \cdot \min (0, x)

(10)

This design is particularly critical for the early fire stage, as it ensures that faint flame features near the luminance threshold are not forcibly truncated to zero, thereby maintaining effective gradient propagation and accumulation throughout the network in this high-sensitivity segmentation task.

Middle Stage Decoder Design

To address the visual characteristics of the middle fire stage, including high combustion intensity, spatially continuous flame distribution, and structurally complete flame regions, a middle-stage decoder oriented toward global semantic modeling is designed upon the U-Net++ nested decoding framework.

Given that middle-stage flame regions span large areas with highly variable morphology, a single 3 × 3 convolutional receptive field is insufficient to capture the full spatial extent. The Atrous Spatial Pyramid Pooling (ASPP) module addresses this by connecting dilated convolutions with different dilation rates in parallel, exponentially expanding the effective receptive field without introducing additional parameters [27]. For a standard 3 × 3 convolution, the receptive field is BF = 3; for a 3 × 3 dilated convolution with dilation rate d, the receptive field becomes BF = 3 + 4d. In this work, a three-branch structure with d = {6, 12, 18} is adopted: d = 6 yields a receptive field of 27, capturing local flame texture; d = 12 yields 51, modeling the structural extent of the flame region; and d = 18 yields 75, perceiving the global combustion state. The ASPP output is defined as the concatenated fusion of all branches:

F_{A S P P} = C o n v_{1 \times 1} ([C o n v_{d 6 (F)}; C o n v_{d 12 (F)}; C o n v_{d 18 (F)}; G A P (F)])

(11)

where the Global Average Pooling (GAP) branch supplies global contextual information to compensate for the gridding artifacts inherent in dilated convolutions. The ASPP module is inserted before the deep decoding nodes, where it hierarchically extracts global context features and large-scale semantic information through the parallel dilated convolution branches and fuses them with the deep features from the backbone. This design substantially enlarges the effective receptive field, ensuring the decoder captures long-range dependencies within flame regions and resolves the hollow and fragmented structures that frequently arise inside large-scale flames.

Within the decoding module, to further enhance the spatial connectivity and structural coherence of middle-stage flame regions, a hybrid feature extraction strategy combining dilated convolution and standard 3 × 3 convolution is incorporated into the basic convolutional units. By expanding the effective receptive field while keeping the parameter count manageable, this strategy more thoroughly models the spatial dependencies of large-scale flame regions and effectively alleviates the region fragmentation and internal void artifacts characteristic of the middle stage. Compared with a pure channel attention mechanism, this approach operates more directly on the spatial dimension and is consequently more effective at recovering the continuous structural integrity of middle-stage flames.

Recession Stage Decoder Design

To address the visual characteristics of the fire recession stage, including weak flame intensity, scattered ember distribution, and low signal-to-noise ratio, an enhanced decoding strategy is proposed upon the U-Net++ architecture by integrating mixed-domain attention with smooth nonlinear activation.

The Convolutional Block Attention Module (CBAM) enhances discriminative features through sequential channel and spatial attention [28]. The channel attention is computed as follows:

M_{c} = σ (M L P (A v g P o o l (F)) + M L P (M a x P o o l (F)))

(12)

where AvgPool denotes global average pooling, which aggregates the mean response of the feature map; MaxPool denotes global max pooling, which captures peak responses; MLP refers to a shared two-layer fully connected network; and σ denotes the Sigmoid activation function. The inclusion of MaxPool is particularly useful here: for the faint ember signals in the recession stage, AvgPool responses are easily diluted by the surrounding low-activation background, whereas MaxPool helps preserve locally strong responses and reduces the risk of suppressing discriminative channels. The spatial attention is computed as:

M_{s} = σ (C o n v_{7 \times 7} ([A v g P o o l_{c} (F); M a x P o o l_{c} (F)]))

(13)

where AvgPool_c and MaxPool_c denote pooling along the channel dimension, [·;·] denotes concatenation, and Conv_7×7 is a 7 × 7 convolution. The spatial attention generates a pixel-wise saliency mask that helps localize ember regions against the recession-stage background of smoke and ash. this mask helps suppress distracting responses associated with smoke edges and ash reflections, while retaining ember-related textures that are more consistent with high-level semantic features, thereby improving localization quality. ensuring that it produces reliable routing signals upon convergence.

To mitigate the feature vanishing problem under low-contrast conditions, the Mish activation function [29] is adopted in the decoding units as a replacement for conventional ReLU. Mish is defined as:

f (x) = x \cdot \tanh (\ln (1 + e^{x}))

(14)

this smooth and non-monotonic formulation ensures continuous gradient propagation of faint flame responses throughout the deep network, substantially improving the model’s perceptual sensitivity to extremely weak fire spots and the stability of segmentation under degraded visual conditions.

2.5. Loss Function Design

Since the Hard-Switch Routing mechanism is mathematically non-differentiable, direct end-to-end training would block gradient backpropagation. Owing to the operational stage labels provided under the annotation protocol constructed in this study, an explicit multi-task supervision strategy is adopted. This strategy decouples the network training into two parallel gradient pathways: the classification loss directly supervises the stage classification head to ensure routing decision accuracy, while the segmentation loss supervises the activated decoding branch. The two gradient streams converge at the shared backbone, enabling it to simultaneously learn macro-level discriminative features for stage recognition and micro-level reconstructive features for pixel-wise segmentation.

To jointly optimize the two objectives of fire stage discrimination and fine-grained semantic segmentation within a unified framework, a multi-task joint loss function is formulated. The total loss

L_{t o t a l}

is defined as the weighted sum of the classification loss

L_{c l s}

and the segmentation loss

L_{s e g}

:

L_{t o t a l} = L_{s e g} + λ \cdot L_{c l s}

(15)

where

λ

is a hyperparameter that balances the relative contributions of the two tasks and regulates the magnitude of their respective gradient signals.

Stage Classification Loss

To encourage the model to learn highly discriminative stage-specific features, the cross-entropy loss function is employed for explicit supervision. Let N denote the batch size and M = 3 the number of fire stage categories. For a given input image, let

y_{i} \in R^{M}

denote the one-hot representation of the ground-truth stage label, and

{\hat{p}}_{i} \in R^{M}

denote the predicted probability distribution output by the classification head. The classification loss is then defined as:

L_{c l s} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{m = 0}^{M} y_{i, m} \log {\hat{p}}_{i, m}

(16)

this loss is backpropagated through the differentiable Softmax operation along the path: ∇L_cls → FC₃ → FC₂ → FC₁ → Backbone. This loss term backpropagates independently of the segmentation task and directly updates both the MLP classification head and the shared backbone. Consequently, even though the non-differentiable Argmax operation is applied during inference, the classification head receives unambiguous gradient guidance through the cross-entropy loss during training, so that it produces reliable routing signals upon convergence.

Stage-Specific Segmentation Loss

Given the challenges of small foreground occupancy ratios in fire scenes, particularly in the early and recession stages where severe class imbalance between foreground and background is present, a single loss function is insufficient to achieve optimal performance. A combined loss integrating Dice Loss and Binary Cross-Entropy (BCE) Loss is therefore adopted:

L_{s e g} = L_{b c e} + L_{d i c e}

(17)

L_{B C E} = - \frac{1}{H \times W} \sum \sum [y_{i j} \log ({\hat{y}}_{i j}) + (1 - y_{i j}) \log (1 - {\hat{y}}_{i j})]

(18)

L_{d i c e} = 1 - \frac{2 \sum \sum y_{i j} {\hat{y}}_{i j} + ε}{\sum \sum y_{i j} + \sum \sum {\hat{y}}_{i j} + ε}

(19)

where

y_{i j} \in \{0, 1\}

denotes the ground-truth label of pixel

(i, j)

,

{\hat{y}}_{i j} \in [0, 1]

denotes the predicted probability output by the decoder, and

ε

is a smoothing term introduced to ensure numerical stability.

3. Experiments

3.1. Stage Definition and Dataset Construction

A prerequisite for stage-aware routing is the establishment of operational and quantifiable criteria for each fire-evolution stage category used in this study. Drawing upon forest fire behavior theory [30] and existing visual quantitative segmentation studies on fire [31,32], this work categorizes fire evolution into three distinct stages, defined as follows.

In the Early Stage, flames typically manifest as star-shaped or point-like fire sources with blurred edges, accompanied by high-frequency dynamic flickering. In terms of quantitative indicators, the flame area ratio is generally below 15%, the spatial distribution exhibits a discrete pattern with the number of connected components no less than 3, and the overall mean grayscale brightness remains below 180. As the fire spreads into the Middle Stage, flames gradually coalesce to form continuously burning regions, with the area ratio rapidly expanding to between 15% and 60%. At this stage, flame contours are well-defined, combustion is stable, and the mean brightness rises to 180 or above. When combustion intensity weakens and the fire transitions into the Recession Stage, smoke coverage exceeds 30% and the effective flame area falls below 20%. The image is predominantly occupied by low-brightness embers (grayscale values between 80 and 120) in discrete distributions, accompanied by intermittent faint flickering that is highly susceptible to interference from complex backgrounds. Through these quantitative criteria, the study establishes an operational labeling basis for subsequent stage classification and stage-specific mask generation.

To address the limitation that most publicly available image segmentation datasets cover general scenes and lack fine-grained annotations for fire evolution stages, a dedicated multi-stage annotated dataset is constructed specifically for the forest fire stage-aware segmentation task. The data sources comprise two main channels: the first involves collecting and reorganizing open-source public datasets from the fire image segmentation domain, and the second consists of a simulated forest fire combustion experiment dataset. To improve segmentation quality across different physical evolution stages and to increase data diversity, the data collection and screening process explicitly encompasses fire images captured under diverse lighting conditions, shooting distances, viewpoints, complex backgrounds such as smoke interference, and multiple image resolutions. After aggregating data from both channels and applying rigorous deduplication and cleaning procedures, a dedicated dataset of 2143 high-quality images is ultimately constructed. To ensure objectivity and comprehensiveness in model evaluation, the dataset is randomly partitioned into training, validation, and test sets at a ratio of 8:1:1. While this benchmark incorporates images from reorganized public datasets and simulated combustion experiments, covering diverse lighting conditions, shooting distances, viewpoints, and background complexities, the present evaluation remains confined to a single internally assembled dataset. The reported results therefore reflect the performance of SAMS-Net under this benchmark setting, while broader generalization across different camera systems, geographic environments, or collection protocols has not yet been directly verified. Evaluating the proposed method on independent public datasets, such as UAV-FFDB [19] is a priority direction for future work. Representative images from the dataset are shown in Figure 5 below.

A semi-automatic two-tier annotation scheme is adopted for data labeling. At the image level, stage labels are first automatically generated via HSV rule-based scripts, after which two professional annotators manually review samples with ambiguous boundaries falling within a threshold margin of ±5%. The stage-labeling protocol is based on HSV-derived quantitative criteria, so the resulting classification task is closely tied to the adopted operational definition of fire stages. Under this setting, classification performance reflects the model’s ability to learn the stage taxonomy defined in the dataset. Since transitions between adjacent fire stages are gradual in real scenes, samples near the decision thresholds are more likely to contain ambiguity in stage assignment. To examine this effect, boundary-ambiguous samples within a ±5% margin of the corresponding stage threshold are analyzed separately ina later boundary-ambiguity analysis. The lower classification accuracy and segmentation IoU observed for these samples indicate that transition ambiguity affects both routing performance and segmentation quality. This result is also consistent with the continuous nature of fire evolution, where neighboring stages often share overlapping visual characteristics. Further improvement may benefit from a less rule-dependent annotation protocol together with a more comprehensive sensitivity analysis for transition cases.

At the pixel level, DeepLabV3+ pre-annotations are used as the initial reference, followed by manual boundary refinement in LabelMe and automated quality control through quantitative constraint scripts. The overall annotation process maintains a high level of inter-annotator consistency. To handle complex interference such as smoke, a hierarchical annotation strategy is devised: at the macro level, smoke coverage rate serves as an important criterion for image-level stage classification; at the micro level, smoke is assigned to the background class in the binary mask, with only flames and embers retained as foreground targets. Accordingly, the segmentation task addressed in this study is defined as flame-and-ember foreground segmentation rather than a comprehensive smoke-aware fire scene parsing problem. Under this definition, the proposed method does not explicitly model smoke as an independent detection target and is therefore more suitable for flame-and-ember extraction scenarios than for smoke-dominant early warning settings in which smoke serves as the primary observable cue. As illustrated in Figure 6, a representative sample from the middle stage is presented, where (a) shows the original fire image and (b) shows the corresponding binary mask.

3.2. Experimental Setup and Evaluation Metrics

Experimental Setup

All models, including SAMS-Net and all baseline methods, were trained and evaluated under a unified software environment of PyTorch 2.1.2 and CUDA 11.8. All experiments were conducted on a cloud server equipped with a single NVIDIA Tesla V100 GPU (32 GB VRAM), an Intel Xeon CPU, and 64 GB RAM. All models were trained using the same dataset split, batch size (8), and training budget of up to 150 epochs. Each model was independently trained and evaluated three times using different random seeds, and the principal results are reported as mean ± standard deviation. All models were initialized with ImageNet-pretrained backbone weights when such official or standard pretrained checkpoints were available; otherwise, random initialization was adopted.

The training images were preprocessed and augmented as follows. Each image was first resized to 512 × 512 using bilinear interpolation, and then subjected to random horizontal flipping (p = 0.5), random vertical flipping (p = 0.5), and random rotation within ±10°. Color jittering was applied with brightness and contrast variation factors of 0.2; saturation and hue were not perturbed. Finally, all images were normalized using ImageNet statistics (mean = [0.485, 0.456, 0.406], std = [0.229, 0.224, 0.225]). During inference, only resizing to 512 × 512 and ImageNet normalization were applied, without test-time augmentation.

CNN-based baselines, including FCN, U-Net, U-Net++, DeepLabV3, and PSPNet, were optimized using AdamW with an initial learning rate of 1 × 10⁻⁴ and a weight decay of 1 × 10⁻², consistent with the settings used for SAMS-Net. SegFormer adopted a layer-wise learning-rate strategy, with the backbone learning rate set to 6 × 10⁻⁵ and the decode-head learning rate set to 6 × 10⁻⁴, while the weight decay was kept at 1 × 10⁻². To maintain a consistent optimizer family across all methods, YOLOv9-Seg was also trained with AdamW; however, its initial learning rate was set to 1 × 10⁻³ and its weight decay to 5 × 10⁻⁴. For all models, cosine annealing was adopted for learning-rate scheduling.

For SAMS-Net, the weighting ratio between the classification loss and the segmentation loss was set to λ = 0.5. In addition, early stopping with a patience of 20 epochs based on validation mIoU was applied within a maximum training budget of 150 epochs. Training SAMS-Net for 150 epochs typically required approximately 5.6 h under this configuration. During single-scale inference evaluation, all models loaded the best checkpoint selected on the validation set according to validation mIoU. This training protocol was designed to preserve comparability in the data split, augmentation settings, and training budget, while allowing a small number of architecture-sensitive hyperparameter adjustments where necessary.

Semantic Segmentation Evaluation Metrics

Pixel-level evaluation metrics are adopted for the semantic segmentation task. Intersection over Union (IoU) and the Dice coefficient serve as the primary metrics, supplemented by Pixel Accuracy (PA) for multi-dimensional verification. All metrics are computed from four fundamental statistics derived from the confusion matrix: True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN).

IoU is the most widely accepted core evaluation metric in segmentation tasks, measuring the degree of overlap between the predicted mask and the ground-truth mask in pixel space. Due to its high sensitivity to both missed detections and false alarms, it serves as the standard benchmark for comparative experiments in this domain:

I o U = \frac{T P}{T P + F P + F N}

(20)

In the comparative experiments, both the individual IoU of each stage-specific decoder and the mean Intersection over Union (mIoU) across the full dataset are reported:

m I o U = \frac{1}{C} \sum_{c = 1}^{C} I o U_{c}

(21)

where c denotes the total number of categories. In this study, c = 2 is used for binary segmentation, where the foreground corresponds to flames and embers, and smoke is treated as part of the background. In the ablation experiments, the IoU of each stage-specific decoder, namely the early, middle, and recession stages, is reported independently to validate the effectiveness of the stage-specific decoder architecture design.

The Dice coefficient is highly correlated with IoU but exhibits greater sensitivity to small target regions, offering complementary advantages. In scenarios with severe class imbalance, the Dice coefficient more objectively reflects the model’s fitting quality on foreground features compared with global pixel accuracy:

D i c e = \frac{2 \times T P}{2 \times T P + F P + F N}

(22)

Pixel Accuracy (PA) is included as a standard auxiliary metric to facilitate cross-comparison with baseline methods under different task configurations:

P A = \frac{T P + T N}{T P + T N + F P + F N}

(23)

Stage Classification Evaluation Metric

Classification accuracy (Acc) is defined as the proportion of correctly classified samples out of the total number of samples, and is computed as follows:

A c c = \frac{N_{c o r r e c t}}{N_{t o t a l}}

(24)

3.3. Comparison with Baseline Methods

To comprehensively evaluate the performance of SAMS-Net, a set of mainstream and representative deep learning semantic segmentation models is selected as baseline methods for comparison, including FCN, U-Net, U-Net++, DeepLabV3, PSPNet, SegFormer, and YOLOv9-Seg. To ensure a fair comparison, all baseline methods were trained and tested under the same dataset split, augmentation strategy, and evaluation protocol, while limited architecture-sensitive hyperparameter adjustments were introduced when necessary following official recommendations or standard practice.

As shown in Table 1, SAMS-Net achieves the best performance across all reported metrics, with a mean mIoU of 76.16 ± 0.31 over three runs, together with a Dice of 81.30 and a PA of 90.31, supporting the effectiveness of stage decoupling and stage-specific decoding under the current benchmark setting. The margin over the strongest baseline, YOLOv9-Seg (71.63 ± 0.38), remains clearly larger than the observed run-to-run variation. SAMS-Net also shows the smallest standard deviation on mIoU among all compared methods, indicating stable performance across repeated runs. The recession stage shows relatively larger standard deviations across models, likely due to the smaller number of test samples and the higher visual ambiguity of this stage. In the early stage, SAMS-Net leverages the PixelShuffle and CA mechanisms to achieve an IoU of 68.21%, surpassing YOLOv9-Seg and other competing methods, enabling precise extraction of small and spatially discrete fire spots. In the middle stage, the integration of the ASPP module effectively enlarges the receptive field, yielding an IoU of 82.31% that outperforms DeepLabV3 and other baselines, successfully mitigating the hollow and fragmented structures within large-scale continuous flame regions. In the recession stage, where dense smoke occlusion and extremely low contrast cause a sharp performance degradation in all other models, SAMS-Net relies on CBAM and the Mish activation function to achieve an IoU of 65.37%, showing stronger resistance to background interference than the compared baselines on the tested recession-stage samples. To further illustrate the qualitative segmentation behavior of the proposed model on representative test samples, Figure 7 presents six representative test samples covering all three fire evolution stages, two per stage, along with a visual comparison of the binary predicted masks generated by SAMS-Net and each baseline model.

The qualitative results in Figure 7 provide an intuitive illustration of the segmentation performance differences among all models across different physical fire stages. In the early stage (rows 1–2), when faced with scattered star-shaped fire spots with blurred edges, FCN and PSPNet suffer from severe missed detections due to the absence of shallow high-resolution feature recovery mechanisms, and the conventional U-Net also fails to preserve a large number of small fire patches. In contrast, SAMS-Net accurately captures and reconstructs isolated pixel-level features with high sensitivity. In the middle stage (rows 3–4), for large-scale continuous flame regions, the predictions of U-Net and other baseline models exhibit pronounced structural fragmentation and internal voids within the flame areas. By comparison, both SAMS-Net and YOLOv9-Seg generate dense and spatially well-connected masks, validating the effectiveness of the ASPP module in the middle-stage decoder for maintaining global structural consistency. In the most challenging recession stage, where dense smoke severely occludes the fire source, FCN, SegFormer, and DeepLabV3 produce significant false detections by misclassifying bright smoke regions as flame foreground. SAMS-Net, in contrast, more effectively suppresses interference from bright smoke regions by separating the smoke background and retaining the faint ember regions in these representative recession-stage examples.

3.4. Ablation Study

3.4.1. Validation of the Stage-Aware Routing Mechanism

To validate the effectiveness of the stage-aware Hard-Switch Routing strategy in SAMS-Net, three comparative experiments are designed to investigate the impact of different routing mechanisms on segmentation accuracy (mIoU), computational complexity (FLOPs), and inference speed (FPS). The three configurations are defined as follows (Table 2):

Baseline

The stage classification head and multi-branch structure are removed, and a single U-Net++ decoder is used to process data from all fire stages, representing the traditional undercoupled segmentation paradigm.

Soft-Routing

The three stage-specific decoders are retained, but the Argmax hard switch is removed. During inference, all three decoders are simultaneously activated, and the final output is computed as the weighted sum of the three decoder predictions based on the classification head probabilities (Pearly, Pmiddle, Precession).

Hard-Routing

The proposed method in this work.

The experimental results first confirm the validity of the divide-and-conquer strategy: SAMS-Net achieves a substantial mIoU improvement of 10.87% over the single universal decoder baseline, indicating that stage-specific decoding is beneficial for jointly handling small fire spots and large-scale semantic structures. A further comparison of routing mechanisms reveals that although Soft-Routing improves accuracy through multi-branch ensemble, the excessive computational redundancy causes its inference speed to drop sharply to 28.5 FPS, making it less favorable for real-time monitoring scenarios in which additional system overhead must also be considered. In contrast, SAMS-Net achieves a stage classification accuracy of 94.5%, enabling the Hard-Switch strategy to selectively activate the corresponding stage-specific decoder while suppressing interference from irrelevant branches. This design attains the highest segmentation accuracy of 76.16% while constraining the computational load to 64.12 GFLOPs through dynamic pruning during inference, ultimately achieving an inference speed of 75.8 FPS and demonstrating a favorable balance between segmentation accuracy and inference efficiency under the current experimental setting.

3.4.2. Analysis of Stage-Specific Decoder Modules

To investigate the effectiveness of the decoding strategies designed for different physical fire stages in SAMS-Net, module ablation experiments are conducted on the independent test set of each stage to observe the IoU improvement contributed by each component.

As shown in Table 3, The results reveal the underlying mechanisms through which each stage-specific decoding module contributes to performance improvement. In the early stage, the baseline model is constrained by the spatial information loss caused by downsampling, yielding an IoU of only 58.45%. PixelShuffle effectively sharpens blurred edges through data-driven sub-pixel channel rearrangement, while PReLU adaptively preserves faint features near the luminance threshold. The subsequent incorporation of CA further enables precise spatial coordinate encoding of discrete fire spots, collectively pushing the IoU to 68.21% and validating the effectiveness of signal enhancement and detail reconstruction strategies for small target segmentation. For the middle stage, where semantic fragmentation frequently occurs within large-scale flame regions, the ASPP module constructs a hybrid receptive field spanning from local texture to global combustion state through multi-rate dilated convolutions, successfully compensating for the gridding artifacts of dilated convolutions and reinforcing spatial connectivity, thereby raising the IoU to 82.31%. In the most challenging recession stage, the global MaxPool introduced within the CBAM precisely captures faint ember responses that would otherwise be diluted by the background averaging effect. Combined with the non-monotonic smooth properties of the Mish activation function under low-contrast conditions, effective gradient propagation is maintained throughout the deep network, significantly enhancing the model’s interference suppression capability under dense smoke occlusion. The IoU ultimately reaches 65.37%, indicating the contribution of mixed-domain attention and smooth nonlinear activation to segmentation performance under the tested low-contrast recession-stage conditions.

3.4.3. Loss Function Configuration Analysis

Sensitivity Analysis of Multi-Task Loss Weights

According to Equation (15), the hyperparameter λ governs the relative emphasis placed on stage discrimination accuracy and segmentation refinement during training. When λ is too small, the classification head fails to converge adequately, causing the Hard-Switch Routing mechanism to malfunction. When λ is too large, the segmentation task gradients are overwhelmed, leading to degraded mask generation quality. A grid search experiment is conducted over the range

λ \in [0, 1]

, and the results are presented in Figure 8.

As shown in Figure 8, when λ is small, the insufficient classification loss weight causes the stage classification accuracy (Acc) to remain at a low level below 80%. Under these conditions, the underfitted classification head misroutes a large proportion of images to mismatched decoders, severely degrading the mIoU to approximately 60%. As λ increases toward 0.5, the classification accuracy rapidly saturates to above 94.5% and routing decisions become stable. At this point, the gradient contribution from the segmentation task remains unimpaired, and mIoU reaches its peak value of 76.16%. Further increasing λ beyond this point maintains classification accuracy above 95%, but the mIoU curve exhibits a steep downward trend. This is attributed to the dominance of classification task gradients during backpropagation, which causes the shared backbone to overfit toward global discriminative features at the expense of the spatial detail information required for pixel-level reconstruction.

Effectiveness Analysis of Segmentation Loss Components

Given the severe foreground-background class imbalance characteristic of forest fire scenes, a single loss function is generally insufficient. The effects of using BCE Loss alone, Dice Loss alone, and their combination (BCE + Dice) are compared systematically.

As shown in Figure 9, the experimental results reveal the limitations of individual loss functions and the necessity of a hybrid strategy. BCE Loss performs acceptably in the middle stage (IoU of 78.12%), but in the early and recession stages it is dominated by the overwhelming number of background negative samples, causing the model to easily collapse into the degenerate all-black prediction local optimum. Dice Loss, by contrast, substantially alleviates the class imbalance problem through region overlap computation, raising the early-stage IoU to 64.15%, but its gradient oscillations lead to blurred boundaries in large-scale middle-stage fires, resulting in an IoU of 76.88%. The combined BCE and Dice strategy achieves an overall mIoU of 76.16% by successfully realizing gradient synergy between global localization and local refinement: Dice provides strong global gradients in the early training phase to address small target missed detections, while BCE supplies stable pixel-level gradients in the later phase to smooth oscillations and sharpen boundaries. Most notably, in the most challenging recession stage, this combination effectively overcomes smoke interference and pushes the IoU to 65.37%, strongly validating its superior capability to simultaneously ensure localization accuracy and boundary precision under complex low-contrast conditions.

3.4.4. Boundary-Ambiguity Analysis

To investigate the influence of stage-transition ambiguity, the internal test set was divided into boundary-ambiguous and non-boundary subsets according to whether the HSV-rule-derived stage criteria fell within a ±5% margin of the corresponding threshold. Classification accuracy and segmentation IoU were then evaluated separately for the two subsets, as reported in Table 4.

As shown in Table 4, boundary-ambiguous samples are consistently more challenging for both classification and segmentation. Compared with non-boundary samples, the boundary-ambiguous subset yields lower classification accuracy and lower IoU across all three stages. Specifically, the Early-stage IoU decreases from 69.5% to 60.0%, the Middle-stage IoU decreases from 83.4% to 75.6%, and the Recession-stage IoU decreases from 66.9% to 55.7%.

This performance gap is consistent with the continuous nature of fire evolution, in which transitions between adjacent stages are gradual rather than strictly discrete. Samples located near stage boundaries often exhibit overlapping visual characteristics, making stage assignment less certain and increasing the difficulty of both routing and segmentation. The largest performance drop is observed in the Recession stage, where the visual overlap between late-Middle and early-Recession appearances becomes stronger under dense smoke conditions. In such cases, small variations in flame area ratio and smoke coverage can lead to greater sensitivity around the decision threshold. These findings suggest that improved robustness near stage transitions may require both a less rule-dependent labeling protocol and a more systematic sensitivity analysis of transition samples.

3.5. Preliminary Zero-Shot Cross-Dataset Qualitative Analysis

To provide an initial empirical characterization of cross-dataset behavior, a zero-shot transfer evaluation was conducted on the FLAME dataset, an independently collected public fire-image dataset with acquisition equipment, combustion materials, and scene compositions that differ from those of the training data. Twelve images were randomly sampled from FLAME without stage-stratified selection, retraining, fine-tuning, or domain-specific preprocessing. Apart from the standard 512 × 512 resizing used during inference, no additional transformation was applied. None of the sampled FLAME images was used in any stage of model training or validation. Each sampled image was processed by the complete SAMS-Net inference pipeline. Specifically, the classification head first generated a routing signal corresponding to the predicted fire-evolution stage, after which the Hard-Switch mechanism activated the associated stage-specific decoder to produce the binary segmentation mask. The model-predicted stage label for each sample was recorded together with the segmentation result, so that the model’s internal routing outputs under distribution shift could be inspected. Because external stage annotations are unavailable for the sampled FLAME images, the predicted stage labels should be interpreted as model-internal routing outputs rather than externally verified ground truth. Representative zero-shot transfer results of SAMS-Net on twelve randomly sampled FLAME images are shown in Figure 10.

Each example consists of an original fire-scene image and the corresponding binary segmentation result produced by SAMS-Net. The colored tag indicates the fire-evolution stage predicted by the model’s classification head. These stage labels are model-internal routing outputs rather than independently verified external annotations. The images were sampled without stage stratification, and both successful and challenging cases are shown without cherry-picking to illustrate model behavior under natural distribution shift.

For samples routed by the classification head to the middle-stage decoder, SAMS-Net often produces spatially coherent masks that capture the principal flame regions. This behavior is qualitatively consistent with the intended role of the ASPP-based decoder in preserving large-scale spatial continuity across diverse combustion scenes. For samples routed to the early-stage decoder, discrete small fire regions are localized in most cases; however, one low-luminance sample exhibits partial missed detections, suggesting a sensitivity boundary near the lower end of the brightness range represented in the training distribution.

The clearest degradation is observed in smoke-heavy external samples whose appearance differs substantially from that of the training data, where the predicted masks show a higher false-positive tendency than that observed on the internal test set. This behavior may reflect the increased difficulty of external smoke- and ember-appearance shifts, together with the current task definition, in which smoke is assigned to the background class and the segmentation target is restricted to flame-and-ember foreground regions. Under this formulation, scenes dominated by visually atypical smoke patterns remain inherently challenging regardless of the specific segmentation architecture.

These observations provide preliminary qualitative evidence that the proposed framework retains stage-consistent routing behavior and functional segmentation capability under moderate domain shift. Because external stage annotations are unavailable for the sampled FLAME images, the predicted stage labels should be interpreted as model-internal routing outputs rather than externally verified ground truth. Systematic quantitative evaluation on independently annotated public datasets remains necessary for a more rigorous assessment of cross-dataset generalization.

4. Conclusions

This paper investigates stage-aware semantic segmentation for forest fire imagery by integrating fire-evolution knowledge with a decoupled segmentation framework. To overcome the lack of stage adaptability in conventional general-purpose segmentation models, a stage-aware multi-head segmentation network, SAMS-Net, is proposed. Compared with traditional models such as U-Net++ and SegFormer, SAMS-Net shows improved adaptability to the morphological variations observed across different fire stages.

Upon confirming the stage heterogeneity of the fire evolution process, a dynamic architecture following the paradigm of shared feature extraction, Hard-Switch Routing, and stage-specific decoding is constructed and evaluated experimentally. Building upon ResNet-50 for universal feature extraction, Hard-Switch Routing is employed to dynamically activate the corresponding decoder, establishing a multi-stage monitoring and analysis framework that spans early-stage small fire spots, middle-stage large-scale combustion, and recession-stage ember regions. By analyzing the segmentation accuracy of SAMS-Net across the three physical stages of fire evolution, it is found that the stage classification head and the stage-specific decoders complement each other, helping address the difficulty of a single model in handling diverse multi-scale features simultaneously. In the early stage, fire spots are spatially discrete with blurred edges, causing features to be easily lost during downsampling and leading to severe missed detections in conventional models. The early-stage branch of SAMS-Net introduces PixelShuffle and Coordinate Attention (CA) to precisely reconstruct the spatial coordinates of small fire spots and reduce segmentation errors. In the recession stage, the recession-stage branch leverages the Convolutional Block Attention Module (CBAM) to effectively suppress smoke noise and resolve the foreground confusion caused by dense smoke occlusion.

Systematic comparative experiments and ablation studies are conducted to validate the effectiveness of SAMS-Net. The model achieves an mIoU of 76.16% on the test set while maintaining a real-time inference speed of 75.8 FPS, reducing prediction errors relative to conventional one-size-fits-all approaches under the current benchmark setting. These results offer useful methodological insights for stage-aware wildfire image analysis, and future work involving broader validation across independent datasets and practical deployment conditions will be important for assessing the practical applicability of the proposed framework.

5. Future Work

Although the proposed method compares favorably with currently available approaches, room for improvement remains in terms of adaptability under extreme environmental conditions. The model currently relies primarily on two-dimensional RGB visual features, and certain limitations persist in feature extraction under nighttime conditions or when visible light is completely obscured by dense smoke. Several directions are identified for future work: (1) incorporating thermal imaging or multispectral data to improve all-weather detection accuracy under nighttime and extreme weather conditions; (2) investigating knowledge distillation and model compression techniques to reduce parameter storage requirements while preserving multi-decoder accuracy, enabling deployment on more compact edge devices; (3) integrating SAMS-Net with UAV flight control systems to support closed-loop monitoring, tracking, and latency-aware decision assistance; (4) extending the evaluation to independent public benchmarks such as UAV-FFDB and FLAME, with explicit source-wise generalization analysis across different acquisition conditions, camera characteristics, and geographic environments, building on the preliminary zero-shot observations in Section 3.5; (5) refining the stage-labeling protocol through an independent manual annotation procedure that does not rely on HSV-based quantitative criteria, together with a more comprehensive sensitivity analysis of transition samples to better distinguish operational rule learning from physically meaningful stage recognition; (6) integrating more cutting-edge instance and semantic segmentation baseline models into the evaluation framework to provide a more comprehensive comparative analysis.

Author Contributions

Conceptualization, Y.T. and F.Y.; Methodology, Y.T. and F.Y.; Software, Y.T., J.A. and Y.W.; Validation, Y.T., Z.L. and J.G.; Formal Analysis, Y.T. and J.A.; Investigation, Y.T., Y.W. and Z.L.; Resources, F.Y.; Data Curation, Y.T., J.G. and J.A.; Writing—Original Draft Preparation, Y.T.; Writing—Review and Editing, Y.T., J.A., Y.W., Z.L., J.G. and F.Y.; Visualization, Y.T. and Z.L.; Supervision, F.Y.; Project Administration, F.Y.; Funding Acquisition, F.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets presented in this article are not readily available because the data are part of an ongoing study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Koukouli, M.E.; Pseftogkas, A.; Karagkiozidis, D.; Mermigkas, M.; Panou, T.; Balis, D.; Bais, A. Extreme wildfires over Northern Greece during Summer 2023–Part B. Adverse effects on regional air quality. Atmos. Res. 2025, 320, 108034. [Google Scholar] [CrossRef]
Dare, M.; Jetten, J.; Selvanathan, H.P.; Crimston, C.R. Sense of Community and Adaptive Capacity: Insights from the 2019/2020 Australian ‘Black Summer’ Bushfires. J. Environ. Psychol. 2026, 110, 102930. [Google Scholar] [CrossRef]
Fernandes, A.M.; Utkin, A.B.; Lavrov, A.V.; Vilar, R.M. Optimisation of location and number of lidar apparatuses for early forest fire detection in hilly terrain. Fire Saf. J. 2006, 41, 144–154. [Google Scholar] [CrossRef]
Wang, S.D.; Miao, L.L.; Peng, G.X. An improved algorithm for forest fire detection using HJ data. Procedia Environ. Sci. 2012, 13, 140–150. [Google Scholar] [CrossRef]
Varela, N.; Ospino, A.; Zelaya, N.A.L. Wireless sensor network for forest fire detection. Procedia Comput. Sci. 2020, 175, 435–440. [Google Scholar] [CrossRef]
Dwivedi, R.K. Density-based machine learning scheme for outlier detection in smart forest fire monitoring sensor cloud. Int. J. Cloud Appl. Comput. 2022, 12, 1–16. [Google Scholar] [CrossRef]
Prasanna, K.R.; Mathana, J.M.; Ramya, T.A.; Nirmala, R. LoRa network based high performance forest fire detection system. Mater. Today Proc. 2023, 80, 1951–1955. [Google Scholar] [CrossRef]
Zhang, Q.; Lin, G.; Zhang, Y.; Xu, G.; Wang, J.J. Wildland forest fire smoke detection based on faster R-CNN using synthetic smoke images. Procedia Eng. 2018, 211, 441–446. [Google Scholar] [CrossRef]
Zhan, J.; Hu, Y.; Zhou, G.; Wang, Y.; Cai, W.; Li, L. A high-precision forest fire smoke detection approach based on ARGNet. Comput. Electron. Agric. 2022, 196, 106874. [Google Scholar] [CrossRef]
Hu, Y.; Zhan, J.; Zhou, G.; Chen, A.; Cai, W.; Guo, K.; Hu, Y.; Li, L. Fast forest fire smoke detection using MVMNet. Knowl.-Based Syst. 2022, 241, 108219. [Google Scholar] [CrossRef]
Yuan, J.; Wang, H.; Yang, T.; Su, Y.; Song, W.; Li, S.; Gong, W. FF-net: A target detection method tailored for mid-to-late stages of forest fires in complex environments. Case Stud. Therm. Eng. 2025, 65, 105515. [Google Scholar] [CrossRef]
Zhu, H.; Ling, W.; Yan, H.; Zhong, X.; Liao, F. YOLO-MP: A lightweight forest fire detection model. Ecol. Inform. 2025, 92, 103516. [Google Scholar] [CrossRef]
Yu, Q.; Zhang, G.; Wang, Y.; Wu, X.; Xiao, J.; Kuang, W.; Zhang, J. CCLNet: An End-to-End Lightweight Network for Small-Target Forest Fire Detection in UAV Imagery. Comput. Mater. Contin. 2026, 86, 58. [Google Scholar] [CrossRef]
Han, R.; Li, J.; Liu, Y.; Liu, H. Forest fire object detection based on multi-task model and extreme weather simulation algorithm. Eng. Appl. Artif. Intell. 2026, 163, 113099. [Google Scholar] [CrossRef]
Yan, C.; Wang, J. MAG-FSNet: A high-precision robust forest fire smoke detection model integrating local features and global information. Measurement 2025, 247, 116813. [Google Scholar] [CrossRef]
Zhao, Y.; Li, Q.; Gu, Z. Early smoke detection of forest fire video using CS Adaboost algorithm. Optik 2015, 126, 2121–2124. [Google Scholar] [CrossRef]
Mambile, C.; Leo, J.; Kaijage, S. Comparative analysis of CNN architectures for satellite-based forest fire detection: A mobile-friendly approach using Sentinel-2 imagery. Remote Sens. Appl. Soc. Environ. 2025, 40, 101739. [Google Scholar] [CrossRef]
Mahaveerakannan, R.; Anitha, C.; Thomas, A.K.; Rajan, S.; Muthukumar, T.; Rajulu, G.G. An IoT based forest fire detection system using integration of cat swarm with LSTM model. Comput. Commun. 2023, 211, 37–45. [Google Scholar] [CrossRef]
Mowla, M.N.; Asadi, D.; Tekeoglu, K.N.; Masum, S.; Rabie, K. UAVs-FFDB: A high-resolution dataset for advancing forest fire detection and monitoring using unmanned aerial vehicles (UAVs). Data Brief 2024, 55, 110706. [Google Scholar] [CrossRef]
Giannakidou, S.; Radoglou-Grammatikis, P.; Lagkas, T.; Argyriou, V.; Goudos, S.; Markakis, E.K.; Sarigiannidis, P. Leveraging the power of internet of things and artificial intelligence in forest fire prevention, detection, and restoration: A comprehensive survey. Internet Things 2024, 26, 101171. [Google Scholar] [CrossRef]
Sun, Y.; Pan, J.; Jiang, L.; Tian, Y.; Zhang, J.; Liu, K. A physics-based remote sensing framework for forest fire smoke detection toward early fire warning. Int. J. Appl. Earth Obs. Geoinf. 2026, 146, 105124. [Google Scholar] [CrossRef]
Jin, P.; Cheng, P.; Liu, X.; Huang, Y. From smoke to fire: A forest fire early warning and risk assessment model fusing multimodal data. Eng. Appl. Artif. Intell. 2025, 152, 110848. [Google Scholar] [CrossRef]
Exaudi, K.; Stiawan, D.; Suprapto, B.Y.; Fakhrurroja, H.; Idris, M.Y.; Alghamdi, T.A.; Budiarto, R. An Improved Forest Fire Detection Model Using Audio Classification and Machine Learning. Comput. Mater. Contin. 2026, 86, 1. [Google Scholar] [CrossRef]
Krüll, W.; Tobera, R.; Willms, I.; Essen, H.; von Wahl, N. Early forest fire detection and verification using optical smoke, gas and microwave sensors. Procedia Eng. 2012, 45, 584–594. [Google Scholar] [CrossRef]
Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 13713–13722. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 18–24 September 2018; pp. 801–818. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 18–24 September 2018; pp. 3–19. [Google Scholar]
Misra, D. Mish: A self regularized non-monotonic activation function. arXiv 2019, arXiv:1908.08681. [Google Scholar] [CrossRef]
Johnson, E.A.; Miyanishi, K. Forest Fires: Behavior and Ecological Effects; Academic Press: Cambridge, MA, USA, 2001. [Google Scholar]
Li, H.; Sun, P. Image-based fire detection using dynamic threshold grayscale segmentation and residual network transfer learning. Mathematics 2023, 11, 3940. [Google Scholar] [CrossRef]
Zou, R.; Xin, Z.; Liao, G.; Huang, P.; Wang, R.; Qiao, Y. A Fire segmentation method with flame detail enhancement U-net in multispectral remote sensing images under category imbalance. Remote Sens. 2025, 17, 2175. [Google Scholar] [CrossRef]

Figure 1. SAMS-Net Overall Architecture Diagram.

Figure 2. ResNet-50 Feature Map Extraction and Properties.

Figure 3. Architecture of the stage classification module.

Figure 4. Stage-Specific Decoder Architecture Diagram.

Figure 5. Representative images from the dataset.

Figure 6. Representative sample from the middle stage: (a) original fire image; (b) corresponding binary mask, where white indicates the flame-and-ember foreground and black indicates the background.

Figure 7. Segmentation results of all models under different background conditions. (Segmentation results of all models under different background conditions. White indicates the foreground region, and black indicates the background.)

Figure 8. Sensitivity analysis of the multi-task loss weight on model performance. The asterisk indicates the selected optimal setting.

Figure 9. IoU performance comparison of different segmentation loss functions across fire evolution stages.

Figure 10. Zero-shot transfer results of SAMS-Net on twelve randomly sampled images from the FLAME dataset. White denotes the segmented foreground, and black denotes the background.

Table 1. Quantitative comparison of all models on the full test set and across different fire evolution stages.

Method	MioU (%)	Dice (%)	EarlyIoU (%)	MiddleIoU (%)	RecessionIoU (%)	PA (%)
SAMS-Net	76.1 ± 0.31	81.3	68.21 ± 0.45	82.31 ± 0.28	65.37 ± 0.52	90.31
FCN	54.37 ± 0.82	67.82	43.52 ± 0.91	64.18 ± 0.74	32.64 ± 1.05	79.56
U-Net	61.83 ± 0.63	74.26	52.47 ± 0.72	71.93 ± 0.58	41.38 ± 0.84	84.12
U-Net++	65.29 ± 0.54	77.54	56.83 ± 0.65	75.61 ± 0.49	46.74 ± 0.73	86.37
DeepLabV3	63.74 ± 0.58	76.08	54.16 ± 0.68	74.29 ± 0.52	43.85 ± 0.79	85.43
PSPNet	55.21 ± 0.79	69.83	48.38 ± 0.87	70.42 ± 0.61	45.17 ± 0.83	78.21
SegFormer	66.81 ± 0.47	78.32	60.12 ± 0.58	73.21 ± 0.43	44.32 ± 0.71	84.23
YOLOv9-Seg	71.63 ± 0.38	75.12	64.72 ± 0.51	79.21 ± 0.34	55.01 ± 0.62	88.25

Table 2. Quantitative comparison of segmentation accuracy and inference efficiency under different routing strategies. “—” indicates not applicable.

Method	Acc (%)	mIoU (%)	FLOPs (G)	FPS (Frame/s)
Baseline	—	65.29	55.48	86.3
Soft-Routing	—	74.85	166.44	28.5
Hard-Routing	94.5	76.16	64.12	75.8

Table 3. Ablation results of stage-specific decoder modules.

Stage	Configuration	KeyModules	IoU(%)
Early	Baseline	ReLU, BilinearUpsampling	58.45
	+PixelShuffle	Sub-pixelConv	63.12
	SAMS-Net(Full)	+PixelShuffle & CA&PReLU	68.21
Middle	Baseline	Standard3 × 3Conv	75.82
Middle	SAMS-Net(Full)	+ASPP	82.31
Recession	Baseline	ReLU, NoAttention	52.15
	+CBAM	Channel(MaxPool)&SpatialAttn	60.83
	SAMS-Net(Full)	+CBAM&Mish	65.37

Table 4. Classification accuracy and segmentation IoU for boundary-ambiguous and non-boundary samples on the internal test set.

Sample Type	N (test)	Classification Acc	Early IoU	Middle IoU	Recession IoU
Non-boundary	185	95.4%	69.5%	83.4%	66.9%
Boundary-ambiguous	29	88.4%	60.0%	75.6%	55.7%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tan, Y.; An, J.; Wang, Y.; Li, Z.; Gao, J.; Yu, F. SAMS-Net: A Stage-Decoupled Semantic Segmentation Network for Forest Fire Detection. Appl. Sci. 2026, 16, 3144. https://doi.org/10.3390/app16073144

AMA Style

Tan Y, An J, Wang Y, Li Z, Gao J, Yu F. SAMS-Net: A Stage-Decoupled Semantic Segmentation Network for Forest Fire Detection. Applied Sciences. 2026; 16(7):3144. https://doi.org/10.3390/app16073144

Chicago/Turabian Style

Tan, Yuxin, Jiazhe An, Yabin Wang, Zhun Li, Jia Gao, and Fuxing Yu. 2026. "SAMS-Net: A Stage-Decoupled Semantic Segmentation Network for Forest Fire Detection" Applied Sciences 16, no. 7: 3144. https://doi.org/10.3390/app16073144

APA Style

Tan, Y., An, J., Wang, Y., Li, Z., Gao, J., & Yu, F. (2026). SAMS-Net: A Stage-Decoupled Semantic Segmentation Network for Forest Fire Detection. Applied Sciences, 16(7), 3144. https://doi.org/10.3390/app16073144

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SAMS-Net: A Stage-Decoupled Semantic Segmentation Network for Forest Fire Detection

Abstract

1. Introduction

2. Method

2.1. Overall Architecture

2.2. Shared Backbone Feature Extraction

2.3. Stage Classification Module

2.4. Stage-Specific Decoders

2.5. Loss Function Design

3. Experiments

3.1. Stage Definition and Dataset Construction

3.2. Experimental Setup and Evaluation Metrics

3.3. Comparison with Baseline Methods

3.4. Ablation Study

3.4.1. Validation of the Stage-Aware Routing Mechanism

3.4.2. Analysis of Stage-Specific Decoder Modules

3.4.3. Loss Function Configuration Analysis

3.4.4. Boundary-Ambiguity Analysis

3.5. Preliminary Zero-Shot Cross-Dataset Qualitative Analysis

4. Conclusions

5. Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI