Forest Fire Detection Method Based on Dual-Branch Multi-Scale Adaptive Feature Fusion Network

Wu, Qinggan; Wei, Chen; Sun, Ning; Xiong, Xiong; Xia, Qingfeng; Zhou, Jianmeng; Feng, Xingyu

doi:10.3390/f16081248

Open AccessArticle

Forest Fire Detection Method Based on Dual-Branch Multi-Scale Adaptive Feature Fusion Network

by

Qinggan Wu

¹,

Chen Wei

²

,

Ning Sun

^1,*

,

Xiong Xiong

²

,

Qingfeng Xia

¹,

Jianmeng Zhou

² and

Xingyu Feng

²

¹

School of Automation, Wuxi University, Wuxi 214105, China

²

Jiangsu Collaborative Innovation Center of Atmospheric Environment and Equipment Technology, Nanjing University of Information Science and Technology, Nanjing 210044, China

^*

Author to whom correspondence should be addressed.

Forests 2025, 16(8), 1248; https://doi.org/10.3390/f16081248

Submission received: 24 June 2025 / Revised: 22 July 2025 / Accepted: 28 July 2025 / Published: 31 July 2025

(This article belongs to the Section Natural Hazards and Risk Management)

Download

Browse Figures

Versions Notes

Abstract

There are significant scale and morphological differences between fire and smoke features in forest fire detection. This paper proposes a detection method based on dual-branch multi-scale adaptive feature fusion network (DMAFNet). In this method, convolutional neural network (CNN) and transformer are used to form a dual-branch backbone network to extract local texture and global context information, respectively. In order to overcome the difference in feature distribution and response scale between the two branches, a feature correction module (FCM) is designed. Through space and channel correction mechanisms, the adaptive alignment of two branch features is realized. The Fusion Feature Module (FFM) is further introduced to fully integrate dual-branch features based on the two-way cross-attention mechanism and effectively suppress redundant information. Finally, the Multi-Scale Fusion Attention Unit (MSFAU) is designed to enhance the multi-scale detection capability of fire targets. Experimental results show that the proposed DMAFNet has significantly improved in mAP (mean average precision) indicators compared with existing mainstream detection methods.

Keywords:

forest fire detection; dual branch network; convolutional neural network (CNN); transformer; feature correction; feature fusion

1. Introduction

With the intensification of global climate change and frequent droughts, forest fires pose a serious threat to the ecological environment and human security. As the “lung of the earth”, forests play an irreplaceable role in maintaining ecological balance and promoting sustainable development by absorbing carbon dioxide and releasing oxygen [1], regulating the water cycle, reducing soil loss and maintaining biodiversity [2]. Therefore, it is imperative to build an efficient forest fire monitoring system, in which real-time monitoring based on video is the core means to obtain fire information. Accurate extraction of flame and smoke characteristics from massive images, taking into account detection accuracy and real-time performance, is the technical bottleneck of rapid early warning and emergency response.

The early traditional computer vision fire detection methods, which do not need complex training, mainly rely on the characteristics of flame and smoke in color, texture, shape, and motion, and have certain practicality in resource constrained scenes. Matougui et al. [3]; Ding et al. [4] comprehensively analyzed the color dynamics and contour texture of the flame, enhancing the detection sensitivity under fixed cameras, but still limited by camera layout and maintenance costs; Titu et al. [5] recognized fires in real time on common monitoring images, effectively suppressing false alarms caused by moving objects, but it is difficult to ensure in real time when processing large-scale video streams. Subsequently, Li et al. proposed a detection framework based on flame flicker characteristics [6] and an autonomous detection algorithm using the Dirichlet process Gaussian mixture model [7], both of which have achieved more than

95 %

accuracy on a variety of videos, but still need a lot of parameter adjustment; Liao et al. [8] proposed the Star Eye model to meet the challenges of complex environments; Vipin V [9] proposed an image processing-based forest fire detection method that classifies fire pixels in images using RGB and YCbCr color spaces; Khondaker et al. [10] proposed a multi-level detection framework with optical flow estimation.

In general, the traditional vision based fire detection method does not need a lot of data training and has low requirements for computing resources, so it is suitable for real-time applications and hardware constrained scenarios. However, such algorithms are vulnerable to environmental changes, and need to adjust parameters repeatedly for different monitoring conditions. At the same time, the rate of false positives and false negatives is high. It is these limitations that promote the rapid development of advanced technologies such as deep learning in the field of fire monitoring.

In recent years, researchers have proposed several improvements to the two classic detection frameworks, SSD and Faster R-CNN, to meet the special needs of fire scenarios. The Fire SSD designed by Liau et al. [11] significantly improves the reasoning speed by optimizing feature extraction and context fusion, but still has precision loss in complex backgrounds and severe lighting changes. Zhang et al. [12] customized the Faster R-CNN network structure for the mine rescue scenario, enhancing the ability to locate the fire source target. Chaoxia et al. [13] introduced the anchor box optimization strategy based on flame color, which effectively suppressed false positives by limiting the position of candidate boxes, although it increased the complexity of the algorithm.

At the same time, YOLO series single-stage detectors are also evolving in fire monitoring. Park et al. and Kumar et al. [14,15] improved YOLOv4 and its lightweight version for site fire monitoring, effectively improving the detection rate of small targets and occluded targets; Mukhiddinov et al. [16] built a fire alarm system with YOLOv4 as the core, which enhanced the robustness of the model to changes in smoke and flame morphology, but there is still precision fluctuation in extreme weather. In the field of forest fire, Xu et al. [17] fused YOLOv5 with Efficient Det and proposed a hybrid detection architecture that considers both speed and accuracy. Experiments show that its overall performance is better than that of a single model, although global context capture is still challenging. Zhang [18], based on the T-YOLOX model of YOLOX, realized the multi-target joint detection of flame, smoke and personnel, with an average accuracy increase of about

2.24 %

, but the stability of smoke recognition in high concentration scenes still needs to be optimized. Talaat et al. [19] improved YOLOv8 in detail, achieving higher success rate of fire detection and reducing false alarm rate in smart city monitoring. However, in the face of complex fire dynamics, its versatility and generalization ability need to be further strengthened.

By analyzing the development of this field, we find that the method based on deep learning has advantages in performance, but still faces the following two major problems:

1.: The fire detection model based on deep learning is often limited to the local receptive field when facing the complex fire scene dynamics. The model extracts features through a fixed size convolution window, which is limited to relying on the information of adjacent pixels. It is difficult to establish effective associations between remote pixels or capture global semantic features.
2.: The feature fusion model based on dual-branch network usually has the problem of feature distribution difference when dealing with multimodal information. In a bipartite scaffold, the local texture features and global context features extracted from the two branches are often significantly different in feature space distribution and response scale, and direct fusion will easily lead to information redundancy and feature conflict.

Therefore, this paper introduces a mechanism that can comprehensively model the global features of images, so as to identify key areas in a larger range and mine the long-distance dependence between pixels. The latest research shows that Transformer is superior to pure convolution structure in global information extraction: its multi-head attention mechanism can not only dynamically evaluate the correlation strength between any two points, but also highlight important pixels through different attention weights, so that it can focus more on the key areas of the detection task while obtaining global semantics [20]; In addition, because each pixel has different degrees of participation in attention calculation, the model can adaptively allocate attention resources, thus further improving the ability to capture important features [21].

To solve the problem of insufficient feature extraction and fusion in existing forest fire detection algorithms, this paper proposes a forest fire detection method based on dual-branch multi-scale adaptive feature fusion network (DMAFNet). The main innovations are as follows:

1.: In the feature extraction phase, a dual-branch backbone network composed of convolutional neural network (CNN) and Transformer is designed. CNN is responsible for capturing the local texture features of flame and smoke, while Transformer focuses on extracting the global context information and realizes the adaptive extraction of multi-level features;
2.: In order to effectively integrate the feature information of dual branches, a feature correction module (FCM) is proposed. Through the two-stage correction mechanism of space and channel, the feature information of CNN and Transformer branches can guide and learn from each other, thus improving the consistency and complementarity of feature representation;
3.: The Fusion Feature Module (FFM) is further designed. Based on the two-way cross-attention mechanism, the interactive expression of fire feature information is enhanced, information redundancy is avoided and the feature expression ability is improved;
4.: A Multi-Scale Fusion Attention Unit (MSFAU) is proposed. Aiming at the significant differences between flame and smoke at different scales, multi-scale feature fusion and self attention mechanism are used to achieve multi-scale accurate detection of fire targets, significantly improving the detection effect of the model on large, medium and small scale fire targets.

2. Dual-Branch Feature Aggregation Network

2.1. Overall Network Structure

The overall network structure is shown in Figure 1, where FCM is the feature correction module and FFM is the feature fusion module.

2.2. Backbone Feature Extraction Network

This paper proposes a dual-branch multi-scale adaptive feature fusion network, which aims to fully combine the advantages of convolutional neural network (CNN) and Transformer, and take into account the local and global feature extraction of images. Specifically, CNN branch uses convolution operation to capture local texture information of the image, and is good at extracting short distance dependent details. However, CNN is not good at capturing long-distance dependent information. In contrast, Transformer branch is based on self attention mechanism, which can effectively model global features and long-distance dependencies in images. However, it often lacks precision when dealing with local details. In order to balance the global and local feature extraction, this paper designs a dual-branch feature extraction framework.

In this design, the backbone network draws on the idea of four layer feature extraction in YOLOX [22]: on the CNN branch, a simplified version of CSPParkNet consisting of four stages has been constructed (as shown in Figure 2,

\times 2

indicates that the module is repeated twice), and its internal module structure is shown in Figure 3. Compared with the original CSPParkNet, we reduced the number of cycles in the second and third stages, respectively, so as to maintain excellent feature extraction efficiency while significantly reducing the number of parameters.

In the Transformer branch, the convolutional vision transformer (CvT) [23] is used to build the feature extraction module. CvT realizes feature projection by introducing Depth Separable Convolution, which effectively improves the computational efficiency of feature extraction and enables the model to capture global information more accurately. This design enables the model to dynamically focus on the important areas in the image, and allocate more attention according to the global semantic information. With the help of Multi-Head Attention mechanism, CvT further enhances the ability to model the correlation between pixels, and can efficiently extract high-level semantic information from images.

2.3. Feature Correction Module

In the dual-branch backbone network, CNN branch and Transformer branch are good at capturing local texture and global context, respectively, but they often have deviations in feature distribution and response scale, which hinder information complementation and even cause conflicts. To solve this problem, we designed a feature correction module, which consists of two stages: “spatial correction” and “channel correction”. The spatial correction stage relies on local spatial correlation and implements dynamic-weighted offset for CNN features and Transformer features to compensate for their differences in local response. In the channel correction phase, based on the global semantic relevance, the features that have completed spatial correction are weighted and fused again on the channel dimension, so as to align the overall distribution of the two branches. Figure 4 shows the overall architecture of the module. The following is the detailed process of the two sub modules.

(a) Spatial feature correction: As shown on the left side of Figure 4, at this stage, we will splice the feature map

F_{in}^{CNN} \in R^{H \times W \times C}

from the CNN branch (where

H, W

are the height and width of the feature map, respectively, and C is the number of channels) with the feature map

F_{in}^{Trans} \in R^{H \times W \times C}

from the Transformer branch in the channel dimension, and get

X = Concat (F_{i n}^{CNN}, F_{i n}^{Trans}) \in R^{H \times W \times 2 C}

is projected to the hidden dimension

C_{mid}

(through Linear (

2 C, C_{mid}

)) and activated by ReLU, and then mapped to channel 2 (through Linear (

C_{mid}

, 2)) to generate the non normalized weight precursor:

F^{S} = M L P (2 C, 2) (X) \in R^{H \times W \times 2}

(1)

Then, sigmoid the

F^{S}

element by element and split it into two spatial weight graphs:

(W_{CNN}^{S}, W_{Trans}^{S}) = Split (σ (F^{S})), W_{CNN}^{S}, W_{Trans}^{S} \in R^{H \times W}

(2)

Finally, the learnable scalar

α \in [0, 1]

is introduced to balance the two correction strengths, and the features are weighted offset in the following form:

\begin{matrix} F_{corr}^{(S, CNN)} = F_{in}^{CNN} + α W_{CNN}^{S} ⊙ F_{in}^{Trans} \\ F_{corr}^{(S, Trans)} = F_{in}^{Trans} + α W_{CNN}^{S} ⊙ F_{in}^{CNN} \end{matrix}

(3)

where ⊙ represents the element by element product. Through this mechanism, the module can adaptively use local spatial association to compensate for the difference between CNN and Transformer in texture and context response, and automatically optimize

α

in training to improve the robustness and expression ability of fusion features.

(b) Channel feature correction: As shown on the right side of Figure 4, after completing spatial correction, we will splice the feature map

F_{corr}^{(S, CNN)}, F_{corr}^{(S, Trans)} \in R^{H \times W \times C}

from CNN branch and Transformer branch into tensor in the channel direction:

X^{C} = [\begin{matrix} F_{corr}^{(S, CNN)} & F_{corr}^{(S, Trans)} \end{matrix}] \in R^{H \times W \times 2 C}

(4)

The purpose is to unify the local response information of two features into the same representation space for subsequent global statistics. Then, apply three pooling operations to

X^{C}

global average pooling (GAP), global maximum pooling (GMP), and channel dimension standard difference pooling (GStdP) to obtain global information at the channel level from different perspectives:

p^{avg} = GAP (X^{C}), p^{\max} = GMP (X^{C}), p^{std} = GStdP (X^{C})

(5)

Among them,

p^{avg}

,

p^{\max}

, and

p^{std}

\in R^{2 C}

, which respectively reflect the overall mean value, the most significant response, and the dispersion degree of distribution of the channel. Combine the three along the channel dimension to get the global description vector:

p^{avg}, p^{max}, p^{std} \in R^{2 C}

(6)

The design effectively combines the average trend and extreme value information through multi-strategy pooling, and captures the change range of channel response using decentralized metrics, which enhances the adaptability of the model to different types and scale features. Next, a two-layer perceptron is used to map

y^{C}

nonlinearly:

f^{C} = W_{2} (ReLU (W_{1} y^{C})) \in R^{2 C}, W_{1} \in R^{C_{mid} \times 6 C}, W_{2} \in R^{2 C \times C_{mid}}

(7)

where

C_{mid}

is used to control the network capacity and avoid over fitting. After being activated through Sigmaid,

f^{C}

is cut into two channel weight vectors:

(w_{CNN}^{C}, w_{Trans}^{C}) = Split (σ (f^{C})), w_{CNN}^{C}, w_{Trans}^{C} \in R^{C}

(8)

This weight reflects the degree to which each feature needs to be enhanced or suppressed during fusion, and ensures its smooth adjustment in the

(0, 1)

interval through normalization. Finally, the learnable parameter

β \in [0, 1]

is introduced (usually

β = 1 - α

, forming a complementary balance with

α

used in spatial correction), and the two features are compensated in the channel dimension:

F_{corr}^{(C, CNN)} = F_{corr}^{(S, CNN)} + β (W_{Trans}^{C} ⊙ F_{corr}^{(S, Trans)})

(9)

F_{corr}^{(C, Trans)} = F_{corr}^{(S, Trans)} + β (w_{CNN}^{C} ⊙ F_{corr}^{(S, CNN)})

(10)

where “⊙” represents the channel level element by element multiplication operation, and broadcasts automatically on the spatial dimension. This process uses the global semantics of the opposite branch to compensate for its own channel distribution differences, so as to achieve global consistent cross-branch alignment while maintaining spatial details.

2.4. Feature Fusion Module

In the bipartite scaffold structure, CNN branch focuses on capturing local texture features, while Transformer branch focuses on representing global context information. Although these two branches have their own advantages in feature extraction, their direct fusion often faces the problems of information redundancy and noise interference. Previous fusion methods, such as weighted average [24,25] and channel-level splicing [26], although simple to implement, may introduce a lot of redundant information, which limits the complementarity between features and is difficult to fully capture the complex nonlinear relationship across modes. In addition, these methods are highly sensitive to noise, which further weakens the expression ability of features. To overcome the above problems, we propose a design called Fusion Feature Module (FFM). Based on the multi-head cross-attention mechanism, this module can dynamically and selectively focus on key information in the global scope, while inhibiting invalid features, so as to achieve efficient information exchange across branches. However, the computational cost of traditional cross-attention on high-resolution features is high, and the complexity is

O (O (h N^{2}))

, where h represents the number of attention heads, and N is the length of the input sequence. In order to alleviate this computing burden, we designed an efficient feature reduction strategy, which can significantly reduce the computational complexity while retaining key information.

As shown in Figure 5, FFM first flattens the corrected feature

F_{corr}^{(C, CNN)} \in R^{H \times W \times C}

and

F_{corr}^{(C, Trans)} \in R^{H \times W \times C}

of CNN and Transformer branches into a feature sequence with the size of

R^{N \times C}

, where

N = H \times W

. Direct application of cross-attention on the original sequence will bring high computational burden, so a convolution based down sampling strategy is introduced to reduce the sequence length from N to

α N (α \in (0, 1)

represents the reduction ratio), while reducing the computational complexity, key information is retained:

K_{CNN} = Conv (F_{corr}^{(C, CNN)}), K_{Trans} = Conv (F_{corr}^{(C, Trans)})

(11)

The down-sampled feature sequences

K_{CNN} \in R^{(α N) \times C}

and

K_{Trans} \in R^{(α N) \times C}

are used to calculate the two-way multi-header cross-attention. Specifically, FFM calculates the interaction between the two branches on each attention head. The formula is as follows:

S_{CNN}^{i} = Softmax (\frac{Q_{CNN}^{i} {(K_{Trans}^{i})}^{T}}{\sqrt{d_{head}}}), S_{Trans}^{i} = Softmax (\frac{Q_{Trans}^{i} {(K_{CNN}^{i})}^{T}}{\sqrt{d_{head}}})

(12)

where

Q_{CNN}^{i}, Q_{Trans}^{i} \in R^{N \times d_{head}}

are query vectors obtained through linear mapping, while

K_{CNN}^{i}, K_{Trans}^{i} \in R^{(α N) \times d_{head}}

are key vectors after convolution reduction. This design reduces the computational complexity from

O (h N^{2})

to

O (h α N^{2})

, greatly reducing the computational cost.

On the basis of two-way cross-attention, FFM splices the output of each attention head on the channel dimension to generate the modified CNN branch sign

Z_{CNN} \in R^{N \times C}

and Transformer branch

Z_{Trans} \in R^{N \times C}

:

Z_{CNN} = Concat (S_{CNN}^{i} \times V_{Trans}^{i}), Z_{Trans} = Concat (S_{Trans}^{i} \times V_{CNN}^{i})

(13)

The above modified feature is then restored to the original spatial dimension

R^{H \times W \times C}

, and added to the original feature through residual connection:

F_{enhanced}^{(CNN)} = Z_{CNN} + F_{corr}^{(C, CNN)}, F_{enhanced}^{(Trans)} = Z_{Trans} + F_{corr}^{(C, Trans)}

(14)

In order to further integrate the information of the two branches, the enhanced CNN and Transformer features are spliced and compressed into C channels through the bottleneck structure. Specifically, the spliced feature

E_{Fuse} \in R^{H \times W \times 2 C}

is compressed to channel C through

1 \times 1

depth convolution and nonlinear activation layer:

F_{Fuse} = {Conv}_{1 \times 1} (Concat (F_{enhanced}^{(CNN)}, F_{enhanced}^{(Trans)}))

(15)

Through this design, the FFM module can reduce the computing cost while preserving the complementarity of CNN and Transformer features, avoiding feature redundancy and information loss.

2.5. Multi-Scale Fusion Attention Unit

The scale of different forest fire targets varies greatly. For fire target detection tasks, the network’s ability to model multi-scale information often determines the detection effect. At present, most methods alleviate this problem through multi-scale convolution, pooling operation or U-shaped structure. In contrast, this paper designs a Multi-Scale Fusion Attention Unit (MSFAU) based on Multi-Head Self Attention (MHSA), as shown in Figure 6.

In order to capture the size and shape changes in fire targets more comprehensively, multi-scale features are directly incorporated into the self attention calculation process. The MSFAU receives the characteristic output from the last three stages of the dual-branch backbone network, which is recorded as

X^{i} \in R^{C^{i} \times H^{i} \times W^{i}} (i = 1, 2, 3)

. To unify the number of channels, we first apply a

1 \times 1

convolution to each of the three features, and transform the channel dimension into the same C. Then, we map the two-dimensional image features into a one-dimensional token sequence through the reshaping operation, which is expressed as

T k^{i} \in R^{L^{i} \times C}

, where

L^{i} = H^{i} \times W^{i}

. Then,

T k^{1}, T k^{2}

, and

T k^{3}

are spliced on the spatial dimension and sent to the continuous N-layer MHSA to fully interact with the spatial association between tokens of different scales. Thanks to the strong ability of multi-head self attention to describe long-distance dependence, all pixel relationships inside and outside the scale are included in the calculation of attention weight uniformly. However, multiple MHSA calculations will change the distribution of the original features, which may adversely affect the detection accuracy. For this reason, MSFAU introduced an attention weighting mechanism: first, split the MHSA output into spatial dimensions, and reshape it back to the two-dimensional feature map; then the attention weight is generated by

1 \times 1

convolution, batch normalization (BN) and sigmoid activation respectively. In addition, in order to fuse multi-scale information more fully, the module also adds scale calibration operation: for the three calibrated features, average pooling and upsampling are performed first to obtain coarse fusion features under three different resolutions; then, they were activated by

3 \times 3

convolution, BN and ReLU, respectively, to optimize the characterization ability. Finally, the generated attention weight is multiplied by the corresponding feature graph element by element to output the fine-grained fusion result

Y^{i} \in R^{C \times H^{i} \times W^{i}}

. The weighting process can be formally expressed as

Y^{i} = γ (B N (W_{3 \times 3}^{i} E_{A}^{i})) * σ (B N (W_{1 \times 1}^{i} E_{M}^{i}))

(16)

where

E_{A}^{i}

represents the coarse fusion feature after scale calibration,

E_{M}^{i}

is the output of MHSA,

W_{k \times k}^{i}

is the

k \times k

convolution operator, the symbol * represents the product by element, and i traverses 1, 2, and 3. In this way, MSFAU effectively takes into account the multi-scale details of forest fire targets while maintaining the global context, thus improving the accuracy and robustness of detection.

3. Dataset and Evaluation Indicators

3.1. Experimental Environment and Parameters

To address the issue of insufficient sample size in existing fire detection datasets, we have constructed a new dataset named FSDD (Fire Smoke Detection Dataset). This dataset contains 12,000 images, each with a resolution of 1920 × 1080 pixels, suitable for high-resolution fire detection tasks. All images are captured by multiple high-definition cameras deployed across different monitoring platforms, including both fixed and mobile cameras. The variation in camera angles and positions allows the dataset to cover multiple perspectives, simulating the effect of changing camera angles during a fire event. Specifically, the camera angles are categorized into three main types: vertical angle (approximately 30%), suitable for monitoring from elevated positions and rooftops; horizontal angle (approximately 50%), suitable for ground-level and low-position cameras; and oblique angle (approximately 20%), used for dynamic monitoring scenarios, ideal for capturing large-scale fire events. To comprehensively assess model performance in different time and environmental conditions, the image acquisition interval is set to 1 frame per second to capture rapidly changing fire scenes. Additionally, the dataset includes images taken at different times of the day, with 70% of the images captured during the day and 30% at night, providing a basis for evaluating the model’s performance in low-light and complex environments. Furthermore, we specifically consider the impact of weather conditions on fire detection tasks. The dataset contains fire scenes under various weather conditions, including sunny weather (45%), cloudy weather (35%), and smoggy conditions (20%). These scenes simulate fire behavior under different meteorological conditions, particularly the challenges posed by reduced visibility of flames and smoke during smoggy weather.

The dataset also includes fire scenes under different occlusion conditions, with approximately 40% of the images containing occlusions. These occlusions include trees, buildings, and other objects, reflecting the reality where fires may be partially obscured by surrounding structures. This further complicates the detection task. Specific example scenes are shown in Figure 7, where images of fires at different intensities and scenes are displayed, including small fires (localized, low-intensity flames), medium-intensity fires (larger areas of fire spread), and large fires (wide-scale fires with significant flame and smoke coverage). These scenes illustrate the characteristics of fires at varying intensities and showcase the detection challenges in complex environments, such as occlusions by trees, buildings, etc. By including such scenes, the dataset effectively simulates and evaluates the robustness of fire detection models under different fire intensities, particularly in the presence of occlusions, fire spread, and complex backgrounds. In constructing the dataset, we aimed to provide researchers with a challenging fire detection task, particularly in evaluating the robustness of detection models under complex environmental interference and occlusion issues. As the dataset continues to be refined, we plan to release the FSDD dataset publicly once it has been further improved, along with detailed usage documentation. We will continuously refine the dataset based on real-world feedback to ensure its effectiveness in supporting the development and evaluation of fire detection technologies.

3.2. Model Training Details and Evaluation Indicators

This research builds an experimental environment based on PyTorch (Python 3.8) and runs on Windows 10 operating system. The hardware platform is NVIDIA GeForce RTX 3090 GPU and Intel Core i7-11700K @ 3.60 GHz CPU. The loss function of the model is consistent with the original YOLOX, and the random gradient descent (SGD) optimizer is used. The momentum coefficient is set to 0.8, and the weight decay rate is

5 \times 10^{- 4}

. The initial learning rate is set to

1 \times 10^{- 3}

, and the stepwise descending strategy is adopted: after every 30,000 iterations, the learning rate is reduced to one tenth of the original. A total of 200 epochs were conducted during the whole training process. In order to ensure the repeatability and reliability of the experimental results, this paper repeated the experiment by taking an average of 10 times.

Due to the complexity of certain model components, such as the Multi-Scale Fusion Attention Unit (MSFAU), the model is prone to overfitting. To address this challenge, several strategies were implemented to mitigate the risk of overfitting. First, L2 regularization was applied to prevent the model parameters from growing excessively large, which helps in controlling overfitting. Additionally, dropout layers were incorporated during feature fusion to enhance the model’s ability to generalize. An early stopping strategy was also adopted, where training would stop early if the performance on the validation set stopped improving, preventing overfitting caused by prolonged training. Furthermore, data augmentation techniques such as random rotations, cropping, and color variations were applied to increase the diversity of the training data, helping the model better adapt to a wide range of fire detection scenarios. These combined measures effectively alleviated the risk of overfitting and ensured that the model maintained high accuracy and robustness in various fire detection tasks.

In order to comprehensively measure the model’s detection performance, this paper introduces several evaluation metrics, including Precision, Recall, F1 Score, Intersection over Union (IoU) distribution, Average Precision (AP), mean Average Precision (mAP), and the Kappa coefficient. These metrics provide a multi-dimensional assessment of the model’s detection capabilities, allowing us to fully evaluate its performance in various scenarios.

Specifically, Precision and Recall are the fundamental evaluation indicators. Precision measures the proportion of true positive samples among all samples predicted as positive, while Recall measures the proportion of actual positive samples that are correctly identified. Their calculation formulas are as follows:

Precision = \frac{T P}{T P + F P}

(17)

Recall = \frac{T P}{T P + F N}

(18)

where

T P

represents the number of true positive samples,

F P

represents the number of negative samples misclassified as positive, and

F N

represents the number of actual positive samples that are undetected. To balance Precision and Recall, the F1 Score is used as a combined metric, which is calculated as

F 1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}

(19)

In addition, this paper uses Average Precision (AP) and mean Average Precision (mAP) as comprehensive performance indicators. AP reflects the model’s detection ability for a single category, while mAP is an overall evaluation by averaging the APs for each category, and their calculation methods are as follows:

A P = \int_{0}^{1} p (r) d r

(20)

m A P = \frac{1}{n} \sum_{i = 1}^{n} A P_{i}

(21)

To further assess the model’s detection performance, this study also uses IoU distribution and the Kappa coefficient. The IoU distribution measures the overlap between predicted bounding boxes and ground truth, providing an intuitive understanding of the model’s prediction performance at different confidence levels. The Kappa coefficient measures the agreement between predicted labels and ground truth, accounting for the possibility of random agreement, and is calculated as

κ = \frac{p_{o} - p_{e}}{1 - p_{e}}

(22)

where

p_{o}

is the observed agreement, and

p_{e}

is the expected agreement by chance.

Finally, to comprehensively measure the speed and complexity of the algorithm, this paper also introduces the reasoning time for a single image as an evaluation metric. Through these comprehensive indicators, this study provides a thorough analysis of the model’s performance, ensuring the reliability and practicality of the results.

4. Experimental Analysis

4.1. Ablation Test

In this section, we systematically quantify the impact of each sub module on the performance of the dual-branch multi-scale adaptive feature fusion network through ablation experiments. All experiments are conducted under the same super parameters and training strategies. The results are shown in Table 1 and Table 2.

First, the single-channel CNN backbone model (a) achieves

86.59 % mAP

with the lowest amount of computation (11.21 GFLOPs) and parameter size (10.54 MB ). However, due to the lack of global correlation characterization, its Recall and Precision are only

81.79 %

and

86.47 %

, respectively, and false detection and missing detection are relatively obvious;

The single-channel transformer backbone (b) has more advantages in long-distance dependent modeling, but it does not describe local details enough, with the mAP slightly reduced to

85.35 %

, indicating that local texture is still the key to fire and smoke recognition;

On this basis, after the parallel introduction of CNN and Transformer to form a dual-branch structure (c), the mAP has increased to

88.71 %

, and the AP_Fire/Smoke has increased to

87.31 % / 90.11 %

, respectively, indicating that the complementarity of local and global information has significantly reduced false detection, but the difference in the semantic distribution of the two branches still limits the fusion efficiency;

When the feature correction module FCM (d) is added, the space channel two-level alignment effectively alleviates the distribution drift, the mAP is further improved to

90.07 %

, the Recall and Precision are synchronously increased to

83.84 %

and

88.89 %

, and the computing overhead is only increased by 3.28 GFLOPs compared with (c);

Then combined with the Fusion Feature Module (FFM) (e), the efficient multi-head cross-attention was used to screen complementary features in the global scope, and the mAP again climbed to

91.24 %

, and the two types of APs exceeded

90 %

, respectively, verifying the importance of dynamic information selection to enhance fine-grained detection.

Finally, an MSFAU, Multi-Scale Fusion Attention Unit, was introduced into the complete model (f), and multi-scale tokens were directly injected into the attention calculation to capture the scale difference of fire targets. The mAP reached

92.49 %

, significantly improved compared with the baseline (a), and the Recall/Precision rose to

84.65 % / 89.74 %

, that is, the reasoning delay increased from 10.68 ms to 17.01 ms, and the GFLOPs increased by 16.40 GFLOPs, still maintaining acceptable real-time edge deployment;

It can be seen from the above trends that FCM mainly solves cross-branch semantic alignment, FFM is responsible for efficient interaction and suppresses redundancy, while MSFAU further bridges the scale difference, and the three gradually form a complete feature integration link, contributing to significant accuracy gains and more robust flame/smoke detection performance.

Figure 8 makes a visual comparison of the target detection results of ablation experiment combination (a)–(f), which further confirms the gradual improvement in the model detection ability of each sub module: when only CNN backbone is used in baseline (a), the network is prone to generate a large number of false positives and missing reports in complex backgrounds such as canopy texture or strong reflection; (b) only using the Transformer backbone can capture the overall structure of the scene and slightly suppress background false detection, but because of insufficient analysis of local details, it is not sensitive to the response of early weak fire spots; when CNN and Transformer are introduced in parallel to form a dual-branch structure (c), the integrity of the flame hot spot and smoke contour is significantly improved, and the number of missed detections is significantly reduced, but there are still missed detections and false detections, indicating that the distribution of the characteristics of the two branches is not fully aligned; in combination (d), the feature correction module (FCM) is added on this basis. The spatial correction makes up for the local response difference, and the channel correction balances the global semantics; After continuing to overlay the Fusion Feature Module (FFM) (e), efficient two-way cross-attention dynamically filters complementary information in the global scope so that the model maintains high confidence in low contrast smoke bands and blocked flames; the final complete model (f) introduces the Multi-Scale Fusion Attention Unit (MSFAU), and directly integrates the multi-scale token into the attention calculation. The prediction box is highly consistent with the real boundary, and the background interference is basically completely suppressed. The comprehensive visualization results in Figure 8 and the quantitative results in Table 1 and Table 2 show that the trend of mAP increasing from

86.59 %

to

92.49 %

is consistent with the gradual reduction in false detection and missed detection, which fully demonstrates the synergistic gains of FCM in cross-branch semantic alignment, FFM in redundancy suppression and information indexing, and MSFAU in scale difference modeling.

4.2. Analysis Experiment of the Feature Interaction Learning Process in Dual-Branch Networks

To quantify and evaluate the mutual learning effects between the CNN and Transformer branches in the dual-branch network, this study introduces Principal Component Analysis (PCA) to visually analyze the features at different stages of the network. PCA, as a dimensionality reduction technique, projects the high-dimensional feature space onto a two-dimensional space, allowing us to intuitively observe the distribution of features from each branch and the similarity between them. This method enables us to quantify and analyze the collaborative effects between the two branches in the feature space and reveal how they gradually achieve information fusion at different stages of training. The results are shown in Figure 9.

At the initial stage, the local texture features extracted by the CNN branch and the global context features extracted by the Transformer branch exhibit significant differences. In this stage, as the network has not been fully trained, the feature distributions of the CNN and Transformer branches differ significantly. The PCA projections in this stage show a feature similarity of only 0.50, indicating that the feature spaces of the two branches are not yet aligned and that the extracted features differ greatly in both spatial and semantic dimensions.

As the training progresses, the feature correction module (FCM) begins to play a role. Through spatial and channel correction mechanisms, FCM adjusts the feature distributions of the two branches, gradually aligning them in both spatial and semantic dimensions. PCA shows that in the intermediate stage, the features of the two branches gradually converge, with the similarity increasing to 0.74. This indicates that through the effect of FCM, the feature differences between the CNN and Transformer branches are significantly reduced, and the information fusion effect is improved. Particularly in this stage, the model effectively combines local features and global features through feature alignment and complementarity, enhancing its ability to capture fire targets.

In the later stage of training, with the introduction of the Feature Fusion Module (FFM) and Multi-Scale Fusion Attention Unit (MSFAU), the network further optimizes the fusion and alignment of features. PCA further shows that with the effect of the FFM and multi-scale modeling, the feature similarity between the two branches increases significantly, reaching 0.87. This indicates that in the final stage of the model, after feature fusion and multi-scale modeling, the features of the CNN and Transformer branches are highly consistent in both spatial and semantic dimensions. At this point, the network can fully leverage the strengths of both branches, capturing both the details and overall information of fire targets, significantly improving detection accuracy and robustness.

4.3. Comparison Experiment

In the comparative experiment in this section, we evaluated the algorithm in this paper with current mainstream single-stage detectors (RetinaNet [27], YOLOv3ṽ5 [28,29], YOLOX, Efficient Det [30], CenterNet [31], etc.) and various models proposed by researchers for forest fire detection scenarios, as well as two Transformer-based models, namely DINO [32] and DETR [33]. The results are listed in Table 3.

Although the traditional two-stage representative Faster R-CNN has reached

88.52 %

in recall rate, it is limited by the serial process of regional proposal and multi-level regression, and its single frame reasoning delay is up to 146.54 ms, which is difficult to meet the real-time requirements of forest fire early warning;

RetinaNet, YOLOv3, EfficientDet, and other lightweight frameworks maintain reasoning latency of 17∼18

ms

, but lack global context modeling ability because they only rely on the characteristics of a single path CNN, and the mAP is only 71%∼76%;

YOLOv4, CenterNet, Fire SSD, and Elastic YOLOv3 have improved the mAP to 80%∼86% through the feature pyramid and channel attention, but still stay in the single branch cascade or shallow attention paradigm, and the false detection rate is high in the face of interference such as canopy texture and light spot reflection in the forest scene;

YOLOX baseline raised the mAP to

87.24 %

with the help of decoupling head and anchor-free mechanism, and the subsequent improved versions (T-YOLOX, Improved YOLOX, etc.) mainly superimposed attention or deep convolution on the neck network, failing to fundamentally alleviate the local and global semantic separation.

In the experiments, we also evaluated two Transformer-based models, namely DINO and DETR. DINO performs image feature learning through self-supervised learning, which allows it to model global context information effectively and improve the accuracy of fire detection. DINO achieves an mAP of 89.14%, with good performance in both precision and recall. However, DINO’s inference speed is relatively slow, with a single-frame inference time of 37.50 ms, which may introduce latency issues in real-time monitoring applications that require high performance. In contrast, DETR performs object detection using an end-to-end Transformer structure, effectively modeling global context information and handling complex fire scenarios. DETR achieves an mAP of 91.30%, with excellent performance in both recall and precision, especially when facing complex fire scenarios, where it can capture fire boundaries and details more accurately. However, DETR has a larger computational cost, with a single-frame inference time of 31.12 ms, requiring more computational resources and longer training time;

To sum up, the existing universal or lightweight detectors mostly stay at the platform stage of

86 % mAP

in forest fire scenarios. In contrast, the algorithm in this paper introduces CNN and Transformer in parallel in the coding phase to co-model the local texture and long-range dependence, and uses the feature correction module (FCM) to align the cross-branch distribution, the Fusion Feature Module (FFM) to suppress redundancy, and then uses the Multi-Scale Fusion Attention Unit (MSFAU) to describe the scale difference in flame smoke. Without the need for a very deep backbone, the mAP can be increased to

92.49 %

; Single frame reasoning of 17.01 ms still meets the real-time forest fire monitoring requirements under the input condition of 55∼60 fps cameras.

The experimental results show that fully mining the global local complementary information is the key to improve the accuracy and robustness of forest fire detection. The dual-branch collaborative architecture and hierarchical attention strategy proposed in this paper achieve a better compromise between accuracy, efficiency, and model size.

Figure 10 and Figure 11 show the visual detection results of the comparison algorithm, in which RetinaNet has insufficient response in the small flame spread area, often missing detection; YOLOv4 has improved its multi-scale capability with the help of CSP and PAN, but it is easy to misjudge the background highlight area as fire in complex backgrounds such as canopy texture and defoliation spot. Although the Fire SSD model is lightweight and fast in reasoning, it is difficult to distinguish multiple overlapping flame hotspots due to shallow feature expression; YOLOX has strengthened the positioning accuracy of single point flame in the design of anchor-free decoupling head, but lacks the global context information. When facing dense fire sources, there will still be overlapping detection or ignoring small edge flames; Although T-YOLOX and Improved YOLOX add attention and deep convolution modules to the neck network to enhance some scale features, they are still limited by local receptive fields, and it is difficult to systematically integrate global information, which leads to the inability to give consideration to the synchronous and accurate identification of small sparks and overall fire in a large-scale fire scene. In contrast, DMAFNet captures the local details and global trend of the flame, respectively, through parallel CNN and Transformer dual branches, adaptively corrects the two features in space and channel dimensions, and dynamically focuses on different fire sources with multi-scale fusion attention mechanism, realizing high confidence detection of small flames and large burning areas, effectively overcoming the limitations of the above single branch method in small target, complex background, and multi-target fire scenarios.

Figure 12 shows the detection effect of DMAFNet in various complex scenes, including multi-target fire sources (a), no smoke occlusion (d), smoke-occluded flames (b, c, f), and small target fire sources under heavy smoke (e). DMAFNet demonstrates excellent robustness and accuracy, handling a variety of environmental challenges effectively.

In the multi-target fire source scene (Figure 12a), the dual-branch network of DMAFNet extracts local texture and global semantic information for each flame in parallel. It effectively distinguishes adjacent fire sources through the feature fusion module, enabling simultaneous high-confidence detection of multiple small flames. This capability significantly enhances the detection accuracy of small flames while maintaining high-confidence identification of multiple fire sources. The feature fusion module integrates information from both branches, enhancing the differentiation between adjacent fire sources and enabling simultaneous detection of small flames with high confidence. In the smoke-free scene (Figure 12d), the feature fusion module of DMAFNet effectively identifies the boundaries of the flames. The multi-scale fusion attention unit (MSFAU) further compensates for the interference caused by forest shadows on the fire source, ensuring accurate flame detection even under challenging lighting conditions, such as uneven shadowing or lighting variations. This highlights DMAFNet’s ability to maintain high accuracy in complex environmental conditions. In the smoke-occluded scenes (Figure 12b,c,f), DMAFNet utilizes the spatial and channel correction modules to compensate for the feature response in blocked areas. These modules dynamically adjust the feature maps to recover flame information in occluded regions. Specifically, even though some portions of the flames may be completely obscured by smoke, the multi-scale fusion attention unit enhances the capture of hidden flame contours. This ability greatly reduces missed detections, allowing the model to detect residual flame details in heavily occluded areas. This feature demonstrates DMAFNet’s robustness in handling partial occlusions, particularly in environments with varying levels of smoke coverage. In the small target fire source scene under heavy smoke (Figure 12e), DMAFNet performs remarkably well by extracting local texture and global semantics in parallel through the dual-branch network. Despite the fire source being covered by dense smoke, the model accurately detects the flame using the effective area of the feature fusion module. This ability to detect small target fire sources in challenging conditions is a key strength of DMAFNet. To further validate the performance of the model, we downloaded several images from the internet and conducted tests. The results are shown in Figure 12g–i. In these images, DMAFNet similarly exhibits exceptional robustness and accuracy, handling complex environments with multiple targets, occlusions, and varying lighting conditions. The experimental results in Figure 12g–i further confirm DMAFNet’s ability to perform well under various disturbances.

Overall, DMAFNet demonstrates its ability to tackle a variety of challenging scenarios, including small targets, multiple targets, occlusions, and lighting variations. By leveraging the dual-branch network for parallel extraction of local texture and global semantics, the multi-scale fusion attention mechanism, and the spatial and channel correction modules, DMAFNet achieves superior performance in complex environments. These features make DMAFNet a highly effective system for early forest fire detection with significant real-world application potential.

4.4. Evaluation of the Proposed DMAFNet Algorithm Under Environmental Disturbances

To further evaluate the robustness and accuracy of the proposed DMAFNet model under various environmental conditions, we introduced three common types of interference into the dataset: random noise, lighting variation, and partial occlusion. These disturbances simulate real-world environmental factors and test the model’s performance in challenging conditions. The types of interference and their descriptions are as follows: Random Noise: Gaussian noise with a mean of 0 and a standard deviation of 0.1 is added to the images, simulating sensor noise commonly found in camera hardware and environmental factors. Lighting Variation: The brightness of the images is adjusted to simulate different lighting conditions, including low-light and overexposure scenarios. Partial Occlusion: Random objects such as trees and buildings are overlaid on the images to simulate occlusions caused by environmental elements. To comprehensively assess DMAFNet’s performance under these disturbances, we conducted tests on the dataset with added interference and compared the results with the original, undisturbed data. The results are summarized in Table 4.

The experimental results demonstrate that DMAFNet maintains high detection performance even with the added disturbances. The model’s performance shows only a slight decrease in mAP and recall when exposed to random noise and lighting variation, and the performance drop is more noticeable in partial occlusion scenarios. However, the model still exhibits robust accuracy and effectively handles various environmental disturbances, making it suitable for real-world fire detection tasks.

5. Conclusions

Aiming at the limitations of existing single branch fire detection algorithms in multi-scale flame feature capture and global semantic fusion, this paper proposes a dual-branch multi-scale adaptive feature fusion network (DMAFNet). Although the traditional vision based fire detection method does not need a lot of data training, it is vulnerable to environmental changes and has a high rate of false positives and false negatives. However, most of the existing deep learning methods are based on a single CNN architecture, limited by the local receptive field, which is difficult to effectively capture the global context information, and prone to false detection in the face of complex forest background. To solve these problems, the DMAFNet designed in this paper deploys CNN branches and Transformer branches in parallel to the backbone structure, which are, respectively, responsible for the extraction of local texture and global context information, giving full play to the complementary advantages of the two architectures.

In view of the difference in the distribution and response scale of dual-branch features, this paper innovatively designs a feature correction module (FCM). Through two stages of spatial correction and channel correction, the adaptive alignment of CNN and Transformer branch features is achieved, which effectively alleviates the drift problem of semantic distribution across branches. The further proposed feature fusion module (FFM) is based on an efficient two-way cross-attention mechanism. While reducing the computational complexity, it dynamically filters and aggregates key information, effectively suppresses redundant features and enhances the ability of cross-modal information interaction. In addition, the multi-scale fusion attention unit (MSFAU) directly introduces multi-scale feature tokens from different stages into self attention computing. Through the pixel relationship modeling in the global scope, it dynamically responds to fire source targets of various scales, significantly improving the detection effect of large-, medium-, and small-scale fire targets.

Although this research has made remarkable achievements in detection accuracy and robustness, there are still some aspects to be improved. First of all, in complex background, especially in extreme weather (such as fog, rainstorm, strong wind) and different seasonal conditions (such as autumn leaves, winter snow), the generalization ability of the model still needs to be further improved. Secondly, the current model is mainly designed for visible light images, and has limited ability to detect fires at night or under low light conditions. In addition, the model still has room for improvement when dealing with a minimal target fire source (such as a small flame that has just caught fire).

Future research can be carried out in the following directions: (1) building more diversified datasets, covering fire scenarios under different weather, season and lighting conditions, and exploring data enhancement and domain adaptation technologies to improve model generalization performance; (2) fusion of multi-modal information, such as infrared thermal imaging, meteorological data, etc., to build a more robust multi-modal fire detection system; (3) optimize the network structure, further reduce the computational complexity, and enable the model to be deployed on edge devices for distributed monitoring; (4) the reinforcement learning [34,35] technology is used to simulate the dynamic process of fire spread, and the time sequence information is used to predict the fire development trend, so as to provide intelligent decision support for the formulation of fire prevention, control and extinguishing strategies; (5) explore unsupervised or semi-supervised learning methods to reduce dependence on a large number of labeled data and improve the feasibility of the model in practical application. These research directions will further promote the development of forest fire intelligent monitoring technology and provide more reliable technical guarantee for ecological environment protection and human safety.

Author Contributions

Conceptualization, Q.W., C.W. and N.S.; methodology, Q.W. and N.S.; software, Q.W., C.W. and N.S.; validation, Q.W. and C.W.; formal analysis, Q.W. and C.W.; investigation, Q.W., C.W. and Q.X.; resources, N.S. and X.F.; data curation, Q.W., J.Z. and Q.X.; writing—original draft preparation, Q.W., C.W. and N.S.; writing—review and editing, X.X. and N.S.; supervision, N.S. All authors have read and agreed to the published version of the manuscript.

Funding

The research in this article was supported by the National Natural Science Foundation of China (No. 42205150), Jiangsu Postgraduate Innovation Project (No. SJCX25_0484).

Data Availability Statement

The data and code used to support the findings of this study are available from the corresponding author upon request (880219@cwxu.edu.cn).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Xu, H.; Zhang, G.; Chu, R.; Zhang, J.; Yang, Z.; Wu, X.; Xiao, H. Detecting forest fire omission error based on data fusion at subpixel scale. Int. J. Appl. Earth Obs. Geoinf. 2024, 128, 103737. [Google Scholar] [CrossRef]
Tian, Y.; Yang, G.; Wang, Z.; Wang, H.; Li, E.; Liang, Z. Apple detection during different growth stages in orchards using the improved YOLO-V3 model. Comput. Electron. Agric. 2019, 157, 417–426. [Google Scholar] [CrossRef]
Matougui, Z.; Zouidi, M. A temporal perspective on the reliability of wildfire hazard assessment based on machine learning and remote sensing data. Earth Sci. Inform. 2025, 18, 19. [Google Scholar] [CrossRef]
Ding, F.; Tang, G.; Zhang, T.; Wu, W.; Zhao, B.; Miao, J.; Li, D.; Liu, X.; Wang, J.; Li, C. Spaceborne Lightweight and Compact High-Sensitivity Uncooled Infrared Remote Sensing Camera for Wildfire Detection. Remote Sens. 2025, 17, 1387. [Google Scholar] [CrossRef]
Titu, M.F.S.; Pavel, M.A.; Michael, G.K.O.; Babar, H.; Aman, U.; Khan, R. Real-Time Fire Detection: Integrating Lightweight Deep Learning Models on Drones with Edge Computing. Drones 2024, 8, 483. [Google Scholar] [CrossRef]
Li, B.; Wang, X.; Sun, Q.; Yu, S. Forest fire image detection method based on improved CenterNet. In Proceedings of the Second International Symposium on Computer Applications and Information Systems (ISCAIS 2023), Chengdu, China, 24–26 March 2023; SPIE: Bellingham, WA, USA, 2023; Volume 12721, pp. 380–385. [Google Scholar]
Li, Z.; Mihaylova, L.S.; Isupova, O.; Rossi, L. Autonomous flame detection in videos with a Dirichlet process Gaussian mixture color model. IEEE Trans. Ind. Inform. 2017, 14, 1146–1154. [Google Scholar] [CrossRef]
Liao, Z.; Hu, K.; Meng, Y.; Shen, S. An advanced three stage lightweight model for underwater human detection. Sci. Rep. 2025, 15, 18137. [Google Scholar] [CrossRef]
Vipin, V. Image processing based forest fire detection. Int. J. Emerg. Technol. Adv. Eng. 2012, 2, 87–95. [Google Scholar]
Khondaker, A.; Khandaker, A.; Uddin, J. Computer vision-based early fire detection using enhanced chromatic segmentation and optical flow analysis technique. Int. Arab J. Inf. Technol. 2020, 17, 947–953. [Google Scholar] [CrossRef]
Liau, H.; Yamini, N.; Wong, Y. Fire SSD: Wide fire modules based single shot detector on edge device. arXiv 2018, arXiv:1806.05363. [Google Scholar] [CrossRef]
Zhang, J.; Jia, Y.; Zhu, D.; Hu, W.; Tang, Z. Study on the situational awareness system of mine fire rescue using faster Ross Girshick-convolutional neural network. IEEE Intell. Syst. 2019, 35, 54–61. [Google Scholar] [CrossRef]
Chaoxia, C.; Shang, W.; Zhang, F. Information-guided flame detection based on faster R-CNN. IEEE Access 2020, 8, 58923–58932. [Google Scholar] [CrossRef]
Park, M.J.; Ko, B.C. Two-Step Real-Time Night-Time Fire Detection in an Urban Environment Using Static ELASTIC-YOLOv3 and Temporal Fire-Tube. Sensors 2020, 20, 2202. [Google Scholar] [CrossRef] [PubMed]
Kumar, S.; Gupta, H.; Yadav, D.; Ansari, I.A.; Verma, O.P. YOLOv4 algorithm for the real-time detection of fire and personal protective equipments at construction sites. Multimed. Tools Appl. 2022, 81, 22163–22183. [Google Scholar] [CrossRef]
Mukhiddinov, M.; Abdusalomov, A.B.; Cho, J. Automatic fire detection and notification system based on improved YOLOv4 for the blind and visually impaired. Sensors 2022, 22, 3307. [Google Scholar] [CrossRef]
Xu, R.; Lin, H.; Lu, K.; Cao, L.; Liu, Y. A forest fire detection system based on ensemble learning. Forests 2021, 12, 217. [Google Scholar] [CrossRef]
Zhang, J.; Ke, S. Improved YOLOX fire scenario detection method. Wirel. Commun. Mob. Comput. 2022, 2022, 9666265. [Google Scholar] [CrossRef]
Talaat, F.M.; ZainEldin, H. An improved fire detection approach based on YOLO-v8 for smart cities. Neural Comput. Appl. 2023, 35, 20939–20954. [Google Scholar] [CrossRef]
Maurício, J.; Domingues, I.; Bernardino, J. Comparing vision transformers and convolutional neural networks for image classification: A literature review. Appl. Sci. 2023, 13, 5521. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5999–6009. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
Wu, H.; Xiao, B.; Codella, N.; Liu, M.; Dai, X.; Yuan, L.; Zhang, L. Cvt: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 22–31. [Google Scholar]
Yuan, S.; Li, X.; Yinxia, S.; Qing, X.; Deng, J.D. Improved quantum image weighted average filtering algorithm. Quantum Inf. Process. 2025, 24, 125. [Google Scholar] [CrossRef]
Hazirbas, C.; Ma, L.; Domokos, C.; Cremers, D. Fusenet: Incorporating depth into semantic segmentation via fusion-based cnn architecture. In Proceedings of the Asian Conference on Computer Vision, Taipei, Taiwan, 20–24 November 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 213–228. [Google Scholar]
Xu, X.; Li, W.; Ran, Q.; Du, Q.; Gao, L.; Zhang, B. Multisource remote sensing data classification based on convolutional neural network. IEEE Trans. Geosci. Remote Sens. 2017, 56, 937–949. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE international Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
Zhou, X.; Wang, D.; Krähenbühl, P. Objects as points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.Y. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2203.03605. [Google Scholar]
Hu, K.; Xu, K.; Xia, Q.; Li, M.; Song, Z.; Song, L.; Sun, N. An overview: Attention mechanisms in multi-agent reinforcement learning. Neurocomputing 2024, 598, 128015. [Google Scholar] [CrossRef]
Hu, K.; Li, M.; Song, Z.; Xu, K.; Xia, Q.; Sun, N.; Zhou, P.; Xia, M. A review of research on reinforcement learning algorithms for multi-agents. Neurocomputing 2024, 599, 128068. [Google Scholar] [CrossRef]

Figure 1. Network overall structure diagram.

Figure 2. Convolutions at each stage in backbone feature extraction network.

Figure 3. Specific details of each component in convolutional neural network branch.

Figure 4. Structural diagram of feature correction module.

Figure 5. Structural diagram of feature fusion module.

Figure 6. Multi-Scale Fusion Attention Unit structure diagram.

Figure 7. Example scenes from the dataset illustrating various fire scenarios and types of occlusions. (a) small fires (localized, low-intensity flames); (b,c) medium-intensity fires (larger areasof fire spread); (d–f) large fires (wide-scale fires with significant flame and smoke coverage).

Figure 8. Visualization results of ablation experiment. (a) only CNN backbone is used in baseline; (b) Only using the Transformer backbone; (c) CNN and Transformer are introduced in parallel to form a dual branch structure; (d) the feature correction module FCM is added on (c); (e) After continuing to overlay the fusion feature module FFM; (f) The final complete model.

Figure 9. PCA projection analysis of feature similarity at different stages in the dual-branch network. (a) Initial stage, (b) intermediate stage, (c) final stage.

Figure 10. Visualization results of comparison experiment.

Figure 11. Visualization results of comparison experiment.

Figure 12. The detection results of the algorithm proposed in this article in complex scenarios. (a) multi-target fire sources and detection results; (d) no smoke occlusion and detection results; (b,c,f) smoke-occluded flames and detection results; (e) small target fire sources under heavy smoke and detection results; (g–i) the several images from the internet and conducted tests and detection results.

Table 1. Combination method and experimental results of ablation experiments.

Combination	Backbone Network		Auxiliary Module			mAP/%	Time/ms	GFLOPs	Parameters/
Combination	CNN	Transformer	FCM	FFM	MSFAU	mAP/%	Time/ms	GFLOPs	MB
(a)	✔	×	×	×	×	86.59	10.68	11.21	10.54
(b)	×	✔	×	×	×	85.35	11.46	13.45	12.22
(c)	✔	✔	×	×	×	88.71	13.21	20.24	20.41
(d)	✔	✔	✔	×	×	90.07	14.82	23.52	22.05
(e)	✔	✔	✔	✔	×	91.24	16.09	25.33	24.26
(f)	✔	✔	✔	✔	✔	92.49	17.01	27.61	26.93

Table 2. Results of ablation experiment.

Combination	Recall	Precision	mAP	AP
Combination	Recall	Precision	mAP	Fire	Smoke
(a)	81.79	86.47	86.59	84.94	88.24
(b)	82.26	87.31	85.35	83.97	86.73
(c)	82.77	87.62	88.71	87.31	90.11
(d)	83.84	88.89	90.07	89.27	90.87
(e)	84.28	89.17	91.24	90.19	92.29
(f)	84.65	89.74	92.49	91.75	93.23

Table 3. Comparative experimental results.

Method	mAP/%	Recall/%	Precision/%	F1/%	IoU/%	Kappa	Time/ms	M FLOPs/	Parameters/ MB
RetinaNet	71.57	72.43	71.45	71.44	68.22	0.65	114.12	77.54	30.04
YOLOV3	75.32	68.27	72.81	70.53	72.15	0.70	18.23	65.60	61.57
EfficientDet	76.21	74.62	74.25	74.43	74.02	0.72	17.47	4.63	3.83
YOLOV4	80.72	72.04	82.81	77.56	78.11	0.76	17.21	59.77	63.98
Faster R-CNN	85.12	88.52	68.12	76.57	79.98	0.81	146.54	369.74	28.36
CenterNet	86.34	85.59	87.95	86.26	84.77	0.83	20.01	69.94	32.66
Fire SSD	85.27	82.46	86.38	84.42	84.92	0.79	18.62	32.69	49.54
Elastic-YOLOv3	85.39	82.29	86.53	84.79	85.14	0.80	14.07	67.24	6.73
YOLOX	87.24	83.71	88.46	85.57	86.14	0.85	14.44	15.12	10.59
T-YOLOX	88.11	83.07	87.27	85.56	87.00	0.87	16.15	18.53	15.21
DINO	89.14	85.90	86.70	87.80	88.14	0.89	37.50	64.00	89.73
Improved YOLOX	88.54	82.96	87.43	85.87	87.52	0.88	15.74	24.82	19.57
DETR	91.30	88.50	89.11	88.00	90.10	0.90	31.12	53.23	75.12
Ours	92.49	84.57	89.43	86.52	91.23	0.92	17.01	27.61	26.93

Table 4. Performance evaluation after adding interference to the dataset.

Interference Type	mAP (%)	Recall (%)	Precision (%)	F1 (%)	IoU (%)	Kappa
No Interference	92.49	84.57	89.43	86.52	91.23	0.92
Random Noise	90.56	83.10	87.92	85.35	89.11	0.90
Lighting Variation	89.62	82.47	86.73	84.77	88.93	0.89
Partial Occlusion	88.71	81.85	85.12	83.26	87.85	0.87

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, Q.; Wei, C.; Sun, N.; Xiong, X.; Xia, Q.; Zhou, J.; Feng, X. Forest Fire Detection Method Based on Dual-Branch Multi-Scale Adaptive Feature Fusion Network. Forests 2025, 16, 1248. https://doi.org/10.3390/f16081248

AMA Style

Wu Q, Wei C, Sun N, Xiong X, Xia Q, Zhou J, Feng X. Forest Fire Detection Method Based on Dual-Branch Multi-Scale Adaptive Feature Fusion Network. Forests. 2025; 16(8):1248. https://doi.org/10.3390/f16081248

Chicago/Turabian Style

Wu, Qinggan, Chen Wei, Ning Sun, Xiong Xiong, Qingfeng Xia, Jianmeng Zhou, and Xingyu Feng. 2025. "Forest Fire Detection Method Based on Dual-Branch Multi-Scale Adaptive Feature Fusion Network" Forests 16, no. 8: 1248. https://doi.org/10.3390/f16081248

APA Style

Wu, Q., Wei, C., Sun, N., Xiong, X., Xia, Q., Zhou, J., & Feng, X. (2025). Forest Fire Detection Method Based on Dual-Branch Multi-Scale Adaptive Feature Fusion Network. Forests, 16(8), 1248. https://doi.org/10.3390/f16081248

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Forest Fire Detection Method Based on Dual-Branch Multi-Scale Adaptive Feature Fusion Network

Abstract

1. Introduction

2. Dual-Branch Feature Aggregation Network

2.1. Overall Network Structure

2.2. Backbone Feature Extraction Network

2.3. Feature Correction Module

2.4. Feature Fusion Module

2.5. Multi-Scale Fusion Attention Unit

3. Dataset and Evaluation Indicators

3.1. Experimental Environment and Parameters

3.2. Model Training Details and Evaluation Indicators

4. Experimental Analysis

4.1. Ablation Test

4.2. Analysis Experiment of the Feature Interaction Learning Process in Dual-Branch Networks

4.3. Comparison Experiment

4.4. Evaluation of the Proposed DMAFNet Algorithm Under Environmental Disturbances

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI