Next Article in Journal
Differences in Soil CO2 Emissions Between Managed and Unmanaged Stands of Quercus robur L. in the Republic of Serbia
Previous Article in Journal
Rights Interactions of Forest Tenure and Carbon Sequestration in China
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

F3-YOLO: A Robust and Fast Forest Fire Detection Model

1
College of Information Science and Technology & Artificial Intelligence, Nanjing Forestry University, Nanjing 210037, China
2
State Grid Electric Power Research Institute, NARI Information & Communication Technology Co., Ltd., Nanjing 211106, China
3
State Key Laboratory of Tree Genetics and Breeding, Co-Innovation Center for Sustainable Forestry in Southern China, Key Laboratory of Tree Genetics and Silvicultural Sciences of Jiangsu Province, Nanjing Forestry University, Nanjing 210037, China
*
Authors to whom correspondence should be addressed.
Forests 2025, 16(9), 1368; https://doi.org/10.3390/f16091368
Submission received: 18 July 2025 / Revised: 11 August 2025 / Accepted: 22 August 2025 / Published: 23 August 2025
(This article belongs to the Section Forest Inventory, Modeling and Remote Sensing)

Abstract

Forest fires not only destroy vegetation and directly decrease forested areas, but they also significantly impair forest stand structures and habitat conditions, ultimately leading to imbalances within the entire forest ecosystem. Therefore, accurate forest fire detection is critical for ecological safety and for protecting lives and property. However, existing algorithms often struggle with detecting flames and smoke in complex scenarios like sparse smoke, weak flames, or vegetation occlusion, and their high computational costs hinder practical deployment. To cope with it, this paper introduces F3-YOLO, a robust and fast forest fire detection model based on YOLOv12. F3-YOLO introduces conditionally parameterized convolution (CondConv) to enhance representational capacity without incurring a substantial increase in computational cost, improving fire detection in complex backgrounds. Additionally, a frequency domain-based self-attention solver (FSAS) is integrated to combine high-frequency and high-contrast information, thus better handling real-world detection scenarios involving both small distant targets in aerial imagery and large nearby targets on the ground. To provide more stable structural cues, we propose the Focaler Minimum Point Distance Intersection over Union Loss (FMPDIoU), which helps the model capture irregular and blurred boundaries caused by vegetation occlusion or flame jitter and smoke dispersion. To enable efficient deployment on edge devices, we also apply structured pruning to reduce computational overhead. Compared to YOLOv12 and other mainstream methods, F3-YOLO achieves superior accuracy and robustness, attaining the highest mAP@50 of 68.5% among all compared methods on the dataset while requiring only 5.4 GFLOPs of computational cost and maintaining a compact parameter count of 2.6 M, demonstrating exceptional efficiency and effectiveness. These attributes make it a reliable, low-latency solution well-suited for real-time forest fire early warning systems.

1. Introduction

Forests and their diverse ecosystems are vital for soil and water conservation and play a critical role in maintaining the ecological balance of Earth [1]. Therefore, strengthening forest protection is essential, with wildfire detection representing a key component of such efforts. Forest fires not only destroy vegetation and directly reduce forest cover but also severely damage forest structures and environmental conditions, leading to an imbalance in the forest ecosystem. Therefore, it is essential to deploy monitoring equipment in the wild for the prevention and early detection of forest fires.
Conventional forest fire detection approaches, including video surveillance systems [2], satellite-based remote sensing [3], and sensor networks [4], often encounter limitations such as high operational costs, delayed response times, and the need for specialized expertise, which impede their ability to provide cost-effective and timely fire alerts. Recent advancements in computer vision have revolutionized forest fire monitoring by enabling more efficient and intuitive detection methods. Vision-based systems utilize specialized cameras deployed in forest areas to process real-time video feeds through advanced image analysis techniques. Chen et al. [5] introduced a chromatic and disorder analysis approach based on the RGB (red, green, and blue) color model to detect fire regions in video sequences. This method identifies potential fire pixels using color information, and then employs motion features to confirm whether the detected region corresponds to an actual flame. Chino et al. [6] proposed BowFire, which integrates color features with superpixel-based texture analysis to identify fires in single images. Overall, traditional image-based fire detection approaches have largely relied on manually designed features, such as color and texture, to distinguish fire regions.
With the advancement of deep learning, neural networks have demonstrated remarkable capability in feature extraction, drawing extensive attention in recent years [7]. Modern fire detection techniques increasingly rely on object detection algorithms based on deep neural networks, which are broadly classified into two paradigms: two-stage and one-stage detectors. Two-stage methods, such as the R-CNN series (e.g., Fast R-CNN [8] and Faster R-CNN [9]), first generate a set of candidate region proposals and subsequently classify them. For instance, Kim et al. [10] applied Faster R-CNN to distinguish fire and non-fire regions by analyzing their spatial characteristics. In contrast, one-stage methods, including the single-shot detector (SSD) [11] and the You Only Look Once (YOLO) family of models [12,13], perform bounding box regression and classification concurrently, yielding substantially higher computational efficiency. The unique challenges of forest fire surveillance—often conducted in remote, mountainous terrain where fires can propagate rapidly—demand continuous, real-time monitoring. This operational context necessitates the deployment of lightweight, low-latency detection models on edge devices with constrained computational resources. The YOLO framework, with its balance of high inference speed and competitive accuracy, has emerged as a suitable foundation for such applications. Recent studies have focused on tailoring YOLO for forest fire detection. Wang et al. [14] introduced YOLO-LFD, a specialized framework for detecting forest fires. Liu et al. [15] adapted YOLOv7-tiny to enhance detection precision. Wang et al. [16] further improved the YOLOv8 architecture to develop a novel model tailored for forest fire identification. Yun et al. [17] aimed at reducing model complexity by optimizing the parameter efficiency of YOLOv8. Wang et al. [18] proposed DSS-YOLO, advancing the performance of fire detection systems. Zhou et al. [19] tackle the key challenge of maintaining both detection accuracy and real-time processing in forest fire monitoring through the design of YOLOv8n-SMMP.
Despite these advancements, the reliable detection of forest fires remains a considerable challenge. The visual appearance of fires and smoke is often erratic due to environmental factors such as vegetative occlusion, wind, and variable atmospheric conditions. This results in targets that exhibit inconsistent shapes, colors, and scales, which can degrade the performance of existing detection models. As the latest version in the YOLO series, YOLOv12 represents a design of an attention-centric model [20]. By employing area attention, YOLOv12 avoids the computational overhead typically associated with standard self-attention while preserving its advantages, allowing the model to effectively recognize key regions within cluttered or dynamic environments that are commonly encountered in forest fire detection tasks because of vegetation occlusion and smoke diffusion, thereby improving the detection robustness of small, occluded, or overlapping objects. To investigate the practical applicability of YOLOv12, this paper selects YOLOv12n as the baseline model, which is the most computationally and memory-efficient variant of the YOLOv12 series. We further optimize it to better adapt to the specific requirements of forest fire detection tasks. This paper introduces F3-YOLO, an efficient and robust model specifically engineered for fire and smoke detection. The primary contributions of this work are as follows:
  • We establish a new baseline for forest fire detection by adapting the YOLOv12 framework, which integrates area attention modules to isolate targets from cluttered backgrounds, thereby improving the detection of small, occluded, or overlapping instances of fire and smoke.
  • F3-YOLO introduces the dynamically activated CondConv to enhance the model’s feature representation of dynamic targets while preserving real-time performance. Additionally, the FSAS module is integrated to leverage frequency-domain information and relational dependencies to improve detection accuracy.
  • We propose the FMPDIoU loss function to stabilize training and enhance localization accuracy for irregularly shaped targets. Furthermore, structured pruning is applied to eliminate redundant parameters, reducing computational overhead and ensuring compatibility with edge devices.
  • Our experimental results demonstrate that the proposed F3-YOLO achieves state-of-the-art (SOTA) accuracy while incurring the lowest computational overhead, offering a reliable, low-latency solution for forest fire early warning systems.

2. Method

2.1. Baseline Model Selection

As illustrated in Figure 1, the YOLOv12 model is composed of three core components: the backbone, a bottleneck, and a detection head, which are improved on the designs of earlier YOLO versions. The backbone is used for progressive feature extraction and downsampling through a series of stacked convolutional (Conv) layers. It also integrates multiple C3k2 modules at different depths, constructing a multi-scale feature hierarchy to extract features at different spatial levels. At the end of the backbone, the A2C2f module is employed to capture spatial dependencies through an attention mechanism. Specifically, the Conv module is composed of a standard convolutional layer followed by batch normalization and the SiLU activation function. The C3k2 module employs a parallel convolutional structure consisting of two branches, allowing feature extraction across multiple channels. This enhances both the efficiency and precision of feature extraction. The A2C2f module integrates area attention and residual connections, which not only maintain a large receptive field but also significantly reduce the computational complexity of attention mechanisms. The bottleneck implements a bidirectional feature fusion strategy by combining features from different scales using feature upsampling and concatenation. The A2C2f module is again utilized here to fuse features across different levels, enhancing the integration of high-level semantic information from deeper layers with fine-grained spatial details from shallower layers. This helps to obtain more expressive and robust feature representations. The detection head adopts a multi-branch architecture, where each branch is designed to process feature maps at a specific resolution. As shown in the figure, three detection branches (P3, P4, and P5) are constructed to correspond to different levels of the feature hierarchy, which allows the model to detect objects of varying scales effectively.

2.2. F3-YOLO

2.2.1. Overall Architecture

As shown in Figure 2, we optimized the backbone network of YOLOv12 to better suit the forest fire detection task. To enhance the model’s capacity, we incorporated CondConv [21], which leverages dynamically activated mixed-expert convolution to avoid the exponential increase in parameters and computational overhead typically associated with deepening network layers or widening convolutional channels. To prevent rapid parameter inflation in deeper layers, we selectively replaced standard Conv modules in YOLOv12 with CondConv modules in the shallow and intermediate layers, obtaining a balance between model complexity and feature extraction capability. To capitalize on the high internal similarity (low-frequency signal) of flames and smoke, as well as their significant variations relative to the background (high-frequency signal), we introduced the FSAS module [22]. This module performs self-attention score calculations in the frequency domain, effectively leveraging frequency-domain information and the spatial relationship modeling capabilities of attention mechanisms. To maximize the utilization of flame and smoke signals in the original images, we incorporated the FSAS module only in the early stages of forward propagation and removed the C3k2 module from the third layer. Through the integration of the CondConv and FSAS modules, the model achieves enhanced feature extraction and global context awareness, expanding the receptive field for area attention in deep layers.

2.2.2. Introduce the CondConv

The dynamic and diverse appearance of flames and smoke, in terms of both shape and color, combined with varying lighting conditions, significantly affects their visual distinguishability from the background. These factors pose substantial challenges for lightweight models, which often struggle to learn generalized and discriminative semantic features, thereby compromising robustness in complex real-world applications. Although enlarging the size of model can improve detection performance, it also introduces considerable computational overhead, rendering such approaches impractical for deployment on resource-constrained edge devices. To address this challenge, we introduce CondConv into the model architecture, aiming to enhance representational capacity without incurring a substantial increase in computational cost.
As illustrated in Figure 3, CondConv takes the input feature map and passes it through a routing network, which consists of a fully connected layer followed by a softmax function. This network generates a set of scalar weights for each expert convolution kernel, which are then used to produce a weighted combination of these kernels. The formal representation is as follows:
α i = r i ( x ) = S i g m o i d ( G A P ( x ) R )
C o n d C o n v ( x ) = σ ( ( i = 1 n α i W i ) x )
where S i g m o i d denotes the Sigmoid activation function and G A P represents global average pooling. R is a matrix of learned routing weights that maps the pooled inputs to n expert weights via a fully connected layer. Each α i = r i ( x ) is an input-dependent scalar weight computed using a learnable routing function; n is the number of experts, and σ is a ReLU activation function. When we adapt a convolution layer to use CondConv, each kernel W i has the same dimensions as the kernel in the original convolution. Unlike conventional methods that increase model capacity by adding more convolution filters or expanding channel widths, CondConv allows each expert kernel to learn distinct feature patterns. Through the dynamic aggregation of multiple expert kernels, the model can adaptively modify its convolution kernels at runtime based on the input. This mechanism enables the network to adjust its feature extraction strategy in an input-dependent manner, thereby significantly improving model performance while keeping the computational burden low, which is essential for edge-device deployment in forest fire detection scenarios.
To semantically capture the highly variable characteristics of flames and smoke, we enhance the model capacity of YOLOv12 by introducing CondConv. This enables the network to learn more expressive and adaptive feature representations of fires and smoke, thereby improving its robustness in complex forest fire detection scenarios. Since lower network layers rely on stable feature representations to ensure training convergence and because the number of parameters grows exponentially with depth, we strategically replace the intermediate convolutional layers in the YOLOv12 backbone with CondConv layers. This design choice is guided by two complementary insights. On the one hand, the number of channels in early layers is relatively small (e.g., 64 or 128), which keeps the expert networks in CondConv compact. In contrast, placing CondConv in deeper layers—where channel counts reach into the thousands—would drastically increase the number of parameters, inflating model size by an order of magnitude and exceeding the strict memory constraints typical of real-time and edge applications. On the other hand, shallow layers produce low-level features, where dynamic kernel selection across experts can more effectively adapt to input variations, yielding higher accuracy gains per parameter. In deeper layers, features already encode rich, high-level semantics, making them less sensitive to fine-grained adaptation and thus less beneficial from conditional computation. By confining CondConv to the earlier stages of the backbone, we strike an optimal balance: the model gains enhanced expressiveness to capture diverse input statistics while maintaining computational and memory efficiency suitable for deployment on resource-constrained devices.

2.2.3. Integrate the FSAS Module

To address the challenges of forest fire detection, where flames and smoke exhibit variable scales and complex backgrounds (e.g., small smoke points or sparks in early stages versus large-scale spread in severe cases), a model must effectively capture multi-scale targets at long distances and understand contextual information to accurately identify targets in intricate scenes. Traditional convolutional neural networks are limited in capturing long-range dependencies, while standard self-attention mechanisms incur high computational complexity ( O ( N 2 ) ), leading to excessive computational overhead and memory usage, which hinders real-time performance and deployment feasibility. Given that large areas of flames and smoke typically manifest as smooth, continuous low-frequency signals in images, with their edges representing high-frequency components relative to the forest background, the fast Fourier transform (FFT) can naturally decompose signals into distinct frequency components. This enables the model to efficiently capture and enhance features critical for fire detection. To leverage frequency-domain information and the attention mechanism’s ability to model relational dependencies, this study introduces the FSAS module. FSAS transforms the computationally expensive self-attention matrix multiplication from the spatial domain to the frequency domain, where the FFT enables an equivalent self-attention mechanism through highly efficient element-wise multiplication. FSAS significantly reduces the computational complexity of self-attention while preserving the Transformer architecture’s powerful global modeling capabilities, thereby lowering resource consumption.
As illustrated in Figure 4, FSAS comprises three branches for computing a Query (Q), a Key (K), and a Value (V). Each branch learns different feature representations using 1 × 1 convolutions to adjust the input channel dimensions and 3 × 3 depth-wise convolutions to capture local features. The resulting Query and Key features undergo an FFT to obtain frequency-domain representations, from which attention scores are computed. These scores are transformed back to the spatial domain via an inverse FFT and multiplied with the Value branch to integrate global contextual information, producing an optimized output feature map. The FSAS module efficiently handles high-resolution inputs, enabling the detection of critical visual cues, such as early-stage small fire points or widespread smoke, while improving accuracy and reducing computational load in forest fire detection.

2.3. Optimization Strategy

2.3.1. FMPDIoU

To address the challenges of detecting dynamic, irregularly shaped flames and smoke in forest fire detection tasks, a model requires enhanced localization precision. The MPDIoU loss function transforms the loss computation during training by minimizing the distance between the predicted and ground-truth bounding boxes, accounting for factors such as the overlap area and center point distance [23]. This approach mitigates the issue of traditional IoU loss, which yields zero gradients when there is no overlap, enabling the model to learn more precise boundary identification. However, variations in detection scenarios and fire intensity lead to a broad distribution of sample data, causing the model to favor learning from simpler samples with moderately sized flames or smoke. To address this, Focaler-IoU introduces a dynamic scaling mechanism that prioritizes low-IoU samples, improving the balance between easy and difficult samples in regression outcomes [24]. In this study, we propose FMPDIoU, a composite loss function combining MPDIoU and Focaler-IoU, to stabilize the training process and enhance detection performance.
The MPDIoU loss function is defined as follows:
w = max ( x 2 , x 2 gt ) min ( x 1 , x 1 gt ) h = max ( y 2 , y 2 gt ) min ( y 1 , y 1 gt ) d 1 2 = ( x 1 x 1 gt ) 2 + ( y 1 y 1 gt ) 2 d 2 2 = ( x 2 x 2 gt ) 2 + ( y 2 y 2 gt ) 2 L MPDIoU = IoU d 1 2 w 2 + h 2 d 2 2 w 2 + h 2
where ( x 1 , y 1 ) and ( x 2 , y 2 ) are the coordinates of the predicted bounding box’s top-left and bottom-right corners, ( x 1 gt , y 1 gt ) and ( x 2 gt , y 2 gt ) are the ground-truth coordinates, w and h are the ground-truth box’s width and height, and IoU is the intersection over union. This formulation ensures non-zero gradients for non-overlapping boxes, improving localization precision.
To focus on diverse samples, Focaler-IoU employs an interval mapping approach to reconstruct the IoU loss, as shown below:
I o U f o c a l e r = 1 clamp IoU d u d , 0 , 1
where d and u denote the lower and upper bounds of the linear scaling range, which are set to 0.05 and 0.95 in this paper, respectively. The clamp(·) function ensures that the output remains within the interval [0, 1], emphasizing challenging samples during training.
FMPDIoU loss combines MPDIoU and Focaler-IoU to balance localization accuracy and sample difficulty, where
L FMPDIoU = L M P D I o U + I o U I o U f o c a l e r
FMPDIoU enhances forest fire detection by combining MPDIoU’s precise localization with Focaler-IoU’s focus on difficult samples. The proposed loss function stabilizes training and improves detection accuracy, making it a robust solution for real-world forest fire detection tasks.

2.3.2. Structured Pruning Strategy

To enhance the compatibility of our model with resource-constrained edge devices, we implemented a structured pruning strategy that targets redundant parameters in convolutional layers, drawing inspiration from established methods in model compression. Specifically, we adopted a channel pruning approach [25], which leverages the scaling factors from batch normalization (BN) layers to identify and eliminate less critical channels, thereby reducing computational overhead while preserving model performance.
The pruning process began with an initial training phase where we introduced the L 1 norm regularization. This regularization encouraged sparsity in the model weights, making it easier to identify channels with minimal contributions to the overall performance. The scaling factor γ associated with each channel in the BN layers served as a proxy for evaluating the importance of the channel. Channels with γ values below a predefined threshold—determined through empirical analysis of the distribution of γ values across the network—were deemed redundant. In our implementation, we we set the γ value to 0.1.
To ensure the pruned model retained its accuracy, we performed iterative fine-tuning on the training dataset after pruning for 100 epochs. This step involved retraining the model with a reduced learning rate to allow the remaining parameters to adapt to the new architecture.

3. Experiment

3.1. Datasets

In this paper, we use the forest fire detection dataset introduced in [19], which consists of a total of 2603 images collected from various online sources. All visible instances of flames and smoke in these images were annotated, yielding more than 4000 labeled flame instances and approximately 2000 labeled smoke instances. The dataset provides a wide range of imaging conditions, including different times of the day, geographical locations, and imaging methods, including day and night scenes, as well as close-range and aerial views captured from varying distances. This diversity ensures comprehensive coverage of real-world variability, thereby enhancing the robustness of models trained on the dataset when deployed in complex and unconstrained environments. Furthermore, it supports the generalization of the experimental results to practical forest fire detection scenarios.
To ensure experimental consistency, we follow the data partitioning way proposed in [19]. Specifically, the dataset is divided into training, validation, and test sets in a ratio of 8:1:1, resulting in 2083 training images, 260 validation images, and 260 test images.

3.2. Evaluation Metrics

To evaluate the effectiveness of the proposed improved model in the task of forest fire detection and to ensure its feasibility for real-world deployment, we conduct a comprehensive assessment based on four key metrics: mean Average Precision (mAP), frames per second (FPS), the number of parameters (Params), and floating-point operations (FLOPs). These metrics, respectively, reflect the model’s detection accuracy, inference speed, spatial complexity, and computational cost.
Mean Average Precision (mAP) quantifies the overall detection accuracy across all target categories. To compute mAP, the model’s precision (P) and recall (R) must first be determined, as defined in Equations (6) and (7):
P = T P T P + F P
R = T P T P + F N
where TP (true positive) denotes the number of correctly identified regions containing flames or smoke, FP (false positive) refers to regions incorrectly predicted to contain such targets, and FN (false negative) indicates ground-truth regions with flames or smoke that were not detected by the model.
Based on the computed precision and recall, the Average Precision (AP) for each class is derived. The mAP is then calculated as the mean of APs across all object categories:
A P = 0 1 P ( R ) d R
mAP = 1 C i = 1 C A P i
where C represents the number of object categories—two in this task: flame and smoke. In our experiments, we evaluate mAP under an IoU (Intersection over Union) threshold of 50%, denoted as mAP@50, to measure the model’s average detection accuracy for flame and smoke targets.
Frames Per Second (FPS) measures the model’s inference speed, which is defined as the number of images the model can process per second. The calculation is as follows:
FPS = N i = 1 N t i
where N is the total number of test images and t i is the time taken to process the i-th image.
The number of parameters (Params) reflects the model’s spatial complexity, corresponding to the total number of trainable parameters. This metric directly influences memory consumption and storage requirements during deployment. Finally, the model’s computational complexity is evaluated in terms of floating-point operations (FLOPs), representing the total number of arithmetic operations required for a single forward pass. We report this value in GigaFLOPs (GFLOPs), i.e., billions of floating-point operations, to quantify the computational cost under typical forest fire detection scenarios.
By jointly considering these metrics, our proposed method achieves a favorable trade-off between accuracy and efficiency. It delivers high detection performance while maintaining a lightweight architecture and fast inference speed, making it well-suited for practical deployment in real-time forest fire monitoring systems.

3.3. Experimental Environment and Configuration

3.3.1. Environment

The main configuration of the training environment used in this paper is summarized in Table 1. All experiments were conducted using PyTorch as the deep learning framework, with model training accelerated by the NVIDIA CUDA parallel computing platform on a GPU.

3.3.2. Training Configuration

Models were trained for 300 epochs with a fixed random seed, and the batch size was set to 64, including 200 epochs of constrained training and 100 epochs of fine-tuning. Following the default configuration of YOLOv12, the training phase adopts the automatically selected AdamW optimizer with an initial learning rate of 0.0016, which is progressively decayed to 0.01 times its original value during training. The momentum is set to 0.937, and the weight decay coefficient is 0.0005. A warm-up strategy for 3 epochs is employed to facilitate model convergence, during which the momentum is initialized at 0.8 and the learning rate is set to 10% of the original value. Standard YOLO data augmentation techniques are used, with a MixUp probability of 0.2, a copy-paste augmentation rate of 0.3, and an image scaling ratio of 0.7. All other hyperparameters remain consistent with the default YOLOv12 configuration. During the fine-tuning stage after pruning, the learning rate is set to 0.0005, and the strength of the aforementioned data augmentation techniques is reduced by half.

3.4. Experimental Results

3.4.1. Comparison Experiment

To demonstrate the feasibility and superiority of the proposed F3-YOLO model in real-time forest fire detection, we conduct a comprehensive comparative evaluation against several mainstream lightweight models in the YOLO series. The comparison includes YOLOv7-tiny [26], YOLOv8n [27], YOLOv10n [28], YOLOv11n [29], YOLOv12n [20], YOLOv12s [20], and the fast aerial image detection model FBRT-YOLO [30], as well as state-of-the-art (SOTA) models tailored for forest fire detection, namely YOLO-LFD [14], DSS-YOLO [18], and YOLOv8n-SMMP [19]. The performance results of these models are summarized in Table 2.
As shown in Table 2, the native lightweight YOLO series models perform well in the forest fire detection task, with YOLOv12n achieving the highest detection accuracy (64.4%). Although FBRT-YOLO has the smallest number of parameters (0.9 M), it also exhibits the lowest detection accuracy (59.4%) and does not offer advantages in terms of computational complexity. Overall, its performance is inferior to that of specialized models designed for forest fire detection, such as YOLO-LFD, DSS-YOLO, and YOLOv8n-SMMP. F3-YOLO outperforms all compared models, achieving the highest mAP@50 score of 68.5% and the lowest computational complexity of 5.4 GFLOPs while maintaining a relatively small parameter count of 2.6 M, thus demonstrating its effectiveness. Compared to the baseline model YOLOv12n, F3-YOLO delivers comprehensive improvements, with a 4.1% increase in detection accuracy, a 27.7% decrease in computational cost, and a consistent parameter count.
To obtain more intuitive results, Figure 5 presents a visual comparison of detection performance in various scenarios in the test set, involving the baseline model YOLOv12n, state-of-the-art models specifically designed for forest fire detection, and the proposed F3-YOLO. As illustrated in the figure, the first row demonstrates the detection performance in close-range, large-area flame scenarios. YOLO-LFD and YOLO-DSS only detected portions of the flames, while YOLOv12n successfully detected complete flames but exhibited redundant detections. YOLOv8-SMMP and F3-YOLO achieved optimal performance in this scenario. The second row presents flame detection results under conditions with smoke and tree coverage. F3-YOLO successfully detected all the smoke and flames present in the scene, whereas YOLOv12n, YOLO-LFD, and YOLO-DSS only recognized the concentrated smoke in the upper part of the image. Although YOLOv8-SMMP detected the dispersed smoke at the bottom, it failed to successfully identify the flame regions obscured by the smoke. The third row illustrates detection performance in low-light environments with distant, widely distributed flames. All models performed well in this scenario, with F3-YOLO demonstrating the fewest missed flame detections. The fourth row shows smoke detection results in distant scenarios without obvious flames. Only F3-YOLO successfully detected the minute smoke located in the center of the image. The fifth row demonstrates forest fire detection scenarios under large-area smoke conditions. All models performed well in detecting smoke; however, YOLOv12n and YOLOv8-SMMP failed to detect the small-area flames obscured by smoke. The sixth row presents the detection results under hazy weather conditions. As shown in the image, while all models successfully capture the prominent flame regions, the smoke is less distinct due to reduced visibility. In this challenging scenario, only YOLOv8-SMMP and F3-YOLO manage to detect the smoke, with F3-YOLO demonstrating superior performance by covering a larger extent of the smoke region and producing higher confidence scores. This highlights F3-YOLO’s enhanced robustness in low-visibility environments and its improved capability to recognize subtle, diffused fire-related features. In the above scenarios, F3-YOLO not only achieved the highest detection accuracy but also demonstrated the highest confidence scores in its detection results, reflecting the superior robustness of the F3-YOLO model.

3.4.2. Ablation Experiment

To comprehensively evaluate the impact of of each component on F3-YOLO, we incrementally integrated CondConv, FSAS, the FMPDIoU loss function, and model pruning into the YOLOv12n baseline model to conduct an ablation experiment. Table 3 illustrates the effects of these modules on detection accuracy, parameter counts, computational complexity, and inference speed.
As shown in Table 3, each module contributes to improved detection accuracy, with the best performance achieved when all components are combined. The baseline model, YOLOv12n, yields an mAP@50 of 64.4%, with 2.6 M parameters, 6.5 GFLOPs, and an inference speed of 234 FPS, reflecting its robust performance in lightweight design.
By replacing standard convolutions with CondConv, the model’s accuracy improved to 66.3%. CondConv dynamically generates convolutional kernel weights for each input, enabling the model to adaptively adjust feature extraction based on input images, thereby enhancing expressive capacity and robustness. Although CondConv’s mixture-of-experts convolution increases the parameter count, its application to intermediate-depth layers of the backbone limits this increase to 3.0 M parameters. Due to CondConv’s dynamic activation, the total computational load decreased to 5.3 GFLOPs, and inference speed rose to 250 FPS. This shows that CondConv not only improves detection accuracy but also offers an effective design that is easy to use on the edge device. Building on CondConv, the FSAS module was introduced to extract high-frequency and low-frequency components. By employing the frequency-domain attention mechanism, FSAS enhances the detection of salient structures and edge information, improving the model’s ability to capture the complex, dynamic shapes and textures of flames and smoke, resulting in an mAP@50 of 67.1%. FSAS performs frequency-domain transformations to extract global statistical information, retaining only low-frequency components for attention weight generation, which reduces the model’s computational complexity to 5.0 GFLOPs without significantly affecting the parameter count (remaining at 3.0 M). Although CUDA’s limited support for cross-domain transformations slightly reduced inference speed to 241 FPS (a 3.6% decrease), the reduced computational overhead and improved detection accuracy highlight FSAS’s viability for real-world forest fire detection. Further improvements were achieved by incorporating the FMPDIoU loss function and applying pruning and fine-tuning to optimize the training strategy. The FMPDIoU loss function refines the training process by focusing on hard-to-classify samples and enhancing target localization, increasing the mAP@50 to 68.1% without additional parameters or computational cost. Structured pruning was then applied to compress redundant channels, mitigating the parameter increase from CondConv (returning to 2.6 M parameters) and further reducing the computational load to 4.7 GFLOPs while improving inference speed to 254 FPS. Post-pruning fine-tuning restored and further enhanced accuracy, achieving an mAP@50 of 68.5%.

4. Discussion

4.1. Attention Map Analysis

To further illustrate the superior detection capabilities of the proposed F3-YOLO model, we provide attention visualization heatmaps for both the baseline YOLOv12n and F3-YOLO, as shown in Figure 6. Specifically, Figure 6a shows the original scene of a forest fire, where multiple flame and smoke regions are visible on a hillside, presenting complexity and interference of the real world in the detection environment. Figure 6b shows the attention heat map produced by YOLOv12n. The activations are relatively scattered, indicating that the model is attending not only to the fire regions but also to background areas such as soil and the sky. This suggests that YOLOv12n could be distracted by irrelevant features, potentially leading to missed detections or false positives. Figure 6c presents the attention heatmap generated by F3-YOLO. Compared to YOLOv12n, the attention is more concentrated and well-localized on the actual flame and smoke regions. This focused activation demonstrates that F3-YOLO is better at capturing discriminative features of fires and smoke, even in the presence of visual clutter and overlapping objects. Meanwhile, the attention heatmaps also reveal that the model tends to focus on bright-colored regions in the image, particularly orange-red areas resembling flames—such as clothing in the example. When such regions are large and irregularly shaped, the model may misclassify them as flames, leading to false positives. Therefore, higher input resolution is suggested to enable the model to capture clearer semantic details for accurate discrimination.

4.2. Hyperparameter Analysis

To quantify the impact of the two critical hyperparameters of FMPDIoU, we conducted an ablation study on the dataset, as shown in Figure 7. Initially, we fixed the upper threshold u at 0.95 and incrementally increased the lower threshold d. The results illustrated in Figure 7a show that the mAP@50 peaked at 68.5% when d = 0.05 , which is an improvement of over 68.3% at d = 0.00 . This suggests that appropriately filtering low-IoU noisy samples enhances the model’s ability to focus on effective training samples. However, as d increased to 0.10, 0.15, and 0.20, the mAP@50 progressively declined to 68.1%, 67.7%, and 66.3%, respectively. These results indicate that excessively high lower thresholds suppress positive samples, thereby hindering the learning process. Subsequently, we fixed the lower threshold at d = 0.05 and decreased the upper threshold u. The results presented in Figure 7b demonstrate that the mAP@50 reached the highest value of 68.5% at u = 0.95 . In contrast, at u = 0.90 and u = 1.00 , the mAP@50 was slightly lower at 68.1%. This indicates that retaining high-IoU samples is crucial for maintaining robust performance, but excessively high or low u values may degrade the quality of supervision signals. Based on these results, the configuration of d = 0.05 and u = 0.95 effectively balances the inclusion of challenging and reliable samples, optimizing the performance for forest fire detection scenarios. Consequently, the configuration that preserves hard positives while emphasizing high-quality samples is adopted as the default setting for F3-YOLO in wildfire detection applications.
To intuitively illustrate the impact of FMPDIoU, we present a comparative visualization of detection results with and without FMPDIoU. As shown in Figure 8, when FMPDIoU is incorporated, the model places greater emphasis on high-quality samples during training, effectively enhancing the learning of discriminative features. This leads to a significant reduction in false positives, particularly in challenging cases where non-flame regions (e.g., reddish terrain) resemble flames in color or texture. By refining localization supervision through adaptive sample weighting, FMPDIoU guides the model to focus on more reliable regions, thereby improving both detection accuracy and robustness in complex real world scenarios.

5. Conclusions

This paper presents a robust and fast forest fire detection model (F3-YOLO), an optimized adaptation of the YOLOv12 framework designed to address the challenges of forest fire detection in complex environments. We integrate the CondConv module and the FSAS module to strengthen the model’s feature representation capabilities. Furthermore, we propose the FMPDIoU loss function to effectively address irregular and blurred boundaries resulting from vegetation occlusion, flame jitter, or smoke dispersion. To enable efficient deployment on edge devices, we also apply structured pruning to reduce computational overhead. Experimental results demonstrate that F3-YOLO outperforms YOLOv12 and other state-of-the-art methods in both accuracy and computational overhead, establishing a new benchmark for real-time forest fire detection and providing a robust and practical solution for forest fire detection systems. Thanks to its lightweight architecture, our model can be efficiently deployed on both backend servers and edge devices, such as surveillance cameras or embedded systems. This enables real-time inference with low latency, allowing for rapid feedback of detection results. The reduced computational and memory footprint makes it particularly suitable for resource-constrained environments, facilitating scalable and responsive forest fire monitoring in both centralized and on-site deployment scenarios.

Author Contributions

Conceptualization, P.Z.; data curation, X.Z., X.Y. and Z.Z.; formal analysis, P.Z., X.Z., X.Y. and Z.Z.; funding acquisition, C.B. and L.Z.; investigation, P.Z., X.Z., X.Y. and Z.Z.; methodology, P.Z.; project administration, C.B. and L.Z.; software, P.Z.; supervision, C.B. and L.Z.; validation, X.Y. and Z.Z.; visualization, X.Z.; writing—original draft, P.Z.; writing—review and editing, L.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Jiangsu Province College Students’ Innovation and Entrepreneurship Training Program Project (grant number 202410298174Y) and the Natural Science Foundation of Jiangsu Province (grant number BK20220414).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets analyzed in this study are not publicly available due to privacy and ethical restrictions. De-identified data can be made available upon reasonable request from the corresponding authors.

Conflicts of Interest

Author Ziqian Zhang was employed by the company NARI Information & Communication Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
F3-YOLOA robust and fast forest fire detection model based on YOLO
CondConvConditionally parameterized convolutions
FFTFast Fourier transform
FSASFrequency domain-based self-attention solver
FMPDIoUFocaler minimum point distance intersection over union loss
FPSFrames per second
GFLOPsGiga floating-point operations

References

  1. Anderegg, W.R.L.; Trugman, A.T.; Badgley, G.; Anderson, C.M.; Bartuska, A.; Ciais, P.; Cullenward, D.; Field, C.B.; Freeman, J.; Goetz, S.J.; et al. Climate-Driven Risks to the Climate Mitigation Potential of Forests. Science 2020, 368, eaaz7005. [Google Scholar] [CrossRef] [PubMed]
  2. Dang-Ngoc, H.; Nguyen-Trung, H. Aerial Forest Fire Surveillance—Evaluation of Forest Fire Detection Model Using Aerial Videos. In Proceedings of the 2019 International Conference on Advanced Technologies for Communications (ATC), Ho Chi Minh City, Vietnam, 17–19 October 2019; pp. 142–148. [Google Scholar]
  3. Güney, C.O.; Mert, A.; Gülsoy, S. Assessing Fire Severity in Turkey’s Forest Ecosystems Using Spectral Indices from Satellite Images. J. For. Res. 2023, 34, 1747–1761. [Google Scholar] [CrossRef]
  4. Zhang, J.; Li, W.; Yin, Z.; Liu, S.; Guo, X. Forest Fire Detection System Based on Wireless Sensor Network. In Proceedings of the 2009 4th IEEE Conference on Industrial Electronics and Applications, Xi’an, China, 25–27 May 2009; pp. 520–523. [Google Scholar]
  5. Chen, T.-H.; Wu, P.-H.; Chiou, Y.-C. An Early Fire-Detection Method Based on Image Processing. In Proceedings of the 2004 International Conference on Image Processing (ICIP), Singapore, 24–27 October 2004; Volume 3, pp. 1707–1710. [Google Scholar]
  6. Chino, D.Y.T.; Avalhais, L.P.S.; Rodrigues, J.F.; Traina, A.J.M. BowFire: Detection of Fire in Still Images by Integrating Pixel Color and Texture Analysis. In Proceedings of the 28th SIBGRAPI Conference on Graphics, Patterns and Images, Salvador, Brazil, 26–29 August 2015; pp. 95–102. [Google Scholar]
  7. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
  8. Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
  9. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
  10. Kim, B.; Lee, J. A Video-Based Fire Detection Using Deep Learning Models. Appl. Sci. 2019, 9, 2862. [Google Scholar] [CrossRef]
  11. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 8–16 October 2016; pp. 21–37. [Google Scholar]
  12. Jiang, P.; Ergu, D.; Liu, F.; Cai, Y.; Ma, B. A Review of YOLO Algorithm Developments. Procedia Comput. Sci. 2022, 199, 1066–1073. [Google Scholar] [CrossRef]
  13. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
  14. Wang, H.; Zhang, Y.; Zhu, C. YOLO-LFD: A Lightweight and Fast Model for Forest Fire Detection. Comput. Mater. Contin. 2025, 82, 1–10. [Google Scholar] [CrossRef]
  15. Liu, H.; Zhu, J.; Xu, Y.; Xie, L. MCAN-YOLO: An Improved Forest Fire and Smoke Detection Model Based on YOLOv7. Forests 2024, 15, 1781. [Google Scholar] [CrossRef]
  16. Wang, Z.; Xu, L.; Chen, Z. FFD-YOLO: A Modified YOLOv8 Architecture for Forest Fire Detection. Signal Image Video Process. 2025, 19, 265. [Google Scholar] [CrossRef]
  17. Yun, B.; Zheng, Y.; Lin, Z.; Li, T. FFYOLO: A Lightweight Forest Fire Detection Model Based on YOLOv8. Fire 2024, 7, 93. [Google Scholar]
  18. Wang, H.; Fu, X.; Yu, Z.; Zeng, Z. DSS-YOLO: An Improved Lightweight Real-Time Fire Detection Model Based on YOLOv8. Sci. Rep. 2025, 15, 8963. [Google Scholar]
  19. Zhou, N.; Gao, D.; Zhu, Z. YOLOv8n-SMMP: A Lightweight YOLO Forest Fire Detection Model. Fire 2025, 8, 183. [Google Scholar]
  20. Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-Centric Real-Time Object Detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
  21. Yang, B.; Bender, G.; Le, Q.V.; Ngiam, J. CondConv: Conditionally Parameterized Convolutions for Efficient Inference. Adv. Neural Inf. Process. Syst. 2019, 32, 1–11. [Google Scholar]
  22. Kong, L.; Dong, J.; Ge, J.; Li, M.; Pan, J. Efficient Frequency Domain-Based Transformers for High-Quality Image Deblurring. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 5886–5895. [Google Scholar]
  23. Ma, S.; Xu, Y. MPDIoU: A Loss for Efficient and Accurate Bounding Box Regression. arXiv 2023, arXiv:2307.07662. [Google Scholar]
  24. Zhang, H.; Zhang, S. Focaler-IoU: More Focused Intersection over Union Loss. arXiv 2024, arXiv:2401.10525. [Google Scholar]
  25. Liu, Z.; Li, J.; Shen, Z.; Huang, G.; Yan, S.; Zhang, C. Learning Efficient Convolutional Networks through Network Slimming. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2736–2744. [Google Scholar]
  26. Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
  27. Varghese, R.; Sambath, M. YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Bangalore, India, 21–22 March 2024; pp. 1–6. [Google Scholar]
  28. Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J. YOLOv10: Real-Time End-to-End Object Detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
  29. Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar]
  30. Xiao, Y.; Xu, T.; Xin, Y.; Li, J. FBRT-YOLO: Faster and Better for Real-Time Aerial Image Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Denver, CO, USA, 25–28 February 2025; Volume 39, pp. 8673–8681. [Google Scholar]
Figure 1. The YOLOv12 architecture consists of three parts: the backbone for feature extraction, the neck for multi-scale fusion, and the head for classification and regression.
Figure 1. The YOLOv12 architecture consists of three parts: the backbone for feature extraction, the neck for multi-scale fusion, and the head for classification and regression.
Forests 16 01368 g001
Figure 2. The architecture of F3-YOLO.
Figure 2. The architecture of F3-YOLO.
Forests 16 01368 g002
Figure 3. Structure of the CondConv module. CondConv achieves mathematical equivalence with the mixture of expert methods by making the convolutional kernel parameters dependent on the input, while only a single convolution operation is needed.
Figure 3. Structure of the CondConv module. CondConv achieves mathematical equivalence with the mixture of expert methods by making the convolutional kernel parameters dependent on the input, while only a single convolution operation is needed.
Forests 16 01368 g003
Figure 4. Structure of the FSAS module. DConv denotes depth-wise separable convolutions, while FFT and IFFT refer to the Fourier transform and inverse Fourier transform, respectively.
Figure 4. Structure of the FSAS module. DConv denotes depth-wise separable convolutions, while FFT and IFFT refer to the Fourier transform and inverse Fourier transform, respectively.
Forests 16 01368 g004
Figure 5. Visual comparison of detection performance in various scenarios.
Figure 5. Visual comparison of detection performance in various scenarios.
Forests 16 01368 g005
Figure 6. Comparison of attention maps in a wildfire scene: (a) the original image, (b) the activation heatmap of YOLOv12n, and (c) the activation heatmap of F3-YOLO.
Figure 6. Comparison of attention maps in a wildfire scene: (a) the original image, (b) the activation heatmap of YOLOv12n, and (c) the activation heatmap of F3-YOLO.
Forests 16 01368 g006
Figure 7. Influence of hyperparameter settings in FMPDIoU on detection accuracy: (a) the effect of lower threshold d with fixed upper threshold u = 0.95 and (b) the effect of upper threshold u with fixed lower threshold d = 0.05 .
Figure 7. Influence of hyperparameter settings in FMPDIoU on detection accuracy: (a) the effect of lower threshold d with fixed upper threshold u = 0.95 and (b) the effect of upper threshold u with fixed lower threshold d = 0.05 .
Forests 16 01368 g007
Figure 8. Visualization of detection results with and without FMPDIoU.
Figure 8. Visualization of detection results with and without FMPDIoU.
Forests 16 01368 g008
Table 1. Experimental environment configuration.
Table 1. Experimental environment configuration.
Experimental EnvironmentType
CPUIntel(R) Xeon(R) Platinum 8255C
GPURTX 4090 24 GB GPU
Operating systemLinux-Ubuntu20.04
Python versionPython 3.12
Deep learning frameworkPyTorch 2.3.0
CUDA VersionCUDA 12.1
Table 2. Results of comparative experiments. Results for the metrics of mAP@50 (%), parameters (M), and GFLOPs are presented. The best values are highlighted in bold, and the second-best ones are marked with an underline.
Table 2. Results of comparative experiments. Results for the metrics of mAP@50 (%), parameters (M), and GFLOPs are presented. The best values are highlighted in bold, and the second-best ones are marked with an underline.
ModelmAP@50/(%)Params/ 10 6 FLOPs
YOLOv7-tiny62.46.0213.2
YOLOv8n64.33.28.9
YOLOv10n63.52.36.7
YOLOv11n63.72.66.6
YOLOv12n64.42.66.5
YOLOv12s64.79.321.7
FBRT-YOLO59.40.96.7
YOLO-LFD65.53.77.8
DSS-YOLO66.23.27.9
YOLOv8n-SMMP67.52.15.4
F3-YOLO68.52.64.7
Table 3. Results of ablation experiments.
Table 3. Results of ablation experiments.
YOLOv12nCondConvFSASFMPDIoUPrunemAP@50/(%)Params/ 10 6 GFLOPsFPS
64.42.66.5234
66.33.05.3250
67.13.05.0241
68.13.05.0243
68.52.64.7254
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, P.; Zhao, X.; Yang, X.; Zhang, Z.; Bi, C.; Zhang, L. F3-YOLO: A Robust and Fast Forest Fire Detection Model. Forests 2025, 16, 1368. https://doi.org/10.3390/f16091368

AMA Style

Zhang P, Zhao X, Yang X, Zhang Z, Bi C, Zhang L. F3-YOLO: A Robust and Fast Forest Fire Detection Model. Forests. 2025; 16(9):1368. https://doi.org/10.3390/f16091368

Chicago/Turabian Style

Zhang, Pengyuan, Xionghan Zhao, Xubing Yang, Ziqian Zhang, Changwei Bi, and Li Zhang. 2025. "F3-YOLO: A Robust and Fast Forest Fire Detection Model" Forests 16, no. 9: 1368. https://doi.org/10.3390/f16091368

APA Style

Zhang, P., Zhao, X., Yang, X., Zhang, Z., Bi, C., & Zhang, L. (2025). F3-YOLO: A Robust and Fast Forest Fire Detection Model. Forests, 16(9), 1368. https://doi.org/10.3390/f16091368

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop