1. Introduction
Forest fires are sudden, destructive natural disasters that are extremely difficult to contain and extinguish [
1]. The catastrophic Daxing’an Mountain range fire of 6 May 1987 consumed a total area of 17,000 km
2 (including portions across the border), devastated 1.01 million hectares of forest within China, claimed 211 lives, injured 266 people, displaced more than 10,000 households, and left over 50,000 residents homeless. Direct economic losses exceeded RMB 500 million, with indirect losses reaching RMB 6.913 billion. On 7 January 2025, fueled by the “Santa Ana Winds,” wildfires erupted across southern California, becoming the most destructive natural disaster in U.S. history. Preliminary estimates place the damage and economic toll between USD 250 and 275 billion. Forest fires have long posed an enormous threat to forest resources and human lives. In their earliest stages, the source of a fire is typically small and easy to overlook; however, when missed, it can spread with alarming speed. The rapid and accurate detection of nascent fires, followed by immediate countermeasures, can drastically reduce losses. Therefore, developing a fast and effective fire detection system is imperative. Timely early-warning capability at the incipient stage of a fire is key to minimizing its overall impact.
Every year, thousands of forest fires erupt worldwide, causing disasters that are both immeasurable and indescribable [
2]. Traditionally, detection of these blazes has long depended on ground patrols, watchtowers, sensor networks, and satellite remote sensing [
3]. Owing to limitations in detection performance, economic cost, and practical operability, these conventional methods often fail to predict fires effectively and may miss fire disasters. They also struggle to meet the dual demands of high monitoring precision and complete spatial coverage over vast forested areas.
Most traditional forest fire detection algorithms based on image processing rely on hand-crafted features, such as color, motion, and texture, to delineate flame regions. Sousa J V R D et al. [
4] present a real-time UAV-based system that detects forest fires in RGB and YCbCr color spaces, coupling this with an intuitive geolocation module to pinpoint fire coordinates. S. Zhong et al. [
5] introduce Wi-Fire, a device-free detection framework that leverages Channel State Information (CSI) from commercial Wi-Fi equipment, using RF signal fluctuations across existing wireless infrastructure to sense fire events. Zhao J et al. [
6] build a Gaussian-mixture model to segment candidate flame regions within single images, before analyzing temporal variations in color, texture, roundness, area, and contour; these statistics are combined with wavelet-based flicker frequency extracted from flame-boundary Fourier descriptors. Chino D Y T et al. [
7] propose a still-image fire detection scheme that fuses color-based classification with the texture analysis of super-pixel regions to improve accuracy.
However, these shallow features often do not sufficiently characterize complex forest fire scenes, making feature extraction challenging. In recent years, with the rapid development of computer vision technology, forest fire detection has seen better solutions for complex backgrounds. The advancement of computer vision technology has provided the necessary conditions for forest fire detection, leading to deep learning-based fire detection methods gradually becoming a research focus. Currently, object detection algorithms applied in the field of fire detection include SSD [
8], R-CNN [
9], Faster R-CNN [
10], and the YOLO series algorithms.
Li L [
11] introduced PDAM-STPNNet, which is a forest smoke detection network that leverages a Parallel Dual-Attention Mechanism (PDAM) to encode both local and global textures of symmetric smoke plumes, and a Small-scale Transformer Feature Pyramid Network (STPN) to markedly boost the model’s capacity to spot tiny smoke objects. Li R et al. [
12] proposed a high-precision, edge-focused smoke detection network featuring the SMWE module and Guillotine Feature Pyramid Network (GFPN), which enhances anti-interference capability and mitigates missed detections. L Cao. [
13], based on the improved YOLO v5 model, added a plug-and-play global attention mechanism, designed a re-parameterized convolution module, and used a decoupling detection head to accelerate convergence speed. A weighted bidirectional feature pyramid network (BiFPN) [
14] is introduced to merge the feature information for local information processing. According to its evaluation, the fully intersected joint (CIoU) loss function is used to optimize the multi-task loss of different types of forest fires. Ma Y et al. [
15] devised a hybrid receptive-field extraction module by integrating a 2D selective scanning mechanism with residual multi-branch structures. They also introduced a dynamic-enhanced downsampling module and a scale-weighted fusion module, replacing SiLU with Mish activation to better capture flame boundaries and faint smoke textures. Soundararajan J. et al. [
16] combined DeepLabV3+ with an EfficientNet-B08 backbone in a deep learning framework that uses satellite imagery to address deforestation and wildfire detection. Through advanced multi-scale feature extraction and group normalization, the system delivers robust semantic segmentation even under challenging atmospheric conditions and complex forest structures.
Currently, the YOLO series demonstrates significant advantages in the field of real-time object detection, but it is somewhat inferior to two-stage detection regarding small-object detection. Additionally, early-stage forest fire smoke and flames are relatively small and easily obscured by vegetation, making them difficult to identify, resulting in possible missed and false detections. The differences in scale between fire and smoke in images are substantial; thus, the improvement process should also consider multi-scale detection issues. Given that the deployment device is a fixed-edge device, this study proposes an early forest fire detection algorithm based on the improved SFGI-YOLO algorithm to address the aforementioned challenges. Compared to YOLO11, SFGI-YOLO incorporates the following improvements:
The remainder of this paper is structured as follows:
Section 2 introduces the datasets used in this study as well as the methods and modules employed in the experiments.
Section 3 presents the experimental results, ablation studies, and comparative experiments.
Section 4 discusses and analyzes the model, taking into account its limitations and future work.
Section 5 provides a summary of this research.
3. Results
3.1. Experimental Environment
The configuration of the software and hardware in the experimental environment is presented in
Table 1, while the settings for the algorithm training parameters are provided in
Table 2.
3.2. Evaluation Criteria
To evaluate the performance of the algorithm, this design adopts the following parameters as evaluation metrics: Precision, Recall, mean Average Precision at an IoU threshold of 0.5 (mAP@0.5), mean Average Precision at IoU thresholds of 0.5 to 0.95 (mAP@0.5–0.95), parameters, GFLOPs, and FPS.
In calculating Precision, Recall, and mAP, the following parameters are utilized: True Negative (TN, hereinafter referred to as TN), which correctly predicts non-smoke and non-flames; True Positive (TP, hereinafter referred to as TP), which correctly predicts smoke and flames; False Negative (FN, hereinafter referred to as FN), which incorrectly predicts smoke and flames as non-smoke and non-flames; and False Positive (FP, hereinafter referred to as FP), which incorrectly predicts non-smoke and non-flames as smoke and flames.
Precision indicates the proportion of correctly predicted samples among those predicted as smoke and flames. The calculation formula is as follows:
Recall represents the proportion of samples of smoke and flames that were correctly predicted as smoke and flames among all samples of smoke and flames. The calculation formula is as follows.
To calculate mAP, it is necessary to first compute the Average Precision (AP). AP integrates Precision and Recall for a comprehensive assessment, and mAP is the mean of the AP across different categories. The formulas for AP and mAP are as follows:
where
n represents the number of categories and, in this design, the categories are as follows: 0—fire; 1—smoke,
n = 2.
is the average precision for each category i.
FPS represents the frames per second, which effectively reflects the computational speed of the algorithm and is used to evaluate the real-time performance of the algorithm. The algorithm components in this experiment were all conducted on the same GPU with a consistent utilization rate. “Parameters” indicates the number of parameters, and “GFLOPs” signifies the floating point operations per second, which are often used to assess the computational cost of the model; hence, parameters and GFLOPs should be as small as possible when applied to embedded devices.
In this experiment, the confidence interval during the algorithm training process was set between 0 and 1, while during the experiment, to reduce the probabilities of missed detections and false detections, the confidence interval was set to above 0.3.
During the algorithm training process, it was observed that excessive training discussions during the selection of epochs could lead to overfitting. Consequently, after testing, it was ultimately decided to select 200 epochs. Additionally, an early stopping directive was implemented during the training, halting the process when mAP@50–95 did not show improvement after 100 epochs, with mAP@50–95 being chosen as the best value for the final round.
3.3. Ablation Test
To explore the enhancement of SFGI-YOLO’s performance, six groups of ablation experiments were conducted; the results are shown in
Table 3. This experiment aimed to assess the feasibility and effectiveness of the improved module.
In Exp 1, it was observed that the parameters of YOLO11 demonstrated a good performance. Following the introduction of new modules, the performance was further enhanced.
Comparisons revealed that the inclusion of detection head P2 and the FEM led to improvements in Precision by 0.8 and 0.9, Recall by 0.9 and 0.1, mAP50 by 0.9 and 0.2, and mAP50–95 by 0.6 for both, respectively. However, this resulted in an increase in parameters and GFLOPs, with parameters rising by 0.1 and 0.8, GFLOPs increasing by 3.9 and 7.3, and a decrease in FPS by 28.5 and 36.5. This demonstrates that while the introduction of detection head P2 and the FEM increases the computational load and parameter count, they significantly enhance the model’s ability to detect flames and smoke, effectively reducing the chances of missed detections and false alarms.
In Exp 3, when replacing the model’s Conv with GhostConv, the model’s Precision improved by 0.8, Recall decreased by 0.5, mAP50 dropped by 0.2, mAP50–95 increased by 0.1, parameters decreased by 0.3, GFLOPs reduced by 0.8, and FPS dropped by 10.1. It is evident that GhostConv is advantageous for reducing the parameter and computational load of the module, with minimal impact on other evaluation metrics, thus mitigating the increase in parameters and computation caused by the introduction of additional modules.
When the C3k2 module was replaced with C3k2_IDC, a comparison with Exp 1 showed that Precision increased by 1.4, Recall decreased by 0.2, mAP50 rose by 0.2, mAP50–95 remained unchanged, parameters increased by 0.3, GFLOPs increased by 1.3, and FPS decreased by 10.1. The C3k2_IDC module processes local and global features through four branches, maintaining a large receptive field and enhancing the detection capability for small objects and multiple scales, effectively improving the detection ability for flames and smoke across different scales.
Through the comprehensive comparison of Exp 1, Exp 6, Exp 7, and Exp 8, it was found that the AP50smoke initially achieved a high precision level of 99.4%, which improved by 0.1% as the modules were added sequentially. Meanwhile, AP50fire increased from 88.6% to 91.4%, and other metrics also showed improvement. This indicates that the model’s detection capability for specific elements such as fire and smoke has been enhanced.
Ultimately, SFGI-YOLO achieved an increase in Precision by 1.8, Recall by 1.7, mAP50 by 1.4, and mAP50–95 by 1.8, while keeping parameters unchanged; GFLOPs increased by 8.2 and FPS decreased by 28.5. Therefore, it is evident that although the introduction of new modules leads to an increased computational load and a decline in FPS, the overall performance remains strong, demonstrating high accuracy in fire and smoke detection, with significant performance enhancements that meet real-time detection requirements.
3.4. Comparative Experiments and Analysis
In this study, targets with pixels smaller than 32 × 32 were classified as small targets, targets with pixels greater than 32 × 32 but less than 96 × 96 were classified as medium targets, and targets with pixels greater than 96 × 96 were classified as large targets.
The training dataset consisted of 20,776 images, of which 11,172 contained flames and smoke, as well as 9604 comparative images. There were 45,095 detection targets in this dataset, which includes 8780 small targets, 12,748 medium targets, and 23,567 large targets.
The validation dataset contained 5407 images, with 3014 images containing flames and smoke, as well as 2393 comparative images. There were 5410 detection targets in this dataset, which includes 2415 small targets, 3457 medium targets, and 5783 large targets.
The comparative images in the training and validation sets include cloud interference images used to enhance the ability to distinguish between clouds and smoke, images without flames and smoke from the same scene, and other forest images from different scenes that enrich the dataset.
3.4.1. Comparative Experiment
Comparative analysis shows that SFGI-YOLO exhibits an increase of 1.8% in Precision, 1.7% in Recall, 1.4% in mAP50, and 1.8% in mAP50–95 compared to YOLO11n, indicating significant improvements.
To comprehensively evaluate and validate the detection capabilities of SFGI-YOLO, this section will detail comparative experiments with YOLOv9n, YOLOv10n, YOLO11n, YOLOv12n, RTDETR50, etc. The experiments will compare performance based on Precision, Recall, mAP50, mAP50–95, parameters, GFLOPs, FPS, and other metrics.
The comparison results are presented in
Table 4.
From the above comparative experiments, it can be observed that SFGI-YOLO demonstrates superior performance in terms of accuracy compared with other conventional models. Although there was a slight decrease in FPS, it met real-time requirements. SFGI-YOLO effectively balances factors such as detection accuracy, detection speed, and model complexity.
3.4.2. Image Comparison Experiment
To validate the performance of SFGI-YOLO and YOLO11n in practical scenarios, a comparison was conducted under various conditions, including small targets, long distances, and multiple scales. The results are demonstrated through visualized images.
As shown in
Figure 8, under cloud interference, YOLO11 failed to detect one instance of smoke, while the confidence levels of the smoke detected by SFGI-YOLO were all higher than those of YOLO11. SFGI-YOLO effectively reduced the interference of clouds in smoke detection, as well as demonstrating the robustness of this algorithm, as the presence of clouds did not affect the ability to detect smoke.
Small-target detection is of the utmost importance in forest fire detection. As shown in
Figure 9 and
Figure 10, the confidence of SFGI-YOLO in small-target detection was significantly higher than that of YOLO11. In
Figure 9, both YOLO11 and SFGI-YOLO detected flames, but the confidence level of SFGI-YOLO was higher than that of YOLO11. Under nighttime conditions, flames were not detected by YOLO11, whereas SFGI-YOLO demonstrated a higher confidence level.
In multi-scale and occlusion scenarios, as shown in
Figure 11 and
Figure 12, YOLO11′s ability to distinguish smoke boundaries and detect flames obscured by occlusions is slightly inferior to that of SGFI-YOLO. Moreover, the confidence level of the detected flames and smoke in SGFI-YOLO is also higher than that of YOLO11.
It can be seen that in various scenarios such as small targets, long distances, and multiple scales, SFGI-YOLO outperforms YOLO11 in terms of detection capabilities at smoke boundaries and obscured flames, as well as in relation to the overall detection accuracy.
4. Discussion
At present, the issue of forest fires remains serious globally. The frequency of fires caused by climatic and human factors is continually increasing. These fires not only destroy vast areas of forest and damage the habitats of fauna and flora, but also pose threats to human safety and property. Therefore, the ability to detect and identify forest fires in a timely and accurate manner is crucial for effectively controlling the spread of fires.
Deep neural networks, particularly object detection algorithms, are utilized to automatically learn and identify the characteristics of flames and smoke, analyzing complex patterns in image and video data to achieve the rapid and precise detection of fire scenarios. The current object detection algorithms used in the field of fire detection include SSD, Faster R-CNN, and YOLO series algorithms, among others. Despite YOLO11, a single-stage algorithm, having a lower accuracy compared to two-stage detection systems, it offers significant advantages in real-time performance, making it more suitable for forest fire detection.
In this study, improvements were made to YOLOv11 by adding a small-object detection head (P2) to detect shallower feature maps, addressing the multi-scale issue in conjunction with other detection heads. A Feature Enhancement Module (FEM) that utilizes a multi-branch structure is employed to extract more discriminative semantic information, thereby increasing feature richness. Additionally, dilated convolutions are applied to obtain richer local contextual information, expanding the receptive field and enhancing the capability to detect small objects across multiple scales. A lightweight GhostConv is used to generate a portion of intrinsic feature maps with a small number of standard convolutions, followed by inexpensive linear operations to produce additional Ghost features. Ultimately, the intrinsic feature maps are concatenated with the Ghost feature maps, resulting in a number of feature maps that are equivalent to traditional convolutions while reducing computational costs and the number of parameters. By combining Inception DWConv with the C3k2 module and utilizing multiple parallel branches, the receptive field is further increased. The improved algorithm achieves a high accuracy and fast frames per second, meeting the requirements for high-precision and real-time detection in forest fire monitoring.
Considering environmental factors, the design of response plans for natural disasters is outlined as follows. Humid air can cause metal rust and circuit aging in electronic modules. Thus, when electronic devices are exposed to clouds and fog for extended periods, it is essential to implement certain moisture-proof measures during the initial production phase. One approach could be to “block the invasion paths” by selecting active protective measures such as waterproof enclosures and moisture-proof boxes, supplemented by passive measures such as desiccants. The regular replacement of carriers is necessary to prevent accidents such as collapse or falling. In the event of natural disasters such as landslides or floods, drones should be deployed to inspect the damaged equipment, facilitating monitoring within a short time frame. Once conditions stabilize, a new site can be selected to construct a new detection network.
However, this model still has room for improvement. Although the parameter count of this model is comparable to YOLO11n, the GFLOPs exhibited a certain enhancement. Therefore, there remains potential for progress regarding the accuracy, speed, and model size of parallel models. Environmental factors, such as strong winds, dense fog, and nighttime conditions, can affect the detection of fires and smoke. Hence, it is essential to consider how to gather data and expand the dataset in order to enhance the model’s processing capabilities. Additionally, this model is expected to be deployed in embedded modules to build monitoring systems. Consequently, it is important to consider the distribution of cameras based on forest conditions and ensure timely alerts in cases of network instability as part of future research.
5. Conclusions
This design is based on an improved YOLO11 algorithm, the SFGI-YOLO model, which aims to overcome the issues in relation to flame and smoke detection in forest fires that are associated with previously used methods. The model introduces a small target detection head (P2) to extract shallower feature information and utilizes a Feature Enhancement Module (FEM) to enhance the representation of small target features. Additionally, Conv layers in the algorithm are replaced with GhostConv to reduce parameters and lower computational costs. Ultimately, the IDC module is combined with the C3k2 module to create the C3k2_IDC module, which processes local and global features in a parallel branch manner, while maintaining a large receptive field. This enhances the capabilities of detecting small targets and multi-scale objects.
The SFGI-YOLO model achieves a precision of 93.6%, a recall of 92.4%, an mAP50 of 95.4%, and an mAP50–95 of 77.6% on the forest fire dataset used in this design. The model has 2.8 million parameters, 14.5 GFLOPs, and an FPS of 263.2. Although it is slightly slower than YOLO11, it demonstrates superior accuracy and performance, making it more suitable for deployment in embedded devices. Future work will involve expanding the dataset to include various fire scenarios in order to further improve the model’s detection accuracy.