1. Introduction
In recent years, forest fires have posed severe threats to global ecosystems, human communities, and biodiversity. These disasters decimate critical forest habitats, eradicate native vegetation, and disrupt ecological balance, triggering cascading declines in wildlife populations and potentially driving vulnerable species to extinction. Furthermore, forest fires intensify soil degradation through erosion and nutrient depletion, severely compromising agricultural productivity. Post-fire landscapes also exhibit reduced water retention capacity, amplifying the risk of secondary disasters, such as debris flows and flash floods, which endanger both ecosystems and human settlements. Therefore, timely detection and effective prevention of forest fire spread are critical [
1,
2,
3].
During the initial phase of a forest fire, flames propagate rapidly under wind influence. This process emits dense smoke plumes that disperse into the atmosphere. Traditional forest fire detection methods primarily rely on sensors, with the most common approach being the use of fire alarm systems for monitoring. These systems typically integrate smoke detectors, temperature and humidity sensors, and flame detectors [
4]. When abnormal conditions are detected, the system automatically triggers an alarm to indicate the occurrence of a fire. However, limited sensor coverage in forest environments often leads to delayed fire detection. Additionally, fire alarm systems depend on costly hardware installations. Moreover, sensors cannot provide real-time visual data from the fire site, hindering rapid situational assessment and making this method unsuitable for real-time forest fire detection. In contrast, high-definition camera-based monitoring provides a more effective solution. With the support of drone technology [
5,
6,
7], larger forest areas can be monitored in real time. This method enables real-time surveillance and captures comprehensive visual data on forest fires. It offers a more accurate and timely assessment of fire dynamics compared to sensor-based approaches [
8].
With the continuous development of computer vision technologies, researchers have focused on applying image processing techniques to forest fire scenarios. Research has shown that, during a forest fire, the motion and texture features of smoke and flames can effectively detect fire occurrences. Additionally, suitable color models such as RGB [
9], YUV [
10], HSV [
11], and YCbCr [
12,
13] can be employed for fire detection. Shidik et al. [
14] employed multiple color features from the RGB, HSV, and YCbCr color spaces, including background subtraction and time frame selection, to achieve rapid fire detection. Hossain et al. [
15] proposed a fire detection method for UAV-captured forest fire images, which utilizes flame and smoke features. Their method employs color and multi-color space local binary patterns to identify flames and smoke in UAV images. Ding et al. [
16] enhanced flame detection by optimizing the color space using chaos theory and the k-medoids particle swarm optimization algorithm. Khondaker et al. [
17] proposed a multi-level fire detection framework based on computer vision. Their approach utilizes advanced fire color detection rules through majority voting to obtain regions of interest, dynamically verifies pixel authenticity through shape change, and evaluates turbulence using an enhanced optical flow analysis algorithm to identify fire. Tung et al. [
18] proposed a four-stage smoke detection algorithm for fire video images. The algorithm includes motion region segmentation, smoke candidate region clustering, parameter extraction, and a support vector machine classifier, utilizing the color and dynamic characteristics of smoke.
With the development of computer software and hardware, as well as advancements in computer algorithms, more researchers have utilized deep learning techniques for forest fire detection. Muhammad et al. [
19] proposed an efficient system based on Convolutional Neural Networks (CNNs) for fire detection in videos recorded under uncertain monitoring environments. This system employs a lightweight network architecture without dense fully connected layers, making it highly suitable for mobile edge devices and embedded systems. Kaliyev et al. [
20] proposed a forest fire detection method based on CNNs and drones, which are particularly useful for continuous patrols in fire-prone areas. Mowla et al. [
21] introduced a novel architecture called an Adaptive Hierarchical Multi-Headed Convolutional Neural Network with Modified Convolutional Block Attention Module (AHMHCNN-mCBAM). This architecture integrates adaptive pooling, concatenated convolution, and an improved attention mechanism, effectively addressing challenges related to varying scales, resolutions, and complex spatial dependencies in wildfire datasets. Wang et al. [
22] addressed the limitations of real-time small-target flame or smoke detection in forest fire scenarios by designing an efficient and lightweight architecture. They enhanced multi-scale flame and smoke detection capabilities through the Dilation Repconv Cross-Stage Partial Network and improved detection accuracy and robustness in complex forest backgrounds using the Global Mixed-Attention model with Cross-Feature Pyramid and the Lite-Path Aggregation Network. Yuan et al. [
23] developed a target detection algorithm called FF-net (F_Res, Fire Label Assignment) to tackle the insufficient detection accuracy in the later stages of forest fires in complex environments. The F_Res (Fire ResNet) and F_fire activation functions enhance feature extraction and nonlinear fitting capabilities. Additionally, the Fire Label Assignment method reduces network complexity while maintaining detection accuracy, and Kullback–Leibler Focal Loss addresses data imbalance and gradient issues. Li et al. [
24] proposed a high-precision and robust forest fire smoke recognition method to effectively solve the early detection problem of forest fire smoke. This method uses a Swin multidimensional window extractor to enhance information exchange between windows in horizontal and vertical dimensions for extracting global texture features. Furthermore, the guillotine feature pyramid network reduces redundant features, improving the model’s resistance to interference. The method also employs a contour-adaptive loss function to handle the sparsity and irregularity of smoke at the edges.
Although the existing deep learning methods have shown potential in forest fire detection, most rely on complex network architectures to improve performance. However, the excessively high FLOPs and large parameter counts of these models create a significant accuracy–efficiency trade-off, hindering their deployment on mobile devices for real-time monitoring. To address this, we propose a lightweight YOLOv11n-based model that maintains high detection accuracy while significantly reducing computational complexity and storage demands. First, the backbone employs a C3k2Mbnv2 block, which utilizes inverted residual structures and depthwise separable convolutions to optimize channel operations, enabling efficient fire feature extraction with minimal overhead. Second, we integrate a spatial-channel decoupled downsampling (SCDown) block into both the backbone and neck. This block decouples spatial and channel dimensions during downsampling, retaining critical details while reducing feature map dimensions. Finally, the neck incorporates a C3k2WTDC block, which integrates wavelet transforms into convolutional operations to capture multi-scale fire details with lower computational costs.
3. Proposed Algorithm
3.1. C3k2MBNV2 Block
In YOLOv11, the backbone network integrates the C3k2 block based on Cross-Stage Partial (CSP) technology. By employing two smaller convolutional kernels, the C3k2 block enhances feature extraction capabilities, enabling the network to capture richer multi-scale features critical for detecting fire and smoke in complex forest fire scenarios. However, this architecture introduces excessive computational complexity, increasing the model’s resource demands. To address this, we draw inspiration from the lightweight MobileNetV2 [
31] and design the C3k2MBNV2 block. By leveraging inverted residuals and linear bottleneck structures, the C3k2MBNV2 reduces parameters and FLOPs, optimizes feature propagation, and preserves representation power.
Unlike traditional residual structures, the inverted residual integrates a pointwise convolution and a depthwise separable convolutional layer. This structure first expands feature channels through pointwise convolution, applies depthwise convolution for spatial filtering, and then compresses channels via linear projection—effectively reducing computational complexity while maintaining efficiency. Its skip connections serve dual roles: bypassing convolutional layers to preserve raw features and enhancing information flow through channel expansion followed by dimension restoration. However, mapping high-dimensional features to low dimensions through nonlinear activations can cause feature degradation. To address this, the inverted residual replaces nonlinear activations with linear functions during dimensionality reduction, effectively mitigating gradual information loss.
In this study, we optimized the architecture of YOLOv11. In the Mbnv2 block, we first used a 1 × 1 convolutional kernel to expand the input channels, which enhanced the network’s ability to capture high-level features. In the depthwise (DW) block, we employed a 5 × 5 depthwise convolutional kernel to extract spatial features. Thanks to the depthwise convolution design, we effectively reduced the model’s computational complexity. Meanwhile, the Cross-Bottleneck (CB) convolution block utilized a 1 × 1 convolutional kernel and removed the SiLU activation function. This compressed the feature map’s channels back to the original dimension while retaining the residual connection step, where the input and output were added together. Additionally, we replaced the original bottleneck block in the backbone with the MBNV2 block. The inverted residual and linear bottleneck structures extracted representative features from images. Specifically, the inverted residual structure mitigated dimensional collapse in low-dimensional representations, significantly reducing the model’s parameters and computational load, thereby improving network efficiency. The enhanced network demonstrated superior detection capabilities for small targets and complex scenes, enabling more accurate identification of flame and smoke features in forest fires under challenging backgrounds. The improved structure of the C3k2MBNV2 block is illustrated in
Figure 3.
3.2. SCDown Block
In CNNs, deep-layer feature maps contain rich semantic information; however, their lower spatial resolution diminishes the ability to extract features from small objects. Therefore, to enhance the detection performance for small objects in complex scenes, effective fusion of shallow and deep feature maps is crucial.
To address this issue, we integrate the SCDown [
32] block into the backbone and neck structures of YOLOv11. The SCDown block first adjusts the input feature map’s channel dimensions via pointwise convolution while preserving spatial resolution. Next, a depthwise separable convolution layer downsamples the spatial resolution, retaining channel-wise information. Finally, the output feature map has reduced spatial dimensions but maintains the original channel count. By independently processing spatial information per channel during downsampling, the SCDown block mitigates information loss common in traditional convolutions and preserves fine-grained details. This design enhances multi-scale feature fusion by effectively combining shallow and deep features. As a result, the model achieves higher detection accuracy in complex scenarios and improved small-target recognition. Additionally, depthwise separable convolution reduces computational costs by operating on individual channels instead of all channels simultaneously, as in standard convolutions. The overall architecture with the SCDown block is illustrated in
Figure 4.
3.3. C3K2WTDC Block
In forest fire detection tasks, flame and smoke features exhibit distinct multi-scale characteristics: flames show dynamic local high-frequency patterns, while smoke displays diffuse low-frequency variations. Traditional CNNs typically extract features within limited receptive fields, focusing on local regions. This makes it challenging for CNN-based models to capture multi-scale interactions between large and small features in dynamic, complex scenarios. Additionally, standard convolutions suffer from aliasing artifacts during downsampling, leading to high-frequency information loss and degraded feature representation. Although larger kernels expand the receptive field, they increase computational complexity and parameter count, hindering real-time deployment on resource-constrained devices. To address these limitations, we integrate wavelet transform convolution (WTConv) [
33] into the neck structure and propose the C3K2WTDC block.
The wavelet transform convolution (WTC) layer integrates the wavelet transform (WT) with convolutional operations, enabling CNNs to capture local and global information more effectively through multi-frequency processing of input data. First, the input feature map is decomposed by the wavelet transform into multi-scale sub-bands across different frequencies. Convolutional layers then extract hierarchical features from these sub-bands. By incorporating WTConv, the model leverages multi-frequency responses to dynamically expand receptive fields. This design allows small-kernel convolutions to operate across multiple scales, enhancing the detection of flame and smoke regions with varying sizes in complex forest fire scenarios.
Building on this, we designed the wavelet transform pointwise convolution (WTPC) block to enhance feature extraction efficiency. In the WTPC layer, a 1 × 1 convolutional kernel processed the feature maps from the WTC layer, adjusting channel dimensions and fusing channels while retaining spatial details and reducing computational costs (e.g., FLOPs). This design ensures efficient utilization of multi-scale features while minimizing model complexity. In the wavelet transform depthwise convolution (WTDC) layer, depthwise convolution was applied to the WTPC output. Unlike standard convolution, depthwise convolution independently operates on each channel, drastically reducing computations. This method optimizes efficiency while preserving per-channel feature representations, enhancing robustness for detecting flames and smoke across scales. By integrating wavelet transforms with multi-convolution operations, we constructed the C3k2WTDC block. This block improves multi-scale feature extraction and reduces computational overhead without sacrificing accuracy. The enhanced network architecture is illustrated in
Figure 5.
3.4. Improved YOLOv11n Model
Our model achieves lightweight feature extraction and reduces computational complexity through three architectural innovations: the C3k2MBNV2 block in the backbone network integrates inverted residuals and linear bottlenecks to maintain flame and smoke feature extraction while minimizing redundant computations; the SCDown block in both backbone and neck decouples spatial and channel dimensions to compress feature maps while preserving critical details, thereby enhancing detection accuracy in complex forest fire scenarios; and the C3k2WTDC block in the neck combines wavelet transform convolution (WTConv) with multi-convolutional operations to capture fine-grained flame and smoke patterns.
The C3k2MBNV2 optimizes computational efficiency through bottleneck channel compression without sacrificing feature fidelity. The SCDown block independently processes spatial and channel information during downsampling, mitigating aliasing artifacts and improving robustness in cluttered fire environments. Meanwhile, the C3k2WTDC leverages WTConv’s multi-frequency responses to dynamically expand receptive fields, enabling multi-scale feature fusion. Compared to traditional convolutions, this design significantly reduces model complexity while improving detection accuracy across diverse fire scales. By synergistically balancing spatial detail retention, channel efficiency, and multi-scale analysis, our model achieves superior computational efficiency and detection performance in resource-constrained forest fire monitoring.
Through the proposed blocks, the improved model effectively reduces computational costs while enhancing detection accuracy—a critical advantage in resource-constrained forest fire monitoring. It can be deployed on mobile or embedded devices for real-time monitoring, achieving high accuracy with minimal computational overhead. The architecture of the improved model is shown in
Figure 6.
3.5. Performance Evaluation
To evaluate the performance of the improved model and validate the effectiveness of each enhancement, we adopt the following key evaluation metrics: precision (P), recall (R), average precision (AP), mean average precision (mAP), F1-score, parameter count (Params), floating-point operations (FLOPs), and detection time per image. Among these, Params and FLOPs quantify computational complexity, while P, R, mAP, and F1-score measure detection accuracy.
In model evaluation, TP represent the regions that the model correctly predicts as fire and smoke, meaning the model’s prediction matches the ground truth. FP refer to areas where the model incorrectly identifies the background as fire and smoke and the model predicts fire and smoke but the ground truth is background. FN represent areas where the model fails to detect fire and smoke, meaning the model predicts background while the ground truth is fire and smoke. In this study, all of the model’s detection results can simultaneously include fire and smoke, represented as TP + FP. Meanwhile, TP + FN represents the total number of fire and smoke instances that actually exist in the image.
In forest fire detection, P is a key metric to evaluate the model’s ability to avoid false alarms. It calculates the proportion of true fire predictions (correctly identified fires) among all predicted fires, as expressed in Equation (
1):
R is a metric that quantifies a model’s ability to correctly identify positive samples among all actual positives, with higher values indicating fewer missed detections. It is formally defined by Equation (
2):
AP is calculated as the area under the precision–recall (PR) curve. The detection threshold determines the trade-off between P and R. In single-class detection, AP aggregates PR values across all thresholds, as defined in Equation (
3):
Here, P(R) denotes the precision at recall R. In multi-class object detection, the mAP is computed by averaging the AP values across all categories. The formula for mAP is defined in Equation (
4):
where
is the AP for the i-th category, and
N is the total number of categories.
The F1-score is a critical metric that evaluates the balance between precision and recall in fire detection models. It combines both metrics by balancing their values to provide a comprehensive performance assessment. The F1-score is calculated as the harmonic mean of precision and recall, as defined in Equation (
5):
In this study, a higher mAP value suggests superior detection performance and improved recognition accuracy of the model. By optimizing these metrics, the model enhances its practical effectiveness, particularly in complex fire and smoke detection scenarios.
5. Discussion
5.1. Flame Detection Performance
At the early stages of a forest fire, flames are typically small and often exhibit irregular patterns. This irregularity poses a challenge for forest fire detection models. Enhancing the model’s ability to detect small flame targets enables timely detection of early-stage fires, which improves the model’s overall capability. We trained the improved model on both our dataset and the FASDD_UAV dataset and compared the performance of both versions in detecting small-target flames. Importantly, the small-target flame images used for testing were excluded from both datasets.
As clearly illustrated in
Figure 9, the same fire detection model trained on our dataset successfully detects most small flame targets, whereas its FASDD_UAV-trained version fails to identify these targets. This highlights that our dataset is more suitable for complex forest fire detection tasks, enabling the model to reduce false negatives and improve detection accuracy.
5.2. Smoke Detection Performance
During a forest fire, smoke rising is common, making smoke a key indicator for fire detection. The smoke produced by burning materials varies in form and exhibits dynamic changes, with the colors typically ranging from white to gray to black. In practical detection, we found that smoke shares similar colors and shapes with clouds. This similarity may lead the fire detection model to mistakenly identify clouds as smoke, resulting in false positives. We trained the improved fire detection model on both our dataset and the FASDD_UAV dataset and compared its performance in detecting smoke in forest fires. The selected images depict both smoke and clouds, and they were excluded from both datasets.
As clearly shown in
Figure 10, the fire detection model trained on our dataset can effectively identify smoke without misclassifying clouds as smoke. In contrast, the model trained on the FASDD_UAV dataset fails to detect smoke. This indicates that our model is less likely to misclassify clouds as smoke. The improved model successfully recognizes smoke across diverse scenarios.
5.3. Model Performance Under Various Conditions
To perform real-time detection tasks in forest environments, the model must be capable of identifying forest fires under diverse conditions. To evaluate the performance of our fire detection model in recognizing flames and smoke under varying conditions, we selected forest images captured under different scenarios, including daytime, nighttime, cloudy, and hazy conditions. These images were not included in our training dataset.
As shown in
Figure 11, our model effectively identifies flames and smoke in forest fires under various environmental conditions, including daytime, nighttime, hazy, and cloudy conditions. This demonstrates that the model exhibits strong robustness and adaptability, whether in complex lighting or high-background-noise environments. Under varying illumination and weather conditions, the model accurately detects flames and smoke while distinguishing fire signals from environmental interference, thereby mitigating environmental interference. By reducing false positives and false negatives, our model provides reliable fire monitoring outcomes, ensuring early detection and timely response to forest fires.
The stability and accuracy of these capabilities not only enhance the reliability of fire detection but also significantly improve the model’s applicability in real-world scenarios. This is particularly evident in real-time monitoring and forest fire prevention, where the model rapidly captures fire signals and delivers timely alerts, a capability critical for responding to sudden forest fire incidents, minimizing damage, and protecting ecosystems. Furthermore, the model’s consistent performance across diverse environmental conditions provides robust technical support for practical deployment and long-term monitoring, highlighting its potential for broader applications in other domains.
6. Conclusions
Fire smoke detection is critical for preventing forest fires. This paper proposed a real-time forest fire detection algorithm based on deep learning, which incorporates innovative lightweight and efficient feature extraction blocks, C3k2MBNV2 and C3K2WTDC, to ensure high accuracy while significantly reducing model parameters and FLOPs. Additionally, a lightweight SCDown block was introduced to maximize information retention during downsampling, maintaining high detection accuracy with a lightweight architecture. Through these improvements, we conducted experimental validation on our custom forest fire dataset. Compared to the YOLOv8n baseline, our algorithm achieved a 3.3% increase in mAP while reducing the parameters and FLOPs by 53.2% and 28.6%, respectively.
Our custom forest fire dataset includes various fire types, such as early small fires, dynamic smoke patterns, and irregular fire spread. Special attention was paid to annotating small flame targets and dynamic smoke patterns, and cloud images were explicitly labeled to prevent mistaking clouds for smoke. The dataset’s diversity closely simulates complex early-stage forest fire scenarios, thereby enhancing the model’s sensitivity to subtle features. In contrast, the FASDD_UAV dataset focuses on large-scale open flame detection and lacks early-stage fire data, limiting the model’s generalization in complex scenarios. Therefore, although the improved algorithm trained on our dataset achieves a lower overall mAP than the model trained on the FASDD_UAV dataset, it excels in detecting small targets in early-stage fires and significantly reduces the misclassification of clouds as smoke.
The experimental results demonstrate that the proposed algorithm effectively balances computational complexity, detection accuracy, and efficiency, making it highly suitable for practical deployment. Currently, drones can autonomously take off, recharge, and patrol along pre-set paths. When integrated with our model, drones perform routine forest fire monitoring tasks, enabling efficient patrols and real-time surveillance. Additionally, cameras can be deployed in high-risk fire-prone forest regions, where our model monitors and analyzes captured images, enabling timely detection of potential fire hazards.