1. Introduction
In recent decades, the frequency and intensity of forest fires have continued to increase, posing significant threats to ecosystems, the carbon cycle, and public safety [
1,
2]. In this context, to achieve effective forest fire prevention and control, it is essential to develop real-time and accurate detection technologies, as undetected fires may rapidly escalate into large-scale disasters [
3]. Edge devices such as unmanned aerial vehicles (UAVs) [
4], robots [
5] and cameras [
6] have been widely adopted due to their high mobility and responsiveness, as well as their ability to enable real-time detection of forest fires by integrating artificial intelligence algorithms. However, edge devices are often deployed in remote and mountainous forest areas with complex environments, where they face significant constraints in computing resources, memory, power budgets, and network stability [
7,
8]. Accordingly, a real-time lightweight detection model is urgently needed to enable reliable, real-time, accurate, and continuous forest fire detection [
9].
Early forest fire detection methods deployed on edge devices primarily relied on traditional machine learning algorithms, such as Random Forests [
10] and Support Vector Machines [
11]. These methods heavily depended on manually designed features, making it difficult for them to adapt to the diverse appearances of smoke and flames as well as complex backgrounds. With rapid advances in deep learning, it has become the dominant algorithms for forest fire detection. Their key advantage lies in learning highly discriminative feature representations more efficiently in an end-to-end manner compared to conventional methods, thereby enabling timely and reliable early warning. Convolutional Neural Networks (CNNs) and Transformers are two primary deep learning algorithms employed for forest fire detection. CNN-based detectors include Faster R-CNN [
12], SSD [
13], and RetinaNet [
14], while models such as IFS-DETR [
15] and FTA-DETR [
16] are built upon the Transformer architecture. Although these detectors perform robustly across a variety of fire detection tasks, their high computational complexity often compromises efficiency and inference speed, making them less suitable for deployment on resource-constrained edge devices and for real-time detection scenarios.
Since the emergence of the YOLO family, YOLO-based detectors have gradually become a major research focus in object detection due to their strong real-time performance and promising accuracy [
17,
18,
19,
20]. This advantage has also led to their increasingly widespread adoption in forest fire detection, significantly improving detection accuracy [
21,
22,
23,
24]. However, many improvements built upon the YOLO architecture often increase the parameter count and computational complexity, thereby undermining the feasibility of deployment on edge devices. Given the limited computational capacity of edge devices and the substantial computational overhead incurred by existing high-accuracy YOLO variants, there is an urgent need to develop more lightweight models to better meet the requirements of edge device deployment.
To meet the requirements of diverse application scenarios, several lightweight YOLO variants have been developed. For example, a YOLOv5-based benthic species recognition method reduces computational cost and model size by 66.0% and 40.5%, respectively [
25]; YOLOv8-MDN-Tiny decreases parameter count and memory usage by 90.1% and 88.9%, improving passion fruit disease detection on handheld devices [
26]; and Ji et al. [
27] targets UAV-based welding defect detection, achieving 93.7% accuracy while reducing parameters by 21.6%, outperforming mainstream methods. Forest fire scenarios impose stricter demands on real-time, model lightweight, and detection accuracy. To address these, ref. Zhou and Jiang [
28] redesigned the C3k2 structure with the FasterBlock module, which stacks multi-scale convolutions serially in a single path. This reduces parameters by 11.2%, improving suitability for resource-limited devices. However, the serial design also restricts multi-scale feature extraction efficiency and sacrifices detection accuracy. The authors of Zhu et al. [
29] proposed the GERB module in the neck network. The module first expanded the input features via a 1 × 1 convolution, split them into a transformation branch and an identity branch, and introduced the RepConv to enrich feature representation. Finally, the two branches were simply concatenated for output. However, this lightweight design relied solely on simple concatenation, lacked contextual collaborative modeling, and thus struggled to suppress background interference, making it prone to false positives. The authors of Yu et al. [
30] adopted composite scaling and a bidirectional feature pyramid to construct a lightweight detection head. It simultaneously scaled all model dimensions using a single scaling factor
to improve accuracy while controlling computational cost. However, although this detection head achieved improved inference speed, it only reached 27.2 FPS, which can hardly meet the strict real-time requirements of edge devices in forest fire scenarios.
Beyond architectural modifications to the YOLO backbone, neck, and head, network pruning provides another practical route to model lightweighting [
31]. In particular, layer-wise sparsity pruning has been widely adopted for its favorable trade-off between sparsity and detection accuracy [
32,
33,
34]. However, many existing pruning approaches rely on heuristic rules and extensive hyperparameter tuning, which increases optimization costs and may compromise generalization and deployment robustness. The authors of Lee et al. [
35] proposed Layer-Adaptive Magnitude-based Pruning (LAMP), which allocates layer-wise sparsity by minimizing
distortion and avoids subjective tuning. This strategy reduces redundant parameters and computation while preserving detection performance, making it crucial for real-time deployment on edge devices.
Although deep learning has greatly improved the accuracy of forest fire detection, high-accuracy models typically incur substantial computational overhead, making them difficult to deploy effectively on resource-constrained edge devices. Existing lightweight models largely rely on redesigning the YOLO architecture, while pruning strategies are less frequently adopted. Therefore, achieving an effective balance between detection accuracy and efficiency in forest fire scenarios remains challenging.
To address the challenges mentioned above, this study proposes a real-time lightweight fire detection network (RLFNet) for forest fire detection on edge devices. Built upon YOLOv11, RLFNet introduces systematic lightweight improvements to the backbone, neck, and head, and further applies the LAMP strategy for holistic optimization. The main contributions and innovations of this study are as follows:
A Diverse Fire Scenarios (DFS) dataset is constructed to cover various fire types, viewpoints, and environmental conditions. It mitigates common limitations of public datasets such as limited data volume, scenario diversity, and unreliable annotations, thereby enhancing model robustness and generalization in forest fire scenarios.
A Parallel Multi-Scale Extraction Block (PMEB) is proposed, which designs a channel-grouping strategy to preserve a low-cost branch and perform parallel multi-kernel convolutions on grouped channels, enhancing multi-scale representation with low overhead and avoiding the constraints on feature extraction from lightweight serial single-branch kernel stacking.
A Bidirectional Cross Fusion Module (BCFM) is presented, which overcomes the inherent limitations of conventional feature concatenation by designing a context-aware cross-gating mechanism to achieve complementary cross-stage channel fusion, thereby significantly enhancing robustness against background interference.
A Faster Inference Detection Head (FIDH) is devised, which enhances localization accuracy through structural re-parameterization, while incorporating group normalization to stabilize small-batch real-time inference, thereby improving the model’s inference efficiency and stability on edge devices. The optimal value of each evaluation metric in the table is marked in bold.
The rest of this study is organized as follows:
Section 2 introduces the materials and methods, including a detailed explanation of each module;
Section 3 presents the experimental setup, ablation study, comparison results, and visualization;
Section 4 discusses the limitations of this study and outlines future work;
Section 5 summarizes the study.
2. Materials and Methods
2.1. Overall Framework of RLFNet for Forest Fire Detection
The overall workflow of the proposed method is illustrated in
Figure 1a. First, the DFS dataset is constructed by collecting diverse fire scenarios, which improves the model’s robustness and generalization in complex forest fire environments. Next, targeting remote and complex forest environments, this study proposes RLFNet, a lightweight detection network tailored for edge device deployment, reducing computational cost while maintaining high detection accuracy. The LAMP strategy is further applied to RLFNet to adaptively prune redundant parameters across layers while preserving a compact architecture, thereby improving its practicality and deployability on resource-constrained platforms. Finally, detection examples on forest fire images further demonstrate that RLFNet has strong potential for edge deployment and is suitable for UAV and other edge-device scenarios. As shown in
Figure 1b, RLFNet lies in the upper-left region, suggesting that it delivers high accuracy and fast inference under a low computational budget, highlights its efficient lightweight design. The corresponding metrics are summarized in
Table 1. In all subsequent tables, the optimal value of each evaluation metric is marked in bold.
2.2. The Diverse Fire Scenarios (DFS) Dataset Construction
High-quality datasets are a crucial foundation for advancing deep learning-based UAV forest fire detection research [
45,
46]. To this end, the Diverse Fire Scenarios (DFS) dataset is constructed with a custom web crawler to collect fire-related images and videos from various public online sources. A standardized preprocessing pipeline is adopted to guarantee data quality and diversity.
Specifically, considering the rapid morphological changes and motion blur exhibited by flame and smoke in forest fire scenarios, one frame is extracted every three frames from video data, which reduces information redundancy caused by excessive inter-frame similarity when the sampling interval is too small, while avoiding the loss of key information when the interval is excessively large. Next, in the image screening stage, the structural similarity index (SSIM) is adopted for image-level deduplication. To achieve an optimal balance between removing redundant data and preserving sample diversity, image pairs with an SSIM value greater than 0.4 are identified as highly similar, and only one representative image is retained to suppress redundancy. Subsequently, corrupted files and low-quality samples are manually inspected and removed to ensure data reliability and reduce sampling bias. After cleaning, all remaining images are annotated according to uniform criteria and classified into two categories: fire and smoke. The final constructed DFS dataset contains a total of 5005 images, which are divided into training, validation, and test sets with a ratio of 7:2:1 to ensure the fairness and reliability of experimental evaluation.
To more accurately quantify the scale distribution characteristics of the targets, statistics on the normalized width and height are performed for the 10,062 annotated boxes in the DFS dataset. As shown in
Figure 2, more than 70% of the target boxes are concentrated within the normalized width range of 0.0–0.6 and normalized height range of 0.4–0.8. In this region, the height of target boxes is generally greater than the width, which is highly consistent with the vertically extended morphological characteristics of flame and smoke. Further statistics indicate that the mean aspect ratio of all target boxes is 0.72 and the median is 0.68, demonstrating that the size ratios of most target boxes are reasonably distributed without obvious extreme outliers. Meanwhile, the DFS dataset covers targets with diverse width and height scales, which better adapts to detection requirements in complex scenarios and provides sufficient and reliable sample support for robust detection in complicated forest fire environments.
Sample instances of the DFS dataset are illustrated in
Figure 3, including forest fire and non-forest fire scenarios such as urban areas, roads, and open fields. Both scene types exhibit rich semantic variations including multi-scale targets and complex background interference, which support comprehensive feature learning of fire and smoke for the model.
2.3. Overview of RLFNet
The overall architecture of RLFNet is illustrated in
Figure 4. This study improves the original C3k2 modules with PMEB to enhance multi-scale feature representation while reducing parameter overhead. Then, BCFM is introduced to effectively fuse shallow spatial details with deep semantic information, alleviating the insufficient information coordination caused by conventional concat-based fusion. Finally, the original detection head is redesigned as FIDH to improve localization accuracy and enhance real-time inference performance on edge devices.
2.4. Parallel Multi-Scale Extraction Block (PMEB)
In forest fire detection, flames and smoke targets exhibit pronounced multi-scale variability, posing significant challenges to accurate detection, particularly under the strict computational constraints of edge device platforms. To address the inefficiency of conventional single-branch and multi-kernel architectures in multi-scale forest fire detection, this study proposes a Parallel Multi-Scale Extraction Block (PMEB), which reorganizes feature channels into kernel-specific subspaces. This design significantly reduces the number of parameters while improving detection accuracy.
Figure 5 presents the structure of PMEB, which is described in detail in the following section.
Given an input feature map
(where
B is the batch size,
C the number of channels,
H and
W the height and width of
X), PMEB first introduces a channel-splitting-driven feature decoupling strategy, which explicitly divides the input feature into a base feature branch with low computational cost and a feature branch that is highly sensitive to scale variations:
Unlike traditional multi-scale methods that directly stack multi-branch or multi-kernel convolutions in a shared feature space, this design starts from the feature organization level and transforms the multi-scale modeling problem into a channel-level structured division of labor. Among them,
serves as a lightweight straight-through branch, which preserves basic semantic information and control the overall computational overhead; while
is specially allocated for subsequent multi-scale feature extraction, thereby improving the model’s extraction ability to scale-sensitive feature extraction without significantly increasing the number of parameters. Subsequently, the scale-sensitive branch
is explicitly rearranged into a structured form suitable for parallel multi-scale processing, with dimensions
, where
G denotes the number of convolutional kernel groups and
. This rearrangement is not merely a tensor transformation, but a structured reorganization of feature channels according to scale requirements, which allocates independent and non-interfering channel subspaces to different convolution kernels. Formally, the reorganized feature representation can be expressed as:
Through kernel-specific channel partitioning, each scale operates exclusively on its corresponding channel subspace, thereby avoiding cross-scale feature interference and significantly reducing computational redundancy. Each channel subgroup
is then processed by a convolution with its corresponding kernel size
:
Specifically, the
convolution branch focuses on capturing fine-grained textures and irregular flame boundaries, which helps reduce missed detection of small-scale fire targets, while the
convolution branch enlarges the receptive field to enhance feature perception of low-contrast and diffusive smoke regions. Since convolutions at different scales are constrained to independent channel subspaces, parallel multi-scale feature extraction without introducing additional branch overhead, thereby effectively exploiting scale complementarity. The outputs of all scale-specific branches are subsequently stacked into a tensor of shape
and then rearranged back to
, forming a unified multi-scale feature representation:
The resulting multi-scale feature
is concatenated with the lightweight bypass branch
:
and a
convolution is applied to fuse the complementary multi-scale features, while maintaining an overall compact structure, this fusion process effectively integrates information across scales and further reduces parameter count and computational complexity. Finally, the overall mapping function of the PMEB module can be expressed as:
The performance advantage of PMEB mainly stems from its channel-splitting strategy, which decouples scale-sensitive features from low-cost base features and restricts multi-scale convolutions to kernel-specific channel subspaces for parallel execution. Benefiting from this design, PMEB enhances the complementary scale information of fire and smoke targets without introducing additional branch overhead, thereby achieving more accurate and stable forest fire detection performance while maintaining a low computational cost and effectively meeting the lightweight and real-time deployment requirements of edge devices.
2.5. Bidirectional Cross Fusion Module (BCFM)
Feature fusion in many conventional CNN-based detectors is still implemented with simple channel concatenation, which overlooks the relative importance of features from different stages and limits effective coordination between shallow spatial details and deep semantic representations. Under complex forest fire conditions, this limitation often leads to feature fusion bias and increases the risk of false detections. To address this issue, a Bidirectional Cross Fusion Module (BCFM) is proposed. The module enhances cross-stage information interaction through a context-aware cross-gating mechanism, enabling adaptive fusion of complementary features, highlighting key fire-related regions, and suppressing complex background interference. The detailed design of the module is described below.
As illustrated in
Figure 6, the input of the BCFM consists of two types of features extracted from different network stages: a shallow feature map
, which primarily preserves fine-grained spatial structural information such as flame boundaries and smoke contours, and a deep feature map
, which encodes high-level semantic attributes and the global distribution characteristics of flames and smoke. These two feature representations exhibit strong complementarity in terms of semantic level and information emphasis. When the channel dimensions of the two inputs are inconsistent (
), BCFM first applies a lightweight
convolution to the shallow feature map for channel alignment:
This operation serves solely for feature-space alignment without introducing additional semantic interference, thereby ensuring that subsequent cross-stage interactions are conducted within a unified feature space. Subsequently, the aligned shallow feature
and the deep feature
are concatenated along the channel dimension to form a joint feature representation:
In contrast to traditional approaches that directly treat concatenated features as the final fusion output, BCFM regards the joint feature as an intermediate representation for cross-stage information interaction. Building on this design, an adaptive reweighting strategy driven by global channel-wise context is applied to the joint feature
to mitigate fusion bias caused by uniform weighting of heterogeneous features. This design captures channel-wise dependencies and dynamically modulates the response strength of different features.
Following this, the channel-wise context descriptor is transformed through two nonlinear mappings to generate the corresponding channel weight vector:
This weight vector reflects the relative importance of different channels within the joint feature and is applied to perform adaptive reweighting:
Through this process, salient information within the joint feature is selectively enhanced, while redundant or interfering responses are effectively suppressed, providing a more stable and discriminative feature basis for subsequent cross-stage information interaction.
Distinct from existing attention-based fusion methods, BCFM does not directly treat the reweighted feature
as the final fusion output. Instead, it further decomposes the reweighted feature along the channel dimension into two complementary attention subspaces:
Here,
and
correspond to the contextual guidance weights associated with the shallow and deep features, respectively. Based on the decomposed attention subspaces, BCFM establishes an explicit bidirectional cross fusion mechanism between shallow and deep features. The interaction is formulated as follows:
where
denotes that deep semantics guide shallow features to complement their missing semantic context;
indicates that shallow details, filtered through weight selection, reversely enhance deep features, enabling them to retain crucial shape boundaries. The final fused feature is obtained through concatenation:
The BCFM designs a context-aware cross-gating mechanism to overcome the limitation of unidirectional semantic propagation in traditional feature pyramid architectures. This design enables shallow details and deep semantic features to form a complementary and mutually constrained collaborative relationship within a unified module. Specifically, deep features provide stable global semantic context to suppress background interference such as clouds, haze, and illumination variations. Meanwhile, shallow structural details enhance the spatial discriminability of deep features, effectively reducing mismatches between flame boundaries and fire-like textures, making BCFM particularly suitable for robust forest fire detection in complex scenarios.
2.6. Faster Inference Detection Head (FIDH)
Although the original detection head of YOLOv11 adopts a decoupled Depthwise Convolution (DWConv) structure [
47] to reduce computational complexity to some extent, it still suffers from redundant inference computation and insufficient localization accuracy in UAV-based forest fire detection scenarios. To address these limitations, as illustrated in
Figure 7, a forest-fire-specific Faster Inference Detection Head (FIDH) is designed. The proposed FIDH further enhances robustness against complex forest backgrounds, improves the discriminability and localization accuracy of flames and smoke, and simultaneously ensures real-time inference capability on edge devices.
During training, the Diverse Branch Block (DBB) enriches the feature space by aggregating four parallel branches. Through structural re-parameterization, these branches are mathematically fused into an equivalent single convolution for inference, preserving an efficient single-path implementation, enabling the network to capture more fine-grained edge information across different receptive fields. This design effectively improves detection accuracy while introducing no additional runtime overhead. Specifically, let
denotes the input tensor, and
represents the fused feature output of the DBB during training. The mathematical expression of this fusion process is given by:
where
denotes the
b-th parallel branch (including standard convolution, average pooling,
convolution, and
+
convolution branches), and
stands for the activation function (SiLU in this study). The feature map after DBB re-parameterization is fed into the Convolution with Group Normalization (Conv-GN) module for channel compression and feature normalization. Its
convolution process can be expressed as:
In forest fire detection tasks, the large size of input images makes large-batch training memory-intensive. Unlike the Batch Normalization (BN), the GN is independent of batch size and can maintain more stable behavior under the small-batch training conditions commonly encountered on edge device platforms, which is beneficial for improving inference efficiency. where denotes the convolutional kernel, which is used for lightweight channel fusion. Subsequently, Conv-GN is applied to the feature map output by the convolution. Finally, the processed feature map is fed into the decoupled output layer and undergoes scale-adaptive adjustment via the Scale module.
The FIDH enriches the feature space during training through the DBB, which enhances the representational capacity of convolution during inference. Meanwhile, the Conv-GN module leverages group normalization to maintain stable feature responses under the small-batch inference conditions commonly encountered on edge device platforms. This design reduces computational complexity while preserving effective sensitivity to flame and smoke targets, thereby meeting the real-time detection requirements in complex aerial scenarios.
2.7. The LAMP Strategy
Despite the architectural efficiency of PMEB, BCFM, and FIDH, residual computational redundancy remains during inference. Therefore, the LAMP strategy is further introduced to adaptively prune low-contribution parameters, enabling additional reductions in computation and memory costs and improving real-time deployability on resource-constrained edge device platforms. This method is based on the LAMP score, the calculation formula for which is as follows:
where
denotes the weight matrix,
u and
are index variables, and
indicates summation over all indices
from the current index
u to the end. This formula calculates the ratio of the squared weight at index
u to the sum of the squared weights of all subsequent suffix weights, serving as a measure of the significance at this position.
To facilitate a clearer understanding of the overall LAMP strategy, the complete pruning procedure is illustrated in
Figure 8.
First, for the input network, the LAMP scores for each weight are calculated. Then, we check if the global sparsity constraint is satisfied. If not, we repeat the score calculation and pruning loop. If satisfied, we determine whether to iterate for further optimization: if yes, we loop back to the score calculation stage; if no, we proceed to the next step. Finally, we fine-tune the pruned model to recover its accuracy, forming a complete pruning pipeline from input to optimized output.
This process prunes low-impact parameters and redundant connections, enabling the model to focus on highly informative features. By adaptively retaining critical weights across layers, LAMP strategy compresses the model effectively while preserving or even slightly boosting detection performance.
2.8. Evaluation Metrics
To evaluate the performance of the proposed model, the PASCAL VOC [
48] evaluation criteria were adopted, as they are widely accepted and commonly used standards in object detection. This criterion primarily uses mean Average Precision (
mAP) as its key performance indicator. The calculation of mAP is based on the model’s Precision and Recall. The specific calculation method is as follows:
In this study, the proposed forest fire detection model is assessed with multiple performance metrics. True Positives (TPs) denote correctly identified fire instances, False Positives (FPs) represent non-fire scenarios misclassified as fire, and False Negatives (FNs) indicate missed fire instances. Precision (P) measures detection accuracy, while Recall (R) reflects the ability to capture actual fires. The mAP provides an overall performance evaluation across categories. Beyond accuracy, model complexity and efficiency are assessed through the number of parameters and GFLOPs, with higher GFLOPs indicating greater computational cost. In addition, FPS is used to evaluate real-time capability, which is crucial for fire detection systems requiring rapid response.
3. Experiments and Results
3.1. Experimental Setup
All experiments conducted in this study are implemented on a computational platform configured with an AMD Ryzen 9 7945HX CPU (integrated with Radeon Graphics, operating at 2.50 GHz). To ensure efficient execution of computationally intensive tasks, the platform is further equipped with an NVIDIA GeForce RTX 4060 GPU (8 GB memory) and 32 GB of RAM. For the model training process, the SGD algorithm is selected as the optimizer, with key parameters set as follows: a batch size of 16, a learning rate of 0.01, and a total training duration of 200 epochs. Additionally, all input images were uniformly resized to a resolution of pixels prior to training. This well-designed experimental setup provides a reliable foundation for validating the effectiveness of the proposed methodologies.
Notably, all comparison models are trained under the same experimental environment and identical hyperparameter settings as the proposed RLFNet to ensure fair and objective performance comparison.
3.2. Model Performance Evaluation
3.2.1. Comparison Experiments with Different Models
As shown in
Table 1, this study conducts a comprehensive comparison between RLFNet and mainstream object detection models, including lightweight variants of the YOLO series, RetinaNet, and RT-DETR, evaluating their overall performance across three core dimensions: detection accuracy, model efficiency, and inference speed. Here, the suffixes “F” and “S” in the metrics correspond to fire and smoke targets, respectively. RLFNet stands out among all compared models: with a compact model structure featuring only 1.9M parameters and 5.0 GFLOPs, it achieves the highest mAP50 (76.5%) and mAP50-95 (53.3%), fully demonstrating its outstanding detection capability in multi-scale fire and smoke target detection tasks. Even when compared with the latest YOLOv13n, RLFNet still exhibits significant advantages in various core detection metrics, and its inference speed of 224.8 FPS surpasses all compared models, highlighting its robust real-time response capability. In summary, RLFNet successfully achieves an optimal balance between low resource consumption, high real-time inference efficiency, and excellent detection accuracy, fully verifying its practical value and good applicability in real-time fire detection scenarios on edge devices.
3.2.2. Ablation Experiment
Ablations on the proposed methods. This study conducts systematic ablation experiments to rigorously verify the effectiveness of each proposed module and their synergistic effects, and performs independent complexity analysis on each module to validate the rationality of the lightweight design.
Experimental results are shown in
Table 2. Introducing PMEB alone significantly reduces the model parameters while effectively improving Precision, mAP50, and FPS, verifying the advantage of channel grouping and parallel multi-kernel convolution in enhancing the efficiency of multi-scale feature extraction. When introducing BCFM alone, although the parameters and computational complexity increase slightly, the model achieves the highest Precision of 78.2%, which fully demonstrates its ability to suppress background interference in complex scenarios. Introducing FIDH alone improves the inference speed by 24.1% to 202.6 FPS and increases mAP50 by 1.1%, reflecting its excellent performance in balancing localization accuracy and inference efficiency.
Although the Recall slightly decreases when each module is used individually, the joint integration of all three modules significantly boosts the Recall to the optimal value of 68.4%. Compared with the baseline model, mAP50 is improved by 3.5%, parameters and computational complexity are reduced by 3.9% and 9.5% respectively, and FPS is increased by 31.7%. The above results fully validate the effectiveness of multi-module collaborative optimization, enabling the model to achieve a more excellent balance among detection accuracy, real-time performance, and lightweight design.
Ablation experiment results on different kernel settings in PMEB. To verify the rationality of the selected kernel combination in PMEB, this study explores different combination settings composed of 1 × 1, 3 × 3, and 5 × 5 kernel sizes, i.e., [1, 3], [1, 5], and [3, 5]. As shown in
Table 3, When using the [1, 3] combination, although the lightweight level and inference speed are improved, the excessively small kernels restrict the receptive field and weaken the model’s ability to capture multi-scale object features, resulting in reduced accuracy. For the [1, 5] combination, the receptive field is expanded to a certain extent, but the extraction of detailed features and global information remains unbalanced, leading to limited accuracy improvement. The [3, 5] combination achieves the optimal balance between accuracy and efficiency, indicating that this kernel combination can better match the feature extraction requirements of the model and is the optimal choice for PMEB.
3.3. Experimental Results with Different Pruning Ratios
To explore the sensitivity of RLFNet to the LAMP strategy, the model is gradually compressed by varying the pruning rate. The experimental hyperparameters, including the number of pruning iterations and fine-tuning epochs, are determined through a series of exploratory experiments and conventional empirical settings widely adopted in network pruning, ensuring effective recovery of the detection performance of the pruned network.
As shown in
Table 4, the accuracy decreases gently when the pruning rate ranges from 1.1 to 1.4, but drops sharply beyond 2.0. This phenomenon reflects the uneven distribution of feature redundancy across different layers of the model, and also validates the idea of layer-adaptive pruning in LAMP: within a low pruning range, most removed channels are redundant shallow-layer channels with little impact on model performance, while excessive pruning damages critical sensitive layers, leading to a rapid decline in accuracy. Therefore, 1.1 is finally selected as the pruning rate in this paper, enabling RLFNet to achieve the optimal balance between detection accuracy and efficiency. The above results demonstrate that structural optimization combined with a reasonable pruning strategy can realize efficient forest fire detection without sacrificing accuracy.
3.4. Experimental Visualization on the DFS Dataset
As shown in
Figure 9, the top row presents the detection results of the baseline YOLOv11n, while the bottom row shows those of RLFNet. In the forest fire scenario (a), YOLOv11n yields more false positives, whereas RLFNet detects a small missed fire spot. In the multi-scale scenario (b), RLFNet covers targets of different sizes and boundary regions with high confidence. In the strongly disturbed scenario (c), RLFNet avoids confusing flame-like bright lights with actual fires, demonstrating better robustness to interference. Overall, RLFNet achieves more reliable fire localization under scale variation, background clutter, and small targets, delivering higher accuracy in complex forest environments and stable real-time detection under resource constraints.
Visualization of Heatmap Experiments
Grad-CAM is used to visualize the feature responses of RLFNet and YOLO series models, assessing their discriminative ability in fire detection. As shown in
Figure 10, YOLOv5n and YOLOv10n exhibit dispersed attention with significant background interference, especially in smoke-filled scenarios, resulting in limited focus on fire regions. YOLOv8n and YOLOv11n show improved attention but remain unstable, with hotspots occasionally spreading into non-fire areas. In contrast, RLFNet demonstrates concentrated and stable attention across all scenarios, effectively suppressing background noise and highlighting fire-related regions. Even under complex backgrounds and smoke interference, it maintains robust target focus. These results indicate that RLFNet offers superior feature extraction and anti-interference capability compared to mainstream YOLO models, consistent with the quantitative experiments, and underscores its practical value in complex forest fire environments.
3.5. Generalization Experiment
Collecting large-scale, annotated forest fire imagery is costly, risky, and difficult. To address this, the publicly available M
4SFWD dataset [
49], a multi-faceted synthetic dataset for remote sensing forest fires detection provides diverse, simulated fire scenarios across terrains, climates, time periods, illumination conditions, and fire scales, providing systematic evaluation support for the robustness and generalization ability of models.
As shown in
Table 5, in the generalization experiments with lightweight models, RLFNet demonstrates an exceptionally balanced and outstanding overall performance. With only 1.9M parameters and a computational complexity of 5.0 GFLOPs, it achieves the highest mAP50 (87.2%) while maintaining relatively high Precision and Recall, fully demonstrating its excellent detection accuracy in forest fire target recognition. Notably, RLFNet’s FPS (312.5) ranks first among all compared models, highlighting its superior inference speed. Overall, RLFNet still strikes an optimal balance between detection accuracy, inference efficiency, and model complexity in generalization experiments, making it highly suitable for deployment in resource-constrained scenarios like UAV-based forest fire detection.
The left and right sides display the comparison results of RLFNet and YOLOv11n, respectively. In
Figure 11, this scenario involves complex elements such as blurred boundaries and overlapping multi-targets, while YOLOv11n produces bounding boxes with relatively high confidence, RLFNet, in contrast, can separate smoke from the background in the blurred edge areas of the scenario, thus accurately capturing indistinct smoke. The heatmaps make it even more intuitive to see that RLFNet only focuses on key fire areas and suppresses background interference. This further validates the model’s excellent generalization ability, as well as its adaptability and accuracy in complex forest environments.
3.6. Inference Performance Analysis Based on PyTorch v.2.12.0 and TensorRT
To validate the lightweight design and efficiency of RLFNet in real-world applications, the model is deployed through a TensorRT pipeline consisting of three steps. First, after training, the model and optimal weights are exported to ONNX format using the Ultralytics API v.8.3.12 (opset = 12, dynamic = True). Second, the ONNX file is parsed with NVIDIA TensorRT’s OnnxParser v.1.22.0 (NVIDIA, Santa Clara, CA, USA) and compiled into an optimized engine file (.engine) with a 1 GB workspace and FP16 precision for GPU acceleration. The input size is fixed at (1, 3, 640, 640) to ensure stable real-time inference. Finally, deployment is implemented using PyCUDA v.2026.1 and the TensorRT Runtime API, including preprocessing, GPU memory allocation, inference execution, and post-processing. Performance is evaluated by averaging 100 runs to obtain stable FPS. The experimental results show that RLFNet achieved 415.3 FPS after TensorRT deployment, representing a nearly 29% speed improvement over YOLOv11n (321.2 FPS), and runs approximately 4.2× and 3.6× faster than RetinaNet (98.7 FPS) and RT-DETR (115.7 FPS), respectively. These results collectively confirm its strong real-time inference capacity and demonstrate its overall potential for edge computing applications, particularly on resource-constrained edge devices.
4. Discussion
To meet the lightweight requirements for deployable forest fire detection on edge devices, a real-time lightweight forest fire detection network, termed RLFNet, is proposed. Experimental results show that, compared with the YOLOv11n baseline, RLFNet improves mAP50 by 5.3% on the self-constructed dataset, while reducing the number of parameters and GFLOPs by 25.2% and 20.6%, and achieving an inference speed of 225 FPS. In addition, RLFNet also demonstrates robust generalization performance on the public remote-sensing wildfire dataset (M4SFWD). These results indicate that RLFNet achieves a better balance between accuracy and efficiency, validating its effectiveness and deployment potential for real-time forest fire detection on resource-constrained edge devices.
The superior accuracy–efficiency balance of RLFNet is mainly attributed to three key modules introduced on top of YOLOv11n: PMEB, BCFM, and FIDH. Specifically, (1) PMEB separates low-cost basic features from scale-sensitive features, and performs convolutions with different kernel sizes in parallel within specific channel subspaces, thereby avoiding the computational redundancy caused by traditional multi-branch designs and stacked multi-scale kernels [
24,
28,
50], and significantly improving multi-scale feature extraction efficiency; (2) BCFM is designed with a context-aware cross-gating mechanism to adaptively enhance fused features and better highlight fine-grained discriminative contextual information, overcoming the limitations of simple concatenation-based fusion [
51,
52,
53,
54], thereby strengthening responses in fire-related regions, suppressing background interference, and reducing false alarms; (3) FIDH uses structural re-parameterization to balance high-accuracy training and compact inference, and combine it with GN to improve performance stability under small-batch training. To some extent, this addresses the decline in localization efficiency caused by the introduction of a large number of parameters in existing detection heads [
55,
56,
57], thus effectively balancing inference speed and localization accuracy.
In addition, this study adopts LAMP as a lightweight strategy beyond structural design. Unlike methods that reduce computational cost only by modifying YOLO modules [
17,
20,
21,
23,
58,
59,
60], such structural optimizations often fail to fully remove internal parameter redundancy, leaving limited room for further efficiency improvement in edge deployment scenarios. In contrast, LAMP further compresses redundancy through parameter sparsification. As shown in
Table 3, the pruned RLFNet achieves a lower parameter count and computational complexity while its detection accuracy is further improved, demonstrating the effectiveness of the proposed pruning strategy. The key reason is that LAMP adaptively allocates sparsity according to inter-layer sensitivity, applying relatively conservative pruning to critical layers while imposing stronger constraints on redundant layers, thereby achieving a more robust accuracy–efficiency trade-off while reducing hyperparameter tuning costs.
Although encouraging results have been achieved, this study still has several limitations. First, in complex backgrounds, overlapping boundaries between flames and smoke may still lead to imprecise localization. Specifically, on the DFS test set, approximately 52% of localization errors with IoU < 0.5 are caused by blurred boundaries of fire and smoke regions, as well as their overlapping distribution, which affects the positioning accuracy of the detection box. In future work, edge-aware loss functions or contour refinement mechanisms could be introduced to enhance boundary representation. Second, early-stage smoke in images often occupies only a few pixels, which are tiny in size, weak in features, easily confused with the background, and prone to partial occlusion in complex environments, representing the main challenge in forest fire detection. Results on the DFS test set show that the miss rate for small fire targets (e.g., width < 32 pixels) reaches 18.3%. Therefore, more accurate small-object detection strategies will be explored to address the problems of missed detection and occlusion, enabling effective identification of early-stage fires. This is crucial for reducing ecological damage and improving the efficiency of emergency response. Meanwhile, in future research, more comprehensive comparisons will be conducted with more popular lightweight general-purpose object detection models and dedicated fire detection models to further verify the advancement and practicality of RLFNet. Finally, although these experiments are conducted on an RTX 4060 GPU without actual embedded deployment, the TensorRT acceleration pipeline in
Section 3.6 ensures high compatibility and lays a solid foundation for future deployment on NVIDIA edge devices such as Jetson Nano. Real-world deployment and field tests in real forest scenarios will also be explored in future work.
Future work can still be pursued in three directions. First, NMS-free detection paradigms can be explored to reduce duplicate predictions during training, for example by introducing a dual label assignment strategy that combines one-to-one and one-to-many matching, thereby further reducing inference latency. Second, the cross-domain generalization ability of the model can be further improved through domain adaptation. This includes conducting richer synthetic data training, expanding the training corpus with more forest fire images collected via UAVs, robots and surveillance equipment, so as to better cover diverse real-world scenarios. Meanwhile, specific data augmentation strategies are adopted, such as using cross-seasonal forest fire data and simulated scene data, to enhance the model’s adaptability to different scene distributions. Finally, multimodal fusion methods can be further investigated, such as combining visible-light images with near-infrared cues that reflect smoke absorption and scattering characteristics, so as to improve the model’s overall perception of complex forest fire scenes.
5. Conclusions
This study proposes a real-time lightweight forest fire detection network (RLFNet), which adopts a lightweight design for the backbone, neck, and detection head of YOLOv11, and further optimizes the model using the LAMP strategy, making it more suitable for deployment on edge devices. Experiments on our self-constructed dataset show that RLFNet improves mAP50 by 5.3% over the baseline model, while reducing parameters and GFLOPs by 25.2% and 20.6%, respectively, and achieving the fastest inference speed of 225 FPS. Overall, RLFNet outperforms competing methods and achieves a more effective balance between accuracy and efficiency. Furthermore, generalization experiments on the public remote-sensing wildfire dataset M4SFWD demonstrate that RLFNet also achieves the best performance among all compared models, indicating strong generalization capability. With TensorRT acceleration, the inference throughput reaches 415 FPS, confirming its suitability for real-time detection under strict computational budgets.
Overall, RLFNet demonstrates strong engineering application potential on edge platforms such as UAVs, robots, and surveillance cameras. Its high efficiency allows the high-precision fire detection algorithm to run stably for a long time on UAVs or field monitoring devices with limited computing resources and power supply, which is of great practical significance for early fire warning, ecological environment protection and economic loss reduction.