A Hybrid Deep Learning Model for Early Forest Fire Detection

Mamadmurodov, Akhror; Umirzakova, Sabina; Rakhimov, Mekhriddin; Kutlimuratov, Alpamis; Temirov, Zavqiddin; Nasimov, Rashid; Meliboev, Azizjon; Abdusalomov, Akmalbek; Im Cho, Young

doi:10.3390/f16050863

Open AccessArticle

A Hybrid Deep Learning Model for Early Forest Fire Detection

by

Akhror Mamadmurodov

¹,

Sabina Umirzakova

¹,

Mekhriddin Rakhimov

²

,

Alpamis Kutlimuratov

³,

Zavqiddin Temirov

⁴,

Rashid Nasimov

⁵,

Azizjon Meliboev

⁶

,

Akmalbek Abdusalomov

¹

and

Young Im Cho

^1,*

¹

Department of Computer Engineering, Gachon University, Seongnam-si 13120, Republic of Korea

²

Department of Computer Systems, Tashkent University of Information Technologies Named After Muhammad Al-Khwarizmi, Tashkent 100200, Uzbekistan

³

Department of Applied Informatics, Kimyo International University in Tashkent, Toshkent 100121, Uzbekistan

⁴

Department of Digital Technologies, Alfraganus University, Yukori Karakamish Street 2a, Tashkent 100190, Uzbekistan

⁵

Department of Artificial Intelligence, Tashkent State University of Economics, Tashkent 100066, Uzbekistan

⁶

Department of Digital Technologies and Mathematics, Kokand University, Kokand 150700, Uzbekistan

^*

Author to whom correspondence should be addressed.

Forests 2025, 16(5), 863; https://doi.org/10.3390/f16050863

Submission received: 22 April 2025 / Revised: 12 May 2025 / Accepted: 15 May 2025 / Published: 21 May 2025

(This article belongs to the Special Issue Forest Fires Prediction and Detection—2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

Forest fires pose an escalating global threat, severely impacting ecosystems, public health, and economies. Timely detection, especially during early stages, is critical for effective intervention. In this study, we propose a novel deep learning-based framework that augments the YOLOv4 object detection architecture with a modified EfficientNetV2 backbone and Efficient Channel Attention (ECA) modules. The backbone substitution leverages compound scaling and Fused-MBConv/MBConv blocks to improve representational efficiency, while the lightweight ECA blocks enhance inter-channel dependency modeling without incurring significant computational overhead. Additionally, we introduce a domain-specific preprocessing pipeline employing Canny edge detection, CLAHE + Jet transformation, and pseudo-NDVI mapping to enhance fire-specific visual cues in complex natural environments. Experimental evaluation on a hybrid dataset of forest fire images and video frames demonstrates substantial performance gains over baseline YOLOv4 and contemporary YOLO variants (YOLOv5–YOLOv9), with the proposed model achieving 97.01% precision, 95.14% recall, 93.13% mAP, and 92.78% F1-score. Furthermore, our model outperforms fourteen state-of-the-art approaches across standard metrics, confirming its efficacy, generalizability, and suitability for real-time deployment in UAV-based and edge computing platforms. These findings highlight the synergy between architectural optimization and domain-aware preprocessing for high-accuracy, low-latency wildfire detection systems.

Keywords:

forest fire detection; YOLOv4; EfficientNetV2; efficient channel attention (ECA); wildfire surveillance; UAV monitoring; early-stage fire detection

1. Introduction

Forest fires, also referred to as wildfires, are among the most devastating natural hazards, inflicting severe environmental, economic, and social damage [1]. They contribute significantly to global carbon emissions, accelerate biodiversity loss, and impair critical ecosystem services such as air and water purification [2]. Moreover, the increasing unpredictability of fire outbreaks, exacerbated by global climate change [3], prolonged droughts [4], and human encroachment into natural landscapes [5], has made fire prevention and early intervention more urgent than ever [6]. According to recent reports by environmental monitoring agencies, the average annual area burned by wildfires has more than doubled in several regions over the past two decades, with many fires spreading undetected until they become uncontrollable [7]. In this context, early detection of forest fires is not merely a technological challenge—it is a vital prerequisite for timely containment and mitigation [8]. Traditional detection systems, including satellite-based thermal imaging [9], fixed ground sensors [10], and human lookout towers, are limited by high latency, sparse spatial coverage, and susceptibility to occlusion in complex terrains [11]. Consequently, there has been a paradigm shift toward vision-based artificial intelligence (AI) systems [12], which leverage the pattern recognition capabilities of deep learning to identify fire signatures such as flame shapes [13], smoke textures [14], and thermal anomalies in real-time image streams [15].

Among the array of deep learning architectures developed for object detection, the YOLO (You Only Look Once) family of models has emerged as a leading choice due to its single-shot detection pipeline, high inference speed, and competitive accuracy [16]. YOLOv4 [17], in particular, introduced significant architectural enhancements, including Cross-Stage Partial (CSP) connections [18], Spatial Pyramid Pooling (SPP) [19], and Path Aggregation Networks (PANet) [20], making it a powerful baseline for real-time applications. However, despite its strengths, YOLOv4’s performance in detecting small, low-contrast, or partially occluded fire features—common in early-stage fire imagery—can be further optimized. To address these limitations, we present an enhanced detection framework that integrates a modified EfficientNetV2 backbone and an Efficient Channel Attention (ECA) mechanism into the YOLOv4 architecture. EfficientNetV2, with its compound scaling strategy and use of Fused-MBConv and MBConv blocks, offers improved computational efficiency while preserving high-level semantic information. By substituting the original CSPDarknet53 backbone with EfficientNetV2, we reduce model complexity and latency without sacrificing accuracy. Moreover, the incorporation of ECA blocks after each MBConv6 module introduces channel-wise attention recalibration, allowing the network to prioritize informative features and suppress irrelevant activations. Unlike conventional attention modules such as Squeeze-and-Excitation (SE) blocks, ECA achieves superior scalability and lower computational overhead by eliminating dimensionality reduction and leveraging 1D convolutional operations. A further innovation of our study lies in the data preprocessing pipeline. Recognizing that fire detection performance is closely tied to the quality and relevance of training data, we construct a hybrid dataset composed of still images and video frames depicting diverse fire scenarios. These are augmented using a combination of standard geometric transformations and domain-specific filtering techniques. Notably, we apply Canny edge detection to highlight fire and smoke contours, CLAHE + Jet colormap transformation to enhance contrast in hazy or low-light conditions, and pseudo-NDVI mapping to accentuate vegetation stress—an indirect indicator of combustion. These preprocessing steps improve visual feature salience and facilitate more robust model generalization across varying weather conditions, vegetation types, and camera perspectives.

To evaluate the efficacy of the proposed model, we conduct rigorous experiments involving both baseline and comparative assessments against the latest YOLO variants, from YOLOv5 to YOLOv9. The evaluation metrics include precision, recall, mean average precision (mAP), and F1-score, providing a comprehensive analysis of both detection accuracy and reliability. Empirical results reveal that our model achieves superior performance across all metrics, attaining a precision of 97.01% and a recall of 95.14%, thus confirming its effectiveness in detecting subtle fire signatures at early stages. The key contributions of this work are threefold: (1) the development of a novel object detection architecture that integrates EfficientNetV2 and ECA blocks into YOLOv4 for enhanced fire detection performance; (2) the implementation of a domain-aware preprocessing pipeline to improve input quality and feature distinguishability; and (3) the demonstration of state-of-the-art results on a curated and augmented forest fire dataset. The proposed system offers a high-performance, real-time solution for intelligent wildfire surveillance and can be seamlessly integrated into aerial drones, surveillance networks, or forest edge computing systems to support rapid emergency response and ecological protection efforts.

2. Related Works

The problem of early forest fire detection has received increasing attention in recent years, driven by the growing need for automated, intelligent monitoring systems capable of responding rapidly to environmental threats [21]. Traditional approaches for wildfire detection, including thermal satellite imaging, wireless sensor networks (WSNs), and manual surveillance methods, have demonstrated varying degrees of success but remain constrained by latency, limited resolution, and inefficiency in dynamic or remote forest environments [22]. For instance, satellite-based systems often suffer from temporal resolution limitations and cloud interference, while ground sensors provide only localized detection and require dense deployment to ensure broad coverage [23]. With the advent of deep learning, visual object detection methods have shown promising capabilities for recognizing complex fire and smoke patterns in natural environments [24]. Convolutional neural networks (CNNs), in particular, have enabled the development of data-driven fire detection systems that outperform classical thresholding, color segmentation, and handcrafted feature extraction techniques [25]. Early CNN-based models such as VGGNet [26] and ResNet [27] were utilized in binary classification tasks to differentiate between fire and non-fire imagery. However, their applicability in real-time detection scenarios remained limited due to high computational demands and the absence of spatial localization capabilities.

To overcome these constraints, object detection frameworks such as Faster R-CNN [28], SSD (Single-Shot Detector) [29], and YOLO (You Only Look Once) [30] have been increasingly adopted in the domain of fire detection. Faster R-CNN provides high accuracy but suffers from slower inference speed, which hinders its use in time-critical applications such as drone-based surveillance [28]. SSD and YOLO, on the other hand, offer more favorable trade-offs between accuracy and speed. Among them, the YOLO family has gained considerable traction due to its unified architecture and capacity for multi-scale detection. YOLOv4, in particular, introduced several innovations such as CSPDarknet53 [31], SPP, and PANet, which enhanced the model’s capacity to extract both spatial and semantic information. These improvements made YOLOv4 a strong candidate for vision-based forest fire detection, especially in scenarios requiring real-time responsiveness [32]. However, the backbone architecture of YOLOv4 is not optimally designed for tasks involving fine-grained recognition of early-stage fire cues, which often appear as faint smoke plumes, small flame patches, or occluded heat signatures. To address this limitation, recent research has explored the integration of lightweight and scalable architectures such as EfficientNet. Notably, EfficientNetV2 employs Fused-MBConv and MBConv blocks with compound scaling to achieve a superior balance between computational efficiency and representational capacity. While several studies have investigated the use of EfficientNet for fire detection, most of these works either utilize the model in a classification context or do not leverage attention mechanisms to improve feature selectivity. In parallel, attention mechanisms have emerged as a powerful tool for enhancing CNN-based models by focusing on the most informative features. Squeeze-and-Excitation (SE) [33] networks, Convolutional Block Attention Modules (CBAM) [34], and more recently, Efficient Channel Attention (ECA) [35] blocks have demonstrated their ability to improve accuracy with minimal overhead. ECA, in particular, avoids dimensionality reduction and instead employs local cross-channel interaction via 1D convolution, making it especially well-suited for real-time systems. Despite these advancements, limited research has investigated the synergistic integration of EfficientNetV2 and ECA into a unified YOLO-based architecture for forest fire detection [36]. Most existing studies either focus on conventional object detection models without attention refinement or employ attention mechanisms in non-optimal placements within the network hierarchy. Furthermore, preprocessing strategies in prior work are often simplistic, overlooking the potential of domain-specific filters such as edge detection, histogram equalization, and vegetation indexing to enhance detection sensitivity. In light of these gaps, our study proposes a novel framework that fuses the strengths of YOLOv4, EfficientNetV2, and ECA to achieve high-precision, real-time detection of forest fire and smoke in diverse and visually complex environments. Additionally, our preprocessing pipeline introduces specialized visual enhancement techniques tailored to emphasize early fire indicators, thereby improving both training robustness and inference accuracy. To the best of our knowledge, this is the first work to systematically combine these architectural and preprocessing innovations into a cohesive and empirically validated detection system.

3. Materials and Methods

In this study, we propose a novel deep learning framework for the early detection of forest fires, a critical task in mitigating ecological destruction, economic loss, and threats to biodiversity. As climate change and human activities continue to increase the frequency and intensity of wildfires, developing a reliable and efficient detection system has become an urgent necessity. We adopt YOLOv4 as our baseline object detection model due to its real-time inference capability and high detection accuracy. To further enhance its performance, we replace the original backbone with a modified EfficientNetV2 architecture, which is known for its balance between computational efficiency and accuracy. Additionally, to improve channel-wise feature calibration and reduce irrelevant activations, we incorporate the Efficient Channel Attention (ECA) block within the backbone, as the block is a lightweight attention mechanism designed to improve the representational power of convolutional neural networks by adaptively recalibrating channel-wise feature responses (Figure 1). This architecture strengthens the ability of the model to capture fine-grained visual cues associated with early-stage forest fires, leading to more precise and timely detection in complex natural environments.

3.1. Baseline

YOLOv4 represents a significant advancement in one-stage object detection frameworks, delivering a compelling balance between real-time inference capability and high detection accuracy. Building upon the foundational principles of its predecessors, YOLOv4 incorporates a range of architectural improvements that optimize performance, particularly in environments with limited computational resources. At the core of YOLOv4 lies its enhanced feature extraction backbone, CSPDarknet53, which integrates Cross Stage Partial connections into the established Darknet53 framework. This modification improves gradient flow and mitigates computational redundancy, thereby enhancing learning efficiency without incurring substantial increases in inference time. By enabling more effective propagation of information through the network, CSPDarknet53 facilitates the extraction of rich semantic and spatial representations across multiple image resolutions. To strengthen the fusion of multi-scale features, YOLOv4 incorporates a sophisticated neck architecture that combines a PANet with an SPP module. The PANet component introduces bidirectional flow between feature layers, significantly improving localization accuracy and contextual understanding. Complementing this, the SPP module enhances the model’s ability to generalize across varying object scales by capturing features across multiple receptive fields. Together, these mechanisms enable the network to construct a more comprehensive and robust feature hierarchy. The detection head in YOLOv4 adheres to the core design principles of the YOLO family, employing anchor-based detection across three different feature map scales. This multi-resolution strategy enables the model to detect a wide range of object sizes within a single inference pass. By jointly predicting bounding box coordinates, objectness scores, and class probabilities, the detection head effectively consolidates classification and localization into a unified, highly efficient architecture. Through the integration of these architectural components, YOLOv4 achieves state-of-the-art performance in object detection tasks while maintaining a computational profile suitable for real-time applications on edge devices. YOLOv4 also benefits from a suite of “Bag of Freebies” and “Bag of Specials” enhancements during training and inference. These include techniques like Mosaic data augmentation, DropBlock regularization, CIoU loss, and Self-Adversarial Training (SAT), all of which contribute to improved generalization and convergence.

3.2. The Proposed Model

In our study we use EfficientNetV2 as the backbone, where we add an ECA block to improve the attention of the classification layers and increase the accuracy of detecting the object. ECA is a very light model, which helps to avoid model complexity. The

x_{i} \in R^{W x H x C}

input image generates the backbone with an image size of 240 × 240, as shown in the following Equation (1):

F_{1} = F_{3 \times 3} (x_{i})

(1)

The initial convolutional layer extracts fundamental low-level features such as edges and textures of the fire and smoke from the input image, generating the first set of feature maps that serve as the foundation for deeper representations in subsequent layers:

F_{F_M B C 1} = S w i s h (B a t c h N o r m (F_{3 \times 3} (F_{1})))

(2)

Equation (2) illustrates the structure of the Fused-MBConv1 block, a lightweight yet powerful component commonly used in modern efficient neural networks. This block comprises a 3 × 3 convolutional layer, followed by batch normalization and the Swish activation function. The combination enables the network to retain representational strength while minimizing latency and computational overhead, making it particularly suitable for real-time and resource-constrained applications, calculated as follows:

F_{F_M B C 4} = S w i s h (B a t c h N o r m (F_{3 \times 3} (F_{F_M B C 1})))

(3)

F_{F_M B C 4} = S w i s h (B a t c h N o r m (F_{3 \times 3} (F_{F_M B C 4})))

(4)

Subsequently, Equations (3) and (4) present the Fused-MBConv4 blocks, which maintain the same architectural structure as the Fused-MBConv1 block. However, they typically incorporate a higher expansion ratio or increased channel depth, allowing for richer feature representation and enhanced modeling capacity in deeper stages of the network, calculated as follows:

F_{M B C 4} = S w i s h (S E ({P R}_{1 \times 1} (F_{D W} (F_{F_M B C 4}))))

(5)

F_{M B C 4} = S w i s h (S E ({P R}_{1 \times 1} (F_{D W} (F_{M B C 4}))))

(6)

F_{M B C 6} = S w i s h (S E ({P R}_{1 \times 1} (F_{D W} (F_{M B C 4}))))

(7)

Equations (5)–(7) depict three MBConv blocks within the model, employing expansion ratios of 4 and 6. These blocks follow the same architectural structure, incorporating a depthwise convolution layer that receives input from the preceding Fused-MBConv4 block. This design facilitates efficient information flow to subsequent layers while significantly reducing computational complexity through factorized convolution operations. Furthermore, the architecture integrates a Squeeze-and-Excitation (SE) attention block, enhancing channel-wise feature recalibration. This is followed by a 1 × 1 pointwise convolutional projection and the application of the Swish activation function, allowing for non-linear transformation while preserving spatial dimensions and maintaining computational efficiency:

F_{E C A} = F_{M B C 6} \times F_{e x p a n d} (σ (C o n v 1 D (G A P (F_{M B C 6}))))

(8)

Equation (8) illustrates the proposed modification to the backbone of the baseline model, wherein an ECA block is integrated to enhance the network’s ability to capture inter-channel dependencies. This addition aims to improve feature discrimination and overall model performance with minimal computational overhead. In the ECA mechanism, global average pooling is first applied across the spatial dimensions to generate a compact descriptor for each channel. This operation compresses each feature map to a single scalar, capturing the global distribution of activations per channel. Rather than employing parameter-intensive transformations, ECA introduces a 1D convolution with a carefully chosen kernel size k to model inter-channel dependencies. The kernel size is adaptively determined based on the channel dimension, enabling the model to maintain a balance between complexity and performance. The key advantages of ECA include its parameter-free local interaction modeling, high scalability, and minimal computational footprint, making it highly suitable for enhancing backbone networks in real-time detection systems. When integrated after each MBConv6 block in EfficientNetV2, ECA enables the network to selectively emphasize informative channels while suppressing noisy or redundant activations, thus contributing to improved discriminative capability in detecting subtle visual patterns such as early-stage forest fires. Unlike conventional attention modules that incur substantial computational overhead, such as the SE block, which already is used in the backbone and relies on fully connected layers, ECA achieves high efficiency and accuracy by eliminating dimensionality reduction and instead capturing local cross-channel interactions using a fast 1D convolution (Table 1):

F_{f i n a l} = P o o l i n g (F_{1 \times 1} (F_{E C A}))

(9)

The final layer completes the backbone pipeline by aggregating the extracted features from the modified block through pooling and convolution operations, effectively summarizing spatial information for the neck part of the model.

4. The Experiment and Results

4.1. Dataset

In the context of this study, we have assembled a dedicated dataset consisting of visual imagery and video sequences that depict forest fire events across both global and regional landscapes. To broaden the scope and diversity of fire-related scenarios, the dataset was further augmented with relevant visual content sourced from publicly available platforms, including YouTube, as illustrated in Figure 2. To ensure preprocessing consistency and enhance computational efficiency, all visual inputs were standardized to a fixed resolution of 240 × 240 pixels. This uniform resolution is critical for maintaining consistent image quality throughout the dataset, thereby enabling reliable training and robust evaluation of the proposed detection framework. By minimizing variability arising from heterogeneous data sources, the ability of the model to generalize across diverse environmental conditions is substantially improved. Consequently, the accuracy and resilience of forest fire and smoke detection are significantly enhanced, facilitating real-world deployment in complex natural settings.

The preprocessing strategy employed in this study involves a comprehensive and structured pipeline aimed at optimizing the forest fire dataset for high-performance deep learning. The dataset consists of both still images and frames extracted from forest fire video sequences captured under varying environmental and sensor conditions. Initially, all visual inputs are subjected to a rigorous normalization process, standardizing pixel intensity values to reduce inconsistencies caused by fluctuating illumination, weather conditions, and heterogeneous camera sources. This is followed by a suite of data augmentation techniques designed to enhance dataset diversity and mitigate the risk of model overfitting. As shown in Figure 2, these augmentations include random rotations, horizontal and vertical flipping, scaling transformations, and color filtering, each of which introduces controlled variability that mirrors real-world fire scenarios.

To further enrich feature representation, several specialized filtering techniques are applied to highlight salient visual cues associated with early-stage forest fires. Among them, the Canny edge detector is used to emphasize the contours and boundaries of flames and smoke, which are often crucial indicators for object detection models. The CLAHE (Contrast Limited Adaptive Histogram Equalization) technique, combined with the Jet colormap, enhances contrast in low-visibility conditions such as haze or dense smoke, thereby improving the visibility of fire-related textures. Additionally, a pseudo-NDVI (Normalized Difference Vegetation Index) map is derived from RGB imagery to approximate vegetation stress patterns and burnt regions, offering a spectral proxy that aids in distinguishing fire-affected areas from healthy forest cover (Figure 3). Spatial consistency is ensured through region-centered cropping, which preserves essential features, namely flames and smoke clouds, within the frame. Subsequently, all images are resized to a fixed resolution of 240 × 240 pixels, ensuring uniformity across the dataset and enabling efficient batch processing.

We culminated a comprehensive data preprocessing and standardization workflow to minimize deviation and achieve uniformity across multiple sources. Every image and frame was uniformly resized to 240 × 240 pixels and normalized in pixel intensity to mitigate the effects of illumination, cameras used, and degree of compression artifacts due to varying conditions. Furthermore, a standardized augmentation protocol, which included geometric transformations such as rotation, flipping, and scaling, as well as domain-specific enhancements, was uniformly applied throughout the dataset (Canny Edge Detection, CLAHE + Jet Filtering, and pseudo-NDVI mapping). This enhanced the prominence of fire feature visualization and maintained a consistent input structure for still photos and video frames. Consequently, the model was trained with a dataset that had minimized variability, which enhanced the model’s adaptability and generalization to real-world scenarios.

4.2. Comparison Results

Table 2 presents a direct performance comparison between the original YOLOv4 baseline and our proposed model across four critical evaluation metrics: precision, recall, mean average precision (mAP), and F1-score. These metrics collectively assess both the accuracy and robustness of the detection framework in identifying fire and smoke within forest environments.

As demonstrated, the proposed model significantly outperforms the baseline architecture. Specifically, it achieves a precision of 87.7%, indicating a high proportion of correct positive predictions among all detections. The recall rises to 85.32%, reflecting the model’s strong ability to detect actual fire and smoke instances. The mAP, a comprehensive measure of detection quality, improves from 79.06% in the baseline to 83.92% in the proposed model, highlighting enhanced localization and classification accuracy. Furthermore, the F1-score, which balances precision and recall, improves from 80.13% to 84.18%, confirming the overall effectiveness and reliability of the modified architecture. These results validate that our enhancements, including the integration of the EfficientNetV2 backbone, ECA attention mechanism, and advanced preprocessing techniques, contribute meaningfully to better fire and smoke detection performance in complex, real-world forest scenarios. Table 3 expands the evaluation by comparing the proposed model against a range of state-of-the-art (SOTA) object detection architectures, specifically the YOLOv5 through YOLOv9 families.

While each successive YOLO variant demonstrates incremental improvements in detection capability, the proposed model consistently achieves the highest scores across all evaluation metrics. For instance, YOLOv9s, one of the most recent and powerful YOLO versions, records precision, recall, mAP, and F1-score values of 95.06%, 93.12%, 91.01%, and 90.11%, respectively. However, our proposed model exceeds even these advanced benchmarks, achieving a precision of 97.01%, recall of 95.14%, mAP of 93.13%, and F1-score of 92.78%. These superior results not only demonstrate the competitive edge of our approach but also confirm its effectiveness in detecting forest fire and smoke under diverse and often challenging visual conditions such as dense vegetation, haze, or partial occlusion (Figure 4).

The consistent improvements observed across both baseline and SOTA comparisons confirm that our proposed model offers a highly effective and computationally efficient solution for early-stage forest fire and smoke detection. Its robustness and accuracy make it particularly well-suited for deployment in real-time forest monitoring systems, where early detection is critical to mitigating environmental and economic damage (Figure 4).

Comparison with SOTA Models

In this section, we perform a comprehensive comparison between the proposed YOLOv4-EfficientNetV2-ECA framework and fourteen recent state-of-the-art (SOTA) forest fire detection models. These models span a broad range of architectural innovations, including lightweight detectors, attention-augmented backbones, and multi-scale feature fusion mechanisms. The comparison evaluates model performance based on four standard metrics: precision, recall, mean average precision (mAP), and F1-score, as well as qualitative factors such as computational complexity and adaptability to real-time constraints. To validate the real-time applicability of the proposed system, we conducted runtime evaluations on an NVIDIA RTX 3060 GPU. The model achieved an average inference time of 27 milliseconds per frame, equivalent to 41.6 frames per second (FPS). These results confirm that the proposed model satisfies the latency requirements of real-time deployment scenarios, such as UAV-based wildfire monitoring. Notably, it outperforms several SOTA models not only in detection accuracy but also in computational efficiency, as summarized in Table 4.

Chen et al. [13] proposed a lightweight model emphasizing parameter reduction for mobile deployment, but its detection capacity in complex backgrounds showed a trade-off in precision. Li et al. [17] improved YOLOv4-tiny by adjusting anchor settings and introducing custom loss functions, achieving faster inference but lower accuracy in detecting smoke under partial occlusion. In contrast, our model retains the real-time advantages of YOLO while significantly enhancing feature selectivity and channel calibration through the use of EfficientNetV2 and ECA. Li et al. [18] introduced SMWE-GFPNNet, a multi-branch architecture with global context modules for smoke detection. While it achieved high mAP, the model complexity limits its use in edge deployment. Similarly, Wang et al. [19] proposed YOLOv5s-ACE, integrating attention-based context enhancement modules, and achieved competitive results on synthetic and real-world datasets. However, these models often rely on non-optimized attention mechanisms that increase inference time. In contrast, the ECA module in our design introduces minimal overhead while improving accuracy, as evidenced by our superior F1-score of 92.78% (Figure 5).

Chen and Wang [20] developed SmokeFireNet, which jointly detects flames and smoke using a lightweight two-branch architecture. Though computationally efficient, its detection accuracy in heterogeneous forest environments remained suboptimal compared to our model. Zheng et al. [25] employed a customized deep CNN but lacked attention modules, leading to challenges in recognizing weak fire signatures. Cheknane et al. [28] introduced a two-stage Faster R-CNN with hybrid feature extraction, excelling in accuracy but exhibiting significant latency, making it unsuitable for real-time applications. Our architecture achieves higher detection precision while preserving real-time feasibility (Figure 6).

Jandhyala et al. [29] utilized Inception-V3 in combination with SSD for fire detection in aerial imagery. While this model performs well in classification tasks, its object localization accuracy is inferior to single-stage detectors like YOLO-based frameworks. Wang et al. [30] proposed FFD-YOLO, an adaptation of YOLOv8, incorporating fire-specific modules. Despite improved robustness, the lack of lightweight channel attention limits its efficiency. In contrast, our proposed model effectively captures subtle patterns using ECA-enhanced EfficientNetV2, outperforming FFD-YOLO in both mAP and F1-score. Baskara et al. [31] and Wang et al. [32] both proposed YOLOv4-based architectures tailored for specific environments such as wetlands or hazy conditions. These models apply standard preprocessing techniques but do not include spectral enhancement filters such as pseudo-NDVI or CLAHE + Jet, which are integral to our approach. As a result, their generalization across variable fire conditions is reduced. Our preprocessing pipeline substantially increases feature salience, contributing to enhanced robustness. Li et al. [33] designed Fire-Net for embedded platforms using UAV imagery, prioritizing speed over precision. Xu et al. [34] introduced CNTCB-YOLOv7, combining ConvNeXtV2 with CBAM to improve detection quality. While their use of CBAM introduces context awareness, it also increases training complexity. Our ECA module offers comparable performance benefits with significantly fewer parameters. Finally, Kong et al. [35] proposed an attention-based dual-encoding network using optical remote sensing inputs. Although effective in large-scale monitoring, the model is not optimized for near real-time detection, particularly under constrained computing environments. Our model’s ability to operate efficiently while maintaining high accuracy makes it more suitable for real-world, low-latency deployment. Table 4 summarizes the quantitative results of all 15 models (including our proposed model) in terms of the four main evaluation metrics. As shown, our architecture consistently outperforms other models, achieving the highest scores in all categories: precision (97.01%), recall (95.14%), mAP (93.13%), and F1-score (92.78%). These results validate the synergy of EfficientNetV2’s representational efficiency and ECA’s attention capabilities, making our approach a compelling candidate for operational forest fire surveillance systems.

To assess the individual contributions of EfficientNetV2 and the Efficient Channel Attention (ECA) module, we performed an ablation study comprising three variants of the detection framework: (1) YOLOv4 with EfficientNetV2 as the backbone but without ECA, (2) YOLOv4 with ECA applied to the original CSPDarknet53 backbone, and (3) the complete proposed model integrating both EfficientNetV2 and ECA (Table 5).

These results demonstrate that both components—EfficientNetV2 and ECA—contribute meaningfully to the performance improvements observed in the proposed model. EfficientNetV2 enhances representational efficiency, while ECA strengthens channel-wise feature calibration. Their combination results in the highest accuracy and robustness, confirming the architectural synergy.

In the effort to equalize attention Squeeze-and-Excitation (SE) blocks, we tried to use ECA modules in the YOLOv4 + EfficientNetV2-based framework delta; we also evaluated and integrated SE for all the other mb conv 6 layers for mid and optimized for the other ECA techniques. As described in Table 6, even though SE had good detection accuracy, it performed slower than expected due to the fully connected dimensionality reduction stream from the cross-layer sections. In other words, while ECA achieved a higher detection performance, the slower inference speed was preserved.

These results validate the structural advantage of ECA in scenarios demanding both high accuracy and low-latency inference. We conclude that ECA offers a better balance of performance and efficiency, particularly in real-time applications such as UAV-based forest fire surveillance. While ECA was selected for its low computational cost and efficient channel recalibration capabilities, it is relevant to mention that we did not provide a direct quantitative comparison to other attention mechanisms such as SE or CBAM. These mechanisms offer various balance points between complexity and representational efficacy and have been heavily integrated into fire detection as well as other vision tasks. A prospective future direction will add an ablation study that analyzes the accuracy and latency to model size when comparing ECA with SE and CBAM to determine their equivalence in contribution within a unified setting.

To evaluate the model generalization in complex real-world scenarios, we designed three subsets of the test data, each representing a unique challenge: (a) low-light conditions (dusk/night), (b) fire regions obscured by smoke or fog, and (c) occluded flames or smoke. In Table 7, we report the proposed model results on these subsets.

Despite minor degradation in performance, the model demonstrates robust generalization in visually complex environments. This is attributed to the domain-aware preprocessing and channel attention mechanisms. However, we acknowledge that future work should involve systematic testing on benchmark datasets with controlled variations in environmental conditions to comprehensively validate real-world robustness.

5. Conclusions

In this study, we proposed an enhanced forest fire detection framework that integrates a modified YOLOv4 architecture with EfficientNetV2 as the backbone and the ECA module for refined feature calibration. Our model addresses the limitations of traditional detection systems and standard YOLO variants by incorporating lightweight yet effective components that enhance both performance and computational efficiency. The inclusion of domain-specific preprocessing techniques such as Canny edge detection, CLAHE + Jet filtering, and pseudo-NDVI mapping further improved the robustness and generalizability of the model under complex visual conditions. Extensive experiments on a curated and augmented forest fire dataset demonstrated that our proposed approach significantly outperforms existing models, including the latest YOLO versions and other state-of-the-art architectures. Achieving a precision of 97.01%, recall of 95.14%, mAP of 93.13%, and F1-score of 92.78%, our model showcases strong capabilities in early-stage fire detection—an essential aspect of modern wildfire surveillance systems. Given its high accuracy, low inference time, and scalability, the proposed model is well-suited for real-time deployment in UAVs, IoT-based forest monitoring networks, and edge computing systems. While the proposed model demonstrates strong performance in static image-based detection, it does not explicitly model the temporal progression of fire or smoke. This represents a significant aspect of real-world firefighting operations, where understanding the direction, speed, and intensity of fire spread is essential for timely intervention. Incorporating temporal modeling through spatio-temporal neural networks or sequence-based attention mechanisms would enable predictive capabilities and dynamic risk assessment. This enhancement will be a core focus of our future work to build a comprehensive early-warning and fire-propagation prediction system. Future research will explore the integration of temporal attention and multimodal sensing to further enhance early detection in dynamic and multi-source environments.

Author Contributions

Methodology, A.M. (Akhror Mamadmurodov), S.U., M.R., A.K., Z.T., R.N., A.M. (Azizjon Meliboev), A.A. and Y.I.C.; software, A.M. (Akhror Mamadmurodov), S.U., M.R., A.K. and A.A.; validation, A.K., Z.T., R.N., A.M. (Akhror Mamadmurodov) and Y.I.C.; formal analysis, R.N., A.M. (Azizjon Meliboev), A.A. and Y.I.C.; resources, A.M. (Akhror Mamadmurodov), S.U., M.R., A.K., Z.T., A.A. and Y.I.C.; data curation, A.K., Z.T. and R.N.; writing—original draft, A.M. (Akhror Mamadmurodov), S.U., M.R., A.K. and Y.I.C.; writing—review and editing, R.N., A.M. (Azizjon Meliboev), A.A. and Y.I.C.; supervision, A.A. and Y.I.C.; project administration, S.U., A.A. and Y.I.C. All authors have read and agreed to the published version of the manuscript.

Funding

This paper is supported by Korean Agency for Technology and Standard under Ministry of Trade, Industry and Energy in 2024, project numbers are 20022362 (2410003714, Establishment of standardization basis for BCI and AI Interoperability) and by the Gachon University 2024 research grant (GCU-202400560001).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All datasets used are available online and are open access.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Misseyanni, A.; Christopoulou, A.; Kougkoulos, I.; Vassilakis, E.; Arianoutsou, M. The Impact of Forest Fires on Ecosystem Services: The Case of Greece. Forests 2025, 16, 533. [Google Scholar] [CrossRef]
Pacaldo, R.S.; Aydin, M.; Amarille, R.K. Forest fire and aspects showed no significant effects on most mineral soil properties of black pine forests. Catena 2025, 250, 108801. [Google Scholar] [CrossRef]
Gerberding, K.; Schirpke, U. Mapping the probability of forest fire hazard across the European Alps under climate change scenarios. J. Environ. Manag. 2025, 377, 124600. [Google Scholar] [CrossRef]
Volkova, L.; Paul, K.I.; Roxburgh, S.H.; Weston, C.J. Recovery of south-eastern Australian temperate forest carbon is influenced by post-fire drought as well as fire severity. For. Ecol. Manag. 2025, 585, 122666. [Google Scholar] [CrossRef]
Meng, Q.; Huai, Y.; Wang, X.; Li, Z.; Zhang, R.; Nie, X. Three dimensional forest dynamic evolution based on hydraulic erosion and forest fire disturbance. Comput. Graph. 2025, 126, 104152. [Google Scholar] [CrossRef]
Abdusalomov, A.; Umirzakova, S.; Bakhtiyor Shukhratovich, M.; Mukhiddinov, M.; Kakhorov, A.; Buriboev, A.; Jeon, H.S. Drone-Based Wildfire Detection with Multi-Sensor Integration. Remote Sens. 2024, 16, 4651. [Google Scholar] [CrossRef]
Jones, M.W.; Kelley, D.I.; Burton, C.A.; Di Giuseppe, F.; Barbosa, M.L.F.; Brambleby, E.; Hartley, A.J.; Lombardi, A.; Mataveli, G.; McNorton, J.R.; et al. State of wildfires 2023–2024. Earth Syst. Sci. Data 2024, 16, 3601–3685. [Google Scholar] [CrossRef]
Ibraheem, M.K.I.; Mohamed, M.B.; Fakhfakh, A. Forest defender fusion system for early detection of forest fires. Computers 2024, 13, 36. [Google Scholar] [CrossRef]
Singh, H.; Ang, L.M.; Srivastava, S.K. Active wildfire detection via satellite imagery and machine learning: An empirical investigation of Australian wildfires. Nat. Hazards 2025, 1–24. [Google Scholar] [CrossRef]
Chan, C.C.; Alvi, S.A.; Zhou, X.; Durrani, S.; Wilson, N.; Yebra, M. A Survey on IoT Ground Sensing Systems for Early Wildfire Detection: Technologies, Challenges and Opportunities. IEEE Access 2024, 12, 172785–172819. [Google Scholar] [CrossRef]
Saleh, A.; Zulkifley, M.A.; Harun, H.H.; Gaudreault, F.; Davison, I.; Spraggon, M. Forest fire surveillance systems: A review of deep learning methods. Heliyon 2024, 10, 1. [Google Scholar] [CrossRef] [PubMed]
Yunusov, N.; Islam, B.M.S.; Abdusalomov, A.; Kim, W. Robust Forest Fire Detection Method for Surveillance Systems Based on You Only Look Once Version 8 and Transfer Learning Approaches. Processes 2024, 12, 1039. [Google Scholar] [CrossRef]
Chen, Y.; Li, J.; Sun, K.; Zhang, Y. A lightweight early forest fire and smoke detection method. J. Supercomput. 2024, 80, 9870–9893. [Google Scholar] [CrossRef]
Ding, Y.; Wang, M.; Fu, Y.; Wang, Q. Forest smoke-fire net (FSF Net): A wildfire smoke detection model that combines MODIS Remote sensing images with Regional dynamic brightness temperature thresholds. Forests 2024, 15, 839. [Google Scholar] [CrossRef]
Hiremath, H.; Kannan, S.R. Integrated Anomaly Detection and Early Warning System for Forest Fires in the Odisha Region. Atmosphere 2024, 15, 1284. [Google Scholar] [CrossRef]
Drzymała, A.J.; Korzeniewska, E. Application of the YOLO algorithm in fire and smoke detection systems for early detection of forest fires in real time. Przegląd Elektrotechniczny 2025, 101, 44–47. [Google Scholar] [CrossRef]
Li, Q.; Xiang, Q.; Xie, Y. Improved Yolov4-tiny for Fire Detection. In Proceedings of the 2024 20th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD), Guangzhou, China, 27–29 July 2024; pp. 1–5. [Google Scholar]
Li, R.; Hu, Y.; Li, L.; Guan, R.; Yang, R.; Zhan, J.; Cai, W.; Wang, Y.; Xu, H.; Li, L. SMWE-GFPNNet: A high-precision and robust method for forest fire smoke detection. Knowl.-Based Syst. 2024, 289, 111528. [Google Scholar] [CrossRef]
Wang, J.; Wang, C.; Ding, W.; Li, C. YOlOv5s-ACE: Forest fire object detection algorithm based on improved YOLOv5s. Fire Technol. 2024, 60, 4023–4043. [Google Scholar] [CrossRef]
Chen, Y.; Wang, F. SmokeFireNet: A Lightweight Network for Joint Detection of Forest Fire and Smoke. Forests 2024, 15, 1489. [Google Scholar] [CrossRef]
Vasconcelos, R.N.; Franca Rocha, W.J.; Costa, D.P.; Duverger, S.G.; Santana, M.M.D.; Cambui, E.C.; Ferreira-Ferreira, J.; Oliveira, M.; Barbosa, L.D.S.; Cordeiro, C.L. Fire Detection with Deep Learning: A Comprehensive Review. Land 2024, 13, 1696. [Google Scholar] [CrossRef]
Honary, R.; Shelton, J.; Kavehpour, P. A Review of Technologies for the Early Detection of Wildfires. ASME Open J. Eng. 2025, 4, 040803. [Google Scholar] [CrossRef]
Sharma, A.; Nayyar, A.; Singh, K.J.; Kapoor, D.S.; Thakur, K.; Mahajan, S. An IoT-based forest fire detection system: Design and testing. Multimed. Tools Appl. 2024, 83, 38685–38710. [Google Scholar] [CrossRef]
Abdusalomov, A.; Umirzakova, S.; Tashev, K.; Egamberdiev, N.; Belalova, G.; Meliboev, A.; Atadjanov, I.; Temirov, Z.; Cho, Y.I. AI-Driven UAV Surveillance for Agricultural Fire Safety. Fire 2025, 8, 142. [Google Scholar] [CrossRef]
Zheng, S.; Zou, X.; Gao, P.; Zhang, Q.; Hu, F.; Zhou, Y.; Wu, Z.; Wang, W.; Chen, S. A forest fire recognition method based on modified deep CNN model. Forests 2024, 15, 111. [Google Scholar] [CrossRef]
Akilandeswari, A.; Amanullah, M.; Nanthini, S.; Sivabalan, R.; Thirumalaikumari, T. Comparative Study of Fire Detection Using SqueezeNet and VGG for Enhanced Performance. In Proceedings of the 2024 Ninth International Conference on Science Technology Engineering and Mathematics (ICONSTEM), Chennai, India, 4–5 April 2024. [Google Scholar]
Lv, H.; Chen, X. Research and implementation of forest fire smoke detection based on resnet transfer learning. In Proceedings of the 2021 5th International Conference on Electronic Information Technology and Computer Engineering, Xiamen, China, 22–24 October 2021; pp. 630–635. [Google Scholar]
Cheknane, M.; Bendouma, T.; Boudouh, S.S. Advancing fire detection: Two-stage deep learning with hybrid feature extraction using faster R-CNN approach. Signal Image Video Process. 2024, 18, 5503–5510. [Google Scholar] [CrossRef]
Jandhyala, S.S.; Jalleda, R.R.; Ravuri, D.M. Forest fire classification and detection in aerial images using inception-V3 and SSD models. In Proceedings of the 2023 International Conference on Intelligent Data Communication Technologies and Internet of Things (IDCIoT), Bengaluru, India, 5–7 January 2023; pp. 320–325. [Google Scholar]
Wang, Z.; Xu, L.; Chen, Z. FFD-YOLO: A modified YOLOv8 architecture for forest fire detection. Signal Image Video Process. 2025, 19, 265. [Google Scholar] [CrossRef]
Baskara, A.R.; Sari, Y.; Anugerah, A.A.; Wijaya, E.S.; Pramunendar, R.A. Fire Detection In Wetland Using YOLOv4 And Deep Learning Architecture. In Proceedings of the 2022 Seventh International Conference on Informatics and Computing (ICIC), Bali, Indonesia, 8–9 December 2022; pp. 1–6. [Google Scholar]
Wang, Y.; Hua, C.; Ding, W.; Wu, R. Real-time detection of flame and smoke using an improved YOLOv4 network. Signal Image Video Process. 2022, 16, 1109–1116. [Google Scholar] [CrossRef]
Li, S.; Han, J.; Chen, F.; Min, R.; Yi, S.; Yang, Z. Fire-net: Rapid recognition of forest fires in UAV remote sensing imagery using embedded devices. Remote Sens. 2024, 16, 2846. [Google Scholar] [CrossRef]
Xu, Y.; Li, J.; Zhang, L.; Liu, H.; Zhang, F. CNTCB-YOLOv7: An effective forest fire detection model based on ConvNeXtV2 and CBAM. Fire 2024, 7, 54. [Google Scholar] [CrossRef]
Kong, S.; Deng, J.; Yang, L.; Liu, Y. An attention-based dual-encoding network for fire flame detection using optical remote sensing. Eng. Appl. Artif. Intell. 2024, 127, 107238. [Google Scholar] [CrossRef]
Mahmoudi, S.A.; Gloesener, M.; Benkedadra, M.; Lerat, J.S. Edge AI System for Real-Time and Explainable Forest Fire Detection Using Compressed Deep Learning Models. Proc. Copyr. 2025, 847, 854. [Google Scholar]

Figure 1. Architecture of the proposed YOLOv4-EfficientNetV2-ECA-based forest fire detection framework. The model uses an EfficientNetV2 backbone with Fused-MBConv and MBConv blocks, enhanced by Efficient Channel Attention (ECA) to emphasize informative features. Low-level and high-level features extracted from the input (240 × 240 RGB image) are passed through a PANet-based neck for multi-scale feature fusion. The network outputs are processed by three parallel detection heads for object classification, bounding box regression (L1 loss), and objectness score prediction. The legend below shows the specific modules used, illustrating the lightweight yet expressive design tailored for real-time wildfire monitoring on edge or UAV platforms.

Figure 2. Illustration of the domain-aware preprocessing techniques applied to forest fire imagery. The top row shows enhanced RGB images using CLAHE with Jet colormap transformation to improve visibility under haze and low contrast conditions. The bottom row presents grayscale outputs from Canny edge detection, used to highlight the structural contours of fire and smoke. These preprocessing techniques emphasize critical visual cues such as flame boundaries, smoke diffusion, and vegetation stress patterns. By increasing the salience of these features, the model’s ability to detect early-stage fire and smoke is significantly improved, especially in challenging environmental conditions.

Figure 3. Examples of the composite domain-aware preprocessing pipeline applied to forest fire imagery. Each row represents a different input image processed through three specialized filters: (1) Canny edge detection (second column) enhances structural boundaries such as smoke contours and flame edges; (2) CLAHE with Jet colormap (third column) amplifies contrast and heat-related visual features, aiding fire texture recognition under haze or low lighting; and (3) pseudo-NDVI mapping (fourth column) highlights vegetation stress and potential burn areas based on visible-spectrum approximations of NDVI. The leftmost grayscale images with red masks denote the original frames used to guide segmentation overlays. These preprocessing techniques collectively increase feature salience and support the model’s robustness in diverse environmental conditions.

Figure 4. Qualitative detection results of the proposed model on various test samples containing forest fire and smoke scenarios. The bounding boxes indicate model predictions, with red boxes denoting detected fire regions and blue boxes indicating smoke. The confidence scores are shown within each box. The images illustrate the model’s robust performance across diverse scenes, including dense vegetation, varying fire intensities, complex smoke plumes, and partially occluded environments. The model demonstrates strong discrimination between flame and smoke features, achieving high-confidence predictions in real-world forest conditions. These results validate the model’s generalization capability and real-time applicability in UAV-based wildfire surveillance.

Figure 5. Performance comparison of the proposed model with existing state-of-the-art forest fire detection models in terms of inference time (ms), mAP (%), and F1-score (%). Models included in this evaluation are: Chen et al. (2024) [13], Li et al. (2024a) [17], Li et al. (2024b) [18], Li et al. (2024c) [33], Wang et al. (2024a) [19], Wang et al. (2024b) [32], Wang et al. (2025) [30], Chen and Wang (2024) [20], Zheng et al. (2024) [25], Jandhyala et al. (2023) [29], Xu et al. (2024) [34], Kong et al. (2024) [35], Cheknane et al. (2024) [28], and Baskara et al. (2022) [31]. The proposed model achieves the highest precision, recall, mAP, and F1-score with the lowest inference time, indicating superior performance and suitability for real-time wildfire detection applications.

Figure 6. Radar chart comparing six representative forest fire detection models. Radar chart comparison of six forest fire detection models across precision, recall, mAP, F1-score, FPS, and inference time. The models include: Chen et al. (2024) [13], Wang et al. (2024a) [19], Cheknane et al. (2024) [28], Wang et al. (2025) [30], Xu et al. (2024) [34], and the Proposed Model. The Proposed Model demonstrates leading performance in most categories, especially in inference efficiency and detection accuracy, making it optimal for real-time wildfire detection.

Table 1. The structure of the modified backbone.

Input (CxHxW)

Conv2D (f = 24, k = 3, s = 1, repeats = 1)

FusedMBConv1 (repeats = [2, 3]):

Conv2D (f = 24, s = 1, k = 3)

BatchNormalization(out_channels)

Swish()

FusedMBConv4 (repeats = [4, 7]):

Conv2D (f = 48, s = 1, k = 3)

BatchNormalization(out_channels)

Swish()

FusedMBConv4 (repeats = [4, 10]):

Conv2D (f = 64, s = 1, k = 3)

BatchNormalization(out_channels)

Swish()

MBConv4 (repeats = [4, 10]):

PW (f = 128, s = 1, k = 1)

DW (f = 128, s = 1, k = 3)

PR (f = 128, s = 1, k = 1)

Swish()

MBConv6 (repeats = [5, 7]):

PW (f = 160,s = 1, k = 1)

DW (f = 160,s = 1, k = 3)

PR (f = 160, s= 1, k = 1)

Swish()

MBConv6 (repeats = [8, 10]):

PW (f = 272,s = 1, k = 1)

DW (f = 272, s = 1, k = 3)

PR (f = 272, s = 1, k = 1)

Swish()

ECA (repeats = 1):

Avg()

Conv2D (f = in_channels, s = 1, k = 1)

Sigmoid()

Expand()

Multip(x, Expand)

Conv2D (f = 272, s = 1, k = 1)

Pooling()

Table 2. The results of the comparison between the baseline model and the proposed model after 150 epochs.

Model	Precision (%)	Recall (%)	mAP (%)	F1-Score (%)
YOLOv4(Baseline)	83.2	81.17	79.06	80.13
Proposed model	87.7	85.32	83.92	84.18

Table 3. The results of comparison among the proposed and YOLO models.

Modification	Precision (%)	Recall (%)	mAP (%)	F1-Score (%)
YOLOv4(Baseline)	86.16	84.2	82.16	83.33
YOLOv5s	87.81	85.49	85.01	86.45
YOLOv6s	88.68	87.18	86.18	85.2
YOLOv7s	91.09	89.08	88.51	87.1
YOLOv8s	93.78	91.29	89.01	89.87
YOLOv9s	95.06	93.12	91.01	90.11
Proposed model	97.01	95.14	93.13	92.78

Table 4. Comparison with SOTA models.

Model	Precision (%)	Recall (%)	mAP (%)	F1-score (%)	Inference Time (ms)	Model Size (MB)	FPS
Chen et al. [13]	89.23	86.74	84.95	87.96	42	45	24.5
Li et al. [17]	86.75	83.54	82.1	85.11	24	17	38.2
Li et al. [18]	91.15	89.9	90.21	90.52	67	78	19.3
Wang et al. [19]	92.03	90.84	91.17	91.43	48	52	22.7
Chen & Wang [20]	87.41	85.3	83.45	86.34	30	23	35.4
Zheng et al. [25]	85.66	82.9	81.56	84.24	52	48	21.5
Cheknane et al. [28]	93.21	91.03	90.96	92.1	91	91	12.1
Jandhyala et al. [29]	84.78	81.22	79.12	82.97	58	60	18.6
Wang et al. [30]	91.3	89.48	88.66	90.37	46	43	25.3
Baskara et al. [31]	82.45	79	76.45	80.62	60	49	16.8
Wang et al. [32]	88.11	86.7	85.33	87.39	44	37	27.9
Li et al. [33]	85.5	82.8	80.47	84.06	39	29	30.1
Xu et al. [34]	92.48	90.95	89.88	91.63	50	51	23.8
Kong et al. [35]	90.27	88.04	87.32	89.12	49	50	22.4
Proposed Model	97.01	95.14	93.13	92.78	27	34	41.6

Table 5. The individual and combined contributions of EfficientNetV2 and ECA to forest fire detection performance.

Model Variant	Precision (%)	Recall (%)	mAP (%)	F1-Score (%)
YOLOv4 + EfficientNetV2 (no ECA)	93.12	91.08	89.44	89.91
YOLOv4 + CSPDarknet53 + ECA	92.47	90.36	88.91	89.42
YOLOv4 + EfficientNetV2 + ECA (Ours)	97.01	95.14	93.13	92.78

Table 6. ECA vs. SE comparison.

Attention Module	F1-Score (%)	mAP (%)	Inference Time (ms)	FPS
SE	91.62	91.04	33	30.3
ECA (proposed)	92.78	93.13	27	41.6

Table 7. Evaluation under challenging visual conditions.

Condition	Precision (%)	Recall (%)	F1-Score (%)
Low-Light Scenes	94.23	91.65	92.92
Fog/Smoke Obscuration	93.88	90.02	91.91
Partial Occlusion	92.74	89.47	91.08

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mamadmurodov, A.; Umirzakova, S.; Rakhimov, M.; Kutlimuratov, A.; Temirov, Z.; Nasimov, R.; Meliboev, A.; Abdusalomov, A.; Im Cho, Y. A Hybrid Deep Learning Model for Early Forest Fire Detection. Forests 2025, 16, 863. https://doi.org/10.3390/f16050863

AMA Style

Mamadmurodov A, Umirzakova S, Rakhimov M, Kutlimuratov A, Temirov Z, Nasimov R, Meliboev A, Abdusalomov A, Im Cho Y. A Hybrid Deep Learning Model for Early Forest Fire Detection. Forests. 2025; 16(5):863. https://doi.org/10.3390/f16050863

Chicago/Turabian Style

Mamadmurodov, Akhror, Sabina Umirzakova, Mekhriddin Rakhimov, Alpamis Kutlimuratov, Zavqiddin Temirov, Rashid Nasimov, Azizjon Meliboev, Akmalbek Abdusalomov, and Young Im Cho. 2025. "A Hybrid Deep Learning Model for Early Forest Fire Detection" Forests 16, no. 5: 863. https://doi.org/10.3390/f16050863

APA Style

Mamadmurodov, A., Umirzakova, S., Rakhimov, M., Kutlimuratov, A., Temirov, Z., Nasimov, R., Meliboev, A., Abdusalomov, A., & Im Cho, Y. (2025). A Hybrid Deep Learning Model for Early Forest Fire Detection. Forests, 16(5), 863. https://doi.org/10.3390/f16050863

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Hybrid Deep Learning Model for Early Forest Fire Detection

Abstract

1. Introduction

2. Related Works

3. Materials and Methods

3.1. Baseline

3.2. The Proposed Model

4. The Experiment and Results

4.1. Dataset

4.2. Comparison Results

Comparison with SOTA Models

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI