Wildfire Detection from a Drone Perspective Based on Dynamic Frequency Domain Enhancement

Ma, Xiaohui; He, Yueshun; Du, Ping; Lv, Wei; Yang, Yuankun

doi:10.3390/f16101613

Open AccessArticle

Wildfire Detection from a Drone Perspective Based on Dynamic Frequency Domain Enhancement

by

Xiaohui Ma

¹,

Yueshun He

^1,2,*,

Ping Du

¹,

Wei Lv

¹ and

Yuankun Yang

¹

The School of Artificial Intelligence and Information Engineering, East China University of Technology, Nanchang 330013, China

²

The School of Surveying and Geoinformation Engineering, East China University of Technology, Nanchang 330013, China

^*

Author to whom correspondence should be addressed.

Forests 2025, 16(10), 1613; https://doi.org/10.3390/f16101613

Submission received: 9 August 2025 / Revised: 18 September 2025 / Accepted: 22 September 2025 / Published: 21 October 2025

(This article belongs to the Section Natural Hazards and Risk Management)

Download

Browse Figures

Versions Notes

Abstract

In recent years, drone-based wildfire detection technology has advanced rapidly, yet existing methods still encounter numerous challenges. For instance, high background complexity leads to frequent false positives and false negatives in models, which struggle to accurately identify both small-scale fire points and large-scale wildfires simultaneously. Furthermore, the complex model architecture and substantial parameter count hinder lightweight deployment requirements for drone platforms. To this end, this paper presents a lightweight drone-based wildfire detection model, DFE-YOLO. This model utilizes dynamic frequency domain enhancement technology to resolve the aforementioned challenges. Specifically, this study enhances small object detection capabilities through a four-tier detection mechanism; improves feature representation and robustness against interference by incorporating a Dynamic Frequency Domain Enhancement Module (DFDEM) and a Target Feature Enhancement Module (C2f_CBAM); and significantly reduces parameter count via a multi-scale sparse sampling module (MS3) to address resource constraints on drones. Experimental results demonstrate that DFE-YOLO achieves mAP50 scores of 88.4% and 88.0% on the Multiple lighting levels and Multiple wildfire objects Synthetic Forest Wildfire Dataset (M4SFWD) and Fire-detection datasets, respectively, whilst reducing parameters by 23.1%. Concurrently, mAP50-95 reaches 50.6% and 63.7%. Comprehensive results demonstrate that DFE-YOLO surpasses existing mainstream detection models in both accuracy and efficiency, providing a reliable solution for wildfire monitoring via unmanned aerial vehicles.

Keywords:

wildfire detection; YOLOv8; frequency domain enhancement; MS3

1. Introduction

Forests constitute a vital component of the Earth’s ecosystems, playing a pivotal role in preserving biodiversity, regulating climate, and conserving soil and water. Nevertheless, wildfires pose a persistent and severe threat to these ecosystems. Conventional monitoring techniques—including ground patrols, observation from lookout towers, and satellite remote sensing—all exhibit significant limitations. These include significant constraints imposed by manpower, material resources, and geographical conditions, coupled with low detection accuracy and slow response times. With advances in science and technology, drones have increasingly been deployed for wildfire detection. Since the US Forest Service’s (USFS) inaugural deployment in 1961, drone forest fire monitoring technology has undergone substantial advancement. Over several decades, a series of pivotal projects—such as Firebird 2001, the Wildfire Research and Applications Partnership (WRAP) project, NASA’s “Altair” and the “Ikhana” (Predator-B) missions, and the First Response Experiment (FiRE) project—have demonstrated UAV capabilities in real-time imaging, data transmission, and operational deployment, validating their effectiveness in real-time disaster data collection [1]. However, wildfire imagery captured from a drone perspective presents challenges including highly variable target scales, a high proportion of small targets, and severe background interference. Figure 1 shows an example of a wildfire captured by a drone. As shown in Figure 1a, the image contains multiple flame targets with significant scale differences; Figure 1b indicates that flame targets account for a tiny proportion of the image during the early stages of a fire; Figure 1c,d show that many background elements in the images share similar features with the fire targets, such as sunset and nighttime lights resembling flame targets, and cloud layers resembling smoke targets. These traits affect the precision of wildfire detection and elevate the likelihood of false alarms. Additionally, since drones are edge devices with limited computational resources, efforts to improve model accuracy must also consider the model’s parameters and computational complexity.

Many scholars have conducted targeted research on the above issues. Addressing the problems of variable target scales in wildfires and the high proportion of tiny targets in the initial stages of a fire, Zhang et al. [2] enhanced small target recognition through the BRA-MP downsampling module, achieving an AP of 87.0% on a self-built dataset, a 2.3% improvement over the original YOLOv7. Yan et al. [3] proposed the dense pyramid pooling module MCCL, which improves the model’s small target detection capability by integrating feature maps of different scales, resulting in a 1.1% increase in mAP; Cao et al. [4] designed the full-dimensional dynamic convolutional space pyramid (OD-SPP), which dynamically adjusts the weights of convolutional kernels to adapt to multi-view inputs, resulting in a 4% increase in the model’s mAP50; Li et al. [5] optimized the feature allocation strategy by combining BiFPN with the eSE attention module, significantly reducing the small object miss rate; Jia et al. [6] enhanced the model’s proficiency in detecting tiny objects by introducing a multi-head attention mechanism; Ye et al. [7] integrated the k-nearest neighbor (k-NN) attention mechanism into the Swin Transformer, significantly boosting the model’s proficiency in detecting diminutive fireworks, resulting in an impressive bbox_mAP50 score of 96.4%. Yuan et al.’s [8] FF module employs three parallel decoding paths to effectively cover the spatial distribution differences from ground fires to crown fires; Zheng et al.’s [9] GBFPN network fuses contextual information through bidirectional cross-scale connections and Group Shuffle convolutions, improving the mAP for object detection by 3.3%; Li et al. [10] achieve mAP50 of 80.92% on the D-Fire dataset by integrating the GD mechanism for multi-scale information fusion; Dai et al. [11] replace the backbone network of YOLOv3 with the MobileNet network; Zhao et al. [12] augment the YOLOv3 architecture with an additional detection layer to bolster its capacity for pinpointing minor targets. These studies establish important technical references for multi-scale wildfire detection.

Addressing the challenges of complex forest environments and severe background interference, Yuan et al. [8] enhanced flame region responses through position encoding in their FA mechanism, maintaining an 89.6% recall rate in hazy conditions; Zheng et al. [9] embedded the BiFormer mechanism to filter key features via dynamic sparse attention, reducing false positive rates by 51.3% in tree canopy occlusion scenarios; Sheng et al. [13] utilized dilated convolutions to construct a spatial attention map, improving focus on smoke targets by 2.9%; Pang et al. [14] introduced a coordinate attention mechanism, increasing recall to 78.69% in foggy scenarios; Yang et al. [15] enhanced small object detection through a Coordinated Attention (CA), improving mAP50 by 2.53%; Qian et al. [16] adopted the SimAM attention mechanism, achieving an mAP50 of 92.10%, an improvement of 4.31% over the baseline. These attention mechanisms effectively address the issue of complex background interference.

Drones and other edge devices have limited resources, so researchers are not only focusing on accuracy but also actively developing lightweight models. Zhang et al. [17] further optimized the neck structure of YOLOv5s, replacing traditional components with the GC-C3 module and SimSPPF, reducing the number of parameters by 46.7%; Ma et al. [18] introduced the lightweight cross-scale feature fusion module CCFM and the receptive field attention mechanism RFCBAMConv, reducing the number of parameters by 20.2% while maintaining an average accuracy of 90.2%; Yun et al. [19] combined GSConv with VoV-GSCSP, reducing the computational complexity of the Neck layer by 30.6%; Sheng et al. [13] replaced standard convolutions with GhostNetV2, achieving parameter compression through inexpensive convolution operations; Chen et al. [20] used a lightweight initial network and parallel mixed attention mechanism (PMAM) to maintain a low parameter count while improving detection accuracy; Feng et al. [21] employed GhostNetV2 and a hybrid attention mechanism to reduce the parameter count by 22.77%; Lei et al. [22] used DSConv and GhostConv to reduce the model parameter count by 41%; Han et al. [23] combined GhostNetV2 with multi-head self-attention to reduce the parameter count to 2.6 M while achieving 86.3% mAP. These lightweight designs provide feasible solutions for edge device deployment.

Despite significant progress achieved through numerous approaches addressing challenges in wildfire detection from drone perspectives, most existing methods struggle to simultaneously enhance small-target features while accommodating large targets within drone wildfire detection scenarios. This is particularly problematic given the substantial variation in target scales and the high proportion of small objects. Furthermore, while some models incorporate attention mechanisms to mitigate false positives and false negatives caused by complex background interference, these often incur increased computational costs, rendering them ill-suited to the resource constraints inherent in drone operations. To address the aforementioned issues, this paper presents DFE-YOLO, a wildfire detection model for drone perspectives based on dynamic frequency domain enhancement. Firstly, by expanding the traditional three-stage detection mechanism to a four-stage approach, a dedicated small-object feature processing layer and corresponding detection head are introduced, significantly enhancing the model’s detection capability for small-scale targets. Secondly, two core modules are innovatively designed during feature fusion: (1) the lightweight Dynamic Frequency Domain Enhancement Module (DFDEM), which captures detail information overlooked by conventional spatial convolutions through frequency domain analysis; (2) the Feature Enhancement Module (C2f_CBAM), integrating the CBAM attention mechanism within the C2f architecture to achieve dynamic weighting of key features. Finally, a multi-scale sparse sampling module (MS3) was constructed to substantially reduce computational complexity while preserving feature extraction capability, addressing resource constraints on drone platforms. Experimental results demonstrate that DFE-YOLO achieves superior performance on the public datasets Multiple lighting levels and Multiple wildfire objects Synthetic Forest Wildfire Dataset (M4SFWD) and Fire-detection, attaining mAP50 scores of 88.4% and 88.0%, respectively, with a 23.1% reduction in parameters compared to baseline models. This effectively balances detection accuracy with computational efficiency. The proposed method combines lightweight design with high precision, enabling accurate identification of multi-scale fire sources and providing a reliable solution for wildfire detection by unmanned aerial vehicles.

2. Materials and Methods

2.1. Dataset

In this study, experiments were conducted using two datasets: Fire-detection and Multiple lighting levels and Multiple wildfire objects Synthetic Forest Wildfire Dataset (M4SFWD). Fire-detection [24] is a fire detection dataset sourced from the Kaggle platform. It encompasses diverse fire scenarios including indoor fires, vehicle fires, house fires, candle flames, laboratory fires, and forest fires, serving to validate model generalization capabilities. Each image within the dataset is labeled with the ‘fire’ category to indicate the presence or absence of fire. M4SFWD is a synthetic wildfire dataset. It utilizes Unreal Engine 5 to develop a synthetic forest wildfire dataset comprising multiple scenes, various weather conditions, diverse lighting conditions, and realistic wildfire objects. By constructing different forest scenes, simulating all-day light flux density, setting multi-scale wildfire targets, and performing pixel-level manual annotation (3974 images, 17,763 bounding boxes), it constructs eight different terrain scenes, including sunny days, foggy days, and rainy/snowy days, as well as daytime, evening, and nighttime conditions [25].

2.2. Methods

2.2.1. Basic Model Selection

In recent years, more and more researchers have begun applying convolutional neural networks to wildfire detection, including two-stage detectors such as R-CNN and Fast-RCNN, as well as one-stage detectors like YOLO and SSD. Since drones are resource-constrained devices, deploying uncompressed deep learning models directly on the device may result in low detection efficiency due to excessive computational complexity, which can also impact the device’s storage capacity. Therefore, it is necessary to select lightweight base models. YOLO is a method based on one-stage object detection, which has obvious advantages, making it have good application prospects in wildfire detection from the perspective of a drone. To select the optimal base model, this study conducted systematic comparative experiments on the M4SFWD dataset. The experiments evaluated performance across four dimensions: detection accuracy, parameters, computational complexity, and inference speed. The results are shown in Table 1.

The experimental results demonstrate that YOLOv8 exhibits outstanding detection accuracy, achieving mAP50 and mAP50-95 values of 86.7% and 49.2%, respectively, significantly outperforming YOLOv10 (85.4% and 47.6%). In terms of calculation efficiency, the FPS reaches 67.45, outperforming YOLOv9 (66.94) and YOLOv10 (67.11); it balances precision and efficiency with 3.01 million parameters and 8.2 GFLOPs. Although YOLOv9 has a slight advantage in detection accuracy, its computational complexity is 38.2 times higher than YOLOv8, so it is not suitable for deployment on drones. Therefore, YOLOv8 was selected as the base model.

YOLOv8 [26] was developed by Ultralytics, a company based in Frederick, MD, USA, in 2023, drawing on the design advantages of models such as YOLOv5, YOLOv6, and YOLOX. YOLOv8 comprehensively improves the model structure of YOLOv5 while retaining the engineering simplicity and ease of use of YOLOv5. As shown in Figure 2, YOLOv8 is divided into three main modules: Backbone, Neck, and Head. Among them, the Backbone performs feature extraction, the Neck performs multi-scale fusion, and the Head performs object recognition, mainly including bounding box prediction, category determination, and confidence determination.

2.2.2. Model Optimization

This paper focuses on the challenges encountered in wildfire detection from a drone perspective. Based on the YOLOv8 model, we have designed a drone-based wildfire detection model, DFE-YOLO, utilizing dynamic frequency domain enhancement. The model architecture is illustrated in Figure 3. The principal contributions of this work encompass the following aspects:

(1): By introducing a four-stage detection mechanism, incorporating a small object feature processing layer and corresponding detection heads, the model’s small object detection capability is enhanced.
(2): A lightweight dynamic frequency domain enhancement module (DFDEM) was constructed. By analyzing frequency domain information, it captures details overlooked by traditional spatial convolutions, which enhances the model’s sensitivity to fire and smoke while accentuating fire’s spectral characteristics. DFDEM was integrated with the Bottleneck architecture, replacing its base convolutions to form the dual-domain (spatial-frequency) fusion feature extraction module Bottleneck_DFDEM.
(3): By incorporating the CBAM attention mechanism to the C2f, we developed the C2f_CBAM feature enhancement module, enabling dynamic weighting of key features.
(4): We constructed a multi-scale sparse sampling module (MS3), significantly reducing computational complexity while preserving feature extraction capabilities.

2.2.3. Four-Level Detection Mechanism

Drones typically operate at high altitudes, and the initial wildfires captured in images account for a tiny proportion of the image. These small features are easily affected by complex background factors. The basic YOLOv8 model employs a three-level detection mechanism (p3–p5) with three downsampling steps, which limits its capabilities for detecting small objects. To further enhance small object detection capabilities, this study introduces a four-level detection mechanism (p2–p5), performing four downsampling operations, and adds corresponding small object detection heads. The newly added p2 high-resolution detection layer significantly enhances the retention of target features, together with the tiny object detection heads, and improves the model’s capabilities for detecting tiny objects. The structure is illustrated in Figure 4, where the red-boxed sections represent the newly added sampling modules and the yellow-boxed sections represent the newly added small object detection heads.

2.2.4. Dynamic Frequency Domain Enhancement Module (DFDEM)

In recent years, frequency domain analysis has been extensively employed to reveal multi-scale features in wildfire imagery. Vaswani et al. [27] noted that the inductive bias inherent in convolutional neural networks causes their receptive field size to be proportional to computational distance, significantly constraining the model’s ability to capture global contextual information. Consequently, traditional spatial convolution-based methods, constrained by the inherently local receptive field characteristics of convolution operations, struggle to effectively model long-range spatial dependencies within fire scenes. Experiments by Jiang et al. [28] demonstrate that frequency-domain features can efficiently capture global image structure and spatial consistency.

Experimental results by Yin et al. [29] reveal that smoke diffusion causes a sharp decline in high-frequency energy across the entire image, with the vast majority of energy concentrated in the low-frequency range, indicating significant suppression of edges and textures. Furthermore, wavelet decomposition by Zhang et al. [30] confirms that once smoke appears, high-frequency sub-band energy decreases synchronously, while low-frequency energy relatively increases. Conversely, Zhang et al. [31] observed stable peaks in the high-frequency band of flame contour Fourier descriptors, indicating pronounced high-frequency components at flame edges. In summary, fire images exhibit a hybrid structure of “coexisting high-frequency and low-frequency components” in the frequency domain; smoke particles and flame edges constitute high-frequency information, while extensive backgrounds and diffuse smoke screens dominate low-frequency energy distribution. To effectively capture both global correlations and local details in images, this paper proposes a lightweight Dynamic Frequency Domain Enhancement Module (DFDEM), as illustrated in Figure 5. This module innovatively combines frequency domain analysis with a dynamic gating mechanism. By integrating frequency domain analysis with spatial domain convolution, it simultaneously captures both global information and local detail, overcoming the limitations of relying solely on spatial domain information. Subsequently, through the dynamic gating mechanism, it adaptively fuses features across different frequencies, enhancing the model’s sensitivity to flames and smoke while suppressing interference from complex backgrounds.

As shown in Figure 5, for a feature map

x \in R^{H \times W \times C}

where C represents the number of channels, W represents the width, and H represents the height, DFDEM is primarily divided into global branches and local branches. The global branch is used to obtain the overall frequency information of features, while the local branch focuses on detailed features within a local range. Finally, dynamic fusion of global and local information is achieved through gating.

In the global branch, each channel’s values are first mapped to the frequency domain via a global Fourier transform, generating

m a g

and

p h a

. Subsequently, shared convolutions are applied to the amplitude spectra and phase spectra of different channels to obtain

m a g_{f e a t}

and

p h a_{f e a t}

, as shown in Equation (1), where

x_{f f t}

denotes the complex spectral signal obtained by applying the Fast Fourier Transform (FFT) to the input signal x, and

∠ x_{f f t}

represents the phase angle information for each frequency component within the spectrum.

\{\begin{cases} x_{f f t} = F F T (x) \\ m a g = |x_{f f t}| \\ p h a = ∠ x_{f f t} \\ m a g_{f e a t} = S h a r e d C o n v (m a g) \\ p h a_{f e a t} = S h a r e d C o n v (p h a) \end{cases}

(1)

In the local branch, the feature map

x_{l o c a l} \in R^{C \times N \times w i n d o w S i z e \times w i n d o w S i z e}

where

N = \frac{H}{w i n d o w S i z e} \times \frac{W}{w i n d o w S i z e}

, is divided into a set of non-overlapping local windows, each with a fixed size of

w i n d o w S i z e \times w i n d o w S i z e

.

Then, a local Fast Fourier transform is performed on each window to obtain the local magnitude and local phase.

Subsequently, local feature magnitudes undergo enhancement. Features derived from distinct branches are spatially aligned via the function

E_{H, W}

to achieve uniform dimensions, then concatenated to form the fused input. Thereafter, a linear layer combined with the Sigmoid function generates the gating weights. Finally, these weights are applied to perform weighted fusion of local and global features, yielding the output

m a g_{f e a t}^{f u s e d}

, as shown in Equation (2), where

c a t

denotes concatenation,

g a t e

represents the gating mechanism,

S i g m o i d

is the activation function,

L i n e a r

denotes linear transformation,

i n p u t

denotes the input content, and

I n t e r p

denotes interpolation.

\{\begin{cases} l o c a l_m a g = |F F T (x_{l o c a l})| \\ l o c a l_p h a = ∠ F F T (x_{l o c a l}) \\ l o c a l_m a g_{f e a t} = l o c a l_e n h a n c e (l o c a l_m a g) \\ i n p u t = c a t (E_{H, W} (m a g_{f e a t}), E_{H, W} (l o c a l_m a g_{f e a t})) \\ g a t e = S i g m o i d (L i n e a r (i n p u t)) \\ m a g_{f e a t}^{f u s e d} = g a t e \cdot m a g_{f e a t} + (1 - g a t e) \cdot I n t e r p (l o c a l_m a g_{f e a t}) \end{cases}

(2)

Finally, the processed amplitude and phase spectrum are added to the original amplitude and phase spectrum to obtain the final spectra

m a g_{o u t}

and

p h a_{o u t}

. Then, the frequency domain information is converted to spatial domain information

x_{o u t}

through an inverse Fast Fourier transform for output (IFFT), as shown in Equation (3).

\{\begin{cases} m a g_{o u t} = m a g + m a g_{f e a t}^{f u s e d} \\ p h a_{o u t} = p h a + p h a_{f e a t} \\ x_{o u t} = I F F T (m a g_{o u t} \cdot e^{i \cdot p h a_{o u t}}) \end{cases}

(3)

DFDEM combines global and local branches to enrich the extracted feature information. By introducing DFDEM into the Bottleneck and replacing its underlying convolutions, the Bottleneck_DFDEM is formed (as shown in the red marked box in Figure 6). Correspondingly, Bottleneck_DFDEM replaces the Bottleneck in the C2f (as shown in the green marked box in Figure 6) to form C2f_DFDEM, as shown in Figure 6. From a frequency domain perspective, DFDEM supplements frequency domain information that traditional convolutions cannot capture, such as flame flickering, enhancing the ability to focus on critical fire-related information. By combining spatial domain information with frequency domain information, the robustness of detection is improved.

2.2.5. C2f_CBAM

Early fire targets captured by drones occupy a small part of the entire image, and due to changes in lighting, background light, and reflection of sunlight, and these factors can affect the model’s judgment. The attention mechanism can regulate the model’s attention to object characteristics to a certain extent, effectively suppressing the model’s attention to the background. Convolutional Block Attention Module (CBAM) is a lightweight attention mechanism that divides attention into channel and spatial modules. The Channel Attention Module keeps the channel dimension steady as it reduces the spatial dimension, zeroing in on the pertinent information related to the target within the input image [32]. Conversely, the Spatial Attention Module reduces the channel dimension while preserving the spatial dimension, primarily focusing on the positional data of the target [33]. This study innovatively integrates the CBAM prior to the final output of YOLOv8′s C2f module (as shown in the red marked box in Figure 7), thereby constructing the feature enhancement module C2f_CBAM (with its structure depicted in Figure 7). CBAM’s channel attention focuses on important semantic features, while spatial attention highlights the target region. Through dual attention weighting in both channels and space, the model’s feature representation capability is enhanced, and complex background noise is suppressed.

2.2.6. Multi-Scale Sparse Sampling Module (MS3)

To address the dual challenges of constrained resources in drone edge devices and the real-time demands of fire detection, this study optimizes the downsampling module. YOLOv8 employs convolution downsampling. Conventional convolution involves sliding a filter across all channels of the input feature map and performing a weighted sum across all channels to generate the output channel. Its parameter count and computational complexity are illustrated in Equation (4), where K represents the size of the convolution kernel, C_in and C_out represent the number of input and output channels, H and W represent the height and width of the feature map.

\begin{array}{l} P a r a m e t e r s = K^{2} \times C_{in} \times C_{o u t} \\ F l o p s = K^{2} \times C_{in} \times C_{o u t} \times H \times W \end{array}

(4)

Traditional convolutions suffer from issues such as large parameter counts, substantial memory consumption, and high computational complexity. To address these challenges, this paper proposes the Multi-Scale Sparse Sampling (MS3) module. It employs dynamic sparse sampling to reduce computational complexity while introducing frequency-domain enhancement to compensate for information loss. This approach achieves a significant reduction in parameters while preserving the model’s feature representation capabilities. Its architecture, as depicted in Figure 8, comprises three principal components: a Dynamic Spatial Mask Generator, Grouped Sparse Depthwise Convolution, and a Frequency Domain Enhancement Branch.

The Dynamic Spatial Mask Generator is designed to generate soft masks at minimal computational cost, serving to indicate significant spatial regions within input feature maps. Specifically, it first performs adaptive average pooling on the input tensor

X \in R^{B \times C \times H \times W}

, downsampling it to an 8 × 8 low-resolution space to capture global contextual information. Subsequently, a two-layer mini-convolutional network performs a nonlinear transformation, followed by bilinear interpolation to up-sample the mask back to its original dimensions. The expression is shown in Equation (5), where

M_{m i d}

denotes the intermediate feature mask and

M_{r a w}

denotes the original mask.

X_{p o o l}

denotes the result of adaptive average pooling applied to the input tensor X.

G E L U

represents the Gaussian Error Linear Unit activation function.

M

denotes the final soft mask.

\begin{array}{l} M_{m i d} = G E L U (C o n v_{1 \times 1} (X_{p o o l})) \\ M_{r a w} = C o n v_{1 \times 1} (M_{m i d}) \\ M = S i g m o i d (M_{r a w}) \end{array}

(5)

To reduce the number of parameters, we employ Grouped Sparse Depthwise Convolution to uniformly divide the input channels into four groups, denoted as

X = [X_{0}, X_{1}, X_{2}, X_{3}]

. Each group of features undergoes element-wise multiplication with its corresponding spatial mask

M_{i}

, thereby achieving sparse sampling. Subsequently, the thinned features undergo downsampling, synchronizing this operation with feature extraction. The expression is given by Equation (6), where

X_{i}

denotes the i-th group of features,

\otimes

representing element-wise multiplication;

{\tilde{X}}_{i}

denotes the masked-weighted thinned feature map; and

Y_{i}

denotes the output feature map.

\begin{array}{l} {\tilde{X}}_{i} = X_{i} \otimes M_{i} \\ Y_{i} = D e p t h w i s e C o n v ({\tilde{X}}_{i}) \end{array}

(6)

To compensate for the potential loss of high-frequency details caused by sparse sampling and downsampling, we designed a frequency domain enhancement pathway for the main branch. First, the original feature map

X_{0}

undergoes average pooling and downsampling to match its spatial dimensions with the main branch output. Subsequently, it undergoes a Fast Fourier Transform (FFT) to convert it into the frequency domain, where learnable scaling modulation is applied. This is expressed in Equation (7), where

X_{0}

denotes the feature map of the 0th group.

F_{0}

is the frequency-domain representation of

X_{0}

after Fast Fourier transformation,

F_{f f t}

denotes the Fast Fourier transform of the input,

W_{f}

is a learnable complex weight matrix,

F_{0}^{e n h a n c e d}

represents the enhanced frequency-domain features,

F_{i f f t}

denotes the inverse Fast Fourier transform, and

Y

denotes the output feature map. Finally, the four output groups are concatenated along the channel dimension. A 1 × 1 convolutional projection layer is employed to fuse and transform the channel dimension, whilst channel mixing facilitates information exchange between different groups. The result is ultimately added to the initial input, forming a residual connection structure.

\begin{array}{l} F_{0} = F_{f f t} (X_{0}) \\ F_{0}^{e n h a n c e d} = F_{0} \otimes W_{f} \\ Y = F_{i f f t} (F_{0}^{e n h a n c e d}) \end{array}

(7)

3. Results

3.1. Experimental Environment and Hyperparameter Settings

This paper uses the Linux operating system, NVIDIA RTX 4090 graphics processor, 25.2 GB of VRAM, AMD EPYC 9354, CUDA 12.1, Python 3.11, and PyTorch 2.2.2. For all ablation experiments, we employed identical experimental settings and hyperparameters. The hyperparameter settings in this paper were based on the default configuration of the base model, combined with the characteristics of the wildfire detection task. They were gradually optimized and adjusted to determine the optimal hyperparameters. Pretrained weights were not used in all experiments. The specific experimental parameter settings are shown in Table 2.

3.2. Model Evaluation

This study employs a multidimensional quantitative evaluation system to assess model performance. The specific evaluation metrics are as follows:

(1): Mean Average Precision (mAP)

In object detection tasks, mean Average Precision (mAP) serves as the core metric for evaluating model performance. Specifically, mAP50 denotes the mean average precision calculated at an Intersection over Union (IoU) threshold of 0.5, focusing solely on instances where the detection box overlaps with the ground truth box by at least 50%. Its expression is given by Equation (8), where N is the total number of classes, and

A P_{50}^{(i)}

denotes the average precision of the i-th class at an IoU of 0.5.

m A P_{50} = \frac{1}{N} \sum_{i = 1}^{N} A P_{50}^{(i)}

(8)

The mAP50-95 metric is more stringent, as it comprehensively evaluates a model’s performance across varying precision requirements. This is achieved by incrementally increasing the IoU threshold from 0.5 to 0.95 in 0.05 increments, calculating the corresponding mAP at each threshold, and then averaging these values. Its expression is shown in Equation (9).

m A P_{50 - 95} = \frac{1}{10} \sum_{k = 1}^{10} m A P_{I o U = 0.5 + 0.05 (k - 1)}

(9)

(2): Parameters

The number of parameters is a key indicator of a model’s complexity. It represents the total number of learnable parameters. Generally, a higher parameter count allows a model to capture more complex patterns but also increases computational demands.

(3): Computational Complexity

Giga Floating Point Operations Per Second (GFLOPs) serve as a crucial metric for measuring the computational complexity during model inference, which is crucial for assessing the efficiency of the model and its suitability for deployment on resource-constrained platforms. In the task of forest fire detection from a drone perspective, stringent constraints on computational resources and energy consumption within the drone platform necessitate models with minimal computational overhead. This requirement is expressed as in Equation (10), where FLOPs denotes the total number of floating-point operations. The specific expression varies depending on the type of operation within the neural network.

\begin{array}{l} G F L O P s = \frac{T o t a l F L O P s}{10^{9}} \\ T o t a l F L O P s = \sum F L O P s p e r L a y e r \end{array}

(10)

(4): Real-time performance

Frames per second (FPS) serves as a pivotal measure of model efficacy, revealing the number of frames each second the model is capable of handling. A higher FPS implies a faster model inference speed and improved real-time performance.

3.3. Ablation Experiment

To evaluate the detection performance of the proposed model, this study conducted ablation experiments on the M4SFWD dataset by gradually adding modules, systematically assessing the impact of each module on the base model, and gaining a deeper understanding of the distinct contributions of each module to the model. The results are presented in Table 3, where A denotes the four-level detection mechanism, B represents the C2f_DFDEM, C represents the MS3, D represents the C2f_CBAM.

Analysis of Table 3 indicates that model performance improves with each additional module relative to the baseline model, demonstrating that each module exerts a positive influence on the model. Experiment 1 presents the performance of the baseline model YOLOv8, revealing its robust overall capability. These results will serve as the reference baseline for subsequent experiments. Experiment 2 introduced a four-level detection mechanism into the baseline model. Compared to the baseline, flame detection accuracy improved by 1.6%, smoke detection accuracy increased by 0.4%, overall mAP50 rose by 1.0%, and mAP50-95 increased by 0.7%, demonstrating the effectiveness of the multi-scale detection mechanism. Experiment 3 incorporates the C2f_DFDEM module upon experiment 2. Compared to experiment 2, flame detection accuracy improves by 0.2%, smoke detection accuracy increases by 0.8%, overall mAP50 rises by 0.5%, and mAP50-95 increases by 0.4%. Computational complexity slightly decreases while FPS marginally improves, validating the positive impact of frequency domain information on the model. Experiment 4 incorporates the MS3 module into experiment 3. Compared to experiment 3, flame detection accuracy reaches 89.6%, while overall mAP50 and mAP50-95 remain unchanged. The number of parameters decreases by 20.3%. Experiment 5 incorporates the C2f_CBAM module into experiment 4, achieving the highest flame detection accuracy of 90%. mAP50 and mAP50-95 also reach optimal values of 88.4% and 50.6%, respectively, though the number of parameters increases slightly. The comprehensive experimental results demonstrate that Modules A, B, C, and D all exert a positive influence on the base model. The five modules effectively enhance detection accuracy while reducing the number of parameters, rendering them suitable for forest fire detection from a drone perspective.

3.4. Comparison of Different Models

This study selected several representative mainstream object detection models to validate the effectiveness of the proposed model, including YOLOv9 [34], YOLOv10 [35], YOLOv11 [36], Hyper-YOLO [21], Drone-YOLO [37], YOLO-FireAD [38], the model proposed by Kong et al. [39], and the model proposed by Yang et al. [15]. Among these, YOLOv9 to YOLOv11 and Hyper-YOLO are general object detection models, Drone-YOLO is a drone image object detection model, while YOLO-FireAD, the model proposed by Kong et al. [39], and the model proposed by Yang et al. [15] are all fire detection models. All comparative experiments were conducted using identical experimental settings and evaluation criteria on the M4SFWD dataset. with results presented in Table 4.

Table 4 demonstrates that our approach achieves outstanding overall performance on the M4SFWD dataset. Regarding the key metric mAP50, our approach achieves 88.4%, while mAP50-95 reaches 50.6%, surpassing the compared mainstream models. Moreover, with the minimal number of parameters (just 3.3% of YOLOv9 and 78% of Drone-YOLO), it achieves a flame detection accuracy of 90.0% and a smoke detection accuracy of 88.4%, surpassing the second-ranked model by 0.6% and 0.5%, respectively. Computational complexity is controlled at 11.2 GFLOPs, significantly lower than YOLOv9′s 313.4 GFLOPs, while maintaining real-time inference at 65.43 FPS, merely 9 FPS slower than the fastest Drone-YOLO. Overall, the proposed model achieves optimal comprehensive performance, offering an edge device fire detection solution that balances accuracy with low power consumption.

3.5. Model Generalization Performance

To evaluate the model’s generalization capability, comparisons were conducted against the baseline model YOLO8 using the Fire-detection dataset as a benchmark. Experimental results are presented in Table 5. On the same dataset, the proposed model achieved an mAP50 of 88.0% and an mAP50-95 of 63.7%, representing improvements of 1.1% and 1.7%, respectively over the baseline model. These results demonstrate that the proposed model possesses a certain degree of generalization capability and exhibits overall superior generalization performance compared to the baseline model.

3.6. Comparative Analysis of Different Attention Mechanisms

To verify the influence of different attention mechanisms on model performance, this study selected coordinate attention (CA) [40], Squeeze-and-Excitation (SE) [41], spatial attention [42], and Simultaneous Attention (SimAM) [43] as comparison objects.

Experiments were conducted under the same conditions on the M4SFWD dataset to explore differences in detection accuracy, calculation efficiency, and other aspects. The experimental results are shown in Table 6.

As shown in Table 6, different attention mechanisms exhibit varying performance on this dataset, primarily reflected in mAP metrics and computational cost. Regarding core performance indicators, CA, Spatial Attention, SimAM, and CBAM all achieved mAP50 values of 88.0% or higher, with CBAM holding a slight advantage at 88.4%. In the mAP50-95 metric, which better reflects a model’s adaptability across different IoU thresholds, SE, Spatial Attention, and CBAM all reached 50.3%, outperforming CA and SimAM. Experimental results indicate that CBAM achieves superior overall performance by combining dual attention mechanisms—channel attention and spatial attention—thereby more effectively capturing multi-dimensional object features. Regarding computational cost, experiments revealed all attention mechanisms achieved GFLOPs values of 11.2, indicating comparable baseline computational demands. However, parameter counts exhibited subtle variations, with CBAM requiring the fewest parameters. Balancing performance and computational overhead, the CBAM algorithm emerges as the optimal choice. Its design, integrating both channel and spatial attention, increasing the model’s ability to capture object features without additional computational overhead.

3.7. Visualization Analysis of Results

To provide a more intuitive assessment of the wildfire detection model proposed in this paper, we compared the detection performance of YOLOv8 and the improved model on the M4SFWD dataset. Figure 9 shows the comparison results. The figure includes scenarios such as smoke obstruction, changes in lighting, and changes in shooting angles. Columns (1)–(3) and Figure 9a–f are shown in the figure. Column (1) displays the original image, column (2) presents the detection results from YOLOv8, and column (3) shows the detection results from DFE-YOLO.

Figure 9a,d depicts complex terrains with multiple targets, Figure 9b shows an open woodland with a single target containing distant small-scale objects, Figure 9c represents low-light dense woodland, Figure 9e is a high-contrast scene with a dark background and bright target, and Figure 9f features mountainous terrain with multiple targets. In Figure 9a,d, YOLOv8 detected most prominent fire sources but exhibited false negatives for certain small targets. Conversely, DFE-YOLO identified a greater number of fire sources with more comprehensive coverage of small targets, featuring tighter bounding box alignment to target edges and generally higher overall confidence scores compared to YOLOv8′s corresponding detections. In Figure 9b,c,f, YOLOv8 exhibited missed detections for small targets obscured by smoke or with low brightness. DFE-YOLO, however, accurately detected and annotated these low-brightness small targets. In Figure 9e, YOLOv8 demonstrated limited adaptability to irregularly shaped objects, with detection boxes failing to precisely encompass targets. Conversely, DFE-YOLO achieved both accurate object encapsulation and comparatively higher confidence scores. Across diverse scenarios, DFE-YOLO demonstrated superior overall performance to YOLOv8. It achieved more complete detection of small and occluded targets with enhanced localization accuracy and fewer false positives. For distant small targets in open woodland (Figure 9b), its detection confidence was notably higher. In summary, DFE-YOLO exhibits greater advantages in detection robustness, localization precision, and performance under complex environmental conditions when handling small targets.

4. Discussion

This study innovatively proposes DFE-YOLO, a drone-based wildfire detection model enhanced with dynamic frequency domain augmentation, building upon YOLOv8. By incorporating a novel small object detection layer, it substantially enhances detection capabilities for small-scale targets. The lightweight Dynamic Frequency Domain Enhancement Module (DFDEM) and the target feature enhancement module C2f_CBAM strengthen the model’s ability to extract features from targets of varying scales and resist interference from complex backgrounds. The designed lightweight Multi-Scale Sparse Sampling Module (MS3) effectively reduces the model’s parameter count. Ablation studies validate the effectiveness of each module in drone-based wildfire detection. On the M4SFWD dataset, DFE-YOLO achieves 88.4% mAP50 and 50.6% mAP50-95 with only 2.3 million parameters—a 23.1% reduction compared to the baseline YOLOv8 model, representing the lowest parameter count among comparable models. Compared to models such as YOLOv10, Hyper-YOLO, and Drone-YOLO, it demonstrates superior detection accuracy while achieving a better balance between precision and parameter count.

At the practical application level, DFE-YOLO’s lightweight design enables deployment on embedded airborne hardware platforms with limited computational power. When integrated with Jetson Xavier (produced by NVIDIA Corporation, Santa Clara, CA, USA) on typical drone platforms such as the DJI M300 RTK (manufactured by SZ DJI Technology Co., Ltd., Shenzhen, China), it can fulfill real-time monitoring requirements. Given typical drone endurance times (25–35 min), the system can continuously monitor approximately 10–15 square kilometers of forest during a single flight, identifying and localizing fires in their early stages to significantly enhance firefighting response efficiency. However, the model’s current computational complexity (11.2 GFLOPs) remains challenging for sustained high-performance computing on airborne platforms. However, on hardware with more constrained power consumption, such as the Jetson Nano (developed by NVIDIA Corporation, Santa Clara, CA, USA), frame rates may experience a reduction. Subsequent work will focus on optimizing model structure parallelism and hardware utilization, developing dynamic inference mechanisms that adapt computation paths based on image complexity. This approach will reduce computational overhead in simple scenarios, enabling the model to maintain lightweight characteristics while further accelerating inference speeds.

5. Conclusions

To enhance the accuracy and efficiency of wildfire detection by drones, this paper addresses challenges such as high false positive and false negative rates in complex environments, difficulties in synchronously identifying multi-scale targets, and the trade-off between model lightweighting and accuracy. Building upon YOLOv8, we propose DFE-YOLO, a drone-specific wildfire detection model. This model incorporates a dynamic frequency domain enhancement module and undergoes systematic validation on the M4SFWD dataset. Experiments demonstrate that through the collaborative operation of its modules, the model satisfies real-time requirements while maintaining high accuracy. DFE-YOLO outperforms mainstream models across multiple metrics, validating its practical value for edge deployment on drones.

Author Contributions

Validation, Y.Y.; data curation, W.L.; writing—original draft preparation, X.M.; writing—review and editing, Y.H. and P.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Graduate Innovation Special Fund Project of East China University of Technology, grant number DHYC-2025068.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CBAM	Convolutional Block Attention Module
DFDEM	Dynamic Frequency Domain Enhancement Module
DMBFNet	Dynamic Multi-branch Fusion Network
DMC	Dual-modal Convolution
MANet	Mixed Aggregation Network

References

Yuan, C.; Zhang, Y.; Liu, Z. A survey on technologies for automatic forest fire monitoring, detection, and fighting using unmanned aerial vehicles and remote sensing techniques. Can. J. For. Res. 2015, 45, 783–792. [Google Scholar] [CrossRef]
Zhang, X.; Zhang, F. A method for detecting open flames in forest scenes based on SBP-YOLOv7. J. Nanjing For. Univ. (Nat. Sci. Ed.) 2025, 49, 103–109. [Google Scholar]
Yan, J.; Xiao, M.; Wang, Y.; Zhu, Z. An Improved DETR-Based Forest Fire Smoke Recognition Model. Comput. Technol. Dev. 2025, 35, 24–32. [Google Scholar] [CrossRef]
Cao, Y.; Zeng, Y.; Cheng, H.; Sui, B.; Zhao, J.; Pan, R. A Multi-View Forest Fire Detection Method Based on Full-Dimensional Dynamic Convolution and Focused IoU. J. Southwest Jiaotong Univ. 2025, 61, 1–9. [Google Scholar]
Li, B.; Wang, X.; Sun, Q.; Zhang, J. A Forest Fire Image Early Warning Detection Method Based on Probabilistic Two-Stage CenterNet2. Comput. Eng. Sci. 2023, 45, 1884–1890. [Google Scholar]
Jia, Y.; Zhang, C.; Hu, C.; Zhang, J. A Forest Fire Smoke Detection Method Based on Few-Shot Learning. J. Beijing For. Univ. 2023, 45, 137–146. [Google Scholar]
Ye, M.; Zhou, H.; Li, J. Forest Fire Detection Algorithm Based on Improved Swin Transformer. J. Cent. South Univ. For. Technol. 2022, 42, 101–110. [Google Scholar] [CrossRef]
Yuan, J.; Wang, H.; Li, M.; Wang, X.; Song, W.; Li, S.; Gong, W. FD-Net: A Single-Stage Fire Detection Framework for Remote Sensing in Complex Environments. Remote Sens. 2024, 16, 3382. [Google Scholar] [CrossRef]
Zheng, Y.; Tao, F.; Gao, Z.; Li, J. FGYOLO: An Integrated Feature Enhancement Lightweight Unmanned Aerial Vehicle Forest Fire Detection Framework Based on YOLOv8n. Forests 2024, 15, 1823. [Google Scholar] [CrossRef]
Li, C.; Du, Y.; Zhang, X.; Wu, P. YOLOGX: An improved forest fire detection algorithm based on YOLOv8. Front. Environ. Sci. 2025, 12, 1486212. [Google Scholar] [CrossRef]
Dai, Z. Image flame detection method based on improved YOLOv3. IOP Conf. Ser. Earth Environ. Sci. 2021, 693, 012012. [Google Scholar] [CrossRef]
Zhao, Y.; Zhu, J.; Xie, Y.; Li, W.; Guo, Y. Improved real-time flame detection algorithm for video images based on Yolo-v3. J. Wuhan Univ. (Inf. Sci. Ed.) 2021, 46, 326–334. [Google Scholar]
Sheng, S.; Liang, Z.; Xu, W.; Wang, Y.; Su, J. FireYOLO-Lite: Lightweight Forest Fire Detection Network with Wide-Field Multi-Scale Attention Mechanism. Forests 2024, 15, 19994907. [Google Scholar] [CrossRef]
Pang, Y.; Wu, Y.; Yuan, Y. FuF-Det: An early forest fire detection method under fog. Remote Sens. 2023, 15, 5435. [Google Scholar] [CrossRef]
Yang, J.; Zhu, W.; Sun, T.; Ren, X.; Liu, F. Lightweight forest smoke and fire detection algorithm based on improved YOLOv5. PLoS ONE 2023, 18, e0291359. [Google Scholar] [CrossRef]
Qian, J.; Lin, J.; Bai, D.; Xu, R.; Lin, H. Omni-dimensional dynamic convolution meets bottleneck transformer: A novel improved high accuracy forest fire smoke detection model. Forests 2023, 14, 838. [Google Scholar] [CrossRef]
Zhang, L.; Zhang, Q.; Jin, M.; Yuan, Y.; Wang, H. An improved YOLOv5s algorithm for forest fire smoke detection. J. Metrol. 2024, 45, 1314–1323. [Google Scholar]
Ma, Y.; Huang, Z.; Zhou, W.; Xu, Y. A lightweight forest fire detection algorithm based on receptive field attention. Comput. Eng. 2024. online first. [Google Scholar] [CrossRef]
Yun, B.; Zheng, Y.; Lin, Z.; Li, T. FFYOLO: A lightweight forest fire detection model based on YOLOv8. Fire 2024, 7, 93. [Google Scholar] [CrossRef]
Chen, Z.; Yang, J.; Chen, L.; Jiao, H. Garbage classification system based on improved ShuffleNet v2. Resour. Conserv. Recycl. 2022, 178, 106090. [Google Scholar] [CrossRef]
Feng, Y.; Huang, J.; Du, S.; Ying, S.; Yong, J.-H.; Li, Y.; Ding, J.; Ji, R.; Gao, Y. Hyper-YOLO: When Visual Object Detection Meets Hypergraph Computation. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 2388–2401. [Google Scholar] [CrossRef] [PubMed]
Lei, L.; Duan, R.; Yang, F.; Xu, L. Low Complexity Forest Fire Detection Based on Improved YOLOv8 Network. Forests 2024, 15, 1652. [Google Scholar] [CrossRef]
Han, Y.; Duan, B.; Guan, R.; Yang, G.; Zhen, Z. LUFFD-YOLO: A lightweight model for UAV remote sensing Forest fire detection based on attention mechanism and multi-level feature fusion. Remote Sens. 2024, 16, 2177. [Google Scholar] [CrossRef]
Amedi, M. Fire Detection; Data Set; Kaggle: San Francisco, CA, USA, 2023; Available online: https:///www.kaggle.com/datasets/metinamy/fire-detection (accessed on 21 September 2025).
Wang, G.; Li, H.; Li, P.; Lang, X.; Feng, Y.; Ding, Z.; Xie, S. M4SFWD: A Multi-Faceted synthetic dataset for remote sensing forest wildfires detection. Expert Syst. Appl. 2024, 248, 123489. [Google Scholar] [CrossRef]
Varghese, R.; Sambath, M. Yolov8: A novel object detection algorithm with enhanced performance and robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 18–19 April 2024; IEEE: New York, NY, USA, 2024; pp. 1–6. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Jiang, L.; Dai, B.; Wu, W.; Loy, C.C. Focal frequency loss for image reconstruction and synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 13919–13929. [Google Scholar]
Yin, Z.; Luo, Q.; Tang, T.; Wu, Z. Analysis of spectral characteristics for forest fire images based on Fourier transform. In Proceedings of the International Conference on Communication and Electronic Information Engineering (CEIE 2016), Guangzhou, China, 15–16 October 2016; Atlantis Press: Dordrecht, The Netherlands, 2016; pp. 341–347. [Google Scholar]
Zhang, B.; Wei, W.; He, B.Q. Early Wildfire Smoke Detection Based on Multi-feature Fusion. J. Chengdu Univ. Inf. Technol. 2018, 33, 408–412. [Google Scholar]
Zhang, Z.; Zhao, J.; Zhang, D.; Qu, C.; Ke, Y.; Cai, B. Contour based forest fire detection using FFT and wavelet. In Proceedings of the 2008 International Conference on Computer Science and Software Engineering, Wuhan, China, 12–14 December 2008; IEEE: New York, NY, USA, 2008; Volume 1, pp. 760–763. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Dai, D.; Dong, C.; Yang, X.; Li, Z.; Xu, S. Striking a better balance between segmentation performance and computational costs with a minimalistic network design. Appl. Soft Comput. 2025, 182, 113549. [Google Scholar] [CrossRef]
Wang, C.Y.; Yeh, I.H.; Mark Liao, H.Y. Yolov9: Learning what you want to learn using programmable gradient information. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer Nature: Cham, Switzerland, 2024; pp. 1–21. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J. Yolov10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
Khanam, R.; Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Zhang, Z. Drone-YOLO: An efficient neural network method for target detection in drone images. Drones 2023, 7, 526. [Google Scholar] [CrossRef]
Pan, W.; Xu, B.; Wang, X.; Lv, C.; Wang, S.; Duan, Z.; Tian, Z. YOLO-FireAD: Efficient Fire Detection via Attention-Guided Inverted Residual Learning and Dual-Pooling Feature Preservation. arXiv 2025, arXiv:2505.20884. [Google Scholar]
Kong, D.; Li, Y.; Duan, M. Fire and smoke real-time detection algorithm for coal mines based on improved YOLOv8s. PLoS ONE 2024, 19, e0300502. [Google Scholar] [CrossRef] [PubMed]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Jaderberg, M.; Simonyan, K.; Zisserman, A. Spatial transformer networks. Adv. Neural Inf. Process. Syst. 2015, 28, 2017–2025. [Google Scholar]
Yang, L.; Zhang, R.Y.; Li, L.; Xie, X. Simam: A simple, parameter-free attention module for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; PMLR: Cambridge MA, USA, 2021; pp. 11863–11874. [Google Scholar]

Figure 1. Example of a wildfire captured by a drone.

Figure 2. The network architecture of YOLOv8.

Figure 3. The network architecture of DFE-YOLO.

Figure 4. Four-level detection mechanism network architecture. The red-boxed sections represent the newly added sampling modules and the yellow-boxed sections represent the newly added small object detection heads.

Figure 5. Dynamic Frequency Domain Enhancement Module.

Figure 6. C2f_DFDEM module.

Figure 7. C2f_CBAM module.

Figure 8. Multi-scale Sparse Sampling Module.

Figure 9. Comparison of Prediction Results.

Table 1. Comparison of different models on the M4SFWD dataset.

Model	mAP50/%	mAP50-95/%	Parameters	GFLOPs	FPS
YOLOv8	86.7	49.2	3,011,238	8.2	67.45
YOLOv9	86.9	50.6	70,282,908	313.4	66.94
YOLOv10	85.4	47.6	2,890,667	12.3	67.11

Table 2. Experimental parameter settings.

Hyperparameter	Value
Epoch	300
Batch size	32
Image size	640 × 640

Table 3. Ablation experiments on the M4SFWD dataset.

Number	Module	AP/%		mAP50/%	mAP50-95/%	Parameters	GFLOPs	FPS
Number	Module	Fire	Smoke	mAP50/%	mAP50-95/%	Parameters	GFLOPs	FPS
1	YOLOv8	87.4	86.0	86.7	49.2	3,011,238	8.2	67.45
2	+A	89.0	86.4	87.7	49.9	2,921,304	12.4	66.94
3	+A+B	89.2	87.2	88.2	50.3	2,890,667	12.3	67.11
4	+A+B+C	89.6	86.7	88.2	50.3	2,304,499	11.2	65.83
5	+A+B+C+D	90.0	86.9	88.4	50.6	2,315,746	11.2	65.43

Table 4. Comparison Experiments on the M4SFWD Dataset.

Module	AP/%		mAP50/%	mAP50-95/%	Parameters	GFLOPs	FPS
Module	Fire	Smoke	mAP50/%	mAP50-95/%	Parameters	GFLOPs	FPS
YOLOv8	87.4	86.0	86.7	49.2	3,011,238	8.2	67.45
YOLOv9	88.0	85.9	86.9	50.6	70,282,908	313.4	66.94
YOLOv10	86.4	84.4	85.4	47.6	2,890,667	12.3	67.11
YOLOv11	88.0	86.4	87.2	49.3	2,582,542	6.4	65.83
Hyper-YOLO	88.7	85.6	87.1	49.3	3,942,454	10.8	65.43
Drone-YOLO	89.4	86.2	87.8	50.4	2,972,704	12.5	74.52
YOLO-FireAD	87.4	84.4	85.9	48.3	2,551,683	-	52.19
Kong et al. [39]	89.2	86.4	87.8	50.2	3,061,548	8.7	52.90
Yang et al. [15]	87.5	85.7	86.6	47.1	3,788,445	8.5	53.46
Ours	90.0	86.9	88.4	50.6	2,315,746	11.2	65.43

Table 5. Comparison Experiments on the Fire-detection Dataset.

Module	mAP50/%	mAP50-95/%	Parameters	GFLOPs
YOLOv8	86.9	62.0	3,011,238	8.2
Ours	88.0	63.7	2,315,746	11.2

Table 6. Comparison Experiment of Different Attention Mechanisms.

Attention Mechanisms	mAP50/%	mAP50-95/%	Parameters	GFLOPs
CA	88.0	50.2	2,320,987	11.2
SE	87.9	50.3	2,319,047	11.2
Spatial Attention	88.4	50.4	2,318,565	11.2
SimAM	88.3	50.2	2,321,235	11.2
CBAM	88.4	50.6	2,315,746	11.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ma, X.; He, Y.; Du, P.; Lv, W.; Yang, Y. Wildfire Detection from a Drone Perspective Based on Dynamic Frequency Domain Enhancement. Forests 2025, 16, 1613. https://doi.org/10.3390/f16101613

AMA Style

Ma X, He Y, Du P, Lv W, Yang Y. Wildfire Detection from a Drone Perspective Based on Dynamic Frequency Domain Enhancement. Forests. 2025; 16(10):1613. https://doi.org/10.3390/f16101613

Chicago/Turabian Style

Ma, Xiaohui, Yueshun He, Ping Du, Wei Lv, and Yuankun Yang. 2025. "Wildfire Detection from a Drone Perspective Based on Dynamic Frequency Domain Enhancement" Forests 16, no. 10: 1613. https://doi.org/10.3390/f16101613

APA Style

Ma, X., He, Y., Du, P., Lv, W., & Yang, Y. (2025). Wildfire Detection from a Drone Perspective Based on Dynamic Frequency Domain Enhancement. Forests, 16(10), 1613. https://doi.org/10.3390/f16101613

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Wildfire Detection from a Drone Perspective Based on Dynamic Frequency Domain Enhancement

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset

2.2. Methods

2.2.1. Basic Model Selection

2.2.2. Model Optimization

2.2.3. Four-Level Detection Mechanism

2.2.4. Dynamic Frequency Domain Enhancement Module (DFDEM)

2.2.5. C2f_CBAM

2.2.6. Multi-Scale Sparse Sampling Module (MS3)

3. Results

3.1. Experimental Environment and Hyperparameter Settings

3.2. Model Evaluation

3.3. Ablation Experiment

3.4. Comparison of Different Models

3.5. Model Generalization Performance

3.6. Comparative Analysis of Different Attention Mechanisms

3.7. Visualization Analysis of Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI