An Improved Unmanned Aerial Vehicle Forest Fire Detection Model Based on YOLOv8

Yun, Bensheng; Xu, Xiaohan; Zeng, Jie; Lin, Zhenyu; He, Jing; Dai, Qiaoling

doi:10.3390/fire8040138

Open AccessArticle

An Improved Unmanned Aerial Vehicle Forest Fire Detection Model Based on YOLOv8

by

Bensheng Yun

^*

,

Xiaohan Xu

,

Jie Zeng

,

Zhenyu Lin

,

Jing He

and

Qiaoling Dai

School of Science, Zhejiang University of Science and Technology, Hangzhou 310023, China

^*

Author to whom correspondence should be addressed.

Fire 2025, 8(4), 138; https://doi.org/10.3390/fire8040138

Submission received: 17 February 2025 / Revised: 21 March 2025 / Accepted: 28 March 2025 / Published: 31 March 2025

(This article belongs to the Special Issue Intelligent Forest Fire Prediction and Detection)

Download

Browse Figures

Versions Notes

Abstract

Forest fires have a great destructive impact on the Earth’s ecosystem; therefore, the top priority of current research is how to accurately and quickly monitor forest fires. Taking into account efficiency and cost-effectiveness, deep-learning-driven UAV remote sensing fire detection algorithms have emerged as a favored research trend and have seen extensive application. However, in the process of drone monitoring, fires often appear very small and are easily obstructed by trees, which greatly limits the amount of effective information that algorithms can extract. Meanwhile, considering the limitations of unmanned aerial vehicles, the algorithm model also needs to have lightweight characteristics. To address challenges such as the small targets, occlusions, and image blurriness in UAV-captured wildfire images, this paper proposes an improved UAV forest fire detection model based on YOLOv8. Firstly, we incorporate SPDConv modules, enhancing the YOLOv8 architecture and boosting its efficacy in dealing with minor objects and images with low resolution. Secondly, we introduce the C2f-PConv module, which effectively improves computational efficiency by reducing redundant calculations and memory access. Thirdly, the model boosts classification precision through the integration of a Mixed Local Channel Attention (MLCA) strategy preceding the three detection outputs. Finally, the W-IoU loss function is utilized, which adaptively modifies the weights for different target boxes within the loss computation, to efficiently address the difficulties associated with detecting small targets. The experimental results showed that the accuracy of our model increased by 2.17%, the recall increased by 5.5%, and the mAP@0.5 increased by 1.9%. In addition, the number of parameters decreased by 43.8%, with only 5.96M parameters, while the model size and GFlops decreased by 43.3% and 36.7%, respectively. Our model not only reduces the number of parameters and computational complexity, but also exhibits superior accuracy and effectiveness in UAV fire image recognition tasks, thereby offering a robust and reliable solution for UAV fire monitoring.

Keywords:

UAV images; forest fire detection; C2f-PConv; SPDConv; MLCA

1. Introduction

The Earth’s forest ecosystems provide irreplaceable ecological services. Their core value lies in supporting the fundamental habitats that shape biodiversity patterns, participating in the dynamic regulation of the global carbon cycle, acting as carbon sinks in climate regulation, strengthening soil structure stability through root networks, reducing the erosive effects of surface runoff on soil matrices, and forming a crucial replenishment mechanism for regional water cycles. However, forest fires threaten forests by destroying vegetation, reducing biodiversity, worsening global warming, and releasing harmful smoke and particles that impact human health. Therefore, protecting forests and preventing fires have become urgent issues that need to be addressed globally.

Conventional fire detection methods, including smoke, heat, and infrared detection, are mainly intended for identifying indoor and outdoor fire indicators. These sensors detect changes in smoke concentration and temperature, with limitations in their detection conditions, making them more suitable for indoor use. Infrared sensors transform reflected infrared energy into electrical impulses to trigger outdoor fire alerts. For instance, Le et al. [1] introduced a system to decrease false alarms in traditional manual detection processes. However, these technologies are prone to environmental interference, have limited range, and result in high costs and slow response times, restricting their use in complex environments. Advancements in computer vision and fire detection have shifted the focus from sensor-based to image-based detection, offering faster, more accurate wildfire detection with enhanced perception and positioning capabilities, making this a leading fire detection method.

Optical remote sensing is crucial for extensive surveillance, such as tracking forest fires. Hao et al. [2] introduced YOLO-SR, a method that fuses super-resolution images with deep learning for infrared small target detection, enhancing input image quality and network structure to boost the detection of faint targets, achieving 95.2% mAP@0.5. Wang et al. [3] created the Smoke-Unet model, pinpointing RGB, SWIR2, and AOD spectral bands as the most crucial for detecting smoke in Landsat-8 imagery. Ding et al. [4] proposed FSF Net, using Mask-RCNN-ResNet50-FPN for semantic segmentation and XGBoost for thresholding in forest fire smoke detection with MODIS data. Jin et al. [5] created a U-Net model incorporating physical local context and global index methods for accurate fire detection, and introduced YOLOv5 to filter false positives, proposing an adaptive fusion algorithm for improved robustness, stability, and generality. Chen et al. [6] proposed the Att Mask R-CNN model for tree canopy detection and segmentation in remote sensing images. This model outperformed Mask R-CNN and MS R-CNN, achieving an mAP of 65.29%, mIoU of 80.44%, and an overall recognition rate of 90.67% on six tree-species datasets. The study also utilized segmentation masks to count pixels and estimate the vertical projection area of tree canopies.

At present, the monitoring of forest fires mainly relies on camera detection technology. This technology captures real-time images of forest areas through cameras installed at high altitudes, and analyzes the images using image processing and recognition algorithms to detect and warn of potential fire risks in a timely manner. Abdusalomov et al. [7] proposed an improved YOLOv3 approach for fire detection and categorization. By refining the algorithm and integrating a real-time camera system, they realized precise, all-weather fire detection, identifying 1 m × 0.3 m fire spots up to 50 m away. Lin et al. [8] proposed LD-YOLO for early forest fire detection, incorporating techniques like GhostConv, Ghost-DynamicConv, DySample, SCAM, and a lightweight self-attention detection head for high accuracy and speed, and low complexity. Geng et al. [9] introduced YOLOv9-CBM, optimizing YOLOv9 for fewer false alarms and missed detections. Utilizing the new CBM Fire dataset with 2000 images, they enhanced detection with SE Attention, BiFPN, and MPDIoU loss. YOLOv9-CBM boosted recall by 7.6% and mAP by 3.8%. Yun et al. [10] presented FFYOLO to address low accuracies and high computational costs, using CPDA for feature extraction, a hybrid classification–detection head (MCDH) for accuracy–speed balance, MPDIoU loss for better regression and classification, and GSConv for parameter reduction, with knowledge distillation for improved generalization. He et al. [11] introduced DCGC-YOLO based on an improved YOLOv5, incorporating dual-channel grouped convolution bottlenecks and IoU K-means clustering for anchor box optimization. Liu et al. [12] proposed CF-YOLO based on YOLOv7, improving the SPPCSPC module for better small-target detection, combining C2f with DSConv for faster inference. Khan et al. [13] introduced the FFireNet model, leveraging MobileNetV2 for transfer learning and appending a fully connected layer for image classification, thereby enhancing the precision of forest fire detection and lowering the incidence of false positives.

Compared to cameras installed in fixed locations, unmanned aerial vehicle (UAV) services demonstrate the following advantages in fire detection: Firstly, UAVs have flexibility and mobility. They can specify terrain areas and provide real-time fire information. Secondly, they can be deployed quickly, without being limited by installation location like fixed cameras. Thirdly, the compact size of drones and their ability to carry multiple sensors can greatly expand the monitoring field of view, covering vast and remote forest areas that are difficult to reach manually, thereby providing comprehensive and detailed views of fire scenes. Finally, UAVs have low economic costs, and significantly save manpower and resources compared to traditional monitoring methods, improving monitoring efficiency and cost-effectiveness.

In order to make their model lightweight, the following researchers used different improvement methods. Xiao et al. [14] developed the FL-YOLOv7 lightweight model to overcome the computational constraints in UAV-based forest fire detection. They integrated C3GhostV2, SimAm, ASFF, and WIoU techniques to reduce parameters and computational burden, simultaneously improving the precision and efficiency of detecting small targets. Small target fires are easily masked by background noise in UAV forest images, making recognition difficult. Meleti et al. [15] proposed a model for detecting obscured wildfires using RGB cameras on drones, which integrated a pre-trained CNN encoder and a 3D convolutional decoder, achieving a Dice score of 85.88%, precision of 92.47%, and accuracy of 90.67%. Chen et al. [16] introduced the YOLOv7-based lightweight model LMDFS. By replacing standard convolution with Ghost Shuffle Convolution (GSConv) and creating GSELAN and GSSPPFCSPC modules, their model reduced parameters, sped up convergence, and achieved lightweight performance, with strong results. Zhou et al. [17] created a forest fire detection technique employing the compact YOLOv5 model. This method substitutes YOLOv5’s backbone with the lightweight MobileNetV3s and incorporates a semi-supervised knowledge distillation algorithm to reduce model memory and enhance detection accuracy. Liu et al. [18] introduced the FF Net model to tackle issues like the interference and high false positive and negative rates in forest fire detection. It integrates attention mechanisms based on the VIF net architecture. By improving the M-SSIM loss function, it better processes high-resolution images and highlights flame areas. Yang et al. [19] proposed an advanced model to address the issue of identifying smoke from forest fires in unmanned aerial vehicle imagery. The model utilizes K-means++ to optimize anchor box clustering, PConv technology to improve network efficiency, and introduces a coordinate attention mechanism and small object detection head to identify small-sized smoke. Shamta [20] constructed a deep-learning-based surveillance system for forest fires, utilizing drones with cameras for image capture, YOLOv8 and YOLOv5 for object detection, and CNN-RCNN for image classification. To improve the smoke detection accuracy in unmanned aerial vehicle forest fire monitoring, researchers have proposed various enhancement methods. Saydirasulovich et al. [21] improved the YOLOv8 model. Through the incorporation of technologies like GSConv and BiFormer, the model’s detection accuracy was enhanced, thereby significantly improving its ability to detect forest fire smoke. Choutri et al. [22] proposed a fire detection and localization framework based on drone images and the YOLO model, using YOLO-NAS to achieve high-precision detection, and developed a stereo-vision-based localization method. The model achieved a score of 0.71 for mAP@0.5 and an F1 score of 0.68. Luan et al. [23] presented an enhanced YOLOX architecture that includes a multi-tiered feature extraction framework, CSP-ML, and CBAM attention module. This upgrade aimed to refine feature detection in intricate fire scenarios, reduce interference from the background, and integrate an adaptive feature extraction component to preserve critical data during the fusion process, thereby bolstering the network’s learning proficiency. Niu et al. [24] introduced FFDSM, a refined YOLOv5s Seg model that enhances feature extraction in intricate fire scenarios by employing ECA and SPPFCSPC. ECA captures key features and adjusts weights, improving target expression and discrimination. SPPFCSPC extracts multi-scale features for comprehensive perception and precise target localization.

Although UAVs provide significant advantages in forest fire detection, they face a number of technical hurdles. Firstly, unlike traditional images, the images captured by UAVs may appear blurry, due to limitations such as flight altitude and weather conditions, making it difficult to provide sufficiently clear details [25,26] when dealing with small-sized target objects, which poses obstacles for early detection and accurate assessment of fires. Secondly, UAV images often feature closely packed objects that merge together, leading to occlusion and increasing detection complexity. Finally, UAVs require models to be as lightweight as possible while maintaining high detection accuracy, in order to adapt to their limited computing resources and storage capabilities.

In response to the aforementioned issues, a lightweight UAV forest fire detection model utilizing YOLOv8 is proposed. Building upon earlier findings, this article presents a series of enhancements, which are outlined below:

(1): To enhance detection precision, we propose the MLCA approach, which fortifies the model’s capability to discern local intricacies and global configurations, while lessening the computational burden, suitable for intricate applications such as object detection.
(2): To enhance the processing of small targets and low-resolution images, we introduce SPDConv. Its core idea is to preserve fine-grained information during downsampling by replacing the traditional stride convolution and pooling, improving object detection and image analysis performance.
(3): To reduce the number of parameters and boost computational efficiency, we introduce the C2f-PConv technique. By integrating partial convolution (PConv) within the C2f module, PConv operates on a subset of the input channels for spatial feature extraction, which effectively enhances efficiency by reducing unnecessary computations and memory usage.
(4): To enhance detection capabilities for tiny objects and those subject to occlusions, we present the W-IoU method. W-IoU refines the detection of small objects and the recognition of low-quality images by adaptively modifying the weight allocation of target boxes within the loss computation.

2. Methods

2.1. YOLOv8

Building on the previous background, this paper chose YOLOv8 as the base model. Introduced by the Ultralytics group in 2023, YOLOv8 represents a state-of-the-art model for object detection, with its fundamental structure depicted in Figure 1. It introduced several optimizations over previous versions of YOLO, aiming to provide higher detection accuracy and faster inference speed. Unlike earlier YOLO versions, YOLOv8 uses an anchor-free object detection method. The model forecasts the center, width, and height of objects directly, rather than using anchor boxes for estimating object positions and dimensions. This approach streamlines calculations, decreases the dependency on anchor boxes, and improves both the pace and precision of detection. Figure 1 depicts the YOLOv8 framework, which comprises four primary elements: the Input layer, Backbone, Neck, and Head. The Input layer handles image preprocessing, such as scaling and data augmentation. The Backbone network extracts image features utilizing convolutional layers (Conv), C2f blocks, and the Spatial Pyramid Pooling Fast (SPPF) block. The C2f module enhances feature representation by utilizing inter-layer connections. Within the Neck component, YOLOv8 merges the advantages of the Feature Pyramid Network (FPN) with the Path Aggregation Network (PAN) to refine feature extraction. Finally, the Head section employs a decoupled architecture to independently handle object classification and detection responsibilities. YOLOv8 further refines its loss function by incorporating Binary Cross Entropy (BCE) for classification tasks. For regression purposes, it employs Dynamic Focus Loss (DFL) alongside Complete Intersection over Union Loss (CIoU). These innovations bolster the model’s efficacy in dealing with complex situations. In conclusion, YOLOv8 boosts accuracy and speed, while streamlining the network architecture and adopting an anchor-free detection strategy, rendering it more proficient in tackling difficult object detection challenges. Despite YOLOv8’s notable achievements in object detection, it does encounter certain constraints, particularly in certain specialized scenarios. For example, in crowded environments or with severe occlusion, the performance of YOLOv8 may decrease. This is because the anchor box mechanism and loss function design of YOLOv8 can result in the model being less robust to such situations. In addition, YOLOv8 also faces challenges when processing low-resolution images or small targets. Conventional convolution and pooling methods tend to discard detailed information during downsampling, impacting the model’s proficiency in detecting fine details. We have made improvements to the model to address the aforementioned issues.

2.2. Mixed Local Channel Attention

To enhance detection accuracy without adding computational complexity, we introduce Mixed Local Channel Attention (MLCA) [27,28]. MLCA [29] is a lightweight attention mechanism that integrates local spatial and channel information. This approach employs a two-phase pooling combined with 1D convolution, which diminishes computational expenses, while enhancing the model’s capability to encapsulate both local and global characteristics, suited for tasks such as object recognition.

The design objectives of MLCA were to avoid accuracy loss when reducing the channel dimension, while addressing the high computational resource demands of traditional attention mechanisms. Compared to global attention, MLCA combines local pooling and 1D convolution, reducing the need for extensive computations across the entire feature map. This not only conserves resources, but also avoids increasing the network’s parameter count and computational burden. The mechanism improves the representation of key regions, enabling the model to focus on crucial details, offering an efficient and accurate solution for resource-limited environments like embedded systems or mobile devices.

Another key advantage of MLCA is its incorporation of an attention mechanism during information fusion. In each branch, MLCA extracts both global and local features through 1D convolution, then combines these two types of information using a weighting factor of 0.5. The integrated features are restored to their original resolution via unpooling and are then multiplied with the initial input to generate the final output. This merging technique enhances the model’s capacity for information integration and feature representation, as well as boosting the distinguishing capabilities of features. It is especially useful in complex object detection tasks, significantly boosting accuracy, particularly for detecting small or occluded objects.

As illustrated in Figure 2a, the input features of MLCA undergo a two-stage pooling process. During the initial phase, local spatial details are gathered through local average pooling, resulting in a feature map with dimensions

1 \times C \times k s \times k s

(with C denoting the channel count and k representing the pooling window’s dimensions). Subsequently, the feature map is split into two streams: one dedicated to extracting global context, and the other to maintaining local spatial information. Both branches undergo 1D convolution and unpooling, and are then fused to create the mixed attention information, as shown in Figure 2b. In this phase, the kernel size k is linked to the number of channels C. This configuration enables MLCA to capture local cross-channel interaction information, which is especially useful for detecting small objects. The kernel size is chosen based on the following formula, where k is the selection criterion:

k = φ (C) = {⌊\frac{{log}_{2} (C)}{γ} + \frac{b}{γ}⌋}_{o d d}

(1)

Here,

γ

is a hyperparameter, typically set to 2, and b is the offset of the result.

2.3. Space-to-Depth Convolution

To improve the handling of small objects and images with low resolution, we utilize the SPDConv technique [30]. SPDConv (Space-to-Depth Convolution), introduced by Sunkara et al. [31], is a novel CNN module aimed at improving traditional CNN performance in handling small objects and low-resolution images. This technique is substituted for traditional strided convolutions and pooling, preventing the degradation of detailed information during the downsampling process. Such an approach significantly improves object detection and image analysis, being particularly beneficial for images with low resolution and minute objects.

In traditional CNNs, excessive downsampling (e.g., through pooling or convolutions with strides greater than 1) reduces resolution, affecting the model’s responsiveness to minor objects. SPDConv addresses this by replacing strided convolutions and pooling layers, preventing excessive downsampling and preserving more spatial information, making it particularly effective for low-resolution images.

As illustrated in Figure 3, the architecture of SPDConv is comprised of two primary elements: the Space-to-Depth (SPD) block and a convolution layer without stride. The process can be outlined in these primary stages:

2.3.1. Space-to-Depth (SPD) Layer

The key operation of the SPD layer is to convert the spatial information of the input feature map into depth information. This process involves dividing the feature map into multiple sub-feature maps, and then merging them along the channel axis, which decreases the spatial dimensions, while boosting the channel count. For example, with an input feature map X of dimensions (S, S,

C_{1}

), the SPD layer converts it to a new feature map

X_{0}

with dimensions

(\frac{S}{s c a l e}, \frac{S}{s c a l e}, s c a l e^{2} C_{1})

. This transformation can be represented by the following formula:

f_{(0, 0)} = X [0 : S : s c a l e, 0 : S : s c a l e]

(2)

f_{(1, 0)} = X [1 : S : s c a l e, 0 : S : s c a l e]

(3)

f_{(s c a l e - 1, 0)} = X [s c a l e - 1 : S : s c a l e, 0 : S : s c a l e]

(4)

f_{(0, 1)} = X [0 : S : s c a l e, 1 : S : s c a l e]

(5)

f_{(1, 1)} = X [1 : S : s c a l e, 1 : S : s c a l e]

(6)

f_{(s c a l e - 1, 1)} = X [s c a l e - 1 : S : s c a l e, 1 : S : s c a l e]

(7)

f_{(0, s c a l e - 1)} = X [0 : S : s c a l e, s c a l e - 1 : S : s c a l e]

(8)

f_{(1, s c a l e - 1)} = X [1 : S : s c a l e, s c a l e - 1 : S : s c a l e]

(9)

f_{(s c a l e - 1, s c a l e - 1)} = X [s c a l e - 1 : S : s c a l e, s c a l e - 1 : S : s c a l e]

(10)

Here, the scale is the downsampling factor. The Figure 3c shows an example when downsampling factor is set to 2, the SPD layer converts feature map into four sub-feature maps

f_{0, 0}

,

f_{1, 0}

,

f_{0, 1}

,

f_{1, 1}

, each with the shape

(\frac{S}{2}, \frac{S}{2}, C_{1})

. Figure 3d represents the resulting new feature map

X_{0} (\frac{S}{2}, \frac{S}{2}, 4 C_{1})

.

2.3.2. Non-Strided Convolution Layer

After the SPD layer, SPDConv applies a convolutional layer with a stride of 1. Unlike conventional convolutions with strides greater than 1, the non-strided convolution effectively avoids information loss, while further extracting features through learned parameters. Following the convolutional layer, the channel count in the feature map is decreased from

s c a l e^{2} C_{1}

to

C_{2}

, with the spatial resolution staying constant at

(\frac{S}{s c a l e}, \frac{S}{s c a l e})

.

Conventional convolution methods (including strided convolutions and pooling layers) usually lead to the sacrificing of detailed data, particularly in the handling of low-resolution imagery or tiny objects. By converting spatial information to depth, SPDConv decreases the spatial dimensions while increasing the channel depth, retaining more data and lowering the chance of data loss. For small object detection, SPDConv improves the identification of small items by shrinking spatial dimensions and augmenting the channel depth.

2.4. CSP Bottleneck with 2 Convolutions-Partial Convolution

The C2f module leverages residual connections, skip connections, and multi-layer convolution structures to capture hierarchical features from the input image, improving the model’s efficacy in intricate scene analysis. The method combines features between layers, minimizes the transfer of unnecessary information, and enhances the model’s capability to identify objects at various scales. The feature map is divided along its channel axis, with deep convolution applied to harvest local and global features across different scales. While this architecture improves the efficiency of computational resource usage to a degree, it encounters difficulties in intricate situations such as deformation and occlusion, because of the convolutional kernel’s static receptive field. Traditional convolution (as shown in Figure 4a) is the most basic form of convolution. It retrieves features by sliding a filter (or convolutional kernel) across the input feature map and computing the dot product of the filter with the corresponding local area at every location. Traditional convolution can capture both the global and local features of the input data. However, UAVs typically have small form factors and limited power budgets, making them unable to bear the heavy computational load and energy consumption brought by traditional convolutions.

Depthwise separable convolution (illustrated in Figure 4b) is a variant aimed at reducing computational cost. It splits traditional convolution into two steps: depthwise convolution, and pointwise convolution. Although it reduces computational costs, it may compromise expressive power, potentially affecting the model’s performance.

The partial convolution [32] (PConv, depicted in Figure 4c) gathers spatial features by applying filters to a selection of input channels, with the remaining channels remaining unaffected. This method can considerably decrease computational expense and memory usage without a substantial loss in performance, and it facilitates the trade-off between efficiency and effectiveness by choosing a suitable portion ratio.

In order to improve detection performance and efficiency, we introduce C2f-PConv [33]. C2f-PConv employs PConv technology to decrease unnecessary computation and memory usage, while maintaining feature extraction proficiency, thus lowering the model’s computational complexity and parameter count.

It enhances computational efficiency by minimizing redundant computations and memory accesses. Unlike traditional convolution, PConv applies the operation to only a subset of channels in the input feature map, leaving the others unchanged. This reduces the FLOPs of PConv:

h \times w \times k^{2} \times c_{p}^{2}

(11)

where h represents the height of the feature map, w is the width, k is the size of the convolution kernel, and

c_{p}

is the number of input feature map channels. This approach greatly reduces computational complexity, and for a typical partial ratio of

r = \frac{c_{p}}{c} = \frac{1}{4}

, the FLOPs of PConv are only

\frac{1}{16}

of those in conventional convolution.

Additionally, PConv requires less memory access compared to conventional convolution. With a partial ratio of

r = \frac{1}{4}

, the memory access is only

\frac{1}{4}

of that needed by standard convolution. The formula is as follows:

h \times w \times 2 c_{p} + k^{2} \times c_{p}^{2} \approx h \times w \times 2 c_{p}

(12)

This design enables PConv to effectively avoid unnecessary computations during feature extraction, thereby improving processing speed and efficiency. In the C2f-PConv structure (illustrated in Figure 5), it replaces the convolution operations in the C2f module with PConv, which reduces redundant computations and memory accesses, while preserving feature extraction capabilities. This enhancement allows the C2f module to maintain efficient computational performance, even when handling larger or more complex objects. Through the flexible sampling mechanism of partial convolution, C2f-PConv can adapt to local variations in the input image, while overcoming the limitations imposed by the fixed receptive field in standard convolution.

2.5. W-IoU Loss

In YOLOv8, W-IoU tackles the challenges of small object detection and low-quality images by dynamically adjusting the weights of different object boxes in the loss function [34,35]. In complex tasks like UAV-based forest fire detection and scanning electron microscope (SEM) image object detection, W-IoU improves detection performance, especially in the presence of small objects and occlusions, by optimizing anchor boxes of varying quality.

W-IoU comes in three variants (v1, v2, and v3), each providing tailored optimization approaches for various contexts.

W-IoU v1 introduces a distance metric as the attention criterion, minimizing excessive penalties on geometric metrics like aspect ratio and position. In small object detection, it reduces the overemphasis on high-quality anchor boxes, improving the model’s robustness by adjusting the attention weight of anchor boxes using this distance metric. The formula is as follows:

L_{I o U} = 1 - I o U = 1 - \frac{W_{i} H_{i}}{S_{u}}

(13)

R_{W I o U} = exp (\frac{{(x - x_{g t})}^{2} + {(y - y_{g t})}^{2}}{(W_{g}^{2} + H_{g}^{2})})

(14)

L_{W I o U v 1} = R_{W I o U} L_{I o U}

(15)

The boundary box loss based on

L_{I o U}

represents the loss using IoU, IoU is calculated as the quotient of the overlap area to the combined area of the predicted and actual bounding boxes.

S_{u}

is the area of their union. When both

W_{i}

and

H_{i}

are zero, this can cause vanishing gradients during backpropagation. For the predicted bounding boxes, x and y represent the center coordinates, while

x_{g t}

and

y_{g t}

refer to the center coordinates of the ground truth bounding box.

W_{g}

and

H_{g}

are the width and height of the ground truth box, respectively.

L_{W I o U v 1}

is the weighted intersection parallel ratio loss, and

R_{W I o U}

is the weight factor for this loss.

In W-IoU v2, the model introduces a monotonic attention coefficient aimed at dynamically adjusting the weight, reducing the influence of easy samples on the training process, and accelerating convergence.

L_{W I o U v 2} = {(\frac{L_{I o U}^{*}}{L_{I o U}})}^{γ} L_{W I o U v 1}

(16)

In this context,

γ

denotes the dynamically adjusted attention coefficient, while

L_{I o U}^{*}

refers to the enhanced Intersection over Union loss.

W-IoU v3 is the final version of this approach, building upon the foundation of v2 by introducing an outlier degree (

β

) and a dynamic focusing mechanism (r). W-IoU v3 introduces the concept of outlier degree, to dynamically assess the quality of anchor boxes and modify the gradient weights for individual boxes accordingly. This approach is especially useful for dealing with low-quality samples and image noise, reducing the influence of poor anchor boxes on gradient updates. As a result, it accelerates model convergence and improves overall performance. W-IoU v3 shows outstanding performance in object detection, especially in tasks involving occlusion, small objects, and complex backgrounds, where it displays strong robustness. The formula for this mechanism is as follows:

L_{W I o U v 3} = r L_{W I o U v 1}

(17)

r = \frac{β}{δ α^{β - δ}}

(18)

In this context,

β

denotes the outlier degree, reflecting the quality of the anchor boxes, while r is the dynamically adjusted focusing factor.

The parameter scheme of W-IoU is shown in Figure 6. W-IoU v3 introduces an anomaly score and a dynamic non-monotonic focusing mechanism, addressing the issue that traditional loss functions struggle to handle low-quality anchor boxes. This optimization improves the gradient update process and enhances the model’s adaptability to complex scenes. W-IoU v3 mitigates the adverse effects of subpar anchor boxes in training, guaranteeing that YOLOv8 sustains robust detection precision and consistency, particularly in the presence of noisy visuals or significant object obstructions.

In summary, W-IoU effectively optimizes the object detection performance of YOLOv8 by introducing a dynamic focusing mechanism and outlier degree. It demonstrates significant advantages in handling small objects, complex backgrounds, low-quality samples, and occlusion issues, providing more stable and precise training results for YOLOv8. By dynamically adjusting the weights, W-IoU speeds up the model’s convergence and boosts its robustness to difficult samples, improving its adaptability and performance in real-world scenarios.

2.6. An Improved UAV Forest Fire Detection Model Based on YOLOv8

This paper proposes an enhanced UAV forest fire detection model based on YOLOv8, tailored for lightweight forest fire detection using UAVs.

First, SPDConv is used to replace conventional static convolutions, decreasing the computational load while improving performance, particularly for small objects and low-resolution images.

Next, MLCA is introduced before the three detection heads. This mechanism combines local and channel-wise information, enhancing the model’s ability to capture key features. As a result, this enhancement significantly boosts the precision and resilience of object detection, particularly in intricate and evolving wildfire scenarios.

Furthermore, a C2f-PConv structure is designed, replacing the traditional C2f with the C2f-PConv. The PConv operation replaces the convolutional operations in the C2f module, minimizing redundant computations and memory access, while preserving feature extraction capabilities. This modification allows the C2f module to maintain efficient computational performance when handling larger-scale or more complex targets.

By incorporating W-IoU, the model becomes more sensitive to the boundaries of fire and smoke regions, leading to better alignment of object boundaries. This enhancement increases detection accuracy and speeds up the model’s convergence in complex scenarios.

The YOLO model presented in this study seeks to improve UAV forest fire detection capability, optimizing the trade-off between detection performance and computational efficiency. Our model begins with the input image and progressively downsamples the feature map using a sequence of convolutional blocks. The C2f-PConv module is integrated into the structure to achieve efficient processing of multi-branch features. The feature map undergoes Split operation segmentation in the intermediate stage, followed by feature merging using Concat and Upsampling to achieve multi-scale feature fusion. Then, the feature expression ability is further enhanced through the MLCA module. Finally, the detection results are output through the Detect layer. The structure of the YOLO model is shown in Figure 7.

3. Experiments and Analysis

3.1. Datasets

The experimental data came from the FLAME dataset [36], which was gathered by drones monitoring controlled pile burn activities in an Arizona pine forest. Comprising video footage and thermal imagery from infrared cameras, we utilized the normal and White-Hot image subsets, amassing a total of 3253 images. These images were subsequently tagged using labeling.

The data were divided into training, test, and validation sets in an 8:1:1 ratio. Figure 8 shows a portion of the images, which were categorized into four types. The first category consists of White-Hot images with fire occurrences, as shown in Figure 8a. The second category includes White-Hot images without fire occurrences, as shown in Figure 8b. The third category contains normal images with fire occurrences, as shown in Figure 8c. The fourth category comprises normal images without fire occurrences, as shown in Figure 8d. Table 1 offers comprehensive insights into the data collection.

3.2. Environment

The experimental setup is detailed in Table 2 and Table 3, which provide comprehensive information on the software and hardware configurations.

3.3. Assessment Indicators

For a thorough analysis of the model’s effectiveness, we evaluated its performance in terms of classification, detection precision, and computational efficiency, with the aid of these metrics:

Precision gauges the ratio of accurately identified positive instances to the total instances labeled as positive. The formula for precision is

Precision = \frac{T P}{T P + F P}

(19)

with TP representing True Positives (accurately classified positive instances) and FP indicating False Positives (negatives incorrectly classified as positives).

Recall measures the proportion of actual positive instances correctly detected by the model. The formula for recall is

Recall = \frac{T P}{T P + F N}

(20)

where FN denotes False Negatives (actual positives incorrectly predicted as negative).

Average Precision (AP) is the integral of the Precision–Recall curve, assessing the model’s precision across various recall thresholds. The formula for AP is

A P = \int_{0}^{1} P (r) d r

(21)

The area under the Precision–Recall curve is usually estimated through the summation of precision values at discrete recall intervals, which represents an approximation of the curve’s underlying integral.

Mean Average Precision (mAP) is a summary statistic that combines a model’s performance across different classes, commonly employed in multi-class detection challenges. The formula for mAP is

m A P = \frac{1}{C} \sum_{i = 1}^{C} A P_{i}

(22)

The parameters of the YOLOv8 model mainly involve the architectural details of the model, hyperparameter settings during training, and specific configurations during the inference phase. These parameters collectively determine the performance of the model on various datasets, including detection precision, recall, and mAP@0.5.

3.4. Experimental Results

To intuitively reflect the data distribution characteristics in object detection tasks and provide a quantitative basis for model optimization, we generated an analysis plot of the bounding box size and position distribution in object detection tasks, as shown in Figure 9. The x-axis (horizontal coordinate) and y-axis (vertical coordinate) represent normalized positional data of the entire image (ranging from 0.0 to 1.0), while width and height indicate the relative proportions and positions of small targets in the image. The width and height values are concentrated in the 0.00–0.25 range, primarily around 0.05, demonstrating that the targets occupy a minimal spatial proportion of the image.

If no significant improvement was observed, the training process was automatically halted. Following 476 training epochs, the revised model produced outcomes on the dataset. Figure 10 showcases the performance metrics for the training and validation datasets. The first three columns illustrate the loss components of the improved YOLOv8 model, including the box loss, object loss, and classification loss. The three curves in these columns illustrate the trend of the losses, where the X-axis represents the progress of training over time, and the Y-axis shows the overall loss values. The graphs clearly show that with the advancement of training, the total loss consistently diminished and finally leveled off. The findings suggest that the enhanced YOLOv8 model demonstrated superior fitting capabilities, robust stability, and precision. The last two columns present the Precision–Recall (PR) curves, where the X-axis represents recall and the Y-axis indicates precision values, respectively, over the course of training. These curves assess the model’s effectiveness in object detection at various confidence levels: points nearer to 1 on the curves signify greater model certainty. As seen in Figure 10, the enhanced YOLOv8 model demonstrated its potency. In the early stages of training, significant fluctuations were observed in the loss function and evaluation metrics, primarily due to the following reasons. Firstly, the random initialization of model weights resulted in the high sensitivity of the initial parameters to the loss surface, leading to unstable performance. Secondly, the initial learning rate may have led to excessive updating of model parameters during the optimization process, resulting in unstable fluctuations in loss values. Furthermore, variations in the data distribution between batches, particularly during small batch training, could have resulted in fluctuations in the loss function due to these data batch changes.

To assess the efficacy of our model, we generated a Precision–Recall (PR) curve with Recall on the x-axis and Precision on the y-axis, as depicted in Figure 11a. The chart features two lines: a light blue line that illustrates the precision–recall trade-off for the fire category, and a dark blue line that denotes the mAP@0.5 for all categories collectively. The legend specifies the key metrics, with both the fire class and the overall classes achieving an average precision of 0.893 (mAP@0.5). The model maintained a high precision at lower recall rates, but the precision slightly declined as the recall increased. Overall, the model demonstrated robust detection performance for both the fire class and all classes, achieving high precision and recall rates.

To further assess the classification performance, a confusion matrix is presented in Figure 11b. When distinguishing between the fire and background classes, the model correctly predicted fire in 1541 instances, while misclassifying 459 fire cases as background (false negatives). Conversely, it incorrectly predicted background as fire in 89 instances (false positives). These results indicate a strong overall performance, with high precision and recall rates for fire detection. The model exhibited a reliable classification accuracy and balanced error distribution across both categories.

3.5. Ablation Experiment

In the ablation experiment, we used precision, recall, and parameters as the criteria for judgment. In addition, during the training process, mAP@0.5 and GFlops were used as auxiliary evaluations. The comprehensive results of the ablation study are laid out in Table 4. The experiment was based on the YOLOv8 model, and gradually introduced SPDConv, MLCA, C2f-PConv, and W-IoU. The outcomes indicated that the model attained a score of 0.875 for both precision and recall, as well as 0.893 for mAP@0.5. In addition, the number of model parameters decreased to 5.96 M.

3.6. Comparison Experiment

To showcase the detection capabilities of the YOLOv8 model, we carried out comparative studies using the FLAME dataset. We compared YOLOv8 with a range of established models, including YOLOv3, the YOLOv5 series (YOLOv5s, YOLOv5m, YOLOv5x), the YOLOv6 series (YOLOv6s, YOLOv6m), the YOLOv8 series (YOLOv8s, YOLOv8l, YOLOv8x), the YOLOv9 series (YOLOv9c, YOLOv9s, YOLOv9e), and the YOLOv10 series (YOLOv10n, YOLOv10m, YOLOv10l). In addition, we also compared the model with recent models RT-DETR and the SOTA method YOLO-World. The outcomes of the experiments were assessed using the following metrics: Precision, Recall, mAP@0.5, number of Parameters, GFLOPs, and Model Size, as detailed in Table 5.

In the comparison experiments, we compared multiple YOLO versions and found that YOLOv8s struck a good balance between accuracy and computational efficiency. Although newer YOLO versions have shown improvements in accuracy, they are relatively larger and require higher computational resources, which can lead to significant resource consumption in practical applications, especially on devices like drones. On the other hand, YOLOv8s maintains high accuracy, while significantly reducing the computational load and the number of parameters through various optimizations, making it especially suitable for deployment in resource-constrained environments. Hence, we selected YOLOv8s as the base model and refined it to boost the performance and efficiency for detecting forest fires.

In the series of comparisons shown in Table 5, our model demonstrated superior performance. Specifically, it achieved a precision of 0.875 and a recall of 0.875, with an mAP@0.5 of 0.893, placing it among the top-performing models in the comparison. In contrast to other models within the YOLO series, our model struck a balance between maintaining a low parameter count (5.96M, significantly lower than high-performance but parameter-heavy models such as YOLOv8x and YOLOv5x) and a moderate computational complexity (18.1 GFLOPs), while still achieving significant performance gains. For example, our model outperformed YOLOv5s by increasing the mAP@0.5 by 1.7% (from 0.878 to 0.893), while also having a smaller parameter footprint. Compared to YOLOv10l, which had comparable mAP@0.5 scores, our model offered distinct benefits, with fewer parameters and a lower computational complexity. Our model excelled in precision, recall, and mAP@0.5. Both were superior to RT-DETR and YOLO World. At the same time, the model parameters and computational complexity were significantly reduced. Compared with RT-DETR, the parameters were reduced by more than 80%, and the computational complexity was reduced by about 85%. Compared with YOLO World, the parameters were reduced by more than 50%, and the computational complexity was reduced by about 43%.

Overall, the parameter count of our model was reduced by 43.8%, and the model size and GFLOPs were limited to 12.19 MB and 18.1, respectively, achieving lightweighting of the model. Meanwhile, our model demonstrated significant enhancements compared to the baseline YOLOv8s model, with precision seeing a 2.17% increase, recall a 5.5% increase, and mAP@0.5 a 1.9% increase, thereby maintaining high precision, without compromising its lightweight nature.

As shown in Figure 12, our model demonstrated the highest accuracy throughout the entire training process, with its performance curve consistently at the top of all compared models, demonstrating its outstanding ability in detecting forest fires. Compared with the other models, our model improved accuracy faster in the early stage and reached a higher detection level faster, which is of great significance for rapid response and early warning in practical applications. Additionally, our model performed the smoothest in accuracy fluctuations, reducing detection errors caused by model instability and improving the reliability of detection results.

For normal images, as shown in Figure 13a, the initial model(YOLOv8s) frequently had difficulty identifying small fire targets, which could result in undetected instances. This constraint could be challenging, particularly in situations where the prompt identification of minor fires is vital for ensuring safety and prompt action. However, when analyzing the same image (as illustrated in Figure 13b), our model demonstrated a remarkable ability to successfully detect all the small target fires. This feature highlights the advantage of our model in detecting small target objects over the initial model. Furthermore, the original model(YOLOv8s), as shown in Figure 13c, may also suffer from false detection of fires, which can be misleading and lead to unnecessary alarms or interventions. These false positives can be particularly troublesome in environments where accuracy and reliability are paramount. In stark contrast, our model did not experience this issue (as evidenced in Figure 13d). Our model achieved a higher level of precision in fire detection, minimizing false positives and ensuring more reliable results.

When processing White-Hot images, the original model(YOLOv8s) had a low precision for fire detection. Therefore, as shown in Figure 14a, the YOLOv8s mistakenly detected the leaves as fires, resulting in false alarms during the detection process. However, as shown in Figure 14b, our improved model was specifically designed to address this issue. By introducing improvements, it significantly reduced error detection and improves precision. Overall, these comparisons highlight the significant improvements and advantages of our model over the baseline model, especially in terms of the precise detection of small fire targets and reduction in false positives.

For fire detection tasks in complex environments, the traditional models are prone to interference from smoke and tree occlusion in normal modal images (as shown in Figure 15a), resulting in missing the detection of small-scale fire targets. The experimental results indicate that our model successfully identified fire targets obscured by smoke and vegetation under the same testing conditions (Figure 15b). In the White-Hot modal image, the image blur caused by tree occlusion resulted in missed detections in the original model (Figure 15c). In contrast, our model enhanced the feature representation ability of fuzzy targets and successfully identified them (Figure 15d). The analysis results indicate that our model exhibited significant advantages in complex scenes involving occlusion, blur, and small object detection.

3.7. Generalization Experiment

To verify the practicality and generalizability of the improved algorithm, we tested it using two different public datasets. One is FireSense [37], and the other is Furg-Fire-Dataset [38]. As shown in Table 6, all performance metrics have improved on both public datasets, indicating excellent generalization performance. On the FireSense and Furg-Fire-Dataset, our model outperforms YOLOv8s in Precision, Recall, and mAP@0.5 indicators show advantages. Especially on the FireSense dataset, our model performs particularly well, significantly improving detection performance. Specifically, on the FireSense dataset, our method improved precision by 0.052 and recall by 0.033, mAP@0.5 increased by 0.034. On the Furg-Fire-Dataset, our method improved precision by 0.003 and recall by 0.015, mAP@0.5 increased by 0.012. It is further proves the effectiveness and superiority of our model.

4. Discussion

This study presents an efficient method for UAV forest fire detection through several innovative improvements to the YOLOv8 model. By introducing MLCA, SPDConv convolution module, C2f-PConv feature extraction structure, and W-IoU loss function, we markedly enhance the precision and efficiency of the model for detecting forest fires, particularly for small targets and intricate situations.

The empirical findings show that the refined model attains a significant upgrade in detection precision as well as in its lightweight operational capabilities. Specifically, precision improved by 2.17%, recall by 5.5%, and mAP@0.5 by 1.9%. In addition, the number of parameters decreased by 43.8% to only 5.96 M, and Model Size and GFlops were only 12.19 MB and 18.1, which makes the model more practical and reliable in actual UAV monitoring.

Although MLCA, C2f-PConv, SPDConv, and W-IoU perform well, they also have certain limitations. Indeed, while the Multi-Layer Channel Attention (MLCA) mechanism is adept at amalgamating channel and spatial information, along with local and global contextual cues to enhance the network’s expressive capabilities, its intricate design can lead to a higher level of implementation complexity. This complexity has the potential to escalate the computational demands placed on the model, which might be a concern in terms of efficiency and resource utilization, especially in real-time applications or on devices with limited computational power. Meanwhile, the effectiveness of MLCA also depends on specific tasks and datasets, and may not be applicable to all situations.

Secondly, C2f-PConv performs well in handling situations containing missing values or irregular data, but it has high requirements for data integrity. If the data loss is too severe, PConv may not be able to effectively extract features.

Thirdly, the SPDConv structure improves performance by reducing the number of feature maps, but it is mainly suitable for low resolution and small target scenes. For high-resolution or large-sized targets, SPDConv may not be able to fully leverage its advantages. Furthermore, the incorporation of SPDConv into the model necessitates alterations to the standard convolutional layer. These modifications can add to the overall complexity of the model, potentially affecting its ease of implementation and computational efficiency.

Additionally, the W-IoU loss function introduces weight factors for weighted calculation, making the loss function more widely applicable in object detection tasks. Nonetheless, the computation of W-IoU is somewhat intricate, potentially prolonging the model’s training duration. Additionally, the utility of W-IoU is contingent on the appropriate choice of weighting coefficients. Certainly, improper tuning of the weight factors can adversely affect the model’s efficacy.

Given the current model’s exceptional ability to detect small objects and in scenarios with low-resolution imagery, we acknowledge the potential for further improvements, particularly in its fire detection performance during extreme and low-visibility conditions. To tackle these hurdles, future investigations will concentrate on two primary directions. Firstly, efforts will be directed at bolstering the model’s robustness to ensure high accuracy and consistency across a range of weather conditions. The second priority is to investigate and incorporate more sophisticated preprocessing and post-processing methods to further refine the precision and efficacy of fire detection. By delving into cutting-edge algorithms and technologies, including adaptive learning rate scheduling, model pruning, and quantization, we aim to create an effective and robust fire detection system, offering enhanced reliability for the domain of public safety.

5. Conclusions

Based on the YOLOv8 architecture, we innovatively propose an improved UAV forest fire detection model, which is specifically optimized for the challenges in detecting forest fire images captured by UAVs. The primary objective is to address the limitations of conventional models in handling low-resolution and small target images, and to decrease the quantity and dimensions of model parameters to facilitate more efficient deployment on UAV platforms.

To maintain detailed information within the image, we introduce SPDConv. The utilization of this technology effectively prevents information loss during downsampling, substantially enhancing the model’s capability to detect low-resolution and small target images.

We introduce the C2f-PConv module. This component is used to perform convolutional operations on a subset of channels within the input feature map, preserving the other channels unaltered. Such a design not only reduces extraneous computation and memory consumption but also significantly boosts the model’s computational efficiency, concurrently reducing the parameter count.

In order to enhance the detection capability of UAV images while reducing computational burden, MLCA has been introduced. MLCA is a lightweight attention mechanism that combines local spatial information and channel information. The approach refine’s the model’s ability to discern both local and global features, simultaneously curbing the computational load via a two-stage pooling mechanism and one-dimensional convolutional processes. This strategy leads to an overall enhancement in the performance of image detection tasks.

To tackle the challenge of targets that are obscured and not accurately identified, we introduce the W-IoU loss function. The W-IoU loss function is capable of refining anchor boxes with varying qualities and can dynamically modify the weights associated with different target boxes within the loss computation. This flexibility markedly improves the model’s ability to detect, especially in cases with tiny objects and instances of obstructions.

In summary, to counteract the problem of blurry images obtained from UAVs, we employ SPDConv technology. This innovation helps maintain the detailed information within the images, thereby enhancing the detection ability for low-resolution and small target objects. Moreover, the integration of MLCA aims to heighten the model’s sensitivity to both local and global feature details. To tackle the issue of occlusion from closely packed objects in images, we introduce W-IoU. In response to the limited computing resources and storage capacity of drones, we introduce the C2f-PConv module, which reduces redundant calculations by flexibly applying convolution operations. In addition, the introduction of MLCA improves detection capability without increasing computational burden.

Regarding the validation phase, we allocated the 3253 forest fire images among a training set, a testing set, and a validation set, adhering to an 8:1:1 division. The outcomes of our experiments indicate that our model achieved a precision of 87.5%, a recall of 87.5%, and an mAP@0.5 of 89.3% on the FLAME dataset. These metrics represent improvements of 2.17%, 5.5%, and 1.9% over the baseline YOLOv8 model. Concurrently, the model’s parameter count was decreased by 43.8%, and the GFLOPs were lowered by 36.7%, which streamlines more effective deployment on mobile devices.

In comparison to conventional detection models such as YOLOv3, YOLOv5, YOLOv6, YOLOv7, YOLOv9, YOLOv10, RT-DETR and YOLO-World, our model exhibits well performance across key metrics including precision, recall, and mAP@0.5. These advancements enhance the precision of forest fire detection in drone imagery while also decreasing the model’s intricacy and scale, playing a vital role in ensuring swift and precise forest fire detection.

Author Contributions

Conceptualization, X.X. and B.Y.; methodology, X.X. and B.Y.; writing—original draft preparation, X.X., B.Y., J.H. and J.Z.; writing—review and editing, X.X., B.Y., Z.L., Q.D. and J.Z.; supervision, B.Y.; project administration, B.Y. All authors have read and agreed to the published version of the manuscript.

Funding

Supported by the National Natural Science Foundation of China (No. 61972357 & 62476250) and the Industry-University-Research Innovation Fund of Chinese Colleges (No. 2022IT009).

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Le Maoult, Y.; Sentenac, T.; Orteu, J.J.; Arcens, J.P. Fire detection: A new approach based on a low cost CCD camera in the near infrared. Process Saf. Environ. Prot. 2007, 85, 193–206. [Google Scholar] [CrossRef]
Hao, X.; Luo, S.; Chen, M.; He, C.; Wang, T.; Wu, H. Infrared small target detection with super-resolution and YOLO. Opt. Laser Technol. 2024, 177, 111221. [Google Scholar]
Wang, Z.; Yang, P.; Liang, H.; Zheng, C.; Yin, J.; Tian, Y.; Cui, W. Semantic Segmentation and Analysis on Sensitive Parameters of Forest Fire Smoke Using Smoke-Unet and Landsat-8 Imagery. Remote Sens. 2022, 14, 45. [Google Scholar]
Ding, Y.; Wang, M.; Fu, Y.; Wang, Q. Forest Smoke-Fire Net (FSF Net): A Wildfire Smoke Detection Model That Combines MODIS Remote Sensing Images with Regional Dynamic Brightness Temperature Thresholds. Forests 2024, 15, 839. [Google Scholar] [CrossRef]
Jin, S.; Wang, T.; Huang, H.; Zheng, X.; Li, T.; Guo, Z. A self-adaptive wildfire detection algorithm by fusing physical and deep learning schemes. Int. J. Appl. Earth Obs. Geoinf. 2024, 127, 103671. [Google Scholar] [CrossRef]
Chen, W.; Guan, Z.; Gao, D. Att-Mask R-CNN: An individual tree crown instance segmentation method based on fused attention mechanism. Can. J. For. Res. 2024, 54, 825–838. [Google Scholar]
Abdusalomov, A.; Baratov, N.; Kutlimuratov, A.; Whangbo, T.K. An improvement of the fire detection and classification method using YOLOv3 for surveillance systems. Sensors 2021, 21, 6519. [Google Scholar] [CrossRef]
Lin, Z.; Yun, B.; Zheng, Y. LD-YOLO: A Lightweight Dynamic Forest Fire and Smoke Detection Model with Dysample and Spatial Context Awareness Module. Forests 2024, 15, 1630. [Google Scholar] [CrossRef]
Geng, X.; Han, X.; Cao, X.; Su, Y.; Shu, D. YOLOV9-CBM: An Improved Fire Detection Algorithm Based on YOLOV9. IEEE Access 2025, 13, 19612–19623. [Google Scholar]
Yun, B.; Zheng, Y.; Lin, Z.; Li, T. FFYOLO: A Lightweight Forest Fire Detection Model Based on YOLOv8. Fire 2024, 7, 93. [Google Scholar] [CrossRef]
He, Y.; Hu, J.; Zeng, M.; Qian, Y.; Zhang, R. DCGC-YOLO: The Efficient Dual-Channel Bottleneck Structure YOLO Detection Algorithm for Fire Detection. IEEE Access 2024, 12, 65254–65265. [Google Scholar]
Liu, W.; Shen, Z.; Xu, S. CF-YOLO: A capable forest fire identification algorithm founded on YOLOv7 improvement. SIViP 2024, 18, 6007–6017. [Google Scholar]
Khan, S.; Khan, A. FFireNet: Deep Learning Based Forest Fire Classification and Detection in Smart Cities. Symmetry 2022, 14, 2155. [Google Scholar] [CrossRef]
Xiao, Z.; Wan, F.; Lei, G.; Xiong, Y.; Xu, L.; Ye, Z.; Liu, W.; Zhou, W.; Xu, C. FL-YOLOv7: A Lightweight Small Object Detection Algorithm in Forest Fire Detection. Forests 2023, 14, 1812. [Google Scholar] [CrossRef]
Meleti, U.; Razi, A. Obscured wildfire flame detection by temporal analysis of smoke patterns captured by unmanned aerial systems. arXiv 2023, arXiv:2307.00104. [Google Scholar]
Chen, G.; Cheng, R.; Lin, X.; Jiao, W.; Bai, D.; Lin, H. LMDFS: A Lightweight Model for Detecting Forest Fire Smoke in UAV Images Based on YOLOv7. Remote Sens. 2023, 15, 3790. [Google Scholar] [CrossRef]
Zhou, M.; Wu, L.; Liu, S.; Li, J. UAV forest fire detection based on lightweight YOLOv5 model. Multimed. Tools Appl. 2024, 83, 61777–61788. [Google Scholar]
Liu, Y.; Zheng, C.; Liu, X.; Tian, Y.; Zhang, J.; Cui, W. Forest Fire Monitoring Method Based on UAV Visual and Infrared Image Fusion. Remote Sens. 2023, 15, 3173. [Google Scholar] [CrossRef]
Yang, H.; Wang, J.; Wang, J. Efficient Detection of Forest Fire Smoke in UAV Aerial Imagery Based on an Improved Yolov5 Model and Transfer Learning. Remote Sens. 2023, 15, 5527. [Google Scholar] [CrossRef]
Shamta, I.; Demir, B.E. Development of a deep learning-based surveillance system for forest fire detection and monitoring using UAV. PLoS ONE 2024, 3, e0299058. [Google Scholar]
Saydirasulovich, S.N.; Mukhiddinov, M.; Djuraev, O.; Abdusalomov, A.; Cho, Y.I. An Improved Wildfire Smoke Detection Based on YOLOv8 and UAV Images. Sensors 2023, 23, 8374. [Google Scholar] [CrossRef] [PubMed]
Choutri, K.; Lagha, M.; Meshoul, S.; Bouzidi, F.; Charef, W. Fire detection and geo-localization using UAV’s aerial images and YOLO-based models. Appl. Sci. 2023, 13, 11548. [Google Scholar] [CrossRef]
Luan, T.; Zhou, S.; Zhang, G.; Song, Z.; Wu, J.; Pan, W. Enhanced Lightweight YOLOX for Small Object Wildfire Detection in UAV Imagery. Sensors 2024, 24, 2710. [Google Scholar] [CrossRef]
Niu, K.; Wang, C.; Xu, J.; Yang, C.; Zhou, X.; Yang, X. An Improved YOLOv5s-Seg Detection and Segmentation Model for the Accurate Identification of Forest Fires Based on UAV Infrared Image. Remote Sens. 2023, 15, 4694. [Google Scholar] [CrossRef]
Diwan, T.; Anirudh, G.; Tembhurne, J.V. Object detection using YOLO: Challenges, architectural successors, datasets and applications. Multimed. Tools Appl. 2023, 82, 9243–9275. [Google Scholar] [PubMed]
Zhao, Q.; Liu, B.; Lyu, S.; Wang, C.; Zhang, H. TPH-YOLOv5++: Boosting Object Detection on Drone-Captured Scenarios with Cross-Layer Asymmetric Transformer. Remote Sens. 2023, 15, 1687. [Google Scholar] [CrossRef]
Guo, J.; Wang, S.; Chen, X.; Wang, C.; Zhang, W. QL-YOLOv8s: Precisely Optimized Lightweight YOLOv8 Pavement Disease Detection Model. IEEE Access 2024, 12, 128392–128403. [Google Scholar]
Zou, R.; Liu, J.; Pan, H.; Tang, D.; Zhou, R. An Improved Instance Segmentation Method for Fast Assessment of Damaged Buildings Based on Post-Earthquake UAV Images. Sensors 2024, 24, 4371. [Google Scholar] [CrossRef]
Wan, D.; Lu, R.; Shen, S.; Xu, T.; Lang, X.; Ren, Z. Mixed local channel attention for object detection. Eng. Appl. Artif. Intell. 2023, 123, 106442. [Google Scholar]
Zhang, J.; Zheng, H.; Zeng, C.; Gu, C. High-Precision Instance Segmentation Detection of Micrometer-Scale Primary Carbonitrides in Nickel-Based Superalloys for Industrial Applications. Materials 2024, 17, 4679. [Google Scholar] [CrossRef]
Sunkara, R.; Luo, T. No More Strided Convolutions or Pooling: A New CNN Building Block for Low-Resolution Images and Small Objects. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), Grenoble, France, 19–23 September 2022; pp. 443–459. [Google Scholar]
Chen, J.; Kao, S.H.; He, H.; Zhuo, W.; Wen, S.; Lee, C.H.; Chan, S.H.G. Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks. arXiv 2023, arXiv:2303.03667. [Google Scholar]
Li, G.; Li, Y.; Jiang, S. Real-Time Smoke Detection Network Based on Multi-Scale Feature Recognition and Lightweight Architecture Design. In Proceedings of the 2024 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Sarawak, Malaysia, 6–10 October 2024; pp. 262–267. [Google Scholar]
Guo, F.; Guo, X.; Guo, L.; Wang, Y.; Wang, Q.; Liu, S.; Zhang, M.; Zhang, L.; Gai, Z. Target Detection of Diamond Nanostructures Based on Improved YOLOv8 Modeling. Nanomaterials 2024, 14, 1115. [Google Scholar] [CrossRef] [PubMed]
Hu, D.; Yu, M.; Wu, X.; Hu, J.; Sheng, Y.; Jiang, Y.; Huang, C.; Zheng, Y. DGW-YOLOv8: A Small Insulator Target Detection Algorithm Based on Deformable Attention Backbone and WIoU Loss Function. IET Image Process. 2024, 18, 1096–1108. [Google Scholar]
Shamsoshoara, A.; Afghah, F.; Razi, A.; Zheng, L.; Fulé, P.Z.; Blasch, E. Aerial imagery pile burn detection using deep learning: The FLAME dataset. Comput. Netw. 2021, 193, 108001. [Google Scholar] [CrossRef]
Dimitropoulos, K.; Barmpoutis, P.; Grammalidis, N. Spatio-Temporal Flame Modeling and Dynamic Texture Analysis for Automatic Video-Based Fire Detection. IEEE Trans. Circuits Syst. Video Technol. (TCSVT) 2015, 25, 339–351. [Google Scholar]
Hüttner, V.; Steffens, C.R.; Silva da Costa Botelho, S. First response fire combat: Deep leaning based visible fire detection. In Proceedings of the 2017 Latin American Robotics Symposium (LARS) and 2017 Brazilian Symposium on Robotics (SBR), Curitiba, Brazil, 8–11 November 2017; pp. 1–6. [Google Scholar]

Figure 1. YOLOv8 structure diagram.

Figure 2. The structure of MLCA. (a) Schematic diagram. (b) Structural diagram.

Figure 3. The structure of SPDConv. (a) Input feature map. (b) Channel-wise split of the feature map. (c) Space-to-Depth transformation on feature maps. (d) Concatenation and convolution process. (e) Output feature map.

Figure 4. Comparison diagram of PConv, Conv, and DWConv structures.

Figure 5. The structure of C2f-PConv.

Figure 6. Schematic diagram of W-IoU parameters.

Figure 7. An improved UAV forest fire detection model based on YOLOv8.

Figure 8. Example of experimental dataset. (a) White-Hot images with fire occurrences. (b) White-Hot images without fire occurrences. (c) Normal images with fire occurrences. (d) Normal images without fire occurrences.

Figure 9. Analysis chart of bounding box size and position distribution in object detection tasks.

Figure 10. The training/validation performance metrics of the improved YOLOv8 model.

Figure 11. Model performance evaluation plot. (a) Precision–recall curve. (b) Confusion matrix.

Figure 12. Precision comparison.

Figure 13. Comparison of ordinary images. (a) Original model’s small fire missed detection. (b) Our model’s complete fire detection. (c) Original model’s false positive detection. (d) Our model’s accurate detection.

Figure 14. Comparison of White-Hot images. (a) The original model for White-Hot image detection. (b) Our model for White-Hot image detection.

Figure 15. Contrast image of blurred field of view. (a) Original model’s occlusion failure. (b) Our model’s occlusion detection. (c) Original model’s blurred target failure. (d) Our model’s blurred target recognition.

Table 1. Details of dataset.

Dataset	Number of Images	Number of Fire Labels
Training	2603	7550
Validation	325	1118
Testing	325	1056

Table 2. Experimental conditions.

Configuration	Version
Framework	PyTorch (version 2.2.1)
Programming Language	Python (version 3.12)
GPU	RTX A6000
Operating System	Linux Ubuntu 24.04LTS

Table 3. Experimental parameter settings.

Parameter	Value
Input Image Size	640 × 640
Epochs	500
Batch Size	32
Patience	100
Learning Rate	0.01

Table 4. Ablation experiment results.

Model	Precision	Recall	mAP@0.5	GFlops	Params (M)	Model Size (MB)
YOLOv8	0.856	0.829	0.876	28.6	10.61	21.52
YOLOv8 + SPDConv	0.853	0.826	0.877	25.0	9.39	19.08
YOLOv8 + MLCA	0.853	0.857	0.882	28.6	10.61	21.54
YOLOv8 + C2f-PConv	0.85	0.841	0.871	20.5	7.39	15.04
YOLOv8 + W-IoU	0.861	0.847	0.89	28.6	10.61	21.54
YOLOv8 + SPDConv + W-IoU	0.851	0.863	0.885	26.0	9.39	19.07
YOLOv8 + MLCA + W-IoU	0.847	0.868	0.888	28.8	10.61	21.52
YOLOv8 + C2f-PConv + W-IoU	0.856	0.834	0.87	20.6	7.39	15.05
YOLOv8 + SPDConv + MLCA	0.868	0.85	0.891	26.0	9.39	19.09
YOLOv8 + SPDConv + C2f-PConv	0.841	0.855	0.87	17.9	6.17	12.62
YOLOv8 + MLCA + C2f-PConv	0.859	0.83	0.865	20.7	7.39	15.05
YOLOv8 + SPDConv + MLCA + C2f-PConv	0.859	0.863	0.886	18.1	6.17	12.64
YOLOv8 + SPDConv + MLCA + W-IoU	0.863	0.86	0.891	26.0	9.39	19.09
YOLOv8 + MLCA + C2f-PConv + W-IoU	0.862	0.847	0.88	20.7	7.39	15.06
YOLOv8 + SPDConv + MLCA + C2f-PConv + W-IoU	0.875	0.875	0.893	18.1	5.96	12.19

Table 5. Target detection comparison experiment.

Model	Precision	R	mAP@0.5	Params (M)	GFlops	Model Size (MB)
YOLOv3	0.845	0.854	0.89	98.86	282.2	198.10
YOLOv5s	0.86	0.84	0.878	8.68	23.8	17.73
YOLOv5m	0.857	0.843	0.88	23.88	64.0	48.19
YOLOv5x	0.869	0.835	0.894	92.65	246.0	185.90
YOLOv6s	0.826	0.829	0.867	15.54	44.0	31.36
YOLOv6m	0.816	0.805	0.854	49.57	161.1	99.54
YOLOv8s (Base)	0.856	0.829	0.876	10.61	28.6	21.50
YOLOv8l	0.857	0.853	0.895	41.58	164.8	83.65
YOLOv8x	0.871	0.883	0.906	64.96	257.4	130.40
YOLOv9c	0.846	0.87	0.887	24.14	102.3	49.23
YOLOv9s	0.874	0.844	0.886	6.83	26.7	14.56
YOLOv9e	0.85	0.84	0.88	54.71	189.1	111.80
YOLOv10s	0.835	0.847	0.88	7.6	24.4	15.81
YOLOv10m	0.832	0.859	0.888	15.68	63.4	31.97
YOLOv10l	0.835	0.878	0.899	24.52	126.3	49.85
RT-DETR	0.866	0.859	0.885	30.5	103.4	66.2
YOLO-World	0.854	0.841	0.881	12.15	32.1	25.8
Ours	0.875	0.875	0.893	5.96	18.1	12.19

Table 6. Comparison experiments with public datasets.

Dataset	Model	Precision	Recall	mAP@0.5
FireSense	YOLOv8s	0.822	0.886	0.884
	Ours	0.874	0.919	0.918
Furg-Fire-Dataset	YOLOv8s	0.943	0.877	0.92
	Ours	0.946	0.892	0.932

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yun, B.; Xu, X.; Zeng, J.; Lin, Z.; He, J.; Dai, Q. An Improved Unmanned Aerial Vehicle Forest Fire Detection Model Based on YOLOv8. Fire 2025, 8, 138. https://doi.org/10.3390/fire8040138

AMA Style

Yun B, Xu X, Zeng J, Lin Z, He J, Dai Q. An Improved Unmanned Aerial Vehicle Forest Fire Detection Model Based on YOLOv8. Fire. 2025; 8(4):138. https://doi.org/10.3390/fire8040138

Chicago/Turabian Style

Yun, Bensheng, Xiaohan Xu, Jie Zeng, Zhenyu Lin, Jing He, and Qiaoling Dai. 2025. "An Improved Unmanned Aerial Vehicle Forest Fire Detection Model Based on YOLOv8" Fire 8, no. 4: 138. https://doi.org/10.3390/fire8040138

APA Style

Yun, B., Xu, X., Zeng, J., Lin, Z., He, J., & Dai, Q. (2025). An Improved Unmanned Aerial Vehicle Forest Fire Detection Model Based on YOLOv8. Fire, 8(4), 138. https://doi.org/10.3390/fire8040138

Article Menu

An Improved Unmanned Aerial Vehicle Forest Fire Detection Model Based on YOLOv8

Abstract

1. Introduction

2. Methods

2.1. YOLOv8

2.2. Mixed Local Channel Attention

2.3. Space-to-Depth Convolution

2.3.1. Space-to-Depth (SPD) Layer

2.3.2. Non-Strided Convolution Layer

2.4. CSP Bottleneck with 2 Convolutions-Partial Convolution

2.5. W-IoU Loss

2.6. An Improved UAV Forest Fire Detection Model Based on YOLOv8

3. Experiments and Analysis

3.1. Datasets

3.2. Environment

3.3. Assessment Indicators

3.4. Experimental Results

3.5. Ablation Experiment

3.6. Comparison Experiment

3.7. Generalization Experiment

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI