FGYOLO: An Integrated Feature Enhancement Lightweight Unmanned Aerial Vehicle Forest Fire Detection Framework Based on YOLOv8n

Zheng, Yangyang; Tao, Fazhan; Gao, Zhengyang; Li, Jingyan

doi:10.3390/f15101823

Open AccessArticle

FGYOLO: An Integrated Feature Enhancement Lightweight Unmanned Aerial Vehicle Forest Fire Detection Framework Based on YOLOv8n

¹

College of Information Engineering, Henan University of Science and Technology, Luoyang 471023, China

²

Longmen Laboratory, Luoyang 471000, China

³

College of Humanities and Media, Wuhan Polytechnic University, Wuhan 430000, China

^*

Author to whom correspondence should be addressed.

Forests 2024, 15(10), 1823; https://doi.org/10.3390/f15101823

Submission received: 5 September 2024 / Revised: 29 September 2024 / Accepted: 17 October 2024 / Published: 18 October 2024

(This article belongs to the Special Issue Wildfire Monitoring and Risk Management in Forests)

Download

Browse Figures

Versions Notes

Abstract

To address the challenges of complex backgrounds and small, easily confused fire and smoke targets in Unmanned Aerial Vehicle (UAV)-based forest fire detection, we propose an improved forest smoke and fire detection algorithm based on YOLOv8. Considering the limited computational resources of UAVs and the lightweight property of YOLOv8n, the original model of YOLOv8n is improved, the Bottleneck module is reconstructed using Group Shuffle Convolution (GSConv), and the residual structure is improved, thereby enhancing the model’s detection capability while reducing network parameters. The GBFPN module is proposed to optimize the neck layer network structure and fusion method, enabling the more effective extraction and fusion of pyrotechnic features. Recognizing the difficulty in capturing the prominent characteristics of fire and smoke in a complex, tree-heavy environment, we implemented the BiFormer attention mechanism to boost the model’s ability to acquire multi-scale properties while retaining fine-grained features. Additionally, the Inner-MPDIoU loss function is implemented to replace the original CIoU loss function, thereby improving the model’s capacity for detecting small targets. The experimental results of the customized G-Fire dataset reveal that FGYOLO achieves a 3.3% improvement in mean Average Precision (mAP), reaching 98.8%, while reducing the number of parameters by 26.4% compared to the original YOLOv8n.

Keywords:

unmanned aerial vehicles; forest fire and smoke detection; YOLOv8; deep learning; GBFPN; Biformer; Inner-MPDIoU; GSconv

1. Introduction

The forest is one of the most vital ecological environments in the Earth’s biosphere, playing a crucial role in carbon sequestration, oxygen production, climate regulation, water conservation, and soil stabilization. Forest fires are natural disasters that are destructive, sudden, and difficult to extinguish. Once they occur, they can quickly spread to bushes or tree canopies, causing significant harm and loss to both humans and the entire forest ecosystem. Therefore, monitoring forest fire is essential for safeguarding forest resources and is an urgent problem that must be addressed. Traditional fire monitoring methods include manual inspection and satellite monitoring. While manual inspection is low-cost, its limited monitoring range makes it difficult to detect fires promptly. Satellite monitoring covers a wide area, but its data update frequency is low, the cost is high, and real-time monitoring is challenging. Additionally, detecting small-area fires using satellites is relatively weak [1].

UAVs have arisen in recent years as a new sort of airborne platform, and their technology is still maturing. UAVs are now applied in various fields such as meteorological detection, fire monitoring, and environmental remote sensing. The advantages of micro UAVs include low acquisition and operating costs, simple and flexible operation, and the ability to adjust programs and equipment based on real-time site conditions. These characteristics make UAVs particularly suitable for forest fire detection.The most critical technical challenge in UAV-based forest fire detection is efficiently and accurately recognizing flame targets in UAV images. UAV’s image-based intelligent perception enhances their ability to understand scenes and extract relevant forest information, with target detection technology being a key component of this perception. Currently, deep learning-based target detection technology is continuously advancing, serving as a key driver for UAV development. Deploying effective target detection models in aerial photography equipment, such as UAV, is an effective means for forest fire detection and prevention.

Given the specific requirements of fire scenarios—where a balance between computational speed and detection accuracy is crucial—the YOLO algorithm emerges as a more suitable choice for UAV detection. Although YOLOv8 performs well in target detection, it still has shortcomings in forest scenes with lush vegetation and complex environments, which affects the detection and early warning of forest fires. To overcome these challenges, this study proposes a new algorithmic model called FGYOLO. This model builds upon the YOLOv8 framework, with improvements across various network modules to achieve high detection accuracy while maintaining lightweight requirements, making it ideal for resource-constrained environments with high real-time demands. To accommodate the network’s need for diverse training data, the model is trained using a comprehensive, customized dataset. The experimental results demonstrate that the FGYOLO model remains robust against morphological changes, interfering objects, and the feature blurring of smoke and fire targets under real-time conditions, meeting the detection requirements in forest scenarios effectively.

The main contributions of this study are as follows:

Replacing the traditional convolution in some C2F modules with GSConv, and reconstructing its Bottleneck module to improve the residual structure. This modification significantly reduces redundant features and network parameters, accelerates model convergence, and improves detection accuracy.
An improved GBFPN network, based on the Bi-directional Feature Pyramid Network (BiFPN) structure, replaces the Path Aggregation Network (PANet) in the neck layer. This improvement retains bi-directional cross-scale connections, removes redundant branches, and utilizes only the P3, P4, and P5 channels for feature output. Additionally, a new fusion method based on contextual information is introduced to eliminate conflicting information between layers, further optimizing the neck layer structure to enhance computational efficiency.
The improved neck network integrates Biformer, a dynamic sparse attention mechanism, to address the challenge of extracting salient features in complex forest environments. This mechanism helps the model focus on important features while suppressing irrelevant background information, thereby improving detection performance.
The Inner-IoU algorithm is combined with the Maximum Possible DIoU (MPD) loss function. Inner-MPDIoU more accurately captures target position information, enhancing the model’s generalization ability and improving both regression and classification accuracy.
A comprehensive forest fire smoke dataset, encompassing various time periods and multiple scene categories, is established. Advanced software techniques are employed for data augmentation and enhancement to prevent model overfitting. Experiments using this dataset confirm the superiority of the proposed method, achieving a 98.8% mAP, outperforming current mainstream YOLO models and other classical target detection models.

2. Related Works

The most classical target detection algorithms at this stage are divided into two types: two-stage and one-stage algorithms. Prominent two-stage algorithms include Mask region-based CNN (R-CNN) [2], Grid R-CNN [2], Fast R-CNN [3], Faster R-CNN [4], and so on. In contrast, the one-stage detection-of-target algorithms, including the Single-Shot MultiBox Detector(SSD)algorithm [5], the You Only Look Once(YOLO) series algorithms [6,7,8,9], are well known for their speed. At present, many studies have explored optimizing forest fire detection models, demonstrating significant potential. Peruzzi et al. [10] used two embedded models to detect fires based on audio and image data and ran on low-power devices. Benzekri et al. [11] collected environmental data in real time through sensor nodes distributed in the forest and then used the GRU model for early forest fire detection. Cao et al. [12] improved the detection model segmentation accuracy by using the instance segmentation technique and the YOLOv7-tiny target detection algorithm. Choutri K et al. [13] constructed an UAV detection system to implement depth estimation using stereo vision for accurate fire localization. Chayma Bahhar et al. [14] used a staged YOLO model and an integrated convolutional neural network (CNN) for wildfire and smoke detection, combining the Squeeze-and-Excitation (SE) mechanism with the convolutional layer of YOLOv5s to achieve more accurate smoke target recognition. Zhang et al. [15] used Faster R-CNN, which solved the problem of the lack of training data by synthesizing the image with a homemade dataset.

In recent years, target detection technology has experienced rapid advancement, with the YOLO algorithm, a prominent representative of first-stage detection methods, widely utilized. UAV-based forest fire detection models built on YOLO also find broad application in this field. SHAMTAID et al. [16] compared the target detection performance of YOLOv8 and YOLOv5 and constructed a CNN-RCNN network to classify fire in images. Han et al. [17] used the GhostNetV2 structure to optimize the convolution operation of the backbone layer in YOLOv8n to reduce the complexity and computational requirements of the model and build a model more suitable for deployment on UAV. Cao et al. [18] designed a new convolution module and decoupling detection head to improve YOLOv5, taking into account the local and global features of fires. Zhao et al. [19] extended the feature extraction network from a 3D perspective to enhance the feature propagation of small target recognition in fires to improve UAV’s image detection accuracy. Saydirasulovich et al. [20] used GSConv to reduce the number of Yolov8 parameters and performed bounding box regression with Wise-IoU to improve the model performance in forests. Mukhiddinov et al. [21] introduced spatial pyramid pooling with fast additive layers to strengthen the network’s backbone and focus on small-sized wildfire smoke regions to improve the YOLOv5 network architecture and improve its detection speed.

Deep learning models hold significant potential for UAV-based forest fire detection systems but continue to encounter various challenges. Zhuo et al. [22] designed a lightweight feature extraction module, GhostNetV2, incorporating the SimAM attention mechanism to address the balance between computational cost and detection accuracy. However, false alarm rates in complex natural scenarios remain an issue, and the inference speed and accuracy still heavily rely on hardware performance. Yang et al. [23] applied the K-means++ method for optimizing anchor box clustering to reduce classification errors, and introduce a novel small target detection head to improve accuracy for small targets. Despite these advancements, their dataset suffers from an insufficient number of categories and imbalances, which undermine the model’s generalization ability. Forest fire detection continues to face challenges, such as complex natural backgrounds, multi-scale fire and smoke variations, low image resolution, and frequent target occlusion. Solving these issues demands not only improving detection accuracy but also ensuring that the algorithm is lightweight enough to be implemented on remote-sensing devices such as UAVs. In light of these challenges, our improvement of YOLOv8 focuses on enhancing feature fusion for small target detection and reducing the network model’s complexity to enable deployment on UAV platforms.

3. Materials and Methods

3.1. Improved Forest Fire Detection Model

This paper presents FGYOLO, an integrated feature enhancement lightweight Unmanned Aerial Vehicle forest fire detection framework based on YOLOv8. It designed for deployment on UAV to more accurately and rapidly detect smoke and fire in forests, enabling early fire detection and prevention. The specific architecture is shown in Figure 1. First, the C2f(faster implementation of a CSP Bottleneck with 2 convolutions) module is improved by introducing GSConv(Ghost Shuffle Convolution) to replace the original Conv and refining the residual structure of C2f, creating a new GS_C2f module. This module replaces part of the original C2f in the backbone. The modification ensures the efficient extraction of flame and smoke features while significantly reducing the model’s size and computational complexity, which in turn increases detection speed. Next, a GBFPN feature fusion network is proposed for the neck layer. GBFPN adopts the BiFPN(Bi-directional Feature Pyramid Network) structure and removes some branches of BiFPN to make the network more lightweight. Also, a new fusion method is employed, utilizing dilated convolutions with varying dilation rates. This captures contextual information from different receptive fields and enables more effective cross-scale feature fusion, which improves the model’s capacity to recognize small targets. Additionally, the concept of remote dependency capture is introduced, and the BiFormer attention mechanism is embedded into the GBFPN module. This allows for the more accurate identification of key information in the input feature map. Lastly, Inner-MPDIoU replaces the CIoU loss function, accelerating model convergence and further improving small target detection accuracy.

3.2. YOLOv8 Network

The model proposed in this paper is based on YOLOv8, which is the latest object detection and image segmentation algorithm following YOLOv5. The entire network structure comprises four main components, as shown in Figure 2. The input segment handles adaptive grayscale filling, adaptive anchor calculation, and mosaic data enhancement for the input image. Then, the backbone network extracts the image’s feature map, which the neck module processes by combining features from multiple layers. Finally, the head predicts the target object and its location, sending the processed image features to the prediction layer.

Compared to YOLOv5, YOLOv8 further enhances the model’s efficiency, primarily by replacing the CSP Bottleneck with 3 convolutions (C3) module in the backbone with the CSPDarknet53 to 2-Stage FPN (C2f) module. The C2f module is an improved version of the C3 module that leverages the Bottleneck module to enhance gradient branching, eliminates one standard convolutional layer, and incorporates the ELAN (Efficient Layer Aggregation Network) structure from YOLOv7 [24]. This approach maintains the model’s lightweight nature while improving the gradient flow, allowing for more detailed data processing. The algorithm’s fundamental architecture is illustrated in Figure 3. The SPPF (Spatial Pyramid Pooling—Fast) module processes the output feature maps by combining them through pooling with varying kernel sizes before sending the information to the neck layer.

In the neck section, YOLOv8 employs a combination of the Feature Pyramid Network (FPN) and the Path Aggregation Network (PAN) structure. This approach retains the PAN concept from YOLOv5 but eliminates the convolution operation that follows upsampling in the PAN structure. These modifications not only streamline the model but also enhance its capability to merge features across different scales through advanced multi-scale fusion techniques, thereby improving detection performance for objects of diverse sizes.

The head section of YOLOv8 is designed with two separate branches: one for target classification and another for bounding box regression. For classification tasks, it utilizes binary cross-entropy loss (BCE loss), while bounding box regression is addressed using a combination of distribution focal loss (DFL) and Complete-IOU (CIoU) [25]. This dual-branch technique not only increases detection accuracy but also speeds up model convergence. Additionally, YOLOv8 moves away from the previous intersection over union (IoU) matching strategy and unilateral proportion allocation, adopting the Task-Aligned Assigner [26] instead. This dynamic allocation method uses regression coordinates and classification scores to improve detection accuracy and robustness. The task alignment metric integrates classification scores with intersection over union (IoU) values to enhance both classification and localization accuracy, thereby minimizing the prevalence of low-quality prediction boxes.

3.3. GS_C2f Module

Given the resource limitations of UAV, including the need for lightweight, low-load, and multifunctional devices, reducing the size of network models is crucial. In YOLOv8, the C2f structure is split into two types: the residual structure exists in the Bottleneck component of the C2f in the backbone layer, while the Bottleneck of the C2f in the neck has no residual structure. The C2f module branches the tandem Bottleneck modules across layers, eliminating the convolution operation in the branches by adding more layer-hopping connections and enriching the feature map information by adding the Split operation. In this process, each additional Bottleneck results in a larger computational size. To reduce the scale of parameter computation, the convolution kernel in Bottleneck is replaced. Traditional convolution techniques and approaches slow down detection speeds by applying a convolution kernel to each channel of the input feature map individually and then combining the results to generate a single output feature map. This approach involves repetitive computations across all channels, resulting in a high computational burden. To streamline this, we introduce Ghost Shuffle Convolution (GSConv), a lightweight convolutional technique that drastically cuts computing costs while retaining performance. GSConv merges the advantages of standard convolution and depthwise separable convolution (DSConv) to process input images of forest fires and smoke more efficiently.

Unlike DSConv, GSConv preserves inter-channel connections, ensuring the model’s accuracy. The resulting features are combined and rearranged to enhance nonlinear representation, which is particularly beneficial for detecting smoke targets that frequently change due to fire and environmental factors. These nonlinear features better capture the deformation and expansion processes of smoke, providing the model with richer learning material and improving its adaptability and resilience. The new Bottleneck structure, denoted as GS_Bottleneck, as shown in Figure 4, removes the residual structure found in the original Bottleneck and replaces the CBS module in the second layer with GS_Conv.

This GS_Bottleneck structure is then used to replace the entire Bottleneck structure in the original C2f module, referred to as GS_C2f, as shown in Figure 5.

During forward propagation, GS_C2f enhances the network’s depth and receptive field by combining different convolutional layers with a different GS_Bottleneck. GS_Bottleneck first employs a 3 × 3 convolutional dimensionality reduction to remove redundant global channel information. It then applies GSConv to the input image, avoiding issues related to excessive channel redundancy and a large number of parameters. Furthermore, GS_C2f cascades multiple GS_Bottleneck structures, providing feature information at different scales. After multi-scale fusion, it strengthens the extraction and transformation of specific features for small targets. Overall, GS_C2f operates without feature loss, enhances feature expression, significantly reduces computational demands, and achieves a more lightweight model.

3.4. GBFPN Network

Forest remote-sensing images often contain small targets and have low resolution, presenting challenges for effective detection. The original YOLOv8n model’s Path Aggregation Feature Pyramid Network (PAFPN) exhibits limited spatial fusion capability, reducing the network’s utilization of shallow features. This deficiency hampers its ability to detect the numerous small targets typically found in UAV-based forest fire monitoring.To mitigate this issue, this paper offers a GBFPN feature fusion network, which is based on the Bi-directional Feature Pyramid Network (BiFPN) [27] and replaces the PANet used in the original YOLOv8n model (Figure 6a). The goal of this enhancement is to improve the generalization capabilities of shallow network spatial features. BiFPN employs a bi-directional feature pyramid architecture that supports both top-down and bottom-up connectivity. This design enhances efficient cross-scale integration and allows for weighted feature fusion. By repeatedly applying bi-directional cross-scale connectivity, BiFPN strengthens the integration of low-level and high-level semantic features. It employs a weighted feature fusion mechanism to prioritize the importance of different features, as illustrated in Figure 6b.

Considering the limited edge computing capabilities of UAV, this paper proposes an improved version of BiFPN, named GBFPN, to reduce redundant computations, as shown in Figure 6c.

The original BiFPN fuses the features from the third to the seventh layers out of seven feature layers. It assumes that nodes with only one input edge contribute little to the network. Therefore, two feature fusion nodes are removed to streamline the structure, and feature fusion is performed using only the P3 to P5 layers, similar to the original YOLOv8. Since shallow semantic information from the second layer plays a significant role in detecting small targets, we combine the shallow features extracted from the P2 channel with the P3 channel. The output feature maps are generated using only the P3, P4, and P5 channels. This reduces the noise interference from shallow features while preserving essential shallow semantic information for small targets without losing too much deeper semantic information. Simplifying the network structure in this way makes the model more lightweight without compromising feature fusion.

During the feature fusion process, input features at different resolutions contribute to the output in a manner that is not uniform. Given the prevalence of multi-scale smoke targets in UAV remote-sensing images, BiFPN adopts a fast normalized fusion module to balance the weights of these features. This approach enhances the extraction of deep information from smoke targets, reducing misdetections and omissions caused by complex environments. The relationship between input and output features is described by Equation (1):

O = \sum_{i} \frac{ω_{i}}{ε + \sum_{j} ω_{j}} \cdot I_{i}

(1)

where

ω_{i}

denotes the learning weight associated with the input feature

I_{i}

, and ReLU is applied at the back-end to ensure that

ω_{i} ⩾ 0

. To avoid instability in the values, an initial learning rate of

ε = 0.0001

is set, ensuring the normalized weight falls between 0 and 1.

Taking feature layer

P_{4}

as an example, the intermediate features

P_{4}^{t d}

and output features

P_{4}^{o u t}

of

P_{4}

Equation (2) and Equation (3), respectively, are as follows:

P_{4}^{t d} = Conv (\frac{ω_{1} \cdot P_{4}^{i n} + ω_{2} \cdot Resize (P_{5}^{i n})}{ω_{1} + ω_{2} + ε})

(2)

P_{4}^{o u t} = Conv (\frac{ω_{1}^{'} \cdot P_{4}^{i n} + ω_{2}^{'} \cdot P_{4}^{t d} + ω_{3}^{'} \cdot Resize (P_{3}^{o u t})}{ω_{1}^{'} + ω_{2}^{'} + ω_{3}^{'} + ϵ})

(3)

Here,

P_{i}^{i n}

represents the i-th input feature, and the Resize operation refers to upsampling or downsampling.

However, the fast normalized fusion approach tends to overlook conflicting information between features of different scales and between the target object and the background. This lack of contextual information may hinder further performance improvements [28]. To address this, the study proposes a fusion method based on contextual information. This approach uses convolutions with varying dilation rates to capture contextual information from multiple receptive fields. The obtained features are then convolved using three 1 × 1 convolutions with dilation rates of 1, 3, and 5. These convolved features are subsequently fused using the Concat operation, which integrates features from different layers and eliminates conflicting information. The fusion method is illustrated in Figure 7.

3.5. Biformer Dynamic Sparse Attention Mechanism

The vastness of forest areas and the complexity of image backgrounds present significant challenges in UAV forest fire detection, especially when flame targets are small and easily obscured. If a model lacks the ability to suppress irrelevant information, extracting critical information becomes difficult. As a convolutional neural network (CNN) model, YOLO can produce significant noise when extracting features from forest images. This noise may interfere with the long-range dependencies between pixels, thereby impairing the model’s detection and recognition performance and increasing the chances of false positives or missed detections. To enhance the detection and recognition of fire and smoke targets while minimizing noise interference, this paper introduces the concept of capturing remote dependencies. This approach aims to improve the accuracy of UAV-based recognition systems.

Firstly, we encode the input sequence

[a_{1}, a_{2}, a_{3}, \dots, a_{T}]

to obtain

[x_{1}, x_{2}, x_{3}, \dots, x_{T}]

, and after that, the linear transformation matrices

W^{Q}, W^{K}

, and

W^{V}

are used to create the three matrices Q, K, and V. The weighted sum can be obtained by multiplying the result by matrix V after normalizing it and computing the dot product of the query and the associated key. To keep the gradient from vanishing,

\sqrt{d^{k}}

is introduced. The dimension of the matrix k is denoted by

d^{k}

. The attention formula is shown in Equation (4):

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{K}}}) V

(4)

Traditional attention mechanisms are computationally intensive and memory-demanding, making them unsuitable for deployment in hardware environments with limited resources. To overcome this challenge, BiFormer, a variant of the Transformer model [29], implements a dynamic sparse attention mechanism. This mechanism utilizes bi-level routing attention (BRA) to allow for more adaptable computational distribution and enhanced content awareness. By concentrating on the most critical attention modules, BiFormer effectively reduces the computational load of attention operations. As illustrated in Figure 8, the BRA module serves as the fundamental component of the BiFormer architecture.

To construct a region-level affinity map, the input feature map

X \in R^{H \times W \times C}

is divided into

S \times S

non-overlapping regions to obtain

X^{r} \in R^{S^{2} \times \frac{H W}{S^{2}} \times C}

, ensuring that each region contains

\frac{H W}{S^{2}}

feature vectors. The linear mapping of the re-partitioned feature map

X^{r}

yields the query, key, and value tensor in the attention mechanism, denoted as

Q, K, V \in R^{H \times W \times C}

, as shown in Equations (5)–(7):

Q = X^{r} W^{q}

(5)

K = X^{r} W^{k}

(6)

V = X^{r} W^{v}

(7)

where

W^{q}

,

W^{k}

, and

W^{v}

are the weight matrices representing the query, key, and value tensor, respectively.

Following that, the attentional link between distinct areas is formed by drawing a directed graph and identifying the connected regions for each region. Averaging Q and K along the regions yields region-level

Q^{r}

,

K^{r}

, and

Q^{r}, K^{r} \in R^{S^{2} \times C}

, respectively. The adjacency matrix

A^{r} \in R^{S^{2} \times S^{2}}

, which reflects the degree of association between areas, is obtained by performing matrix transpose multiplication on it. To keep only the top k-most correlated region connections at each node, pruning is performed on

A^{r}

. The least correlated tokens in

A^{r}

are filtered out at the coarse-grained level to obtain the routing index matrix

I^{r} \in N^{S^{2} \times k}

, calculated as shown in Equations (8) and (9).

A^{r} = Q^{r} {(K^{r})}^{T}

(8)

I^{r} = t o p k I n d e x (A^{r})

(9)

In this case, the elements in the adjacency matrix

A^{r}

are used to determine the degree of similarity between two areas in the forest image in terms of feature information. The

t o p k I n d e x (•)

is the set of routing indices of the first k connections of each region retained using the

t o p k

operator. Self-attention is computed on fine-grained tokens within each region using the index matrix

I^{r}

. The key and value tensor are collected by

g a t h e r (•)

to obtain

K^{g}

and

V^{g}

. As shown in Equations (10) and (11):

K^{g} = g a t h e r (K, I^{r})

(10)

V^{g} = g a t h e r (V, I^{r})

(11)

where

K^{g}, V^{g} \in R^{S^{2} \times \frac{H W}{S^{2}} \times C}

represent the tensor matrices obtained by collecting K and V tensors in the routing index matrix

I^{r}

, respectively. Applying token-to-token attention to the above

K^{g}

and

V^{g}

, the output O can be obtained as shown in Equation (12).

O = A t t e n t i o n (Q, K^{g}, V^{g}) + L C E (V)

(12)

where

A t t e n t i o n (•)

is the attention function which is responsible for converting each Q into a weighted sum of V, and

L C E (•)

is the local context augmentation term, which is parameterized using a deep convolutional parameterization with kernel size 5.

The complexity of the obtained BRA with appropriate region partitioning factor S is compressed to

O ({(H W)}^{\frac{4}{3}})

, which is lower than the traditional attention with complexity

O ({(H W)}^{2})

and the quasi-global axial attention with complexity

O ({(H W)}^{\frac{3}{2}})

, which effectively reduces the computational amount of the attention mechanism.

The BiFormer module follows the Vision Transformer architecture and adopts a feature pyramid structure. Figure 9a,b illustrate the structure of the BiFormer and BiFormer Block, respectively. BiFormer is composed of multiple BiFormer Blocks. The input forest fire image is processed through four feature extraction stages, each producing feature maps of different scales. The first stage consists of

N_{i}

patch embedding layers and BiFormer modules, while the remaining three stages are composed of

N_{i}

patch merging modules and BiFormer modules. To decrease the dimensionality of the input feature map while preserving feature invariance, each BiFormer block implicitly encodes the relative position information using a 3 × 3 deep convolution at the outset. The BRA module and the multi-layer perceptron (MLP) module—featuring a 2-layer expansion ratio of e—are utilized in sequence for modeling cross-position relationships and embedding features position by position. The plus symbol in Figure 9b represents the connection of two feature vectors.

To harness the potential of this attention mechanism without significantly increasing computational costs, the BiFormer is embedded into the improved neck layer. This integration achieves comparable performance to the current sparse transformer models and exceeds expectations in average APs (a metric for small target detection).

3.6. Inner-MPDIoU Loss Function

Bounding box regression aims to refine the detector’s output by adjusting the detection box to better match the Ground Truth (GT) box. IoU has become the standard criterion assessing prediction box loss in detection tasks, and its formula is provided in Equation (13).

L_{I o U} = 1 - \frac{∣ A \cap B ∣}{∣ A \cup B ∣} ν = \frac{4}{π^{2}} {(arctan \frac{w^{g t}}{h^{g t}} - arctan \frac{w}{h})}^{2}

(13)

where A denotes the size of the prediction bounding box, while B represents the GT box size. The

I o U

is calculated as the ratio of the intersection area to the union area of these two boxes. The weight function is represented by

α

, and v is employed to measure the aspect ratio’s comparability. Here, w and h indicate the width and height of the prediction box, whereas

w^{g t}

and

h^{g t}

correspond to the width and height of the GT box, respectively. Equations (14) and (15) provide the formulas for

α

and v, respectively.

α = \frac{ν}{(1 - IoU) + ν}

(14)

ν = \frac{4}{π^{2}} {(a r c t a n \frac{w^{g t}}{h^{g t}} - a r c t a n \frac{w}{h})}^{2}

(15)

Several methods have extended IoU by adding new loss terms. For instance, DIoU [25] accelerates convergence by directly regressing the Euclidean distance between the centroids of the prediction and GT boxes. EIoU [30] decomposes the aspect ratio on CIoU to explicitly measure the differences between three geometric factors—overlapping regions, centroids, and edge lengths—while introducing a focal loss to mitigate the imbalance between hard and easy samples. SIoU [31] evaluates the angles of the vectors corresponding to the desired regressions, redefining the angle penalty metric. This allows the prediction box to quickly align with the nearest axis, thus reducing the degrees of freedom in the regression process, which relates to the number of variable parameters in the model that influences its complexity and data-fitting capability. GIoU [32] addresses the gradient vanishing issue that occurs when the anchor box and GT box do not overlap.

The YOLOv8 model uses the CIoU loss function without considering the balance between difficult and easy samples. For smaller objects, minor positional offsets lead to a sharp decrease in IoU, while for larger objects, the same positional offsets cause significant changes in the IoU value. Additionally, the CIoU loss function consumes significant computational resources due to the involvement of an inverse trigonometric function. To address these limitations, this study replaces the CIoU in the original network with the MPDIoU [33] loss function. By incorporating the minimum point distance, MPDIoU redefines the loss function, decreasing its overall degrees of freedom. The calculation of the MPDIoU loss function is shown in Equation (16):

M P D I o U = I o U - \frac{ρ^{2} (P_{1}^{pred}, P_{1}^{gt})}{w^{2} + h^{2}} - \frac{ρ^{2} (P_{2}^{pred}, P_{2}^{gt})}{w^{2} + h^{2}}

(16)

where

P_{1}^{p r e d}, P_{2}^{p r e d}, P_{1}^{g t}, P_{2}^{g t}

represent the lower-right and upper-left corners of the prediction and GT boxes, respectively.

ρ^{2} (P_{1}^{pred}, P_{1}^{gt})

measures the separation between the corresponding points. The MPDIoU loss function enhances the prediction box’s alignment with the Ground Truth (GT) box, especially when their centers do not overlap. When the centers of the prediction box and the GT box overlap and maintain the same aspect ratio, but differ in width and height, the penalty term in the MPDIoU function remains non-zero. This ensures that the IoU loss does not degrade under these conditions.

In current object detection tasks, traditional bounding box regression loss functions exhibit some irrelevance between the regressed content and the evaluation standard IoU, leading to incomplete regression attributes and limited generalization in practical applications. To overcome these shortcomings, this study introduces an Inner-IoU [34] loss improvement method. Unlike standard IoU, which examines the overall overlapping region between the prediction and GT boxes, Inner-IoU focuses on the core part of the bounding box, providing a more accurate assessment of the overlap. The approach is explained as follows:

b_{l}^{g t} = x_{c}^{g t} - \frac{w^{g t} * r a t i o}{2}, b_{r}^{g t} = x_{c}^{g t} + \frac{w^{g t} * r a t i o}{2}

(17)

b_{t}^{g t} = y_{c}^{g t} - \frac{h^{g t} * r a t i o}{2}, b_{b}^{g t} = y_{c}^{g t} + \frac{h^{g t} * r a t i o}{2}

(18)

b_{l} = x_{c} - \frac{w * r a t i o}{2}, b_{r} = x_{c} + \frac{w * r a t i o}{2}

(19)

i n t e r = (min (b_{r}^{g t}, b_{r}) - max (b_{l}^{g t}, b_{l})) * (min (b_{b}^{g t}, b_{b}) - max (b_{t}^{g t}, b_{t}))

(20)

u n i o n = (w^{g t} * h^{g t}) * {(r a t i o)}^{2} + (w * h) * {(r a t i o)}^{2} - i n t e r

(21)

I o U^{i n n e r} = \frac{i n t e r}{u n i o n}

(22)

In the model, the center of the GT box is denoted by b, while the center of the prediction box is represented by

b^{g t}

. The center point within the GT box is referred to as

(x_{c}^{g t}, y_{c}^{g t})

, and the center point within the anchor box is denoted as

(x_{c}, y_{c})

. The term "

r a t i o

" represents the scale factor, which typically ranges between 0.5 and 1.5.

By using different scales of auxiliary bounding boxes for various datasets and detection tasks, Inner-IoU demonstrates better generalization performance compared to traditional IoU loss. In UAV forest fire detection, where flames are often small targets and smoke exhibits multi-scale features, this study modifies MPDIoU using the Inner-IoU concept. This modification replaces the IoU computational components to increase the model’s capacity of modeling the details of the flame and smoke targets, allowing for the better detection of small targets in UAV images and multi-scale target feature detection performance. The calculation formula is shown in Equation (23):

M P D I o U^{inner} = I o U^{inner} - \frac{ρ^{2} (P_{1}^{pred}, P_{1}^{gt})}{w^{2} + h^{2}} - \frac{ρ^{2} (P_{2}^{pred}, P_{2}^{gt})}{w^{2} + h^{2}}

(23)

Inner-MPDIoU overcomes the limitations of the current IoU loss functions in detection tasks, improving the model’s ability to learn from high-loss samples.

During the bounding box regression for small flame targets, CIoU (as shown in Figure 10,

b_{g t}

and

b_{p r e d}

are the centers of the GT and prediction box, respectively;

w_{g t}

and

h_{g t}

are the width and length of the GT box, respectively; and

w_{p r e d}

and

h_{p r e d}

are the width and length of the prediction box, respectively.) The following three key geometric factors are considered: the overlap area, the distance between the centers of prediction and GT boxes, and the aspect ratio. When the centers of the prediction and GT boxes overlap, CIoU optimizes the shape, position, and size differences between them based on aspect ratio. However, if the prediction and GT boxes share the same aspect ratio but differ in width and height, CIoU fails to reflect the true difference between

w_{g t}

,

w_{p r e d}

and

h_{g t}

,

h_{p r e d}

, thus reducing the effectiveness of the loss function and limiting the model’s convergence speed and accuracy.

In contrast, Inner-MPDIoU (as shown in Figure 11,

d_{1}

represents the distance between the top left corner of the prediction box and the GT box, and

d_{2}

represents the distance between the prediction box and the bottom right corner of the GT box.) addresses these issues by using a scaling factor in Inner-IoU to adjust the auxiliary bounding box size. For small flame targets, the ratio can be set greater than 1, resulting in an auxiliary box that is larger than the GT box. This adjustment expands the effective range of regression, facilitating faster model convergence. Furthermore, when the prediction and GT boxes differ in width and height, MPDIoU minimizes the distances

d_{1}

and

d_{2}

, effectively optimizing the discrepancy terms. This allows it to handle overlapping and non-overlapping bounding boxes more efficiently, thus enhancing the accuracy of bounding box regression. The effective combination of these two mechanisms significantly enhances the detection performance for smoke and flame targets in forest fire scenarios.

4. Experiment

4.1. Dataset Acquisition and Processing

The dataset used in this experiment comes from two main sources. One part is a public dataset widely used for evaluating UAV-based wildfire detection models, as referenced in the literature [35]. The other part is collected from various online sources, including cameras, websites such as Kaggle and MIVIA, and web scraping, resulting in a total of 2060 images.

Detailed information is shown in Table 1. To minimize overfitting caused by insufficient training data, Matlab techniques are used to enhance the data. Techniques such as random flipping, local cropping, scaling, contrast-limited adaptive histogram equalization, median filtering, adding Gaussian noise, and gamma noise are used to enhance the dataset. Some of the results are shown in Figure 12. Figure 12b,c are examples of the original image after horizontal flipping and linear transformation with translation, which preserve the image content but increase data diversity. Figure 12d converts the image to the HSV, where hue (H), saturation (S), and value (V) control the color type, intensity, and brightness, respectively. Random adjustments to these parameters simulate various fire scenarios under different lighting conditions and weather environments. Figure 12e adds Gaussian noise to the image, with the level of noise controlled by adjusting the standard deviation (sigma) of the Gaussian distribution. After these operations, a total of 10,295 images were generated. Some images undergo significant quality degradation due to augmentation, so manual screening is necessary to select qualified data for the final dataset, removing 260 poor-quality images. In addition, based on the 1% negative sample ratio from the Coco dataset, 100 background images without fire or smoke are added as negative samples, bringing the final total to 10,135 images. The custom dataset, Q-Fire, is annotated using the LabelImg tool. Q-Fire contains “fire” and “smoke” labels, with 32,315 bounding boxes labeled as fire and 18,814 labeled as smoke. The dataset is diverse, featuring fires with varying colors, shapes, textures, and densities, captured in various complex scenarios such as daytime and nighttime forests, as well as deep forest settings. The images include many small targets. The dataset is randomly divided into training, validation, and test sets in a 7:2:1 ratio.

Figure 13 shows the positional offsets of prediction boxes generated by the network. The left subplot displays the distribution of object bounding box centroids, which mainly cluster in the middle and lower sections of the image data. The right subplot presents a scatter plot of bounding box widths and heights, revealing that the dataset predominantly contains small objects, as indicated by the darkest color in the lower-left corner. This diversity and comprehensiveness of the dataset facilitate in-depth research into forest fire prediction and detection techniques.

4.2. Experimental Environment

In this study, we build the PyTorch deep learning framework to train and test the customized dataset. The operating system is Ubuntu 18.04.1, the CPU is Ryzen5800X, the GPU is GeForceRTX3060Ti, the video memory is 8GB, the RAM is 32GB, the CUDA version is 11.1.1, and the Python version is 3.8.1, and the deep learning framework is PyTorch1.10.0, as shown in Table 2. The pre-training weights use the official YOLOv8n, the number of training generations is set to 300 epochs, and the batch size is set to 64 to adapt to the graphics memory. The starting learning rate is set to 0.01, while the SGD momentum is 0.937.

4.3. Model Evaluation Indexes

In order to validate the improvement of the improved YOLOv8s, the model was evaluated using the following four indicators: precision (P), recall (R), mean Average Precision (mAP), detection speed (FPS), and the confusion matrix.

The confusion matrix summarizes the relationship between actual and predicted categories in a matrix format. For instance, in binary classification, as shown in Table 3.

Precision. The true fire samples were identified by the model as a percentage of fire samples identified by the model. The formula can be expressed as follows:

P = \frac{T_{TP}}{T_{TP} + F_{FP}}

(24)

Recall. The percentage of identified fire samples to the total number of true samples. The calculation formula is shown below:

R = \frac{T_{TP}}{T_{TP} + F_{FN}}

(25)

Average Precision (AP). The average accuracy for a given type of fire and smoke; the formula is as follows:

A_{AP} = ʃ_{0}^{1} P (R) d R

(26)

mean Average Precision(mAP). The mean value of the average accuracy of all fire and smoke types, calculated by the following formula:

m_{mAP} = \frac{1}{n} \sum_{i = 1}^{n} A_{{AP}_{i}}

(27)

where n is the number of categories.

Detection speed. Measures the model’s computing capability and relates to the number of photos it can process per second. This study describes the detection speed of the model in terms of FPS (Frame Per Second).

4.4. Analysis of Results

4.4.1. Inner-MPDIoU Validation

First, we conduct comparative experiments by adjusting the ratio factor within Inner-IoU to evaluate the effect of different ratios in the Inner-MPDIoU loss function. The goal is to identify the most effective ratio factor for detecting small targets like forest flames and smoke. The results are shown in Table 4.

When the ratio equals 1, the inner-MPDIoU loss function effectively reduces to the standard MPDIoU loss function. Experimental findings reveal that the the distribution of forest fire images contains more small targets, and even slight shifts in the labeling box can cause a significant decrease in IoU. When the ratio exceeds 1, the auxiliary box becomes larger than the GT box, which facilitates the regression of samples with low IoU. Consequently, the experimental results with a ratio greater than 1 are generally better than those with a ratio less than or equal to 1. The optimal result is achieved when the ratio equals 1.15, but for a ratio greater than 1, the effect does not consistently improve with a larger ratio, and fluctuates between a ratio of 1.00 and 1.20. Thus, it is necessary to fine-tune the ratio value according to the specific dataset used in the experiments.

Next, this experiment aims to validate the effectiveness of the Inner-MPDIoU loss function selected in this study, particularly its optimization effect on small target detection tasks. In UAV-based forest fire detection, since remote-sensing equipment operates from a considerable distance, flames and smoke often appear as small targets in the early stages of a fire. Therefore, improving the detection accuracy of small targets is one of the critical challenges in this field. By comparing different IoU loss functions, we analyze which loss function can better enhance the model’s performance in forest fire detection. Seven YOLOv8n models, each with different IoU loss functions, were evaluated, and the results are shown in Figure 14. The comparison between the original CIoU and Inner-CIoU demonstrates that Inner-IoU improves the accuracy of small target detection. Further comparisons of six different Inner-IoU variants verify the effectiveness of Inner-MPDIoU used in this study. Compared to the original YOLOv8 model, the improved model achieves a 1.6% increase in mAP when using Inner-MPDIoU.

4.4.2. Ablation Experiments

To thoroughly evaluate the effectiveness of the enhanced algorithm, five sets of ablation experiments are conducted on the G-Fire dataset to examine the influence of each newly added or modified module on the overall model performance. All models are trained and tested under identical conditions to maintain consistency and comparability. The baseline is the original YOLOv8n, and additional experiments include YOLOv8n with the modified GS_C2f, YOLOv8n with the modified GBFPN, YOLOv8n with BiFormer, YOLOv8n with the modified Inner-MPDIoU, and the proposed synthesis method. The experimental results are summarized in Table 5.

Table 5 indicates that the integration of the GS_C2f module into YOLOv8n leads to a reduction in both parameter count and computational load, while simultaneously enhancing detection accuracy, thereby validating the effectiveness of the GS_Bottleneck module. Moreover, the inclusion of the BiFormer module significantly boosts the model’s performance, elevating detection precision from 95.1% to 97.5% and overall accuracy from 93.7% to 94.2%. This improvement greatly enhances the model’s ability to detect targets, even in the presence of occlusions. Additionally, the refined GBFPN module achieves approximately a 30% reduction in parameters and a 15% decrease in computational demand, all without compromising accuracy. Notably, it also improves the detection precision of small flame targets, thereby strengthening the model’s generalization capability and robustness.

Additionally, the first set of experiments reveals that without data augmentation, the model performs well on the training set but poorly on the test set, indicating overfitting. This suggests that the model may have adapted too closely to the training data and failed to generalize effectively to new data. Overfitting has a significantly negative impact on the model’s detection performance, as it results in a notable drop in mAP values. In the complex forest environment, because the model overfits specific fire or smoke shapes and colors from the training set. It often struggles to correctly identify flames and smoke under different light or angles, leading to increased false positives and missed detections.

4.4.3. Comparison Analysis Experiment

Furthermore, to validate the performance improvements brought about by the enhanced C2f module, the integration of the BiFormer module, and the application of the Inner-MPDIoU loss function, the proposed algorithm is rigorously compared with the original YOLOv8n algorithm using the same dataset and experimental settings. Figure 15 illustrates a comparison of the training results, with the left figure showing the recognition results before the improvement and the right figure displaying the recognition results of the enhanced FGYOLO model. In the comparison of Figure 15(a-1,a-2), it is evident that FGYOLO outperforms the original model in detecting small target flames and smoke. The comparison of (b-1) and (b-2) reveals that the original YOLOv8n has instances of missed detections and fails to accurately identify all small flame targets. In contrast, the improved algorithm in this study uses the Inner-MPDIoU loss function, which is more sensitive to the size and position of objects, enhancing localization accuracy, reducing the target leakage rate, and improving discriminative power to detect flame targets accurately. In the comparison of Figure 15(c-1,c-2), it can be revealed that the fused BiFormer attention mechanism retains fine-grained detail information and combines it with multiscale information processing. This improvement enhances the model’s visual recognition ability in complex environments with significant background interference. The model effectively focuses on the important features of flame and smoke during feature extraction and is not misled by mixed flame and smoke targets, thereby more accurately identifying flame and smoke targets without causing false detections. In Figure 15(d-2), the improved algorithm demonstrates greater sensitivity in recognizing smoke targets when the flame target is obscured, which aids in identifying hidden fire ignition points and provides reliable technical support for forest safety monitoring.

Through this comparison, it is evident that FGYOLO effectively meets the application scenarios described in this paper. It accurately detects small target flames even under conditions of smoke interference in complex environments with numerous occlusions, such as forests, providing reliable technical support for forest safety monitoring.

Additionally, we compare FGYOLO with other models in various parameters, followed by normalization, to comprehensively analyze the balance between model complexity and performance. This experiment further assesses the efficiency of the proposed model in real-world forest fire detection applications. It is evident from Table 6 and Figure 16 that the improved YOLOv8 algorithm surpasses YOLOv5, YOLOv7-tiny, YOLOv9 [36], YOLOv10 [37], YOLOv8n, Faster R-CNN, and SSD algorithms in terms of precision, recall, and detection speed (FPS) across a comprehensive range of performances in the same hardware and development environments.

To further elucidate the effectiveness of the model enhancement strategies, this paper employs Grad-CAM (Gradient-weighted Class Activation Mapping) to visualize the improvements made to the YOLOv8n model. The resulting heatmaps reveal the feature extraction capabilities of the enhanced model in fire detection. Grad-CAM heatmap visualization helps assess FGYOLO’s ability to extract features in fire detection, providing clearer visual explanations. Figure 17 compares heatmaps from the baseline YOLOv8n model with those from our proposed FGYOLO model on the G-Fire dataset. The results show that FGYOLO encompasses a larger detection area for both bright and dim forest fires relative to YOLOv8n. In addition, FGYOLO demonstrates improved focus on positive sample regions of forest fires and smoke within complex environments, effectively addressing missed detections. This suggests that the GBFPN structure enhances the model’s ability to capture image features comprehensively, while the BiFormer attention mechanism significantly boosts its target perception in challenging backgrounds. The experimental outcomes confirm the success of these enhancements.

4.5. Limitations and Future Work

Our FGYOLO model has demonstrated promising results in forest fire detection within this study. However, there are still limitations when applied to real-world scenarios with UAVs. In specific environments, such as during sunrise and sunset, the angle and color of sunlight can resemble the optical properties of fire, especially in forested areas where sunlight penetrating the canopy may create orange-yellow spots similar to flames. These lighting variations may cause false positives, where sunlight is misinterpreted as fire signals. Additionally, during high-speed drone flights, rapid changes in light and shadow increase the difficulty of detection tasks. In response, we plan to enhance model training and consider integrating multiple sensors, such as infrared cameras, thermal imaging, and optical cameras, to improve the detection of fire and smoke. Furthermore, in forest environments, weather conditions like rain or fog can weaken or obscure fire and smoke signals. The fusion of smoke with moisture may also interfere with the YOLOv8 model’s accurate identification of fire and smoke. Although data augmentation techniques and the improved modules mitigate this issue to some extent, the performance may still be inferior under these conditions compared to clear weather.Therefore, we will try to incorporate environmental contextual information (e.g., geographic location, weather data, time of day) into the model, enabling it to make more intelligent fire detection decisions. In future real-world deployments, once the drone or other remote-sensing devices detect a fire, the system is expected to automatically transmit the detection results, along with images or videos of the fire and its geographic location, via wireless communication to nearby forest management stations and firefighting personnel, notifying them for further action. Our long-term goal is to continuously improve this intelligent, integrated forest fire detection system to achieve early fire detection and dynamic tracking in forest environments.

5. Conclusions

The intricacy of forest environments and the variability in environmental conditions present significant challenges for UAV-based forest fire detection, frequently resulting in suboptimal detection performance and false alarms. To tackle these issues, this study proposes FGYOLO. This approach aims to enhance UAV detection models while safeguarding forest resources. FGYOLO enhances the feature extraction network, improves the C2f structure with GSConv, and lightens its design to reduce computational pressure while maintaining efficiency. By combining Inner-IoU with the MPDIoU loss function in YOLOv8, FGYOLO further improves the accuracy and efficiency of bounding box regression and enhances the recognition accuracy of tiny targets. A lightweight and improved GBFPN feature fusion network is proposed to replace the PANet in the neck layer, strengthening the fusion capability of the shallow network, ensuring the UAV’s detection accuracy of small targets, and reducing the number of model parameters and computation. Subsequently, the BiFormer attention mechanism is integrated into the improved neck layer, improving the extraction of features of major flame targets and enhancing the model’s generalization in real-world scenarios. The effectiveness of the algorithm is demonstrated through experimental results on a custom comprehensive dataset. These results encompass an analysis of the loss function, the validation of individual modules, ablation studies, and comparisons with leading algorithms. Following these improvements, the model achieves an average accuracy of 98.8%, with a reduction in parameter count by 26.4%, laying the foundation for accurate forest fire identification. Compared to other detection models, the improved model in this study shows better comprehensive performance, balancing a higher fire detection rate with faster computation speed. In the future, we hope to achieve actual forest fire detection through our model, provide strong technical support for rapid forest fire warning and disaster management, and promote safer and more efficient forest management.

Author Contributions

Conceptualization, Y.Z.; Software, Y.Z.; Validation, Y.Z.; Investigation, Z.G.; Resources, J.L.; Data curation, Y.Z.; Writing—original draft, Y.Z.; Writing—review & editing, F.T.; Visualization, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China (Grant No. 62301212), The Program for Science and Technology Innovation Talents in the University of Henan Province (Grant No. 23HASTIT021), Henan Higher Education Teaching Reform Research and Practice Project (Grant No. 2024SJGLX0319).

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to ownership reasons.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Qin, X.; Li, X.; Liu, S.; Liu, Q.; Li, Z. Forest fire early warning and monitoring techniques using satellite remote sensing in China. J. Remote Sens. 2020, 24, 511–520. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Peruzzi, G.; Pozzebon, A.; Van Der Meer, M. Fight fire with fire: Detecting forest fires with embedded machine learning models dealing with audio and images on low power IoT devices. Sensors 2023, 23, 783. [Google Scholar] [CrossRef]
Benzekri, W.; El Moussati, A.; Moussaoui, O.; Berrajaa, M. Early forest fire detection system using wireless sensor network and deep learning. Int. J. Adv. Comput. Sci. Appl. 2020, 11. [Google Scholar] [CrossRef]
Cao, X.; Su, Y.; Geng, X.; Wang, Y. YOLO-SF: YOLO for fire segmentation detection. IEEE Access 2023, 11, 111079–111092. [Google Scholar] [CrossRef]
Choutri, K.; Lagha, M.; Meshoul, S.; Batouche, M.; Bouzidi, F.; Charef, W. Fire detection and geo-localization using uav’s aerial images and yolo-based models. Appl. Sci. 2023, 13, 11548. [Google Scholar] [CrossRef]
Bahhar, C.; Ksibi, A.; Ayadi, M.; Jamjoom, M.M.; Ullah, Z.; Soufiene, B.O.; Sakli, H. Wildfire and smoke detection using staged YOLO model and ensemble CNN. Electronics 2023, 12, 228. [Google Scholar] [CrossRef]
Zhang, Q.x.; Lin, G.h.; Zhang, Y.m.; Xu, G.; Wang, J.j. Wildland forest fire smoke detection based on faster R-CNN using synthetic smoke images. Procedia Eng. 2018, 211, 441–446. [Google Scholar] [CrossRef]
Shamta, I.; Demir, B.E. Development of a deep learning-based surveillance system for forest fire detection and monitoring using UAV. PLoS ONE 2024, 19, e0299058. [Google Scholar] [CrossRef] [PubMed]
Han, Y.; Duan, B.; Guan, R.; Yang, G.; Zhen, Z. LUFFD-YOLO: A Lightweight Model for UAV Remote Sensing Forest Fire Detection Based on Attention Mechanism and Multi-Level Feature Fusion. Remote. Sens. 2024, 16, 2177. [Google Scholar] [CrossRef]
Cao, L.; Shen, Z.; Xu, S. Efficient forest fire detection based on an improved YOLO model. Vis. Intell. 2024, 2, 20. [Google Scholar] [CrossRef]
Zhao, L.; Zhi, L.; Zhao, C.; Zheng, W. Fire-YOLO: A small target object detection method for fire inspection. Sustainability 2022, 14, 4930. [Google Scholar] [CrossRef]
Saydirasulovich, S.N.; Mukhiddinov, M.; Djuraev, O.; Abdusalomov, A.; Cho, Y.I. An improved wildfire smoke detection based on YOLOv8 and UAV images. Sensors 2023, 23, 8374. [Google Scholar] [CrossRef] [PubMed]
Mukhiddinov, M.; Abdusalomov, A.B.; Cho, J. A wildfire smoke detection system using unmanned aerial vehicle images based on the optimized YOLOv5. Sensors 2022, 22, 9384. [Google Scholar] [CrossRef]
Xiao, Z.; Wan, F.; Lei, G.; Xiong, Y.; Xu, L.; Ye, Z.; Liu, W.; Zhou, W.; Xu, C. FL-YOLOv7: A lightweight small object detection algorithm in forest fire detection. Forests 2023, 14, 1812. [Google Scholar] [CrossRef]
Yang, H.; Wang, J.; Wang, J. Efficient Detection of Forest Fire Smoke in UAV Aerial Imagery Based on an Improved Yolov5 Model and Transfer Learning. Remote. Sens. 2023, 15, 5527. [Google Scholar] [CrossRef]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference On Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference On Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–13000. [Google Scholar]
Feng, C.; Zhong, Y.; Gao, Y.; Scott, M.R.; Huang, W. Tood: Task-aligned one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; IEEE Computer Society: Washington, DC, USA, 2021; pp. 3490–3499. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Liu, P.; Qian, W.; Wang, Y. YWnet: A convolutional block attention-based fusion deep learning method for complex underwater small target detection. Ecol. Inform. 2024, 79, 102401. [Google Scholar] [CrossRef]
Vaswani, A. Attention is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Zhang, Y.F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and efficient IOU loss for accurate bounding box regression. Neurocomputing 2022, 506, 146–157. [Google Scholar] [CrossRef]
Gevorgyan, Z. SIoU loss: More powerful learning for bounding box regression. arXiv 2022, arXiv:2205.12740. [Google Scholar]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference On Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–19 June 2019; pp. 658–666. [Google Scholar]
Ma, S.; Xu, Y. MPDIoU: A loss for efficient and accurate bounding box regression. arXiv 2023, arXiv:2307.07662. [Google Scholar]
Zhang, H.; Xu, C.; Zhang, S. Inner-IoU: More effective intersection over union loss with auxiliary bounding box. arXiv 2023, arXiv:2311.02877. [Google Scholar]
Shamsoshoara, A.; Afghah, F.; Razi, A.; Zheng, L.; Fulé, P.Z.; Blasch, E. Aerial imagery pile burn detection using deep learning: The FLAME dataset. Comput. Netw. 2021, 193, 108001. [Google Scholar] [CrossRef]
Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. Yolov9: Learning what you want to learn using programmable gradient information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. Yolov10: Real-time end-to-end object detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]

Figure 1. FGYOLOv8 network framework.

Figure 2. Original Yolov8 model.

Figure 3. C2f module.

Figure 4. Improved Bottleneck structure: (a) original Bottleneck, (b) improved GS_Bottleneck structure.

Figure 5. GS_C2f structure.

Figure 6. Network structure diagram of PANet, BiFPN, and GBFPN.

Figure 7. Fusion mode.

Figure 8. Bi-level routing attention mechanism.

Figure 9. The details of the Biformer.

Figure 10. CIoU loss function.

Figure 11. Inner-MPDIoU loss function.

Figure 12. Examples of data enhancement operations: (a) original image, (b) a horizontal flip operation applied to the original image, (c) affine transformation operation, (d) HSV data enhancement operation, (e) adding Gaussian noise to the original image.

Figure 13. Distribution statistics of the training set.

Figure 14. Inner-IoU comparison experiment.

Figure 15. A comparison of the improvement results: The detection results using the original YOLOv8 model are denoted as i-1, while those using FGYOLO are denoted as i-2. (a-1,a-2,b-1,b-2) are detection comparisons in the case of an occluded very small fire target, (c-1,c-2,d-1,d-2) are detection comparisons in the case of background interference and confusion between smoke and fire.

Figure 16. A comparison of the normalization effect of experimental metrics.

Figure 17. Comparison of Grad-CAM for YOLOv8 and FGYOLO.

Table 1. Dataset details.

Target	Quantity	Source
Fire only	1400	Public dataset
Fire only	1825	Self-accessible
Smoke only	1762	Self-accessible
Both fire and smoke	2400	Public dataset
Both fire and smoke	2648	Self-accessible
Empty samples	100	Public dataset
Combined datasets	10,135	Public dataset and Self-accessible

Table 2. Hardware configuration and development environment table.

Item	Configuration
Operating System	Ubuntu 18.04.1
Graphics Card	GeForce RTX 3060Ti
CUDA version	11.1.1
Python	3.8.1
Deep learning framework	PyTorch 1.10.0

Table 3. Confusion matrix of classification results.

Actual Circumstances	Predicted Result
Actual Circumstances	Positive Example	Negative Example
Positive example	$T_{T P}$	$F_{F N}$
negative example	$F_{F P}$	$T_{T N}$

Table 4. Ablation experiments of

r a t i o

.

Table 4. Ablation experiments of

r a t i o

.

$ratio$	P (%)	R (%)	${m A P}_{0.5}$ (%)
0.5	95.3	95.7	95.5
0.65	95.6	96.1	95.9
0.8	95.4	95.9	95.7
0.9	95.8	96.1	96.1
1.00	95.1	96.8	96.5
1.15	95.9	97.1	96.9
1.20	94.5	96.8	96.7

Table 5. Ablation experimental detection results.

Network	P (%)	R (%)	${mAP}_{smoke}$ (%)	${mAP}_{fire}$ (%)	Parameters ( $\times 10^{6}$ )	GFlops
YOLOv8n without data enhancement	92.8	89.6	90.4	92.5	3.01	8.2
YOLOv8n	93.7	95.1	95.1	95.3	3.01	8.2
YOLOv8n + DC_C2f	93.8	95.4	94.7	95.9	2.57	7.2
YOLOv8n + GBFPN	94.5	94.8	94.7	95.8	2.26	7.1
YOLOv8n + Biformer	94.2	95.6	97.5	97.1	3.22	9.5
YOLOv8n + Inner-MPDIoU	94.1	95.8	96.7	96.9	3.01	8.2
FGYOLO	94.5	96.3	98.7	98.8	2.38	7.3

Table 6. A comparison of the performance of the eight algorithms.

Algorithms	P (%)	R (%)	${mAP}_{0.5}$ (%)	Parameters ( $\times 10^{6}$ )	GFlops
Faster R-CNN	83.9	82.0	81.9	136.73	401.7
SSD	89.3	78.5	84.2	26.29	62.7
YOLOv5	95.9	86.3	94.0	2.51	7.2
YOLOv7-tiny	91.4	87.5	93.4	5.91	12.5
YOLOv9	93.1	88.2	95.4	7.01	26.2
YOLOv10	92.1	89.4	94.1	2.37	6.7
YOLOv8n	93.2	90.3	95.6	3.01	8.2
FGYOLO	94.5	96.3	98.8	2.38	7.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zheng, Y.; Tao, F.; Gao, Z.; Li, J. FGYOLO: An Integrated Feature Enhancement Lightweight Unmanned Aerial Vehicle Forest Fire Detection Framework Based on YOLOv8n. Forests 2024, 15, 1823. https://doi.org/10.3390/f15101823

AMA Style

Zheng Y, Tao F, Gao Z, Li J. FGYOLO: An Integrated Feature Enhancement Lightweight Unmanned Aerial Vehicle Forest Fire Detection Framework Based on YOLOv8n. Forests. 2024; 15(10):1823. https://doi.org/10.3390/f15101823

Chicago/Turabian Style

Zheng, Yangyang, Fazhan Tao, Zhengyang Gao, and Jingyan Li. 2024. "FGYOLO: An Integrated Feature Enhancement Lightweight Unmanned Aerial Vehicle Forest Fire Detection Framework Based on YOLOv8n" Forests 15, no. 10: 1823. https://doi.org/10.3390/f15101823

APA Style

Zheng, Y., Tao, F., Gao, Z., & Li, J. (2024). FGYOLO: An Integrated Feature Enhancement Lightweight Unmanned Aerial Vehicle Forest Fire Detection Framework Based on YOLOv8n. Forests, 15(10), 1823. https://doi.org/10.3390/f15101823

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FGYOLO: An Integrated Feature Enhancement Lightweight Unmanned Aerial Vehicle Forest Fire Detection Framework Based on YOLOv8n

Abstract

1. Introduction

2. Related Works

3. Materials and Methods

3.1. Improved Forest Fire Detection Model

3.2. YOLOv8 Network

3.3. GS_C2f Module

3.4. GBFPN Network

3.5. Biformer Dynamic Sparse Attention Mechanism

3.6. Inner-MPDIoU Loss Function

4. Experiment

4.1. Dataset Acquisition and Processing

4.2. Experimental Environment

4.3. Model Evaluation Indexes

4.4. Analysis of Results

4.4.1. Inner-MPDIoU Validation

4.4.2. Ablation Experiments

4.4.3. Comparison Analysis Experiment

4.5. Limitations and Future Work

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI