YOLO-UFS: A Novel Detection Model for UAVs to Detect Early Forest Fires

Luo, Zitong; Xu, Haining; Xing, Yanqiu; Zhu, Chuanhao; Jiao, Zhupeng; Cui, Chengguo

doi:10.3390/f16050743

Open AccessArticle

YOLO-UFS: A Novel Detection Model for UAVs to Detect Early Forest Fires

by

Zitong Luo

^1,*,

Haining Xu

¹,

Yanqiu Xing

^1,2,*,

Chuanhao Zhu

¹,

Zhupeng Jiao

¹ and

Chengguo Cui

¹

School of Mechanical and Electrical Engineering, Northeast Forestry University, Harbin 150040, China

²

Centre for Research on Forest Operations and the Environment, Harbin 150040, China

^*

Authors to whom correspondence should be addressed.

Forests 2025, 16(5), 743; https://doi.org/10.3390/f16050743

Submission received: 12 March 2025 / Revised: 24 April 2025 / Accepted: 24 April 2025 / Published: 26 April 2025

(This article belongs to the Section Natural Hazards and Risk Management)

Download

Browse Figures

Versions Notes

Abstract

Forest fires endanger ecosystems and human life, making early detection crucial for effective prevention. Traditional detection methods are often inadequate due to large coverage areas and inherent limitations. However, drone technology combined with deep learning holds promise. This study investigates using small drones equipped with lightweight deep learning models to detect forest fires early. A high-quality dataset constructed through aerial image analysis supports robust model training. The proposed YOLO-UFS network, based on YOLOv5s, integrates enhancements such as the C3-MNV4 module, BiFPN, AF-IoU loss function, and NAM attention mechanism. These modifications achieve a 91.3% mAP on the self-built early forest fire dataset. Compared to the original model, YOLO-UFS improves accuracy by 3.8%, recall by 4.1%, and average accuracy by 3.2%, while reducing computational parameters by 74.7% and 78.3%. It outperforms other mainstream YOLO algorithms on drone platforms, balancing accuracy and real-time performance. In generalization experiments using public datasets, the model’s mAP0.5 increased from 85.2% to 86.3%, and mAP0.5:0.95 from 56.7% to 57.9%, with an overall mAP gain of 3.3%. The optimized model runs efficiently on the Jetson Nano platform with 258 GB of RAM, 7.4 MB of storage memory, and an average frame rate of 30 FPS. In this study, airborne visible light images are used to provide a low-cost and high-precision solution for the early detection of forest fires, so that low-computing UAVs can achieve the requirements of early detection, early mobilization, and early extinguishment. Future work will focus on multi-sensor data fusion and human–robot collaboration to further improve the accuracy and reliability of detection.

Keywords:

early detection of forest fires; lightweight drone detection; YOLO-UFS; deep learning

1. Introduction

Forest fires are a type of natural disaster characterized by their sudden onset, significant destructiveness, and considerable challenges in emergency response. They pose a threat not only to the stability of ecosystems but also to human life and the integrity of infrastructure [1]. According to statistics, in 2022, China experienced 709 forest fires, with an affected forest area of approximately 0.5 million hectares [2]. In 2023, the number of forest fires decreased to 328, and the affected area was about 0.4 million hectares [2]. Research indicates that early forest fire detection technology can identify fire sources at the initial stage of a fire, thereby helping to keep fire losses within an acceptable range [3].

China has vast territories with extensive forest areas and complex terrain, resulting in numerous monitoring blind spots. Moreover, the early signs of forest fires are often not obvious and can be easily obscured by vegetation, which presents significant challenges for fire detection. Traditional fire detection methods typically rely on varIoUs sensors to detect early signs of fires [4], such as optical sensors [5], acoustic sensors [6], and gas concentration sensors [7]. However, given the extensive forest coverage in China, deploying a large number of sensors in forest areas is not only costly but also cannot guarantee the accuracy and real-time nature of detection. With the continuous development of digital forestry and intelligent technologies, image-based early forest fire detection technology based on deep learning has gradually gained widespread application. This technology can autonomously learn fire characteristics from a vast amount of image data, overcoming the limitations of traditional manual feature extraction [8]. In addition, systems combining drones with optical sensors can achieve rapid and accurate early fire detection [9]. Visible light sensors, which capture images with distinct features and rich texture details, can intuitively display the situation on-site and are thus widely used in image-based fire detection technology [10].

A forest fire detection system based on aerial visible light images typically consists of three key components: image acquisition, fire recognition, and fire warning [11]. The image acquisition component captures real-time image data of forest areas using drones equipped with visible light cameras. The fire recognition component analyzes the images using target detection algorithms from deep learning to identify the presence of fires. The fire warning component then promptly alerts firefighting personnel upon fire detection. The fire recognition component is the core of the fire detection system, with research mainly focusing on constructing appropriate datasets and optimizing target detection algorithms to improve the accuracy and real-time nature of fire detection.

Deep learning-based target detection algorithms are mainly divided into two categories: two-stage algorithms represented by the R-CNN (Region-CNN) series and one-stage algorithms represented by the YOLO (You Only Look Once) series [12]. Due to the advantage of one-stage detection algorithms in detection speed, they are widely used in fire detection tasks with high real-time requirements. Recent studies on forest fire detection have made notable progress but also face significant limitations. Xue et al. [13] enhanced YOLOv5 by adding small-object detection layers and attention mechanisms, and modifying SPPF and PANet structures. However, this increased model complexity and computational demands, potentially restricting real-time application. Zhao et al. [14] replaced YOLOv3’s backbone with EfficientNet to improve small-object detection but used non-realistic forest datasets, raising concerns about generalizability in actual fire scenarios. Zu Xinping et al. [15] modified YOLOv3 to improve smoke detection precision but struggled with distinguishing smoke from other environmental interferences in complex backgrounds. Su Xiaodong et al. [16] and Pi Jun et al. [17] introduced lightweight networks and attention mechanisms into YOLOv5, optimizing detection performance on aerial forest fire datasets. However, this trade-off led to potential feature extraction insufficiency, impacting accuracy.

Since 2024, YOLO-related technologies have seen a surge in development. Wang et al. [18] combined Transformers and CNNs to enhance detection precision for early-stage fires but faced high computational costs and complex model structures. Chen et al. [19] used a multimodal fusion of infrared and visible light images to boost robustness in complex fire scenes, yet struggled with sensor data synchronization and integration complexity. Zhang et al. [20] introduced dynamic attention mechanisms in YOLOv8 to reduce false alarms in smoke detection but still faced high false-positive rates under adverse weather conditions. Guo et al. [21] improved the detection of small targets like initial flames through self-supervised pretraining, but the high pretraining costs and complex parameter tuning limited its practicality. Other notable efforts include Li et al. [22], who proposed a lightweight edge-computing model for real-time monitoring at the expense of detection accuracy; Zhou et al. [23], who built a federated learning-based distributed detection system for data privacy and fusion but faced challenges in network latency and model consistency; and He et al. [24], who enhanced model robustness against environmental disturbances through adversarial training, although at the cost of increased training complexity and computational load.

Most research on deep learning-based image detection of early forest fires focuses primarily on optimizing algorithms to enhance detection accuracy and real-time performance. However, these studies often overlook the unique characteristics of early forest fires, oversimplify target detection, and fail to fully consider the applicability in low-computing-power UAV scenarios. Additionally, comprehensive assessments of dataset suitability for forest environments are rarely conducted. While these studies contribute to incremental improvements, they generally address only specific aspects of the problem. Our work aims to overcome these limitations by proposing a comprehensive solution that balances accuracy, speed, and practical deployment feasibility.

This study presents a novel approach for early forest fire detection, addressing critical limitations in existing methods. By integrating the unique characteristics of early forest fires with an aerial perspective, we have developed a customized dataset that captures the nuances of these events. Leveraging the low latency and low computing power consumption of YOLOv5s, we propose the YOLO-UFS algorithm, which represents a significant leap forward in detection technology. Our innovations include modifying the network structure and loss function of YOLOv5s, introducing the ObjectBox detector to enhance localization precision, and incorporating the NAM attention mechanism to refine feature extraction, especially in complex and variable environments. These enhancements result in a lightweight yet highly accurate model that overcomes the traditional speed–accuracy trade-off. Unlike previous methods that focused narrowly on single aspects like flame or smoke detection, or relied on incremental changes, YOLO-UFS offers a comprehensive solution. It spans from data acquisition to system integration, significantly improving real-time processing capabilities and operational efficiency in field applications. This study’s holistic approach marks a major breakthrough in early forest fire detection, providing a robust and practical solution for real-world deployment.

The main contributions of this paper are as follows:

A New Early Forest Fire Detection Model, YOLO-UFS Model: We propose a novel detection model, YOLO-UFS, designed to enhance drone-based early forest fire and smoke detection by addressing low computational cost, low latency, complex background interference, and the coexistence of smoke and fire.
Self-Built Dataset: A custom dataset was created, comprising three types of data: small flames only, smoke only, and combined small flames and smoke. Experiments were conducted to compare its performance with classical algorithms.
New Method Improvements: We propose the C3-MNV4 module to replace the original C3 module, effectively reducing the number of parameters while enhancing feature extraction capabilities. Additionally, the AF-IoU loss function is introduced to optimize detection accuracy, particularly for small targets in complex background environments.
Integration of Existing Improvements: The NAM attention mechanism is employed to concentrate the kernel within the target region, improving detection precision. Meanwhile, ObjectBox and BiFPN are incorporated to enhance detail retention and model generalization. These upgrades collectively make YOLO-UFS more accurate and efficient for early forest fire detection.

The paper is organized as follows: Section 2.1 details the data collection process, improvement methods, evaluation metrics, and experimental environment. Section 2.2 introduces the proposed YOLO-UFS, an improved method based on YOLOv5s. Section 3 validates the method’s effectiveness through ablation studies and comparisons with classical models, and analyzes the self-built dataset’s contribution to early forest fire model training by comparing results from public and self-built datasets. Section 4 summarizes the findings and outlines future work, while Section 5 discusses the results.

2. Materials and Methods

2.1. Construct an Early Forest Fire Dataset

2.1.1. Image Acquisition of Early Forest Fires

In the context of early forest fire detection from an aerial perspective, the primary focus is on data collection in forest environments and the detection of fire-related targets. When constructing a dataset for early forest fires, it is essential to consider the characteristics of forest fires and the features of aerial images. The following are key considerations:

a.: Target Identification

During the early stages of a forest fire, the flames are typically small and not easily distinguishable. In aerial images captured by drones, small flames can be easily obscured by surrounding vegetation, making them difficult to detect. Relying solely on flames as the detection target can lead to missed detections. However, early fires often produce significant amounts of smoke, which can spread across the forest canopy and is more easily detected. Therefore, incorporating both smoke and flames as detection targets can enhance detection accuracy and reliability [25]. Consequently, the dataset should include three typical types of early forest fire images: scenes with only small flames, scenes with only smoke, and scenes with both flames and smoke, as shown in Figure 1a.

b.: Interference with Detection Targets

In forest environments, various objects can interfere with flame and smoke detection, causing false positives. Flames may be mistaken for similar-colored objects like sunlight reflections or reddish-yellow leaves [26]. Smoke characteristics, such as color and shape, vary with environmental conditions. For example, smoke from high oil content materials or high temperatures may appear grayish-black or black, while tree shadows can be misidentified as smoke. Conversely, smoke from low fuel loads or low temperatures may appear blue-white or white, and objects like mist, snow, or clouds can be misidentified as smoke [27]. Including these interferences in the dataset can increase scenario diversity and enhance early forest fire detection accuracy. Examples are shown in Figure 1b.

c.: Position of Detection Targets

During forest fire patrols, drones follow predefined flight paths [28], scanning limited areas, and typically do not recheck already scanned regions. This means fire locations may not always be within the camera’s field of view. Additionally, early forest fires are often obscured by surrounding vegetation, complicating detection. Therefore, when collecting data, it is crucial to consider that flames and smoke may appear in various positions within the image and may be partially or fully obscured by vegetation, as shown in Figure 1c.

2.1.2. Data Augmentation Processing

Initially, all images were adjusted to a resolution of 640 × 640 pixels for normalization to ensure consistency of image data [29]. Subsequently, the LabelImg tool was used to accurately label the images and divide the labels into the “fire” category. The dataset is divided into a training set, a validation set, and a test set according to the ratio of 6:3:1. Through in-depth analysis of the dataset and detection targets, the diversity of the samples is further enhanced.

In order to further enrich the dataset and improve the robustness of the model, a series of data augmentation techniques were applied, including HSV color space conversion, horizontal and vertical flipping of images, and contrast adjustment. In addition, in order to reduce the influence of camera shake on the detection accuracy, with the help of the Albumentations open-source image enhancement library (For detailed information, please refer to the Supplementary Materials), Gaussian blur technology was applied to eliminate the blur of glare misjudgment and gust wind and camera shake on the UAV platform. At the same time, the RandomSunFlare transform is used to simulate the effect of solar glare in the self-built database, so that the model can learn how to distinguish glare from actual targets in the training stage, thus effectively reducing such interference.

For images with inconspicuous or small-sized targets, the Mosaic data augmentation method was utilized [30]. This technique involves randomly cropping, scaling, and stitching together four images, making small targets more recognizable to the model. This approach effectively increases the number of samples [31] and significantly reduces the risk of missed detections. An example of an image after data augmentation is shown in Figure 2.

2.1.3. Dataset Composition

In this study, we defined samples containing early forest fire characteristics as positive samples and included the following three types of data: samples containing only small flames, samples containing smoke only, and samples containing both small flames and smoke. Conversely, samples that do not contain early fire features are defined as negative samples, and these samples usually contain only interferences, or images of the forest that resemble the fire scene but do not have the actual fire.

Based on the above image acquisition analysis, an early forest fire dataset was constructed. DJI drones equipped with visible light cameras were employed to simulate UAV patrols at the Maoershan Forest Farm of Northeast Forestry University (the geographical coordinates are 127°18′0″–127°41′6″ east longitude and 45°2′20″–45°18′16″ north latitude.) Given the constraints of safety, cultural, and environmental considerations, a multi-layer overlay method was utilized to extract fire elements from different images and superimpose them onto real forest scenes. To enhance the authenticity of the simulated images, the open-source CGAN-PyTorch model (for detailed information, please refer to the Supplementary Materials) was employed. By training a generator and discriminator network, the model learned the features and distribution of fire images, generating highly realistic fire images to enrich the dataset. To increase the diversity of the dataset, we collected 64,360 images of early forest fires from online sources and public datasets, representing 61.4% of the total training set. Using PyCharm software, video frames were sampled to obtain 10,476 early forest fire images (positive samples), with one frame captured every 60 frames. These images were divided into a training set (10476 images) and a validation set (7685 images) at a 4:1 ratio. Additionally, 639 forest images (negative samples) were added to the validation set. Table 1 illustrates the distribution of sample types in the training and validation sets. Through these methods, a high-quality and diverse early forest fire dataset was established, providing a solid foundation for subsequent fire detection model training.

In order to label the early forest fire samples, we used the commonly used image annotation tool labelImg. In the labeling process, flame and smoke are labeled as two independent detection targets, in which the flame characteristics of the fire are marked with the “fire” label and the smoke characteristics of the fire are marked with the “smoke” label.

2.2. The YOLO-UFS Model

The YOLO-UFS model proposed in this paper optimizes the YOLOv5s baseline algorithm to meet the needs of forest fire detection in drone scenarios. To address image blurring caused by gusts and camera jitter under complex conditions, the “dejitter” filter of FFmpeg (for details, refer to the Supplementary Materials) is employed to compensate for frame jitter and correct lens shake. This stabilizes the video frames in the preprocessing stage and reduces interference in subsequent detection. Additionally, the Albumentations open-source image augmentation library is used for a preliminary segmentation of sky glare, reducing the probability of mistaking glare for flames. This enhances the model’s detection accuracy and ability to recognize fine details, thereby strengthening its overall robustness. The optimized network structure is shown in Figure 3.

2.2.1. Replace the C3 Module

In embedded devices, computing resources are limited, so the detection model needs to be lightweighted. The Universal Inverted Bottleneck (UIB) module proposed in MobileNetV4 [32] provides an effective solution to this problem. This module can be applied to the C3 module in the YOLOv5s backbone network to reduce the number of parameters of the model.

The UIB module integrates the Inverted Bottleneck (IB), ConvNext, Feed-Forward Network (FFN) in MobileNetV2 [33], and the new Extra Depth (ExtraDW) variant in MobileNetV4. Among them, IB processes the extended feature activation through spatial blending. ConvNext performs spatial blending operations before feature expansion; ExtraDW can improve the depth and receptive field of the network without significantly increasing the computational cost. FFN consists of two 1 × 1 point-by-point convolutions stacked with an activation layer and a normalization layer in between. The UIB module enables adaptive mixing of spaces and channels, flexible adjustment of the receptive field, and maximization of computational utilization.

The UIB module replaces the bottleneck structure in the C3 module to build a new C3-MNV4 (C3-MobileNetV4) module. This module effectively reduces the parameter count of the C3 module while enhancing its feature extraction ability, thereby reducing the computational burden and ensuring the accuracy of early forest fire detection. The structure of the fused C3-MNV4 is shown in Figure 4.

2.2.2. Introduction of the Attention Mechanism NAM

In the training process of neural networks, the attention mechanism plays a key role in suppressing less prominent features in both the channel and spatial dimensions. PrevIoUs research has mainly used attention operators for feature extraction, which can reveal feature information across different dimensions. The contribution factor of weights helps to suppress insignificant features, making the prominent features more noticeable. However, earlier methods did not consider this factor sufficiently. Thus, targeting the contribution factor of weights is an effective way to enhance the attention mechanism. This can be achieved by utilizing the scaling factor in Batch Normalization to represent the importance of the weights [34]. This approach avoids the need to add extra fully connected or convolutional layers, as seen in methods like SE, BAM, and CBAM [35].

Therefore, this study proposes a novel attention mechanism: the Normalization-based Attention Mechanism (NAM) [36]. The NAM is a lightweight attention mechanism that integrates the space and channel attention modules of CBAM, adjusting them to allow NAM to be embedded at the end of each network block. For residual networks, NAM can be incorporated at the end of the residual structure. In the channel attention submodule, the scaling factor from Batch Normalization is used, as shown in Formula (1).

B_{o u t} = B N (B_{i n}) = γ \frac{B_{i n} - μ_{B}}{σ_{B}^{2}} + ϵ + β

(1)

where

B_{o u t}

and

B_{i n}

represent the input and output of the module, respectively, and

μ_{B}

and

σ_{B}

are the mean and standard deviation of the mini-batch

B . γ

and

β

are trainable affine transformation parameters (scale and shift). The scaling factor shows the degree of change in each channel’s information and reflects the importance of each channel [37].

Specifically, a larger variance indicates more significant changes in the channel, meaning it contains more useful information and therefore has more importance. On the other hand, channels with small variance change less and contain less information; thus, they are less important. The channel attention mechanism is shown in Figure 5b and expressed in Formula (2).

M_{c} = σ (W_{γ} [B N (F_{1})])

(2)

where

M_{c}

is the output feature and

W_{γ}

is the weight. For the spatial dimension, a Batch Normalization scaling factor is applied to measure pixel importance, which is referred to as pixel normalization. The spatial attention mechanism is shown in Figure 5c. and expressed in Formula (3).

M_{s} = σ (W_{λ} [B N_{s} (F_{2})])

(3)

where

M_{s}

is the output feature,

λ

is the scaling factor, and

W_{λ}

is the weight.

L o s s = \sum l (f (x, W), y) + p \sum g (γ) + p \sum g (λ)

(4)

Formula (4) introduces a regularization term into the loss function to suppress insignificant weights, where

x

is the input,

y

is the output,

W

is the network weights,

l (\dots)

is the loss function, and

g (\dots)

is the

l_{1}

-norm penalty function. The parameter

p

balances the penalties of

g (γ)

and

g (λ) .

2.2.3. Bidirectional Characteristic Pyramid Network (BiFPN)

BiFPN (Bidirectional Feature Pyramid Network) [38] is an efficient multi-scale feature fusion architecture, which is further optimized on the basis of the PANet structure that fuses FPN (Feature Pyramid Network) [39] and PAN (Path Aggregation Network). The structural design of BiFPN and PANet is shown in Figure 6.

In the PANet architecture, the FPN structure transmits the strong semantic information from the top layer to the bottom layer through the top-down upsampling operation, while the PAN structure transmits the position information from the bottom layer to the top layer through the bottom-up downsampling operation. This combination method enables the parameter aggregation of features of different detection layers, so that the feature maps of different sizes contain both semantic information and position information of the image. BiFPN is optimized and improved on the basis of PANet. It not only retains the advantages of bidirectional connection and allows information fusion between features of different scales, but also introduces a weighted feature fusion mechanism. This mechanism enables the network to pay more attention to features with a larger amount of information, so as to improve the efficiency and effect of feature fusion. In addition, BiFPN removes nodes with only one input, adds connections between input and output nodes at the same level, and treats each bidirectional path as a network feature layer to optimize cross-scale connections. These improvements make BiFPN more structurally lean and significantly improve the network’s ability to handle targets of different sizes and complexity.

In this study, small flames and smoke were used as detection targets. Due to the great difference in features between the two, the BiFPN structure can be used to replace the original PANet structure in the neck of YOLOv5s, which can better realize multi-scale feature fusion and improve the processing ability of different detection targets. This not only improves the accuracy of detection but also reduces the weight of the network model.

2.2.4. ObjectBox Detector

In the YOLOv5s model, the object detection head consists of three detectors, which use the grid-based anchor mechanism to achieve object detection through multi-scale feature maps. In the experiment, when the input image size is 640 × 640, the network will output three feature maps of different scales, 80 × 80, 40 × 40, and 20 × 20. Among them, the feature map of 80 × 80 is a shallow feature, which contains more low-level target information and is suitable for detecting small targets, so a small-scale anchor is configured. The feature map of 20 × 20 represents deep features and contains more high-level information, such as contours and structures, which is suitable for detecting large targets, so large-scale anchors are configured. The characteristic map of 40 × 40 is used to detect medium-sized targets.

However, this study proposes an innovative single-stage anchor-free and highly versatile object detection method, ObjectBox [40]. Unlike traditional anchor-free detectors, existing methods are usually biased towards targets of a specific scale in label assignment. Meanwhile, ObjectBox relies only on the central position of the target as a positive sample and treats all targets equally at all feature levels, regardless of size or shape. The traditional anchor-free method first looks for candidate-positive samples in a certain area through spatial and scale constraints, and then selects positive samples according to the scale, but this method has certain limitations because it ignores the situation that different size and shape targets may lead to different target boxes.

In contrast, the ObjectBox method proposed in this study proposes a fairer treatment by regressing only from the center of the target. To achieve this, we define the regression target as the distance from the two corners containing the target center grid element to the bounding box boundary. Figure 7 illustrates how the original anchor-free detector works with the ObjectBox detector in this study: the latter extends the range of positive samples by regressing only from the central position.

2.2.5. Optimize the Loss Function

The loss function is an important tool to measure the difference or error between the predicted output of the model and the actual target. Although the existing GIoU (Generalized Intersection over Union) loss function can calculate the IoU (Intersection over Union) at a broader level, it cannot accurately reflect the actual situation when dealing with the inclusion relationship between the two prediction frames, resulting in the degradation of GioU into an ordinary IoU indicator [41]. In addition, GioU needs to calculate the minimum bounding box for each prediction box and the real box, which not only increases the computational complexity, but also limits the convergence speed of the loss function.

In order to solve these problems, the improved network adopts the Adaptive Focus Intersection over Union (AF-IoU) loss function. AF-IoU significantly improves the generalization ability of the model by reducing the weight of position information when the anchor frame coincides with the target frame, reducing the interference in the pre-training process. As a bounding box regression loss, AF-IoU introduces a dynamic non-monotonic mechanism, and designs a reasonable gradient gain allocation strategy, which effectively avoids the large gradient or harmful gradient that may occur in extreme samples. The AF-IoU of bounding box regression is shown in Figure 6 and is calculated as follows:

A F - I o U = L_{W I o U v 3} = r L_{W I o U v 1}

(5)

r = \frac{β}{δ α^{β - δ}}

(6)

L_{W I o U v 1} = R_{W I o U} L_{I o U}

(7)

R_{W I o U} = \exp (\frac{{(x - x_{g t})}^{2} + {(y - y_{g t})}^{2}}{{(W_{g}^{2} + B_{g}^{2})}^{*}})

(8)

where r is the gradient gain, β is the non-monotonic focusing coefficient, and the α and δ are hyperparameters. (x,y) is the coordinate of the center point of the prediction box, (x_gt,y_gt) is the coordinate of the center point of the target box, B_g is the width of the union of the prediction box and the target box, and Hg is the height of the union of the prediction box and the target box.

RioU is used to amplify the weight of the ordinary mass anchor frame, so that the model pays more attention to the anchor frame with low overlap between the prediction frame and the target box. LioU is used to reduce the weight of high-quality anchor frames and reduce the attention to the distance between the center points when the overlap between the prediction box and the target box is high. In addition, by separating B_g and H_g from the computational plot, it is possible to prevent RioU from creating gradients that hinder convergence, as shown in Figure 8.

Since LioU is dynamic, the quality classification criteria of the anchor frame also changes dynamically, which enables AF-IoUv3 to flexibly adjust the gradient gain allocation strategy according to the current training situation, so as to effectively improve the overall performance of object detection.

3. Experiments and Analysis of Results

3.1. Test Conditions and Indicators

In this experiment, the AutoDL server was used for the training of the early forest fire detection model, and the mirror environment chosen was PyTorch 1.11.0, Cuda 11.3, the programming language was Python 3.8 (ubuntu20.04), and the hardware configuration consisted of an RTX4060 GPU (8 GB) and 8 GB of RAM. The input image size for model training is 640 × 640, the number of training rounds is 300, the initial learning rate is 0.01, and the optimizer uses Adam. In the study of early forest fire detection based on aerial visible images, in addition to ensuring detection accuracy, this paper considers embedding the detection model into the UAV platform and optimizing the model by lightweighting it to increase the computational speed, so as to ensure the real-time processing of detection information.

In cases of imbalanced samples, using accuracy alone as a metric for model evaluation does not fully reflect the model’s performance. Moreover, since both flame and smoke features are considered as detection targets in this study, we evaluate the model’s accuracy in early wildfire detection using three metrics: Precision (P), Recall ®, and Mean Average Precision (mAP).

To assess whether the detection model achieves lightweight optimization, the frames per second (FPS) may vary depending on the computer’s performance. Therefore, this study uses the number of model parameters (Params) and floating-point operations (FLOPs) to evaluate the model’s lightweight nature and real-time performance. The formulas for Precision (P), Recall ®, and Mean Average Precision (mAP) are as follows:

P = \frac{T P}{T P + F P}

(9)

R = \frac{T P}{T P + F N}

(10)

m A P = \frac{1}{N} \sum_{i = 1}^{N} \int P (r) d R

(11)

where

T P

is the number of true positive samples correctly predicted by the model,

F P

is the number of false positives (negative samples predicted as positive), and

F N

is the number of false negatives (positive samples predicted as negative).

3.2. Comparative Experiments

In order to verify the advantages of the optimized YOLOv5s model in early forest fire detection, this paper uses the same experimental conditions on a self-constructed early forest fire dataset and compares it with YOLOv3 Tiny [42], YOLOv8, and YOLOXs [16] proposed in similar studies. The experimental results are shown in Table 2.

From Table 2, it can be found that the YOLOv3-Tiny model is better than YOLOv5s in terms of lightweight, but its accuracy in early forest fire detection is reduced. Compared with YOLOv5s, YOLOv8 increased the mAP value of early forest fire detection by 3.5%, but its number of parameters and floating-point operations were much higher than those of YOLO-UFS. However, the YOLOXs model proposed in other similar studies can only detect flame targets, and its number of parameters and floating-point operations are higher than those of the detection model proposed in this study. In contrast, the YOLO-UFS model in this study outperforms other YOLO algorithms in terms of accuracy and lightweight. This suggests that the YOLO-UFS model in this study has an advantage when working with visible images from a bird’s eye view. If it is applied to the UAV platform, the detection information can be processed more efficiently and the real-time detection can be ensured. In order to further verify the robustness of the detection network structure, forest fire experiments in other periods will be carried out in the future to evaluate its application in real scenarios.

3.3. Ablation Experiments

In order to verify whether the YOLO-UFS network structure can be improved by introducing ObjectBox, BiFPN, NAM, AF-Loss, and C3-MNV4, making the model lighter and improving the accuracy and real-time accuracy of early detection of forest fires, we carried out ablation experiments on the self-built forest fire early dataset. The results of the experiment are shown in Table 3. Among them, the first set of experiments was tested based on the original model of YOLOv5s. For each optimization method, experiments from groups 2 to 6 were tested separately. Groups 7 to 12 were tested by combining some optimization methods. Group 13 experiments were tested by applying all optimization methods simultaneously.

According to the data analysis in Table 3, ObjectBox and BiFPN can improve the accuracy of early forest fire detection, significantly reduce the number of parameters and floating-point arithmetic of the model, and successfully achieve the lightness of the network model. While AF Loss has no direct effect on the lightness of the model, they effectively improve the accuracy of detection. Specifically, compared with the original model, the accuracy of ObjectBox and BiFPN has decreased, but the recall rate and average accuracy have been significantly improved, and the C3-MNV4 module reduces the number of parameters while enhancing its feature extraction ability to ensure the accuracy of early forest fire detection. NAM Attention enhances the detection of network models in noisy environments. In the field of early detection of forest fires, it is important to follow the principle of “false positive and false detection is better than false detection”. Therefore, maximizing recall while ensuring accuracy is key. mAP is used as an evaluation metric that takes into account accuracy and recall, which means that the overall accuracy rate is improved.

By comparing the 7th and 12th group trials, it can be found that there are significant differences in the accuracy, recall, and mAP improvement ability of different optimization methods. The results of the fusion of the four optimization methods provide a good balance between precision, recall, and mAP. Compared to the original model, the accuracy (P) is improved by 3.8%, the recall ® is increased by 4.1%, the average accuracy (mAP) is increased by 3.2%, the floating-point arithmetic (FloPs) is reduced by 74.7%, and the number of parameters is reduced by 78.3%. The results show that the optimization method proposed in this paper can not only ensure the accuracy but also realize the lightweight of the early detection model of forest fires, and improve the recall rate as much as possible under the premise of ensuring accuracy.

3.4. Generalization Experiment

3.4.1. Generalization Comparison Experiments

In order to further verify the generalization of the improved algorithm, the network models YOLOv3, YOLOv4, YOLOv5, YOLOv7, and YOLOX under other fire stage configurations in the public dataset environment are compared with the proposed algorithm. In the experiment, all five networks use the same loss function, LCIoU, to ensure the accuracy of the experimental results. In addition, the evaluation indicators of the comparative test still follow the above index system.

As shown in Table 4, the YOLO-UFS network model shows bvious advantages over other network models in forest fire detection tasks. Compared with the YOLOv5s model, the mAP value is increased by 3 percentage points, which is of great significance in practical applications. Compared to YOLOv7, YOLO-UFS has an mAP value that is about 3.6 percentage points higher, while its advantage over YOLOX is even more obvious, with an mAP value that is about 6.3 percentage points higher.

The results of these comparative experiments strongly show that YOLO-UFS has better results in identifying forest fires, complex environments, low latency, and small targets. These characteristics also correspond to those of early forest fires (see Table 5 for detailed data).

The detection effect of the YOLO-UFS model on forest fire data samples is shown in Figure 9. From the figure, it can be clearly seen that the model is able to accurately perform the recognition operation when facing targets of different scales and shapes. This indicates that the YOLO-UFS model not only outperforms other similar models in terms of performance indexes but also can better cope with a variety of complex detection scenarios in the actual forest fire detection task, which provides stronger technical support for the early warning and rapid response of forest fire.

3.4.2. Generalized Ablation Experiments

When generalized ablation experiments were performed using publicly available datasets, the accuracy and recall results of YOLO-UFS and different modules as a function of the number of iterations were shown in Figure 10.

The inspection accuracy reflects the proportion of targets that the model actually correctly predicts to be positive, and higher values indicate that the model is more accurate in judging positive samples during the identification process, and can filter out relevant targets more effectively, thereby improving the relevance of the results. The full detection rate (i.e., recall) focuses on the proportion of all true positive targets correctly predicted by the model, which intuitively reflects the degree to which the model covers the positive samples, and the higher the value, the higher the likelihood that the true target in the image is successfully identified.

As can be seen in Figure 10a, the YOLO-UFS model outperforms YOLOv5s; as can be seen from Figure 10b, several models rise very rapidly before three epochs, first falling rapidly between 3~7 epochs, and then showing an upward trend. Specifically, the combination of C3-MNV4 and AF-lou loss function is mainly responsible for reducing the amount of data and improving precision, but the improvement on the recall is also better than that of the baseline algorithm, while the improvement of NAM attention and ObjectBox on the recall is more obvious, and the conclusion of the complete YOLO-UFS model training under the public dataset is basically the same as that of the self-built dataset. The improved generalization ability and the multi-environment applicability of the algorithm are demonstrated.

The mAP trend obtained by each algorithm in the public experimental dataset is shown in Figure 11, where 0.50 and 0.95 are threshold settings, and the mAP value represents the average accuracy value of the average calculation of all classified targets, and the mAP value of the model shows an upward trend in general, and the mAP value of the model is also increasing with the optimization of the model; mAP0.5 has increased from 85.2% of the original model to 86.3%, and the value of mAP0.5:0.95 has also increased from the initial 56.7%. That is 57.9 percent. The YOLOv5s model is being optimized step by step, and the mean average accuracy has improved significantly. Then, on the premise of ensuring that the loss function is the initial function of YOLOv5s, the data augmentation mode of the network is changed, and it is found that the baseline model uses the Mixup data augmentation method less effectively, but after changing the anchor-free mode of the original network and using a more accurate ObjectBox detector and NAM attention mechanism, the experimental evaluation index is more obvious. The experimental results showed that the total mAP value increased by 3.3%.

In summary, the replacement of the C3 module, the data augmentation method, the fixed change loss function, and the unanchored frame detector adopted in this experiment have good results in forest fire detection, and the improvement of the experiment has a strong pertinence, and the detection effect is significant when the drone shoots the environment, showing the excellent performance of the model in dealing with different target environments.

3.5. Actual Machine Verification Experiments

After preliminary laboratory verification, the algorithm has good performance and reliability. In order to further test its practical application effect in the low-computing power and low-power UAV platform, we carried out an experiment to verify its lightweight, low-cost, and high-efficiency characteristics to ensure its practical feasibility and practicability. Prior to the experiment, we had reported to the school as required and received approval. This experiment will provide strong support for the optimization and marketing of drones in this scenario, and promote its application in more practical scenarios.

3.5.1. Field Experiment Environment and UAV Configuration

In this experiment, considering environmental constraints, safety, pollution, and human factors, we avoided simulating a fire scene in a dense forest. Instead, we conducted a single ignition experiment with a safety officer present at a small forest farm near the river beach in Maoershan to simulate an early forest fire scenario. To ensure safety and minimize environmental impact, we pre-deployed comprehensive fire extinguishing measures around the site and immediately extinguished the fire after the experiment.

For our experimental drone platform, we prioritized market adaptability, cost control, feasibility, and market promotion potential. We constructed a drone with a traditional frame and propulsion system, as shown in Figure 12. It is equipped with a PX4 flight control system featuring GPS for precise flight control. The onboard computer is an affordable Jetson Nano 4GB, which includes a quad-core ARM Cortex-A57 MPCore CPU (up to 1.43 GHz) and 128 CUDA cores, sufficient for basic experiments. Additionally, we used the OpenMV camera (model OV7725) designed for low-power drone platforms. This configuration, which balances computing power and power consumption, is a common market choice and costs around USD 450. It effectively validates the deployability of algorithms in real-world environments and supports future market promotion.

3.5.2. Real-World Experimental Analysis

In the above real-world environment, we deployed the drone platform for the actual forest fire search mission, and after pressing the one-key take-off button on the remote control, we did not manually intervene in the search process. Instead, we simply monitor the display of the Bluetooth connection to verify the feasibility of the algorithm in real-world applications. The experimental environment and search task are shown in Figure 13a (the red bucket is the actual source of the fire, and the bucket is used to keep the flame under control), and the top-down images captured after the combustion ignition source is detected are shown in Figure 13b.

In a real-world environment, we conducted experiments on the Jetson Nano platform. By leveraging TensorRT to convert the FP32 model from the PC to FP16, we successfully reduced the model size and enhanced inference speed, making it more suitable for onboard deployment. The optimized model is only 7.4 MB in size. We meticulously analyzed the memory usage, including model parameters (64 MB), input images and intermediate feature maps (160 MB), and output results (32 MB), totaling 256 MB on the Jetson Nano platform. With an average power consumption of just 5.5 watts, our algorithm is significantly more energy-efficient than industrial-grade drone GPUs running YOLOv5 (which typically consume tens of watts). This low power consumption enables prolonged operation on small drones without compromising endurance, crucial for long-duration on-site tasks. Further optimization reduced YOLO-UFS’s memory usage from 512 MB to 256 MB, primarily by minimizing the intermediate feature map. Although additional reductions in feature map resolution and channels could lower memory usage further, this might negatively impact accuracy and efficiency. Balancing these factors, our algorithm achieves a frame rate of approximately 30 FPS, meeting the real-time object detection requirements in low-compute scenarios.

Based on the verification using the computer-side dataset and actual tests on the low-computing UAV platform, the algorithm achieves a good balance between low power consumption, low memory usage, and accuracy, meeting the research requirements. This optimization enables the algorithm to run efficiently on resource-constrained UAV platforms, providing a low-cost, high-efficiency solution for field applications such as forest fire fighting, with significant practical value.

4. Visual Analysis and Discussion

4.1. Visual Analysis

In this experiment, we conducted an in-depth comparative analysis of three different detection models. The three models are an open flame detection model based on the open flame dataset and YOLOv5s, an early forest fire detection model based on the self-built early forest fire dataset and YOLOv5s, and an early forest fire detection model based on the self-built early forest fire dataset and YOLO-UFS. A closer look at the visualization results in Figure 14 shows that although all three models exhibit high accuracy in the detection task of flame and smoke targets, the optimized early forest fire detection model in this study performs particularly well on several key performance indicators. In particular, in terms of confidence, the model shows a significant advantage in identifying relevant targets for early forest fires with a higher level of confidence. At the same time, it effectively reduces the phenomenon of duplicate detection, which fundamentally improves the accuracy and reliability of detection. In addition, compared with the self-built dataset and the model trained on the public dataset, the optimized model has stronger performance in target directionality, and shows a higher sensitivity to the typical feature of early forest fires, “big smoke and small fire”, and can more keenly capture this key feature, so as to send out early warning signals in time in the early stage of fires.

4.2. Discussion of Future Work

Despite the significant breakthroughs in accuracy and recall of the model, we must also be aware that there are still some shortcomings when using public datasets for testing. This indicates that the current model may still have some limitations when dealing with diverse and complex practical scenarios. Therefore, future research work should further optimize the model architecture and actively expand the scale and diversity of datasets. By introducing more real-world scene data, the model will be able to better learn the fire characteristics in different environments, thereby further improving its detection performance and adaptability in the real world.

Looking to the future, forest fire detection will advance in four key directions to achieve comprehensive coverage and zero false negatives. The integration of infrared sensors—capable of accurately detecting flames and high-temperature regions—and multispectral sensors—used for analyzing the spectral characteristics of smoke—forms an effective multi-dimensional perception method. Infrared sensors provide precise temperature information of flames and hotspots, while multispectral sensors analyze the spectral signatures of smoke to identify its composition and concentration. This combination effectively addresses the limitations of relying solely on visible light data in complex environments. Meanwhile, drones equipped with video surveillance will transmit real-time data to ground control stations, where human operators will process complex images and situations to determine whether a fire is present and coordinate follow-up actions. This human–robot collaboration minimizes false positives by leveraging the efficiency of AI algorithms and the discernment of human operators. Additionally, model compression algorithms will be optimized to ensure real-time analysis capabilities on UAV edge devices with limited computing power, aiming to achieve a frame rate of 60 frames per second for timely detection and response. Lastly, a large-scale, cross-regional, multi-meteorological dataset with millions of labeled entries will be developed. Fine-grained annotation will enhance the model’s generalization performance in complex forest areas. Ultimately, these advancements will lead to the construction of an integrated early warning system that embodies “accurate perception, rapid response, and all-weather monitoring”.

5. Conclusions

This study enhances early forest fire detection accuracy and efficiency through improved deep learning models. Replacing the C3 module with C3-MNV4 reduces parameters while boosting feature extraction. The AF-IoU loss function improves precision for small targets, and enhancements like the NAM attention mechanism, ObjectBox, and BiFPN further elevate the model’s focus, detail retention, and generalization capabilities. These improvements offer a cost-effective, real-time detection solution using drones, supporting low-cost UAV monitoring in forest environments.

Future work will focus on human–machine collaboration to optimize workflows and reduce resource waste. We also plan to integrate additional sensor data processes to further enhance accuracy and minimize resource consumption, improving the system’s reliability and applicability in diverse forest fire scenarios.

Supplementary Materials

Public datasets: https://par.nsgov/biblio/10497556 (accessed on 23 December 2024); Self-managed datasets: https://www.heywhale.com/mw/dataset/67ce7b2fe64dcf03bf8e08a1; Albumentations—GitCode. (n.d.). Retrieved from https://gitcode.com/gh_mirrors/al/albumentations/overview (accessed on 13 March 2025); GitHub—ChenKaiXuSan/CGAN-PyTorch https://github.com/ChenKaiXuSan/CGAN-PyTorch (accessed on 17 November 2024); FFmpeg Developers form https://ffmpeg.org/ffmpeg-filters.html#deshake (accessed on 10 March 2025).

Author Contributions

The conceptualization of the study was carried out by Z.L., H.X., and Z.J. The methodology was developed by Z.L. and H.X., while the software was handled by Z.L., C.C., and C.Z. Validation was performed by H.X. and Z.J. and formal analysis was conducted by H.X., Z.L., and Z.J. The investigation was led by C.Z., H.X., and Z.J., with resources provided by Z.L., C.C., and H.X. Data curation was managed by C.Z. and H.X. The original draft preparation was carried out by Z.L., C.C., Y.X. and H.X. The writing—review and editing was completed by Z.L., H.X., and C.Z. Visualization was conducted by C.Z., Z.J., and Z.L. Supervision was provided by H.X. and C.C. Project administration was managed by Z.L. and Y.X. Funding acquisition was also secured by Z.L. and Y.X., All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the University Student Innovation Training Program (National Level) of China, grant numbers 202410225290, and the Heilongjiang Northeast Forestry University Innovation Program, Open Funding Item No. 0824.

Data Availability Statement

The original contributions of this study are incorporated into the article. For further inquiries, please contact the corresponding author.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

References

Cui, R.K.; Qian, L.H.; Wang, Q.H. Research progress on fire protection function evaluation of forest road network. World For. Res. 2023, 36, 32–37. [Google Scholar]
National Bureau of Statistics. Statistical Bulletin of the People’s Republic of China on National Economic and Social Development for 2023. 2024. Available online: https://www.stats.gov.cn/sj/zxfb/202402/t20240228_1947915.html (accessed on 4 November 2024).
Abid, F. A survey of machine learning algorithms based forest fires prediction and detection systems. Fire Technol. 2021, 57, 559–590. [Google Scholar] [CrossRef]
Alkhatib, A.A.A. A review on forest fire detection techniques. Int. J. Distrib. Sens. Netw. 2014, 10, 597368. [Google Scholar] [CrossRef]
Bao, C.; Cao, J.; Hao, Q.; Cheng, Y.; Ning, Y.; Zhao, T. Dual-YOLO architecture from infrared and visible images for object detection. Sensors 2023, 23, 2934. [Google Scholar] [CrossRef] [PubMed]
Zhang, S.; Gao, D.; Lin, H.; Sun, Q. Wildfire detection using sound spectrum analysis based on the internet of things. Sensors 2019, 19, 5093. [Google Scholar] [CrossRef] [PubMed]
Lai, X.L. Research and Design of Forest Fire Monitoring System Based on Data Fusion and Iradium Communication. Ph.D. Thesis, Beijing Forestry University, Beijing, China, 2015. [Google Scholar]
Nan, Y.L.; Zhang, H.C.; Zheng, J.Q.; Yang, K.Q. Application of deep learning to forestry. World For. Res. 2021, 34, 87–90. [Google Scholar]
Yuan, C.; Zhang, Y.M.; Liu, Z.X. A survey on technologies for automatic forest fire monitoring, detection, and fighting using unmanned aerial vehicles and remote sensing techniques. Can. J. For. Res. 2015, 45, 783–792. [Google Scholar] [CrossRef]
Han, Z.S.; Fan, X.Q.; Fu, Q.; Ma, C.Y.; Zhang, D.D. Multi-source information fusion target detection from the perspective of unmanned aerial vehicle. Syst. Eng. Electron. Technol. 2025, 47, 52–61. Available online: https://link.cnki.net/urlid/11.2422.tn.20240430.1210.003 (accessed on 4 November 2024).
Barmpoutis, P.; Papaioannou, P.; Dimitropoulos, K.; Grammalidis, N. A review on early forest fire detection systems using optical remote sensing. Sensors 2020, 20, 6442. [Google Scholar] [CrossRef]
Li, D. The Research on Early Forest Fire Detection Algorithm Based on Deep Learning. Ph.D. Thesis, Central South University of Forestry & Technology, Changsha, China, 2023. [Google Scholar]
Xue, Z.; Lin, H.; Wang, F. A small target forest fire detection model based on YOLOv5 improvement. Forests 2022, 13, 1332. [Google Scholar] [CrossRef]
Zhao, L.; Zhi, L.; Zhao, C.; Zheng, W. Fire-YOLO: A small target object detection method for fire inspection. Sustainability 2022, 14, 4930. [Google Scholar] [CrossRef]
Zu, X.P. Research on Forest Fire Smoke Recognition Method Based on Deep Learning. Ph.D. Thesis, Northeast Forestry University, Harbin, China, 2023. [Google Scholar]
Su, X.D.; Hu, J.X.; Chenlin, Z.T.; Gao, H.J. Fire image detection algorithm for UAV based on improved YOLOv5. Comput. Meas. Control. 2023, 31, 41–47. [Google Scholar]
Pi, J.; Liu, Y.H.; Li, J.H. Research on lightweight forest fire detection algorithm based on YOLOv5s. J. Graph. 2023, 44, 26–32. [Google Scholar]
Wang, Y.; Li, J.; Chen, M. Hybrid Transformer-CNN architecture for enhanced forest fire detection. IEEE Trans. Ind. Inform. 2024, 20, 1345–1357. [Google Scholar]
Chen, X.; Liu, Y.; Zhao, Q. Multi-modal fusion for early fire detection using infrared and visible images. Sensors 2024, 24, 3215–3227. [Google Scholar]
Zhang, L.; Guo, S.; Sun, J. Adaptive YOLOv8 with dynamic attention mechanism for smoke detection in forest fires. IEEE Access 2025, 13, 102345–102359. [Google Scholar]
Guo, H.; Deng, F.; Zhou, R. Self-supervised pre-training for improved small object detection in aerial fire monitoring. Neurocomputing 2025, 495, 443–455. [Google Scholar]
Li, W.; Zhang, T.; Ma, F. Edge-cloud collaborative lightweight model for real-time forest fire detection. IEEE Internet Things J. 2024, 11, 3241–3252. [Google Scholar]
Zhou, Q.; Sun, Y.; Tang, D. Federated learning based distributed fire detection system for forest environments. J. Netw. Comput. Appl. 2024, 186, 103133. [Google Scholar]
He, J.; Liu, Z.; Wang, R. Adversarial training for robust YOLO-based detection in complex fire scenarios. Pattern Recognit. 2025, 121, 108055. [Google Scholar]
Sun, X.; Sun, L.; Huang, Y. Forest fire smoke recognition based on convolutional neural network. J. For. Res. 2020, 32, 1921–1927. [Google Scholar] [CrossRef]
Lu, K.; Huang, J.; Li, J.; Zhou, J.; Chen, X.; Liu, Y. MTL-FFDET: A multi-task learning-based model for forest fire detection. Forests 2022, 13, 1448. [Google Scholar] [CrossRef]
Prema, C.E.; Vinsley, S.S.; Suresh, S. Multi-feature analysis of smoke in YUV color space for early forest fire detection. Fire Technol. 2016, 52, 1319–1342. [Google Scholar] [CrossRef]
Xu, Y.Q.; Li, J.M.; Zhang, F.Q. A UAV-based forest fire patrol path planning strategy. Forests 2022, 13, 1952. [Google Scholar] [CrossRef]
Zu, X.; Li, D. Forest fire smoke recognition method based on UAV images and improved YOLOv3-SPP algorithm. J. For. Eng. 2022, 7, 142–149. [Google Scholar] [CrossRef]
Zhang, Q.; Liu, Y.; Gong, C.; Chen, Y.; Yu, H. Applications of deep learning for dense scenes analysis in agriculture: A review. Sensors 2020, 20, 1520. [Google Scholar] [CrossRef]
Bu, H.; Fang, X.; Yang, G. Object detection algorithm for remote sensing images based on multi-dimensional information interaction. J. Heilongjiang Inst. Technol. 2022, 22, 58–65. [Google Scholar] [CrossRef]
Qin, D.; Leichner, C.; Delakis, M.; Fornoni, M.; Luo, S.; Yang, F.; Wang, W.; Banbury, C.; Ye, C.; Akin, B.; et al. MobileNetV4: Universal models for the mobile ecosystem. arXiv 2024, arXiv:2404.10518. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Yang, Y. Research on Image Data Augmentation Method Based on Generative Adversarial Network. Ph.D. Thesis, Strategic Support Force Information Engineering University, Zhengzhou, China, 2022. [Google Scholar]
Xiu, Y.; Zheng, X.; Sun, L.; Fang, Z. FreMix: Frequency-based Mixup for data augmentation. Wirel. Commun. Mob. Comput. 2022, 2022, 5323327. [Google Scholar] [CrossRef]
Liu, Y.; Shao, Z.; Teng, Y.; Hoffmann, N. NAM: Normalization-based attention module. arXiv 2021, arXiv:2111.12419. [Google Scholar]
Yao, X. Research on Airtight Water Inspection Method of Closed Container Based on Deep Separable Convolutional Neural Network. Ph.D. Thesis, Xijing University, Xi’an, China, 2022. [Google Scholar] [CrossRef]
Tan, M.X.; Pang, R.M.; Le, Q.V. EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Zand, M.; Etemad, A.; Greenspan, M. ObjectBox: From centers to boxes for anchor-free object detection. In Lecture Notes in Computer Science; Springer Nature Switzerland: Cham, Switzerland, 2022; pp. 390–406. [Google Scholar] [CrossRef]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding box regression loss with dynamic focusing mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]

Figure 1. Sample composition analysis of self-built early forest fire dataset.

Figure 2. Enhance data and enhance images.

Figure 3. YOLO-UFS network structure diagram.

Figure 4. Structure of C3-MNV4.

Figure 5. Attention Mechanism.

Figure 6. Structure of PANet and BiFPN.

Figure 7. Anchor-free detector.

Figure 8. Schematic diagram of bounding box regression.

Figure 9. Comparison of forest fire recognition accuracy enhanced by the improved model.

Figure 10. Comparison of the accuracy and completeness of the models. (a) Precision comparative experiments; (b) recall comparative experiments.

Figure 11. Comparison of mAP0.5 values and mAP0.5:0.95 values for each model. (a) mAP0.5; (b) mAP0.5:0.95.

Figure 12. The UAV platform built in actual verification.

Figure 13. Actual scenario of UAV backhaul. (a) Photographs of real-world environmental experiments. (b) Retrieved a return image of the flame.

Figure 14. Comparison of recognition effect of three detection models.

Table 1. The number of samples in the dataset.

Dataset	Sample Type		Number of Samples	Total
Training set	Positive sample	only flames only smoke Flames, smoke coexist	2720 1776 5980	10,476
Validation set	Positive sample	only flames only smoke Flames, smoke coexist	2181 1141 4098	7685
	Negative samples		639	639
		Total		18,800

Table 2. Comparison of YOLO series algorithms.

Method	mAP/%	Flops/G	Parameters/Piece
YOLOV3-Tiny	79.7	13.2	8,849,182
YOLOv5s	87.7	15.8	7,015,519
YOLOv8	89.2	28.4	11,126,358
YOLOXs	86.9	9.6	2,975,226
YOLO-UFS	91.3	4	1,525,465

Table 3. Performance comparison of different modules after change.

Number	ObjectBox	BiFPN	NAM	AF	C3- MNV4	Weight/MB	Precision/%	Recall/%	mAP/%	FloPs/G	Parameters Piece
1						14.2	85.3	80.4	88.4	15.8	7,015,519
2	√					14.0	85.7	80.7	88.6	3.5	1,630,157
3		√				14.1	85.6	80.6	88.7	4.6	1,685,145
4			√			14.1	85.4	80.5	88.6	4.3	1,944,973
5				√		14.0	86.2	81.3	88.8	5.3	5,015,519
6					√	14.0	87.8	81.2	89.2	4.4	1,014,517
7	√	√				14.2	85.4	80.7	88.4	3.8	1,447,897
8			√	√		14.0	87.1	80.6	88.6	4.5	1,944,973
9				√	√	14.1	88.3	82.3	90.4	4.3	3,499,378
10	√	√		√		14.1	88.4	81.6	90.6	3.9	1,447,897
11	√	√	√			14.1	87.8	81.4	90.3	4	1,525,465
12		√	√		√	14.2	88.2	82.6	91.4	4	1,477,634
13	√	√	√	√	√	14.0	88.6	83.7	91.3	4	1,525,465

Table 4. Performance comparison of different algorithm models.

Method	Weight/MB	Enter a Size	Precision/%	Recall/%	Mean of Average Accuracy/%	F₁	Recognition Rate/(Frame × s⁻¹)
YOLOv3	120.5	640	70.9	59.9	63.1	64.9	39.7
YOLOv4	18.1	640	73.2	59.3	64.4	65.5	71.0
YOLOv5s	14.2	640	75.4	57.3	62.3	65.1	85.6
YOLOv7	135.0	640	76.2	52.9	79.4	62.4	35.3
YOLOX	15.5	640	74.2	53.5	77.4	62.2	169.5
YOLO-UFS	14.0	640	74.9	58.7	82.3	65.8	172.4

Table 5. The experimental results of different sizes based on the Microsoft COCO standard.

Model	P_A-S	P_A-M	P_A-L	R_A-S	R_A-M	R_A-L
YOLOv5s	18.3	27.0	22.1	30.7	43.8	27.6
YOLO-UFS	23.8	29.8	27.6	34.2	47.2	37.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Luo, Z.; Xu, H.; Xing, Y.; Zhu, C.; Jiao, Z.; Cui, C. YOLO-UFS: A Novel Detection Model for UAVs to Detect Early Forest Fires. Forests 2025, 16, 743. https://doi.org/10.3390/f16050743

AMA Style

Luo Z, Xu H, Xing Y, Zhu C, Jiao Z, Cui C. YOLO-UFS: A Novel Detection Model for UAVs to Detect Early Forest Fires. Forests. 2025; 16(5):743. https://doi.org/10.3390/f16050743

Chicago/Turabian Style

Luo, Zitong, Haining Xu, Yanqiu Xing, Chuanhao Zhu, Zhupeng Jiao, and Chengguo Cui. 2025. "YOLO-UFS: A Novel Detection Model for UAVs to Detect Early Forest Fires" Forests 16, no. 5: 743. https://doi.org/10.3390/f16050743

APA Style

Luo, Z., Xu, H., Xing, Y., Zhu, C., Jiao, Z., & Cui, C. (2025). YOLO-UFS: A Novel Detection Model for UAVs to Detect Early Forest Fires. Forests, 16(5), 743. https://doi.org/10.3390/f16050743

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

YOLO-UFS: A Novel Detection Model for UAVs to Detect Early Forest Fires

Abstract

1. Introduction

2. Materials and Methods

2.1. Construct an Early Forest Fire Dataset

2.1.1. Image Acquisition of Early Forest Fires

2.1.2. Data Augmentation Processing

2.1.3. Dataset Composition

2.2. The YOLO-UFS Model

2.2.1. Replace the C3 Module

2.2.2. Introduction of the Attention Mechanism NAM

2.2.3. Bidirectional Characteristic Pyramid Network (BiFPN)

2.2.4. ObjectBox Detector

2.2.5. Optimize the Loss Function

3. Experiments and Analysis of Results

3.1. Test Conditions and Indicators

3.2. Comparative Experiments

3.3. Ablation Experiments

3.4. Generalization Experiment

3.4.1. Generalization Comparison Experiments

3.4.2. Generalized Ablation Experiments

3.5. Actual Machine Verification Experiments

3.5.1. Field Experiment Environment and UAV Configuration

3.5.2. Real-World Experimental Analysis

4. Visual Analysis and Discussion

4.1. Visual Analysis

4.2. Discussion of Future Work

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI