SRoFF-Yolover: A Small-Target Detection Model for Suspicious Regions of Forest Fire

Chen, Lairong; Li, Ling; Cheng, Pengle; Huang, Ying

doi:10.3390/f16081335

Open AccessArticle

SRoFF-Yolover: A Small-Target Detection Model for Suspicious Regions of Forest Fire

¹

School of Technology, Beijing Forestry University, Beijing 100083, China

²

Department of Civil, Construction, and Environmental Engineering, North Dakota State University, Fargo, ND 58102, USA

^*

Author to whom correspondence should be addressed.

Forests 2025, 16(8), 1335; https://doi.org/10.3390/f16081335

Submission received: 6 July 2025 / Revised: 6 August 2025 / Accepted: 15 August 2025 / Published: 16 August 2025

(This article belongs to the Section Natural Hazards and Risk Management)

Download

Browse Figures

Versions Notes

Abstract

The rapid detection and confirmation of Suspicious Regions of Forest Fire (SRoFF) are critical for timely alerts and firefighting operations. In the early stages of forest fires, small flames and heavy occlusion lead to low accuracy, false detections, omissions, and slow inference in existing target-detection algorithms. We constructed the Suspicious Regions of Forest Fire Dataset (SRFFD), comprising publicly available datasets, relevant images collected from online searches, and images generated through various image enhancement techniques. The SRFFD contains a total of 64,584 images. In terms of effectiveness, the individual augmentation techniques rank as follows (in descending order): HSV (Hue Saturation and Value) random enhancement, copy-paste augmentation, and affine transformation. A detection model named SRoFF-Yolover is proposed for identifying suspicious regions of forest fire, based on the YOLOv8. An embedding layer that effectively integrates seasonal and temporal information into the image enhances the prediction accuracy of the SRoFF-Yolover. The SRoFF-Yolover enhances YOLOv8 by (1) adopting dilated convolutions in the Backbone to enlarge feature map receptive fields; (2) incorporating the Convolutional Block Attention Module (CBAM) prior to the Neck’s C2fLayer for small-target attention; and (3) reconfiguring the Backbone-Neck linkage via P2, P4, and SPPF. Compared with the baseline model (YOLOv8s), the SRoFF-Yolover achieves an 18.1% improvement in mAP@0.5, a 4.6% increase in Frames Per Second (FPS), a 2.6% reduction in Giga Floating-Point Operations (GFLOPs), and a 3.2% decrease in the total number of model parameters (#Params). The SRoFF-Yolover can effectively detect suspicious regions of forest fire, particularly during winter nights. Experiments demonstrated that the detection accuracy of the SRoFF-Yolover for suspicious regions of forest fire is higher at night than during daytime in the same season.

Keywords:

suspicious regions of forest fire; small target detection; YOLOv8; convolutional block attention module

1. Introduction

Forests are a critical natural resource, providing abundant forestry products and maintaining ecological balance, climate regulation, air purification, and soil conservation. Among forest disturbances, wildfires are among the most devastating threats.

China has widely distributed forest resources, with wildfire risks across all provinces and seasons, showing distinct seasonal variations [1]. Spring is the peak wildfire season, mainly due to agricultural activities (e.g., slash-and-burn farming). Autumn is the second most high-risk period, primarily due to drought and ignition sources. Winter wildfires mostly result from human activities (e.g., ritual burning), while summer wildfires are mainly triggered by lightning.

Forest fires spread rapidly and are highly destructive, making timely identification of potential fire zones critical for prevention. Various detection systems have been proposed. Kumar, S.K. et al. [2] developed an intelligent forest fire detection system using LoRaWAN sensors and machine/deep learning for early warning in remote areas. Gobinda Prasad Acharya et al. [3] proposed a Forest Protection System (FPS) integrating IoT and LoRa communication to detect intrusions and fires. Nahar, A. et al. [4] introduced an IoT system combining Smart EcoHydrant Stations (SEHS), UAVs, and blockchain for wildfire monitoring and deforestation control.

Early fire detection algorithms determine monitoring system effectiveness. Remote sensing and machine vision are increasingly used for forest fire monitoring. Ghaderpour, E. et al. [5], Jang, E. et al. [6], and Marsha, A. L. et al. [7] used remote sensing for near-real-time surveillance, but these methods require significant infrastructure and maintenance costs.

Recently, drone-based monitoring with machine vision has emerged as a promising approach [8]. The accurate identification of potential fire zones in forest imagery is essential. Early forest fire image recognition relied on manually extracted color features. Pathare et al. [9] analyzed the R-component intensity of flame pixels in the RGB space, while C. Hossain, F. M. A. et al. [10] used the YUV space for flame detection. Yuan, C. et al. [11] developed a color-based fire detection method for drone patrols.

The success of deep learning in image processing has spurred research into early forest fire detection [12]. Current approaches face three challenges.

First, models lack generalization. Forest environments vary in vegetation, terrain, and weather, requiring adaptable models. Marques, E.Q. et al. [13] studied vegetation effects, and Mambile, C. et al. [14] explored weather impacts, but generalization across regions and seasons remains understudied [15]. Huang, J. et al. [16] showed that the detection performance degrades significantly under low-light dawn/dusk scenarios and dense fog. Environmental sensitivity, caused by seasonal color shifts, varying humidity, and ambient temperature, remains one of the most cited limitations in current deep-learning fire detection systems [17].

Second, the subtle features of early-stage forest fires and partial occlusion often lead to false negatives or misclassifications. To address this, Lee, Y. et al. [18] improved accuracy by integrating global and local features in Faster R-CNN. Xu, R. et al. [19] combined YOLOv5 and EfficientDet with a classifier for verification. Recent advancements have further improved forest fire detection accuracy through optimizations to the YOLO algorithm, including backbone network enhancements [20,21], attention mechanism integration [20,22], Neck structure modifications [21,22], and loss function refinements [22,23]. Wang F. [24] enhanced small-target detection with attention mechanisms. Zhu et al. [25] replaced full convolution with partial convolution in YOLOv8’s C2f module, reducing complexity. Wang, H. et al. [26] proposed the DSS-YOLO to address missed detections and real-time issues for small flames and smoke.

Third, near-real-time monitoring requires faster inference. Model lightweighting and hardware acceleration improve efficiency. Honglin Wang et al. [27] used Depthwise Separable Convolutions to simplify YOLOv5. Lin et al. [28] introduced a lightweight self-attention detection head (SADH) for YOLOv8. Jin, L. et al. [29] applied RepViT and SimAM for lightweighting. Briley, A. A. et al. [30] achieved faster inference via hardware acceleration.

A key challenge is the lack of public, high-quality datasets for early fire detection [31]. Forest fires’ unpredictability and regional heterogeneity hinder data collection. Currently, researchers predominantly rely on data augmentation techniques to expand and diversify forest fire image datasets, facilitating model training and optimization [32]. To address the lack of real forest fire data, Soliman [33] proposed collecting data by simulating small-scale forest fires to approximate large-scale wildfire scenarios.

To address the poor performance and low accuracy of models in identifying suspicious regions of forest fire due to a limited dataset sample size, dispersed sample features, and small target sizes, this paper conducts the following work based on relevant research achievements:

(1): Designed the suspicious regions of forest fire dataset (SRFFD), incorporating diverse early-fire images across seasons, times, and weather conditions.
(2): Introduced an embedding layer to integrate seasonal (S) and temporal (T) information as additional channels, improving adaptability to environmental variations.
(3): Replaced standard convolutions in YOLOv8’s C2f module with dilated convolutions, enlarging receptive fields without parameter increase.
(4): Incorporated the convolutional block attention module (CBAM) in the Neck to enhance the focus on small targets and critical regions.
(5): Modified feature fusion by linking Backbone outputs (P2, P4, and SPPF) to the Neck, improving multi-scale detection.

These innovations collectively address challenges in small-target detection, generalization, and real-time processing, advancing forest fire monitoring systems.

2. Materials and Methods

2.1. Dataset Construction

To ensure the diversity and representativeness of the data, an initial dataset’s wildfire data was established by utilizing early forest fire videos and images sourced from the internet, while also incorporating datasets from the Corsican Fire Database [34], the Dataset for Fire and Smoke detection (DFS) [35], the FLAME dataset from Northern Arizona University [36], and the VISFIRE dataset of Bilkent University. The initial dataset was collected from forest environments across multiple typical regions, including the Mediterranean, North American forests, and Asian forests. During the image selection process, fire scenarios under seasonal variations (spring, summer, autumn, and winter) and diurnal differences (day/night) were comprehensively considered, while also covering various weather conditions such as clear, cloudy, overcast, and light rain to simulate the complex meteorological environments encountered in real-world fire detection. The selected images include not only the characteristic features of early-stage fires (e.g., small flames) but also partially capture mid-to-late-stage fire conditions (e.g., large-scale burning). These fire images serve as positive samples, while non-fire images with similar scene characteristics, collected from the internet, are used as negative samples.

The initial dataset contains 2392 images comprising forest early-fire images and their corresponding negative samples. The image size in the dataset was uniformly resized to 480 × 480 pixels. The positive samples were labeled as “Yes”, while the negative samples were labeled as “No”. For positive sample images, LabelImg (https://github.com/HumanSignal/labelImg (accessed on 8 January 2025)) was employed to annotate the corresponding suspected fire areas.

Subsequently, we selected three image data augmentation techniques to enhance the images in the initial dataset, resulting in the SRFFD. The three image augmentation techniques—affine transformation, HSV random enhancement, and copy-paste augmentation—generate diverse training image data by adjusting image angles and colors. The SRFFD comprises 64,584 samples, including 36,180 positive samples and 28,404 negative samples. The distribution of samples in the SRFFD is shown in Table 1.

Representative images of positive and negative samples in the SRFFD are displayed in Figure 1 and Figure 2.

2.2. YOLOv8

You Only Look Once (YOLO) is an advanced real-time object detection algorithm renowned for its efficiency and accuracy. The primary strength of the YOLO lies in its speed, enabling real-time detection while maintaining high precision. The YOLOv8 was developed and released by the Ultralytics team in 2023 (https://docs.ultralytics.com/zh/models/yolov8 (accessed on 2 December 2024)). The architecture of the YOLOv8 primarily consists of a Backbone, Neck, and Head, as illustrated in Figure 3.

It can be observed that the C2fLayer, ConvModule, and Spatial Pyramid Pooling Fast (SPPF) are the main modules in the YOLOv8. The C2fLayer in the YOLOv8 is an enhanced feature extraction module that integrates the Cross-Stage Partial (CSP) structure and multi-branch convolution to improve the model’s feature extraction capability and computational efficiency. The SPPF module in the YOLOv8 is an improved spatial pyramid pooling structure designed to extract multi-scale features for object detection tasks. Compared to the traditional SPP module, the SPPF reduces computational overhead by sequentially stacking pooling layers (e.g., repeatedly applying 5 × 5 max pooling) while retaining the effectiveness of multi-scale feature extraction. This design significantly enhances computational efficiency.

The YOLOv8 retains the efficiency characteristic of the YOLO series while achieving higher detection accuracy and speed through structural optimizations and technical enhancements. Notably, the YOLOv8 exhibits distinct advantages in lightweight design and small object detection [37,38,39].

2.3. Dilated Convolution

Dilated convolution is a convolutional operation that expands the receptive field by introducing a dilation rate into the standard convolution kernel [40].

The core concept involves enlarging the convolution kernel’s receptive field without additional parameters, thereby enabling the capture of broader contextual information. This is achieved by inserting “holes” (zeros) between the kernel weights in the standard convolution. The dilation rate serves as the key parameter in dilated convolution, controlling the degree of kernel expansion. Specifically, the dilation rate (r) determines the spacing between the kernel weights. When r = 1, the dilated convolution becomes equivalent to standard convolution. When r > 1, (r − 1) zeros are inserted between the kernel weights, effectively expanding the receptive field. For a given input feature map (I) and kernel (K), the output (O) of the dilated convolution operation can be mathematically expressed by Equation (1) [40].

O (x, y) = \sum_{i} \sum_{j} I (x + i \cdot r, y + j \cdot r) \cdot K (i, j)

(1)

where

(x, y)

indicate positions on the output feature map, while (i,j) represent kernel indices, and r specifies the dilation rate.

The receptive field of dilated convolution is given by Equation (2) [40].

R e c e p t i v e F i e l d = (k - 1) \cdot r + 1

(2)

where k is the kernel size and r denotes the dilation rate.

For a 3 × 3 kernel, when r = 1, the kernel covers a 3 × 3 region corresponding to a 3 × 3 receptive field. When r = 2, the kernel spans a 5 × 5 region while computing only 9 pixels, yielding a 5 × 5 receptive field. At r = 3, the kernel encompasses a 7 × 7 region with a 7 × 7 receptive field. Standard convolution performs dense sampling on the input feature map, whereas dilated convolution achieves sparse sampling through the insertion of holes, thereby covering an expanded area. This approach enlarges the receptive field without increasing the number of parameters, which facilitates the capture of broader contextual information. Furthermore, dilated convolution preserves the feature map resolution, making it particularly suitable for tasks requiring high-resolution outputs. While maintaining computational costs comparable to standard convolution, dilated convolution provides a significantly larger receptive field.

Dilated convolution expands the receptive field without increasing the number of parameters by introducing a dilation rate, while maintaining resolution and computational efficiency.

2.4. Convolutional Block Attention Module (CBAM)

The CBAM [41] is a simple yet effective attention module designed for feedforward convolutional neural networks. By incorporating an attention mechanism, the CBAM enhances the model’s representation capability. This module specifically improves the model’s focus on regions containing small targets. The CBAM comprises two sequentially connected sub-modules: a channel attention sub-module and a spatial attention sub-module. Figure 4 illustrates the structure of the CBAM.

The input of the CBAM is the input feature map (F), with dimensions

C \times H \times W

, where C denotes the number of channels, H represents the height of the feature map, and W corresponds to the width of the feature map.

In the channel attention sub-module, firstly, Global Average Pooling (GAP) and Global Max Pooling (GMP) are applied to the input feature map to generate two 1D vectors. For each channel, the average value across all spatial positions

H \times W

is computed, yielding a 1D vector (

F_{a v g}

) with dimensions C × 1 × 1. The calculation is expressed as follows (Equation (3)) [41].

F_{a v g} (c) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} F (c, i, j)

(3)

Here, c represents the channel index, while i and j denote spatial positions.

For each channel, the maximum value across all spatial positions (

H \times W

) is computed, resulting in a 1D vector Fmax with dimensions

C \times 1 \times 1

. The calculation is expressed as follows (Equation (4)) [41].

F_{m a x} (c) = \max_{i, j} F (c, i, j)

(4)

The pooling operations capture the global average and the most salient features of each channel, respectively. The resulting vectors, Favg and Fmax, are then processed by a shared two-layer MultiLayer Perceptron (MLP) to generate channel attention weights. The MLP outputs are combined through element-wise addition, followed by a sigmoid activation function to produce the channel attention weights Mc(F) with dimensions C × 1 × 1. Finally, the channel refined feature map F′ is obtained by multiplying the input feature map F with the attention weights Mc(F) in a channel-wise manner.

The CBAM spatial attention sub-module processes the channel-weighted feature map through a series of operations to generate the spatially weighted feature map. First, the input feature map (F′) undergoes both average pooling and max pooling along the channel dimension, yielding two 2D feature maps (F′avg and F′max). These two feature maps are then concatenated along the channel dimension to form a composite 2D feature map. A convolutional operation is subsequently applied to this concatenated feature map to produce spatial attention weights (Ms(F′)). These weights are then normalized using the sigmoid function, generating the final spatial attention weights (Ms(F′)). Finally, the weights are multiplied element-wise with the input feature map (F) across all spatial positions, resulting in the spatially weighted feature map (F″). This mechanism enhances salient spatial regions while suppressing less important ones, thereby refining the feature representation for downstream tasks.

The CBAM is an efficient attention mechanism that refines feature maps by combining channel and spatial attention along both dimensions. Being lightweight and highly versatile, it can be easily integrated into existing architectures, serving as a powerful tool to enhance the performance of CNNs in various computer vision tasks.

2.5. Our Improved Model

Considering the unique characteristics of forest fire detection, we propose an identification model for suspicious regions of forest fire based on YOLO v8, named SRoFF-Yolover.

The architecture of SRoFF-Yolover is illustrated in Figure 5.

The SRoFF-Yolover takes 480 × 480 color images as input and outputs the bounding box coordinates along with classification results for suspicious regions of forest fire.

Through the analysis of images in the SRFFD, we observe that suspicious regions of forest fire exhibit distinct characteristics compared to the background across different seasons and time periods. To enhance the model’s adaptability to temporal variations and improve its generalization capability across seasons, we introduce an embedding layer that integrates seasonal and temporal information into the image representation.

The embedding layer is achieved by augmenting the existing three-channel color image with two additional channels dedicated to seasonal (S) and temporal (T) information. The detailed embedding methodology is formalized in Equation (5).

E X = C o n c a t e n a t e (X, S, T)

(5)

where X is the input of the SRoFF-Yolover, a color image of dimensions 480 × 480 × 3. S represents a seasonal information matrix of dimensions 480 × 480 × 1. T denotes a temporal information matrix of dimensions 480 × 480 × 1.

The matrices S and T are implemented as learnable parameters initialized with random values sampled from a normal distribution (mean = 0, std = 0.01). During training, these matrices are optimized via backpropagation alongside other model parameters. The seasonal matrix S is designed to capture recurring patterns (e.g., dry vs. wet seasons), while T encodes time-of-day variations (e.g., daylight vs. nighttime). To ensure meaningful learning, we constrain the values of S and T to the range [−1, 1] using a tanh activation function. This prevents extreme values from dominating the original RGB features.

Both S and T are implemented as learnable information matrices, with S encoding seasonal patterns and T capturing temporal variations. Three matrices are concatenated along the last dimension to form a new matrix EX, with dimensions of 480 × 480 × 5. The first three channels of this new matrix retain the RGB color information of the original image, while the fourth channel encapsulates seasonal information, and the fifth channel embodies temporal information. Therefore, the embedding layer introduces seasonal and temporal information as prior knowledge into the SRoFF-Yolover.

To improve the SRoFF-Yolover’s accuracy in detecting small targets while reducing its complexity, we replaced the C2fLayer in the Backbone with the DCC2fLayer. The DCC2fLayer is created by substituting the standard convolution in the C2fLayer with the dilated convolution. The standard convolution kernels (e.g., 3 × 3) employed in the C2fLayer possess a relatively limited receptive field. Moreover, these conventional convolutions incur high computational costs, particularly when processing high-resolution inputs, which may consequently impact the model’s inference speed. In contrast, dilated convolution can capture multi-scale contextual information while reducing computational complexity and enhancing the detection capability for small targets. Therefore, the DCC2fLayer is used to improve small target detection performance. The hybrid strategy is employed to adjust the dilation rate, ensuring that the sampling points of the convolutional kernel cover continuous spatial regions and thereby avoiding feature discontinuity caused by a fixed large dilation rate. Experimental results demonstrate that setting the dilation rate (r) to 2 in the DCC2fLayer achieves an optimal balance between preserving local details and moderately expanding the receptive field.

To improve multi-scale target adaptation, we enhanced the Backbone-to-Neck linkage. The feature map at the P2 level exhibits a higher resolution, enabling it to capture more detailed information and exhibit greater sensitivity to small targets, particularly in the recognition of small objects within complex backgrounds. The SPPF module integrates multi-scale feature information, and its connection further enhances the model’s capability for multi-scale target detection. Therefore, in the SRoFF-Yolover, the linkage between the Backbone and Neck is implemented using the P2, P4, and SPPF outputs from the Backbone.

By integrating the CBAM before the C2fLayer in the YOLOv8’s Neck, we aim to enhance the SRoFF-Yolover’s capacity to focus on crucial channels and spatial regions, which is particularly advantageous for small object detection. Small targets typically occupy only a minor proportion of an image. The primary reason for the low accuracy in small target detection lies in the fact that convolutional networks processing images with extensive background areas tend to accumulate substantial redundant information. This adaptation enables the SRoFF-Yolover to preferentially emphasize and amplify relevant features for small targets during both feature extraction and fusion stages.

2.6. Experimental Settings

The SRFFD was split into training, validation, and test sets at an 8:1:1 ratio. Further specifics regarding the SRFFD are provided in Table 2. The training set was used for model training, the validation set for hyperparameter tuning, and the test set for evaluating model performance. The experimental conditions in this paper are shown in Table 3. The hyperparameters were initially set based on theoretical principles and then fine-tuned experimentally. The hyperparameter configurations are detailed in Table 4.

2.7. Evaluation Metrics

The models were assessed based on two criteria: complexity and predictive performance. The model complexity encompasses spatial complexity, temporal complexity, and model forward inference time. The metrics used to evaluate the model’s complexity properties include the following: (1) #Params (K): The total number of model parameters, which is positively correlated with the model’s spatial complexity. (2) Giga Floating-Point Operations (GFLOPs): A metric for evaluating the computational complexity of deep learning models. It represents the number of floating-point operations required for a single forward pass through the model. GFLOPs provide valuable insight into the computational demands of a model, facilitating performance comparison and optimization. For models with similar architectures, a lower GFLOP count generally indicates reduced computational cost and faster inference on the same hardware. (3) Frames Per Second (FPS): A critical metric for evaluating the inference speed of object detection models, FPS represents the number of image frames a model can process per second. Higher FPS values indicate better real-time performance.

In this study, we evaluate the model’s predictive performance using the PASCAL VOC assessment criteria, a widely recognized standard in object detection research. The PASCAL VOC evaluation metrics are quantified by the mean average precision (mAP) [42] at an Intersection over Union (IoU) threshold of 0.5, denoted as mAP@0.5. The calculation of mAP@0.5 requires the determination of precision and recall values, which are computed using the following formulas [42].

A P = \int_{0}^{1} P (r) d r

(6)

m A P = \frac{\sum_{i = 1}^{n} {A P}_{i}}{n}

(7)

In these formulas, P(r) denotes precision at a specific recall value r, representing a point on the precision-recall curve where precision is plotted on the y-axis and recall on the x-axis. Here, n indicates the total number of categories. Average Precision (AP) is defined as the area under the precision-recall curve. mAP represents the mean of AP values across all categories.

3. Results

3.1. Ablation Experiments

The experimental procedure for each ablation experiment is conducted as follows. First, we configure the training hyperparameters according to Table 4 and train the suspicious regions of forest fire detection models using the training set. Subsequently, we evaluate the trained models on the test dataset to assess key performance metrics, including mAP@0.5, FPS, number of parameters (#Params), and GFLOPs.

The ablation experiment results are summarized in Table 5.

To systematically evaluate the efficacy of our proposed enhancements, we implemented four consecutive modifications (Experiments 2–5) to the baseline model (YOLOv8s). Experiment 2 replaced the C2fLayer in the backbone with the DCC2fLayer for achieving lightweight; Experiment 3 incorporated a CBAM attention module before the C2fLayer in the Neck to enhance accuracy; Experiment 4 modified the Backbone-Neck linkage via P2, P4, and SPPF to improve detection performance; and Experiment 5 added an embedding layer to the YOLOv8s to boost accuracy. The results from Experiments 2 to 5 demonstrate the specific contributions of each module. Experiment 2 shows that the lightweight module (DCC2f) reduces parameters by 9.5% (6.3 MB→5.7 MB), decreases GFLOPs by 9.0% (7.8→7.1), and improves mAP@0.5 by 4.4% (0.764→0.798). This indicates that the DCC2f effectively reduces computation but has a limited impact on accuracy improvement. The CBAM (Experiment 3) achieves the highest single-module mAP improvement (+10.5%, 0.764→0.844), though with increased parameters and computation. The CBAM significantly enhances small-target detection accuracy. The multi-scale feature reconstruction through P2/P4/SPPF optimization (Experiment 4) increases mAP@0.5 by 5.8% and FPS by 4%, validating the effectiveness of multi-scale feature fusion. The spatiotemporal embedding layer designed for seasonal/diurnal variations (Experiment 5) improves mAP@0.5 by 7.6% (0.764→0.822), enhancing detection robustness in winter nighttime scenarios. These results demonstrate that all four improvements can enhance the suspicious regions of the forest fire detection model in terms of accuracy, inference speed, and model lightweighting to varying degrees.

In Experiments 6–8, we progressively integrated the aforementioned four improvements. The results demonstrate that the detection accuracy (mAP@0.5) shows a continuous improvement, as all four modifications contribute to enhancing detection performance to varying degrees. The DCC2f + CBAM combination (Experiment 6) achieves 12.8% mAP@0.5 improvement (0.764→0.862), exceeding the individual module gains, demonstrating synergistic effects. The model in Experiment 7 achieves the lowest parameter count and the fastest inference speed. The complete model (Experiment 8) maintains real-time performance (113 FPS) while achieving 0.902 mAP@0.5, an 18.1% improvement over baseline, with fewer parameters (6.1 MB) than the original YOLOv8s (6.3 MB). These results confirm that the SRoFF-Yolover delivers superior comprehensive detection performance in forest fire detection.

The curves depicting the variation in mAP@0.5 with the number of iterations for different structural improvements are presented in Figure 6.

Figure 6 shows the trend of mAP@0.5 for different models during the training process. The models involved include the YOLOV8s model (Experiment 1), model equipped with the DCC2f (Experiment 2), model integrated with the CBAM (Experiment 3), model based on Backbone-Neck linkage (Experiment 4), and model based on Embedding (Experiment 5). It can be seen from the figure that with the increase in the number of training rounds, the mAP@0.5 value of each model is on the rise, and shows different growth rates at different stages. The models from Experiments 2 to 5 all achieved the improvement of mAP@0.5. Among them, the CBAM-based model in Experiment 3 performs well, and its mAP@0.5 improves the most. This result shows that the attention mechanism introduced by the CBAM can effectively guide the model to learn key features, thereby greatly improving the object detection performance of the model. When the number of training iterations reaches 300 rounds, the mAP@0.5 values of all models gradually become stable, which indicates that the model has basically reached the convergence state and the training process tends to be stable.

The trained SRoFF-Yolover was employed to identify the suspicious regions of forest fire in the test set. Some representative results are shown in Figure 7.

Nighttime detection (Figure 7e–h) generally outperforms daytime detection (Figure 7a–d), likely because flames are more distinct against dark backgrounds at night. Winter scenes (Figure 7d,h) show the lowest false detection rates, possibly due to the sparse vegetation reducing interference. In contrast, summer scenes (Figure 7b,f) exhibit lower detection accuracy, likely because of noise from high temperatures.

3.2. The Comparative Experiments

Considering that suspicious regions of forest fire typically consist of scattered small targets with significant seasonal variations, we selected the YOLO series (v5/v8s/v10) and the Faster R-CNN [43] as the comparative models to validate the applicability of our proposed model. As a classic two-stage detection model, the Faster R-CNN preserves more detailed information through its Region Proposal Network (RPN), while its improved RoIAlign method effectively mitigates feature misalignment issues, making it particularly effective for small target detection. However, it suffers from a slower inference speed and higher computational costs. In contrast, the YOLO series, as single-stage detection models, demonstrate faster inference speeds. The YOLOv5 achieves multi-scale feature fusion through its FPN + PAN architecture, providing certain small target detection capabilities. YOLOv8s enhances feature propagation for small targets by optimizing the FPN structure and cross-stage connections. The newly released YOLOv10 [44] incorporates a lightweight Transformer module to explore its performance in detecting scattered small targets.

To ensure a fair comparison, all competing models (YOLOv5, YOLOv8s, and YOLOv10) were fine-tuned on the SRFFD under identical training protocols, including data augmentation, optimizer settings (AdamW), and learning rate schedules. Default anchor boxes and hyperparameters were adjusted to match the scale of forest fire detection tasks, and all models were trained until convergence. The comparative experiment results are shown in Table 6.

Our proposed model (SRoFF-Yolover) achieved the highest mAP@0.5 (0.902) among all the compared models, while the YOLOv10 demonstrated the fastest inference speed (128 FPS).

Compared with the Faster R-CNN, the SRoFF-Yolover exhibited a 9.9 percentage point improvement in mAP@0.5 (0.902 vs. 0.821) and a 4.2× acceleration in inference speed (113 FPS vs. 27 FPS), highlighting the advantages of single-stage detectors. When evaluated against the YOLO series, the SRoFF-Yolover maintained a comparable model size (6.1 MB vs. YOLOv10’s 5.6 MB) while outperforming the YOLOv10 by 3.3% in mAP@0.5 (0.902 vs. 0.873), demonstrating the efficacy of its customized design.

Furthermore, we randomly selected 10 images from four typical scenarios in the test set: winter night, autumn night, summer day, and spring day. Each of the five models was used for prediction on these 10 images. The average prediction results (mAP@0.5) are recorded in Table 7.

The results show that the SRoFF-Yolover maintains comprehensive superiority (average mAP@0.5: 0.872), while the YOLOv5 performs the worst. The best-performing scenario is the winter night, whereas the most challenging scenario is the summer day. Further analysis indicates that winter achieves the highest detection performance due to simpler backgrounds, while the summer day suffers from decreased detection accuracy due to increased flame occlusion caused by dense vegetation.

Although the comparative models—including the YOLOv5, YOLOv8s, YOLOv10, and Faster R-CNN—achieved relatively high accuracy in target detection, they are primarily designed for general-purpose object detection tasks and lack specialized modeling for the unique characteristics of early-stage forest fires. In contrast, the SRoFF-Yolover is not merely an accuracy-oriented improvement over the YOLOv8 but a task-specific innovation that directly addresses key challenges in forest fire detection, such as early occurrence, small and scattered targets, and significant seasonal variability. First, the SRoFF-Yolover is tailored for early-stage forest fire detection. It incorporates a novel embedding layer to integrate seasonal and temporal information into image features, significantly enhancing detection robustness across different seasons and time periods—an aspect overlooked by most baseline models. Second, the model applies dilated convolutions in the backbone network to expand the receptive field, which strengthens its capability to detect small, dispersed fire sources—common in early fire stages—and compensates for the limitations of models like the YOLOv5 and the YOLOv8s in this regard. Third, a CBAM attention module is integrated before the C2fLayer in the neck, which increases the model’s sensitivity to critical fire regions, and effectively reduces false positives caused by environmental noise—this mechanism is not included in the baseline models. Finally, the SRoFF-Yolover features a redesigned backbone-to-neck connection using P2, P4, and SPPF layers, optimizing the transmission and fusion of multiscale features while maintaining high computational efficiency. In summary, the SRoFF-Yolover demonstrates not only competitive mAP performance but also significant structural innovations and task adaptability, offering a more practical and effective solution for early forest fire detection in real-world scenarios.

4. Discussion

We conducted a systematic evaluation of various image enhancement techniques, with a particular focus on analyzing their impact on the performance of both the YOLOv8s and the SRoFF-Yolover. Through comparative studies, we identified the most suitable dataset augmentation methods for improving suspicious regions of forest fire detection capabilities, thereby providing a foundation for subsequent data augmentation efforts. Furthermore, we thoroughly investigated the SRoFF-Yolover’s detection performance for suspicious regions of forest fire across different seasons to assess their seasonal adaptability.

4.1. The Effects of Data Augmentation Techniques on Model Detection Performance

Applying various forms of image augmentation to the dataset can increase the quantity and diversity of sample data, effectively mitigate the issue of model overfitting, and improve the model’s generalization ability. We employed three data augmentation techniques—affine transformation, HSV random enhancement, and copy-paste augmentation—to augment the dataset and form the SRFFD. The experiments were conducted to validate the impact of various data augmentation techniques on YOLOv8s and SRoFF-Yolover detection performance, respectively. Based on the initial image dataset (containing 2392 images), we designed the following experimental scheme. In Experiments 1–3, three individual data augmentation methods were separately applied to expand the initial dataset by six times (resulting in 14,352 images); in Experiments 4–6, pairwise combinations of the three augmentation methods were used to similarly expand the initial dataset by six times; in Experiment 7, all three augmentation methods were simultaneously applied to expand the initial dataset by six times. Using the aforementioned dataset, we trained the YOLOv8s and the SRoFF-Yolover according to Table 3 and Table 4, respectively. The mAP@0.5 obtained by testing the trained model on the test set is listed in Table 8.

The “Yes” indicates the application of the corresponding image enhancement technique, while “No” represents its absence.

The experimental results demonstrate that data augmentation techniques play a critical role in improving model performance, particularly for complex tasks like forest fire detection. Among individual augmentation methods, HSV random enhancement performs the best, indicating that color features are central to forest fire detection. Flames (e.g., red, orange, and yellow) exhibit distinct separability in the HSV space. Random adjustments to hue (H), saturation (S), and value (V) simulate flame appearances under varying lighting conditions, enhancing the model’s robustness to color variations. Additionally, the diversity of forest backgrounds (e.g., green leaves, dry wood) may be better represented through HSV augmentation, reducing false detections. Copy-paste augmentation, while increasing sample quantity, is slightly less effective than HSV enhancement. This may be because simple copy-paste operations fail to fully simulate realistic interactions between flames and backgrounds. Affine transformations (e.g., rotation, scaling, and translation) show the weakest performance when used alone, likely because flame shapes and orientations follow certain natural patterns, and excessive transformations may disrupt these patterns.

Experimental results demonstrate that combining multiple augmentation techniques maximizes model performance. Specifically, the “HSV + copy-paste + affine” combination achieves the highest mAP@0.5 (0.902) on the SRoFF-Yolover. This suggests a synergistic effect among different augmentation strategies: color enhancement enriches flame appearance diversity, copy-paste increases sample quantity, and affine transformations improve adaptability to geometric variations. This combined strategy provides a valuable reference for data augmentation in complex scenarios.

4.2. Results of Suspicious Regions of Forest Fire Detection

Generally, forest fires occur randomly. To validate the adaptability of the SRoFF-Yolover across different seasons and times of day, a series of experiments were conducted. As presented in Table 9, the SRoFF-Yolover demonstrates robust performance in detecting suspicious regions of forest fire under varying temporal conditions.

Seasonal and temporal adaptability in forest fire detection is a critical issue in practical applications. Experimental results demonstrate that the SRoFF-Yolover maintains stable performance across different times and seasons, highlighting its adaptability.

First, the SRoFF-Yolover exhibits significant day-night performance differences, with consistently better nighttime detection. This can be attributed to the following: (1) In low-light conditions, flames become prominent light sources with higher contrast against backgrounds, significantly improving detection. (2) Reduced nighttime interference factors (e.g., sunlight reflection, cloud cover) effectively lower false alarm rates.

Second, the SRoFF-Yolover achieves optimal performance in winter, particularly at night (mAP@0.5: 0.951). Our analysis reveals key influencing factors: (1) Environmental factors: Sparse winter vegetation significantly reduces background complexity, making flame features easier to identify. In contrast, autumn shows relatively lower performance (daytime mAP@0.5: 0.836), mainly due to increased background complexity from fallen leaves and dry vegetation. (2) Spectral characteristics: Winter flames exhibit stable H-value distribution (concentrated in the red channel) in the HSV color space, while summer flames show multi-channel dispersion. This difference directly affects detection difficulty, confirming that our HSV-based data augmentation effectively improves winter detection performance. (3) Combustion properties: Winter flames maintain more stable burning states with lower flickering frequencies due to colder temperatures, facilitating feature capture. Conversely, summer flames display more dynamic variations under high temperatures, potentially increasing miss rates. (4) Data balance: The seasonally and diurnally balanced dataset ensures the temporal adaptability of the model. (5) Model architecture: The SRoFF-Yolover’s multi-scale feature extraction effectively captures flame features at various distances and sizes, significantly enhancing the model’s adaptability to lighting variations, particularly its temporal adaptability.

Based on the flame pixel area in the test set, we classified the samples into three groups: (1) small targets (<32 × 32 pixels, 68% of samples, average mAP@0.5 = 0.894), (2) medium targets (32 × 32 to 64 × 64 pixels, 23% of samples, average mAP@0.5 = 0.899), and (3) large targets (>64 × 64 pixels, 9% of samples, average mAP@0.5 = 0.911).

The results show that the CBAM performs similarly for small and medium targets but achieves better detection for large targets. This suggests the CBAM’s ability to adapt to different target scales, which can be explained by two factors: (1) channel attention evaluates feature importance independently of absolute scale, and (2) spatial attention uses dual-path pooling to capture multi-scale features.

Meanwhile, we conducted a comparison between our work and other related studies. Both the SRoFF-Yolover and the FEDS-YOLOv11n [45] optimize the detection of obscured or concealed fire sources using attention mechanisms, but their focuses differ. The SRoFF-Yolover performs exceptionally well in severely obscured scenarios, such as winter nights. It employs the CBAM, which is more suitable for low-visibility conditions (e.g., nighttime). The FEDS-YOLOv11n adopts the SEAM attention module, effectively improving the detection capability for obscured targets while maintaining a lightweight design. It achieves a balance between accuracy and efficiency, making it more suitable for real-time detection. The YOLOGX [46] employs a GD mechanism and Focal-SIoU to address environmental sensitivity, achieving better performance in complex fire recognition scenarios. The SRoFF-Yolover is specifically optimized for early fire detection (small flames, occluded fires), demonstrating superior performance for winter nighttime scenarios compared to daytime detection. This improvement stems from its seasonal embedding layer, which enhances temporal sensitivity. While the YOLOGX exhibits greater innovation in its loss function and feature fusion mechanism, the SRoFF-Yolover provides more refined data augmentation and temporal information embedding.

This study systematically analyzes the impact of data augmentation and temporal adaptability on forest fire detection. Combining color and geometric augmentation maximizes model performance, while nighttime and winter conditions favor detection. These findings provide a scientific basis for optimizing forest fire monitoring systems and a reference for similar detection tasks.

5. Conclusions

We conducted research addressing the key challenges in suspicious regions of forest fire detection, including small-target detection capability, real-time performance, and generalization ability (particularly seasonal adaptability). The study was carried out from two perspectives: dataset construction and model optimization. We developed the SRFFD to address the lack of diverse early-stage fire imagery. Additionally, we proposed the SRoFF-Yolover, an enhanced YOLOv8-based model for detecting suspicious regions of forest fire, aiming to achieve both accurate detection and a lightweight design.

The initial dataset comprises early-stage forest fire videos and images gathered from online sources, supplemented by several additional public datasets. The SRFFD is generated by augmenting and expanding the initial dataset through three data augmentation techniques: affine transformations, HSV random enhancement, and copy-paste augmentation. The SRoFF-Yolover, based on the YOLOv8, achieves performance improvements through four key modifications: the incorporation of an embedding layer, Backbone modification, modification to the linkage between the Backbone and Neck, and Neck modification. Compared to the baseline model (the YOLOv8s), the SRoFF-Yolover achieves an 18.1% improvement in mAP@0.5, a 4.6% increase in FPS, a 2.6% reduction in GFLOPs, and a 3.2% decrease in #Params. For the same season, the SRoFF-Yolover exhibits a higher detection rate for suspected forest fire areas at night compared to daytime. Across different seasons, the model achieves the highest detection accuracy in winter and the lowest in autumn.

This study still has several limitations. First, the lack of dataset diversity restricts further improvements in model performance, particularly due to insufficient image samples under extreme lighting conditions (e.g., intense midday sunlight in the summer) and adverse weather (e.g., heavy rain, dense fog, and other low-visibility scenarios). Additionally, negative samples from easily confusable scenes (e.g., sunset vs. fire, welding sparks vs. flames) are underrepresented. Second, the SRoFF-Yolover’s computational complexity remains high, making efficient deployment on resource-constrained edge devices (e.g., mobile chips or embedded systems) challenging.

To address these issues, future work could explore generative AI techniques, such as Generative Adversarial Networks (GANs) or diffusion models, to synthesize diverse training samples and expand dataset coverage. Furthermore, model compression techniques—including neural network pruning, quantization, and knowledge distillation—could be employed to optimize the model architecture, reduce computational overhead, and enhance compatibility with edge computing devices. This study is conducted based on visible light images, where the model exhibits certain levels of missed detections and false alarms in summer day scenarios. The future incorporation of thermal infrared images for multimodal fusion may potentially improve detection accuracy under such conditions.

Author Contributions

L.C.: Methodology, Software, Validation, Writing—original draft. L.L.: Writing—review and editing, Visualization, Data curation, Conceptualization. P.C.: Resources, Project administration, Funding acquisition. Y.H.: Investigation, Supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (32171797) and the Chunhui Project Foundation of the Education Department of China (HZKY20220026).

Data Availability Statement

Dataset available on request from the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zeng, A.; Yang, S.; Zhu, H.; Tigabu, M.; Su, Z.; Wang, G.; Guo, F. Spatiotemporal Dynamics and Climate Influence of Forest Fires in Fujian Province, China. Forests 2022, 13, 423. [Google Scholar] [CrossRef]
Kumar, S.K.; Gupta, S.; Sharma, P.; Verma, A.; Tripathi, D.; Bansal, P. Smart Forest Fire Detection System Using LoRaWAN Sensors. In Proceedings of the Data Analytics and Management, London, UK, 14–15 June 2024; Lecture Notes in Networks and Systems. Swaroop, A., Virdee, B., Correia, S.D., Polkowski, Z., Eds.; Springer Nature: Singapore, 2025; Volume 1299, pp. 489–497, ISBN 978-981-96-3357-9. [Google Scholar]
Acharya, G.P.; Poluboyina, L.; Veeragamoorthi, J.; Joydeb, C. An Efficient and Low Cost Realization of LoRa Based Real-Time Forest Protection System. Indones. J. Electr. Eng. Comput. Sci. 2024, 34, 1452–1462. [Google Scholar] [CrossRef]
Nahar, A.; Al Sohan, M.F.A.; Redwan, K.; Ahmed, M. Advanced IoT-Enabled Forest Management System for Wildfire Detection and Ecosystem Protection: A Case Study in the Sundarbans. In Proceedings of the 2025 2nd International Conference on Advanced Innovations in Smart Cities (ICAISC), Jeddah, Saudi Arabia, 9–11 February 2025; IEEE: New York, NY, USA; pp. 1–6. [Google Scholar]
Ghaderpour, E.; Vujadinovic, T. The Potential of the Least-Squares Spectral and Cross-Wavelet Analyses for Near-Real-Time Disturbance Detection within Unequally Spaced Satellite Image Time Series. Remote Sens. 2020, 12, 2446. [Google Scholar] [CrossRef]
Jang, E.; Kang, Y.; Im, J.; Lee, D.-W.; Yoon, J.; Kim, S.-K. Detection and Monitoring of Forest Fires Using Himawari-8 Geostationary Satellite Data in South Korea. Remote Sens. 2019, 11, 271. [Google Scholar] [CrossRef]
Marsha, A.L.; Larkin, N.K. Evaluating Satellite Fire Detection Products and an Ensemble Approach for Estimating Burned Area in the United States. Fire 2022, 5, 147. [Google Scholar] [CrossRef]
Chen, Y.; Zhang, Y.; Xin, J.; Wang, G.; Mu, L.; Yi, Y.; Liu, H.; Liu, D. UAV Image-Based Forest Fire Detection Approach Using Convolutional Neural Network. In Proceedings of the 2019 14th IEEE Conference on Industrial Electronics and Applications (ICIEA), Xi’an, China, 19–21 June 2019; IEEE: New York, NY, USA; pp. 2118–2123. [Google Scholar]
Pathare, S.J.; Bhombra, G.K.; Kamble, K.D.; Nagare, G.D. Early Identification of Fire by Utilization of Camera. In Proceedings of the 2018 International Conference on Communication and Signal Processing (ICCSP), Chennai, India, 3–5 April 2018; IEEE: New York, NY, USA; pp. 0001–0004. [Google Scholar]
Hossain, F.M.A.; Zhang, Y.M.; Tonima, M.A. Forest Fire Flame and Smoke Detection from UAV-Captured Images Using Fire-Specific Color Features and Multi-Color Space Local Binary Pattern. J. Unmanned Veh. Syst. 2020, 8, 285–309. [Google Scholar] [CrossRef]
Yuan, C.; Ghamry, K.A.; Liu, Z.; Zhang, Y. Unmanned Aerial Vehicle Based Forest Fire Monitoring and Detection Using Image Processing Technique. In Proceedings of the 2016 IEEE Chinese Guidance, Navigation and Control Conference (CGNCC), Nanjing, China, 12–14 August 2016; IEEE: New York, NY, USA; pp. 1870–1875. [Google Scholar]
Chaturvedi, S.; Khanna, P.; Ojha, A. A Survey on Vision-Based Outdoor Smoke Detection Techniques for Environmental Safety. ISPRS J. Photogramm. Remote Sens. 2022, 185, 158–187. [Google Scholar] [CrossRef]
Marques, E.Q.; Silvério, D.V.; Galvão, L.S.; Aragão, L.E.O.C.; Uribe, M.R.; Macedo, M.N.; Rattis, L.; Alencar, A.A.C.; Brando, P.M. Assessing the Effectiveness of Vegetation Indices in Detecting Forest Disturbances in the Southeast Amazon. Sci. Rep. 2024, 14, 27287. [Google Scholar] [CrossRef]
Mambile, C.; Kaijage, S.; Leo, J. Deep Learning Models for Enhanced Forest-Fire Prediction at Mount Kilimanjaro, Tanzania: Integrating Satellite Images, Weather Data and Human Activities Data. Nat. Hazards Res. 2024, 5, 335–347. [Google Scholar] [CrossRef]
Michail, D.; Davalas, C.; Panagiotou, L.-I.; Prapas, I.; Kondylatos, S.; Bountos, N.I.; Papoutsis, I. FireCastNet: Earth-as-a-Graph for Seasonal Fire Prediction. arXiv 2025, arXiv:2502.01550. [Google Scholar]
Huang, J.; Zhou, J.; Yang, H.; Liu, Y.; Liu, H. A Small-Target Forest Fire Smoke Detection Model Based on Deformable Transformer for End-to-End Object Detection. Forests 2023, 14, 162. [Google Scholar] [CrossRef]
Carta, F.; Zidda, C.; Putzu, M.; Loru, D.; Anedda, M.; Giusto, D. Advancements in Forest Fire Prevention: A Comprehensive Survey. Sensors 2023, 23, 6635. [Google Scholar] [CrossRef] [PubMed]
Lee, Y.; Shim, J. False Positive Decremented Research for Fire and Smoke Detection in Surveillance Camera Using Spatial and Temporal Features Based on Deep Learning. Electronics 2019, 8, 1167. [Google Scholar] [CrossRef]
Xu, R.; Lin, H.; Lu, K.; Cao, L.; Liu, Y. A Forest Fire Detection System Based on Ensemble Learning. Forests 2021, 12, 217. [Google Scholar] [CrossRef]
Wang, S.; Chen, T.; Lv, X.; Zhao, J.; Zou, X.; Zhao, X.; Xiao, M.; Wei, H. Forest Fire Detection Based on Lightweight Yolo. In Proceedings of the 2021 33rd Chinese Control and Decision Conference (CCDC), Kunming, China, 22–24 May 2021; IEEE: New York, NY, USA; pp. 1560–1565. [Google Scholar]
Goyal, S.; Dhillon, A.; Vohra, H.; Singh, A. A YOLO Based Technique for Early Forest Fire Detection. Int. J. Innov. Technol. Explor. Eng. 2020, 9, 1357–1362. [Google Scholar] [CrossRef]
Li, Y.; Shen, Z.; Li, J.; Xu, Z. A Deep Learning Method Based on SRN-YOLO for Forest Fire Detection. In Proceedings of the 2022 5th International Symposium on Autonomous Systems (ISAS), Hangzhou, China, 9–10 April 2022; IEEE: New York, NY, USA, 2022; pp. 1–6. [Google Scholar]
Zhao, L.; Zhi, L.; Zhao, C.; Zheng, W. Fire-YOLO: A Small Target Object Detection Method for Fire Inspection. Sustainability 2022, 14, 4930. [Google Scholar] [CrossRef]
Xue, Z.; Lin, H.; Wang, F. A Small Target Forest Fire Detection Model Based on YOLOv5 Improvement. Forests 2022, 13, 1332. [Google Scholar] [CrossRef]
Zhu, W.; Niu, S.; Yue, J.; Zhou, Y. Multiscale Wildfire and Smoke Detection in Complex Drone Forest Environments Based on YOLOv8. Sci. Rep. 2025, 15, 2399. [Google Scholar] [CrossRef]
Wang, H.; Fu, X.; Yu, Z.; Zeng, Z. DSS-YOLO: An Improved Lightweight Real-Time Fire Detection Model Based on YOLOv8. Sci. Rep. 2025, 15, 8963. [Google Scholar] [CrossRef]
Wang, H.; Zhang, Y.; Zhu, C. YOLO-LFD: A Lightweight and Fast Model for Forest Fire Detection. Comput. Mater. Contin. 2025, 82, 3399–3417. [Google Scholar] [CrossRef]
Lin, Z.; Yun, B.; Zheng, Y. LD-YOLO: A Lightweight Dynamic Forest Fire and Smoke Detection Model with Dysample and Spatial Context Awareness Module. Forests 2024, 15, 1630. [Google Scholar] [CrossRef]
Jin, L.; Yu, Y.; Zhou, J.; Bai, D.; Lin, H.; Zhou, H. SWVR: A Lightweight Deep Learning Algorithm for Forest Fire Detection and Recognition. Forests 2024, 15, 204. [Google Scholar] [CrossRef]
Briley, A.A.; Afghah, F. Hardware Acceleration for Real-Time Wildfire Detection Onboard Drone Networks. In Proceedings of the IEEE INFOCOM 2024—IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), Vancouver, BC, Canada, 20 May 2024; IEEE: New York, NY, USA; pp. 01–06. [Google Scholar]
Geetha, S.; Abhishek, C.S.; Akshayanat, C.S. Machine Vision Based Fire Detection Techniques: A Survey. Fire Technol. 2021, 57, 591–623. [Google Scholar] [CrossRef]
Abdusalomov, A.B.; Islam, B.M.S.; Nasimov, R.; Mukhiddinov, M.; Whangbo, T.K. An Improved Forest Fire Detection Method Based on the Detectron2 Model and a Deep Learning Approach. Sensors 2023, 23, 1512. [Google Scholar] [CrossRef]
Soliman, H.; Haque, A. A Smart and Secure Wireless Sensor Network for Early Forest Fire Prediction: An Emulated Scenario Approach. In Advances in Information and Communication; Lecture Notes in Networks and Systems; Arai, K., Ed.; Springer Nature Switzerland: Cham, Switzerland, 2025; Volume 1284, pp. 719–731. ISBN 978-3-031-85362-3. [Google Scholar]
Pérez-Ramirez, Y.; Ferrat, L.; Filippi, J.B. Wildland Fire Fuels Database for Corsican—Mediterranean Forest Stand Types. For. Ecol. Manag. 2024, 565, 122002. [Google Scholar] [CrossRef]
Wu, S.; Zhang, X.; Liu, R.; Li, B. A Dataset for Fire and Smoke Object Detection. Multimed. Tools Appl. 2023, 82, 6707–6726. [Google Scholar] [CrossRef]
Shamsoshoara, A.; Afghah, F.; Razi, A.; Zheng, L.; Fulé, P.Z.; Blasch, E. Aerial Imagery Pile Burn Detection Using Deep Learning: The FLAME Dataset. Comput. Netw. 2021, 193, 108001. [Google Scholar] [CrossRef]
Huangfu, Z.; Li, S. Lightweight You Only Look Once v8: An Upgraded You Only Look Once v8 Algorithm for Small Object Identification in Unmanned Aerial Vehicle Images. Appl. Sci. 2023, 13, 12369. [Google Scholar] [CrossRef]
Sandhya; Kashyap, A. A Novel Method for Real-Time Object-Based Copy-Move Tampering Localization in Videos Using Fine-Tuned YOLO V8. Forensic Sci. Int. Digit. Investig. 2024, 48, 301663. [Google Scholar] [CrossRef]
Du, S.; Pan, W.; Li, N.; Dai, S.; Xu, B.; Liu, H.; Xu, C.; Li, X. TSD-YOLO: Small Traffic Sign Detection Based on Improved YOLO V8. IET Image Process. 2024, 18, 2884–2898. [Google Scholar] [CrossRef]
Yu, F.; Koltun, V. Multi-Scale Context Aggregation by Dilated Convolutions. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2016. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Computer Vision—ECCV 2018; Lecture Notes in Computer Science; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer International Publishing: Cham, Switzerland, 2018; Volume 11211, pp. 3–19. ISBN 978-3-030-01233-5. [Google Scholar]
Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
Zhou, K.; Jiang, S. Forest Fire Detection Algorithm Based on Improved YOLOv11n. Sensors 2025, 25, 2989. [Google Scholar] [CrossRef] [PubMed]
Li, C.; Du, Y.; Zhang, X.; Wu, P. YOLOGX: An Improved Forest Fire Detection Algorithm Based on YOLOv8. Front. Environ. Sci. 2025, 12, 1486212. [Google Scholar] [CrossRef]

Figure 1. Representative positive samples in SRFFD. (a) Forest fire images during daytime across all four seasons; and (b) forest fire images during night across all four seasons.

Figure 2. Representative negative sample in SRFFD. (a) Forest non-fire images during daytime across all four seasons; and (b) forest non-fire images during night across all four seasons.

Figure 3. The structure of YOLOv8.

Figure 4. The structure of CBAM.

Figure 5. SRoFF-Yolover architecture.

Figure 6. The variation in mAP@0.5 with the number of training iterations for different structural improvements. 1, 2, 3, 4, and 5 represent the experiment number.

Figure 7. The effectiveness of detection using SRoFF-Yolover. (a) Day in spring, (b) day in summer, (c) day in autumn, (d) day in winter, (e) night in spring, (f) night in summer, (g) night in autumn, (h) night in winter.

Table 1. Distribution of samples in the SRFFD.

Season	Time	Type of Sample	Number of Samples in the Initial Dataset	Affine Transformation	HSV Random Enhancement	Copy-Paste Augmentation
Spring	Day	Positive	295	885	2655	7965
	Day	Negative	220	660	1980	5940
	Night	Positive	197	591	1773	5319
	Night	Negative	142	426	1278	3834
Summer	Day	Positive	144	432	1296	3888
	Day	Negative	132	396	1188	3564
	Night	Positive	124	372	1116	3348
	Night	Negative	120	360	1080	3240
Autumn	Day	Positive	176	528	1584	4752
	Day	Negative	123	369	1107	3321
	Night	Positive	165	495	1485	4455
	Night	Negative	131	393	1179	3537
Winter	Day	Positive	156	468	1404	4212
	Day	Negative	119	357	1071	3213
	Night	Positive	83	249	747	2241
	Night	Negative	65	195	585	1755

Table 2. Details of the experimental datasets.

Season	Training Set	Validation Set	Test Set
spring	18,446	2306	2306
summer	11,232	1404	1404
autumn	12,851	1607	1607
winter	9137	1142	1142

Table 3. Computational environment.

Computational Environment	Details
Operating system	Ubuntu 16.04
Deep learning framework	Pytorch 2.2.1
GPU	NVIDIA RTX 4070 Ti
CPU	Intel(R) Core (TM) i7-13700KF
RAM	32 GB
Programming language	Python 3.10

Table 4. Training hyperparameters of the models.

Training Hyperparameters	Details
Optimization algorithm	SGD with a momentum of 0.925
Batch-size	128
Epochs	300
Initial learning rate	0.001
Image-size (pixels)	480 × 480

Table 5. The results of the ablation experiments.

Experiment Number	Model	mAP@0.5	FPS	#Params (MB)	GFLOPs
1	YOLOv8s	0.764	108	6.3	7.8
2	YOLOv8s + DCC2f	0.798	115	5.7	7.1
3	YOLOv8s + CBAM	0.844	104	6.5	8.1
4	YOLOv8s + Backbone-Neck linkage	0.808	112	5.8	7.3
5	YOLOv8s + Embedding	0.822	103	6.6	8.1
6	YOLOv8s + DCC2f + CBAM	0.862	112	6.2	7.7
7	YOLOv8s + DCC2f + CBAM + Backbone-Neck linkage	0.884	116	5.8	7.4
8	SRoFF-Yolover	0.902	113	6.1	7.6

Table 6. The comparative experiment results.

Model	mAP@0.5	FPS	#Params (MB)	GFLOPs
Faster R-CNN	0.821	27	126.3	389.4
YOLOv5	0.755	99	6.5	8.1
YOLOv8s	0.764	108	6.3	7.8
YOLOv10	0.873	128	5.6	7.7
SRoFF-Yolover	0.902	113	6.1	7.6

Table 7. Prediction results of different models in typical scenarios.

Typical Scenario	Faster R-CNN	YOLOv5	YOLOv8s	YOLOv10	SRoFF-Yolover
Spring/day	0.801	0.716	0.735	0.846	0.862
Summer/day	0.762	0.632	0.719	0.824	0.849
Autumn/night	0.794	0.712	0.742	0.844	0.859
Winter/night	0.832	0.753	0.765	0.861	0.917

Table 8. The effects of data augmentation techniques on model detection performance.

Experiment Number	Data Augmentation Technique			mAP@0.5
Experiment Number	Copy-Paste Augmentation	HSV Random Enhancement	Affine Transformation	YOLOv8s	SRoFF- Yolover
1	Yes	No	No	0.724	0.847
2	No	Yes	No	0.737	0.851
3	No	No	Yes	0.712	0.821
4	Yes	Yes	No	0.741	0.869
5	No	Yes	Yes	0.742	0.877
6	Yes	No	Yes	0.757	0.883
7	Yes	Yes	Yes	0.764	0.902

Table 9. The ability of SRoFF-Yolover to detect suspicious regions of forest fire under varying temporal conditions.

Season	Time	mAP@0.5
Spring	Day	0.902
Spring	Night	0.916
Summer	Day	0.893
Summer	Night	0.91
Autumn	Day	0.836
Autumn	Night	0.871
Winter	Day	0.932
Winter	Night	0.951

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, L.; Li, L.; Cheng, P.; Huang, Y. SRoFF-Yolover: A Small-Target Detection Model for Suspicious Regions of Forest Fire. Forests 2025, 16, 1335. https://doi.org/10.3390/f16081335

AMA Style

Chen L, Li L, Cheng P, Huang Y. SRoFF-Yolover: A Small-Target Detection Model for Suspicious Regions of Forest Fire. Forests. 2025; 16(8):1335. https://doi.org/10.3390/f16081335

Chicago/Turabian Style

Chen, Lairong, Ling Li, Pengle Cheng, and Ying Huang. 2025. "SRoFF-Yolover: A Small-Target Detection Model for Suspicious Regions of Forest Fire" Forests 16, no. 8: 1335. https://doi.org/10.3390/f16081335

APA Style

Chen, L., Li, L., Cheng, P., & Huang, Y. (2025). SRoFF-Yolover: A Small-Target Detection Model for Suspicious Regions of Forest Fire. Forests, 16(8), 1335. https://doi.org/10.3390/f16081335

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SRoFF-Yolover: A Small-Target Detection Model for Suspicious Regions of Forest Fire

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Construction

2.2. YOLOv8

2.3. Dilated Convolution

2.4. Convolutional Block Attention Module (CBAM)

2.5. Our Improved Model

2.6. Experimental Settings

2.7. Evaluation Metrics

3. Results

3.1. Ablation Experiments

3.2. The Comparative Experiments

4. Discussion

4.1. The Effects of Data Augmentation Techniques on Model Detection Performance

4.2. Results of Suspicious Regions of Forest Fire Detection

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI