An Efficient and Lightweight Detection Model for Forest Smoke Recognition

: Massive wildfires have become more frequent, seriously threatening the Earth’s ecosystems and human societies. Recognizing smoke from forest fires is critical to extinguishing them at an early stage. However, edge devices have low computational accuracy and suboptimal real-time performance. This limits model inference and deployment. In this paper, we establish a forest smoke database and propose a model for efficient and lightweight forest smoke detection based on YOLOv8. Firstly, to improve the feature fusion capability in forest smoke detection, we fuse a simple yet efficient weighted feature fusion network into the neck of YOLOv8. This also greatly optimizes the number of parameters and computational load of the model. Then, the simple and parametric-free attention mechanism (SimAM) is introduced to address the problem of forest smoke dataset images that may contain complex background and environmental disturbances. The detection accuracy of the model is improved, and no additional parameters are introduced. Finally, we introduce focal modulation to increase the attention to the hard-to-detect smoke and improve the running speed of the model. The experimental results show that the mean average precision of the improved model is 90.1%, which is 3% higher than the original model. The number of parameters and the computational complexity of the model are 7.79 MB and 25.6 GFLOPs (giga floating-point operations per second), respectively, which are 30.07% and 10.49% less than those of the unimproved YOLOv8s. This model is significantly better than other mainstream models in the self-built forest smoke detection dataset, and it also has great potential in practical application scenarios.


Introduction
Forests provide important ecosystem services, such as climate regulation, the recharge of aquifers, food production, and greenhouse gas sequestration [1].However, large-scale forest fires have occurred frequently, causing serious damage to the Earth's ecosystem and human society in recent years.In Australia, Russia, the United States, Canada, and other countries, the burnt area of mega forest fires in recent years has reached tens of thousands or even hundreds of thousands of square kilometers, having serious impacts on the local ecosystem and the personal safety of local residents.According to the statistics of the Fire and Rescue Bureau of China's Ministry of Emergency Management, 709 forest fires occurred nationwide in 2022, affecting a forest area of about 4689.5 hectares [2].Wildfires occur primarily in wilderness areas with high levels of flammable materials.In summer and fall, in particular, when the temperature rises and the water content in the forest is low, the likelihood of a fire increases.Additionally, when the wind is strong, fires spread faster.At the same time, forests are more open and oxygen-rich, making it easier for fires to start and spread.Once a fire has spread, it is difficult to control due to changes in terrain, which can cause serious risks to personal safety and economic losses.If a fire occurs on a large scale, it will cause incalculable casualties and economic losses.Therefore, there is an urgent need for the development of a fast and effective fire detection solution.
• We propose an efficient and lightweight forest smoke detection model to successfully detect and localize smoke in the early stages of forest fire spread.Edge devices with limited computing resources and memory can achieve real-time object detection.New ideas and insights are provided for forest fire detection.• We introduce a BiFPN module in the neck of YOLOv8.This enables the network to aggregate features at different resolutions by incorporating learnable weights to learn the importance of different input features, thus improving the performance of detecting forest smoke.
• SimAM is introduced into the YOLOv8 model to effectively suppress the interference of redundant information on the network by enhancing the attention of the down-sampling unit and the basic unit in the convolutional neural network.To expand the sensory field, SPPF is replaced with focal modulation to learn coarse-grained spatial information and fine-grained feature information to improve the performance of the network.• The proposed detection model significantly outperforms other methods in the self- built forest smoke detection dataset.Compared with the original YOLOv8s, the mean average precision of forest smoke detection is improved by 3%.The number of parameters and the computational complexity of the model are reduced by 30.07%and 10.49%, respectively.

Forest Smoke Detection Based on YOLOv8 Architecture
YOLOv8 [20] was published in January 2023 by Ultralytics, the company that developed YOLOv5.YOLOv8 is designed to be a fast, single-stage object detection algorithm, and is provided in five different scaled versions: YOLOv8n (nano), YOLOv8s (small), YOLOv8m (medium), YOLOv8l (large), and YOLOv8x (extra large).It supports image classification, object detection, and instance segmentation tasks and runs on a variety of hardware platforms from the Central Processing Unit (CPU) to the Graphics Processing Unit (GPU).The network structure of YOLOv8 consists of four parts: input, backbone, neck, and head.Input: This mainly consists of mosaic data enhancement, adaptive anchor frame calculation, and adaptive image scaling.Backbone: The C3 structure of YOLOv5 is replaced by the C2f structure with richer gradient flow, and the number of channels is adjusted differently for different scale models to achieve further lightweightness.At the same time, the SPPF [21] module in Yolov5 is used to fine-tune the model for different scales, which greatly improves the model's performance.Neck: YOLOv8 combines the idea of PANet and introduces a PAN [22] and FPN [23] architecture in its neck component to enhance the feature fusion capability of the model.This architecture uses top-down upsampling and bottom-up downsampling techniques to fuse strong semantic information and strong localization information, which helps improve the accuracy and robustness of object detection.Head: The original coupled head is replaced by the current mainstream decoupled head structure, which separates the classification and detection heads, and the anchor-based approach is replaced by the anchor-free approach.Since the coupling of the detection head affects the performance of the model, replacing the detection head of YOLO with the decoupled head significantly improves the convergence speed of the model.In addition, loss computation includes two branches: classification and regression.The bounding box regression prediction task employs complete intersection over union (CIOU) and distributed focus loss (DFL) [24].Meanwhile, the classification task is supported by binary cross-entropy loss (BCE Loss).This design choice helps to improve the recognition accuracy and accelerate the model convergence.The improved YOLOv8s architecture is shown in Figure 1.

Bidirectional Feature Pyramid Network (BiFPN)
FPN [23] is a top-down approach.It upsamples the coarse location information at the lower level and semantically stronger feature maps at the higher pyramid level to produce higher-resolution features.PAN [22] complements FPN by downsampling from the bottom up so that the top-level features contain image location information.Because FPN uses summation for feature fusion, some detail is lost in the fusion process.Therefore, it may not perform as well as PAN for scenes that require high-precision detection.PAN uses a cascading approach to feature fusion, which can preserve more detail but increases computational complexity.
The bidirectional feature pyramid network (BiFPN) was proposed by Tan et al. in 2020 [14].Based on the PANet, the BiFPN first removes nodes that are only shallowly sampled without more information.Then, for nodes whose inputs and outputs are located in the same layer, an extra edge is added to fuse more features without increasing the cost.The BiFPN is equivalent to providing different weights to each layer for fusion, allowing the network to pay more attention to the important layers, and it also reduces the node connectivity of some unnecessary layers.In this way, it can better balance the feature information of different scales and improve the detection accuracy of small targets.The BiFPN architecture is shown in Figure 2.

Bidirectional Feature Pyramid Network (BiFPN)
FPN [23] is a top-down approach.It upsamples the coarse location information at the lower level and semantically stronger feature maps at the higher pyramid level to produce higher-resolution features.PAN [22] complements FPN by downsampling from the bottom up so that the top-level features contain image location information.Because FPN uses summation for feature fusion, some detail is lost in the fusion process.Therefore, it may not perform as well as PAN for scenes that require high-precision detection.PAN uses a cascading approach to feature fusion, which can preserve more detail but increases computational complexity.
The bidirectional feature pyramid network (BiFPN) was proposed by Tan et al. in 2020 [14].Based on the PANet, the BiFPN first removes nodes that are only shallowly sampled without more information.Then, for nodes whose inputs and outputs are located in the same layer, an extra edge is added to fuse more features without increasing the cost.The BiFPN is equivalent to providing different weights to each layer for fusion, allowing the network to pay more attention to the important layers, and it also reduces the node connectivity of some unnecessary layers.In this way, it can better balance the feature information of different scales and improve the detection accuracy of small targets.The BiFPN architecture is shown in Figure 2.

Focal Modulation
Focal modulation was presented by Yang et al. [25] in 2022.It mainly incorporates a multilevel feature fusion mechanism in the module to learn both coarse-grained spatial information and fine-grained feature information to improve the performance of the network.Focal modulation can improve the performance of the network according to the object size to increase the focus on hard-to-detect targets, thus improving the detection accuracy of the model.= , , , Equation ( 1) is self-attention, where aggregation M1 on context X is performed after calculating the attention score between query and target through interaction process T1.In contrast, Equation (2) is focal modulation, where the context features are first aggregated using M2 at each location i, and then the query interacts with the aggregated feature based on T2 to form yi. Focal modulation is defined with Equation (3).
where q denotes the query mapping function; ⨀ indicates the element-wise multiplication; and m is the context aggregation operation, which consists of two steps: hierarchical semantics in Equation ( 4) and gated aggregation in Equation (5).

Focal Modulation
Focal modulation was presented by Yang et al. [25] in 2022.It mainly incorporates a multilevel feature fusion mechanism in the module to learn both coarse-grained spatial information and fine-grained feature information to improve the performance of the network.Focal modulation can improve the performance of the network according to the object size to increase the focus on hard-to-detect targets, thus improving the detection accuracy of the model.
Equation ( 1) is self-attention, where aggregation M 1 on context X is performed after calculating the attention score between query and target through interaction process T 1 .In contrast, Equation (2) is focal modulation, where the context features are first aggregated using M 2 at each location i, and then the query interacts with the aggregated feature based on T 2 to form y i .Focal modulation is defined with Equation (3).
where q denotes the query mapping function; indicates the element-wise multiplication; and m is the context aggregation operation, which consists of two steps: hierarchical semantics in Equation ( 4) and gated aggregation in Equation (5).
In Equation ( 4), f l a is the context function of the l-th layer, generated by deep-wise convolution with kernel size k l and the GeLU activation function.Hierarchical semantics extracts context information from local to global ranges through different levels of gran-ularity.In Equation ( 5), G∈R H×W×1 is a slice of G for the level l.Specifically, we use a linear layer to obtain spatial-and level-aware gating weights: G = f g (X)∈R H×W×(L+1) .Then, we perform a weighted sum through an element-wise multiplication to obtain a single feature map Z out , which has the same size as the input X. Gated aggregation condenses context features at different levels of granularity into a single feature vector, the modulator.Combining the previous interaction and aggregation, the focal modulation formula can be expressed as Equation (6), where g l i and Z l i are the gating values and visual features at position i of G l and Z l , respectively.The Focal modulation and its detailed explanation can be found in Figure 3.In Equation ( 4), is the context function of the l-th layer, generated by deep-wise convolution with kernel size k l and the GeLU activation function.Hierarchical semantics extracts context information from local to global ranges through different levels of granu larity.In Equation ( 5), G∈R H×W×1 is a slice of G for the level l.Specifically, we use a linea layer to obtain spatial-and level-aware gating weights: G = fg(X)∈R H×W×(L+1) .Then, we per form a weighted sum through an element-wise multiplication to obtain a single feature map Z out , which has the same size as the input X. Gated aggregation condenses contex features at different levels of granularity into a single feature vector, the modulator.Com bining the previous interaction and aggregation, the focal modulation formula can be ex pressed as Equation ( 6), where 4 and are the gating values and visual features at po sition i of G l and Z l , respectively.The Focal modulation and its detailed explanation can be found in Figure 3.

Simple and Parameter-Free Attention Module (SimAM)
The images in the forest smoke dataset may contain interference from complex back ground and environmental factors, such as trees and terrain, which increase the difficulty of object detection.At the same time, as the number of network layers increases, the weight of the interfering information in the feature map also increases, which has some negative impact on the model.Transformers have achieved greater success in natural lan guage processing, image classification, object detection, and image segmentation in recen years.The reason is that self-attention plays a key role and is, therefore, able to suppor the global interaction of input information.However, its complexity exceeds the number of visual tokens.The high computational complexity of self-attention, especially with high-resolution inputs, may require more computational resources and datasets for train ing.Therefore, it is not practical to deploy on forest smoke edge computing devices.
In this paper, we introduce the simple and parametric-free attention mechanism (SimAM) [26], a module that can efficiently generate realistic 3D weights to enhance the attention of downsampling and basic units without additional parameters.Compared to

Simple and Parameter-Free Attention Module (SimAM)
The images in the forest smoke dataset may contain interference from complex background and environmental factors, such as trees and terrain, which increase the difficulty of object detection.At the same time, as the number of network layers increases, the weight of the interfering information in the feature map also increases, which has some negative impact on the model.Transformers have achieved greater success in natural language processing, image classification, object detection, and image segmentation in recent years.The reason is that self-attention plays a key role and is, therefore, able to support the global interaction of input information.However, its complexity exceeds the number of visual tokens.The high computational complexity of self-attention, especially with high-resolution inputs, may require more computational resources and datasets for training.Therefore, it is not practical to deploy on forest smoke edge computing devices.
In this paper, we introduce the simple and parametric-free attention mechanism (SimAM) [26], a module that can efficiently generate realistic 3D weights to enhance the attention of downsampling and basic units without additional parameters.Compared to other attention mechanisms, the SimAM operates in a more concise manner while allowing the module to maintain its lightweight properties.Therefore, in this work, the SimAM is introduced into the YOLOv8 model to effectively suppress the interference of redundant information in the network and extract essential feature information from the complex background of forest smoke images.This enables the network to pay more attention to the essential features related to forest smoke images, enhancing the perceptual ability and adaptivity of the model and thus improving the detection accuracy and reducing the network complexity.The calculation formula is shown in Equations ( 7)-( 11).
where t = w t t + b t and xi = w t x i + b t are linear transforms of t, and x i ,t, and x i denote the target neuron and other neurons in a single channel of the input feature, respectively; i is the index over spatial dimension, and M = H × W is the number of neurons on that channel; w t and b t are the weight and bias of the transform, computed with the following equations: where μ denote the mean and variance, respectively, computed for all neurons in the channel except t.Equation (10) shows that the minimal energy e * t and neuron t are more distinctive from the surrounding neurons more important for visual processing.
Equation ( 11) is the augmented feature tensor, where E groups all e * t across channels and spatial dimensions across all channels and spatial dimensions, while is a dot product operation.Excessively large values of E are limited by adding a sigmoid function that does not affect the relative importance of each neuron.The complete 3D weights of SimAM are shown in Figure 4.
other attention mechanisms, the SimAM operates in a more concise manner while allowing the module to maintain its lightweight properties.Therefore, in this work, the SimAM is introduced into the YOLOv8 model to effectively suppress the interference of redundant information in the network and extract essential feature information from the complex background of forest smoke images.This enables the network to pay more attention to the essential features related to forest smoke images, enhancing the perceptual ability and adaptivity of the model and thus improving the detection accuracy and reducing the network complexity.The calculation formula is shown in Equations ( 7)- (11).
, 6 , , 7 , , , where 9 ̂ = wtt + bt and > ?= wtxi + bt are linear transforms of t, and xi,t, and xi denote the target neuron and other neurons in a single channel of the input feature, respectively; i is the index over spatial dimension, and M = H × W is the number of neurons on that channel; wt and bt are the weight and bias of the transform, computed with the following equations: where B̂= @ ∑ @ 0 and C D = @ ∑ H − B̂ @ 0 denote the mean and variance, respectively, computed for all neurons in the channel except t.Equation (10) shows that the minimal energy e*t and neuron t are more distinctive from the surrounding neurons and more important for visual processing.
Equation ( 11) is the augmented feature tensor, where E groups all , * across channels and spatial dimensions across all channels and spatial dimensions, while ⨀ is a dot product operation.Excessively large values of E are limited by adding a sigmoid function that does not affect the relative importance of each neuron.The complete 3D weights of SimAM are shown in Figure 4.

Datasets and Implementation Details
The Forest Smoke Dataset has a total of 8176 images.The dataset was divided into a training set (6540 images), a validation set (817 images), and a test set (819 images) according to the ratio of 8:1:1.The images covered three scenes of forests, residential areas, and fields.Figure 5 shows some representative sample images on the Forest Smoke Dataset.
initial feature map to obtain the final output feature map.The same color indicates that a single scalar is used for each point on that feature map.

Datasets and Implementation Details
The Forest Smoke Dataset has a of 8176 images.The dataset was divided into a training set (6540 images), a validation set (817 images), and a test set (819 images) according to the ratio of 8:1:1.The images covered three scenes of forests, residential areas, and fields.Figure 5 shows some representative sample images on the Forest Smoke Dataset.In this work, the model experimental environment is shown in Table 1, and the training parameters related to the forest smoke detection model are shown in Table 2.In particular, we selected YOLOv8s as the benchmark for our evaluation among the various versions of YOLOv8.In this work, the model experimental environment is shown in Table 1, and the training parameters related to the forest smoke detection model are shown in Table 2.In particular, we selected YOLOv8s as the benchmark for our evaluation among the various versions of YOLOv8.

Evaluation Metrics
In the experiments in this study, we used the mean average precision (mAP), precision, and recall to evaluate the performance of forest smoke detection.True positive (TP) was Forests 2024, 15, 210 9 of 15 considered correct only when intersection over union (IOU) ≥ 0.5.The calculations for precision and recall are given with Equations ( 12) and ( 13), respectively.
Recall = TP TP + FN (13) where TP denotes the number of correctly detected smoke samples, and FP represents the number of non-smoke samples falsely detected as smoke.The higher the accuracy, the lower the false detection rate.FN indicates the amount of smoke sample leakage.
AP = ∑ i∈{0.5,0.55,..0.95}AP i 10 ( 14) The AP is the area under the P-R (precision-recall) curve.It is used to evaluate the performance of object detection for a single class, as shown in Equation (14).mAP is the average of the AP values for all categories, which takes into account the differences between different categories and reflects the model performance more comprehensively.In this study, the accuracy refers to mAP50.The F 1 Score is a harmonic average of precision and recall.A higher F 1 Score means a better detection effect on forest smoke detection.

Effect of Different Attention Mechanisms
The effect of integrating different attention modules at the end of the YOLOv8s backbone, i.e., after the SPPF, on the model detection performance is shown in Table 3. Adding only the attention mechanism results in a slight increase in the computational cost and the number of parameters of the model, while the accuracy shows a significant change.
The GAM [27] attention mechanism takes into account the information of the channel, and the height and width dimensions simultaneously, resulting in a better capture of crossdimensional interactions.The mean average precision of the model improves by up to 3.1%, but the computational cost increases by 58.71%.On the other hand, the CBAM [28] attention mechanism is able to consider information in both channel and space dimensions but ignores channel-space interactions, thus losing cross-dimensional information.However, there is a certain improvement in model performance.In comparison, the improved SimAM attention mechanism achieves a good trade-off between the different metrics analyzed in this paper, with a 3.3% improvement in the mean average precision and no significant changes in the amount of parameter volume computed.

Effects of Different Detection Models
To evaluate the performance of the improved model, comparative experiments with mainstream object detection models were conducted using our self-constructed forest smoke dataset.The selected models were two-stage object detection based on candidate regions (Faster R-CNN) and one-stage object detection based on regression (SSD, EfficientDet, YOLOv3-tiny, and YOLOv5).In Table 4, it is obvious that EfficientDet and YOLOv3-tiny achieved the second-and third-best results, with accuracies of 89.37% and 87.9%, and F 1 of 0.8 and 0.841, respectively.In comparison, YOLOv5-ShuffleNet [32] and SSD [13] showed lower performance, with accuracies of 75.9% and 76.28%, respectively.In contrast, our improved YOLOv8s model exhibited significant results, with a mean average precision as high as 90.1% and an F1 score of 0.89, which was significantly higher than other object detection models.Additionally, the improved YOLOv8s had stronger generalization performance and robustness.

Ablation Experiment
As shown in Table 5, reconstructing the neck of YOLOv8 using the BiFPN resulted in a 33.84% reduction in parameter count and an 11.89% reduction in computational complexity.In addition, the mAP50 improved by 1.8%.Replacing the SPPF with focal modulation had little effect on model accuracy, parameter count, and computational complexity.However, it is evident that the model's mean average precision improved by 1.4%, and the detection speed improved by 8 FPS.The introduction of the SimAM attention module resulted in the model parameter counts and computational complexity remaining largely unaltered, and it is worth noting that the mean average precision of the model improved by 3.3%, thus improving the performance of the forest smoke detection model.The combination of the BiFPN and the focal modulation module improved the accuracy of the model while greatly reducing the computation and number of parameters of the network.Overall, on the same dataset, the improved version of YOLOv8 combining the BiFPN, focal modulation, and SimAM modules outperformed the original YOLOv8 model in terms of detection accuracy, computational complexity, and number of parameters.The improved YOLOv8 model had a 3% improvement in the mean average precision, a 30.07%reduction in the number of parameters, and a 10.49% reduction in computational complexity.We selected a few representative images from our test set to better demonstrate the feasibility of the model.The comparison of the detection results between YOLOv8 and our improved forest fire and smoke detection model is shown in Figures 6 and 7. Considering the detection scene, whether it is a residential area or a field, our model is able to identify the target well.In particular, for the small smoke in Figure 6c, the accuracy of the improved model is significantly higher than previous models.We selected a few representative images from our test set to better demonstrate the feasibility of the model.The comparison of the detection results between YOLOv8 and our improved forest fire and smoke detection model is shown in Figures 6 and 7. Considering the detection scene, whether it is a residential area or a field, our model is able to identify the target well.In particular, for the small smoke in Figure 6c, the accuracy of the improved model is significantly higher than previous models.

Effects of Different Datasets
In this section, we present the results of a systematic experimental comparison of the EfficientDet, YOLOv3-tiny, YOLOv4, YOLOv5, YOLOv8s, and the improved YOLOv8s models on the self-built forest smoke dataset and the publicly available dataset used by Venancio et al. [33].As can be seen in Table 6, the improved YOLOv8s performs better in Figure 7 compares the effect of the missed detection problem when monitoring ground smoke.The improved YOLOv8s can detect forest smoke more accurately.However, the detection results of the unimproved YOLOv8s are less accurate, missing one smoke target (Figure 7b,d).This may be because the improved YOLOv8s has more robustness and better target detection performance.

Effects of Different Datasets
In this section, we present the results of a systematic experimental comparison of the EfficientDet, YOLOv3-tiny, YOLOv4, YOLOv5, YOLOv8s, and the improved YOLOv8s models on the self-built forest smoke dataset and the publicly available dataset used by Venancio et al. [33].As can be seen in Table 6, the improved YOLOv8s performs better in terms of detection accuracy and robustness on the forest smoke dataset.This can be attributed to the fact that the publicly available dataset used by Venancio et al. [33] has a low pixel resolution, which limits its effectiveness in real scenes.In contrast, our selfconstructed dataset has a higher pixel resolution and a larger number of diverse scenarios, which allows for a more comprehensive evaluation of the performance of the model in a complex forest environment.

Discussion
In this study, a self-constructed forest smoke dataset was used as training and test data.However, an actual forest fire scenario has extremely complex features, such as trees, terrain, and other disturbing factors, which makes it more difficult to achieve significant results.In addition, the detection sensitivity of the algorithm will be different for large forest fires and small forest fires.Future studies should continue to expand the dataset to further improve the performance of the model.On the other hand, the proportion of positive and negative samples in the dataset will also affect the detection results.If the proportion of positive samples is high, the model may focus extensively on the positive samples, resulting in poor performance when faced with new negative samples.A balanced ratio of positive and negative samples helps the model learn the object and background more comprehensively, improving model robustness and generalization performance.In real scenarios, the ratio of positive and negative samples must be adjusted according to the specific scenario and the distribution of the dataset.
In the future, we need to further optimize the forest smoke detection algorithm to better adapt to the practical application scenarios of edge computing platforms.Firstly, we will focus on model compression algorithms for neural network models on edge devices.This includes techniques to reduce the memory and computational complexity of the model, such as model pruning [34], knowledge distillation [35], quantization [36], and lowrank decomposition [37].In addition, the great success of deep learning heavily relies on increasingly large training datasets.Dataset compression can be used to construct a minimal subset of training data from the entire training dataset without significantly affecting the performance of the model [38].Finally, we will explore automated neural network architecture design for neural network architecture search.This approach allows for the adaptive generation of optimal network structures based on specific scenario requirements, thus improving the applicability and efficiency of the algorithms.These research directions aim to reduce computational cost and memory requirements and improve the efficiency and utility of object detection.By implementing efficient smoke detection algorithms on edge devices, we can respond to fire threats in a more timely manner and minimize losses.
Compared to the original YOLOv8 model, the proposed improved YOLOv8 model shows significant improvement in terms of accuracy, mAP50 number of parameters, and computational complexity.However, these improvements come at the cost of longer inference times.Experimental results show that the introduction of the FPN structure increases the inference time and reduces the recognition speed of the model.To solve this problem, the focal modulation structure was introduced to replace the original SPPF structure.The performance of the network is improved by learning coarse-grained spatial information and fine-grained feature information.The experimental results show that the introduction of the focal modulation structure can improve the detection speed of the model.In addition, the experimental results in Table 3 show that the combination of attention mechanisms can effectively improve the detection performance.However, it also has the disadvantage of increasing the computational and parameter complexity as well as the inference time, which makes it more demanding for real-time detection tasks.In comparison, the SimAM presented in this paper does not introduce additional parameters and improves the detection accuracy of the model, which is superior to other attention mechanisms.

Conclusions
Forest fires are natural disasters with extensive destructive power and rapid spread, posing a great threat to human life, property, and the ecological environment.Therefore, early detection and rapid response to forest fires are crucial.
In this paper, we presented an improved version of the original YOLOv8 model.Firstly, a weighted path aggregation network for multiscale forest smoke detection was fused in the neck to balance the feature information of different scales and optimize the number of parameters and computational load of the model to a great extent.Next, the simple and parametric-free attention mechanism (SimAM) was introduced to address the problem that images in forest smoke datasets may contain complex background and environmental clutter and to improve the perceptual ability and adaptivity of the model.Finally, focal modulation was introduced to increase the focus on hard-to-detect smoke while improving the model's runtime speed.Compared to the original YOLOv8s, the mean average precision of forest smoke detection improved by 3%, and the number of parameters and computational complexity of the model were reduced by 30.07%and 10.49%, respectively.The proposed detection model significantly outperforms other existing object detection networks in the self-built forest smoke detection dataset and also has advantages in lightweight models.The improved model can be applied to edge devices, such as mobile devices and UAVs, to realize real-time monitoring and early warning and improve the response speed and accuracy of fire events, which is of great significance for early detection and response to forest fires.

Data Availability Statement:
The data presented in this study are available upon request from the corresponding author.The dataset and code cannot be shared due to specific reasons.

Figure 1 .
Figure 1.The architecture of the smoke detection network based on improved YOLOv8s.

Figure 1 .
Figure 1.The architecture of the smoke detection network based on improved YOLOv8s.

Figure 3 .
Figure 3. Detailed explanation of context aggregation (b) in focal modulation (a).The aggregation procedure consists of two steps: hierarchical contextualization to extract contexts from local to global ranges at different levels of granularity and gated aggregation to condense all context fea tures at different granularity levels into the modulator.

Figure 3 .
Figure 3. Detailed explanation of context aggregation (b) in focal modulation (a).The aggregation procedure consists of two steps: hierarchical contextualization to extract contexts from local to global ranges at different levels of granularity and gated aggregation to condense all context features at different granularity levels into the modulator.

Figure 4 .
Figure 4. SimAM structure diagram.The feature X expansion generates 3D weights, which are normalized with a function.The weights of the target neurons are multiplied by the features of the

Figure 4 .
Figure 4. SimAM structure diagram.The feature X expansion generates 3D weights, which are normalized with a function.The weights of the target neurons are multiplied by the features of the initial feature map to obtain the final output feature map.The same color indicates that a single scalar is used for each point on that feature map.

Figure 5 .
Figure 5. Partial samples from the Forest Smoke Dataset: (a) smoke generated before forest fires; (b) forest fire; (c) smoke from residential areas; (d) field smoke.

Figure 5 .
Figure 5. Partial samples from the Forest Smoke Dataset: (a) smoke generated before forest fires; (b) forest fire; (c) smoke from residential areas; (d) field smoke.

Figure 6 .
Figure 6.Comparison of smoke detection results before (right) and after (left) improvement: (a,b) and (c,d) show the smoke detection results in residential areas and fields, respectively.The smoke detection results of medium scales and small scales are also shown.

Figure 7
Figure7compares the effect of the missed detection problem when monitoring ground smoke.The improved YOLOv8s can detect forest smoke more accurately.However, the detection results of the unimproved YOLOv8s are less accurate, missing one smoke target (Figure7b,d).This may be because the improved YOLOv8s has more robustness and better target detection performance.

Figure 6 .
Figure 6.Comparison of smoke detection results before (right) and after (left) improvement: (a,b) and (c,d) show the smoke detection results in residential areas and fields, respectively.The smoke detection results of medium scales and small scales are also shown.

Forests 2024 , 15 Figure 7 .
Figure 7.Comparison of the results for the detection of smoke leakage: (a,c) the smoke detection results using our model; (b,d) the smoke detection results using the original YOLOv8s.

Figure 7 .
Figure 7.Comparison of the results for the detection of smoke leakage: (a,c) the smoke detection results using our model; (b,d) the smoke detection results using the original YOLOv8s.

Author
Contributions: X.G.: conceptualization, methodology, software, writing-original draft preparation.T.H.: funding acquisition, writing-review and editing.Y.C.: writing-review and editing.All authors have read and agreed to the published version of the manuscript.Funding: National Key R&D Program Strategic International Science and Technology Innovation Cooperation Key Project: 2018YFE0207800.

Table 2 .
The training parameter of forest fire detection model.

Table 2 .
The training parameter of forest fire detection model.

Table 3 .
Comparison of the performance of different attention mechanisms.

Table 4 .
Comparison of the performance of different object detection models.

Table 5 .
Comparison results of ablation experiments.

Table 5 .
Comparison results of ablation experiments.

Table 6 .
Comparison results of different datasets.