A Forest Fire Smoke Monitoring System Based on a Lightweight Neural Network for Edge Devices

: Forest resources are one of the indispensable resources of the earth, which are the basis for the survival and development of human society. With the swift advancements in computer vision and artificial intelligence technology, the utilization of deep learning for smoke detection has achieved remarkable results. However, the existing deep learning models have poor performance in forest scenes and are difficult to deploy because of numerous parameters. Hence, we introduce an optimized forest fire smoke monitoring system for embedded edge devices based on a lightweight deep learning model. The model makes full use of the multi-scale variable attention mechanism of Transformer architecture to strengthen the ability of image feature extraction. Considering the needs of application scenarios, we propose an improved lightweight network model LCNet for feature extraction, which can reduce the parameters and enhance detecting ability. In order to improve running speed, a simple semi-supervised label knowledge distillation scheme is used to enhance the overall detection capability. Finally, we design and implement a forest fire smoke detection system on an embedded device, including the Jetson NX hardware platform, high-definition camera, and detection software system. The lightweight model is transplanted to the embedded edge device to achieve rapid forest fire smoke detection. Also, an asynchronous processing framework is designed to make the system highly available and robust. The improved model reduces three-fourths of the parameters and increases speed by 3.4 times with similar accuracy to the original model. This demonstrates that our system meets the precision demand and detects smoke in time.


Introduction
Forest resources are one of the most important resources on the earth, with extremely important ecological, economic, and social values [1].However, forests are often disturbed by climate change, deforestation, fires, and pests.Forest fire is a serious natural disaster, which will not only burn trees and other forest resources but also cause serious harm to the ecological environment, wildlife, human life, and property [2].In a complex forest environment, an early fire is easy to block, and the fire smoke has more obvious characteristics than the early fire.The smoke floats up before the fire and spreads with time, which is an important sign for detecting forest fires.Therefore, accurately finding the smoke produced by early forest fires in time plays a vital role in personal safety, the social economy, and ecological environmental protection.
When monitoring forest fires, traditional methods rely heavily on manual inspection and smoke sensor networks.However, manual inspection consumes a lot of material resources, which is inefficient, dangerous, and unsatisfactory in preventing forest fires.In the past decades, wireless sensor networks composed of various sensors have been widely used in fire smoke detection [3].The single-point smoke sensor has a good indoor smoke detection effect [4].When a fire occurs, the sensor can trigger a change in the internal physical components according to the temperature and smoke concentration.However, the investment cost of establishing a wireless sensor network in the whole forest is too high, and the sensor is easily disturbed and damaged by the environment.Besides the single-point smoke sensor, satellite remote sensing sensors have also been widely used in forest fire smoke detection, which are unaffected by most environmental factors [5].Unfortunately, because of the long detection period and limited resolution, satellite sensors cannot detect early forest fires immediately, they can only monitor large-scale fires and fire spread direction.
As optical equipment and artificial intelligence progress rapidly, video image analysisbased deployable forest fire smoke detection systems have emerged as an advanced alternative to traditional ways [6].Fire smoke detection models based on traditional image processing usually adopt pattern recognition and classify video images after discovering and extracting artificially designed features.Kim et al. [7] use a Gaussian Mixture Model (GMM) as a background estimation algorithm to remove the background, extract Haar-like features and pixel statistical features from the foreground image, and finally use Adaboost to identify smoke in images.Ye et al. [8] use an adaptive background removal method to extract moving smoke and flame targets in a video and then classify the moving targets through several statistical difference methods in the frequency domain.
In recent years, with the significant advancement in computing capabilities, deep learning technology has witnessed exponential growth and refinement.Compared with the traditional fire smoke detection method based on image processing, the deep learning method can extract deeper image features and semantic information, which has the advantages of high speed and precision and is more suitable for diverse detection scenarios.Bouchouicha et al. [9] use the same semantic segmentation network to detect smoke on two different smoke datasets, emphasizing the importance of rich and varied datasets with accurate labels.Kwak and Ryu [10] use color conversion and corner detection methods to preprocess the fire area and use dark-channel prior and optical flow to detect the smoke area and eliminate unnecessary backgrounds.After using this method to preprocess the image, they used a classical object detection model to detect smoke.Using an excellent lightweight network could obtain good accuracy and speed for forest fire detection for potential applications.Jiao et al. [11,12] propose a lightweight detection strategy based on YOLOv3 and obtain good efficiency and precision by using unmanned an aerial vehicle (UAV).Sun et al. [13] propose a lightweight and high-precision detection network (AERNet) aiming at multi-scale smoke.The SE-Ghost module and multi-scale detection module used in the network are the keys to detecting smoke and improving speed greatly.
Given the exponential advancements in the field of natural language processing (NLP), the Transformer model [14] stands out and has become one of the excellent solutions for text translation, text generation, semantic analysis, and other tasks [15].The Transformer model is a network architecture based on a self-attention mechanism, which is implemented by multi-head attention modules and fully connected layers.Dosovitskiy et al. [16] propose the Vision Transformer (ViT) model, which no longer combines the attention mechanism with a convolutional neural network or embeds it as a component but directly inputs the image block sequence into the Transformer model for training.After that, the Detection Transformer (DETR) [17], a landmark end-to-end object detection model, creatively uses the global modeling ability of Transformer and uses sets to predict the output results.Based on DETR, subsequent research endeavors have introduced a series of enhancements, such as Conditional DETR [18], DAB-DETR [19] and Deformable DETR [20].By decoupling the cross-attention mechanism, introducing learnable object query, and using sparse and efficient attention mechanisms and other methods, the new model has greatly improved the training speed and detection accuracy, which is more excellent than most existing detection models.According to our previous research, our baseline is the improved Deformable DETR [21].Our experiments show the effectiveness of the proposed improvement.The previously proposed forest fire smoke detection methods yield favorable results on our dataset, but they still have the shortcomings of large model parameters and slow running speed, which cannot meet practical application requirements.Because of the limited resources and power consumption of edge computing devices, it is difficult to deploy large models.Therefore, we improve the model by designing a lightweight feature extraction network and using a knowledge distillation scheme [22,23].After that, we design a complete forest fire smoke monitoring system to detect smoke in application.
The contributions of our paper are as follows: • We propose a forest fire smoke monitoring system based on a lightweight neural network for edge devices, which involves an improved LCNet as a feature extraction network and Transformer.The lightweight network efficiently reduces model parameters while maintaining original accuracy.Additionally, various techniques are employed to enhance precision and execution efficiency.

•
In order to further improve the speed and accuracy, we use a simple semi-supervised label knowledge distillation scheme (SSLD) to compress the model size and eliminate redundant parameters.

•
In order to meet the application demand, a complete forest fire smoke monitoring system is designed, including an asynchronous processing framework and embedded edge devices with deployed models.The system can complete image acquisition, algorithmic inference, display early warning, and other functions.
The subsequent sections of this paper are structured as follows.Section 2 includes the composition of our smoke dataset, the overall architecture of the lightweight smoke detection model, and implementation details.Section 3 analyzes the experimental results after introducing the experimental environment configuration.The discussion in Section 4 and the conclusion in Section 5 follow.

Dataset and Annotation
As deep learning technology continues to develop rapidly, data collection and annotation remain an indispensable part of the model.The quality and quantity of a dataset can directly affect the accuracy and generalization ability of the model.Forest fire smoke detection technologies based on deep learning mostly use self-built datasets because of the lack of public standard datasets.We gathered over 10,000 images from public fire smoke datasets, such as HPWREN FIg Lib [24], the Wildfire Smoke Video Database [25], and the Smoke Detection Dataset [26], amounting to a total of 12,720 images from different perspectives.The dataset comprises diverse forest fire smoke images across various types and scenes.We labeled smoke targets in images and generated annotation files by Labe-lImg 1.8.1 software [27].Then, the annotation files were converted to COCO format [28].Furthermore, we randomly divided the dataset into two portions, with 80% serving as the training set and the remaining 20% as the validation set.Some sample images are shown in Figure 1, and the construction of dataset is shown in Table 1.

Model Architecture
The proposed model for detecting forest fire smoke leverages a combination of a streamlined feature extraction network and a Transformer architecture.The model architecture is shown in Figure 2, and the overall structure is an end-to-end structure including a feature extraction network, encoder, and decoder.Output contains prediction result and its bounding box.The input image is initially processed by the LCNet to obtain multi-scale abstract feature maps.In the encoder part of Transformer, the input multi-scale feature map is transformed into a sequence, and the position information of the image is supplemented by position coding and multi-scale hierarchical coding.Then, each encoder layer uses a multi-scale deformable attention module and feedforward network for feature fusion and learning.The decoder processes the output features and object query by a multi-head self-

Model Architecture
The proposed model for detecting forest fire smoke leverages a combination of a streamlined feature extraction network and a Transformer architecture.The model architecture is shown in Figure 2, and the overall structure is an end-to-end structure including a feature extraction network, encoder, and decoder.Output contains prediction result and its bounding box.

Model Architecture
The proposed model for detecting forest fire smoke leverages a combination of a streamlined feature extraction network and a Transformer architecture.The model architecture is shown in Figure 2, and the overall structure is an end-to-end structure including a feature extraction network, encoder, and decoder.Output contains prediction result and its bounding box.The input image is initially processed by the LCNet to obtain multi-scale abstract feature maps.In the encoder part of Transformer, the input multi-scale feature map is transformed into a sequence, and the position information of the image is supplemented by position coding and multi-scale hierarchical coding.Then, each encoder layer uses a multi-scale deformable attention module and feedforward network for feature fusion and learning.The decoder processes the output features and object query by a multi-head self- The input image is initially processed by the LCNet to obtain multi-scale abstract feature maps.In the encoder part of Transformer, the input multi-scale feature map is transformed into a sequence, and the position information of the image is supplemented by position coding and multi-scale hierarchical coding.Then, each encoder layer uses a multi-scale deformable attention module and feedforward network for feature fusion and learning.The decoder processes the output features and object query by a multihead self-attention module and multi-scale deformable attention module.Each decoder layer calculates in parallel to generate different prediction results.The final outputs are formulated through a prediction feed-forward network that generates prediction boxes and categorical information.The prediction feed-forward network consists of three perceptron layers and a linear projection layer including the ReLU activation function.The linear layer uses the softmax function to predict the category label information, and the feed-forward network outputs the normalized center coordinates, length, and width of the prediction box.Because the decoder outputs a fixed number of prediction frames with N, the value of N significantly exceeds the count of targeted objects present in the image.The model uses an additional category, i.e., the none category , which is used to indicate that no target objects are detected in an area.It is essentially similar to the background function in the traditional object detection model.

LCNet Structure
The basic module of LCNet is the depthwise separable convolution (DepthSepConv) module in MobileNetv1 [29], which contains no residual network structure.There is no need for additional operations such as concatenating or matrix operations.Integrating a residual network into a model decelerates its operational speed but does not necessarily enhance its performance.In addition to this basic module, the original LCNet has an average pooling layer, a flattening layer, and a fully connected layer.
In order to improve the feature extraction ability of LCNet and adapt it to the subsequent encoder-decoder structure of the model, the network structure of LCNet is redesigned in this paper.Specifically, we reduce the number of DepthSepConv layers, remove the last average pooling layer, and flatten the layer.The first layer of the network is still a common convolution module, followed by 5 depth separable convolution modules using 3 × 3 convolution kernels and 7 depth separable convolution modules using 5 × 5 convolution kernels.Finally, the model is finished with 1 × 1 convolution.Table 2 depicts the comprehensive details of the model architecture, where SE (squeeze-and-excitation) indicates that the SE module is used in a given layer [30].Figure 3 illustrates the composition of our enhanced lightweight network model LCNet.The overall structure is a stack of multiple basic modules.Compared with the original LCNet, the improved model optimizes two parts.First, the improved model only uses five 3 × 3 depth separable convolution modules and removes the global average pooling layer and fully connected layer.Second, the model enlarges the output characteristics of depth separable convolution modules with different convolution kernel sizes.These adjustments result in a model that possesses fewer parameters, higher computational efficiency, and enhanced extracted features.Furthermore, the elimination of the global average pooling layer enables the model to capture a richer array of background and edge features, thereby rendering it more effective as a feature extraction network.Employing LCNet as a replacement for the original network results in significant parameter reduction while maintaining model accuracy.
background and edge features, thereby rendering it more effective as a feature extraction network.Employing LCNet as a replacement for the original network results in significant parameter reduction while maintaining model accuracy.By splitting the correlation between spatial and channel dimensions, depth separable convolution remarkably reduces the computational consumption of standard convolution, thereby enhancing computational efficiency.During standard convolution, the input feature map I undergoes a transformation with the convolution kernel F to yield the output feature map O, where the size of I is H1 × W 1× C1, the size of F is K × K, and the size of O is H2 × W2 × C2.The required calculation amount and parameter amount can be expressed as Formulas ( 1) and ( 2), respectively: Since the size of the convolution kernel is usually 3 × 3 or 5 × 5 and the input feature map size is fixed as a constant, the number of channels mainly affects the calculation amount and the parameter amount in high-dimensional convolution operations.The depth-separable convolution module can reduce network parameters and improve calculation efficiency by decomposing ordinary convolution operations into depthwise convolution (DW) and pointwise convolution (PW).Figure 4 shows a feature map transformation example of depth separable convolution.By splitting the correlation between spatial and channel dimensions, depth separable convolution remarkably reduces the computational consumption of standard convolution, thereby enhancing computational efficiency.During standard convolution, the input feature map I undergoes a transformation with the convolution kernel F to yield the output feature map O, where the size of I is The required calculation amount and parameter amount can be expressed as Formulas ( 1) and ( 2), respectively: Since the size of the convolution kernel is usually 3 × 3 or 5 × 5 and the input feature map size is fixed as a constant, the number of channels mainly affects the calculation amount and the parameter amount in high-dimensional convolution operations.The depthseparable convolution module can reduce network parameters and improve calculation efficiency by decomposing ordinary convolution operations into depthwise convolution (DW) and pointwise convolution (PW).Figure 4 shows a feature map transformation example of depth separable convolution.
background and edge features, thereby rendering it more effective as a feature extraction network.Employing LCNet as a replacement for the original network results in significant parameter reduction while maintaining model accuracy.By splitting the correlation between spatial and channel dimensions, depth separable convolution remarkably reduces the computational consumption of standard convolution, thereby enhancing computational efficiency.During standard convolution, the input feature map I undergoes a transformation with the convolution kernel F to yield the output feature map O, where the size of I is H1 × W 1× C1, the size of F is K × K, and the size of O is H2 × W2 × C2.The required calculation amount and parameter amount can be expressed as Formulas ( 1) and ( 2), respectively: Since the size of the convolution kernel is usually 3 × 3 or 5 × 5 and the input feature map size is fixed as a constant, the number of channels mainly affects the calculation amount and the parameter amount in high-dimensional convolution operations.The depth-separable convolution module can reduce network parameters and improve calculation efficiency by decomposing ordinary convolution operations into depthwise convolution (DW) and pointwise convolution (PW).Figure 4 shows a feature map transformation example of depth separable convolution.In conventional convolution operation, each convolution kernel needs to conduct convolution with each input channel.But depthwise convolution only uses one convolution kernel for convolution operation.The output of each convolution kernel is concatenated to obtain the final output.In this instance, the number of channels in the feature map matches that of the input feature map.Next, the feature map's dimension is changed into the output feature map's dimension through pointwise convolution.In essence, pointwise convolution is a common convolution using a 1 × 1 convolution kernel and channel number as the output feature map.Using a 1 × 1 convolution kernel can reduce the amount of calculation.
Combined with Formulas (1) and ( 2), the calculation amount and the parameter amount of depth separable convolution are, respectively: Comparing them with the calculation amount and parameter amount of standard convolution, we can determine that: As can be seen from Equations ( 5) and ( 6), the calculation amount and the parameter amount of depth separable convolution are reduced many times compared with standard convolution.For example, a depth separable convolution with a 3 × 3 convolution kernel can reduce the calculation amount and parameters to nearly 1/9.The implementation details of the network structure are depicted in Figure 5, in which standard convolution is replaced with depth separable convolution.
In conventional convolution operation, each convolution kernel needs to conduct convolution with each input channel.But depthwise convolution only uses one convolution kernel for convolution operation.The output of each convolution kernel is concatenated to obtain the final output.In this instance, the number of channels in the feature map matches that of the input feature map.
Next, the feature map's dimension is changed into the output feature map's dimension through pointwise convolution.In essence, pointwise convolution is a common convolution using a 1 × 1 convolution kernel and channel number as the output feature map.Using a 1 × 1 convolution kernel can reduce the amount of calculation.
Combined with Formulas (1) and ( 2), the calculation amount and the parameter amount of depth separable convolution are, respectively: Comparing them with the calculation amount and parameter amount of standard convolution, we can determine that: As can be seen from Equations ( 5) and ( 6), the calculation amount and the parameter amount of depth separable convolution are reduced many times compared with standard convolution.For example, a depth separable convolution with a 3 × 3 convolution kernel can reduce the calculation amount and parameters to nearly 1/9.The implementation details of the network structure are depicted in Figure 5, in which standard convolution is replaced with depth separable convolution.The SE module has been widely used in various kinds of networks since it was proposed, which contains a global average pooling layer, a fully connected layer, and an activation function.It can strengthen and learn the relationships among feature channels to improve the network's representation ability, which is embodied in recalibrating the The SE module has been widely used in various kinds of networks since it was proposed, which contains a global average pooling layer, a fully connected layer, and an activation function.It can strengthen and learn the relationships among feature channels to improve the network's representation ability, which is embodied in recalibrating the feature mechanism.Through this, the network can learn global information to extract features with more information and suppress unimportant features.
The SE module is a computing unit that can transform the input features.First, global information is embedded by compression.The input feature map with the size of H × W × C is compressed into a feature vector of 1 × 1 × C through a multi-dimensional global average pooling operation.Moreover, the channel-level statistical information generated encompasses contextual details, thereby mitigating the issue of channel dependence.Subsequently, an adaptive correction mechanism is triggered via excitation, employing a two-stage gating mechanism composed of consecutive fully connected layers.In the first stage, the feature vector is compressed from C channels to minimize computational cost, following which a nonlinear activation function is applied.The second fully connected layer restores the number of channels to C and then uses the activation function to obtain the 1 × 1 × C vector describing the input feature map.Finally, the module multiplies the feature map by the corresponding weight to obtain the final output with the size of H × W × C and completes the regeneration of the input features in the channel dimension.
The inference speed becomes lower because of the large number of SE modules in the model.Therefore, SE modules are only used in the last few layers so that they play a better role in the tradeoff between detection accuracy and inference speed.

Re-Parameterization Strategy and the H-Swish Activation Function
To improve the model's extraction capability and inference speed, we adopt a reparameterization strategy, employ a better activation function named h-swish for optimization, and modify the pointwise convolution and residual structure.
The size of the convolution kernel determines the receptive field in the convolution operation.By employing a range of convolutional kernels of varying sizes to generate the feature map, we can extract multi-scale features and subsequently integrate the outputs.Therefore, the depthwise convolution of 1 × 1, 3 × 3, and 5 × 5 convolution kernel is used in the C 4 and C 5 layers of LCNet, respectively.However, considering the reduction in inference efficiency caused by multiple calculations, a re-parameterization strategy is chosen to fuse multiple convolution operations in the same layer, and only 5 × 5 depthwise convolution is reserved, as shown in Figure 6.This reduces the calculation cost by changing the convolution operation of multiple convolution kernels with different sizes into a maximum convolution kernel operation.
information is embedded by compression.The input feature map with the size of H × W × C is compressed into a feature vector of 1 × 1 × C through a multi-dimensional global average pooling operation.Moreover, the channel-level statistical information generated encompasses contextual details, thereby mitigating the issue of channel dependence.Subsequently, an adaptive correction mechanism is triggered via excitation, employing a two-stage gating mechanism composed of consecutive fully connected layers.In the first stage, the feature vector is compressed from C channels to minimize computational cost, following which a nonlinear activation function is applied.The second fully connected layer restores the number of channels to C and then uses the activation function to obtain the 1 × 1 × C vector describing the input feature map.Finally, the module multiplies the feature map by the corresponding weight to obtain the final output with the size of H × W × C and completes the regeneration of the input features in the channel dimension.
The inference speed becomes lower because of the large number of SE modules in the model.Therefore, SE modules are only used in the last few layers so that they play a better role in the tradeoff between detection accuracy and inference speed.

Re-Parameterization Strategy and the H-Swish Activation Function
To improve the model's extraction capability and inference speed, we adopt a reparameterization strategy, employ a better activation function named h-swish for optimization, and modify the pointwise convolution and residual structure.
The size of the convolution kernel determines the receptive field in the convolution operation.By employing a range of convolutional kernels of varying sizes to generate the feature map, we can extract multi-scale features and subsequently integrate the outputs.Therefore, the depthwise convolution of 1 × 1, 3 × 3, and 5 × 5 convolution kernel is used in the C4 and C5 layers of LCNet, respectively.However, considering the reduction in inference efficiency caused by multiple calculations, a re-parameterization strategy is chosen to fuse multiple convolution operations in the same layer, and only 5 × 5 depthwise convolution is reserved, as shown in Figure 6.This reduces the calculation cost by changing the convolution operation of multiple convolution kernels with different sizes into a maximum convolution kernel operation.In a convolutional neural network, using an activation function to introduce nonlinearity could improve the expression ability.Employing the ReLU function can bring about substantial improvements in the model's performance.However, the ReLU function is not without its drawbacks, primarily because it can only output values greater than or equal to zero.During training, if a significant negative gradient is present, the zero output from the ReLU function can lead to the disappearance of the gradient.This phenomenon can cause neurons to become permanently inactive, thereby adversely In a convolutional neural network, using an activation function to introduce nonlinearity could improve the expression ability.Employing the ReLU function can bring about substantial improvements in the model's performance.However, the ReLU function is not without its drawbacks, primarily because it can only output values greater than or equal to zero.During training, if a significant negative gradient is present, the zero output from the ReLU function can lead to the disappearance of the gradient.This phenomenon can cause neurons to become permanently inactive, thereby adversely affecting the overall training process.In recent years, there have been many variants of the ReLU function, such as Leaky-ReLU [31] and ELU [32].In 2017, Google proposed a swish activation function [33] that combines the advantages of Sigmoid and ReLU with lower computing consumption.In this paper, it is optimized into an h-swish (hard-swish) activation function, and the specific implementation is shown in Equation ( 7) and Figure 7. activation function, and the specific implementation is shown in Equation ( 7) and Figure 7.
x, x +3 x * (x + 3) / 6, 3 < x < 3 (7) As shown in Equation ( 7) and Figure 7, the method uses simple linear functions and nonlinear functions near 0 piecewise to cope with gradient calculation.The nonmonotonicity of the h-swish activation function makes it possible to eliminate the problem of node saturation effectively and optimize the network's regularization operation.The absence of exponential operations in this function benefits the network by enabling it to learn more features and exhibit greater robustness to noise interference.Using the h-swish function can effectively improve the inference speed while the network accuracy is almost unaffected, so it is selected as the activation function.
In addition, in order to strengthen the fitting ability of depth separable convolution, two pointwise convolution layers are used to replace one pointwise convolution layer in the C4 layer.The initial pointwise convolution compresses the feature map's dimensionality, while the subsequent one restores it.The residual structure can improve the performance in most convolutional neural networks.Because it involves the addition operation between elements, it greatly affects the inference efficiency of the model.We only used it in the last depth separable convolution layer of the model.

Simple Semi-Supervised Label Knowledge Distillation Scheme
Knowledge distillation is a method used to transfer the characteristic knowledge learned from large-scale models with strong learning ability to small-scale models, which is essentially a process of model compression [34].The goal is to facilitate knowledge transfer from a large teacher model to a compact student model, ensuring the student's output aligns with the teacher's output.By training knowledge distillation on a suitable model, the performance of the generated model can basically exceed the original model.The basic framework of knowledge distillation is shown in Figure 8.As shown in Equation ( 7) and Figure 7, the method uses simple linear functions and nonlinear functions near 0 piecewise to cope with gradient calculation.The nonmonotonicity of the h-swish activation function makes it possible to eliminate the problem of node saturation effectively and optimize the network's regularization operation.The absence of exponential operations in this function benefits the network by enabling it to learn more features and exhibit greater robustness to noise interference.Using the h-swish function can effectively improve the inference speed while the network accuracy is almost unaffected, so it is selected as the activation function.
In addition, in order to strengthen the fitting ability of depth separable convolution, two pointwise convolution layers are used to replace one pointwise convolution layer in the C 4 layer.The initial pointwise convolution compresses the feature map's dimensionality, while the subsequent one restores it.The residual structure can improve the performance in most convolutional neural networks.Because it involves the addition operation between elements, it greatly affects the inference efficiency of the model.We only used it in the last depth separable convolution layer of the model.

Simple Semi-Supervised Label Knowledge Distillation Scheme
Knowledge distillation is a method used to transfer the characteristic knowledge learned from large-scale models with strong learning ability to small-scale models, which is essentially a process of model compression [34].The goal is to facilitate knowledge transfer from a large teacher model to a compact student model, ensuring the student's output aligns with the teacher's output.By training knowledge distillation on a suitable model, the performance of the generated model can basically exceed the original model.The basic framework of knowledge distillation is shown in Figure 8.The simple semi-supervised label distillation scheme (SSLD) used in this paper improves the overall performance of the target model by distilling the knowledge of the existing large-scale pre-trained model.It can be directly extended to downstream applications encompassing object detection, transfer learning, and semantic segmentation The simple semi-supervised label distillation scheme (SSLD) used in this paper improves the overall performance of the target model by distilling the knowledge of the existing large-scale pre-trained model.It can be directly extended to downstream applications encompassing object detection, transfer learning, and semantic segmentation [23].
In the SSLD scheme, the output of the student model only needs to be the same as the soft label predicted by the teacher model.Given that a single image may encompass multiple target objects, the characteristics of these objects cannot be effectively captured using artificially designed hard labels.The SSLD scheme uses labeled training sets and unlabeled datasets for training, and the knowledge gained from the teacher model is imparted to the student model.The specific framework implementation is shown in Figure 9.The simple semi-supervised label distillation scheme (SSLD) used in this paper improves the overall performance of the target model by distilling the knowledge of the existing large-scale pre-trained model.It can be directly extended to downstream applications encompassing object detection, transfer learning, and semantic segmentation [23].
In the SSLD scheme, the output of the student model only needs to be the same as the soft label predicted by the teacher model.Given that a single image may encompass multiple target objects, the characteristics of these objects cannot be effectively captured using artificially designed hard labels.The SSLD scheme uses labeled training sets and unlabeled datasets for training, and the knowledge gained from the teacher model is imparted to the student model.The specific framework implementation is shown in Figure 9.The teacher model can be any well-trained classification model, and its performance is better than the student model.The teacher model needs to filter the unlabeled data and select valuable unlabeled data to ensure the reliability of the training results.In the process of knowledge distillation, both the training set and the unlabeled dataset are employed.The labels from the training set continue to serve as monitors for the prediction results of the student model.Furthermore, the prediction outputs of both the student and the teacher models are compelled to align, ensuring consistency in their responses.We minimize the JS divergence constraint to finish this task.
The framework represents the training set and validation set as T = {(x i , y xi )} and V = {(x i , y xi )}, respectively, where x i is the input image, y xi is the corresponding label, and the unlabeled data set is U = {x i }.The teacher model Q can be parameterized as follows: In Equation ( 8), (x i , y xi ) represents the training set data, Q (x i |θ Q ) represents the teacher model Q's classification probability, L is the cross-entropy loss function, and α is the weight decay factor.Assuming that the teacher model is a well-trained model, its generalization error R Q * can be less than a minimum value R 0 : The framework optimizes the student model by obtaining θ P * : where β is the weight decay factor.Each part of the distillation scheme is introduced below, including the teacher model, soft label, regularization, and the label-free dataset.The teacher model is the key to the knowledge distillation scheme.The better its performance, the better the student model's performance will be.In this paper, ResNet50-D is selected as the teacher model, which obtains top-1 accuracy in image classification as high as 83% [35].
In the SSLD scheme, the student model is exclusively linked to the output of the teacher model, enabling the use of any data to enhance performance without the need for annotation work.One of the reasons for using soft labels is the characteristics of image processing tasks.There can be several targets in one image, but most datasets only provide one single hard label, which leads to the loss of image information.
Regularization serves as a crucial factor in mitigating and preventing model overfitting.One of the mainstream regularization methods is to add L2 regularization to model parameters, as shown in Equation (10).By reducing the value of hyperparameter β properly, the overall training loss can be reduced, and the reduction in training loss may lead to an improvement in validation accuracy without a serious over-fitting phenomenon.
The dataset used by the teacher model in this paper is divided into labeled data and unlabeled data.The labeled data is the ImageNet-1K training dataset that contains 1.2 million images.The unlabeled data consists of 4 million images selected from ImageNet-22K by the teacher model ResNeXt101.In the filtering process, the SIFT method is used to remove the similar images between ImageNet-1K and ImageNet-22K, so as to avoid overlap or similarity between the training and validation data.Then, the teacher model is used to obtain the prediction results of ImageNet-22K images.The images in each category are sorted according to the model output score, and the top 4000 images in each category are taken to form the final unlabeled dataset.

Overview of the Monitoring System
The model based on deep learning has obvious advantages in inferring edge devices.In unmanned monitoring, real-time inference with edge devices can quickly analyze the input data and give the results, which is important in forest fire monitoring scenarios.Therefore, the results returned after inference in the edge equipment can well meet the actual needs of forest fire monitoring.
The forest fire smoke monitoring system designed in this paper includes a software system and a hardware platform.The software part includes a human-computer interaction interface, an asynchronous task processing framework, and a transplanted forest fire smoke detection model.The hardware part includes a high-definition camera and Jetson NX edge computing equipment produced by NVIDIA (NVIDIA corporation, St. Clara State of California, USA).The overall framework of the monitoring system is shown in Figure 10.This combination of high performance and low power consumption makes Jetson NX equipment very suitable for AI tasks in an environment with limited resources.The specific configuration of Jetson NX is shown in Table 3.This combination of high performance and low power consumption makes Jetson NX equipment very suitable for AI tasks in an environment with limited resources.The specific configuration of Jetson NX is shown in Table 3.We use the official file Jetson SDK Manager (v2.0, NVIDIA corporation, Santa Clara, CA, USA) to install the Ubuntu operating system and its supporting files.After this, we use TensorRT (v8.2.1.9,NVIDIA corporation, Santa Clara, CA, USA) to transplant the proposed model, which plays the role of a deep learning inference framework in NVIDIA.The model based on the Pytorch (v1.5.1, Facebook AI Research, New York, NY, USA) framework is converted into a TensorRT model and quantified.The inference engine is compiled in C++, which can greatly improve the inference speed and meet the practical application requirements.
In this paper, a lightweight asynchronous task-processing framework is designed, which can split the overall synchronous multi-stage tasks into asynchronous tasks for processing.It improves the utilization of system resources and can manage the configuration and information of tasks more conveniently.The overall architecture design is divided into a server layer and an executive layer.As a producer, the server layer includes the functions of task creation, task occupation, task status setting, etc.As a consumer, the executive layer includes the task scheduling function, which obtains tasks from the service layer and executes them.The concrete implementation of the service layer design is a Web server, which provides interface services through https.The execution layer is realized by a thread pool, Redis cache, and distributed lock.The specific design is shown in Figure 12.
includes the functions of task creation, task occupation, task status setting, etc.As a consumer, the executive layer includes the task scheduling function, which obtains tasks from the service layer and executes them.The concrete implementation of the service layer design is a Web server, which provides interface services through https.The execution layer is realized by a thread pool, Redis cache, and distributed lock.The specific design is shown in Figure 12.

Model Training
The details of our proposed model in this paper are shown in Table 4

Model Training
The details of our proposed model in this paper are shown in Table 4, model is built by Pytorch 1.5.1 and trained on RTX 3070ti (NVIDIA corporation, Santa Clara, CA, USA).
The training hyperparameter settings are shown in Table 5.The parameters are set for the training model to ensure convergence.The specific training configuration of the SSLD scheme is as follows: because ImageNet is used for training, a data enhancement operation is carried out through random size clipping to 224 × 224 pixels and a random horizontal flipping method.For optimization, we adopt the SGD optimizer and set the momentum to 0.9 and batch size to 256.In the student model, the weight decay factor of the improved LCNet is set to 2 × 10 −5 , and the initial learning rate is set to 0.1.Meanwhile, the warm-up strategy and cosine learning rate updating strategy are used to ensure the stability of the model [35,36].In the first stage, training epochs of distillation training using labeled and unlabeled datasets are set to 300.In order to improve the distillation efficiency, LCNet is trained for 30 epochs only using labeled data to optimize the network.

Comparison and Evaluation
In this section, we employ Microsoft COCO evaluation metrics, which are widely utilized for assessing the object detection task.The model is trained on the divided training set and evaluated on the validation set.Microsoft COCO evaluation metrics include detection accuracy indexes for different scale areas, which are all compared based on AP and AR [21].Among them, mAP is the average precision of all categories, which is obtained by weighted average calculation for AP of all target categories, mAR is the average recall rate of all categories, and the calculation method is similar.
Through training on the dataset constructed in this paper, several mainstream object detection models are selected to compare the metrics.The specific experimental results are shown in Table 6.As shown in Figure 13, different-scale forest fire smoke targets could be well recognized with high confidence.Table 6 shows that the proposed LCNet achieves a slight accuracy drop but a 2.8% mAP reduction compared with the original model.However, it greatly improved the model parameters and inference speed, with the parameters reduced to one-fourth and the inference speed increased by 3.4 times.Compared to YOLOv5s, which is known for its robust performance, there is a 4.2% improvement in mAP and a 5.8% increase in APs, highlighting the enhanced small target detection capabilities of the model.This indicates improvements in both overall detection accuracy and the ability to detect smaller targets.Based on comparative experiments, the detection accuracy and running speed of the improved model are better than the commonly used mainstream models, and the use of the lightweight feature extraction network and knowledge distillation scheme of the model have achieved remarkable results.We transplanted the model to a hardware platform and established an interactive interface to monitor forest fire smoke.Figure 14 shows the results of monitoring using the proposed system on Jetson NX.
accuracy and running speed of the improved model are better than the commonly used mainstream models, and the use of the lightweight feature extraction network and knowledge distillation scheme of the model have achieved remarkable results.We transplanted the model to a hardware platform and established an interactive interface to monitor forest fire smoke.Figure 14 shows the results of monitoring using the proposed system on Jetson NX.  and the ability to detect smaller targets.Based on comparative experiments, the detection accuracy and running speed of the improved model are better than the commonly used mainstream models, and the use of the lightweight feature extraction network and knowledge distillation scheme of the model have achieved remarkable results.We transplanted the model to a hardware platform and established an interactive interface to monitor forest fire smoke.Figure 14 shows the results of monitoring using the proposed system on Jetson NX.

Ablation Experiments
To assess the impact of the h-swish function and the SE module in LCNet, ablation studies were conducted to analyze their effects on model performance.In the experiment, when the h-swish activation function is not used, the network uses the ReLU function instead of the knowledge distillation scheme to compress the model.The experimental results are shown in Table 7.The ablation experiment results show that using the h-swish function improves model performance without increasing inference time.The SE module can enhance the average accuracy of the model by strengthening key information and suppressing irrelevant information, which verifies the effectiveness of the improvement.
The SE module makes full use of the channel attention mechanism to improve the model performance, but excessive use of the SE module will slow down the model inference speed [22].Positioning the SE module appropriately achieves a favorable accuracy-speed tradeoff.In this paper, ablation experiments in different positions were carried out to analyze the influence of the SE module in different positions in the model.The specific experimental results are shown in Table 8.In Table 8, 12-bit binary numbers represent the placement of the SE module within the model, extending from the first layer to the twelfth.In this representation, a "1" indicates the utilization of the SE module, whereas a "0" denotes its absence.The results show that when the SE module is used only in two consecutive layers, the performance will be greatly improved when it is used in the last two layers of the model.Although the average accuracy is slightly improved when the SE module is used in each layer, it also has a more reasonable time consumption.Therefore, we use the SE module in the last two layers of the model.

Discussion
Smoke often serves as the initial indicator of a fire.Accordingly, the early detection of smoke can enable the identification of forest fires sooner than the detection of flames.The proposed forest fire smoke detection model exhibits good performance on the self-built dataset.But it still has the shortcomings of large model parameters and slow inference speed, which cannot meet the practical application requirements.With the improvement in model feature extraction ability, the corresponding increase in model parameters makes it difficult to use edge devices for real-time detection.Therefore, it is necessary to develop an excellent lightweight network model for edge device detection.In addition, the highprecision model can be improved to meet the practical application requirements, such as model pruning, quantification, and knowledge distillation [37].
After calculating the parameters of each part of the model, we find that ResNet50 used in the original model is the main reason for the excessive parameters.To enable real-time and high-precision detection demand, we choose a lightweight network to optimize model performance.In this paper, an improved LCNet is designed for the Transformer architecture to replace the original feature extraction network.By reconstructing the network architecture, employing superior activation functions, and integrating the SE module, the system becomes better suited for real-time forest fire smoke detection tasks.
From the comparative experiment in Table 6, it can be found that the model can still maintain superior detection accuracy and greatly improve the speed after using the improved LCNet.Fewer network layers, no residual structure, and fast calculation of activation function are the key points to achieve speed breakthrough.Compared with the SSLD scheme, the use of LCNet can bring greater speed and accuracy improvement to the model.LCNet uses a better network structure than the original model, and the SSLD scheme allows for compressing the model parameters and improving the accuracy slightly.The effectiveness of the proposed improvement can be verified by ablation experiments.As shown in Table 7, the h-swish activation function and the SE module can effectively improve the model's ability.The SE module enhances key features and suppresses irrelevant ones, significantly boosting the accuracy of smoke detection across all scales.The SE module improves the model performance by transforming the feature maps, and the h-swish function reduces the loss of information in training and inference.In contrast, the SE module can directly scope the feature maps, which has a great impact on the inference of subsequent modules.The h-swish function indirectly improves the model capability through optimization calculation.This improvement is evident from the model's capacity to accurately identify various types of smoke targets in the detection samples presented in Figure 13.At the same time, we also study the influence of the SE module's position on the model performance.Based on the analysis in Table 8, it can be concluded that the effect of using the SE module in the last two layers is the best.The reason may be that when the SE module is used in the first few layers, it is easily influenced by the subsequent network layers, thus reducing its role.Finally, we design a forest fire smoke monitoring system based on the Jetson NX platform, which is suitable for edge equipment.It makes full use of hardware resources by using the transplanted model and asynchronous task framework.However, for different hardware platforms, the resources such as memory access cost and platform characteristics, are not the same, which may degrade the model effect.For different hardware platforms, users can fine-tune the model structure according to their needs to adapt to different memory access costs, computing efficiency, and storage requirements.

Conclusions
In this paper, we propose a forest fire smoke monitoring system based on a lightweight model, which can be deployed on edge devices for real-time monitoring.First, we devise a Transformer-based lightweight model for forest fire smoke detection.Also, the lightweight feature extraction network LCNet is proposed.The stacking depth separable convolution modules reduce the model parameters while improving the feature extraction ability.The h-swish activation function and re-parameterization strategy also improve the calculation efficiency of the model.Second, we apply the SSLD knowledge distillation approach to compress and optimize the model, which transfers the superior ability of the large-scale model to our model and eliminates some redundant parameters.Finally, a forest fire smoke monitoring system based on Jetson NX equipment is designed, and the model is transplanted and deployed.Extensive experiments show that the lightweight model proposed in this paper can achieve better accuracy and speed than the mainstream model in forest fire smoke detection tasks, and the designed system can also carry out real-time monitoring tasks on edge devices.
Next, we plan to use multi-modal fusion technology and integrate information such as polarized cameras and sensors for analysis, which will be helpful for rapid edge device inspection and early warning of fire.

Figure 2 .
Figure 2. Proposed model architecture in this paper.

Figure 1 .
Figure 1.Samples from the self-built dataset (images contain smoke objects in various scenes, colors, and scales).

Forests 2024 , 19 Figure 1 .
Figure 1.Samples from the self-built dataset (images contain smoke objects in various scenes, colors, and scales).

Figure 2 .
Figure 2. Proposed model architecture in this paper.

Figure 2 .
Figure 2. Proposed model architecture in this paper.

Figure 8 .
Figure 8.The basic framework of knowledge distillation.

Figure 8 .
Figure 8.The basic framework of knowledge distillation.

Figure 9 .
Figure 9.The procedure of simple semi-supervised label knowledge distillation.The teacher model can be any well-trained classification model, and its performance is better than the student model.The teacher model needs to filter the unlabeled data and select valuable unlabeled data to ensure the reliability of the training results.In the process of knowledge distillation, both the training set and the unlabeled dataset are employed.The labels from the training set continue to serve as monitors for the prediction results of the student model.Furthermore, the prediction outputs of both the student and the teacher models are compelled to align, ensuring consistency in their responses.We minimize the JS divergence constraint to finish this task.The framework represents the training set and validation set as T = {(xi, yxi)} and V = {(xi, yxi)}, respectively, where xi is the input image, yxi is the corresponding label, and the unlabeled data set is U = {xi}.The teacher model Q can be parameterized as follows:

Figure 9 .
Figure 9.The procedure of simple semi-supervised label knowledge distillation.

Forests 2024 , 19 Figure 10 .
Figure 10.The framework of the monitoring system.2.4.1.Hardware Devices NVIDIA Jetson NX is an efficient embedded AI computing platform, which makes it possible to realize real-time deep learning inference on embedded devices.NVIDIA Jetson NX includes the Jetson Xavier NX core module, which provides powerful computing power.The platform is shown in Figure 11.

Figure 10 .
Figure 10.The framework of the monitoring system.2.4.1.Hardware Devices NVIDIA Jetson NX is an efficient embedded AI computing platform, which makes it possible to realize real-time deep learning inference on embedded devices.NVIDIA Jetson NX includes the Jetson Xavier NX core module, which provides powerful computing power.The platform is shown in Figure 11.
2.4.1.Hardware Devices NVIDIA Jetson NX is an efficient embedded AI computing platform, which makes it possible to realize real-time deep learning inference on embedded devices.NVIDIA Jetson NX includes the Jetson Xavier NX core module, which provides powerful computing power.The platform is shown in Figure 11.
, model is built by Pytorch 1.5.1 and trained on RTX 3070ti (NVIDIA corporation, Santa Clara, CA, USA).The training hyperparameter settings are shown in Table 5.The parameters are set for the training model to ensure convergence.The specific training configuration of the SSLD scheme is as follows: because ImageNet is used for training, a data enhancement operation is carried out through random size clipping to 224 × 224 pixels and a random horizontal flipping method.For optimization, we adopt the SGD optimizer and set the momentum to 0.9 and batch size to 256.In the student model, the weight decay factor of the improved LCNet is set to 2 × 10 −5 , and the initial learning rate is set to 0.1.Meanwhile, the warm-up strategy and cosine learning rate updating strategy are used to ensure the stability of the

Figure 13 .
Figure 13.Detection results of our model.The first row shows large-scale smoke images; the second row shows small-scale smoke images.Figure 13.Detection results of our model.The first row shows large-scale smoke images; the second row shows small-scale smoke images.

Figure 13 .
Figure 13.Detection results of our model.The first row shows large-scale smoke images; the second row shows small-scale smoke images.Figure 13.Detection results of our model.The first row shows large-scale smoke images; the second row shows small-scale smoke images.

Figure 13 .
Figure 13.Detection results of our model.The first row shows large-scale smoke images; the second row shows small-scale smoke images.

Table 1 .
Overview of our forest fire smoke dataset.
Figure 1.Samples from the self-built dataset (images contain smoke objects in various scenes, colors, and scales).

Table 2 .
Architecture details of LCNet used in this paper.

Table 6 .
Comparison of experimental results.ResNet50 as the backbone.YOLOv5s uses C3+SPPF as a backbone.+ Add ablation experiments are conducted on the baseline, ++ Representative is conducted on the previous ablation experiment.The speed is the inference time consumption of the model on the Jetson NX platform.

Table 7 .
Different improved ablation experiments of LCNet.

Table 8 .
Ablation experiment using the SE module in different positions.