Real-Time Forest Fire Detection by Ensemble Lightweight YOLOX-L and Defogging Method

Forest fires can destroy forest and inflict great damage to the ecosystem. Fortunately, forest fire detection with video has achieved remarkable results in enabling timely and accurate fire warnings. However, the traditional forest fire detection method relies heavily on artificially designed features; CNN-based methods require a large number of parameters. In addition, forest fire detection is easily disturbed by fog. To solve these issues, a lightweight YOLOX-L and defogging algorithm-based forest fire detection method, GXLD, is proposed. GXLD uses the dark channel prior to defog the image to obtain a fog-free image. After the lightweight improvement of YOLOX-L by GhostNet, depth separable convolution, and SENet, we obtain the YOLOX-L-Light and use it to detect the forest fire in the fog-free image. To evaluate the performance of YOLOX-L-Light and GXLD, mean average precision (mAP) was used to evaluate the detection accuracy, and network parameters were used to evaluate the lightweight effect. Experiments on our forest fire dataset show that the number of the parameters of YOLOX-L-Light decreased by 92.6%, and the mAP increased by 1.96%. The mAP of GXLD is 87.47%, which is 2.46% higher than that of YOLOX-L; and the average fps of GXLD is 26.33 when the input image size is 1280 × 720. Even in a foggy environment, the GXLD can detect a forest fire in real time with a high accuracy, target confidence, and target integrity. This research proposes a lightweight forest fire detection method (GXLD) with fog removal. Therefore, GXLD can detect a forest fire with a high accuracy in real time. The proposed GXLD has the advantages of defogging, a high target confidence, and a high target integrity, which makes it more suitable for the development of a modern forest fire video detection system.


Introduction
Forest fire, as one of the most frequent and serious natural disasters, not only destroys the forest, but also causes extensive damage to the ecosystem [1]. Forest fire occurs frequently in China. According to the statistics of the Fire Rescue Bureau of the Ministry of Emergency Management of China, 616 forest fires in China destroyed approximately 4292 hectares of forest just in 2021 [2]. If a forest fire is not detected in time, it can easily cause an uncontrollable disaster, resulting in more casualties and economic losses [3]. Therefore, accurate, efficient, and timely forest fire detection is imperative to prevent the forest fire.
At present, several forest fire detection methods have been implemented, such as manual patrol [4], the satellite remote sensing-based method [5], and the video monitoringbased method [6]. Among them, manual patrol requires the forest ranger to continuously patrol the forest and report the fire in time [7]. However, the patrol area is limited, and it is difficult to achieve all-weather monitoring [8]. Optical satellite remote sensing can detect a forest fire in a wide spatial range; however, it is difficult to monitor the forest fire with a high spatial resolution in real time due to the conflict between the spatial and temporal resolution of the satellite remote sensing systems [9]. fire detection performance of the forest fire video detection system and reduce the construction cost.

Study Area
The study areas in this paper are Mianning County and Xide County, Liangsh fecture, Sichuan Province, China. Their geographical locations are shown in Figur anning County and Xide County are located at the mountainous area in the south Sichuan Basin, belonging to the subtropical monsoon climate zone. The average a of the whole region is over 1500 m. The solar radiation is high during the day a temperature difference between day and night here is huge. Every year from Jan June, with hot dry weather and little precipitation for a long time, forest fires are e occur. Mianning County and Xide County plan to start the planned burning in J 2022, allowing us to collect a lot of real and effective forest fire data in a month.

Forest Fire Dataset
The specific process of data processing for the forest fire dataset is shown in 2. Data collection in this study mainly includes open-source forest fire data, field ment data, and forest fire video monitoring data provided by the local forestry an bureau. Establishment of the forest fire dataset mainly includes data preprocessin work training, and test data processing; we will introduce the data processing pro in detail. The background color map is the study area (Mianning County and Xide County) and its location.

Forest Fire Dataset
The specific process of data processing for the forest fire dataset is shown in Figure 2. Data collection in this study mainly includes open-source forest fire data, field experiment data, and forest fire video monitoring data provided by the local forestry and grass bureau. Establishment of the forest fire dataset mainly includes data preprocessing, network training, and test data processing; we will introduce the data processing procedure in detail.

Data Collection
In order to obtain more data in the early stage of forest fires, the data were collected from the planned burning areas of Mianning County and Xide County from January 4 to January 6, 2022. To ensure data diversity, multi-angle and distance data were captured with a digital single-lens reflex (DSLR) camera and an unmanned aerial vehicle (UAV); the shooting distance of camera is set between 2-5 km and the UAV flight altitude is kept between 50-150 m. The specific information of the capture device is shown in Table 1. We captured 103 videos and 115 images, with a total size of 10.0 GB. The specific information of the 11 captured areas is shown in Table 2. During the whole collection process, the weather was sunny on January 4 and 5, but cloudy and foggy on January 6. The hand-held GPS equipment was used to record the longitude and latitude of each capture location. In addition, 205 forest fire monitoring videos with a total size of 94.7 GB were acquired from the local forestry and grass bureau during December 2021 to January 2022. The specific information of the four areas captured by the video monitoring system is shown in Table 3.

Data Collection
In order to obtain more data in the early stage of forest fires, the data were collected from the planned burning areas of Mianning County and Xide County from January 4 to January 6, 2022. To ensure data diversity, multi-angle and distance data were captured with a digital single-lens reflex (DSLR) camera and an unmanned aerial vehicle (UAV); the shooting distance of camera is set between 2-5 km and the UAV flight altitude is kept between 50-150 m. The specific information of the capture device is shown in Table 1. We captured 103 videos and 115 images, with a total size of 10.0 GB. The specific information of the 11 captured areas is shown in Table 2. During the whole collection process, the weather was sunny on January 4 and 5, but cloudy and foggy on January 6. The hand-held GPS equipment was used to record the longitude and latitude of each capture location. In addition, 205 forest fire monitoring videos with a total size of 94.7 GB were acquired from the local forestry and grass bureau during December 2021 to January 2022. The specific information of the four areas captured by the video monitoring system is shown in Table 3.
The open-source forest fire data of the national laboratory of fire science, University of Science and Technology of China [22] and Bilkent EE Signal Processing group [23] were screened to obtain 1147 images of high image resolution, long monitoring distance, and early stage fire detection in forest fire prevention and control that are suitable for this experiment. In the phase of data preprocessing, the image data is cropped to remove the watermark and fuzzy areas in the image. We take a screenshot of the video every 3 min to obtain more image data with different smoke shapes. Because the image obtained from the screenshot is a real forest fire image, in order to make full use of such data, the image obtained from the screenshot will not be cropped. Part of the used data are shown in Figure 3. The specific information of the dataset is shown in Table 4. The open-source forest fire data of the national laboratory of fire science, Universit of Science and Technology of China [22] and Bilkent EE Signal Processing group [23] wer screened to obtain 1147 images of high image resolution, long monitoring distance, and early stage fire detection in forest fire prevention and control that are suitable for thi experiment.

Establishment of Forest Fire Dataset
In the phase of data preprocessing, the image data is cropped to remove the water mark and fuzzy areas in the image. We take a screenshot of the video every 3 min to obtai more image data with different smoke shapes. Because the image obtained from th screenshot is a real forest fire image, in order to make full use of such data, the imag obtained from the screenshot will not be cropped. Part of the used data are shown in Fig  ure 3. The specific information of the dataset is shown in Table 4.  In the phase of network training and test data processing, the open-source too LabeImg is used to label images, and divide the dataset into a training set and test set wit a 9:1 ratio, which is used for the network training and test. To ensure the validity of th test set, we designed forest fire images with different scenes as the data in the test set.  In the phase of network training and test data processing, the open-source tool LabeImg is used to label images, and divide the dataset into a training set and test set with a 9:1 ratio, which is used for the network training and test. To ensure the validity of the test set, we designed forest fire images with different scenes as the data in the test set.

The Proposed Forest Fire Detection Method
The overall design process of GXLD (GhostNet-YOLOX-L-Light-Defog) is shown in Figure 4. The core parts of GXLD are YOLOX-L-Light and the dark channel defogging method. YOLOX-L-Light is the result of light-weighted YOLOX-L, which included the introduction of GhostNet to replace the Backbone network, the improvement of some ordinary convolutions in neck and prediction to deeply separable convolutions, and the integration of the SE attention mechanism at the Backbone output. The dark channel defogging method, which is based on the dark channel prior, mainly obtains the fog-free image by calculating the dark channel image, estimating the transmittance, and calculating the atmospheric light value. These two core parts are described in detail below.

The Proposed Forest Fire Detection Method
The overall design process of GXLD (GhostNet-YOLOX-L-Light-Defog) is shown in Figure 4. The core parts of GXLD are YOLOX-L-Light and the dark channel defogging method. YOLOX-L-Light is the result of light-weighted YOLOX-L, which included the introduction of GhostNet to replace the Backbone network, the improvement of some ordinary convolutions in neck and prediction to deeply separable convolutions, and the integration of the SE attention mechanism at the Backbone output. The dark channel defogging method, which is based on the dark channel prior, mainly obtains the fog-free image by calculating the dark channel image, estimating the transmittance, and calculating the atmospheric light value. These two core parts are described in detail below.

YOLOX-L-Light
YOLOX network is a new target detection framework proposed by Broadview in 2021 [24], which is mainly based on the improvement of YOLOv3 network. The network improvement mainly includes backbone network structure, classification and regression decoupling head, anchor free frame mechanism and dynamic matching positive samples. YOLOX network is composed four modules: Input, Backbone, Neck and Prediction. Two powerful data enhancement technologies Mixup [25] and Mosaic [26] are mainly used at the input. Mosaic can effectively improve the detection effect of small targets. Mixup is an additional enhancement strategy based on Mosaic. The Backbone of YOLOX network is

YOLOX-L-Light
YOLOX network is a new target detection framework proposed by Broadview in 2021 [24], which is mainly based on the improvement of YOLOv3 network. The network improvement mainly includes backbone network structure, classification and regression decoupling head, anchor free frame mechanism and dynamic matching positive samples. YOLOX network is composed four modules: Input, Backbone, Neck and Prediction. Two powerful data enhancement technologies Mixup [25] and Mosaic [26] are mainly used at the input. Mosaic can effectively improve the detection effect of small targets. Mixup is an additional enhancement strategy based on Mosaic. The Backbone of YOLOX network is consistent with that of the original YOLOv3 [27] network, and the Darknet53 network is adopted. The Neck part also adopts the Feature Pyramid Networks (FPN) structure for integration. Prediction consists of decoupling head, anchor free detector, tag allocation strategy and loss calculation. YOLOX can be divided into standard network structure and lightweight network structure by adjusting the width and height of the network. In this paper, YOLOX-L network with the best performance in the standard network structure is selected, and makes lightweight improvement is get YOLOX-L-Light.
The lightened YOLOX-L model (YOLOX-L-Light) is shown in Figure 5. Firstly, we replace the Backbone of YOLOX-L network with GhostNet. GhostNet [28] network has advantages of maintaining the recognition performance of similarity and reducing convolution operation. The GhostNet can surpass MobileNet [29] and SSD [30] in accuracy and efficiency with relative low network parameters. We use GhostNet as the feature extraction network of YOLOX-L. As shown in Figure 5, the Conv in the GhostNet represents two-dimensional convolution of the input feature map, Ghost BN represents Ghost Bottle Neck, which is the basic unit of GhostNet. Feat1, Feat2, and Feat3 represent feature map with three scales respectively, which include 80 × 80 × 40, 40 × 40 × 112 and 20 × 20 × 160. The output is input into Neck for feature extraction in the next step. consistent with that of the original YOLOv3 [27] network, and the Darknet53 network is adopted. The Neck part also adopts the Feature Pyramid Networks (FPN) structure for integration. Prediction consists of decoupling head, anchor free detector, tag allocation strategy and loss calculation. YOLOX can be divided into standard network structure and lightweight network structure by adjusting the width and height of the network. In this paper, YOLOX-L network with the best performance in the standard network structure is selected, and makes lightweight improvement is get YOLOX-L-Light.
The lightened YOLOX-L model (YOLOX-L-Light) is shown in Figure 5. Firstly, we replace the Backbone of YOLOX-L network with GhostNet. GhostNet [28] network has advantages of maintaining the recognition performance of similarity and reducing convolution operation. The GhostNet can surpass MobileNet [29] and SSD [30] in accuracy and efficiency with relative low network parameters. We use GhostNet as the feature extraction network of YOLOX-L. As shown in Figure 5, the Conv in the GhostNet represents two-dimensional convolution of the input feature map, Ghost BN represents Ghost Bottle Neck, which is the basic unit of GhostNet. Feat1, Feat2, and Feat3 represent feature map with three scales respectively, which include 80× 80 × 40, 40 ×40 × 112 and 20 × 20× 160. The output is input into Neck for feature extraction in the next step. To further reduce the parameters of YOLOX-L, we replace normal convolution with depth separable convolution. Depth separable convolution [29] is different from ordinary convolution in that it consists of depth convolution and pointwise convolution. Previous studies have proved that the replacement of ordinary 3 × 3 convolution in the CNN with depth separable convolution can effectively reduce the amount of network parameters [29]. This paper refers to the position of depth separable convolution in the YOLOX-nano To further reduce the parameters of YOLOX-L, we replace normal convolution with depth separable convolution. Depth separable convolution [29] is different from ordinary convolution in that it consists of depth convolution and pointwise convolution. Previous studies have proved that the replacement of ordinary 3 × 3 convolution in the CNN with depth separable convolution can effectively reduce the amount of network parameters [29]. This paper refers to the position of depth separable convolution in the YOLOX-nano model to replace some ordinary convolutions in neck and prediction in YOLOX-L. The specific location of the CBS_DW module is shown in Figure 5.
The attention mechanism is a structure to improve the network's attention to the space and channel information of features. The accession of the attention mechanism can strengthen the network structure's ability to extract key features in innumerable feature information. Thus, the network's performance is greatly enhanced [31]. At present, the mainstream attention mechanisms can be divided into the following three types: channel attention, spatial attention [32], and self-attention [33]. SENet is a typical channel attention mechanism [34]; it can strengthen the relationship between channels concerned by the network. So, the weight of the feature information concerned on the feature layers of different channels is various. As a plug and play module, the attention mechanism can, in theory, be placed behind any feature layer. The SENet is introduced in this study to extract the important features in the output of Backbone.

Defogging Using Dark Channel Prior Theory
Dark channel prior theory was first proposed by He et al. [35]. They obtained a prior rule through experimental results on a large number of fog-free images. This rule states that in most clear fog-free color images, after removing the sky part and some areas with high brightness, there must be a color channel in the local non-haze area that contains a large number of pixels (called dark pixels) with an intensity of about 0. This channel is named as the dark channel, which is defined as Equations (1) and (2): where J dark is the dark channel image. Ω(x) represents the area around the pixel point x; J c is a channel in the fog-free image. c is the visible light image, including red, green, and blue color components. The image taken by the camera consists of the following two parts. The first part is the reflected light of the shooting target; however, it may be attenuated due to the scattering and absorption of atmospheric light. The other part is atmospheric light after being scattered. The formula of the atmospheric scattering model can be expressed as Equation (3): where I(t) is the foggy image, J(x) is the fogless image, A is the atmospheric light value, and t(x) is transmissivity. According to the prior rule of the dark channel image and the combination with the atmospheric scattering model, fog, which is J(x) in Equation (3), can be removed. Assuming that the transmittance of the same area remains unchanged and the atmospheric light value A is known, Equation (3) can be divided by the atmospheric light value to obtain Equation (4): Both sides of Equation (4) are minimized to make them approach to the dark channel (Equation (5)): where t(x) is a constant in the area around pixel x; thus, it is not minimized. A c is the atmospheric light value of color channel c, and J is the fog-free image to be obtained. Combining Equations (1) and (2), we can deduce Equation (6): In order to make the defogged image more natural, it is necessary to increase the depth of field information in the image. Therefore, a constant coefficient ω is introduced into Equation (6); after that, a rough transmittance can be obtained by means of Equation (7): where ω is usually set as 0.95. A common method for estimating the atmospheric light value A c is to directly take the maximum value of pixel intensity from an image. This method is not only simple, but also effective. However, the outdoor image may contain a large proportion of sky areas or gray-white objects, which will cause a dramatic interference to the estimation of pixel intensity, and result in a large deviation between the estimated atmospheric light value and the real scene. The dark channel defogging method first extracts the pixel values of the first 10% with the lowest intensity from the previously obtained transmittance image. These pixels have the maximum fog concentration at the same time, and their gray value can be approximately equivalent to the atmospheric illumination value.
The transmissivity t(x) and atmospheric light value A are obtained from the previous steps. The fogless image can be recovered by substituting the two values into Equation (8): where t 0 is the minimum of transmissivity. In order to prevent the overall whitening of the image due to the small value of t(x), it is generally set as 0.
The defogging result of the method is shown in Figure 6. The dark channel defogging method can better defog the image and retain the characteristics of thick smoke, which will provide less fog interference images for subsequent forest fire detection.
In order to make the defogged image more natural, it is necessary to increase the depth of field information in the image. Therefore, a constant coefficient is introduced into Equation (6); after that, a rough transmittance can be obtained by means of Equation (7): where is usually set as 0.95. A common method for estimating the atmospheric light value is to directly take the maximum value of pixel intensity from an image. This method is not only simple, but also effective. However, the outdoor image may contain a large proportion of sky areas or gray-white objects, which will cause a dramatic interference to the estimation of pixel intensity, and result in a large deviation between the estimated atmospheric light value and the real scene. The dark channel defogging method first extracts the pixel values of the first 10% with the lowest intensity from the previously obtained transmittance image. These pixels have the maximum fog concentration at the same time, and their gray value can be approximately equivalent to the atmospheric illumination value.
The transmissivity ( ) and atmospheric light value A are obtained from the previous steps. The fogless image can be recovered by substituting the two values into Equation (8): where is the minimum of transmissivity. In order to prevent the overall whitening of the image due to the small value of ( ), it is generally set as 0.
The defogging result of the method is shown in Figure 6. The dark channel defogging method can better defog the image and retain the characteristics of thick smoke, which will provide less fog interference images for subsequent forest fire detection.

Experimental Setting and Evaluation Index
In this study, all the experiments were performed using an Intel Core i7-8700 with 16 GB RAM on a platform of a Windows 10, 64 bit operating system; and an NVIDIA Geforce RTX3060 graphic card having 12 GB of VRAM. The proposed model is implemented with the PyTorch 1.2.0 deep learning framework. The device configuration we used in this experiment is as Table 5.

Experimental Setting and Evaluation Index
In this study, all the experiments were performed using an Intel Core i7-8700 with 16 GB RAM on a platform of a Windows 10, 64 bit operating system; and an NVIDIA Geforce RTX3060 graphic card having 12 GB of VRAM. The proposed model is implemented with the PyTorch 1.2.0 deep learning framework. The device configuration we used in this experiment is as Table 5. During the experiment, we use the same hyperparameter to train YOLOX-L-Light, YOLOX-L, YOLOX-Tiny, YOLOv4, and YOLOv4-Tiny. The specific values of the hyperparameter are listed in Table 6. In addition, we also add YOLOv4-Light proposed by FAN [36], and train it with the same hyperparameter. To evaluate the performance of network models and GXLD, mean average precision (mAP) in Pascal VOC was used to evaluate the detection accuracy. mAP is the average value obtained after calculating the average precision (AP) for each category. AP is a general evaluation index in target detection, which can assess the accuracy of classification and positioning. Classification is to judge whether the prediction is smoke and flame, and positioning is to judge whether the intersection of the union (IoU) between the network prediction box and the manual label box meets the requirements. The AP value is equivalent to the area under the recall and precision curves, where the precision and recall are defined as Equations (9) and (10): where T P is the number of real classes in the detection results, and N d is the number of detection boxes after non maximum suppression; N g is the number of dimension boxes.

The Experiment Results of the YOLOX-L-Light
Firstly, the quantity of model parameters is evaluated, which is calculated by using the summary module under the Python deep learning framework. The results are shown in Table 7. The results indicate that the parameters of the YOLOX-L-Light model are not only smaller than YOLOX-L, YOLOv4, and YOLOv4-Light, but also smaller than those of the official lightweight YOLOX-Tiny and YOLOv4-Tiny. It shows that the proposed lightweight strategies can greatly reduce the number of network parameters. In order to compare the average precision (AP) and the mean average precision (mAP) of each model, we tested all the trained models on the same dataset, and the statistical results of detection accuracy are shown in Table 7. According to the results, the mAP of all models is more than 0.85, indicating that the models work well on forest fire detection. Among them, YOLOX-L-Light has the highest mAP (86.81%), which is 1.8% higher than YOLOX-L and 0.78% higher than YOLOx4-Light. In addition, the AP of each category of YOLOX-L-Light is higher than that of other models. The accuracy of its flame category is 84%, and the accuracy of its smoke category is 89.62%. The results indicate that the improved lightweight network YOLOX-L-Light can effectively increase the accuracy of forest fire detection with fewer parameters.
Ablation experiments were conducted for the improved structure to demonstrate the effectiveness of each of the proposed improvements to the YOLOX-L network. The experimental results are shown in Table 8, where GhostNet-YOLOX-L-dsc is the network obtained by introducing the GhostNet network and deeply separable convolution into YOLOX-L. GhostNet-YOLOX-L-SE is the network obtained by introducing the GhostNet network into YOLOX-L and integrating the SE attention mechanism. YOLOX-L-dsc-SE is the network obtained by introducing the deeply separable convolution into YOLOX-L and integrating the SE attention mechanism. The results of ablation experiments indicate that the introduction of GhsotNet can effectively improve the accuracy of the network and reduce most of the network parameters. The introduction of deep separable convolution can effectively reduce some network parameters without reducing the accuracy of the network. The introduction of the SE attention mechanism can effectively improve the network accuracy.

The Experment Results of the GXLD
We tested GXLD on the test dataset and obtained the statistical results of detection accuracy as shown in Table 9. The mAP of GXLD is 87.47%, which is 2.46% higher than the original YOLOX-L and 0.66% higher than YOLOX-L-Light. The specific detection results are shown in Figure 7.  In order to verify the real-time performance of GXLD, we manually selected 12 video monitoring data and 12 camera shooting data for the FPS test of GXLD, for which, including 11 videos of the fog environment, the average duration of each video is 3 min, and the average original FPS of each video is 29.33. We adjusted the input image sizes to 1280 × 720 and 720 × 480, respectively, using the resize operation in OpenCV; the frame number of GXLD frame extraction processing is adjusted to 8, which means detecting one frame every eight frames. As shown in Table 10, when the input image size is 1280 × 720, the maximum FPS of GXLD is 30.51, the minimum FPS is 25.14, and the average FPS is 26.33. When the input image size is 720 × 480, the maximum FPS of GXLD is 68.12, the minimum FPS is 50.51, and the evaluation FPS is 56.41. This shows that GXLD can realize real-time detection when the input images are 1280 × 720 and 720 × 480.
According to the specific data in Table 9 and Table 10, GXLD has excellent forest fire detection effect and real-time detection capability. In addition, GXLD also has certain advantages in target confidence and target integrity. The left figure in Figures 8 and 9 shows the detection results of YOLOX-L-Light, and the right figure shows the detection results of GXLD. It can be seen from Figure 8 that, at the same time, GXLD can effectively detect smoke, while YOLOX-L-Light cannot. In Figure 9, although both models have detected smoke, GXLD's target confidence is 0.86, while YOLOX-L-Light's target confidence is 0.78. In terms of target integrity, GXLD can display more complete smoke and frame it. In order to verify the real-time performance of GXLD, we manually selected 12 video monitoring data and 12 camera shooting data for the FPS test of GXLD, for which, including 11 videos of the fog environment, the average duration of each video is 3 min, and the average original FPS of each video is 29.33. We adjusted the input image sizes to 1280 × 720 and 720 × 480, respectively, using the resize operation in OpenCV; the frame number of GXLD frame extraction processing is adjusted to 8, which means detecting one frame every eight frames. As shown in Table 10, when the input image size is 1280 × 720, the maximum FPS of GXLD is 30.51, the minimum FPS is 25.14, and the average FPS is 26.33. When the input image size is 720 × 480, the maximum FPS of GXLD is 68.12, the minimum FPS is 50.51, and the evaluation FPS is 56.41. This shows that GXLD can realize real-time detection when the input images are 1280 × 720 and 720 × 480. According to the specific data in Tables 9 and 10, GXLD has excellent forest fire detection effect and real-time detection capability. In addition, GXLD also has certain advantages in target confidence and target integrity. The left figure in Figures 8 and 9 shows the detection results of YOLOX-L-Light, and the right figure shows the detection results of GXLD.
It can be seen from Figure 8 that, at the same time, GXLD can effectively detect smoke, while YOLOX-L-Light cannot. In Figure 9, although both models have detected smoke, GXLD's target confidence is 0.86, while YOLOX-L-Light's target confidence is 0.78. In terms of target integrity, GXLD can display more complete smoke and frame it.

Conclusions
This research proposes a lightweight forest fire detection method (GXLD) with fog removal. GXLD can achieve real-time and high accuracy forest fire detection. It has the advantages of defogging, a high target confidence, and a high target integrity, which are more suitable for the development of a modern forest fire video detection system.
First, a high-quality forest fire dataset is built with open-source datasets, an outdoor experiment dataset, and a video monitoring data system. Then, a lightweight method YOLOX-L-Light model is proposed by improving YOLOX-L. With the same hyperparameter, we trained and tested YOLOX-L-Light, YOLOX-L, YOLOV4, YOLOV4 Tiny, and YOLOV4-Light. Experiment results show that the proposed YOLOX-L-Light outperforms other models in terms of both precision (mAP = 86.13%) and parameter quantity (about 4 MB). The ablation experiment proved that the proposed lightweight strategies can significantly reduce the number of network parameters and enhance the network feature extraction ability.
In addition, this study combines YOLOX-L-Light with the dark channel defogging method to obtain GXLD and evaluate its performance. The results show that the mAP of GXLD on the test dataset is 87.47%. The average fps is 26.33 when the input image size is 1280 × 720. GXLD also has excellent performance in target confidence and target integrity.
In the experiment, we also found there are still some limitations in GXLD. The detection performance of GXLD is poor in a very serious dense foggy scene. The future research will make more in-depth lightweight improvement on YOLOX-L-Light and needs to

Conclusions
This research proposes a lightweight forest fire detection method (GXLD) with fog removal. GXLD can achieve real-time and high accuracy forest fire detection. It has the advantages of defogging, a high target confidence, and a high target integrity, which are more suitable for the development of a modern forest fire video detection system.
First, a high-quality forest fire dataset is built with open-source datasets, an outdoor experiment dataset, and a video monitoring data system. Then, a lightweight method YOLOX-L-Light model is proposed by improving YOLOX-L. With the same hyperparameter, we trained and tested YOLOX-L-Light, YOLOX-L, YOLOV4, YOLOV4 Tiny, and YOLOV4-Light. Experiment results show that the proposed YOLOX-L-Light outperforms other models in terms of both precision (mAP = 86.13%) and parameter quantity (about 4 MB). The ablation experiment proved that the proposed lightweight strategies can significantly reduce the number of network parameters and enhance the network feature extraction ability.
In addition, this study combines YOLOX-L-Light with the dark channel defogging method to obtain GXLD and evaluate its performance. The results show that the mAP of GXLD on the test dataset is 87.47%. The average fps is 26.33 when the input image size is 1280 × 720. GXLD also has excellent performance in target confidence and target integrity.
In the experiment, we also found there are still some limitations in GXLD. The detection performance of GXLD is poor in a very serious dense foggy scene. The future research will make more in-depth lightweight improvement on YOLOX-L-Light and needs to

Conclusions
This research proposes a lightweight forest fire detection method (GXLD) with fog removal. GXLD can achieve real-time and high accuracy forest fire detection. It has the advantages of defogging, a high target confidence, and a high target integrity, which are more suitable for the development of a modern forest fire video detection system.
First, a high-quality forest fire dataset is built with open-source datasets, an outdoor experiment dataset, and a video monitoring data system. Then, a lightweight method YOLOX-L-Light model is proposed by improving YOLOX-L. With the same hyperparameter, we trained and tested YOLOX-L-Light, YOLOX-L, YOLOV4, YOLOV4 Tiny, and YOLOV4-Light. Experiment results show that the proposed YOLOX-L-Light outperforms other models in terms of both precision (mAP = 86.13%) and parameter quantity (about 4 MB). The ablation experiment proved that the proposed lightweight strategies can significantly reduce the number of network parameters and enhance the network feature extraction ability.
In addition, this study combines YOLOX-L-Light with the dark channel defogging method to obtain GXLD and evaluate its performance. The results show that the mAP of GXLD on the test dataset is 87.47%. The average fps is 26.33 when the input image size is 1280 × 720. GXLD also has excellent performance in target confidence and target integrity.
In the experiment, we also found there are still some limitations in GXLD. The detection performance of GXLD is poor in a very serious dense foggy scene. The future research will make more in-depth lightweight improvement on YOLOX-L-Light and needs to conduct more in-depth research on the defogging method; thus, to achieve a forest fire detection method with better performance and serve the forest fire prevention and control.