Real-Time Detection of Full-Scale Forest Fire Smoke Based on Deep Convolution Neural Network

: To reduce the loss induced by forest ﬁres, it is very important to detect the forest ﬁre smoke in real time so that early and timely warning can be issued. Machine vision and image processing technology is widely used for detecting forest ﬁre smoke. However, most of the traditional image detection algorithms require manual extraction of image features and, thus, are not real-time. This paper evaluates the effectiveness of using the deep convolutional neural network to detect forest ﬁre smoke in real time. Several target detection deep convolutional neural network algorithms evaluated include the EfﬁcientDet (EfﬁcientDet: Scalable and Efﬁcient Object Detection), Faster R-CNN (Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks), YOLOv3 (You Only Look Once V3), and SSD (Single Shot MultiBox Detector) advanced CNN (Convolutional Neural Networks) model. The YOLOv3 showed a detection speed up to 27 FPS, indicating it is a real-time smoke detector. By comparing these algorithms with the current existing forest ﬁre smoke detection algorithms, it can be found that the deep convolutional neural network algorithms result in better smoke detection accuracy. In particular, the EfﬁcientDet algorithm achieves an average detection accuracy of 95.7%, which is the best real-time forest ﬁre smoke detection among the evaluated algorithms.


Introduction
Forest fire is one of the natural disasters with high frequency and great harmfulness in the world [1].It usually spreads quickly and is difficult to control, causing intensive losses of human lives and properties.For example, "Lungs of the Earth" fires in the Amazon rainforest have burned to the ground a total of 4920 square kilometers of rainforest, larger than half a million football fields, which has brought incalculable damage to human beings and the natural environment.To control and mitigate forest fire effectively, early warning of the initiation of forest fire is of particular importance.For early warning of forest fires, compared with fire flames, smoke appears earlier, spreads faster, and has a larger volume, which can be easier to identify visually [2][3][4].The fast advances and implementation of field surveillance cameras in forests and enhanced computational capacity can especially reduce the cost of monitoring of forest fires by using machine vision [5].Doing so provides a potential economic and effective way for early detection of forest fires if efficient and effective machine vision algorithms are available.
In general, there are two categories of smoke detection algorithms including the traditional methods and deep machine learning methods [6][7][8][9][10][11]. Traditional smoke detection methods are usually based on manually extracted features such as color, texture, shape, and motion.For instance, the color features are usually extracted from different color spaces, such as RGB, HSV, and YCbCr, and the texture features are extracted by fractal analysis [12][13][14][15][16][17], wavelet decomposition [18][19][20][21][22], Gabor Transform [23][24][25][26][27], and histogram of local gradient direction [28][29][30][31] methods.In addition to the single feature extraction, multiple smoke features can also be considered together to improve the robustness of detection algorithm.Han et al. [32] combined Gaussian mixture model and multi-color space to detect forest fire detection method with improved detection accuracy.Recently, Gao et al. [33,34] developed a smoke root extraction strategy in full-scale conditions through the fluid mechanics model, which is effective in early smoke detection, although it still has challenges in leak detection of the candidate connected domain of smoke root in near and far conditions.However, since these traditional algorithms rely on intensive knowledge for artificial feature selection algorithms, they may be highly subjective and complex in operation.In addition, due to the fact that these artificially extracted features vary greatly in different scenes, the detection cannot meet the required accuracy and has poor robustness for wide applications.To date, there still has a great need to develop artificial feature algorithms that can effectively detect complex and changeable field scenes.
In the past decade, wireless communication has made it possible for users to obtain a large amount of remote camera vision data.In addition, the continuous advances in computer computation capacity lower the computation cost, especially with the development of Graphics Processing Unit (GPU), which make it within the reach for applying deep learning neural networks and algorithms for various applications.Therefore, the neural networks have been introduced to develop self-learning algorithms for feature collection of fire images [1][2][3][4][5][35][36][37][38].Based on various CNN models such as AlexNet, VGG, Inception, ResNet, etc., the smoke and flame detection algorithms were also investigated [36,38].Time series information was introduced into the algorithm [6] to detect the smoke and flame simultaneously through reforming the VGG network [7].In addition, multi-layer convolutional neural networks were also investigated to detect smoke and fire [39][40][41].
However, the deep learning algorithms also have their limitations.As shown in Figure 1, most of the existing deep learning algorithms consider fire detection as a classification problem and ignore the region identification process such that the entire image was classified into one category.However, during the early stages of the fire, smoke and flames covered only a small portion of the image and do not show the smoke and flame characteristics to be very obvious.The use of features from the entire image without region recommendations reduces detection accuracy and delays the detection and alarm of fire events.
To address the inaccurate early fire detection, some algorithms were developed to generate suggested regions through artificial selection of features and to classify suggested regions through neural networks.In such algorithms, the suggested regions were generated through separate calculations, and the global fire detection was not conducted using neural networks, resulting in a large amount of computation and slow detection speed.A combination of CNNs with these regions and features extracting algorithms may solve the above-mentioned challenges for early fire detection.Although the color, texture, and shape of smoke are effective methods to distinguish smoke from background, no single artificial feature can cover all scenes.Thus, the CNN algorithms can automatically extract features from a large number of samples followed by other deep learning algorithms for fire detection.As shown in Figure 2, Wang et al. [42] converted RGB into HSI images and input them into two residual networks for fire detection.Zhao et al. [43] obtained candidate areas through saliency technology and then determined whether there was smoke in the candidate areas through the AlexNet network [44][45][46].To investigate which deep CNN algorithm can perform the best for early fire detection, this paper implements and compares four deep CNN algorithms for fire detection in real time.These algorithms were developed and trained using a huge fire data base with more 12,000 images.Based on the validation testing, the optimal detection performance among the four algorithms was determined, which can provide some alternative ways to detect forest fire accident prevention with high accuracy in real time.

The proposed Framework 2.1. Convolutional Neural Network
Figure 3 shows the design of the algorithm flow for forest fire smoke detection based on a convolutional neural network (CNN).As shown in Figure 3, the CNN forest fire smoke detection includes several steps.Firstly, the CNN algorithm analyzes the input images and suggests different functional regions through methods such as convolution and pooling.Secondly, it uses region-based target detection to determine whether there is fire in the proposed region through the convolution layer, the pooling layer, and the fully connected layer.The convolutional layer is the core of the central nervous system.Unlike other neural networks that use concatenated weights and weighted sums, the convolution layer uses an image transform filter called a convolution kernel to generate a feature map of the original image.The convolution layer is actually a set of convolution kernels.The convolution kernel slides over the image and generates feature maps by floating the weights of pixels and computing new pixels.The feature map reflects one aspect of the original image.The output feature graph (y) of the convolution layer can be computed as:

𝒚=
where W × H represents the input image with a size of W × H; W represents the width of the image; H represents the height of the image; W ij represents the convolution kernel of size J × I; and b represents the bias.In practice, the values of W and b can be determined by training on the image data sets.
In accordance with Equation ( 1), Figure 4a-c show an example of forest fire smoke, its detection using the CNN algorithm for the 32 cores of the first convolutional layer in the Inception ResNet, and its responding 32 feature maps of the fire images generated by these cores.From Figure 4, it can be seen that the number of Eigen maps equals the number of the convolution kernels.For example, if there are three convolution kernels in this layer, three feature graphs will be generated.Additionally, the color of the pixel illustrates the degree of activation, with black pixels representing strong negative activation, gray pixels representing weak activations, and white pixels indicating strong activation.Compared with the original image in Figure 4a, Figure 4c shows that, in this example, the feature graph generated by the convolution kernel number 14 of this layer was activated at the edge.In addition, the feature graph generated by convolution kernel number 26 was activated, which was on the orange areas in the conventional kernels.Thus, the feature detection of the early layers mainly learns and extracts simple features such as colors, edges, etc.However, this example demonstrated that these simple features may not be able to distinguish the fire from the disturbances in complex scene or with multiple disturbing events.Therefore, it is necessary to develop more advanced fire detection algorithms that can extract complex image features for fire detection in practical scenes.

Deep Convolutional Neural Network
To address the limitations of the CNN algorithm, deep convolutional neural networks can be used to detect forest fire smoke.For the same example as in Figure 4a-c, compare the kernel samples in the first, third, and sixth convolutional layers of Inception Resnet V2 using the deep CNN algorithm.
Specifically, in this paper, two deep CNN feature extraction networks were selected, including the Inception ResNet V2 [11] and Darknet-53 [47].For each feature extraction network, 235 and 53 convolutional layers were used, respectively.In addition, four image target detection networks including the Faster R-CNN, SSD, YOLOv3, and Efficient-Det [47,48] were selected to construct the image fire detection algorithm.These image target detection networks are expected to have excellent performance in detection accuracy and speed.In the following sections, the fundamentals of these four image target detection networks are introduced.

Faster R-CNN
Figure 5 shows the structure of the Faster R-CNN algorithm.It can be seen that the Faster R-CNN has two stages.In the first stage, the feature map of the original image is generated through the feature extraction network such as VGG, ResNet, Inception, Inception ResNet, and the regional proposal network (RPN).The proposed regions with target fractions and positions are predicted using the feature graphs obtained from some selected intermediate convolution layers.This stage outputs only scores that estimate the probability of each proposed object or non-object and box regression through two types of SoftMax layers and a robust loss function (smoothing L1).In the second stage, the location of the proposed area is used to crop features from the same intermediate feature map through the ROI pooling.The area feature maps for each suggested area are fed to the rest of the network to predict scores for a particular category and refine the box locations.Such a network achieves partial computation sharing by pruning suggestions in the first stage from feature graphs generated by the same intermediate convolution layer.This method avoids the input of each proposed region into the front CNN calculation region feature map.However, each proposal area must be entered into the rest of the network for separate calculations.Therefore, the detection speed is highly dependent on the number of suggested areas from the RPN.In addition, due to the fact that the Faster R-CNN is a two-stage target detection network, the detection speed is relatively low.

SSD
The SSD is a one-stage target detection network that predicts object classes and locations through a forward CNN.The SSD structure can be divided into three steps including: 1.
The basic convolutional layer consisting of VGG, ResNet, Inception, Inception RESnet-V2, and other feature extraction networks.The middle convolution layer of this step generates a large-scale feature map that can be divided into more units and has a smaller receptive field to detect smaller objects; 2.
The additional convolutional layer is connected to the last layer of the basic convolutional network, which generates a multi-scale feature map that has a larger receptive field for larger object detection; 3.
The prediction convolution layer of small convolution kernel is used to predict the position and confidence of bounding boxes of multiple categories.
From the operational steps, it can be seen that to maintain translation variance, the SSD network selects earlier layers to generate large-scale feature maps for detecting small objects.The features in the images of these early layers may not be complex enough, resulting in relatively poor detection for smaller objects.

YOLOv3
To improve the detection accuracy for smaller objects, YOLOv3 was developed by referring to the residual network.The YOLOv3 is also a one-stage strategy, which has high detecting speed.The architectural details of the YOLOv3 algorithm are as follows: It uses Darknet-53 without the last three layers to generate a small-scale feature image that is 32 times down-sampled from the original image.For example, if the original image is 416 × 416 in size, the element map will be 13 × 13 in size.Small-scale feature maps are used to detect large objects.Unlike the SDD, which selects an earlier layer to generate a largescale element map, the YOLOv3 generates a large-scale element map by up-sampling a small-scale element map and connecting it with an earlier layer's element map.Such large-scale feature maps with earlier layers of location information and deeper complex features are used to detect small objects.The three scales of the feature map are 8, 16, and 32 times down-sampled from the original image.

EfficientDet
The EfficientDet is also a two-stage feature extraction network that has a unique feature.This network is developed based on three or more great characteristics from some other excellent neural networks.Below are some examples of the three characteristics which had been combined to develop EfficientDet: 1.
The residual neural network, which can increase the depth of neural network and realize feature extraction through a deeper neural network; 2.
Changing the number of feature layers extracted from each layer to achieve more feature extraction and obtain more features in addition to improving the width; 3.
Increasing the resolution of the input picture so that the network can learn and express more abundantly, which is conducive to improving accuracy.
The EfficientDet will also scale the baseline model while adjusting the depth, width, and input image resolution to complete an excellent network design.In MobileNet, the scaling model is realized by using a scaling factor, α.Different α results in different precision.α = 1 represents the baseline model.The ResNet also has a baseline model, which is implemented by changing the depth of the image.

Image Dataset
Although advances in deep learning provide potential new solutions for visual forest fire detection, due to the limitation of budget, it may not be possible to conduct a large number of experiments to obtain the real forest fire image data set.Thus, this study used three different types of data sources including a computer-simulated smoke based on fluid dynamics, a crawler to crawl open data on the web, and forest fire smoke data taken by the authors.With all the three sources, 17,840 smoke image data sets were obtained using the data image augmentation technique.These data sets included 12,640 "forest fire smoke" images and 5200 non-forest fire smoke images.Figure 6 shows an example of part of the used data.Among these data, 70% was used for training, and 30% was used for testing.The training and testing data were randomly selected.The data set that we used in the experiments can be freely download via ZHENG data set 2021.

Image Pre-Processing
Image pre-processing is a necessary step for image recognition and classification.Image normalization is a typical image pre-process that can prevent affine transformation and accelerate gradient descent to find the optimal solution.In this study, the experimental data set was pre-processed by normalization using Equation (2).
where img n and img are the normalized and original pixel values of the image, respectively.The normalization using Equation ( 2) was computed pixel by pixel.After normalization, flipping and clipping were applied.Flipping and clipping is one of the earliest and most widely used methods of image augmentation.Flipping images left and right usually does not change the category of the object.In addition, through random clipping of the image, the object appears in different positions of the image in different proportions, which reduces the sensitivity of the model to the target location.Additional pre-processing by changing image color characteristics such as brightness, contrast, saturation, and hue was also conducted, followed by the image augmentation.

Transfer Learning and Training
Since the four deep CNN algorithms introduced in Section 2 were trained based on the large-scale image data sets and showed excellent performance in image target detection, this paper transferred the pre-trained network to large-scale image data.The transfer learning strategy is the front end of the reserved feature extraction network, only fine-tuning the network on the training and verification.
The training was then performed using the platform of the Intel(R) Xeon(R) W-2223 CPU @ 3.60 GHz, 16 GB DDR4 RAM 2400 MHz, CUDA10.2GPU NVIDIA Quadro RTX 5000, and Quadro RTX 4000 distributed training unit.The operating system is Ubuntu 18.0.4workstation.show some example smoke detection after the 100 iterations of training.It can be found that the confidence of Faster R-CNN is lower than that of YOLOv3, SSD, and EfficientDet.

Validation Testing and Discussion
With the training completed, validation testing was then performed on the remaining 30% of the total data to evaluate the performance of the deep CNN algorithms.Data composition is shown in the following Table 1.

Evaluation Index
To better evaluate the accuracy of forest fire smoke recognition, three indicators were used herein, including pixel accuracy, category average accuracy, and FPS for model performance evaluation.Larger values of the four indicators corresponded to superior recognition effects.Precision (P) and Recall (R) are the two simplest evaluation indicators and represent the proportion of correctly classified images out of the total number of images and the number of correctly classified images out of the images that should be correctly classified.The specific equations are shown below: The Mean Average Precision (MAP) provides a comprehensive measure of the average accuracy of the detected target, and it indicates the average of each category of Average Precision (that is, the average accuracy of all categories is summed and divided by all categories).The specific equation for MAP is as follows: Figure 15 shows the average accuracy and detection time for fire smoke detection based on the four deep CNN algorithms.All the four deep CNN methods achieved high average accuracy more than 85%, indicating that it is feasible to detect forest fires in images by using deep CNNs.Among the investigated algorithms, the EfficientDet method showed the highest accuracy of 95.7%.In addition, the detection speeds of the one-stage algorithms were shown to be faster, with more than 15 frames per second, indicating that they can detect fire smoke in real time.Among the four algorithms, the YOLOv3 showed the highest detection speed of 27 FPS.
The average measurement accuracy and detection speed of the four investigated deep CNN algorithms compares the measurement accuracy and its mean value, in addition to the detection time of the four algorithms.It can be seen that the EfficientDet method showed the highest average detection accuracy and the YOLOv3 has the fastest real-time detection speed.

Discussion
Faster R-CNN has higher detection accuracy, while the YOLO series is faster.Faster R-CNN uses a two-stage scheme to detect the target.The feature was discovered using the best network followed by adjusting the frame.However, the two-stage scheme can only be completed in one stage when the YOLO series method is applied.The core of the Faster R-CNN is to find the network with best performance and then assemble networks together to produce better results.Based on a multi-network fusion scheme, features of Faster R-CNN are very precise, but it yields slow computation, which is detrimental to the real-time nature of the forest fire smoke detection.
The emergence of YOLOv3 solved this challenge.The most significant features of the YOLOv3 are that it is faster and more accurate than the Faster R-CNN.The forest fire smoke detection results in Figure 10 showed that the maximum detection speed of YOLOv3 reached 27 frames/second, while the real-time detection performance of the Faster R-CNN was the worst among the four methods of target detection, with detection speed of 5 frames/second.Compared to the Faster R-CNN detection model requiring object proposals, the SSD method completely eliminates the stages of proposals generation, pixel resampling, or feature resampling, making it easier to optimize training and to integrate the detection model into the system.Although the detection speed for the SSD method of 16 frames per second meets the requirements for real-time detection, its detection accuracy is 87.5%, which is the lowest compared with other models.
Based on the results of the Scalable Neural Network (EfficientNet), EfficientDet can be combined with a new bi-directional feature network (BiFPN) and new scaling rules to achieve SOTA accuracy.Compared to the previous most cutting-edge detection algorithm, EfficientDet's volume is reduced to one-ninth of the original, and the computation time is also greatly reduced.This study developed a small EfficientDet-D0 baseline from the D0 to D7 models to improve the detection accuracy gradually while the computation effort was also decreased.According to the experimental results, the detection speed is 12 frames/second, and the detection accuracy is up to 95.7%.
Deep learning neural networks enable the capability of detecting the forest fire smoke without dependence on manual feature extraction through their special network architecture.To achieve such an end-to-end detection, a neural network model is constructed using generalized patterns, and a large number of environmental data sets is introduced for training for an effective detection.In this study, there are two major contributions to the fields: 1.As the forest fire smoke has its special nature, a real forest fire smoke data set is hard to obtain through experiments.This paper develops a computer simulation model based on the Navier-Stokes equation of fluid dynamics to simulate smoke, which can be used to supplement the forest fire smoke training data set by combining different smoke patterns obtained from the simulation with real field scenes.Such a simulation model solves the challenge that the forest fire smoke data set is difficult to obtain.
2. By constructing a large amount of forest fire smoke data sets to train deep learning target detection models, four forest fire smoke detection models with good generalization performance were obtained in this study, which makes the detection of forest fire smokes in complex scenes feasible.
This study shows that it is very effective to use simulated smoke to supplement the data set for training of neural networks in the absence of forest fire smoke data.With the trained models, the network model can detect the forest fire smoke in real time through the front-end remote video monitoring equipment.In practice, it is very easy to implement the trained models on the server with support of onsite cameras to monitor the all-weather fire occurrence and provide early warning of fire in forest areas.
Forest fire smoke has very large variations in color, texture, and shape, and it is crucial to establish a standard forest fire smoke database.Usually, data enhancement techniques are used to expand the smoke data set, but the data enhancement techniques do not increase the data or video surveillance scenes.It may reduce the robustness and effectiveness of the trained model in recognizing forest fire smoke scenes that are not included in the training set.Thus, accurate recognition of smoke is still challenging.In addition, the increase of the recognition accuracy of the target detection model requires computation power and memory.Therefore, in the future, while ensuring the recognition accuracy, more studies are needed to improve the computation efficiency of the detection algorithm network structure and make them more convenient to be deployed in practical field scenes.

Conclusions
To improve the performance of machine vision forest fire smoke detection, this paper investigated the feasibility of using the advanced object detection deep convolutional neural network of Faster R-CNN, SSD, YOLOv3, and EfficientDet to detect forest fire smoke.The deep CNN algorithm can automatically extract complex image fire features for fire detection in different scenes.The experimental evaluation results show that the four investigated algorithms all achieved acceptable average accuracy, with the EfficientDet showing the highest accuracy of mAP, up to 95.7%.The one-stage algorithms including YOLOv3 and SSD achieved real-time detection of more than 16 frames/s with the YOLOv3 the fastest of up to 27 frames/s.

Figure 1 .
Figure 1.Deep learning convolutional neural network for forest fire smoke classification.

Figure 3 .
Figure 3. Flow of forest fire smoke detection algorithm based on CNN.

Figure 4 .
Figure 4.This is a figure.Schemes follow another format.If there are multiple panels, they should be listed as: (a) Original image; (b) Convolution kernel; (c) Feature map.

Figure 5 .
Figure 5.The Faster R-CNN schematic diagram of fire detection algorithm based on deep CNN.

Figure 6 .
Figure 6.Example of part of the used data sets.

Figures 7 -
10 show the process of loss decline during 100 training iterations, from which it can be seen that Efficient-Det converged the fastest.

Figure 11 .
Figure 11.Faster R-CNN close-range detection effect of forest fire smoke.

Figure 12 .
Figure 12.YOLOv3 remote detection effect of forest fire smoke.

Figure 13 .
Figure 13.EfficientDet remote detection effect of small-target forest fire smoke.

Figure 14 .
Figure 14.SSD remote detection effect of forest fire smoke.

Figure 15 .
Figure 15.Histogram of forest fire smoke mAP and AP.

Table 1 .
Data sets' quantity statistics and division.