An Efficient Smoke Detection Algorithm Based on Deep Belief Network Classifier Using Energy and Intensity Features

: Smoke detection plays an important role in forest safety warning systems and fire prevention. Complicated changes in the shape, texture, and color of smoke remain a substantial challenge to identify smoke in a given image. In this paper, a new algorithm using the deep belief network (DBN) is designed for smoke detection. Unlike popular deep convolutional networks (e.g., Alex-Net, VGG-Net, Res-Net, Dense-Net, and the denoising convolution neural network (DNCNN), specifically devoted to detecting smoke), our proposed end-to-end network is mainly based on DBN. Indeed, most traditional smoke detection algorithms follow the pattern recognition process which consists basically feature extraction and classification. After extracting the candidate regions, the main idea is to perform both smoke recognition and smoke-no-smoke region classification using static and dynamic smoke characteristics. However, manual smoke detection cannot meet the requirements of a high smoke detection rate and has a long processing time. The convolutional neural network (CNN)-based smoke detection methods are significantly slower due to the maxpooling operation. In addition, the training phase can take a lot of time if the computer is not equipped with a powerful graphics processing unit (GPU). Thus, the contribution of this work is the development of a preprocessing step including a new combination of features—smoke color, smoke motion, and energy—to extract the regions of interest which are inserted within a simple architecture with the deep belief network (DBN). Our proposed method is able to classify and localize reliably the smoke regions providing an interesting computation time and improved performance metrics. First, the Gaussian mixture model (GMM) is employed to capture the frames containing a large amount of motion. After applying RGB rules to smoke pixels and analyzing the energy attitude of smoke regions, extracted features are then used to feed a DBN for classification. Experimental results conducted on the publicly available smoke detection database confirm that the DBN has reached a high detection rate that exceeded an average of 96% when tested on different videos containing smoke-like objects, which make smoke recognition more challenging. The proposed methodology provided high detection ratios and low false alarms, and guaranteed robustness verified by evaluations of accuracy, F1-score, and recall for noisy and non-noisy images with and without noise.


Introduction
Damage caused by forest fires to vegetation, animals, and humans can have disastrous consequences for nature. The danger of forest fires is real and poses a threat to people, animals, and plants. Each summer, many countries suffer from the impacts of large forest fires.
Protection from these dangers using video images for smoke and fire detection is a challenging task for a surveillance system. The presence of smoke can be the first indication of a fire since all fires produce smoke. The main idea of the smoke detection method that we propose hereafter is essentially based on a deep learning technique called the deep belief network (DBN) [1,2]. In addition, a large number of algorithms have recently been developed to ensure reliable prediction of smoke detection. Although some progress has been made in smoke detection, most algorithms are only effective in the rapid detection of wildfires. The features extracted from smoke images can cause false alarms and the smoke detection process is slow. Conventional smoke detectors are not ideally suited to effective smoke detection because they do not specify the size, location, or direction of the smoke. Compared to video-assisted smoke detection, conventional smoke detectors provide a significant lack of information. Thus, it is necessary to develop new visual smoke detection systems with a better detection rate.
In general, smoke detection is carried out in two main phases. First, it is necessary to define the target regions, extracted using the Gaussian mixture model (GMM) [3][4][5], optical flow [6], or background subtraction [7]. Furthermore, smoke is generally characterized by both static and dynamic characteristics. The second step consists of choosing certain characteristics (such as color, motion, energy/flicker analysis, and texture) and grouping them into multidimensional feature vectors. These vectors will be classified with different classification techniques, such as: The support vector machine (SVM) machine vector support [8,9], Markov model [10], and Bayesian classifiers [11]. This classification is a crucial step to calculate and separate regions with and without smoke in the studied images. Although many methods can be applied in ideal visual conditions, some are not suitable for forest fires, as there may be reflections of sunlight and other moving objects [3][4][5]. Many researchers have proposed several avenues for the development of intelligent smoke detection systems using video cameras [12]. These methods are intended for the visible area for which the distance between the camera and the target is less than 100 m. Recently, researchers have also used infrared cameras to detect smoke which involves a high cost compared to visual domain cameras. Most proposed smoke detection methods follow the same approach as that detailed previously. Following the extraction of smoke candidate regions, researchers use a set of selected features. The greater the number of features, the more complex the method, hence the need to insert highperforming classifiers.
Zhao et al. [13] extracted the flutter feature to perform classification using the Cost Sensitive Adaboost algorithm. Feature extraction from candidate regions is established by analyzing spatial and temporal characteristics of smoke videos. All of the extracted features are combined into a single input vector used in the Cost Sensitive Adaboost algorithm. The candidate smoke regions are extracted and updated to detect motion regions. Zhao et al. selected the flutter feature and computed the flutter direction angle to label the centroid motion of candidate regions. Using this approach, the direction of the centroid from bottom to top shows the presence of smoke regions. In addition, the presence of smoke is marked by a small change of high frequency information. Then, a threshold is applied to the neighborhood of each pixel based on the center value and the result is considered a binary number. The local binary pattern extracts both dynamic and appearance features of dynamic textures and is used to compute the pattern on each block of the considered frame. This method improves the performance of smoke detection but several potential improvements could be made to extract more effective flutter features [13]. Xiong et al. [14] extracted other features: background subtraction, flicker frequency, and contours. For further assessment, they classified smoke regions by the criteria of perimeter and area of smoke sequence. This method is one of the first smoke detection methods applicable to open spaces. The limitation of this method is that the extracted smoke feature is seldom adaptive to the background scene. In addition, the dataset used cannot be used to efficiently evaluate and compare the performance [14]. Toreyin et al. [15] investigated appropriate features such as the energy behavior of regions with and without smoke. The input frames were divided into sub-images using discrete wavelet transform [16]. This is a crucial step to calculate the energy ratio and, hence, determine the energy behavior. The energy of the input frame is compared to a reference frame used as a background model. Moving objects are detected using background estimation. Additionally, for further assessment the authors used the criterion of color detection. The decrease of U and V in Luminance and chrominance YUV space in a grayish scene is a sign of the presence of smoke. This not only identifies smoke features but also allows flicker frequency computing. Finally, all of these features were combined to make a final decision using a hidden Markov model (HMM) [10].
A novel Bayesian approach is developed by Calderara et al. [17] to detect smoke regions in a scene analyzing image energy through the wavelet transform coefficients and color information. The Bayesian classifier is employed on energy and color information. They developed a statistical model of gray image energy to compare the color of the input frame and the color of the chosen reference frame. This technique detects any change of energy image in the scene [11]. In the work of Yuanbin [18,19], the quality of the image is firstly enhanced by fuzzy logic, then it is adopted to extract dynamic regions from video frames. A swaying identification algorithm based on centroid calculation is used to distinguish candidate smoke regions from other dynamic regions. The support vector machine classifier is employed after that to classify the input vector of extracted static and dynamic features. This method is reliable, but it needs a high computational time in the processing stage. Similarly, Favorskaya et al. [20] applied the basic traditional spatial and temporal features used in other methods. The traditional features describing the spatial ones are color, energy, transparency, and shape, while the temporal ones are motion, flicker, and frame difference estimator [18,19].
Visual saliency detection developed by Xu et al. [21] focused on the most important object regions in an image. The pixel-level and object-level salient convolutional neural networks are combined to extract the informative smoke saliency map [21]. The saliency map is added to the deep feature map to predict the existence of smoke in an image. Unlike popular deep convolutional networks (e.g., Alex-Net, VGG-Net, Res-Net, and Dense-Net, and the DNCNN specifically devoted to detecting smoke), the proposed end-to-end network developed by Gu et al. [22] is mainly composed of dual channels of deep sub-networks. The first one connects multiple convolutional layers and max-pooling layers. The first sub-network extracts the detailed information of smoke, such as texture. In the second sub-network, they inserted two important components. The first avoids the vanishing gradient and improves feature propagation. The second reduces the number of parameters and solves the over-fitting problem. Based on the augmented data obtained by rotating the training images, smoke detection using the denoising convolutional neural network (DCNN) can stably converge to a perfect performance [21]. To overcome the shortfalls of the existing methods, we have decided in the present work to use smoke videos captured in the visual domain by low-cost cameras and we have improved the smoke detection by implementing a powerful classifier. The smoke regions will be classified using deep belief network (DBN) [1,2], to have the probabilities of smoke and no-smoke presence in the studied frames. The DBN uses a combination of three selected features: smoke color, motion, and image energy. Furthermore, we propose to extract the movement in the different images while eliminating static images. Thus, the proposed method allows us to localize the smoke regions in each frame, contrarily to existing methods above-mentioned, which can only detect the existence of smoke regions.
The present paper is organized as follows. The different smoke features proposed in previous works are described in Section 2. Then, we present the methodology of the proposed technique in Section 3. The experimental results of the proposed method are reported and discussed in Section 4. Finally, conclusions and perspectives are presented in the last section.

Smoke Motion
Labeling smoke motion is the first step to extract candidate regions to determine the nature of motion (ordinary or chaotic). Motion detection could be realized by these techniques: background subtraction [7], and optical flow [6]. The most known and efficient technique for background subtraction is the Gaussian mixture model [3][4][5], which iteratively subtracts the background image from the current frame to find finally the moving objects. In this approach, the camera is stationary. Each pixel in the frame is defined with a mixture of K-Gaussian distributions. The probability that a pixel represents the intensity is defined with:  is the covariance for the ith distribution and  is a Gaussian probability density function.

Smoke Color Feature
Color detection is one of the used criteria for smoke detection. The color of the smoke can be gray, white, black, or dark gray. Color features can be treated in different spaces (RGB, YUV, YCbCr, etc.). The smoke RGB and YUV rules are defined as [11]: where Sy, Su, and Sv are the experimentally selected thresholds.

Smoke Energy
The temporal behavior of smoke is defined with the study of the energy wavelet analysis. Noting that It is the current frame and Bk is the background image, many studies prove that the decrease in the energy ratio in the current frame E(Bk, It) divided by the background energy E(Bk, BGt) can be a pertinent characteristic of the smoke transparency, because smoke gradually softens the edges in an image. Researchers apply the discrete wavelet transform [16] to the current frame It and the background image Bk, before the appearance of the smoke to compute the energy with: Discrete wavelet transform DWT [16] creates four images: the horizontal high band/vertical high band HH, the horizontal low band/vertical high band LH, the horizontal high band/vertical low band HL, and the horizontal low band/vertical low band LL after convolving the intensity image by filter banks. The frame It is divided into blocks with arbitrary size. The contribution of Lee et al. [23] is the choice of the technique of clustering called boosted random forests (RBF) [24] for the smoke and nosmoke regions. This classifier offers better performance than the SVM classifier. Furthermore, Lee et al. [23] combine temporal and spatial features (color, moving object, flicker/energy analysis, disorder analysis, etc.) and insert them in feature vector for clustering to finally find smoke regions.

Smoke Disorder
Disorder is a dynamic feature that defines the direction and propagation of smoke, which is considered a chaotic phenomenon. It is defined with: where Perimeter(i,j) and Area(i,j) are the perimeter and the area of object i in the jth frame, respectively. The dynamic features of the smoke like area and perimeter disorder of segmented smoke region are calculated for differentiating between smoke and non-smoke region. Once the smoke features are fixed as required in each method, the smoke and no-smoke regions are classified using different types of adequate classifiers.

Smoke and No-Smoke Classification Regions Methods
Various techniques of classification are applied to feature vector to separate smoke and nosmoke regions, such as the support vector machine [8,9], Bayesian classifiers [11], and Markov models [10]. Most researches used the support vector machine [8,9] as a classification method intended for a multidimensional feature vector. Recently, researchers have reduced the number of features and inserted a better classifier for smoke detection. The comparative study between approaches is based on the criterion of detection rate of smoke frames presented as:  number of positive detected frames number of false detected frames Detection Rate= number of smoke frames (7) Some of the currently used methods techniques intended for smoke detection are shown in Table  1 and detailed in our published paper [25]. Researchers in forest smoke detection have recently tried to minimize the features and insert a higher performing classifier to get the best values of detection rate. The added value of our proposed method is to implement a method that helps us to detect smoke with an optimized number of features. A simple comparison between classical smoke detection methods and our implemented method is able to demonstrate the superiority of our method in terms of detection rate and time detection. In the following, we expose our methodology based on the classification with a DBN classification.

Dataset Presentation and Proposed Methodology Description
The proposed program was achieved using Python and an open source computer vision library Open CV. The used dataset is available at [29]. First of all, we divided the video into frames. The training set is used to fit the models while the validation set is used to estimate prediction errors for model selection. In order to estimate how well our model has been trained, we evaluated model properties (mean error for numeric predictors, classification errors, recall, and precision for classifiers). The training of DBN requires a lot of computation and a large number of patterns. For the training, we used a data set of ten videos containing smoke and six videos without smoke. Some frames of these videos are shown in Figures 1-3. We utilized 70% of the extracted frames for training, whereas the remaining frames have been used for the test.
At this stage, we extracted frames from smoke-based videos. The size of each frame of the videos is set to 320 × 240 pixels. We used a total of 20,000 frames with equal distribution of non-smoke and smoke classes. The general idea of our work is simple. The pre-processing that we proposed to apply on the database enormously reduces the complexity of the detection method.   The main idea of our work is to select the frames containing a movement by using GMM. We eliminated the static frames. We calculated the difference between the components R, G, B of each frame. For the frames where R, G, B (experimental thresholding) are close, we kept them. We calculated the ratio of the energies for each frame and a reference frame before the appearance of the smoke. If the contours are blurred, the energy ratio is less than 1. Therefore, we kept the smoke frame and extracted the regions of interest. These regions of interest were then inserted in a vector, which was provided to the DBN network. The proposed method presented a high performance metric with an interesting computation time (detection rate, precision F1 score, accuracy, recall) with the capability of localization of smoke regions. The proposed smoke detection method is detailed by the flowcharts of Figures 4 and 5.
To the best of our knowledge, the only research paper using DBN and dealing with smoke detection is Pundir's one [28]. The difference between the proposed approach and the method developed in [27] is that we are interested only by color, motion, and energy features, and we inserted these features on the DBN classifier contrarily to the aforementioned method [28] based on motion, color, and texture analysis. For this reason, we propose, in this paper, to show the importance of using energy as a relevant smoke feature in the classification framework by comparing the performances of both methods. In addition, we think that Pundir's algorithm is more complex than our approach in terms of feature computation since it requires, in addition to color and motion, texture analysis in each processed frame, by evaluating the local extrema co-occurrence patterns (LECoP).  The LECoP gives texture and intensity features by extracting the local directional information by the use of gray level co-occurrence matrix. Consequently, we believe that the computation of this matrix to extract textural attributes will increase the processing time of the video, and that this will present a significant problem when the scheme is implemented on real-time devices for wildland forest fire detection. The old works do not mention the true location of smoke, and do not specify the regions of interest and balance the frames as they are. The real contribution is to limit the regions of interest before feeding them into the DBN and calculating the intersection over union (IoU) criterion.

Pre-Processing of Smoke Images
The frames containing smoke are characterized by closed R, G, B components. The idea is to set an experimental threshold where R-G, R-B, G-B are close to zero. After applying the RGB rules on the foreground objects containing high motions using Gaussian mixture model as presented in Figure 6, we compute the ratio of the energy between the current frame and the energy of the background frame before the apparition of the smoke. The contours of the smoke regions become blurred. Consequently, the energy of the frames containing the smoke regions decreases. If the ratio between the energy of the actual frame and the background image is less than 1, the possibility of smoke is detected. For this step, we used the discrete wavelet transform to obtain the HH (high-high), HL (high-low), LH (low-high), LL (low-low) sub-bands. The energy ratio is computed as shown in Equation (5).
The proposed methodology is based essentially on the choice of characteristics such as color, movement, and energy, to specify the regions of interest and locate the smoke by the trained DBN.

Localization of Smoke Regions
In this work, the chaotic movement of smoke is analyzed using GMM tools [3][4][5] to extract regions of high motion. We added to this criterion the color feature calculated from the matrix difference between the R, G, and B components and the energy ratio. This approach led us to easily locate the smoke in the video frames. The region of smoke is shown by a green rectangle in Figure 7.

The Use of the Deep Belief Network
The deep belief network [1,2] is a probabilistic generative model compound from many layers of hidden variables. Each layer is associated to high order correlations. The generative approach is to model the joint distribution of observed and latent variables typically using a log-likelihood based criterion. This approach does not require a labeled data. The DBN used to learn our data is composed of one visible layer and two hidden layers. In addition, it is fully connected and it has 100 nodes per layer. Indeed, the DBN is a stacked restricted Boltzmann machine (RBM), which learns the features (pre-training phase) and then propagates them to fine-tune the networks. After the pre-training, our network acts like a multi-layer perceptron (MLP). Finally, after adding an output layer, we insert a logistic regression as a classifier. In the following, we present the different training steps.

The Pre-Training and Feature Extraction
The RBM [30,31] created by G.Hinton [2], is an algorithm suitable for classification, dimensionality reduction, feature learning. It can be considered as the basic structure block of deep-belief networks. The RBM is usually used as feature extractors and a stand-alone non-linear classifier, as well as logistic regression [32] and SVMs [8,9,33]. Each visible node takes a low-level feature from an item in our dataset to be learned. The hyper-parameters of our used network are detailed in Table 2.
The energy E(v,h) used to fine tune the network model defined as 2 100 3 100 3 The energy-based probabilistic models define a probability distribution through the energy function, and it is given by: For the training, a loss function depending on the energy aforementioned is defined as follows: The training network consists to update the parameters of the network (weight and biases) in the direction of the gradient.
Thus, this approach learns the best model for  [34,35]. The loss function depends on the energy function and it is defined with: , 1 ln ( ) ln ( ) ln ln ln The positive and negative phase doesn't refer to their signs but the reflection of the probability density defined with the model. The first term increases the probability of training data (by reducing the corresponding energy), while the second decreases the probability of generated samples. Thus, we obtain the following equation for training ln ( ) where i = 1, ..., 100, j = 1, 2, 3, and the symbols data and reconstructed are used to represent the expected values of the data and the reconstructed model, respectively.
where r L is the learning rate.
This parameter will be fixed in the following section. It depends on the attitude of pre-training cost per epoch. Gibbs sampling is a crucial step for training intended for sampling the system. We can start with a random state in one of the layers, and then perform alternating Gibbs sampling [36]. All the units in one layer are updated in parallel given the current state statistics of units in the other layer and this is repeated until the system is sampled. This step is the same as generating data from the infinite belief net. Gibbs sampling reduces the Kullback-Leibler divergence [34,35]. Samples of P(x) can be obtained by running a Markov chain [33] to have finally the convergence using Gibbs Sampling. Thus, the visible units are sampled simultaneously with hidden values (vice versa) as follows: where, v, h are conditionally statistically independent. The terms ( 1) n h  and ( 1) n v  are randomly chosen to be 0 or 1 with the respective probabilities: The RBM second layer is trained on the activities of the hidden units of the first RBM layer giving the data and keeping the weight fixed. The hidden units in the second RBM level tends to have strong positive weights similar to features in the first layer. The second RBM layer extracts higher level features. After the pre-training stage using the RBM, the network operates like an MLP [34].

The Multi-Layer Perceptron
The multi-layer perceptron is a feedforward artificial neural network trained by a supervised learning algorithm usually employed to solve binary and multi-class classification problems. It is a compound of at least three layers of nodes. Each node represents a neuron that uses a nonlinear activation function. For the training, the MLP uses generally the backpropagation algorithm as a learning algorithm. We used a simple fully connected model. The hidden layers are firstly pre-trained by a generative restricted Boltzman machine (GRBM). The principle of greedy layer -wise unsupervised in DBN with stacks of RBM is as follows: 1. Train the first layer as an RBM that models the raw input v = h0 as its visible layer. 4. Fine tune all parameters with respect for DBN log-likelihood.

Fine tune the parameters.
After the pre-training phase, we added an output layer from the parameters of the last hidden layer. We fine-tuned the weights of the pre-trained model by continuing the backpropagation, as in the classical feed-forward neural network [37]. This technique is still used to train large deep learning networks. The principle of the backpropagation approach [37,38] is to model a given function by modifying internal weights of input signals to produce an expected output signal. It is possible to fine-tune all the layers of the DBN, as it is possible to keep some of the earlier layers fixed (due to overfitting concerns) and only fine-tune some higher-level portion of the network. Once we finished the training step, we added an output layer for the classification step by using logistic regression.

Classification Using Logistic Regression/Adam Optimizer a. Logistic Regression
The classification step is established by adding a logistic regression [32] as a classifier trained with conditional maximum likelihood estimation. We choose the parameters W that maximize the log probability of the y labels in the training data given the observations v. Logistic regression is a discriminative classifier that models the decision boundary between classes [32]. A generative model explicitly estimates the actual distribution of each class. The key idea here is to predict the probability that nodes in the input layer belong to a certain class, as shown in Figure 8. The probability is defined with where i = 1, ..., 100, j = 1, ..., 3 For training, we should define a loss function as For an individual training observation v in the training set, the optimal weights are estimated with:

b. Adam Optimizer
The optimization algorithm is the main tool used for training a machine learning model to minimize its error rate. Therefore, there are two metrics to determine the robustness of this algorithm: the speed of convergence and generalization on new data. Algorithms such as adaptive moment estimation (Adam) [39,40], or stochastic gradient descent (SGD) can cover one or the other metric for optimization.
Adam is an optimizer intended for gradient-based optimization of stochastic objective functions. It combines the advantages of two SGD extensions: root mean square propagation (RMSProp) [40] and adaptive gradient algorithm (AdaGrad) and calculates individual adaptive learning rates for different parameters. Despite the widespread popularity of Adam, we noted that it fails to converge to an optimal solution under specific settings as extreme learning rate and even intermediate learning rate values. The comparison between these two optimizers is then presented, showing its influence on the variation of accuracy and the cost function in Figure 9a,b. That is why we later opted to use SGD. As perspectives, we will use an optimizer that combines the speed of Adam and the capability of generalization as SGD.

The Network Parameters Tuning
In the present work, we extend our method previously reported in [25] by improving its performance and using a larger database containing more smoke and no-smoke videos. The following performed simulations helped us to fix the parameters of our network. Fixing the network hyperparameter enables us to have a robust network with the best results. We had also to fine-tuned the different parameters: the learning rate Lr, the number of hidden layers, and the number of pretraining epochs. We tested several values of the learning rate and we opted for Lr = 0.0006.
The difference between epoch and iteration is that epoch describes the number of times the algorithm sees the entire data while iteration describes the number of times a batch of data passes through the algorithm. Each time the algorithm goes through all the samples in the dataset, an epoch has completed. The variation of pre-training loss according to the number of epochs is shown in [23]. In addition, the number of pre-training epochs is one of the parameters for training. We tested many values of epochs and noted the time of pre-training Tp.
To have the best results for training, we should have the lowest time for pre-training with a decreasing cost function and the highest value of detection rate Dr. In our experiments, we have achieved the lowest value of pre-training time matching to the number of epochs, which is equal to 100. There is no specific number of hidden layers to obtain the best results. This number is determined by testing and noting the variation of the training cost function over the epoch and modifying in each After fixing the learning rate and the number of pre-training epochs, we have to fix the number of hidden layers, which is one of the most important parameters to pre-train the neural network. We tested a various number of hidden layers to control the attitude of the reconstruction cost of pretraining as shown in Figure 10. We tested various numbers of layers to note the influence of the number of hidden layers on the pre-training processing time Tp, on the fine-tuning processing time Tf and the detection rate. It is a crucial step to test the efficiency of the developed method with the chosen number of the hidden layers. Compared to our simulations presented in [25], we have improved the pre-training and fine-tuning time, which should be optimized by opting an appropriate number of hidden layers and pre-training epochs.
The training time process that represents the sum of the pre-training process time Tp and the fine-tuning process time Tf varies from 90 s 130 s for various tests. In addition, the time for the pre-training and the fine-tuning processing increases, whereas the detection rate decreases, if we add more than two hidden layers to our network.
After testing various numbers of layers, we notice that the loss function converges after 100 epochs. Consequently, we trained the network using SGD algorithm with momentum, and then with Adam Optimizer on 64 mini-batches for over 400 epochs. We choose to train the data with 400 epochs, the dataset will be divided into 400 subsets, and the weights will be updated maximally 400 times. The L2 weight cost on the softmax layer is fixed at 0.1. To prevent overfitting and to achieve good results, a noise is added to the inputs. In order to demonstrate the performance of the implemented network, we show in Figure 11 the variation of the detection rate according to epochs for the learning model and the validation model.

The Detection Rate and Loss Analysis
The loss per epoch is determined according to the training and validation data sets. Its interpretation depends on how well the model is doing for these two sets. The better the model, the lower the loss. The loss does not represent a percentage contrary to accuracy. It is defined with the summation of the errors made for each example in training or validation sets. The variation of the loss function of training and validation depending to the epoch are illustrated in the Figure 11a,b. The loss function is the negative log-likelihood and residual sum of squares for classification and regression respectively. The first objective in a learning model is to minimize the training loss function's value with respect to the model's parameters by changing the weight vector values through the backpropagation algorithm.
The loss function implies how well or poorly a certain model behaves after each iteration of optimization. Subsequently, the accuracy of our model is usually determined when the parameters have been learned and fixed. Afterwards, the test samples are provided to the model and the number of mistakes is recorded when compared to the true targets. These tests are presented in Figure 11.

Training and Validation Using the Proposed Method on the Studied Database
In the present section, several datasets are selected from the easiest to the most difficult in terms of classification. Then, the smoke and no-smoke samples are classified with the proposed method mentioned above, in Section 3. Table 3 contains the results of testing on a set of 12 videos containing smoke, fogs or clouds. In order to show the effectiveness and the robustness of the proposed method, the tested videos are chosen as follows: the first six tests in the database represent videos containing only smoke. The next three tests are videos containing clouds, fogs, and moving people. The last three tests illustrate videos without smoke. We notice here that there is no standard database for smoke detection. Therefore, our proposed method of smoke detection is compared to the work of W. Yuanbin [18], noted M1, aforementioned in the state of the art, and which uses the SVM as a classifier. The results of the comparison are illustrated in Table 3, on the same database, where the accuracy, precision, recall, F1 score values are presented. The accuracy is defined with P re c is io n T P T P F P   (21) Recall TP = TP + FN (22) Precision.Recall F1 Score 2 Precision+Recall = (23) where TP is the true positive frames, TN is the true negative frames, FP is the false positive frames and FN is the false negative frames.
In addition, the intersection over union (IoU) metric is reported in Table 3 as an evaluation of smoke localization provided by the proposed segmentation method. It is the area of overlap between the predicted segmentation (Spred) and the ground truth (Strue), divided by the area of union between these two segmentations: The IoU ranges from 0-1 (0-100%), with 0% signifying no overlap and 100% signifying perfectly overlapping segmentation. This metric evaluates the localization of smoke regions. The time processing of smoke detection using the proposed method is also presented in Table 3. Indeed, Table  3 shows that the first six tests containing only smoke have fairly high smoke detection values, ranging from 90% to 96% using the proposed method (M2). Moreover, for videos containing only smoke, the accuracy for the method (M1) is balanced in the margin (82-94%). For the three tests containing smoke, moving people, or cloud we found that the presence of objects causes a decrease in the detection rate. The smoke detection values in these cases are ranging from 89% to 91%. For the last three tested videos that do not contain smoke, the methods do not detect smoke and the detection values are then zero. In this case, the accuracy and the detection rate depend essentially on the true negative frames and false positive frames in the absence of smoke in videos. As shown in the obtained results, and after several tests comparing to the three smoke detection methods, our proposed method, generally have the highest number of true positive frames and the lowest true negative frames. The presence of constraints as moving people, fogs, or clouds increases the number of true negative frames. Consequently, the detection rate and accuracy decrease compared to videos with smoke only. Additionally, the time processing of the videos containing smoke varies between (0.43-0.82 s) depending, essentially, on the total number of frames, number of hidden layers, and nodes.

Comparison of Smoke Detection Results Using Support Vector Machine and Deep CNN
In the smoke detection method based on SVM [8,9,19,33], smoke regions are extracted by the Gaussian mixed model. Then static and dynamic features are extracted. The characteristics vectors are then classified by the SVM, for smoke recognition, after fixing the hyperparameter of the SVM. This method will be compared with our proposed method on the same dataset. We use the SVM referring to the smoke detection method developed by Yuanbin [18], on the same data, with texture features as inputs.
Furthermore, knowing that the deep CNN used by Pundir et al. [28] remains the best alternative for smoke detection, the performances of the proposed method are compared to [28] in this section. Our goal is not to beat the robustness criteria values achieved by this method, but to propose an alternative for smoke detection with a simpler architecture. The deep CNN based method extracts smoke features using two deep learning frameworks and tries to classify smoke and non-smoke region.
The first deep learning framework is used super pixel algorithm to extract smoke features (color, texture, disorder, sharp edge, etc.). The second-deep learning framework is employed for computing smoke motion feature after applying the optical flow [41] to capture the chaotic motion of smoke. Features extracted from both frameworks are presented to deep CNN and are combined to train the SVM [33].
The training process of the deep CNN method is summarized in Figure 12 [28]. Unlike this method, in our proposed technique of smoke detection, we only insert the chosen feature vector for classification using DBN to directly obtain the smoke and no-smoke probabilities ( Figure 5). Apart from the complexity of the method developed in [28], this method requires a large amount of data to have good classification results and the time of training process takes a lot of time, especially if the computer is not equipped with a good GPU. On the other hand, the proposed smoke detection method does not require so much data to have a good classification rate, and it is able to locate the smoke before classifying the smoked and non-smoked areas.

Robustness of the Proposed Method in the Noisy Case
The present work is analyzed by a challenging smoke dataset. Despite the higher performance of this method, it gets affected by the presence of noise as clouds and moving people, which slightly decreases the detection rate in some cases.

Figure 12.
Bloc diagram of the Pundir method's (smoke detection and classification) with two deep learning frameworks) [26].
Thus, we will test the robustness of the implemented work reverts to adding noise to the inputs and notice the attitude of the detection process. In fact, noise is traditionally added to the inputs, but it can also be added to weights, gradients, and even activation functions.
Furthermore, this approach makes the input space smoother and easier to learn. Additionally, adding noise during the training can make the training process more robust and reduce generalization error. Furthermore, adding noise expands the size of the training dataset. Consequently, adding noise to input samples is a simple form of data augmentation. At first, we propose to add Gaussian noise to test the robustness of the proposed method. This allows us to use variance-stabilizing methods on smoke inputs. The results of the additive noise are presented in Table  4. The noise level is quantified by the signal to noise ratio (SNR) value [42] Three levels of noise are added to the inputs: low noise (SNR = 20 dB), medium noise (SNR = 5 dB), and high noise (SNR = 1 dB).
Finally, to show the difference between the aforementioned classifiers, we apply these methods on videos containing smoke (without moving people and fogs), and we calculate the following classical criteria: accuracy, precision, recall, F1 score defined in Equations (20)- (23), and the time of training process for each method. The results of this experiment are shown in Table 5. The obtained values of recall, precision and F1 score show that the robustness proposed method. In addition, the criteria values are very close to calculated values using the deep CNN classifier, but represent better values than the SVM classifier, even in the presence of noise.  The difference between our developed technique and the deep CNN classification technique is that, through our method, we can easily localize the smoke regions with a feature vector fed into simple architecture after extracting smoke features: motion, color, and energy analysis. This advantage obviously allows us to classify smoke zones more easily and rapidly than the deep CNN classifier.
In fact, the deep CNN models [28] found in the literature for smoke detection are usually very deep, with many layers and inspired by models dealing with several classes' classification.
Moreover, deep CNN requires a lot of computation time, as well as a lot of data that is sometimes not easily available. The DBN has historically demonstrated their usefulness through Hinton's work in the 2000s [1,2], and has been successfully used for classification problems.
The deep CNN based method extracts smoke features using two deep learning frameworks and classifies smoke and non-smoke regions. This aforementioned technique requires a high computation time.
Our proposed method uses a generative model combined with logistic regression. Additionally, unlike CNN-based methods, which are fully supervised, we have moved towards a semi-supervised approach, in which we exploit the ability of DBN to obtain a high level hierarchical representation of the tested data.
The proposed method ensures important values (accuracy, precision, F1 recall) compared with smoke detection techniques based on SVM treated in the state of the art. Apart from this advantage, we can locate and classify regions of interest at the same time and we have calculated IoU as a smoke localization criterion, which was not treated in the works mentioned in the state of the art. The real value of our work is to locate and classify regions of interest with interesting values of performance, and the computation time and resources needed for training and testing, without forgetting that the main purpose of smoke extraction is to locate and classify, with a minimum of execution time and resources and high performance metrics.
After using many assessment criteria, we concluded that the proposed method is far more effective than any other widely used methods for smoke detection, owing to its robustness and accuracy. Exceptionally, the proposed method yields good classification and detection results, especially in a noiseless condition. In the presence of noise, these results are slightly affected.

Conclusions
In this paper, a novel smoke detection and localization technique is proposed by combining different smoke features: color, motion, and energy. Contrarily to up-to-date methods, we prepare our inputs for training to localize the smoke regions and also to classify smoke and no-smoke regions using a DBN classifier. In addition, the proposed system can be used for real-time smoke detection. The calculated time processing on our dataset varies from (0.43-0.82 s), essentially depending on the total number of frames, number of hidden layers, and nodes.
The advantages of the proposed method with respect to other smoke detection methods can be summarized in two points. The first one is the easy and fast localization of the smoke regions, with an IoU changing from 0.85 to 0.94 for videos containing smoke regions. Smoke detection using the proposed method can easily and rapidly localize and classify smoke and no-smoke regions, compared to the conventional smoke detection methods and deep CNN that requires a lot of computation time, as well as a lot of data, which is sometimes not easily available. In addition, we have worked on challenging database by adding to smoke and no-smoke videos another input, containing moving people, clouds, and fogs. Compared to most smoke detection methods using deep CNN [28], the proposed method has the highest values of detection rate, which can reach 96%, even in a noisy context. We conclude that our method is more reliable than many other methods and can provide a high detection rate. Moreover, the capacity of treating a huge number of frames is a real advantage of this approach. In future works, we can propose the use of a dual DBN to improve the detection rate and the smoke localization in each frame. As perspectives, we will use an optimizer that combines the speediness of Adam and the capacity of generalization on new data as SGD. A possible improvement to the developed system is the incorporation of AdaBound to substantially improve the detection rate and more images to train the network for higher accuracy.