Automatic Fire and Smoke Detection Method for Surveillance Systems Based on Dilated CNNs

: The technologies underlying ﬁre and smoke detection systems play a crucial role in ensuring and delivering optimal performance in modern surveillance environments. In fact, ﬁre can cause signiﬁcant damage to lives and properties. Considering that the majority of cities have already installed camera-monitoring systems, this encouraged us to take advantage of the availability of these systems to develop cost-e ﬀ ective vision detection methods. However, this is a complex vision detection task from the perspective of deformations, unusual camera angles and viewpoints, and seasonal changes. To overcome these limitations, we propose a new method based on a deep learning approach, which uses a convolutional neural network that employs dilated convolutions. We evaluated our method by training and testing it on our custom-built dataset, which consists of images of ﬁre and smoke that we collected from the internet and labeled manually. The performance of our method was compared with that of methods based on well-known state-of-the-art architectures. Our experimental results indicate that the classiﬁcation performance and complexity of our method are superior. In addition, our method is designed to be well generalized for unseen data, which o ﬀ ers e ﬀ ective generalization and reduces the number of false alarms.


Introduction
Despite the rapid growth of technologies and smart systems, certain problems remain unsolved or are solved with methods that deliver poor performance. One of these problems is the unexpected outbreak of a fire, an abnormal situation that can rapidly cause significant damage to lives and properties. According to the Korea Statistical Information Service, the National Fire Agency recordedthat, during the three years from 2016 to 2018, 129,929 fires occurred in South Korea, resulting in 1020 deaths, 5795 injuries, and damage to properties estimated at USD 2.4 billion [1].
The latest technological advancements in sensors and sensing technologies have inspired businesses to determine whether these improvements can help to reduce the damage and harm caused by fire. This is the most frequent and widespread threat to public and social development as well as to individuals' lives. Although fire prevention is the top priority to ensure fires do not occur in the first place, it is nonetheless essential to spot fires and to extinguish them before they have serious consequences. In this regard, a large number of methods were introduced and tested for early fire detection to reduce the number of fire accidents and the extent of the damage. Accordingly, different types of detection technologies in automated fire alarm systems have been formulated and are widely implemented in practice.
(1) We propose a CNN-based approach that uses a dilated CNN to eliminate the time-consuming efforts dedicated to introducing handcrafted features because our method automatically extracts a group of practical features to train it. Asit is essential to use a sufficient amount of data for the training process, we assembled a large collection of images of different scenes depicting fire and smoke obtained from many sources. Images were selected from a well-known dataset [9]. Our dataset is also available for further research. (2) We used dilated convolutional layers to build our network architecture and briefly explain the principles thereof. Dilated convolution makes it possible to avoid learning much deeper, because it helps to learn larger features by ignoring smaller features. (3) Small window sizes are used to aggregate valuable values from fire and smoke scenes. The use of smaller window sizes in deep learning is known to enable smaller but complex features in an image to be captured, and it offers improved weight sharing. Therefore, we decided to use a smaller kernel size for the training process. (4) We determined the number of layers that are well suited to solve this task. Four convolutional layers were employed because an excessive number of layers allow the model to learn much deeper. This approach considers that, rather than having to classify a very large number of classes, the task is a simple binary classification. Therefore, employing many layers will exacerbate the overfitting problem. In Section 5, overfitting is demonstrated to occur. However, the latter studies used a larger number of layers, mostly six layers [6].
The remainder of this paper is organized as follows. In Section 2, information about fire and smoke detection approaches is introduced. The features of our custom dataset are presented in Section 3. A comprehensive explanation of our proposed method is provided in Section 4. In Section 5, we discuss all the experimental results. Section 6 highlights a few limitations of the proposed method. Finally, Section 7 concludes the manuscript with final remarks.

Computer Vision Approaches for Fire and Smoke Detection
Many researchers who studied traditional fire and smoke detection systems focused on extracting crucial features from images. Many of these investigations focused on detecting geometrical characteristics of flames [10,11] and fires in the images [12,13]. For example, Bheemul et al. [11] suggested an efficient approach for extracting edges by detecting changes in the brightness of an image of a fire. Jian et al. [13] presented an enhanced edge detection operator, a Canny edge detector, which uses a multi-stage algorithm. However, the abovementioned computer vision-based methods were only applicable to images of simple and steady fires and flames. Other researchers applied new methods based on FFT (Fast Fourier Transform) and wavelet transform to analyze the contours of forest fires in video scenes [14]. Previous research has indicated that these approaches are suitable only under certain conditions. Changes in fires were analyzed using red-green-blue (RGB) and (hue, saturation, intensity) HSI color models. For example, Chen [15] used a color-based approach to detect the discrepancy among sequential images. Celik et al. [16] proposed a generic rule-based approach that uses the YCbCr color space to discriminate luminance from chrominance to identify a variety of smoke and fires in images. Yu et al. [17] also used simultaneous motion and color features for detection purposes. The use of YCbCr can increase the detection rate of fire in images compared to RGB, because it can separate luminance more effectively than RGB color space [18]. However, color-based fire and smoke detection methods are not feasible, because these approaches are not independent from environmental factors such as lighting, shadows, and other distortions. In addition, color-based approaches are vulnerable to the dynamic behavior of fire and smoke, even though fire and smoke have a longer-term dynamic behavior.
The disadvantage of these methods is that they require specific knowledge to extract and explore the features of fire and smoke in images. In addition, almost all conventional fire detection methods Atmosphere 2020, 11, 1241 4 of 15 use color-based, edgedetection, or motion-based techniques, and these approaches are infeasible for analyzing tiny and noisy images. Therefore, these methods are limited, because they rely on limited characteristics of fire and smoke in images such as the motion, color, and edge of the fire or smoke. Furthermore, extracting these characteristics is also challenging because of the quality of the video or image.

Deep Learning Approaches for Fire and Smoke Detection
In recent years, deep learning has emerged significantly because of advances in hardware, the ability to process large-scale data, and substantial advances in the design of network structures and training strategies. Additionally, deep learning has been effectively implemented in various fields such as natural language processing (NLP), network filtering, games, medicine, and vision. Several deep learning applications have been shown to outperform human experts in certain cases [7,19,20]. In vision-related tasks, computers have already achieved human-level performance. Several studies have been carried out to detect fire and smoke in images using deep learning approaches to enhance the reliability and results of these methods.
These approaches for fire and smoke detection differ from those based on computer vision in various ways. First, deep learning performs automatic feature extraction using a massive amount of data for training and discriminative features learned by the neural network to detect a fire or smoke. Another advantage is that deep neural networks can be flexibly and successfully implemented in various fields, and instead of spending time on feature extraction, they can be changed to construct a robust dataset and appropriate network structure.
Recently, Abdulaziz [6] introduced a fire and smoke detection network with limited data based on CNNs and used it with a generative adversarial network (GAN) [21] for augmentation purposes. Instead of using the traditional activation function, Abdulazizet al. employed adaptive piecewise linear units as an activation function. Abdulaziz [6] conducted a number of experiments to show an increase in detection. Sebastien et al. [22] also suggested a model that uses a multilayer perceptron-type neural network to learn features by an iterative process of learning. In addition, Muhammad et al. [23] experimented with different fine-tuned versions of various CNN models, such as AlexNet [7], SqueezeNet [24], GoogleNet [25], and MobileNetV2 [26]. Our proposed model, which allows fire scenes to be semantically understood, is based on the SqueezeNet architecture. However, the abovementioned deep-learning-based models improved the fire detection accuracy, with minimum false alarms, but the complexity and size of the model are comparatively large, that is, 238MB [23]. All of these studies utilized Foggia's dataset [9] as the main source of their training data.Ba et al. [27] proposed a new convolutional neural network (CNN) model, SmokeNet, which incorporates spatial and channel-wise attention in CNN to enhance feature representation for scene classification. In this study, we proved that using a small kernel size and a small number of layers can improve the performance and generalizability of the current task. In fact, by conducting a number of experiments, we proved that this approach could overcome the overfitting problem for a small number of data samples.
In the image/video classification fields, CNN has outpaced and showed superior performance compared with other approaches because of its powerful feature extraction techniques and robust model structure. Consequently, in terms of performance, traditional computer vision methods are being replaced by deep learning methods. Our proposed method adopts a model to classify fire or smoke in images/videos. Misclassification of images or videos leads to an increase in false fire alarms because of variations in perspective distortions, shadows, and brightness. We detected images showing fire and smoke using a model based on dilated CNNs to learn and extract the robust features of a frame.

Dataset
One of the main limitations of vision-related tasks is the insufficiency of robust data for evaluating and analyzing the suggested method. To find a suitable dataset, we examined datasets that were used in prior studies. One of the datasets provided by Foggia et al. [9] contains fourteen fire and Atmosphere 2020, 11, 1241 5 of 15 seventeen non-fire videos. However, the diversity of this video data is insufficient to be suitable for training, and we cannot expect it to deliver good fire detection performance in realistic scenarios. Thus, we attempted to create a diverse dataset by extracting frames from fire and smoke videos and collecting images from internet sources. Our training set consists of fire images sampled from Foggia's dataset and from images on the internet. Images of smoke taken from different internet sources diversified our dataset. We extracted frames from videos and randomly sampled a few images from each video to build our final fire-smoke dataset for use in this study. Table 1 contains information on the number of fire and smoke images in our dataset. Examples of images that were collected are shown in Figure 1. The red-green-blue (RGB) images are stored in JPG format and the images are sized 100 × 100 pixels.
Atmosphere 2020, 11, x FOR PEER REVIEW 5 of 16 collecting images from internet sources. Our training set consists of fire images sampled from Foggia's dataset and from images on the internet. Images of smoke taken from different internet sources diversified our dataset. We extracted frames from videos and randomly sampled a few images from each video to build our final fire-smoke dataset for use in this study. Table 1 contains information on the number of fire and smoke images in our dataset. Examples of images that were collected are shown in Figure 1. The red-green-blue (RGB) images are stored in JPG format and the images are sized 100×100 pixels.

Brief Summary of Well-Known Network Architectures
We propose a novel model for fire and smoke detection. The model is constructed on the basis of dilated convolutions, and its performance was evaluated with respect to the following well-known architectures: AlexNet [7], VGGNet [6], ResNet50 [28], and Inception V3 [29]. In 2012, Alex Krizhevsky published work [7] that was a turning point for vision-related tasks in deep learning. This work was an advanced variant of LeNet [30] and became a winner of the ImageNet LSVRC-2012 [31] competition. The AlexNet model is constructed with five convolutional layers, and a max-pooling operation is applied after the convolutional layers. The output of the last two fully connected layers feeds data into thousand-way units to produce a probability distribution among thousands of classes of labels. The success of AlexNet started a revolution in deep learning, and then,VGGNet and the inception architecture of GoogLeNet achieved similarly high performance in the ImageNet LSVRC-2014 [31] classification challenge, where VGGNet scored second place after GoogLeNet. In 2015, ResNetintroduced a "bottleneck" architecture that employs skip connections to fit the input from the previous layer to the next layer without changing it. Therefore, it enabled a deeper network and became the winner of ImageNet LSVRC-2015 [31] as well as the winner of MS COCO 2015 [32]. Although the aforementioned network models perform more efficiently on the current issue, these network models are too deep for our two-class classification task. This motivated us to build a model with high robustness by emphasizing the extraction of useful and specific

Brief Summary of Well-Known Network Architectures
We propose a novel model for fire and smoke detection. The model is constructed on the basis of dilated convolutions, and its performance was evaluated with respect to the following well-known architectures: AlexNet [7], VGGNet [6], ResNet50 [28], and Inception V3 [29]. In 2012, Alex Krizhevsky published work [7] that was a turning point for vision-related tasks in deep learning. This work was an advanced variant of LeNet [30] and became a winner of the ImageNet LSVRC-2012 [31] competition. The AlexNet model is constructed with five convolutional layers, and a max-pooling operation is applied after the convolutional layers. The output of the last two fully connected layers feeds data into thousand-way units to produce a probability distribution among thousands of classes of labels. The success of AlexNet started a revolution in deep learning, and then, VGGNet and the inception architecture of GoogLeNet achieved similarly high performance in the ImageNet LSVRC-2014 [31] classification challenge, where VGGNet scored second place after GoogLeNet. In 2015, ResNetintroduced a "bottleneck" architecture that employs skip connections to fit the input from the Atmosphere 2020, 11, 1241 6 of 15 previous layer to the next layer without changing it. Therefore, it enabled a deeper network and became the winner of ImageNet LSVRC-2015 [31] as well as the winner of MS COCO 2015 [32]. Although the aforementioned network models perform more efficiently on the current issue, these network models are too deep for our two-class classification task. This motivated us to build a model with high robustness by emphasizing the extraction of useful and specific characteristics of images. In our case, it is necessary to detect whether fire or smoke appears in a given image.

Dilated Convolution
The purpose of utilizing convolutions is to aggregate learnable features from the input images. In computer vision, there are several different filters to extract features for convolutions. Each type of filter is responsible for extracting different aspects or features from the input data, for example horizontal, vertical, and diagonal edges. Correspondingly, CNN uses convolutional layers to extract different features using various filters whose weights are spontaneously updated at the time of the learning process. All the extracted or learned features are then merged to make decisions concerning the input data. In addition, convolution takes the spatial relationship of pixels into consideration, and this is helpful especially in computer vision tasks. In a recent development [31], an additional hyper parameter referred to as dilation was introduced to the convolutional layer, as illustrated in Figure 2. The convolution operator is adapted to apply the filters in a different manner in convolutional layers. The modified version of the convolution operator is referred to as the dilated convolution operator. The standard (vanilla) convolution is shown in Figure 3. Equation (1) expresses the standard convolution and Equation (2) the dilated convolution.
It is clear that, in summation, s + lt = p indicates that certain points are skipped during convolution. Furthermore, dilated convolution allows the network to obtain more information from the context and requires less computational time with fewer parameters, and it allows the model to execute faster than a model that uses normal convolution. A common use of dilated convolutions is image segmentation, where each pixel is labeled by its corresponding category. Therefore, the network output needs to have the same size as the input image.
Atmosphere 2020, 11, x FOR PEER REVIEW 6 of 16 characteristics of images. In our case, it is necessary to detect whether fire or smoke appears in a given image.

Dilated Convolution
The purpose of utilizing convolutions is to aggregate learnable features from the input images. In computer vision, there are several different filters to extract features for convolutions. Each type of filter is responsible for extracting different aspects or features from the input data, for example horizontal, vertical, and diagonal edges. Correspondingly, CNN uses convolutional layers to extract different features using various filters whose weights are spontaneously updated at the time of the learning process. All the extracted or learned features are then merged to make decisions concerning the input data. In addition, convolution takes the spatial relationship of pixels into consideration, and this is helpful especially in computer vision tasks. In a recent development [31], an additional hyper parameter referred to as dilation was introduced to the convolutional layer, as illustrated in Figure 2. The convolution operator is adapted to apply the filters in a different manner in convolutional layers. The modified version of the convolution operator is referred to as the dilated convolution operator. The standard (vanilla) convolution is shown in Figure 3. Equation (1) expresses the standard convolution and Equation (2) the dilated convolution.
It is clear that, in summation, + = indicates that certain points are skipped during convolution. Furthermore, dilated convolution allows the network to obtain more information from the context and requires less computational time with fewer parameters, and it allows the model to execute faster than a model that uses normal convolution. A common use of dilated convolutions is image segmentation, where each pixel is labeled by its corresponding category. Therefore, the network output needs to have the same size as the input image.  This is the origin of the concept of employing convolutions with a dilation rate to solve the described problem. An excellent idea proposed by René et al. [33] is that of multi-scale context aggregation. Convolutional layers with the dilation rate have been implemented in various fields, such as text-to-speech [34] and text interpretation [35]. These methods used dilated convolutions to aggregate multi-scale context features from the input with fewer parameters. The former of these two methods employs dilated convolutions to generate speech and music from a raw audio waveform. Moreover, this method is implemented to recognize speech from a raw audio waveform.

Proposed Network Architecture
As mentioned earlier, our task is not a classification of 1000 groups; hence, we built a model with fewer layers, as shown in Figure 4. All convolutional layers in this architecture use small receptive field sizes (3×3) and dilated convolutions with a rate of 2 are employed. The fourth convolutional layer is followed by two fully connected layers with 2024 nodes and a final output layer with two nodes. The architecture of the proposed method is provided in Table 2. The input layer takes input data with a fixed shape of100 × 100 × 3 ( ℎ × ℎ ℎ × ℎ ), and all data points are resized to fit the given shape. This is the origin of the concept of employing convolutions with a dilation rate to solve the described problem. An excellent idea proposed by René et al. [33] is that of multi-scale context aggregation. Convolutional layers with the dilation rate have been implemented in various fields, such as text-to-speech [34] and text interpretation [35]. These methods used dilated convolutions to aggregate multi-scale context features from the input with fewer parameters. The former of these two methods employs dilated convolutions to generate speech and music from a raw audio waveform. Moreover, this method is implemented to recognize speech from a raw audio waveform.

Proposed Network Architecture
As mentioned earlier, our task is not a classification of 1000 groups; hence, we built a model with fewer layers, as shown in Figure 4. This is the origin of the concept of employing convolutions with a dilation rate to solve the described problem. An excellent idea proposed by René et al. [33] is that of multi-scale context aggregation. Convolutional layers with the dilation rate have been implemented in various fields, such as text-to-speech [34] and text interpretation [35]. These methods used dilated convolutions to aggregate multi-scale context features from the input with fewer parameters. The former of these two methods employs dilated convolutions to generate speech and music from a raw audio waveform. Moreover, this method is implemented to recognize speech from a raw audio waveform.

Proposed Network Architecture
As mentioned earlier, our task is not a classification of 1000 groups; hence, we built a model with fewer layers, as shown in Figure 4. All convolutional layers in this architecture use small receptive field sizes (3×3) and dilated convolutions with a rate of 2 are employed. The fourth convolutional layer is followed by two fully connected layers with 2024 nodes and a final output layer with two nodes. The architecture of the proposed method is provided in Table 2. The input layer takes input data with a fixed shape of100 × 100 × 3 ( ℎ × ℎ ℎ × ℎ ), and all data points are resized to fit the given shape. All convolutional layers in this architecture use small receptive field sizes (3×3) and dilated convolutions with a rate of 2 are employed. The fourth convolutional layer is followed by two fully connected layers with 2024 nodes and a final output layer with two nodes. The architecture of the proposed method is provided in Table 2. The input layer takes input data with a fixed shape of 100 × 100 × 3 (width × height × color channel), and all data points are resized to fit the given shape.
The output shape of the first convolutional layer is 96 × 96 × 128. The calculation of the feature ma is given by Equation (3) [36]. In this equation, (width × height) is the input shape, (F width × F height ) is the filter size, S width and S height are the stride, and P is the padding (it is chosen as "valid" padding in our case): Atmosphere 2020, 11, 1241 8 of 15 We employed a rectified linear unit (ReLU) [37] as the activation function after all four convolutional layers. The mathematical form of the rectified linear unit is expressed as in Equation (4). The advantage of a ReLU is that its processing speed is higher than those of other nonlinear activation functions; in addition, a ReLU does not experience the gradient vanishing problem, because the gradient of the ReLU function is either 0 or 1, which means it never saturates, and so the gradient vanishing problem does not occur. After each convolutional layer, we employed a max-pooling layer for down sampling purposes. Max pooling has been proved to be more effective than average pooling for computer vision tasks such as classification, segmentation, and object detection. Our proposed method functions by increasing the number of filters by a factor of 2 until the 4th convolutional layer. At the initial layer, 128 kernels are employed with a dilation rate of 2. The following layer is formed of 256 kernels that is double the first convolutional layer. The third and fourth layers have the same depth, that is, 512 filters. A common problem in computer vision is that of over fitting. To prevent the overfitting problem, we use dropout regularization [38] after the final convolutional and each fully connected layer. AlexNet [7] additionally employs local response normalization that normalizes over local input regions. Our network architecture is shallower than that of AlexNet, and the amount of data used to train the model is considerably smaller. Therefore, the application of any normalizing technique might lead to the loss of the essential relationship between data points. Eventually, we employed sigmoid activation as the activation function, as presented in Equation (5), to indicate the probability of the evaluation result.
We trained our model using Keras, a high-level API of the TensorFlow framework, in our experiments. Keras is an open-source neural network library written in Python. The model was trained on a workstation with a 3.4 GHz AMD Ryzen Thread ripper 1950X 16-Core Processor and an NVDIA GeForce GTX 1080Ti GPU with 11 GB of memory. During training, data augmentation techniques were also used, and we set the number of epochs and batch size to 250 and 64, respectively. We employed a stochastic gradient descent algorithm (SGD) [39] to optimize the training process and set the parameters as follows: initially, we set the momentum to 0.99, the learning rate was 10 −5 , l2, and regularizationwas 5 × 10 −4 . We used 80% of the data to train the model, and the remainder of the data to evaluate the model performance.

Investigating the Optimum Method for Fire and Smoke Detection
To analyze the efficiency of the model, we carried out extensive attempts to select the appropriate kernel size, dilation rate, and number of convolutional layers. We used the well-known machine learning library, Keras, built on top of TensorFlow. Initially, we compared two neural network models without dilation and with dilated convolutional layers. Figure 5 provides an indication of the training accuracy.
Atmosphere 2020, 11, x FOR PEER REVIEW 9 of 16 We trained our model using Keras, a high-level API of the TensorFlow framework, in our experiments. Keras is an open-source neural network library written in Python. The model was trained on a workstation with a 3.4 GHz AMD Ryzen Thread ripper 1950X 16-Core Processor and an NVDIA GeForce GTX 1080Ti GPU with 11 GB of memory. During training, data augmentation techniques were also used, and we set the number of epochs and batch size to 250 and 64, respectively. We employed a stochastic gradient descent algorithm (SGD) [39] to optimize the training process and set the parameters as follows: initially, we set the momentum to 0.99, the learning rate was10 , 2, and regularizationwas 5 × 10 . We used 80% of the data to train the model, and the remainder of the data to evaluate the model performance.

Investigating the Optimum Method for Fire and Smoke Detection
To analyze the efficiency of the model, we carried out extensive attempts to select the appropriate kernel size, dilation rate, and number of convolutional layers. We used the well-known machine learning library, Keras, built on top of TensorFlow. Initially, we compared two neural network models without dilation and with dilated convolutional layers. Figure 5 provides an indication of the training accuracy. Figures 5a and 5b show that, after the final epochs, the training accuracy for the model (without dilated convolutional layers) with kernel sizes of three and five was 98.86% and 98.63%, respectively. Compared with the other two models, the model with dilated convolutional layers delivered higher performance on training with 99.60% accuracy. The training and testing accuracies of these networks are provided in Table 3. These results indicate that the training and testing accuracies of the network that implements dilated convolutional layers are higher than those of the other models. One of the contributions of our study is the use of dilated convolutions, as we previously mentioned. We carried out a number of experiments to prove the advantages of using convolutional layers to which dilation is applied instead of using no dilation, as shown in Figure 5. As mentioned previously, a dilation operator is adapted to predict each label for each pixel in the images, because it has the capability of expanding the receptive field without losing coverage.  Figure 5a,b show that, after the final epochs, the training accuracy for the model (without dilated convolutional layers) with kernel sizes of three and five was 98.86% and 98.63%, respectively. Compared with the other two models, the model with dilated convolutional layers delivered higher performance on training with 99.60% accuracy. The training and testing accuracies of these networks are provided in Table 3. These results indicate that the training and testing accuracies of the network that implements dilated convolutional layers are higher than those of the other models. One of the contributions of our study is the use of dilated convolutions, as we previously mentioned. We carried out a number of experiments to prove the advantages of using convolutional layers to which dilation is applied instead of using no dilation, as shown in Figure 5. As mentioned previously, a dilation operator is adapted to predict each label for each pixel in the images, because it has the capability of expanding the receptive field without losing coverage. We experimented on models by changing the number of layers to identify the model that performs the best on this task. Figure 6 compares the performance of the model by varying the number of convolutional layers. At first sight, it is obvious which model has the highest accuracy, especially when comparing the models with three and five convolutional layers, the performance of which is similar. According to Figure 6, the plotted line for the model using four layers indicates the highest accuracy relative to the other three models. We experimented on models by changing the number of layers to identify the model that performs the best on this task. Figure 6 compares the performance of the model by varying the number of convolutional layers. At first sight, it is obvious which model has the highest accuracy, especially when comparing the models with three and five convolutional layers, the performance of which is similar. According to Figure 6, the plotted line for the model using four layers indicates the highest accuracy relative to the other three models. With minor variance, the model with four convolutional layers delivers the best performance. Accordingly, the lowest scores belong to the model two convolutional layers. The training and testing scores for the model with two convolutional layers are 98.52% and 98.03%, respectively, whereas the training scores for the models with three and five convolutional layers are 99.38% and 99.36%, respectively. These results are provided in Table 4. However, the generalization ability of neural networks with three and five convolutional layers is slightly lower than those of the model with four layers. Thus, we demonstrated the training and testing accuracy of our proposed method by making use of four convolutional layers. We mentioned above that employing small kernel sizes in dilated convolutional layers might assist the performance of models. Although the exact size of kernels that perform optimally on this task was not known for us, we conducted several experiments to find the optimal kernel size. The selected kernel size proved to be the best option for solving this problem. The performance of the models is compared in Figure 7. With minor variance, the model with four convolutional layers delivers the best performance. Accordingly, the lowest scores belong to the model two convolutional layers. The training and testing scores for the model with two convolutional layers are 98.52% and 98.03%, respectively, whereas the training scores for the models with three and five convolutional layers are 99.38% and 99.36%, respectively. These results are provided in Table 4. However, the generalization ability of neural networks with three and five convolutional layers is slightly lower than those of the model with four layers. Thus, we demonstrated the training and testing accuracy of our proposed method by making use of four convolutional layers. We mentioned above that employing small kernel sizes in dilated convolutional layers might assist the performance of models. Although the exact size of kernels that perform optimally on this task was not known for us, we conducted several experiments to find the optimal kernel size. The selected kernel size proved to be the best option for solving this problem. The performance of the models is compared in Figure 7.
We started by experimenting with training the model by using different kernel sizes from 3 × 3 to 13 × 13. The plotted lines illustrating the training scores of kernel sizes 11 and 13 indicate the lowest training scores along with lower testing scores. At the same time, the results for the performance of our model when the kernel size equals seven demonstrate average performance, as shown in Figure 7. The training scores for models that employ smaller kernel sizes are higher, and thus, the models are more efficient in terms of both training and evaluation. We started by experimenting with training the model by using different kernel sizes from 3 × 3 to 13 × 13. The plotted lines illustrating the training scores of kernel sizes 11 and 13 indicate the lowest training scores along with lower testing scores. At the same time, the results for the performance of our model when the kernel size equals seven demonstrate average performance, as shown in Figure 7. The training scores for models that employ smaller kernel sizes are higher, and thus, the models are more efficient in terms of both training and evaluation.
The testing and training accuracies of the models with kernel sizes of 3,5,7,9,11, and 13 are summarized in Table 5. However, it is difficult to differentiate between the performances of the respective models. Furthermore, the models with smaller kernel sizes performed more accurately for the purpose of our task. We found a kernel size of 3 ×3 to be the most appropriate option to solve the fire and smoke detection problem.

Comparison of Our Network Model with Well-Known Architectures by Conducting Experiments on Our Dataset
Our experiments mainly aimed to evaluate the performance of our proposed model against renowned deep learning models, such as VGGNet, AlexNet, ResNet, and Inception V3, all of which performed exceptionally well with respect to classification in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). We compared the performance of our method in terms of accuracy. The training accuracies of these networks are shown in Figure 8. The testing and training accuracies of the models with kernel sizes of 3,5,7,9,11, and 13 are summarized in Table 5. However, it is difficult to differentiate between the performances of the respective models. Furthermore, the models with smaller kernel sizes performed more accurately for the purpose of our task. We found a kernel size of 3 × 3 to be the most appropriate option to solve the fire and smoke detection problem.

Comparison of Our Network Model with Well-Known Architectures by Conducting Experiments on Our Dataset
Our experiments mainly aimed to evaluate the performance of our proposed model against renowned deep learning models, such as VGGNet, AlexNet, ResNet, and Inception V3, all of which performed exceptionally well with respect to classification in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). We compared the performance of our method in terms of accuracy. The training accuracies of these networks are shown in Figure 8.
As indicated in Figure 8, higher training accuracies were obtained by deeper models. Table 6 provides the detailed training and testing scores of all the models. We trained VGG16, VGG19, and their fine-tuned versions to evaluate their performance against our custom dataset. The results in Table 6 lead to a few conclusions. The highest training and testing scores of 99.6% and 99.53%, respectively, were achieved by our proposed network model, whereas the scores for VGG19 (fine-tuned) were the lowest, i.e., 94.6% and 94.88%, respectively. However, the performance of VGG16 was also higher, even though it is a less deep network in our experiments. The highest performance accuracies were obtained by the Inception V3 and ResNet50 network architecture. We additionally calculated other metrics, such as theF1-score, precision, and recall. The F1 score is the weighted average of precision and recall. Hence, this score considers both false positives and false negatives. Intuitively, it is not as easy to understand as the accuracy, but F1 is more commonly used than accuracy. The accuracy is best used when the false positives and false negatives have similar costs. If the cost of these two metrics differs, it is more useful to consider both precision and recall. Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. Recall is the ratio of correctly predicted positive observations to all observations in the actual class, as shown in Equation (6). As indicated in Table 6, the F1 score of the proposed method is the highest overall, with a lower score on recall and precision. In particular, the F1 score of the proposed method is 0.9892 compared with the lowest result, which was recorded by AlexNet, of 0.7513.We calculated the precision and recall rates as follows: where  a, b). The -and -axes show the number of epochs (250) and the training accuracies of the models, respectively.
As indicated in Figure 8, higher training accuracies were obtained by deeper models. Table 6 provides the detailed training and testing scores of all the models. We trained VGG16, VGG19, and their fine-tuned versions to evaluate their performance against our custom dataset. The results in Table 6 lead to a few conclusions. The highest training and testing scores of 99.6% and 99.53%, respectively, were achieved by our proposed network model, whereas the scores for VGG19 (fine-tuned) were the lowest, i.e., 94.6% and 94.88%, respectively. However, the performance of VGG16 was also higher, even though it is a less deep network in our experiments. The highest performance accuracies were obtained by the Inception V3 and ResNet50 network architecture. We additionally calculated other metrics, such as theF1-score, precision, and recall. The F1 score is the weighted average of precision and recall. Hence, this score considers both false positives and false negatives. Intuitively, it is not as easy to understand as the accuracy, but F1 is more commonly used than accuracy. The accuracy is best used when the false positives and false negatives have similar costs. If the cost of these two metrics differs, it is more useful to consider both precision and recall. Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. Recall is the ratio of correctly predicted positive observations to all observations in the actual class, as shown in Equation (6). As indicated in Table 6, the F1 score of the proposed method is the highest overall, with a lower score on recall and precision. In particular, the F1 score of the proposed method is 0.9892 compared with the lowest result, which was recorded by AlexNet,of 0.7513.We calculated the precision and recall rates as follows:  (a, b). The xand y -axes show the number of epochs (250) and the training accuracies of the models, respectively. In fact, the computational cost of deeper models, such as AlexNet, VGGNet, and ResNet50, is high, and they require much more computational time than models that are not as deep. For example, the parameters of the AlexNet and VGGNet architectures are 60 M and 138 M, respectively. However, our proposed model has only 24 M parameters. Our experiments proved that, for our two-class classification task, the generalization of much deeper networks was lower. For instance, the well-known ResNet50 and Inception V3 network architectures do not generalize well to our custom-built dataset. Moreover, deeper networks are more complex and, thus, affect the training time and prediction time.
The deeper networks were much more time consuming for training as well as for prediction ( Table 7). The prediction time in the table is in seconds, and the results reflect the complexity of the model. Our model spent less time predicting the entire test set, namely 1.9 s; in contrast, Inception V3 had the longest time of all the models, 8.7 s.

Limitations
The proposed method may make errors in the early stages when the pixel values in the fire and smoke images are very close to those of the background. Our method mainly experiences this problem when the weather is cloudy. In an attempt to overcome this problem, we are currently experimenting with datasets containing satellite imagery of smoke (USTC_SmokeRS), which consist of RGB images from more complex land covers. In our research area, dataset images play a significant role in smoke scene detection.

Conclusions
We presented new robust deep learning model architecture for classifying fire and smoke images captured by a camera or nearby surveillance systems. The proposed method is fully automatic, requires no manual intervention, and is designed to be well generalizable for unseen data. It offers effective generalization and reduces the number of false alarms. Based on the proposed fire detection method, our contributions include the following four main features: the use of dilation filters, a small number of layers, small kernel sizes, and a custom-built dataset, which was used in our experiments. This dataset is expected to be a useful asset for future research that requires images of fire and smoke. However, we are far from concluding that this is the best solution for this task, because all experiments were conducted on our custom dataset. We verified our method experimentally by conducting several experiments to demonstrate that employing a dilation operator and a small number of layers can boost the performance of the method by extracting valuable features. Moreover, using a small number of layers and less deep networks would allow the model to be used in devices with low computational power. During the experiments, we assessed the performances and generalizing abilities of well-known CNN architectures in comparison with those of our proposed method. The experimental results proved that the performance of our proposed method on our dataset was slightly superior to that of well-known neural network architectures.
Our future projection is to build a lightweight model with robust detection performance that would allow us to set up embedded devices, which have low computational capabilities.