Flame Detection Using Appearance-Based Pre-Processing and Convolutional Neural Network

: It is important for ﬁre detectors to operate quickly in the event of a ﬁre, but existing conventional ﬁre detectors sometimes do not work properly or there are problems where non-ﬁre or false reporting occurs frequently. Therefore, in this study, HSV color conversion and Harris Corner Detection were used in the image pre-processing step to reduce the incidence of false detections. In addition, among the detected corners, the vicinity of the corner point facing the upper direction was extracted as a region of interest (ROI), and the ﬁre was determined using a convolutional neural network (CNN). These methods were designed to detect the appearance of ﬂames based on top-pointing properties, which resulted in higher accuracy and higher precision than when input images were still used in conventional object detection algorithms. This also reduced the false detection rate for non-ﬁres, enabling high-precision ﬁre detection. D.K.


Introduction
In the case of fire, death is more often caused by the inhalation of toxic substances such as carbon monoxide than by direct injury caused by burns. Therefore, it is important to detect and respond to the occurrence of a fire in the early stages. Additionally, because the precise operation of detectors is directly related to the saving of human life, the study of new fire detection methods with higher performance than conventional sensor-based detection is urgently needed. Existing sensor-based fire detectors include flame detectors that detect infrared (IR) and ultraviolet (UV) energy, and heat detectors that detect heat sources.
However, these sensor-based fire detection methods are limited in indoor environments, and the more sensitive such detectors are to IR, UV, and heat, the more easily they react to other factors, resulting in unnecessary manpower consumption due to the malfunction of alarms. In addition, there are limitations, such as the inability to provide information about the location and size of fires, and frequent false fire alarms if the physical sensor is close to the source of the fire, or in contrast, if there are factors that make the operation of the fire detector too sensitive. Moreover, if actual fires occur in a wide range of areas such as large factories and mountains, early fire detection is difficult with existing sensor-based fire detection systems.
Therefore, to address these existing problems, this study aimed to present a supplemented image preprocessing method to detect fire hazards quickly and automatically, and a flame detection method that reduced the misdetection rate via CNN. Machine learning is a branch of artificial intelligence in which computers train on their own to develop predictive models, and deep learning is a machine learning method using deep neural network theory [1][2][3].
Such deep learning has shown excellent performance in various fields such as pattern recognition, computer vision, speech recognition, and translation. In particular, the CNN used in this work is an artificial neural network built on the basis of deep learning technology, which has the advantage of being able to train while maintaining the spatial structure of the image. This exacerbates the problem of losing features to the original data in the process of training by treating it with one-dimensional flat data when using the existing fully connected layer [4][5][6][7][8].
In the field of artificial intelligence, object recognition or region-based object detection using deep learning is an active field of research, and image recognition algorithms through deep learning are ranked at the top in competitions such as ILSVRC (Image Large Scale Visual Recognition Challenge). In the case of deep learning-based models, detection algorithms based on regions that exist within images as well as the classification of images, such as Single Shot Multibox Detector (SSD) or Region-based Convolutional Neural Network (R-CNN) algorithms, have recently emerged [9][10][11][12][13][14].
However, even if the latest deep learning-based image recognition or object detection algorithms are used, without separate robust image pre-processing, the achieved results may be less accurate than expected in areas requiring high reliability, such as fire detection. Therefore, good results can be obtained if unnecessary background regions are removed as much as possible through image preprocessing and detection is attempted through a deep learning model [15][16][17][18]. For example, Zhong et al. [19] proposed that the predicted area where the flame exists in the input image should be filtered through the RGB model corresponding to the flame color, and the flame was detected through the CNN for the corresponding area. Cai et al. [20] similarly performed color-based pre-processing filtering from the input images, pre-processing of flame regions via HSV color transformation, and YCbCr and Canny edge detection. Additionally, the flame was detected in the form of minimizing overfitting by removing the fully connected layer, which is traditionally used in the last layer of convolutional neural networks, and global interval pooling was applied.
The pre-processing method used in this study first converted the input image to HSV color channels and filtered only the color distribution of the flame. Moreover, when detecting the corner point using the Harris corner detector, pre-processing was performed based on the appearance characteristics of finding the sharp part of the flame. In other words, only the top-direction corner points, not the side or bottom corner points, were detected and used as a ROI; finally, a performance evaluation of flames from real fire images via a CNN was conducted.

Overview of Proposed Approach
The most basic flame detection method using CNN is based on iterative learning through a learning dataset for flames, and then the image is classified as being a flame or not through a new input image. However, there is a limitation in classification using CNN, and if the object is simply predicted, the accuracy may decrease when multiple objects exist in one image. In addition, repeated learning through many datasets cannot significantly reduce the ratio of false negatives or false positives, resulting in reduced accuracy. Therefore, through the application of an appropriate image pre-processing process, these problems were offset, and the detection accuracy of the flame was improved.
In this study, the proposed method to increase the detection accuracy of a flame was divided into two main procedures. First, the flame and non-flame image datasets were collected and trained using Inception-V3 among the CNN models, as shown on the left in Figure 1. When classifying a flame from direct input images without any pre-processing for models learning about a flame, it is difficult to reliably classify flame or non-flame images because they contain non-flame objects or unnecessary background areas. Therefore, the first image pre-processing separates objects from the input image filters the HSV color regions where flames exist. Subsequently, second image pre-processing uses Harris corner detector to detect corner points. Additionally, corner point detection was performed only on areas where corner points existed in a 45 to 135 degree direction, with sharp characteristics at the top of the detected corner points, bounding boxes for dense points, and the area was extracted as a ROI. corner detector to detect corner points. Additionally, corner point detection w formed only on areas where corner points existed in a 45 to 135 degree directio sharp characteristics at the top of the detected corner points, bounding boxes fo points, and the area was extracted as a ROI.

Image Pre-Processing
In this study, HSV color transformation was performed with the first image p cessing. The HSV color model can handle color features in a similar way to how h perceive colors, so it can be used to identify colors of objects in many applications dition to image pre-processing. These properties of HSV color models make the tools for developing image processing algorithms based on color sensing pro [21,22].
In HSV color models, hue represents the distribution of colors based on the wavelength of red, and saturation represents the degree to which pure colors white light.
Value is also used to measure the intensity of light. The value can be indepen a single component to control the range, thus creating algorithms that are robust t ing changes. In Equation (1), the pixel values 1 for H, S, and V, respectively refer to the corresponding to the color space where flames could exist at the image location, an in that range were extracted as the ROI of the first image pre-processing. A pixel v 0 means a pixel is classified as a non-flame area. Figure 2 shows the HSV color conversion: Figure 2a is the original flame ima Figure 2b is the resultant image from applying the HSV color conversion. Howeve after HSV color conversion, results from items other than flames or objects con light-yellow remained, such as leaves. Therefore, to filter this additionally, a Harris detector was used for the second image preprocessing.

Image Pre-Processing
In this study, HSV color transformation was performed with the first image preprocessing. The HSV color model can handle color features in a similar way to how humans perceive colors, so it can be used to identify colors of objects in many applications, in addition to image pre-processing. These properties of HSV color models make them ideal tools for developing image processing algorithms based on color sensing properties [21,22].
In HSV color models, hue represents the distribution of colors based on the longest wavelength of red, and saturation represents the degree to which pure colors contain white light.
Value is also used to measure the intensity of light. The value can be independent of a single component to control the range, thus creating algorithms that are robust to lighting changes.
In Equation (1), the pixel values 1 for H, S, and V, respectively refer to the regions corresponding to the color space where flames could exist at the image location, and those in that range were extracted as the ROI of the first image pre-processing. A pixel value of 0 means a pixel is classified as a non-flame area. Figure 2 shows the HSV color conversion: Figure 2a is the original flame image, and Figure 2b is the resultant image from applying the HSV color conversion. However, even after HSV color conversion, results from items other than flames or objects containing light-yellow remained, such as leaves. Therefore, to filter this additionally, a Harris corner detector was used for the second image preprocessing. First, when there is a reference point , in the image, when the amount of change is moved by , from the reference point, it can be expressed as Equation (2). is the brightness, and , are the points inside the Gaussian window .
The Taylor series allows the area that has moved as much as , to be organized as in the following Equation (3).
The first-order derivative in the and directions, and , could be obtained via convolution arithmetic using , the Sobel kernel, and , the Sobel kernel, as shown in Figure 3.
If is defined as = , properties such as Equations (5) and (6) are satisfied.
Finally, Equation (7) allows us to determine the edge, corner, and flat. is an empirical constant, and a value of 0.04 was used in this paper.  First, when there is a reference point (x, y) in the image, when the amount of change is moved by (u, v) from the reference point, it can be expressed as Equation (2). I is the brightness, and (x i , y i ) are the points inside the Gaussian window W.
The Taylor series allows the area that has moved as much as (u, v) to be organized as in the following Equation (3).
The first-order derivative in the x and y directions, I x and I y , could be obtained via convolution arithmetic using S x , the Sobel x kernel, and S y , the Sobel y kernel, as shown in Figure 3. First, when there is a reference point , in the image, when the amount of change is moved by , from the reference point, it can be expressed as Equation (2). is the brightness, and , are the points inside the Gaussian window .
The Taylor series allows the area that has moved as much as , to be organized as in the following Equation (3).
The first-order derivative in the and directions, and , could be obtained via convolution arithmetic using , the Sobel kernel, and , the Sobel kernel, as shown in Figure 3.
If is defined as = , properties such as Equations (5) and (6) are satisfied.
Finally, Equation (7) allows us to determine the edge, corner, and flat. is an empirical constant, and a value of 0.04 was used in this paper.   (2), it can be expressed as Equation (4).

If Equation (3) is substituted for Equation
, properties such as Equations (5) and (6) are satisfied.
Finally, Equation (7) allows us to determine the edge, corner, and flat. K is an empirical constant, and a value of 0.04 was used in this paper.
Each pixel's location will have a different value, and the final calculated R(x, y) will be compared to the following conditions to distinguish between the edge, corner, and flat [23][24][25][26].

•
When |R| is small, which happens when λ 1 and λ 2 are small, these points belongto flat regions; • When R < 0, if only one eigenvalue of λ 1 and λ 2 is bigger than the other eigenvalue, the region belong to edges; • If R has a large value, the region is a corner. Figure 4a shows the input original images, and Figure 4b shows the corner detected by the HSV color conversion images using the Harris corner detector, and the corner points are still detected for non-flame objects. In order to only select the areas that were most likely to be flames among the corner points of the detected objects, this study additionally applied the appearance characteristics of flames in the flame. Therefore, this paper proposes a method to further filter only the corners with the top of the detected corners.
Each pixel's location will have a different value, and the final calculated , will be compared to the following conditions to distinguish between the edge, corner, and flat [23][24][25][26].

•
When | | is small, which happens when and are small, these points belongto flat regions; • When 0, if only one eigenvalue of and is bigger than the other eigenvalue, the region belong to edges; • If has a large value, the region is a corner. Figure 4a shows the input original images, and Figure 4b shows the corner detected by the HSV color conversion images using the Harris corner detector, and the corner points are still detected for non-flame objects. In order to only select the areas that were most likely to be flames among the corner points of the detected objects, this study additionally applied the appearance characteristics of flames in the flame. Therefore, this paper proposes a method to further filter only the corners with the top of the detected corners.
In order to only select the corner facing the top direction among the detected corners, the angle of the corner-facing direction was calculated using Equation (8). and are first-order derivatives in the and directions, respectively, and can be obtained via a convolution arithmetic using , the Sobel x kernel, and , the Sobel y kernel.
Therefore, each Sobel filter value could be used to calculate the angle in the direction toward the corner point through the arctangent, of which only the corners from 45 to 135 degrees, as shown in Equation (9), are shown ( Figure 4c). As a result of the preprocessing of these images, most of the non-flaming objects have been removed, but there are cases where the corners of some non-flaming objects remain. Therefore, the area where the corners are concentrated is designated as the ROI, and finally, the process of classifying flames or non-flames using the Inception-V3 CNN model was undertaken.   In order to only select the corner facing the top direction among the detected corners, the angle of the corner-facing direction was calculated using Equation (8). I x and I y are first-order derivatives in the x and y directions, respectively, and can be obtained via a convolution arithmetic using S x , the Sobel x kernel, and S y , the Sobel y kernel.
Therefore, each Sobel filter value could be used to calculate the angle in the direction toward the corner point through the arctangent, of which only the corners from 45 to 135 degrees, as shown in Equation (9), are shown (Figure 4c). As a result of the preprocessing of these images, most of the non-flaming objects have been removed, but there are cases where the corners of some non-flaming objects remain. Therefore, the area where the corners are concentrated is designated as the ROI, and finally, the process of classifying flames or non-flames using the Inception-V3 CNN model was undertaken.

Inception-V3 CNN Model
When training through deep learning, it is common to obtain high precision when using it with a deep layer and a wide node. However, in this case, the number of parameters increased and the computational amount increased considerably, and an over-fitting problem or a gradient vanishing problem occurred. Therefore, we made the connections between nodes sparse and the matrix operations dense. Reflecting this, the Inception structure makes the overall network deep, but not difficult to operate [27][28][29].
The left side of Figure 5 shows the structure of the Inception A, Inception B, and Inception C modules, including a 1 × 1 convolution filter. The 1 × 1 convolution filter has no change in height or width, and even if convolution is performed on a plane, there is no spatial feature loss.

Inception-V3 CNN Model
When training through deep learning, it is common to obtain high precision when using it with a deep layer and a wide node. However, in this case, the number of parameters increased and the computational amount increased considerably, and an over-fitting problem or a gradient vanishing problem occurred. Therefore, we made the connections between nodes sparse and the matrix operations dense. Reflecting this, the Inception structure makes the overall network deep, but not difficult to operate [27][28][29].
The left side of Figure 5 shows the structure of the Inception A, Inception B, and Inception C modules, including a 1 × 1 convolution filter. The 1 × 1 convolution filter has no change in height or width, and even if convolution is performed on a plane, there is no spatial feature loss. Therefore, the role of this filter is to increase the number of channels due to the performance of convolution in several layers, which functions to control the number of channels. This can reduce the number of parameters on 3 × 3 or 5 × 5 filters followed by 1 × 1 filters. Thus, the Inception-V3 model has the advantage of a deeper layer than other CNN models, but not having a relatively large parameter. Table 1 shows the configuration of CNN layers configured using the Inception modules. The input image size of 299 × 299 was set, and the first five general convolutional layers are called stems. The layers are more effective than Inception module. Furthermore, the nine Inception modules had a size of 1 × 2048 through a fully connected layer. For conventional convolutional neural networks, pooling was used between modules or layers to reduce the size of the parameters, but to solve the problem of representational bottleneck with increasing feature loss, a dimensional reduction method was used, as shown in Figure 6. If the stride was set to 1 and convolution was performed, the operation cost was equal to 2 . In order to reduce the amount of cost that needed to be calculated, if the stride was set to 2 and convolution was performed with pooling, the operation cost was equal to 2 /2 . However, when the stride was set to 2, a representational bottleneck occurred. Therefore, the two forms were properly mixed to compensate for the above shortcomings. Finally, because the activation function in the final layer was a classification problem for both flame and non- Therefore, the role of this filter is to increase the number of channels due to the performance of convolution in several layers, which functions to control the number of channels. This can reduce the number of parameters on 3 × 3 or 5 × 5 filters followed by 1 × 1 filters. Thus, the Inception-V3 model has the advantage of a deeper layer than other CNN models, but not having a relatively large parameter. Table 1 shows the configuration of CNN layers configured using the Inception modules. The input image size of 299 × 299 was set, and the first five general convolutional layers are called stems. The layers are more effective than Inception module. Furthermore, the nine Inception modules had a size of 1 × 2048 through a fully connected layer. For conventional convolutional neural networks, pooling was used between modules or layers to reduce the size of the parameters, but to solve the problem of representational bottleneck with increasing feature loss, a dimensional reduction method was used, as shown in Figure 6. If the stride was set to 1 and convolution was performed, the operation cost was equal to 2d 2 k 2 . In order to reduce the amount of cost that needed to be calculated, if the stride was set to 2 and convolution was performed with pooling, the operation cost was equal to 2(d/2) 2 k 2 . However, when the stride was set to 2, a representational bottleneck occurred. Therefore, the two forms were properly mixed to compensate for the above shortcomings. Finally, because the activation function in the final layer was a classification problem for both flame and non-flame, sigmoid was used. As in Figure 5a 35 × 35 × 192 Reduction As in Figure 6 35 × 35 × 228 Inception B × 3 As in Figure 5b 17 × 17 × 768 Reduction As in Figure 6 17 × 17 × 768 Inception C × 3 As in Figure As in Figure 5a 35 × 35 × 192 Reduction As in Figure 6 35 × 35 × 228 As in Figure 5b 17 × 17 × 768 Reduction As in Figure 6 17 × 17 × 768 Inception C × 3 As in Figure 5c 8 × 8 × 1280 Sigmoid -- Figure 6. Structure of the reduction module used to reduce feature size.

Dataset Configuration for Training
In order to detect objects through deep learning-based convolutional neural works, the sufficient collection of image datasets is essential for training the model number of datasets used in this study is shown in Table 2, divided into two classes: f or non-flame. The total number of images used in the flame dataset was 10,153, the n ber of images in the non-flame dataset was 10,024, and the training dataset and test da were used in a ratio of 8:2, respectively.

Dataset Configuration for Training
In order to detect objects through deep learning-based convolutional neural networks, the sufficient collection of image datasets is essential for training the model. The number of datasets used in this study is shown in Table 2, divided into two classes: flame or non-flame. The total number of images used in the flame dataset was 10,153, the number of images in the non-flame dataset was 10,024, and the training dataset and test dataset were used in a ratio of 8:2, respectively.

Experimental Setup and Training
In the experimental environment, the CPU used the Intel i7-8700 processor, the GPU used the Geforce RTX 3070, and the OS was tested on Linux Ubuntu 18.04. Figure 7 shows the flame dataset training results of CNN, where Figure 7a is accuracy and Figure 7b is loss. The red curve is the dataset used for training among flame images collected by the training dataset, and shows the accuracy detected through this dataset. The blue curve is the test dataset which, unlike the training dataset, has not been used in the training process, and is a dataset used only for accuracy evaluation purposes. Therefore, if high accuracy and low loss are shown in the training dataset and relatively low accuracy and high loss in the test dataset, overfitting can be seen as occurring, and training has not progressed properly. However, the training results of this work show that both datasets have similarly high accuracy and low loss, which can be judged to be well-trained against flames. Additionally, the training was ended at 3000 steps, where the accuracy did not increase over a certain level and converged to some extent.

Experimental Setup and Training
In the experimental environment, the CPU used the Intel i7-8700 processor, the GPU used the Geforce RTX 3070, and the OS was tested on Linux Ubuntu 18.04. Figure 7 shows the flame dataset training results of CNN, where Figure 7a is accuracy and Figure 7b is loss. The red curve is the dataset used for training among flame images collected by the training dataset, and shows the accuracy detected through this dataset. The blue curve is the test dataset which, unlike the training dataset, has not been used in the training process, and is a dataset used only for accuracy evaluation purposes. Therefore, if high accuracy and low loss are shown in the training dataset and relatively low accuracy and high loss in the test dataset, overfitting can be seen as occurring, and training has not progressed properly. However, the training results of this work show that both datasets have similarly high accuracy and low loss, which can be judged to be well-trained against flames. Additionally, the training was ended at 3000 steps, where the accuracy did not increase over a certain level and converged to some extent.  Figure 8 shows the equipment used for the actual fire test. Figure 9 shows the results of the flame detection evaluation. For specific performance evaluation of the proposed flame detection algorithm, a fire test picture was taken, and pictures of the actual fire site in different environments were used: indoors, outdoors night, day, etc. This was composed of various evaluation pictures.  Figure 8 shows the equipment used for the actual fire test. Figure 9 shows the results of the flame detection evaluation. For specific performance evaluation of the proposed flame detection algorithm, a fire test picture was taken, and pictures of the actual fire site in different environments were used: indoors, outdoors night, day, etc. This was composed of various evaluation pictures.
The presented photos are some of the results of correct flame detection for the flame image, with the green-bound boxes indicating the areas finally determined to be flames in the CNN model. From the left, Figure 9a is the result of detection through the method presented in this study, Figure 9b is the result of detection using the Faster R-CNN, and Figure 9c is the result of detection using SSD. In the case of the model presented in this study, most of the flames were detected even when the size of the object occupied by the flame was small; the Faster R-CNN also correctly judged most flames, but in some cases, it incorrectly judged an object other than a flame. In the case of the SSD, it took less Appl. Sci. 2021, 11, 5138 9 of 14 computation time than the Faster R-CNN, but there were many cases where the object size was not detected in the picture.
The test images comprised 100 fire and non-fire photographs, to evaluate the detection results with the following expressions for a specific and objective accuracy evaluation [30].
where TP is the correct detection of flames, FN is not detected from the image of flames, TN is the correct detection of non-flame objects, and FP is the number of incorrect detections of non-flame objects. Equation (10) is the result of dividing the whole case when the flame and the nonflame are correctly classified in a manner corresponding to accuracy. Equation (11) is a calculation of precision, with TP divided by the addition of FP to TP. However, it is not appropriate to evaluate performance only with accuracy and precision in object detection artificial intelligence models. Therefore, the F1 score was additionally calculated using the detection rates in Equations (12) and (13).
The calculations showed 97.5% accuracy, 98.5% precision, and 98.5% F1 score. This result is shown by comparing with Faster R-CNN and SSD in Table 3, and the compared object detection models were trained through the same image dataset. similarly high accuracy and low loss, which can be judged to be well-trained flames. Additionally, the training was ended at 3000 steps, where the accuracy increase over a certain level and converged to some extent.  Figure 8 shows the equipment used for the actual fire test. Figure 9 shows th of the flame detection evaluation. For specific performance evaluation of the p flame detection algorithm, a fire test picture was taken, and pictures of the actual in different environments were used: indoors, outdoors night, day, etc. This w posed of various evaluation pictures.      Figures 10-12 show the receiver operating characteristic (ROC) curves and precision recall (PR) curves for specific accuracy comparisons of the three detection models. The ROC curves were expressed through two parameters, TPR (true positive rate) and FPR (false positive rate), and showed that the ROC curve changed by changing the threshold criteria. At this time, if the classification threshold was lowered, the TPR and FPR classified as positive in the general classification model increased together. Therefore, a curve with a higher TPR and a lower FPR on the graph can be judged as a better classification model. Likewise, PR curves are a method of evaluating the performance of classification models by changing thresholds, and the higher the precision and recall values, the better the model [31,32].     Both the Faster R-CNN and SSD models, relatively recent deep learning-based object detection algorithms, showed little difference in accuracy, but there was a problem of SSD being superior in response to small objects in the image: detection took about 1.2 s per frame for Faster R-CNN and 0.32 s for SSD. Furthermore, with regard to the flame detection model presented in this study, it took an average of 0.38 s per frame to reach the final CNN-based inference due to low latency in exploring the ROI, and all objects assumed to be flames were extracted as an ROI. In addition, if the proposed pre-processing was applied to Faster R-CNN and SSD, the false detection rate could be decreased. However, due to the large time delay of object detection algorithms using Faster R-CNN and SSD, it is difficult to expect a rapid response when adding the pre-processing.
Thus, for the models presented in this work, the responsiveness was not much different from other models, but also significantly effective in accuracy and detection rates, which is because non-flame objects are filtered in advance through feature points that are likely to be flames. Therefore, both the precision and detection rates were improved due to the high proportion of TP, while the proportion of FP was lower than that of other models presented for comparison.

Conclusions
In this paper, in order to accurately detect flames, image preprocessing was performed through the appearance characteristics of flames and it was finally determined whether they were flames through CNN. Through this image pre-processing, object detection and flame classification in a region where flames were expected to exist in the image were accurately performed. In particular, in the case of flame detection in a fire, precision is the highest priority; therefore, frequent erroneous detection should not occur, and there should not be cases where the actual occurrence of fire cannot be properly detected. In order to reflect these characteristics, it was possible to improve accuracy and precision by significantly reducing false positives and false negatives by filtering objects other than flames in advance. In addition, the proposed model showed higher accuracy than the Faster R-CNN model or SSD that performed object detection through a general input image. This will be a more accurate fire detection method than human judgment if the CNN model is developed and applied. In future studies, appropriate video pre-treatment methods to reduce the mis-detection rate of smoke, which is more difficult to detect than flames, should be applied, and improvements of the computational time generated during the pre-treatment process of images should be pursued.