False Positive Decremented Research for Fire and Smoke Detection in Surveillance Camera using Spatial and Temporal Features Based on Deep Learning

: Fire must be extinguished early, as it leads to economic losses and losses of precious lives. Vision-based methods have many di ﬃ culties in algorithm research due to the atypical nature ﬁre ﬂame and smoke. In this study, we introduce a novel smoke detection algorithm that reduces false positive detection using spatial and temporal features based on deep learning from factory installed surveillance cameras. First, we calculated the global frame similarity and mean square error (MSE) to detect the moving of ﬁre ﬂame and smoke from input surveillance cameras. Second, we extracted the ﬁre ﬂame and smoke candidate area using the deep learning algorithm (Faster Region-based Convolutional Network (R-CNN)). Third, the ﬁnal ﬁre ﬂame and smoke area was decided by local spatial and temporal information: frame di ﬀ erence, color, similarity, wavelet transform, coe ﬃ cient of variation, and MSE. This research proposed a new algorithm using global and local frame features, which is well presented object information to reduce false positive based on the deep learning method. Experimental results show that the false positive detection of the proposed algorithm was reduced to about 99.9% in maintaining the smoke and ﬁre detection performance. It was conﬁrmed that the proposed method has excellent false detection performance.


Introduction
Many civilian fire injuries and civilian fire deaths occur each year due to intentionally-set fires and naturally occurring fires, which causes much property damage. Fires are classified into structure fires (home structures which include one-and two-family, manufactured homes, and apartments), non-residential structure fires (public assembly, school and college, store and office, industrial facilities, and other structures), and outdoor fires (bush, grass, forest, rubbish, and vehicle fires) [1]. Research on automatic fire detection or monitoring has long been the focus of the interior structure fires and non-residential structure fires to protect casualties and property damage from fires.
Smoke is very important because it indicates the start of a fire. However, sometimes the flames start first; thus, both smoke and flames require early detection to extinguish the fire early. Many methods of detecting smoke and flames to extinguish a fire early have been studied. In order to reduce the damage caused by fire, many early fire detection systems using heat sensors, smoke sensors, and flame detection sensors that detect flames by infrared rays (spectrum) and ultraviolet rays (spectrum) are frequently used [2,3]. Sensors used in buildings, factories, and interior spaces detect the presence of particles produced by fire flames and smoke in close proximity using a chemical reaction by ionization As mentioned above, malfunctions of smoke and flame detection using image processing have been drastically reduced due to the development of deep learning, but the malfunction still exist due to problems of deep learning. The goal of most of the existing approaches is detecting either smoke or fire from images, but as explained, they suffer from a variety of limitations. To solve the problem of these limitations, in this paper, Faster R-CNN model is proposed with object attribution for increasing the smoke and fire flame detection and decreasing the false positive rate. This method is capable of detecting both smoke and fire flame images at the same time, and offers many advantages and exhibits better performance than other existing visual recognition CNN models for the recognition of fire flame and smoke in images. Additionally, we researched a novel algorithm on rigid change of natural environment to reduce the false positive smoke detection based on advanced deep learning, as shown Figure 1. As mentioned above, malfunctions of smoke and flame detection using image processing have been drastically reduced due to the development of deep learning, but the malfunction still exist due to problems of deep learning. The goal of most of the existing approaches is detecting either smoke or fire from images, but as explained, they suffer from a variety of limitations. To solve the problem of these limitations, in this paper, Faster R-CNN model is proposed with object attribution for increasing the smoke and fire flame detection and decreasing the false positive rate. This method is capable of detecting both smoke and fire flame images at the same time, and offers many advantages and exhibits better performance than other existing visual recognition CNN models for the recognition of fire flame and smoke in images. Additionally, we researched a novel algorithm on rigid change of natural environment to reduce the false positive smoke detection based on advanced deep learning, as shown Figure 1.
This paper is organized as follows. We propose deep learning model architecture for flame and smoke detection in surveillance camera in Section 2. This paper explains several theories to reduce the rate of false alarms and improve the detection rate in Section 3. Our experimental results and discussion are implemented in Section 4. Finally, the manuscript presents a brief conclusion and future research directions in Section 5.

Deep Learning (Faster R-CNN)
It is often more difficult to distinguish objects within an image than to classify images. Deep learning using the R-CNN method takes several steps. Once the R-CNN creates a region proposal or a bounding box for an area where an object exists, it unifies the size of the extracted bounding box to use as input to CNN. Next, the model uses SVM (Support Vector Machine) to classify the selected region. Finally, it uses a linear regression model so that the bounding box of the categorized object sets the exact coordinates. CNN for training data is divided into three parts. Figure 2 depicts the full flow of the proposed system. In Figure 2, RPN (Region Proposal network) was used to find a predefined number of regions (bounding boxes) that can contain objects using features computed by CNN. The next step is to get a list of possible related objects and their locations in the original image. We apply region of interest pooling (ROIP), using boundary boxes for features and related objects This paper is organized as follows. We propose deep learning model architecture for flame and smoke detection in surveillance camera in Section 2. This paper explains several theories to reduce the rate of false alarms and improve the detection rate in Section 3. Our experimental results and discussion are implemented in Section 4. Finally, the manuscript presents a brief conclusion and future research directions in Section 5.

Deep Learning (Faster R-CNN)
It is often more difficult to distinguish objects within an image than to classify images. Deep learning using the R-CNN method takes several steps. Once the R-CNN creates a region proposal or a bounding box for an area where an object exists, it unifies the size of the extracted bounding box to use as input to CNN. Next, the model uses SVM (Support Vector Machine) to classify the selected region. Finally, it uses a linear regression model so that the bounding box of the categorized object sets the exact coordinates. CNN for training data is divided into three parts. Figure 2 depicts the full flow of the proposed system. In Figure 2, RPN (Region Proposal network) was used to find a predefined number of regions (bounding boxes) that can contain objects using features computed by CNN. The next step is to get a list of possible related objects and their locations in the original image. We apply region of interest pooling (ROIP), using boundary boxes for features and related objects extracted from CNN, and extract the features corresponding to related objects as new tensors. Finally, this information is used to classify the contents of the bounding box and the bounding box coordinates are adjusted in the R-CNN module. As a result of the Faster R-CNN, a bounding box of related objects is displayed on the screen. The proposed algorithm part is added at end of Faster R-CNN. We finally select the case where FD (Final Decision) is greater than threshold (TH) using several features in the bounding box. Finally, this information is used to classify the contents of the bounding box and the bounding box coordinates are adjusted in the R-CNN module. As a result of the Faster R-CNN, a bounding box of related objects is displayed on the screen. The proposed algorithm part is added at end of Faster R-CNN. We finally select the case where FD (Final Decision) is greater than threshold (TH) using several features in the bounding box.

Labeling Dataset
Labeling of the fire flame and smoke in the images was done using the LabelImg program. This paper used a variety size of labeling including fire flame and smoke to train the images, as shown in Figure 3. The labeling results are stored in the .xml file with the image file name along with the four-point coordinates of each rectangle. For labeling dataset, there are two things to be considered. First, a list of class is necessary for the dataset. Second, bounding boxes (Xmin, Ymin, Xmax, Ymax) will be generated by the labeling program according to the classes for images.

Training Data with Faster R-CNN
Faster R-CNN [48] is a method of applying a new method called Region Proposal Network (RPN) that merely integrates the part that generates the region proposal within the model. This is a new application of the RPN network for object detection. The function of RPN is to output the rectangle and object score of the part that proposes the object in the input image. It is a fully connected network and is designed to share a convolutional layer with Faster R-CNN. Trained RPN improves the quality of the proposed area and improves the accuracy of object detection. In general, Faster R-CNN searches external slow selections by CPU calculations but speeds them up by using internal fast RPNs by GPU calculations. The RPN comes after the last convolutional layer, followed by ROIP, classification, and bounding boxes are located, as shown in Figure 4. RPN extracts 256 or 512 features from the input image by convolution calculation using 3 × 3 window. This is then used as a box classifier layer and a box regress layer. The predefined reference box name used as the

Labeling Dataset
Labeling of the fire flame and smoke in the images was done using the LabelImg program. This paper used a variety size of labeling including fire flame and smoke to train the images, as shown in Figure 3. The labeling results are stored in the .xml file with the image file name along with the four-point coordinates of each rectangle. For labeling dataset, there are two things to be considered. First, a list of class is necessary for the dataset. Second, bounding boxes (Xmin, Ymin, Xmax, Ymax) will be generated by the labeling program according to the classes for images.

Labeling Dataset
Labeling of the fire flame and smoke in the images was done using the LabelImg program. This paper used a variety size of labeling including fire flame and smoke to train the images, as shown in Figure 3. The labeling results are stored in the .xml file with the image file name along with the four-point coordinates of each rectangle. For labeling dataset, there are two things to be considered. First, a list of class is necessary for the dataset. Second, bounding boxes (Xmin, Ymin, Xmax, Ymax) will be generated by the labeling program according to the classes for images.

Training Data with Faster R-CNN
Faster R-CNN [48] is a method of applying a new method called Region Proposal Network (RPN) that merely integrates the part that generates the region proposal within the model. This is a new application of the RPN network for object detection. The function of RPN is to output the rectangle and object score of the part that proposes the object in the input image. It is a fully connected network and is designed to share a convolutional layer with Faster R-CNN. Trained RPN improves the quality of the proposed area and improves the accuracy of object detection. In general, Faster R-CNN searches external slow selections by CPU calculations but speeds them up by using internal fast RPNs by GPU calculations. The RPN comes after the last convolutional layer, followed by ROIP, classification, and bounding boxes are located, as shown in Figure 4. RPN extracts 256 or 512 features from the input image by convolution calculation using 3 × 3 window. This is then used as a box classifier layer and a box regress layer. The predefined reference box name used as the

Training Data with Faster R-CNN
Faster R-CNN [48] is a method of applying a new method called Region Proposal Network (RPN) that merely integrates the part that generates the region proposal within the model. This is a new application of the RPN network for object detection. The function of RPN is to output the rectangle and object score of the part that proposes the object in the input image. It is a fully connected network and is designed to share a convolutional layer with Faster R-CNN. Trained RPN improves the quality of the proposed area and improves the accuracy of object detection. In general, Faster R-CNN searches external slow selections by CPU calculations but speeds them up by using internal fast RPNs by GPU calculations. The RPN comes after the last convolutional layer, followed by ROIP, classification, and bounding boxes are located, as shown in Figure 4. RPN extracts 256 or 512 features from the input image by convolution calculation using 3 × 3 window. This is then used as a box classifier layer and a box regress layer. The predefined reference box name used as the bounding box candidate at each position of the sliding window is used as the box regression. It extracts features by applying predefined anchor boxes of various ratios/sizes using the center position, moving the sliding window of the same size. In our model, we used nine anchor boxes (three sizes and three proportions), and each box is considered as a candidate for the bounding box at each position of the sliding window in the image. bounding box candidate at each position of the sliding window is used as the box regression. It extracts features by applying predefined anchor boxes of various ratios/sizes using the center position, moving the sliding window of the same size. In our model, we used nine anchor boxes (three sizes and three proportions), and each box is considered as a candidate for the bounding box at each position of the sliding window in the image.

Creating Inference Graph
An inference graph is also known as a freezing model that is saved for further process. While training the dataset with the model, each pair at different time steps, one is holding the weights ".data", and another is holding the graph ".meta". The labeled image information is progressed using the Faster R-CNN model described above, and the ".meta" file is generated as a training result. The next step is making the graph file (".pb file") which is using the ".meta" file generated in the previous step. Fimage including the bounding box and object score will be displayed on the monitor.

Structural Similarity
SSIM (Structural Similarity) [49] is a measure of the similarity of the original image and distortion due to compression and transformation. This is more widely used in signal processing because it has higher accuracy than the Mean Square Error (MSE) method, which uses a measure of the difference between pixel values of two images. We used the evaluation of the test image (X) against the original image (Y) to measure the quantification of visual similarity. The more similar the test image to the original image, the closer the value is to 1.0, and the more different the test image is to the original image, the closer the value is to 0.0. The SSIM formulas are defined as follows: N(x, y) = + 3 + 3

Creating Inference Graph
An inference graph is also known as a freezing model that is saved for further process. While training the dataset with the model, each pair at different time steps, one is holding the weights ".data", and another is holding the graph ".meta". The labeled image information is progressed using the Faster R-CNN model described above, and the ".meta" file is generated as a training result. The next step is making the graph file (".pb file") which is using the ".meta" file generated in the previous step. Finally, when we use the ".pb" file to detect the objects in the images, the result image including the bounding box and object score will be displayed on the monitor.

Structural Similarity
SSIM (Structural Similarity) [49] is a measure of the similarity of the original image and distortion due to compression and transformation. This is more widely used in signal processing because it has higher accuracy than the Mean Square Error (MSE) method, which uses a measure of the difference between pixel values of two images. We used the evaluation of the test image (X) against the original image (Y) to measure the quantification of visual similarity. The more similar the test image to the original image, the closer the value is to 1.0, and the more different the test image is to the original image, the closer the value is to 0.0. The SSIM formulas are defined as follows: where µ x and µ y are the mean of the pixels, σ x and σ y are the standard deviations, and σ xy is covariance. K1, K2, and K3 are constants for preventing the denominator and numerator from becoming zero.
L(x, y) is the relationship of the brightness difference, M(x, y) is the contrast difference, and N(x, y) is the similarity of the structural change between x and y. The structural similarity is shown in Equation (4): where α, β, and γ represent the importance of each term; 1.0 was used in this paper.

RGB Color Histogram
Generally, smoke is grayish (dark gray, gray, light gray, and white). Black smoke occurs by unburned materials or a combustion at high temperatures; this means that a certain time has passed since the fire occurred. This paper focuses on the smoke of the initial generation, and sets the conditions as shown in Equation (5) to use smoke colors ranging from gray to white: where C is the output image, R is the red image, G is the green image, and B is the blue image. This research set the C L to a minimum value between 80 (τ1) and 150 and the C H (τ2) an upper range value between 180 (τ3) and 250 (τ4). The average image C is histogrammed into 256 bins (0 to 255) for each pixel. The values stored in each bin of the histogram are normalized using the input image size, and the sum is obtained, as in Equation (6): where H S is the RGB color histogram result value, b i means the histogram bins from 0 to 255, which is only included Equation (5) range, and h and w is height and width for an input image. The grayish color is distributed intensively between 80 and 250. Fire flames are usually bright orange or red (red -> orange -> yellow -> white -> mellow). This paper used HSV color instead of RGB color. The range of HSV color used in the paper is as follows: • H: 0 to 40 • S: 100 to 255 • V: 80 to 255 As shown in smoke color extraction, HSV color image is also calculated for the average value for the filtered range image. The HSV histogram is obtained by Equation (6).

Coefficient of Variation (CV)
The coefficient of variation is a type of statistic that represents the precision or scatter of a sample, such as variance and standard deviation, in that it shows how scattered the distribution is relative to the mean. CV is a measure of how large the standard deviation is relative to the mean. These coefficients of variation are useful for comparing the spread in two types of data and for comparing variability when the differences between data are large. It is also used to determine the volatility of economic models and securities of economists and investors, as well as areas such as engineering or physics, when conducting quality assurance research and ANOVA gauge R & R [50].
The coefficient of variation is the standard value divided by the mean, as shown Equation (7): where σ is standard deviation and m is mean. It showed that the image with smoke and fire flame region has lower CV value. In the contrast, the region with false alarm showed higher CV value, as shown Figure 5. This paper adapted as the coefficient value (weighting value) of wavelet transform to remove the false alarm cases. In case of Fire flame, we used the R color in RGB color space, and adapted Y color in YCbCr color space for the smoke region. shown Figure 5. This paper adapted as the coefficient value (weighting value) of wavelet transform to remove the false alarm cases. In case of Fire flame, we used the R color in RGB color space, and adapted Y color in YCbCr color space for the smoke region.

Wavelet Transform
In general, smoke is blurry and uneven, thus, it is difficult to detect the contour using the contour detection method. DWT (Discrete Wavelet Transform) [51,52] supports multiple resolutions, and can express contour information of vertical, horizontal, and diagonal components, respectively. Using this feature to represent smoke in DWT energy, it is more apparent than in conventional edge detection methods.
When smoke with translucent characteristics occurs, the smoke part of the image frame is less sharp and the high frequency component is reduced in the area. Wavelet algorithms are generally suitable for expressing image textures and edge characteristics of smoke and fire flames. Background images generally have lower wavelet energy and few moving objects. In contrast, the edge of smoke images becomes less visible, and may disappear from the scene after a certain time. It means that the high frequency energy of the background scene is decreasing. In order to identify smoke in a scene, any decrease in high frequency from the detected blob images in the frame was monitored by a spatial wavelet transform algorithm.
As shown in Figure 6, if the smoke spreads to the edges of the image, it may be difficult to see initially and the smoke may darken over time, causing part of the background to disappear. [53,54]. This means that there is a high probability that smoke will be present and smoke detection will be easier, as shown in Figure 6. Therefore, this paper used the spatial energy to evaluate the sub-image

Wavelet Transform
In general, smoke is blurry and uneven, thus, it is difficult to detect the contour using the contour detection method. DWT (Discrete Wavelet Transform) [51,52] supports multiple resolutions, and can express contour information of vertical, horizontal, and diagonal components, respectively. Using this feature to represent smoke in DWT energy, it is more apparent than in conventional edge detection methods.
When smoke with translucent characteristics occurs, the smoke part of the image frame is less sharp and the high frequency component is reduced in the area. Wavelet algorithms are generally suitable for expressing image textures and edge characteristics of smoke and fire flames. Background images generally have lower wavelet energy and few moving objects. In contrast, the edge of smoke images becomes less visible, and may disappear from the scene after a certain time. It means that the high frequency energy of the background scene is decreasing. In order to identify smoke in a scene, any decrease in high frequency from the detected blob images in the frame was monitored by a spatial wavelet transform algorithm.
As shown in Figure 6, if the smoke spreads to the edges of the image, it may be difficult to see initially and the smoke may darken over time, causing part of the background to disappear. [53,54]. This means that there is a high probability that smoke will be present and smoke detection will be easier, as shown in Figure 6. Therefore, this paper used the spatial energy to evaluate the sub-image energy by dividing the image into first stage wavelet transform and summing the squared from each coefficient images in Equation (8): where x and y represent positions within the image, and LH, HL, and HH each contain contour information of the high frequency component of the DWT (Discrete Wavelet Transform). LH is horizontal low-band vertical high-band, HL is horizontal high-band vertical low-band, and HH is horizontal high-band vertical high-band. E(x, y) is wavelet energy at each pixel in the candidate region which is detected by deep learning algorithm within each frame.

Experimental Results
We proposed a new algorithm using similarity and color histogram of global and local area in the frame to reduce smoke false positive rate generated by fire detection systems using Onvif camera based on deep learning. In this paper, we used a computer with an Intel Core i7-7700 (3.5 GHz) CPU, 16 GB of memory, and Geforce TITAN-X to perform the experiment. The flame and smoke databases used in this study was obtained from the internet, and general direct ground and factory recorded video. The video recording device was a mobile phone camera, a Canon G5 camera, and a Raspberry pi camera. Python 3.5, Tensorflow, and Opencv were used in this paper.
In order to implement the proposed algorithm, the following process was carried out. The first step is labeling dataset from training database. The first task is labeling data using the LabelImg program, as shown Figure 4. The labeling categories used in this paper are flame, smoke, Grinder, Welding, and human. The result of labeling data is stored in an .xml file that contains the object type name and the four-point coordinates of the object area.
The second step is training process with labeled images. In the training process, the input image is a JPEG or PNG file. The .xml file should be converted to the learning data format of the Tensorflow. Since the meta data and labels of these images are stored in a separate file and must be read separately from the meta data and label file, the code becomes complicated when reading the training data. Additionally, performance degradation can occur if the image is read in JPEG or PNG format and decoded each time. However, the TFRecord file format avoids the above performance degradation and makes it easier to develop. The TFRecord file format stores the height and width of the image, the file name, the encoding format, the image binary, and the rectangle coordinate of the object in the image. Through this process, the entire training data is classified and stored as 70% training data and 30% validation data. We used the FASTER-CNN ResNet (Deep Residual Network)

Experimental Results
We proposed a new algorithm using similarity and color histogram of global and local area in the frame to reduce smoke false positive rate generated by fire detection systems using Onvif camera based on deep learning. In this paper, we used a computer with an Intel Core i7-7700 (3.5 GHz) CPU, 16 GB of memory, and Geforce TITAN-X to perform the experiment. The flame and smoke databases used in this study was obtained from the internet, and general direct ground and factory recorded video. The video recording device was a mobile phone camera, a Canon G5 camera, and a Raspberry pi camera. Python 3.5, Tensorflow, and Opencv were used in this paper.
In order to implement the proposed algorithm, the following process was carried out. The first step is labeling dataset from training database. The first task is labeling data using the LabelImg program, as shown Figure 4. The labeling categories used in this paper are flame, smoke, Grinder, Welding, and human. The result of labeling data is stored in an .xml file that contains the object type name and the four-point coordinates of the object area.
The second step is training process with labeled images. In the training process, the input image is a JPEG or PNG file. The .xml file should be converted to the learning data format of the Tensorflow. Since the meta data and labels of these images are stored in a separate file and must be read separately from the meta data and label file, the code becomes complicated when reading the training data. Additionally, performance degradation can occur if the image is read in JPEG or PNG format and decoded each time. However, the TFRecord file format avoids the above performance degradation and makes it easier to develop. The TFRecord file format stores the height and width of the image, the file name, the encoding format, the image binary, and the rectangle coordinate of the object in the image. Through this process, the entire training data is classified and stored as 70% training data and 30% validation data. We used the FASTER-CNN ResNet (Deep Residual Network) as the primary model for training, and it is characterized by the smallest number of objects and the highest detection rate. The fire images used in the training consisted of 21,230 pieces.
Finally, we extracted the training model. The learning process stores a check pointer that represents the learning result for the predetermined pointer. Each check pointer has meta information about the Tensorflow model file format and can be learned again. However, because there is a lot of unnecessary information in the ".meta" file, the .meta file needs to be improved to use the actual model. Finally, a ".pb" file is generated that combines the weights except for the unnecessary data in the ".meta" file.
In this paper, we used factory recorded video images, mobile camera, Raspberry pi camera, and general camera as the experimental data. Figure 7 shows an example of continuous frames of video used in the experiment. Fire detection experiment was performed using ".pb" file based on Fater R-CNN model. Figure 8 shows fire and smoke detection results included true positive and false positive using general deep learning.      Figure 8a shows the result of the experiment to detect fire and smoke using various videos. The detection threshold of Faster R-CNN was 30% or higher. Figure 8b,c shows the result of false positive detection by applying deep learning training results. Although false positives have appeared in many places, there are two types of false positives. First, smoke or flame is detected by reflection of sunlight. Second, facilities inside and outside the factory show similar shapes and colors like smoke and fire. Third, when objects are moving around, deep learning system recognize them as fire flame or smoke for the similar shape of trained fire flame and smoke, as shown in Figure 8c. Table 1 shows the fire and smoke detection results for several videos. Table 1. The results of video test using general Faster R-CNN (frame).

Videos
Ground  Table 1, Ground Truth represents the total number of frames in the video, True Positive (TP) indicates when a fire flame and smoke is detected as fire flame and smoke. True Negative (TN) indicates that non-fire objects are not detected as fire flame and smoke. False Positive (FP) is a case where non-fire objects are detected as a fire. NON signifies a non-fire video and F/S signifies a fire flame and smoke video. In Table 1, F/S means including fire and smoke frames and NON means without fire and smoke frames.
In the case of Videos, they is not generated in a continuous frame. Since the video is 30 fps, it can be sufficiently compensated. However, in the case of Video 3, Video 7, and non-fire video, the alarm continues to ring and the stress of the worker becomes higher. In order to reduce false positives generated in False Positive Videos, we use the following characteristics. The first is a global check. We checked the motion characteristics before performing deep learning using mean square error (8) and three frame differences (9) [55]. Since there is motion when a fire occurs, if a block of moving pixels is generated, it is registered as a fire candidate state. If the fire candidate frame status is True, a deep learning process is performed, as shown Figure 1.
where FSG is global decision parameter. The second is a local check for the detected area (bounding box) by deep learning. If there is a trained class in the input frame image, a bounding box is created and stored as a local area of interest. The next step is to verify the local area of interest again. In this paper, we determine the final fire region using the color histogram H, SSIM index, and mean square error (MSE), coefficient variant, and wavelet transform with other frames as the following equation: where k means frames, from fth1 to fth4 are threshold value by experiment. C_R_HH and C_Y_HH is the wavelet transform coefficient HH for RGB and YCbCr color. Moreover, R_H sum and C_H sum are the result of R color and Y color histogram for the local region. We compared the local region (bounding box area) of interest using the three frame difference algorithm (first, middle, and last frames) from the stored 10 frame images.
The final smoke region, in common with fire detection, we also adapted same sequence as the following equation: This paper added the following conditions to remove false positives: where CVS and CVF are the coefficient variance of local smoke and fire region, respectively. In this paper, it is regarded as a fire if FD is satisfied as shown in the following equation: We described the result of the experiment applying the proposed algorithms in Table 2. Table 2 shows the experimental results using the proposed algorithm. In the Videos, the false positive rate dropped to 0% and the fire detection of Video 1 to Video 6 persisted. Even though the Video 3 and Video 4 missed a few fire images, it has no problem because it is not continuously generated and the alarm system has no problem sending a warning signal to operator if it misses one or two frames. As shown in Table 2, the proposed algorithm using color histogram, wavelet transform, and coefficient variant was able to eliminate false positives (similar shape and color objects, sun and light reflection, moving objects, etc.) shown in Figure 8b,c. The results of the proposed algorithm using color histogram performance, high frequency components of wavelet transform, which is background discrimination of smoke and fire flame, and coefficient variant coefficients showed higher ratio of false alarm removal than the traditional deep learning method. However, in the case of Video 7 and Video 8, we must seriously consider the case of the missing frames. Additionally, we tested other factory and office videos. It also marked zero false positive rate for the proposed method. The false positive rate for the additional 16 videos was 99%, and the image examples used in the video experiment are shown  Figure 9. Figure 9a is office and factory videos and Figure 9b is fire and smoke videos. Since this involves a lot of movement, it is likely that it has affected the frames missing in Video 7 and Video 8.

Conclusions
Fires resulting from small sparks can cause terrible natural disasters that can lead to both economic losses and the loss of human lives. In this paper, we describe a new fire flame and smoke detection method to remove false positive detection using spatial and temporal features based on deep learning from surveillance cameras. In general, a deep learning method using the shape of an object frequently generate false positives, where general object is detected as the fire or smoke. To solve this problem, first, we used motion detection using the three frame difference algorithm as the global information. We then applied the frame similarity using SSIM and MSE. Second, we adapted the Faster R-CNN algorithm to find smoke and fire candidate region for the detected frame. Third, we determined the final fire flame and smoke area using the spatial and temporal features; wavelet transform, coefficient of variation, color histogram, frame similarity, and MSE for the candidate region. Experiments have shown that the probability of false positives in the proposed algorithm is

Conclusions
Fires resulting from small sparks can cause terrible natural disasters that can lead to both economic losses and the loss of human lives. In this paper, we describe a new fire flame and smoke detection method to remove false positive detection using spatial and temporal features based on deep learning from surveillance cameras. In general, a deep learning method using the shape of an object frequently generate false positives, where general object is detected as the fire or smoke. To solve this problem, first, we used motion detection using the three frame difference algorithm as the global information.
We then applied the frame similarity using SSIM and MSE. Second, we adapted the Faster R-CNN algorithm to find smoke and fire candidate region for the detected frame. Third, we determined the final fire flame and smoke area using the spatial and temporal features; wavelet transform, coefficient of variation, color histogram, frame similarity, and MSE for the candidate region. Experiments have shown that the probability of false positives in the proposed algorithm is significantly lower than that of conventional deep learning method.
For future work, it is necessary to study the analysis for the moving videos and the experiment using the correlation of the frame and the deep learning model to further reduce false positives and missing fire and smoke frames.