AdViSED: Advanced Video SmokE Detection for Real-Time Measurements in Antiﬁre Indoor and Outdoor Systems

: This paper proposes a video-based smoke detection technique for early warning in antiﬁre surveillance systems. The algorithm is developed to detect the smoke behavior in a restricted video surveillance environment, both indoor (e.g., railway carriage, bus wagon, industrial plant, or home / o ﬃ ce) or outdoor (e.g., storage area or parking area). The proposed technique exploits a Kalman estimator, color analysis, image segmentation, blob labeling, geometrical features analysis, and M of N decisor, in order to extract an alarm signal within a strict real-time deadline. This new technique requires just a few seconds to detect ﬁre smoke, and it is 15 times faster compared to the requirements of ﬁre-alarm standards for industrial or transport systems, e.g., the EN50155 standard for onboard train ﬁre-alarm systems. Indeed, the EN50155 considers a response time of at least 60 s for onboard systems. The proposed technique has been tested and compared with state-of-art systems using the open access Firesense dataset developed as an output of a European FP7 project, including several ﬁre / smoke indoor and outdoor scenes. There is an improvement of all the detection metrics (recall, accuracy, F1 score, precision, etc.) when comparing Advanced Video SmokE Detection (AdViSED) with other video-based antiﬁre works recently proposed in literature. The proposed technique is ﬂexible in terms of input camera type and frame size and rate and has been implemented on a low-cost embedded platform to develop a distributed antiﬁre system accessible via web browser.


Introduction
Recent reports of the NFPA (National Fire Protection Association) show that the average number of fires per year is about 1.3 million in the US alone, with a high cost in terms of lives (more than 3000 civilian fire fatalities) and economic losses-the cost for fire losses is estimated to be about 55 billion US Dollars (USD) per year [1]. Hence, with the advent of the Internet of Things (IoT) and the growing interest about safety in public places, an early fire-smoke detection system should be implemented for the benefit of all citizens. To this end, a video-based approach based on the recognition of fire smoke is a promising method. Indeed, the video signal is a wide monitor of the area under investigation and often closed-circuit television (CCTV) systems for surveillance purpose are already installed in smart buildings, in public places in smart cities, or onboard passenger vehicles of public transport systems. Exploiting an already existing video infrastructure allows the reduction of purchase/installation costs of additional add-on products, increasing only the complexity of the algorithm used to detect the smoke.
The smoke is the first event of a fire hazard since it appears before the flames. Standard smoke sensors, based on chemical, temperature, or PIR (Passive InfraRed) detectors [2][3][4][5][6][7], trigger an alarm Energies 2020, 13, 2098 2 of 18 within several minutes when combustion is already producing flames and increasing the environmental temperature. In [8,9], a photoelectric smoke detector with an actual smoke chamber was mixed with smoke temperature. Current standards such as EN50155 for onboard train safety set a delay between the start of fire and its detection of 1 min. This is why the commercial solution in [10] for onboard train antifire systems uses point-based optical and temperature smoke detectors. These sensors aim at revealing the presence of a fire with a time delay within 1 min. The system in [10] exploits the fact that hot air produced by a fire moves from the bottom to the top of the train wagon, and hence the smoke moves toward the sensor placed on the roof. In [11], flames were identified by measuring absolute temperature and its gradient through a smart IR sensor. The detection systems presented have the drawback of reacting slowly. Moreover, they need active fans or air-conditioning to speed-up the smoke/fire detection process, thus avoiding a too high measuring latency. Instead, the aim of this work is to develop an innovative video-based smoke detection technique able to trigger the alarm within a few seconds. Such an algorithm has been already implemented into several IoT embedded devices to develop a distributed antifire system accessible via web browser and able to signal a fire alarm from different camera nodes, discussed in [12]. With this paper, the authors intend to extend that discussion by focusing more on the smoke detection algorithm aspects.
Hereafter, the paper is organized as follows. Section 2 deals with a detailed review of video-based smoke/fire measurements. Section 3 presents the new Advanced Video SmokE Detection (AdViSED) technique discussing the global architecture and then each of the algorithms used in the multiple video processing steps. Section 4 discusses the AdViSED thresholds configuration. Section 5 presents the evaluation of the complexity of the algorithm and the performance compared with the state of art systems. Implementation results of AdViSED in x86-based and ARM (Advanced RISC Machine)-based platforms are analyzed in Section 6. Conclusions are drawn in Section 7.

State of the Art Review of Video-Based Smoke Detection Algorithm
Video-based fire measuring and warning systems may help to reduce the detection time compared to other available sensors (smoke chemicals or temperature sensors) in both indoor and outdoor scenarios. A video camera can monitor "volumes" without transport delays that traditional "point" sensors suffer from. Particularly, CCTV cameras are suitable for fire detection within 100 m, which is the case of fire onboard passenger vehicles or in houses, offices, factories, or cities. Instead, for wildfire in forests or rural environments [13], which is out of the scope of this work, other techniques must be used, optimized for scenes observed at distances of several km. Many studies have been recently proposed in literature for video smoke detection [14][15][16][17][18][19][20][21][22][23][24][25].
In [24], a commercial solution was proposed exploiting both a 160 × 120-pixel, 9 fps, long wave infrared camera for thermal imaging, and a 1920 × 1080-pixel color CMOS (Complementary Metal Oxide Semiconductor) camera to exploit the thermal energy emitted from the objects and to analyze their dynamic features. Hence, absolute and gradient temperatures plus size and motion of the objects were analyzed by the system in [24]. However, the purchase and installation of such a solution is too expensive for exhaustive use in smart cities, homes/offices, and intelligent transport systems. In [25], a CCD (charge-coupled device) camera was used but only as a temperature sensor. The system in [25] was used to estimate the edges of a combustion flame and not as a fast fire/smoke detector. The authors in [13] implemented a background subtraction using frame differencing and working with the smoke features to discover the smoke areas. In the same direction is the work of O. Barnich and M. Van Droogenbroeck, which in [14] generates the background subtraction using Visual Background Estimation (ViBe). The ViBe technique was used also in [15] in combination with the dissipation function to estimate the background. Another smoke recognition was studied in [16], combining fuzzy logic, Gaussian Mixed Model (GMM), and Support Vector Machine (SVM) for recognition. In [19], the background and the foreground were estimated with the use of the Kalman filter. The work in [20] proposed an early fire detection system using Internet Protocol (IP) camera technology with Motion JPEG codec with the Discrete Cosine Transform (DCT) operating in the frequency domain for Energies 2020, 13, 2098 3 of 18 smoke features. Instead, the work in [21] focused the algorithm on the appearance-based method and then a background subtraction was applied. In [26], edge detection image processing was used to reveal the edge of flames, rather than implementing a fire warning algorithm. In [27], a multiexpert decision algorithm was used for fire detection: combining the results of three expert systems exploiting motion, shape, and color information, which achieved a detection accuracy higher than 90% over the testing dataset. However, the complexity of the proposed technique in [27] limits the real-time algorithmic execution on a Raspberry Pi embedded system to only 3 fps and using only low-resolution images, 320 × 240 pixels. A convolutional neural networks (CNN) approach for video-based antifire surveillance systems was proposed in [18,28], but also in this case the implementation complexity, although reduced compared to other CNN-based techniques, is still too high. For example, [28] required a system equipped with a NVidia GeForce GTX TITAN X with 12 GB onboard memory and deep learning framework and Intel Core i5 CPU with 64 GB RAM. Such a platform has a cost in the order of hundreds of USD with a power consumption of hundreds of Watts. A hybrid computing architecture combining a GPU and a general-purpose processor has also been adopted in [29]. Here, the algorithms exploited motion estimation based on background subtraction and color probability analysis to detect candidate smoke blobs. Some of the techniques seen above [14][15][16][17]21,[29][30][31] perform basically a background subtraction, so they just want to recognize the foreground. Other papers, like [19], implement more sophisticated video processing techniques, like Kalman based motion estimation, but with fixed parameters not aware of the specific camera configuration. Techniques such as in [18,28], based on CNNs, are difficult to implement in real scenarios due to the lack of a large dataset for the training videos. This work also outperforms a preliminary algorithm, studied and presented by the authors in [22,23], which was using SAD (sum of absolute differences) block matching as motion estimation, plus other image processing tasks, such as morphological filter, bounding boxes extraction, features extraction, correct bounding boxes selection, and space and time analysis.
The new approach, discussed in Sections 3 and 4, refines the motion estimation method by replacing it with a Kalman estimator, properly tuned for smoke detection. After color analysis, image segmentation, blob labeling, features analysis, and M out of N decisor, an alarm signal is generated. When comparing the new algorithm with that in [22,23], it can be observed that there is an increase of all the evaluation metrics, such as recall, accuracy, F1 score, precision, and MCC (Matthews correlation coefficient) and a reduction of the algorithm implementation complexity that allows implementing the new measuring system in real-time and with low power. Figure 1 shows the logical flow of the proposed video smoke detection algorithm, which is based on motion detection, segmentation, features extraction, and elaboration of the video frames. The algorithm is designed as a chain of image and video processing tasks. The sequence of elaborations ends with computation of a real-time alarm signal.  [21] focused the algorithm on the appearance-based method and then a background subtraction was applied. In [26], edge detection image processing was used to reveal the edge of flames, rather than implementing a fire warning algorithm. In [27], a multiexpert decision algorithm was used for fire detection: combining the results of three expert systems exploiting motion, shape, and color information, which achieved a detection accuracy higher than 90% over the testing dataset. However, the complexity of the proposed technique in [27] limits the real-time algorithmic execution on a Raspberry Pi embedded system to only 3 fps and using only low-resolution images, 320 × 240 pixels. A convolutional neural networks (CNN) approach for video-based antifire surveillance systems was proposed in [18,28], but also in this case the implementation complexity, although reduced compared to other CNN-based techniques, is still too high. For example, [28] required a system equipped with a NVidia GeForce GTX TITAN X with 12 GB onboard memory and deep learning framework and Intel Core i5 CPU with 64 GB RAM. Such a platform has a cost in the order of hundreds of USD with a power consumption of hundreds of Watts. A hybrid computing architecture combining a GPU and a general-purpose processor has also been adopted in [29]. Here, the algorithms exploited motion estimation based on background subtraction and color probability analysis to detect candidate smoke blobs. Some of the techniques seen above [14][15][16][17]21,[29][30][31] perform basically a background subtraction, so they just want to recognize the foreground. Other papers, like [19], implement more sophisticated video processing techniques, like Kalman based motion estimation, but with fixed parameters not aware of the specific camera configuration. Techniques such as in [18,28], based on CNNs, are difficult to implement in real scenarios due to the lack of a large dataset for the training videos. This work also outperforms a preliminary algorithm, studied and presented by the authors in [22,23], which was using SAD (sum of absolute differences) block matching as motion estimation, plus other image processing tasks, such as morphological filter, bounding boxes extraction, features extraction, correct bounding boxes selection, and space and time analysis.

AdViSED Fast Measuring System
The new approach, discussed in Sections 3 and 4, refines the motion estimation method by replacing it with a Kalman estimator, properly tuned for smoke detection. After color analysis, image segmentation, blob labeling, features analysis, and M out of N decisor, an alarm signal is generated. When comparing the new algorithm with that in [22,23], it can be observed that there is an increase of all the evaluation metrics, such as recall, accuracy, F1 score, precision, and MCC (Matthews correlation coefficient) and a reduction of the algorithm implementation complexity that allows implementing the new measuring system in real-time and with low power. Figure 1 shows the logical flow of the proposed video smoke detection algorithm, which is based on motion detection, segmentation, features extraction, and elaboration of the video frames. The algorithm is designed as a chain of image and video processing tasks. The sequence of elaborations ends with computation of a real-time alarm signal.  A binary decision-maker for the fire alarm is implemented using a thresholding technique combined with the M of N classifier over the real-time alarm signal. When implementing the signal Energies 2020, 13, 2098 4 of 18 processing chain of Figure 1 in a computing platform, the tasks motion-detection/color-segmentation and the tasks time/edge-based analysis can be parallelized, while the rest of the functions have to be calculated according to a sequential flow. Hereafter, the main video processing steps of the workflow in Figure 1 are detailed.

Motion Detection
A Kalman filter is used and a simplification of the theory regarding this filter is reported, where the information is obtained through an estimate and the prediction of the background. The background prediction is given by Equation (1), where BG k is the background prediction of the current frame I, BG k−1 is the background estimation at the previous frame, and a = A/(1 − β) with β = 1/ 1 + τ β · f r . The f r coefficient represents the frame rate in fps of the processed video, τ β is a time constant set to 10s, and A is a constant set to 0.618.
The background estimationBG k of the frame I is obtained from Equation (2), where BG k is known in Equation (1), and K 1 and K 2 are defined in Equation (3).
The above formulas work at pixel level. During the initialization, we set BG k equal to initial frame I and .
BG k equal to zero. Obviously, the initialization is performed the first time when the first frame is received. According to Figure 1, the pixel-wise logic AND with the color segmentation is activated by the motion-estimation block only when the pixel of the foreground FG k is higher than a proper threshold THR f oreg . In the above equations, FG k is the foreground of the frame I, Λ = 1/(1 + τ α · f r), where τ α is a time constant set to 16 s. The empirical threshold THR f oreg is set to 0.08. It is noted that the Kalman estimator proposed in [19] has the variable a = 0.7 fixed, while our variable a = A/(1 − β) depends on β, which in turn depends on constants A and τ β and on the frame rate f r. In the proposed algorithm, the frame rate is extracted from the input processed video, so the variable a and hence the matrix in Equation (1) depend on the value of the video frame rate. It is noted that in the test videos considered in this work, f r ranges from 10 to 60 f ps. The coefficient K 1 = K 2 depends on the coefficients Λ and β, selected to allow filtering out quickly in the background the objects that are faster than the smoke (like people in movements) and to filter out in the background static objects that are slower than the smoke. The value of A = 0.618 has been found, in case of high frame rate values (with f r = 60, then β ∼ 0 and a ∼ 0.618), starting from the value of 0.7 proposed in [19] and refining it to maximize the accuracy performance with the test video set considered in Sections 4 and 5. The values of τ α and τ β influence the values of Λ and β and determine the observation times of foreground and background scenes. Because we aim at an early detection of smoke, their value is limited in the order of few seconds (i.e., 300 frames observed with a typical f r = 30 f ps). Empirical measurements with the test video set considered in Sections 4 and 5 prove that the detection accuracy is maximized with τ α = 1.6 τ β and hence τ α = 10s and τ β = 16s. The proposed values for time constants τ α and τ β , to filter out objects moving faster than smoke or objects that are static or moving slower than smoked, are also in-line with experiments carried out in literature [32].

Color Segmentation and Blob Labeling
In this phase, the RGB color frames are converted in gray scale and next in HSV (hue, saturation, value) scale, where H is the hue, S is the saturation, and V is the value. We use only the saturation of the frames and through a saturation threshold, THR sat (set to 0.2 in the range from 0 to 1), we select only those parts of the scene with f rame sat < THR sat . We choose this parameter because the smoke changes color according to the background, so a good way is to use the saturation variable. The value of the threshold 0.2 is found empirically as the value maximizing the estimation accuracy using the dataset detailed in Sections 4 and 5.
The output pixels from motion detection (the foreground pixels) and color segmentation go inside a logic AND (pixel-wise). After this phase, the output logic mask is filtered with a median filter to remove the isolated pixels, considered as noisy pixels. We also do a labeling of the agglomerations of the pixels, called "blob". This labeling is used to extract the region properties in order to decide in the next sub-block which blobs have the right characteristics to pass the thresholds.

Color Segmentation and Blob Labeling
After motion estimation, color segmentation, and blob labeling, the following features of the blobs are extracted inside the feature-extraction block of Figure 1: • Area is the number of pixels in the region studied.

•
Turbulence is calculated as Perimeter 2 blob /Area blob . • Extent is the ratio between the number of pixels in the region and in the bounding box.

•
Convex area is the region inside a bounding box delimited from a polygon without concaved angles; the corners are defined from the most external pixels.

•
Eccentricity is the eccentricity of the ellipse of the bounding box.
The set of thresholds used to select the correct blobs is reported hereafter. We chose these thresholds by doing a trade-off between the video characteristics.
THR area = W·H/2000; THR turb = 55; THR ext_min = 0.2 THR ext_max = 0.95; THR eccentr = 0.99; THR convArea = 0.5 Now after setting the thresholds, for each blob we set the ranges in order to decide which blob can survive or not to the next analysis, see the following rules: Area blob > THR area ; Turbulence blob < THR turb ; THR ext_min < Ext pixels < THR ext_max ; ConvArea blob > THR convArea ; Eccent blob < THR eccentr The turbulence factor we introduce is an amplification of the Boundary Area Roughness (BAR) factor used in other works in literature [32]. The turbulence index is determined by relating the perimeter of the region to the square root of the area, and hence it is the squared version of the BAR index. For example, the turbulence index is 4π for a circle, 16 for a square, and about 20 for a triangle with sides L, L, and L·sqrt (2). Hence, the heuristic value of THR turb = 55 used in this work for smoke blobs is also theoretically justified by being 3 to 4 times higher than that of "regular" figures like a circle. The area threshold THR area is calculated so that for a VGA frame (W = 640 × H = 480) its value is about 150 pixels, which is in line with the size of about 100 pixels (10 × 10) considered as typical in literature [32] for video-based detection systems with an observed area within a distance of 100 m.
It is worth noting that some papers listed in the introduction use some static smoke features. For example, [14] used the distance between the pixels and the clusters, [17] used the thermal turbulence and diffuse and the complexity of the edges, and [20] used the expansion direction of smoke regions. Different from state-of-art systems, the algorithm proposed in this work does not use just a single Energies 2020, 13, 2098 6 of 18 feature, but it uses together several geometric features like area, turbulence, extent, convex area, and eccentricity. This way, considering more parameters, the proposed AdViSED technique can reach better performance in terms of accuracy of the alarm prediction; see Section 5.

Bounding Box Extraction and Time/Edge Analysis
After the feature extraction and first decision step discussed above, the survivor blobs are analyzed dynamically (time-based analysis step in Figure 1) considering the characteristic of the bounding box that contains them. The bounding box table is composed by the lists of all bounding boxes and for each of these we have the dimension and origins x and y and a counter called kill time. The kill time is initialized to the same value for each box (i.e., kill_time = f rame_rate). The counter is decreased by one for each frame thereafter analyzed. When the counter is zero, the bounding box is deleted. This means that each bounding box has a lifetime of one second by setting the value of kill_time = 30 and considering a f rame_rate of 30 fps.
There is also a test in parallel to the count to kill the smoke blobs without marked edges (edge-based analysis step in Figure 1). Therefore, we perform an edge test that deletes the bounding boxes that do not respect the constraint without waiting for the end of the count of the kill time.
To implement this task, we use the Sobel edge detection method in Equation (5). With I being the source image, G x and G y represent the vertical and horizontal derivative approximation, where * denotes the two-dimensional signal processing convolution operation. The resulting gradient approximation can be combined to give the gradient magnitude using Equation (6).

Pre-Alarm Signal and Smoke Alarm
At runtime, if the smoke is present in the scene caught on camera, it is natural to think that multiple boxes are overlapped. Therefore, we want to count the number of overlaps. The algorithm counts for each pixel how many bounding boxes it belongs to. An index called overlap takes the maximum of these values, and this overlap index is used to generate the final smoke alarm.
A threshold is used to generate a binary signal from the prealarm, also known as overlap index. We chose THR overlap = 7, so we need at least seven bounding boxes overlapped to generate a logical one in this comparison. To activate the smoke alarm, we want the signal to be greater than the threshold for some seconds in a sliding temporal window. The selected time window has a size of 3 s, and the overlap index must exceed the threshold for at least 1.5 s inside the window. If this happens, the smoke alarm is generated. The implementation of the latter is done implementing the M of N decision algorithm, where M is the number of detected smoke frames in 1.5 s and N is the number of frames in 3 s.

Pre-Alarm Signal and Smoke Alarm
The proposed technique has been tested with a large set of test videos from 320 × 240 to 1920 × 1440 pixels/frame, while the frame rate is between 10 fps and 60 fps (including all intermediate formats usually present in the state-of-art systems). Hereafter, we show the pictures of the output processed videos for three outdoor cases (Figures 2i and 3a,c) and one indoor case (Figure 3b) representing both the case of true positive (Figures 2i and 3b) and negative (Figure 3a,c) conditions. Purple boxes in the figures refer to detected blobs. The alarm is only generated when the red circle appears on the top left of the image (Figures 2i and 3b); otherwise the circle displayed is green.
Energies 2020, 13, 2098 7 of 18 Figure 2 is an example application of a smart city antifire system showing the output mask of each processing step already described. Figure 3a is related to an intelligent transportation system with antifire safety features. Figure 3b is an example of an active antifire home alarm. Figure 3c is an example of a forest fire prevention. In the elaboration of Figure 2i and Figure 3b, the smoke alarm is activated while in Figure 3a,c the smoke alarm is not activated. This algorithm, different from [14,[16][17][18]20,21], surrounds the smoke blobs and examines their bounding boxes evolution. This way, AdViSED can reach better performance in terms of accuracy of the alarm prediction. In the elaboration of Figures 2i and 3b, the smoke alarm is activated while in Figure 3a,c the smoke alarm is not activated. This algorithm, different from [14,[16][17][18]20,21], surrounds the smoke blobs and examines their bounding boxes evolution. This way, AdViSED can reach better performance in terms of accuracy of the alarm prediction. In the elaboration of Figure 2i and Figure 3b, the smoke alarm is activated while in Figure 3a,c the smoke alarm is not activated. This algorithm, different from [14,[16][17][18]20,21], surrounds the smoke blobs and examines their bounding boxes evolution. This way, AdViSED can reach better performance in terms of accuracy of the alarm prediction.

AdViSED Thresholds Analysis
To evaluate the performance of the algorithm we run it over a selection of videos available in two datasets, one from [33,34] and the other developed in the FP7 EU project Firesense and largely used in literature [35]. The final dataset is the union of test videos in [33][34][35], since some scenes are present both in [33,34] and in [35] plus other videos, which are not public available, provided by the Engineering of Trenitalia, the Italian national railway company. Such videos have been acquired during specific fire/smoke tests on railway wagons, done at the Trenitalia testing facility in Osmannoro, Italy. The final dataset is composed of both videos with smoke presence and videos with no smoke presence. In both cases, we have outdoor and indoor videos of different quality (i.e., different levels of noise) and different frame rates, from 10 to 60 fps. Most of the used videos are adopted often in literature to compare the smoke/fire measurements and detection performance of state-of-art techniques. For the metrics computation, we need to define the true/false and positive/negative terms. We define a positive video from the dataset as a video with smoke presence, while a negative video is without it. In Table 1  The metrics chosen to evaluate the goodness of the algorithm are recall, precision, accuracy, F1 score, and MCC; see formulas in Equations (7)- (11). The recall means how many relevant items are selected. The precision indicates how many selected items are relevant. The accuracy shows how close you are to the true value. The F1 score is the harmonic mean between precision and recall. The MCC parameter has a range that starts at −1, when there is a complete misalignment between predicted and true value, and ends at +1, when there is a complete alignment between predicted and true value. When the MCC is equal to 0 it means that the prediction is random compared to the true values.
In this section, the analysis of the thresholds is reported. For each threshold type, many tests are performed until discovering the best values. As an example of the tests that have been done, we report in Figure 4 the metrics of some not optimized threshold sets comparing the achieved results to the set n. 6, which is the final selected one optimizing all metrics. The results in Figures 5 and 6 are obtained as an average over the whole testing videos. Each set of thresholds is detailed in Table 2. The THR sat value is set using the real smoke saturation color. The THR area , THR turb , THR eccentr , THR ext_min , THR ext_max , and THR convArea values are set using the real features of the blob smoke in the frame. The kill time is set to delete the bounding box within a reasonable time. The THR overlap is set considering a reasonable number of overlapping bounding boxes in a real smoke scene. The THR f oreg is set considering the motion detection operation result I −BG k−1 and filtering this value with the threshold according to the best background subtraction. After these considerations, we adjust the thresholds to find the best values to maximize the metrics. In the thresholds set in Table 2, just a single threshold type is changed, while the other threshold values are kept fixed. The set 1A and the set 1B are developed using THR overlap = 14 and 4; the set 4A and the set 4B are performed using THR sat = 0.6 and 0.1; the set 3A and the set 3B are obtained using THR f oreg = 0.16 and 0.04; the set 2A and the set 2B adopt THR area = Area/3000 and Area/1000 and THR turb = 70 and 40; and the set 5A and the set 5B are developed using THR eccentr = 1 and 0.7, THR convArea = 0.9 and 0.1, THR ext_min = 0.1 and 0.9, and THR ext_max = 0.1 and 0.4; the set 6 in Table 2 is the optimal one. Energies 2020, 13, x FOR PEER REVIEW 10 of 18 , and values are set using the real features of the blob smoke in the frame. The is set to delete the bounding box within a reasonable time. The is set considering a reasonable number of overlapping bounding boxes in a real smoke scene. The is set considering the motion detection operation result − and filtering this value with the threshold according to the best background subtraction. After these considerations, we adjust the thresholds to find the best values to maximize the metrics. In the thresholds set in Table  2 Table 2 is the optimal one.   Figure 5 shows a comparison in terms of estimation parameters between the algorithm proposed in [22,23] and AdViSED. Another comparison in terms of computational complexity is provided in Figure 6. The latter is evaluated as normalized execution time when running the algorithms via SW on an Intel Core i3-4170 CPU, with Intel HD Graphics 4400 GPU, 8GB DDR3 RAM, equipped with Windows 10 Pro x64 operating system. Results in Figures 5 and 6 represent the average of the results obtained for the whole test videos.

Performance and Complexity Results
Energies 2020, 13, x FOR PEER REVIEW 11 of 18 Figure 5 shows a comparison in terms of estimation parameters between the algorithm proposed in [22,23] and AdViSED. Another comparison in terms of computational complexity is provided in Figure 6. The latter is evaluated as normalized execution time when running the algorithms via SW on an Intel Core i3-4170 CPU, with Intel HD Graphics 4400 GPU, 8GB DDR3 RAM, equipped with Windows 10 Pro x64 operating system. Results in Figures 5 and 6 represent the average of the results obtained for the whole test videos.  The results in Figure 5 show an improvement in terms of all parameters, MCC, F1 score, accuracy, precision, and recall, of AdViSED compared to the state-of-art techniques [22,23]. This is mainly due to an improved Kalman-based motion estimator in Section 3.1, by the fact that the decisor in Section 3.3 takes into account multiple geometrical parameters (together with the color parameters). The new edge and time-based analysis in Section 3.4 permits also an improvement respect [22,23]. The results in Figure 6 show a relevant improvement of the computational complexity. Indeed, using the same dataset and the same hardware condition, the total amount of time to get the results is less than a third. The improvement is justified by the fact that a more accurate Kalman motion estimation avoids the generation of too many smoke blobs that must be iteratively processed in the next steps. Figure 7 shows how the computational cost is shared between the different functions of the AdViSED workflow, averaging the results achieved on the considered test video set. From Figure 7 it is clear that the most computing intensive tasks are the feature extraction and alarm decisor steps, since they are applied iteratively on the preselected blobs. AdViSED -63% TIME [22] [23] AdViSED Figure 5. Performance metrics comparison of [22,23] and AdViSED.

Performance and Complexity Results
Energies 2020, 13, x FOR PEER REVIEW 11 of 18 Figure 5 shows a comparison in terms of estimation parameters between the algorithm proposed in [22,23] and AdViSED. Another comparison in terms of computational complexity is provided in Figure 6. The latter is evaluated as normalized execution time when running the algorithms via SW on an Intel Core i3-4170 CPU, with Intel HD Graphics 4400 GPU, 8GB DDR3 RAM, equipped with Windows 10 Pro x64 operating system. Results in Figures 5 and 6 represent the average of the results obtained for the whole test videos.  The results in Figure 5 show an improvement in terms of all parameters, MCC, F1 score, accuracy, precision, and recall, of AdViSED compared to the state-of-art techniques [22,23]. This is mainly due to an improved Kalman-based motion estimator in Section 3.1, by the fact that the decisor in Section 3.3 takes into account multiple geometrical parameters (together with the color parameters). The new edge and time-based analysis in Section 3.4 permits also an improvement respect [22,23]. The results in Figure 6 show a relevant improvement of the computational complexity. Indeed, using the same dataset and the same hardware condition, the total amount of time to get the results is less than a third. The improvement is justified by the fact that a more accurate Kalman motion estimation avoids the generation of too many smoke blobs that must be iteratively processed in the next steps. Figure 7 shows how the computational cost is shared between the different functions of the AdViSED workflow, averaging the results achieved on the considered test video set. From Figure 7 it is clear that the most computing intensive tasks are the feature extraction and alarm decisor steps, since they are applied iteratively on the preselected blobs.
The results in Figure 5 show an improvement in terms of all parameters, MCC, F1 score, accuracy, precision, and recall, of AdViSED compared to the state-of-art techniques [22,23]. This is mainly due to an improved Kalman-based motion estimator in Section 3.1, by the fact that the decisor in Section 3.3 takes into account multiple geometrical parameters (together with the color parameters). The new edge and time-based analysis in Section 3.4 permits also an improvement respect [22,23]. The results in Figure 6 show a relevant improvement of the computational complexity. Indeed, using the same dataset and the same hardware condition, the total amount of time to get the results is less than a third. The improvement is justified by the fact that a more accurate Kalman motion estimation avoids the generation of too many smoke blobs that must be iteratively processed in the next steps. Figure 7 shows how the computational cost is shared between the different functions of the AdViSED workflow, averaging the results achieved on the considered test video set. From Figure 7 it is clear that the most computing intensive tasks are the feature extraction and alarm decisor steps, since they are applied iteratively on the preselected blobs.
Kalman motion estimation avoids the generation of too many smoke blobs that must be iteratively processed in the next steps. Figure 7 shows how the computational cost is shared between the different functions of the AdViSED workflow, averaging the results achieved on the considered test video set. From Figure 7 it is clear that the most computing intensive tasks are the feature extraction and alarm decisor steps, since they are applied iteratively on the preselected blobs.  [20,21,[36][37][38][39][40][41][42][43][44] using the set of test videos, which is the one showed in Figure 8. The results in Table 3 show that AdViSED has improved recognition capabilities in terms of correctly detected frames and has no false alarms. Table 4 shows instead a comparison with Millan-Garcia's [20], Yu's [42], and Toreyin's [36,40,41] methods in terms of delay (calculated in number of frames) to reveal the presence of smoke/fire. Table 4 shows that AdViSED is faster than the other video-based algorithms reported in  [20,21,[36][37][38][39][40][41][42][43][44] using the set of test videos, which is the one showed in Figure 8. The results in Table 3 show that AdViSED has improved recognition capabilities in terms of correctly detected frames and has no false alarms. Table 4 shows instead a comparison with Millan-Garcia's [20], Yu's [42], and Toreyin's [36,40,41] methods in terms of delay (calculated in number of frames) to reveal the presence of smoke/fire. Table 4 shows that AdViSED is faster than the other video-based algorithms reported in the table, except for video n.5 due to the fence in front of the smoke. Even in the worst case of Table 4, an early-alarm is generated in 130 frames, i.e., about 4 s at 30 fps, x15 times faster than EN50155-standardized techniques that react in 60 s. Table 3. Comparison of correct detected frame and false alarm with respect to state-of-the-art techniques. Figure 8 Total It is noted that we used different reference works in Tables 3 and 5, since in literature the same set of test videos in Figure 8 is used but the results are reported in different ways. For example, performance metrics are reported as number of correct frames and false alarms by Wang and Toreyin in Table 3, while the works in Table 4 report, as performance metrics, the delay in smoke detection measured in number of frames. It is noted that we used different reference works in Tables 3 and 5, since in literature the same set of test videos in Figure 8 is used but the results are reported in different ways. For example, performance metrics are reported as number of correct frames and false alarms by Wang and Toreyin in Table 3, while the works in Table 4 report, as performance metrics, the delay in smoke detection measured in number of frames.

Real-Time Embedded Platform Implementation
The AdViSED video-based measuring technique has been implemented on a single board embedded computer, a Raspberry Pi 3 model B (RPI), to test its performance in a real-world scenario. Raspberry Pi in Figure 9 is small, powerful, and low cost at about 30 USD per unit. To this end, it is a perfect platform for applications like distributed and networked measuring notes in smart city or intelligent transport systems scenarios. The board is equipped with a Broadcom BCM2837, a System on Chip (SoC) including a 1.2 GHz 64-bit quad-core ARM Cortex-A53 processor, with a cache L2 of 512 KB and 1GB of DDR2 RAM, Video Core IV GPU, 4 USB 2.0 ports, onboard WiFi @2.4 GHz 802.11n, Bluetooth 4.1 Low Energy, 40 GPIO pins, and many other features [45]. It runs on Raspbian OS, a Debian-based Linux distribution for download [46]. The camera module used in the implementation is an RPI camera board v1.3 that plugs directly into the CSI (camera serial interface) connector on Raspberry Pi board. The module is able to deliver a 5 MP resolution image or 1080p HD video recording at 30 fps [47]. MATLAB "Run on Hardware" support package is adopted to generate C code from MATLAB algorithm and run it on hardware as a stand-alone application.
Unfortunately, only a subset of MATLAB built-in function and toolboxes are supported for efficient C/C++ code generation on the embedded Broadcom SoC. Therefore, the algorithm was examined and modified introducing consideration for low-level C implementation. Hereafter, we report a list of main modifications implemented in the algorithm description to embed it in the Broadcom SoC.
For example, the original algorithmic description in Section 3 adopts variable-size data structure for the storage of the candidate blobs (that are then eliminated during the time/edge-based analysis in Section 3.4) and for the storage of overlapping structures in Section 3.5. Instead, variable size data structures are critical for embedded systems. To implement fixed size data structure on the stack of the RPI, a statistical analysis of the memory required by AdViSED was carried out using different test videos. During these tests, we observed that the maximum number of bounding boxes, including blobs, displayed together in the whole dataset was not exceeding the number of 200. So that, a fixed size for the data structure containing the coordinate of the boxes and kill frames has been used. It means that for each single frame we can track information for a maximum of 200 boxes. code from MATLAB algorithm and run it on hardware as a stand-alone application.
Unfortunately, only a subset of MATLAB built-in function and toolboxes are supported for efficient C/C++ code generation on the embedded Broadcom SoC. Therefore, the algorithm was examined and modified introducing consideration for low-level C implementation. Hereafter, we report a list of main modifications implemented in the algorithm description to embed it in the Broadcom SoC. For example, the original algorithmic description in Section 3 adopts variable-size data structure for the storage of the candidate blobs (that are then eliminated during the time/edge-based analysis in Section 3.4) and for the storage of overlapping structures in Section 3.5. Instead, variable size data structures are critical for embedded systems. To implement fixed size data structure on the stack of the RPI, a statistical analysis of the memory required by AdViSED was carried out using different test videos. During these tests, we observed that the maximum number of bounding boxes, including blobs, displayed together in the whole dataset was not exceeding the number of 200. So that, a fixed size for the data structure containing the coordinate of the boxes and kill frames has been used. It means that for each single frame we can track information for a maximum of 200 boxes.
Moreover, the array element in the sliding windows were declared as the product of the camera frame rate times the width of the window in seconds (parameter "N" in Section 3), instead of declaring a variable size buffer as in the original algorithmic description. The new data structures are now to be considered as a circular buffer to avoid any type of error at runtime. Size of matrix , , etc. are obviously dependent of the input frame, and thus, they are declared statically at the beginning in relation to the resolution's camera.
Furthermore, adopted functions in the original AdViSED description, such as "mat2gray" and the feature "convex area" of "regionprops" function are not supported yet for C/C++ code generation Moreover, the array element in the sliding windows were declared as the product of the camera frame rate times the width of the window in seconds (parameter "N" in Section 3), instead of declaring a variable size buffer as in the original algorithmic description. The new data structures are now to be considered as a circular buffer to avoid any type of error at runtime. Size of matrix BG, BG, etc. are obviously dependent of the input frame, and thus, they are declared statically at the beginning in relation to the resolution's camera.
Furthermore, adopted functions in the original AdViSED description, such as "mat2gray" and the feature "convex area" of "regionprops" function are not supported yet for C/C++ code generation on ARM cores. The issue of missing "mat2gray" absence was solved just scaling the value of the matrix from the range 0-255 to the range 0-1. For the "convex area" feature instead, it was not possible to determine any direct workaround; therefore, this function was deleted when implementing AdViSED in the embedded system. For that reason, the THR overlap factor was increased from 7 to 8. The new ready-to-deploy code was tested again on the whole dataset, to check possible differences with respect to the previous version of the code. The same metrics already showed in Sections 4 and 5 have been obtained.
We tested the final implementation using different video resolutions and frame rates and with respect to a porting of AdViSED both on the Broadcom SoC with Raspbian OS and on an x86-based 64b general purpose processor (GPP) using Windows10 OS, as shown in Table 5. The PC runs at 3.70 GHz, thanks to the Intel Core i3-4170 and 8 GB of RAM. To acquire video frames, we used a simple GitUp Git2 action camera connected via USB on the PC. The performance was compared for both platforms in terms of average frames per second (fps), elaborated for the whole test video set introduced in Section 4. In real-time, we obtain 19 fps for the x86 64b GPP, while the RPI reaches 10.3 fps for the lowest resolution (320 × 480). For 320 × 240 frame size, the maximum frame rate processed in real-time is about 36 fps for the x86 64b GPP and about 19 fps for the RPI platform. The data presented in Table 5 are related to the worst case in which each frame is displayed during the processing on a screen connected to the GPP and on the RPI LCD display. Visualizing the output images requires a lot of computation time; therefore, the application can be sped-up by just closing the visualization windows. That is why we made other tests, running the application without showing the processed frames and therefore reducing the total overhead. In any case, it was always possible to retrieve information of alarm, overlap index, and processing time on the terminal window. The results in real-time fps without output image visualization are shown in Table 6.  Table 6 shows an increasing of the performance in both platforms for all the resolutions. In real-time, we obtain about 30 fps for the x86 64b GPP, while the RPI reaches 13.4 fps for low resolution (320 × 480). For 320 × 240 frame size the maximum frame rate processed in real-time is about 47 fps for the x86 64b GPP and about 25 fps for the RPI platform. The latter value is eight times faster than the result achieved in the state-of-art technique by [27], where an implementation on a Raspberry platform of a video-based smoke measuring system was limited at maximum 3 fps for the same input video resolution.
In Table 6, the implementation on the x86-based 64b GPP ensures, of course, the best performance in terms of maximum frame rate processed in real-time, but the advantage of the RPI over a desktop computer is still the lower cost-30 USD instead of hundreds of USD-lower power consumption, and portability, considering also its relatively good performance. Figure 10 shows power consumption measurements of the Raspberry Pi platform using different configurations and frame sizes. During the tests, a 5V power supply bench is used and measurements are made tracking the current absorption in mA. whole test video set introduced in Section 4. In real-time, we obtain 19 fps for the x86 64b GPP, while the RPI reaches 10.3 fps for the lowest resolution (320x480). For 320x240 frame size, the maximum frame rate processed in real-time is about 36 fps for the x86 64b GPP and about 19 fps for the RPI platform. The data presented in Table 5 are related to the worst case in which each frame is displayed during the processing on a screen connected to the GPP and on the RPI LCD display. Visualizing the output images requires a lot of computation time; therefore, the application can be sped-up by just closing the visualization windows. That is why we made other tests, running the application without showing the processed frames and therefore reducing the total overhead. In any case, it was always possible to retrieve information of alarm, overlap index, and processing time on the terminal window. The results in real-time fps without output image visualization are shown in Table 6.  Table 6 shows an increasing of the performance in both platforms for all the resolutions. In realtime, we obtain about 30 fps for the x86 64b GPP, while the RPI reaches 13.4 fps for low resolution (320x480). For 320x240 frame size the maximum frame rate processed in real-time is about 47 fps for the x86 64b GPP and about 25 fps for the RPI platform. The latter value is eight times faster than the result achieved in the state-of-art technique by [27], where an implementation on a Raspberry platform of a video-based smoke measuring system was limited at maximum 3 fps for the same input video resolution.
In Table 6, the implementation on the x86-based 64b GPP ensures, of course, the best performance in terms of maximum frame rate processed in real-time, but the advantage of the RPI over a desktop computer is still the lower cost-30 USD instead of hundreds of USD-lower power consumption, and portability, considering also its relatively good performance. Figure 10 shows power consumption measurements of the Raspberry Pi platform using different configurations and frame sizes. During the tests, a 5V power supply bench is used and measurements are made tracking the current absorption in mA. The board was disconnected from any peripherals, such as keyboard and mouse, except for the camera. In Figure 10 with the term "display", we refer to the configuration test made using a 5-inch display attached to the RPI board (that of Table 5). The other one refers to the version that sends on console data about smoke alarms (that of Table 6). From Figure 10, we can observe that in idle state, the mean power consumption is around 2.9W with the display connected, while during the execution of the application the power consumption increases to maximum 4.36 W. The terminal version of the The board was disconnected from any peripherals, such as keyboard and mouse, except for the camera. In Figure 10 with the term "display", we refer to the configuration test made using a 5-inch display attached to the RPI board (that of Table 5). The other one refers to the version that sends on console data about smoke alarms (that of Table 6). From Figure 10, we can observe that in idle state, the mean power consumption is around 2.9W with the display connected, while during the execution of the application the power consumption increases to maximum 4.36 W. The terminal version of the application (without displaying processed frames) has a power consumption below 1.5 W in idle mode and within 2.4 W in processing mode. Such values are orders of magnitude lower than typical GPP or GPU power costs. Such an embedded device has been also considered as a final device platform by implementing a distributed antifire and surveillance and exploiting an IoT architecture with several camera nodes [12].

Conclusions
The paper proposes AdViSED, a novel video smoke detection algorithm for antifire surveillance systems, considering both outdoor and indoor application scenarios. To reduce installations costs, the application scenario considers a fixed single camera, working in the visible spectral range, already installed in a close circuit television system for surveillance purpose. Thanks to the adoption of a Kalman-based motion detection technique, color analysis, image segmentation, blob labeling, time/edge-based bloc analysis, geometrical features analysis, and M out of N decisor, the measurement system is able to generate an alarm signal with improved estimation performance compared with the state-of-art techniques in terms of improved response latency and measurement metrics. The latter are calculated in terms of F1, accuracy precision, recall, and MCC metrics. For example, when compared to [22,23] with a set of video tests, available from the Firesense EU project, the accuracy and precision metrics are improved by about 20%, the MCC score doubles, and the recall increases up to 1. The computational complexity of the proposed technique is reduced by 63% compared with the work in [22,23], when considering the same hardware and software computing platform. With respect to the works in [20,36,[40][41][42], AdViSED ensures a reduced response latency, while achieving equal or better measurement accuracy metrics. Several tests, carried out with different frame rate and frame size, have confirmed the scalability of the proposed measurement techniques to different input camera sensors. AdViSED has been implemented in platforms using both x86 64b GPP processors or embedded ones, based on ARM cores, such as the Raspberry Pi 3. The latter achieves in real-time a performance eight times better than state-of-art works targeting the same embedded unit [27]. Power measurements of the embedded implementation prove that its power cost is below 2.4 W. The low cost and power of the final implementation platform, which also includes Wi-Fi and Bluetooth connection capabilities, make it suitable for the implementation of a distributed measuring systems for smoke/fire surveillance in application scenarios like smart cities or intelligent transport systems. As a future step to improve the camera-based antifire surveillance system we will try adopting some lightweight deep learning networks like MobileNet, ShuffleNet, and SqueezeNet [32][33][34][35]. Although these deep neural nets are not directly related to smoke detection, a possible transfer learning solution could be used in order to improve the algorithm used as the base of the proposed distributed antifire and surveillance system [12].
Author Contributions: A.G. and S.S. conceived and designed the experiments, performed the experiments, analyzed the data and wrote the paper. All authors have read and agreed to the published version of the manuscript.
Funding: This research has been partially supported by Crosslab IoT-Dipartimenti di Eccellenza project by University of Pisa/MIUR and by POR FSE EFEST project (Tuscany Region and Solari di Udine S.P.A.).