A Two-Stream CNN Model with Adaptive Adjustment of Receptive Field Dedicated to Flame Region Detection

: Convolutional neural networks (CNN) have yielded state-of-the-art performance in image segmentation. Their application in video surveillance systems can provide very useful information for extinguishing ﬁre in time. The current studies mostly focused on CNN-based ﬂame image classiﬁcation and have achieved good accuracy. However, the research of CNN-based ﬂame region detection is extremely scarce due to the bulky network structures and high hardware conﬁguration requirements of the state-of-the-art CNN models. Therefore, this paper presents a two-stream convolutional neural network for ﬂame region detection (TSCNNFlame). TSCNNFlame is a lightweight CNN architecture including a spatial stream and temporal stream for detecting ﬂame pixels in video sequences captured by ﬁxed cameras. The static features from the spatial stream and dynamic features from the temporal stream are fused by three convolutional layers to reduce the false positives. We replace the convolutional layer of CNN with the selective kernel (SK)-Shufﬂe block constructed by integrating the SK convolution into the deep convolutional layer of ShufﬂeNet V2. The SKnet blocks can adaptively adjust the size of one receptive ﬁeld with the proportion of one region of interest (ROI) in it. The grouped convolution used in Shufﬂenet solves the problem in which the multi-branch structure of SKnet causes the network parameters to double with the number of branches. Therefore, the CNN network dedicated to ﬂame region detection balances the efﬁciency and accuracy by the lightweight architecture, the temporal–spatial features fusion, and the advantages of the SK-Shufﬂe block. The experimental results, which are evaluated by multiple metrics and are analyzed from many angles, show that this method can achieve signiﬁcant performance while reducing the running time.


Introduction
Fire is one of the major disasters that threaten human life and property. With the rapid development of image processing, lots of fire video surveillance systems are built in wild fields or personnel-intensive places, such as forests, tourist attractions, public buildings, and so on. The rapid and accurate detection of fire can provide useful fire extinguishing information in time, which has great significance for reducing casualties and property losses.
Traditional fire image detection schemes extracted handicraft features, including static characteristics (color and texture features) [1,2] and dynamic characteristics (such as blinking characteristics) [3,4], and then these features were fused into feature vectors in order to classify fire images based on machine learning classification algorithms [5]. The extraction steps of this handcrafted features are cumbersome, and the generalization of this method has limitations [6,7]. With the improvement of hardware computing power, convolutional neural networks (CNN) have developed rapidly and have been widely applied into the field of image processing due to their better performance than handicraft feature schemes in the process of feature extraction.

Related Works
Many researchers have been devoted to extracting the static and dynamic features of flame videos according to color, texture, and blinking characteristics. For example, Chen et al. [13] combined RGB color space with the saturation channel of HSI color space and extracted moving regions based on the inter-frame difference. Celik et al. [1] explored YCbCr color space and novel rules for separating the chrominance and luminance components. BoWFire proposed by Chino et al. [2] used the uniform patterns LBP to reduce false positives. To improve the precision of flame detection, motion characteristics are utilized or combined with static characteristics. Wang et al. [14] proposed two new spatiotemporal features including spatiotemporal structural features and spatiotemporal contour dynamics features. Toreyin et al. [15] presented a new method extracting flame flicker by computing the spatial wavelet transform of moving fire-colored regions. The major research problem using the handicraft features designed by researchers is that some moving flame-colored pixels may falsely be recognized as flame pixels. Especially, the color of the inner flame sometimes is white, and the inner-flame pixels do not show distinct movement characteristics, which lead to incomplete detection of the flame regions by color rules or the moving model.
To overcome the disadvantage of handicraft features, some researchers tried to use CNN to classify flame images. For example, the two-stage schemes were presented in [10,11,16]. In the first stage, one candidate flame image was detected by CNN classifier [16] or judging whether suspected flame regions existed in the image [10,11]. In the second stage, CNN was trained to distinguish flame images from the candidate flame images. Jivitesh et al. [9] showed that a traditional CNN performs relatively poorly when tested on an imbalanced dataset and thus proposed to use even deeper CNN for fire detection in images. Obviously, high accuracy was achieved at the expense of increased running time. A few lightweight CNN structures were proposed to reduce the time required for processing [17][18][19]. In [17], a lightweight CNN with no dense fully connected layers was built to make the network computationally inexpensive. The CNN architecture was also simplified by reducing and optimizing parameters experimentally (called experimentally defined CNN in [18]). Moreover, other deep neural networks such as Faster RCNN [19,20], YOLO [21], and GoogleNet [22] were used to detect flame images.
Compared with the above achievements on flame image detection, the research on flame region detection is extremely scarce; only the literature [11,18] proposed a CNN architecture for flame region detection and semantic understanding. There have been some state-of-art deep convolutional neural networks dedicated to image segmentation, such as FCN [23] and Mask RCNN [24] and DeepLab series [25][26][27][28]. However, FCN has embarked on the road of gaining performance at the cost of the increase in parameter quantity and computational complexity. Mask RCNNhas a bulky network structure and requires high hardware configuration. Although a DeepLab series with relatively simple architecture, these networks are designed for multi-target detection, so the redundancy is inevitable if they are used for the detection task of a single object such as flame region detection.
The most closely related work is [12]. To our best knowledge, CNNFire used a simple and lightweight convolutional network to solve the task of flame region detection. In [12], the scheme was a two-stage process including the first stage of recognizing flame images and the second stage of detecting flame regions from the recognized flame images. However, the scheme designed some dedicated decision rules and set the corresponding thresholds to classify flame images and pixels. In fact, not all the flame images and pixels can match these rules, and the performance can also degrade due to the inappropriate thresholds. Moreover, CNNFire did not consider the dynamic characteristics of flames in one video sequence.
To achieve good accuracy with reduced running time and model size, this paper presents a novel CNN architecture based on the fusion of selective kernel network (SKnet) and Shuffle network (ShuffleNet).
Our contributions. The key original contributions can be summarized as threefold: (1) We construct a two-stream network to utilize spatial and temporal features for flame region detection. The spatial features represent color features, texture features, and so on. The temporal features represent dynamic characteristics such as blinking characteristics. (2) We pre-divide the input video frame into several equal cells and design a lightweight network that can identify all flame cells through one-stage detection, thus avoiding the complex units of the segmentation network designed to adapt to multiple targets. (3) We replace the convolutional layers by the convolutional block combining SKnet and ShuffleNet V2. The SK convolution is integrated into the deep convolutional layer of ShuffleNet V2 to achieve the adaptive ability of the convolutional block in adjusting the size of receptive fields, which is of great significance for some very small flame areas captured on a distant view or some small fragment-like flame regions.
Organizations. The rest of the paper is organized in the following way. Section 3 represents the two-stream network of flame region detection in detail. Our experimental results using benchmark datasets and a feasibility analysis of the proposed work are discussed in Section 4. Finally, the conclusions and future research directions are drawn in Section 5.

The Proposed Model
In this paper, we focus on the flame region detection in one surveillance video. The state-of-art deep convolutional networks for image segmentation are very time-consuming and redundant for our task. To balance the efficiency and accuracy, we built a two-stream convolutional neural network for flame region detection (TSCNNFlame).

Pipeline Overview
The pipeline of the proposed network is depicted in Figure 1. The architecture of the network is lightweight and consists of a spatial stream and temporal stream. The input of TSCNNFlame has two parts: one is an image, and the other is two differential images calculated at an interval of three frames before and after the image. Considering that each small piece of flame has the characteristics of the whole flame, we divide the video frame into several cells of equal size, take these cells as candidate regions to judge whether there Symmetry 2021, 13, 397 4 of 17 is a flame in each cell, and finally integrate the flame cells into one intact flame region. The network outputs a foreground probability map at a reduced resolution, indicating the likelihood that a cell is foreground.
gion. The network outputs a foreground probability map at a reduced resolution, indicating the likelihood that a cell is foreground.
The two-stream backbone of this network is built with VGG network elements, but we replace two convolutional layers by the SK-Shuffle convolutional block. The SK-Shuffle block followed by a pool layer combines the depth convolutional layer of ShuffleNet V2 with the SK convolutional layer to adjust the sizes of receptive fields adaptively while reducing running time. The architecture of SK-Shuffle block is elaborated in Section 3.2 and Section 3.3. The feature maps from the two streams are concatenated. Then, the three 1 × 1 convolutional layers and softmax are applied to obtain the classification results of each small cell. In this two-stream network, the spatial stream effectively extracts the static characteristics of the flame directly from static frames. The temporal stream takes the differential image as input, which is simpler and faster than optical flow and can be a good supplement to static appearance information [29]. In general, this is a simple and clear network. It can be effective thanks to the special nature in that flame cell has almost the same visual features as its whole flame region, which allows us to skip the calculation of the candidate region and directly judge whether each cell is a flame cell.

Adaptive Adjustment of the Receptive Field Based on SKnet
Region Proposal plays a key role in the deep convolutional networks designed for semantic segmentation and instance segmentation. For example, Mask-RCNN provides a Region Proposal network (RPN) to produce candidate regions. Compared with deep networks such as Mask-RCNN, the method pre-dividing the image into cells has a simi- The two-stream backbone of this network is built with VGG network elements, but we replace two convolutional layers by the SK-Shuffle convolutional block. The SK-Shuffle block followed by a pool layer combines the depth convolutional layer of ShuffleNet V2 with the SK convolutional layer to adjust the sizes of receptive fields adaptively while reducing running time. The architecture of SK-Shuffle block is elaborated in Sections 3.2 and 3.3. The feature maps from the two streams are concatenated. Then, the three 1 × 1 convolutional layers and softmax are applied to obtain the classification results of each small cell.
In this two-stream network, the spatial stream effectively extracts the static characteristics of the flame directly from static frames. The temporal stream takes the differential image as input, which is simpler and faster than optical flow and can be a good supplement to static appearance information [29]. In general, this is a simple and clear network. It can be effective thanks to the special nature in that flame cell has almost the same visual features as its whole flame region, which allows us to skip the calculation of the candidate region and directly judge whether each cell is a flame cell.

Adaptive Adjustment of the Receptive Field Based on SKnet
Region Proposal plays a key role in the deep convolutional networks designed for semantic segmentation and instance segmentation. For example, Mask-RCNN provides a Region Proposal network (RPN) to produce candidate regions. Compared with deep networks such as Mask-RCNN, the method pre-dividing the image into cells has a similar function as RPN and avoids the complicated calculations, but the size of the candidate regions cannot be adjusted adaptively. To match the lightweight architecture, we introduce the SK [30] convolution block to improve the ability to adaptively adjust the size of the receptive field. Compared with fixed receptive fields, variable receptive fields can reduce the missing rate when facing some very small flame areas captured on a distant view or some small fragment-like flame regions. As shown in Figure 2, SK convolution provides a mechanism for automatic selection among multiple different kernel sizes via three operators including Split, Fuse, and Select. The Split operator generates multiple paths with various kernel sizes which correspond to different RF sizes of neurons. The Fuse operator combines and aggregates the information from multiple paths to obtain a global and comprehensive representation for weights selection. The Select operator aggregates the feature maps of differently sized kernels according to the weights selection. lar function as RPN and avoids the complicated calculations, but the size of the candidate regions cannot be adjusted adaptively. To match the lightweight architecture, we introduce the SK [30] convolution block to improve the ability to adaptively adjust the size of the receptive field. Compared with fixed receptive fields, variable receptive fields can reduce the missing rate when facing some very small flame areas captured on a distant view or some small fragment-like flame regions.
As shown in Figure 2, SK convolution provides a mechanism for automatic selection among multiple different kernel sizes via three operators including Split, Fuse, and Select. The Split operator generates multiple paths with various kernel sizes which correspond to different RF sizes of neurons. The Fuse operator combines and aggregates the information from multiple paths to obtain a global and comprehensive representation for weights selection. The Select operator aggregates the feature maps of differently sized kernels according to the weights selection. It is precisely because the fusion operation can generate adaptive weights for the convolution results of different sizes of convolution kernels, so it is as if there is an adaptive size convolution kernel to convolve a given feature map to get the final convolution result. After the split operation, the first step of the fuse operation is to merge the results from multiple branches through element summation and then embed the global information by simply using global average pooling to generate channel-wise statistics. Finally, the choice of weight parameters is made adaptive through the three-layer fully connected layer.
The purpose of the select operator is to use the output of the fully connected layer to select the weight parameters for the corresponding channels of the multiple branches. In the first step of the select operation, the output of the fully connected layer is equally distributed according to the number of branches, and the number of neurons allocated for each branch number is the number of channels C of the final output feature map, so that each neuron corresponds to a channel on the feature map on a branch (which requires the number of neurons in the last layer of neurons to be: C * N . In Figure 2, there are only two branches, N 2). In the second step, the softmax operation is applied on neurons corresponding to the same channel of different branches: Here, w is the weight value of the -th channel on the i-th branch. is the total number of branches, and is the neuron value corresponding to the -th channel on the -th branch. The final feature map is obtained through the attention weights on various kernels: It is precisely because the fusion operation can generate adaptive weights for the convolution results of different sizes of convolution kernels, so it is as if there is an adaptive size convolution kernel to convolve a given feature map to get the final convolution result. After the split operation, the first step of the fuse operation is to merge the results from multiple branches through element summation and then embed the global information by simply using global average pooling to generate channel-wise statistics. Finally, the choice of weight parameters is made adaptive through the three-layer fully connected layer.
The purpose of the select operator is to use the output of the fully connected layer to select the weight parameters for the corresponding channels of the multiple branches. In the first step of the select operation, the output of the fully connected layer is equally distributed according to the number of branches, and the number of neurons allocated for each branch number is the number of channels C of the final output feature map, so that each neuron corresponds to a channel on the feature map on a branch (which requires the number of neurons in the last layer of neurons to be: C * N b . In Figure 2, there are only two branches, N b = 2). In the second step, the softmax operation is applied on neurons corresponding to the same channel of different branches: Here, w ci is the weight value of the c-th channel on the i-th branch. N b is the total number of branches, and V cj is the neuron value corresponding to the c-th channel on the j-th branch. The final feature map O is obtained through the attention weights on various kernels: where U ci is the feature map after convolution on each branch. From the above, SK convolution is based on the design of multiple convolution kernels. SKnet mixes the results of multiple kernels by generating adaptive weights for each channel of multiple kernels and thus obtains the final feature map O. Adaptive weights ensure the adaptive adjustment of the receptive field. Figure 3 gives an example for explaining the correlation among the receptive field, the detection region, and the detection result. We divide one input image into a number of small cells, such as the grid marked with the detection region on the flame image in Figure 3, and the size of each cell is 16 × 16 pixels. The receptive field (marked in Figure 3) is used to determine whether a cell is a flame region. Generally, the size of the receptive field is larger than that of a 16 × 16 pixel block. Since these cells can be used as candidate regions and are neatly arranged in a grid shape, we can get the judgment results of all cells at once instead of examining each cell individually.
From the above, SK convolution is based on the design of multiple convolution kernels. SKnet mixes the results of multiple kernels by generating adaptive weights for each channel of multiple kernels and thus obtains the final feature map . Adaptive weights ensure the adaptive adjustment of the receptive field. Figure 3 gives an example for explaining the correlation among the receptive field, the detection region, and the detection result. We divide one input image into a number of small cells, such as the grid marked with the detection region on the flame image in Figure 3, and the size of each cell is 16 × 16 pixels. The receptive field (marked in Figure 3) is used to determine whether a cell is a flame region. Generally, the size of the receptive field is larger than that of a 16 × 16 pixel block. Since these cells can be used as candidate regions and are neatly arranged in a grid shape, we can get the judgment results of all cells at once instead of examining each cell individually. In the convolution process, the size of the receptive field tends to increase with the network layer. We can calculate the receptive field of the output feature map of each layer by the following equation [31]: where is the number of convolution layers, rf and are respectively the receptive field and the size of convolution kernel of layer , and is the stride of layer . It can be seen from Equation (3) that the deeper the layer of the network model, the larger the receptive field. If the size of one cell is 16 × 16, the receptive field size of the last layer in this paper is larger than the size 16 × 16; that is to say, the information used to judge a flame region is extracted from the receptive field rather than limited to the cell.
We adaptively adjust the value of the variable in Equation (3) with the weight value w in Equation (1). The variable is calculated as follows: where is the size of the convolution kernel of the -th branch in SK convolution, and w denotes the -th channel on the -th branch. Due to self-adaption of the weights w of convolutional layers in the SKnet block, as shown in Equation (1), the receptive field can be adjusted adaptively.

Lighten the Network by ShuffleNet
While SK convolution improves the network performance, the multi-branch structure causes the network parameters to double with the number of branches, making the network cumbersome. We can use grouped convolution to replace the convolution operation on each branch to reduce the parameters of the convolution kernel. Denote the In the convolution process, the size of the receptive field tends to increase with the network layer. We can calculate the receptive field of the output feature map of each layer by the following equation [31]: where l is the number of convolution layers, rf l and K l are respectively the receptive field and the size of convolution kernel of layer l, and s j is the stride of layer j. It can be seen from Equation (3) that the deeper the layer of the network model, the larger the receptive field. If the size of one cell is 16 × 16, the receptive field size of the last layer in this paper is larger than the size 16 × 16; that is to say, the information used to judge a flame region is extracted from the receptive field rather than limited to the cell.
We adaptively adjust the value of the variable K l in Equation (3) with the weight value w ci in Equation (1). The variable K l is calculated as follows: where k i is the size of the convolution kernel of the i-th branch in SK convolution, and w ci denotes the c-th channel on the i-th branch. Due to self-adaption of the weights w ci of convolutional layers in the SKnet block, as shown in Equation (1), the receptive field can be adjusted adaptively.

Lighten the Network by ShuffleNet
While SK convolution improves the network performance, the multi-branch structure causes the network parameters to double with the number of branches, making the network cumbersome. We can use grouped convolution to replace the convolution operation on each branch to reduce the parameters of the convolution kernel. Denote the group size by G, so compared to the ordinary convolution, both the number of parameters and the computational cost will be divided by G. Furthermore, many lightweight models such as MobileNet [32,33] and ShuffleNet [34,35] are based on packet convolution, so that the modules of these networks can be combined with SK convolution to make the entire network lightweight.
In this paper, we combine ShuffleNet V2 with SK to construct a new convolution block. The building blocks of ShuffleNet V2 are shown in Figure 4a. In the convolution unit of ShuffleNetv2, the input feature map is first divided into two branches in the channel dimension through channel split. The left branch is directly mapped, and the right branch contains three consecutive convolutions. After convolution, the two branches tion, so that the modules of these networks can be combined with SK convolution to make the entire network lightweight.
In this paper, we combine ShuffleNet V2 with SK to construct a new convolution block. The building blocks of ShuffleNet V2 are shown in Figure 4a. In the convolution unit of ShuffleNetv2, the input feature map is first divided into two branches in the channel dimension through channel split. The left branch is directly mapped, and the right branch contains three consecutive convolutions. After convolution, the two branches are concatenated. Next is the channel shuffle operation on the concat results of the two branches to ensure the exchange of information between the two branches.  The building block of shufflenetv2 contains three convolutional layers, two of which are 1 × 1 convolutional layers that do not change the size of the receptive field, and the remaining 3 × 3 depthwise convolution is a special group of convolutions with the number of groups exactly equal to the number of channels. Therefore, we only need to consider integrating the depth convolution of 3 × 3 into SK convolution to achieve the ability of the convolution block to adaptive adjust the size of the receptive field under the condition of low parameter growth. Regarding the convolution operation of U1, U2 in Figure 2, we replace them with depthwise convolution and then embed the new SK convolution in the position of Figure 4a to complete the transformation. Ignoring the parameters of the fully connected layer, the compression ratio of SK convolution to SK-shuffle convolution can be calculated as: where • is the size of the convolution kernel on each branch of the SK convolution, and 2 ≪ . Similarly, assuming that the number of the parameters of the convolution kernel on each branch of SK convolution is equivalent to that of the ordinary 3 × 3 con-  The building block of shufflenetv2 contains three convolutional layers, two of which are 1 × 1 convolutional layers that do not change the size of the receptive field, and the remaining 3 × 3 depthwise convolution is a special group of convolutions with the number of groups exactly equal to the number of channels. Therefore, we only need to consider integrating the depth convolution of 3 × 3 into SK convolution to achieve the ability of the convolution block to adaptive adjust the size of the receptive field under the condition of low parameter growth. Regarding the convolution operation of U 1 , U 2 in Figure 2, we replace them with depthwise convolution and then embed the new SK convolution in the position of Figure 4a to complete the transformation. Ignoring the parameters of the fully connected layer, the compression ratio of SK convolution to SK-shuffle convolution can be calculated as: where k j ·k j is the size of the convolution kernel on each branch of the SK convolution, and 2 c. Similarly, assuming that the number of the parameters of the convolution kernel on each branch of SK convolution is equivalent to that of the ordinary 3 × 3 convolution, the compression ratio of ordinary 3 × 3 convolution to SK-shuffle convolution can be calculated as: The complete network structure of TSCNNFlame is shown in Table 1. Since it does not include a fully connected layer, the network supports input of any size. The video size (in pixels) used in the experimental test in this paper ranges from 300 × 300 to 1000 × 1000, and the adaptive adjustment interval (in pixels) of the receptive field in Table 1

Experiments and Analysis
In this section, we present the experimental settings used in our experiments and provide implementation details. The results obtained with our approach are compared with the state of the art.

Experimental Settings
Experimental configuration. We use tensorflow to implement the flame region detection network model and the stochastic gradient method (SGD) to update the parameters of each layer. The initial learning rate is 0.0001. We train and test on an ordinary computer with an Intel Core I9 processor and a 2080ti11G GPU.
Training Datasets. Since the purpose of our algorithm is to find the flame regions from image sequences, we mainly use the datasets containing flame videos for training. As shown in Figure 5, we record flame videos from 13 different scenes and 11 flame-like videos, and we select a total of 4000 flame video sequences and 1200 flame-like video sequences to train the network. We randomly divide the video sequences of each scene into ten groups and select one group from each scene to compose the verification set. The remaining video sequences are used as the training set. Although the TSCNNFlame supports any size input, the size of the images in the training sets is resized to 400 × 400 in order to support multiple batches of training.
sequences to train the network. We randomly divide the video sequences of each sc into ten groups and select one group from each scene to compose the verification set. remaining video sequences are used as the training set. Although the TSCNNFla supports any size input, the size of the images in the training sets is resized to 400 × 40 order to support multiple batches of training. The labeling process of the training sets is shown in Figure 6. First, we set the va of the flame pixels to 255 and the values of the background pixels to 0 in Figure 6a to the binary image shown Figure 6b. Then, we perform a convolution operation on binary image. We divide one input image into small cells of the size 16 × 16 (but not ited to 16 × 16). We take the size of the convolution kernel as 16 × 16 × 1, the step size is and all element values of the convolution kernel are 1. Thus, when the convolution ke is sliding on the image, each convolution computation is equivalent to counting number of flame pixels in each cell. Finally, a threshold is set to determine the label: The labeling process of the training sets is shown in Figure 6. First, we set the values of the flame pixels to 255 and the values of the background pixels to 0 in Figure 6a to get the binary image shown Figure 6b. Then, we perform a convolution operation on the binary image. We divide one input image into small cells of the size 16 × 16 (but not limited to 16 × 16). We take the size of the convolution kernel as 16 × 16 × 1, the step size is 16, and all element values of the convolution kernel are 1. Thus, when the convolution kernel is sliding on the image, each convolution computation is equivalent to counting the number of flame pixels in each cell. Finally, a threshold is set to determine the label: where e (x,y) represents the value of each element after convolution, size is the length of the cell of the size 16 × 16, and label (x,y) is the label value of a cell in Figure 6c. The value e (x,y) is equal to size 2 2 × 255 when flame pixels in a cell account for exactly half. So, the threshold is set to size 2 2 × 255. The above formula indicates that this cell is labeled as the background region if the number of flame pixels is less than half of the pixels in the cell. Otherwise, this cell is recognized as the flame region. sequences to train the network. We randomly divide the video sequences of each scene into ten groups and select one group from each scene to compose the verification set. The remaining video sequences are used as the training set. Although the TSCNNFlame supports any size input, the size of the images in the training sets is resized to 400 × 400 in order to support multiple batches of training. The labeling process of the training sets is shown in Figure 6. First, we set the values of the flame pixels to 255 and the values of the background pixels to 0 in Figure 6a to get the binary image shown Figure 6b. Then, we perform a convolution operation on the binary image. We divide one input image into small cells of the size 16 × 16 (but not limited to 16 × 16). We take the size of the convolution kernel as 16 × 16 × 1, the step size is 16, and all element values of the convolution kernel are 1. Thus, when the convolution kernel is sliding on the image, each convolution computation is equivalent to counting the number of flame pixels in each cell. Finally, a threshold is set to determine the label: where , represents the value of each element after convolution, is the length of the cell of the size 16 × 16, and label , is the label value of a cell in Figure 6c

Adaptive Adjustment of the Receptive Field
SKnet is used to adjust the size of the receptive fields when one flame video contains smaller flame regions than the corresponding receptive fields of the proposed network. To evaluate the ability, we select the challenging flame videos such as a small flame area captured on a distant view (shown in Figure 7a) or some small fragment-like flame regions (shown in Figure 7b). In the experiments, we compute the size of the receptive field using Equation (3) and Equation (4). In Figure 7, when the size of the flame block detected is about 20 × 20, the size of the receptive field is reduced from about 140 × 140 to 90 × 90 or so. As a result, the proportion of flame pixels in the receptive field increases, which decreases the missing rate. Therefore, the proposed scheme in the paper can automatically reduce the size of the receptive field to improve the capture ability of small flame areas. smaller flame regions than the corresponding receptive fields of the proposed network. To evaluate the ability, we select the challenging flame videos such as a small flame area captured on a distant view (shown in Figure 7a) or some small fragment-like flame regions (shown in Figure 7b). In the experiments, we compute the size of the receptive field using Equation (3) and Equation (4). In Figure 7, when the size of the flame block detected is about 20 × 20, the size of the receptive field is reduced from about 140 × 140 to 90 × 90 or so. As a result, the proportion of flame pixels in the receptive field increases, which decreases the missing rate. Therefore, the proposed scheme in the paper can automatically reduce the size of the receptive field to improve the capture ability of small flame areas.

Comparison of Recognition Performance
To better test the recognition performance of our proposed model, we select the public datasets widely used for evaluating flame detection methods as most of our testing sets. The public datasets are sourced from Foggia et al. [36] and the website

Comparison of Recognition Performance
To better test the recognition performance of our proposed model, we select the public datasets widely used for evaluating flame detection methods as most of our testing sets. The public datasets are sourced from Foggia et al. [36] and the website https: //github.com/steffensbola/furg-fire-dataset (accessed on 1 February 2021). We also record four challenging videos to evaluate the anti-interference performance of our model. We randomly intercepted 100 to 200 video blocks from each video, and each video block contains seven frames that are used as the input of the temporal stream of the proposed model.
For a comparison of our results with state-of-the-art methods, we selected a total of three related works, namely FCN, CNNFire, and DeepLab V3+. Both DeepLab V3+ and FCN are semantic segmentation networks with encoder-decoder structures, and the two models have been proved to have outstanding performance.

Comparison of Recognition Accuracy
We use dataset1 to evaluate the recognition accuracy. All the videos in dataset1 are from public datasets, as given in Figure 8. The videos in the dataset1 all contain flame objects, and the flame areas of each video account for a relatively large proportion of the video images.
video block contains seven frames that are used as the input of the temporal stream of the proposed model.
For a comparison of our results with state-of-the-art methods, we selected a total of three related works, namely FCN, CNNFire, and DeepLab V3+. Both DeepLab V3+ and FCN are semantic segmentation networks with encoder-decoder structures, and the two models have been proved to have outstanding performance.

Comparison of Recognition Accuracy
We use dataset1 to evaluate the recognition accuracy. All the videos in dataset1 are from public datasets, as given in Figure 8. The videos in the dataset1 all contain flame objects, and the flame areas of each video account for a relatively large proportion of the video images. Bothand mean Intersection over Union ( ) are used to evaluate the performance of the four models. The is an effective metric, and here, it is defined as the area of overlap between the detected flame region and the ground truth divided by the area of union between the detected flame region and the ground truth [37].
is the most used evaluation metric and is defined as the harmonic mean of precision and recall, taking into account the detected regions with true positives, false positives, and false negatives [37].
andvalues range from 0 to 1, where these metric scores reach their best values at 1. Tables 2 and 3 show theand of our model and three other models, respectively. As seen from the rightmost of Table 2 and Table 3, TSCNNFlame is superior to the other three models, no matter theor . CNNFire designs special decision rules relying on the thresholds, which lead to the poor adaptability in generalization for various flame videos such as Video3, Video4, and Video7, as shown in Figure 9. FCN generates many more false positive pixels than TSCNNFlame if one video contains a large number of interference pixels, such as Video2 and Video7 shown in Figure 9. DeepLab V3+ still suffers from false detections and missing detections when one flame video such as Video4 contains the blurry contour of the flame region or multiple non-flame instances, which leads to lowof Video7 in Table 2 and low of Video2 in Table 3. TSCNNFlame adopts a dual-stream structure to supplement the lack Both F-score and mean Intersection over Union (mIoU) are used to evaluate the performance of the four models. The IoU is an effective metric, and here, it is defined as the area of overlap between the detected flame region and the ground truth divided by the area of union between the detected flame region and the ground truth [37].
F-score is the most used evaluation metric and is defined as the harmonic mean of precision and recall, taking into account the detected regions with true positives, false positives, and false negatives [37].
IoU and F-score values range from 0 to 1, where these metric scores reach their best values at 1. Tables 2 and 3 show the F-score and IoU of our model and three other models, respectively. As seen from the rightmost of Tables 2 and 3, TSCNNFlame is superior to the other three models, no matter the F-score or IoU. CNNFire designs special decision rules relying on the thresholds, which lead to the poor adaptability in generalization for various flame videos such as Video3, Video4, and Video7, as shown in Figure 9. FCN generates many more false positive pixels than TSCNNFlame if one video contains a large number of interference pixels, such as Video2 and Video7 shown in Figure 9. DeepLab V3+ still suffers from false detections and missing detections when one flame video such as Video4 contains the blurry contour of the flame region or multiple non-flame instances, which leads to low F-score of Video7 in Table 2 and low IoU of Video2 in Table 3. TSCNNFlame adopts a dual-stream structure to supplement the lack of spatial features with frame difference features, so that compared with other methods, this method is more robust.

Comparison of Anti-Interference
In this section, we evaluate the performance of anti-interference of our model using Dataset2, which is composed of eight challenging videos including four flame videos (shown in the top row of Figure 10) and four flame-like normal videos (shown in the bottom row of Figure 10). In Dataset2, the interference pixels in the flame images are much more than those of Dataset1 in Section 4.3.2. The two columns on the left in Figure   Figure 9. Output of each method from Dataset1, From top row to bottom row, Video1 to Video8, respectively. (a) Input frame; (b) Ground truth; (c) TSCNNFlame; (d) CNNFire [12]; (e) FCN [23]; (f) DeepLab V3+ [28]).
As shown in the third row (Video3) and the fifth row (Video5) in Figure 9, the edges of flame regions in the two videos look clear and have a lot of complex curvatures due to the simple background. For the two videos, TSCNNFlame has a lower F-score and IoU value than the other three methods because using pixel blocks as the detection unit is unable to accurately extract the edges of the flame region, although we use Gaussian filtering to smooth the flame contours. However, the backgrounds of the two videos are relatively simple and do not contain flame-like flame objects. Combining Figure 9, Tables 2 and 3, it can be seen that all the methods including TSCNNFlame can detect flame regions effectively. It is worth mentioning that the contour of the outer flame is often blurred due to the influence of a complex background, and thus, the detection error caused by the block detection can be generally ignored.

Comparison of Anti-Interference
In this section, we evaluate the performance of anti-interference of our model using Dataset2, which is composed of eight challenging videos including four flame videos (shown in the top row of Figure 10) and four flame-like normal videos (shown in the bottom row of Figure 10). In Dataset2, the interference pixels in the flame images are much more than those of Dataset1 in Section 4.3.2. The two columns on the left in Figure 10 show four public videos from the website given in the first paragraph of Section 4.3, and the remaining four videos are recorded by us. As seen from Figure 11, compared with the other three methods, our method utilizes the temporal features to reduce the false detection rate of positive pixels because the static flame-like pixels are correctly detected as background pixels. Moreover, the SKnet block makes the receptive fields change with the portion of flame pixels in it, which helps increase the accuracy for positive pixel detection in the small flame region shown on the top row of Figure 11. (c) CNNFire [12]; (d) FCN [23]; (e) DeepLab V3+ [28]).
The Receiver Operating Characteristic (ROC) space [2] shows the relation between the false-positive rate and the true-positive rate. We compute the number of overlapping flame pixels in the detection maps and ground truth images, and we use that as true positives. Similarly, we also determine the number of non-overlapping flame pixels in the detection maps and take that as false positives. Figure 12 shows the ROC space on Datasets2 for the fourth method. As seen from Figure 12, our approach maintains a better balance between the true positive rate and the false positive rate than the other three methods.
The temporal stream in TSCNNFlame has a certain degree of interference suppression. In addition, TSCNNFlame uses 16 × 16 pixel blocks as the detection unit, and the receptive field size is smaller than other methods. So, in the process of convolution, the (c) CNNFire [12]; (d) FCN [23]; (e) DeepLab V3+ [28]).
The Receiver Operating Characteristic (ROC) space [2] shows the relation betwee the false-positive rate and the true-positive rate. We compute the number of overlappin flame pixels in the detection maps and ground truth images, and we use that as tru positives. Similarly, we also determine the number of non-overlapping flame pixels i the detection maps and take that as false positives. Figure 12 shows the ROC space o Datasets2 for the fourth method. As seen from Figure 12, our approach maintains a bette balance between the true positive rate and the false positive rate than the other thre methods.
The temporal stream in TSCNNFlame has a certain degree of interference suppre sion. In addition, TSCNNFlame uses 16 × 16 pixel blocks as the detection unit, and th receptive field size is smaller than other methods. So, in the process of convolution, th feature difference between adjacent pixel blocks is relatively large, and it is not prone t batch continuous misdetection. (c) CNNFire [12]; (d) FCN [23]; (e) DeepLab V3+ [28]).
The Receiver Operating Characteristic (ROC) space [2] shows the relation between the false-positive rate and the true-positive rate. We compute the number of overlapping flame pixels in the detection maps and ground truth images, and we use that as true positives. Similarly, we also determine the number of non-overlapping flame pixels in the detection maps and take that as false positives. Figure 12 shows the ROC space on Datasets2 for the fourth method. As seen from Figure 12, our approach maintains a better balance between the true positive rate and the false positive rate than the other three methods.

methods.
The temporal stream in TSCNNFlame has a certain degree of interference suppression. In addition, TSCNNFlame uses 16 × 16 pixel blocks as the detection unit, and the receptive field size is smaller than other methods. So, in the process of convolution, the feature difference between adjacent pixel blocks is relatively large, and it is not prone to batch continuous misdetection.  The temporal stream in TSCNNFlame has a certain degree of interference suppression. In addition, TSCNNFlame uses 16 × 16 pixel blocks as the detection unit, and the receptive field size is smaller than other methods. So, in the process of convolution, the feature difference between adjacent pixel blocks is relatively large, and it is not prone to batch continuous misdetection.

Running Time
TSCNNFlame adopts the idea of directly classifying candidate regions. As shown in Figure 13, compared with encoder-decoder structures such as FCN and DeepLab V3+, TSC-NNFlame removes the decoder-based complex structure, reduces the number of network layers, and increases the frame difference input stream. The architecture of the temporal stream in TSCNNFlame is the same as that of the spatial stream, and thus, the parallel calculation in the processor makes the running time of the network much less than the structures of encoder-decoder serial networks.

Running Time
TSCNNFlame adopts the idea of directly classifying candidate regions. As shown in Figure 13, compared with encoder-decoder structures such as FCN and DeepLab V3+, TSCNNFlame removes the decoder-based complex structure, reduces the number of network layers, and increases the frame difference input stream. The architecture of the temporal stream in TSCNNFlame is the same as that of the spatial stream, and thus, the parallel calculation in the processor makes the running time of the network much less than the structures of encoder-decoder serial networks.
We use Fps (Frames per second) to evaluate the detecting speed of the four compared methods. The testing data used for all four methods are the same video sequences from Dataset1 and Dataset2. As shown in Table 4, the detection speed of TSCNNFlame is much faster than FCN and DeepLab V3+, and it is nearly comparable with CNNFire that has a single-stream structure and no upsampling calculation. However, as shown in Tables 2 and 3 and Figures 9 and 11, TSCNNFlame has higher detection accuracy than CNNFire. 1×1Conv Upsampled network Downsampled network Figure 13. Comparison of two-stream structure and encoder-decoder structure.

Conclusions and Limitations
In this paper, we proposed a novel CNN architecture for flame region detection. We take two main measures to achieve a good compromise between accuracy and efficiency. One measure is the lightweight network but the fusion of temporal-spatial features. The other is the SK-Shuffle block that adaptively adjusts receptive fields while reducing the network parameters, which has great significance for some very small flame areas cap- We use Fps (Frames per second) to evaluate the detecting speed of the four compared methods. The testing data used for all four methods are the same video sequences from Dataset1 and Dataset2. As shown in Table 4, the detection speed of TSCNNFlame is much faster than FCN and DeepLab V3+, and it is nearly comparable with CNNFire that has a single-stream structure and no upsampling calculation. However, as shown in Tables 2 and 3 and Figures 9 and 11, TSCNNFlame has higher detection accuracy than CNNFire.

Conclusions and Limitations
In this paper, we proposed a novel CNN architecture for flame region detection. We take two main measures to achieve a good compromise between accuracy and efficiency. One measure is the lightweight network but the fusion of temporal-spatial features. The other is the SK-Shuffle block that adaptively adjusts receptive fields while reducing the network parameters, which has great significance for some very small flame areas captured on a distant view or some small fragment-like flame regions. The proposed method achieves comparable or superior performance with state-of-art methods but is less timeconsuming. In the experimental results, it achieves an average F-score of 82.95%, an average IoU of 82.54%, and a detection speed of 35 Fps. In particular, our method is outstanding for the videos containing a large number of flame-like interference objects.
Although the proposed method achieves a pleasant trade-off between flame region segmentation accuracy and efficiency, sometimes, there are errors inevitably when the edge of one flame region has a changeable and complex curvature due to the block-wise detection scheme. Future studies may segment one frame image into small super pixel blocks instead of fixed size blocks. The pixels in one super pixel block have similar characteristics, which may help to improve the accuracy.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.