A Semantic Segmentation Method for Early Forest Fire Smoke Based on Concentration Weighting

: Forest ﬁre smoke detection based on deep learning has been widely studied. Labeling the smoke image is a necessity when building datasets of target detection and semantic segmentation. The uncertainty in labeling the forest ﬁre smoke pixels caused by the non-uniform diffusion of smoke particles will affect the recognition accuracy of the deep learning model. To overcome the labeling ambiguity, the weighted idea was proposed in this paper for the ﬁrst time. First, the pixel-concentration relationship between the gray value and the concentration of forest ﬁre smoke pixels in the image was established. Second, the loss function of the semantic segmentation method based on concentration weighting was built and improved; thus, the network could pay attention to the smoke pixels differently, an effort to better segment smoke by weighting the loss calculation of smoke pixels. Finally, based on the established forest ﬁre smoke dataset, selection of the optimum weighted factors was made through experiments. mIoU based on the weighted method increased by 1.52% than the unweighted method. The weighted method cannot only be applied to the semantic segmentation and target detection of forest ﬁre smoke, but also has a certain signiﬁcance to other dispersive target recognition.


Introduction
The security risks and destruction of ecological balance caused by forest fires have increased dramatically in recent years in terms of both frequency and scale [1][2][3][4]. The forest fire monitoring and detection is of great significance for reducing the above hazards. However, it is very laborious to rely solely on the manual monitoring and detection of forest fires. The development of science and technology has made it possible to monitor and detect forest fires automatically [5][6][7].
Many researchers have been working on automatic smoke detection to reduce damages, since smoke can provide earlier clues for forest fire alarms than flames [8][9][10][11]. Many forest fire detection methods based on smoke recognition have been proposed in the past decade. The image-based forest fire smoke detection method is the most widely used [12][13][14][15][16][17][18][19]. Strictly speaking, image-based smoke detection for forest fires can be divided into three categories. The first category is to only judge whether there is forest fire smoke in an image or not, which is known as whole image forest fire smoke recognition. The second one is not only to recognize whether there is forest fire smoke, but also to indicate the locations of forest fire smoke by bounding boxes [20]. This category is called forest fire smoke detection. The third one is to densely classify each pixel in an image, which is known as forest fire smoke segmentation.
Forest fire smoke segmentation is a far more difficult task than forest fire smoke recognition and forest fire smoke detection. It requires accurate separation of forest fire smoke components from background scenes in an image at pixel levels. Forest fire smoke segmentation outputs a mask with detailed edges, involving object classification, localization and boundary delineation. Traditional forest fire smoke segmentation methods mainly use handcrafted features, such as forest fire smoke color, texture and motion [21][22][23][24][25][26][27][28][29][30][31]. Nevertheless, it is pretty difficult to define, design or choose useful features due to large variations of forest fire smoke appearance, resulting in quite poor segmentation performance. Furthermore, some forest fire smoke segmentation methods extract dynamic features from videos [32]; however, they are extremely unstable in cases of bad weather. Therefore, forest fire smoke segmentation from static images plays a very important role in visual monitoring and detection for forest fire smoke.
In recent years, many methods based on convolutional neural networks (CNNs) have attracted attention due to their outstanding performance in image segmentation [33]. Semantic segmentation based on CNN, with the input of an arbitrary-size image, utilizes a set of convolutional layers, non-linear activation functions, pooling and upsampling layers to output a predicted image [34][35][36][37][38]. Moreover, CNNs have achieved a lot of significant results in the field of vision detection of forest fire smoke [39,40].
For the forest fire smoke segmentation method based on CNN, it is necessary to manually label pixels which are forest fire smoke or background in all training images. However, the fuzziness, translucency, and diversified concentration of forest fire smoke make it extremely difficult to label forest fire smoke accurately, resulting in subjectivity and ambiguity for labeling forest fire smoke; thus, annotating such a training dataset has become a bottleneck in applying these models to forest fire detection.
The labeling problem is widespread in other recognition tasks based on deep learning and has been studied by many researchers [41][42][43][44][45]. However, in the field of forest fires, the labeling problem has not been studied. This paper focuses on how to reduce the impact of the uncertainty for labeling forest fire smoke on smoke segmentation.
In order to improve the accuracy of semantic segmentation of forest fire smoke images and eliminate the impact of labeling ambiguity on the recognition results, a semantic segmentation method based on concentration weighting was first proposed in this paper. By introducing a weighted factor as a measure of the labeling uncertainty, this method can avoid treating all labeled pixels equally so as to improve the accuracy of the model. The weighted method was tested and evaluated on the forest fire smoke dataset.

Materials and Methods
For semantic segmentation of forest fire smoke, the influence of smoke concentration was considered, and the idea of weighting was introduced in this paper. By establishing the pixel-concentration relationship of forest fire smoke in the image, the influence of the labeling ambiguity caused by non-uniform smoke diffusion would be alleviated and the recognition accuracy of forest fire smoke would be improved. The method framework of this paper is shown in Figure 1. The main framework of the weighted method proposed in this paper 1 . 1 : The network is pre-existing deep learning semantic segmentation network, encoder is MobileNet [46] and decoder is PSPnet [47]. The network was initialized with the weights of the MobileNet network pre-trained on the ImageNet. The final optimization loss includes weighted loss and cross-entropy loss.

Forest Fire Smoke Labeling Based on Weight
The input of the semantic segmentation network was original images and ground truth (GT) images corresponding to original images. The pixel value of the forest fire smoke pixel in GT images was labeled as 1 [48,49] and that of the non-forest fire smoke pixel in GT images was labeled as 0 in Figure 2a,b. The concentration of forest fire smoke varies in pixel because of the non-uniform diffusion of forest fire smoke particles. Due to the influence of environmental factors, the concentration of smoke particles will gradually decrease in the diffusion process, which will result in blurring of the edges of the smoke image or mixing with the background such as cloud and fog to cause the uncertainty of the labeling. It is impossible to reflect this kind of uncertainty by simply labeling pixels as 1 or 0 without distinction. The misidentification of the trained network model will be caused by the inaccuracy of the labeling.
The idea of weighting in this paper is to integrate the weight into the original method in order to make the network understand that forest fire smoke is different in concentration. By introducing a weighted factor, it is used as a measure of the uncertainty of the labeled pixels to avoid treating all labeled pixels equally and to identify forest fire smoke more accurately.
The forest fire smoke concentration has a direct correlation with the smoke pixel value in the forest fire smoke image. The difference in smoke concentration in the same image is represented by the difference in smoke pixel value. For white smoke, the higher the smoke concentration, the higher the smoke pixel value in the smoke image, while the black smoke is the opposite.
Therefore, establishing the relationship between the pixel value and the concentration distribution of the forest fire smoke pixel in the image is necessary for the introduction of weight. A normalization method to establish the pixel-concentration relationship was adopted in this paper as shown in Equations (1) and (2). Figure 1. The main framework of the weighted method proposed in this paper 1 . 1 : The network is pre-existing deep learning semantic segmentation network, encoder is MobileNet [46] and decoder is PSPnet [47]. The network was initialized with the weights of the MobileNet network pre-trained on the ImageNet. The final optimization loss includes weighted loss and cross-entropy loss.

Forest Fire Smoke Labeling Based on Weight
The input of the semantic segmentation network was original images and ground truth (GT) images corresponding to original images. The pixel value of the forest fire smoke pixel in GT images was labeled as 1 [48,49] and that of the non-forest fire smoke pixel in GT images was labeled as 0 in Figure 2a,b. The concentration of forest fire smoke varies in pixel because of the non-uniform diffusion of forest fire smoke particles. Due to the influence of environmental factors, the concentration of smoke particles will gradually decrease in the diffusion process, which will result in blurring of the edges of the smoke image or mixing with the background such as cloud and fog to cause the uncertainty of the labeling. It is impossible to reflect this kind of uncertainty by simply labeling pixels as 1 or 0 without distinction. The misidentification of the trained network model will be caused by the inaccuracy of the labeling. g c R is the basic pixel-concentration coefficient and max(G ( , )) r x y is the maximum relative pixel value of the smoke region.
In order to discriminate between smoke, cloud, and fog, the background information of the smoke should be included in the pixel-concentration relationship. Therefore, the contrast coefficient was introduced, as shown in Equation (3). The greater the gap between the average pixel value of the forest fire smoke area and the average pixel value of the entire image, the larger the contrast coefficient, so that it is much easier to identify the smoke area.
where G is the pixel mean of the whole image, G p is the pixel mean of the smoke area, G ns is the pixel mean of the non-smoke area and k is the contrast coefficient. The value range of k is [0, 1], which reflects the relative distance between the pixel mean in the smoke area and the pixel mean in the whole image. Finally, the weighted coefficient reflecting the pixel-concentration relationship is: where λ is the concentration weight. The value range of λ is [1− k , 1]. The lower limit of the pixel-concentration relationship is increased for Equation (4), which can enhance the confidence of model for smoke. The weighted image is shown in Figure 2c. The idea of weighting in this paper is to integrate the weight into the original method in order to make the network understand that forest fire smoke is different in concentration. By introducing a weighted factor, it is used as a measure of the uncertainty of the labeled pixels to avoid treating all labeled pixels equally and to identify forest fire smoke more accurately.
The forest fire smoke concentration has a direct correlation with the smoke pixel value in the forest fire smoke image. The difference in smoke concentration in the same image is represented by the difference in smoke pixel value. For white smoke, the higher the smoke concentration, the higher the smoke pixel value in the smoke image, while the black smoke is the opposite.
Therefore, establishing the relationship between the pixel value and the concentration distribution of the forest fire smoke pixel in the image is necessary for the introduction of weight. A normalization method to establish the pixel-concentration relationship was adopted in this paper as shown in Equations (1) and (2).
where G(x, y) is the pixel value of the smoke area, as shown in the white area in Figure 2b. min(G(x, y)) is the minimum pixel value of the smoke area and G r (x, y) is the relative pixel value of the smoke area. R gc is the basic pixel-concentration coefficient and max(G r (x, y)) is the maximum relative pixel value of the smoke region.
In order to discriminate between smoke, cloud, and fog, the background information of the smoke should be included in the pixel-concentration relationship. Therefore, the contrast coefficient k was introduced, as shown in Equation (3). The greater the gap between the average pixel value of the forest fire smoke area and the average pixel value of the entire image, the larger the contrast coefficient, so that it is much easier to identify the smoke area.
where G is the pixel mean of the whole image, G p is the pixel mean of the smoke area, G ns is the pixel mean of the non-smoke area and k is the contrast coefficient. The value range of k is [0, 1], which reflects the relative distance between the pixel mean in the smoke area and the pixel mean in the whole image. Finally, the weighted coefficient reflecting the pixel-concentration relationship is: where λ is the concentration weight. The value range of λ is [1−k, 1]. The lower limit of the pixel-concentration relationship is increased for Equation (4), which can enhance the confidence of model for smoke. The weighted image is shown in Figure 2c.

Improvement for the Loss Function
In the training process of the semantic segmentation network, when calculating the loss value, the contribution of each pixel in the smoke area to the loss value should be evaluated according to the weighted image. The improved loss function was as follows: where L is the overall loss function, L CE is the cross-entropy loss function [50,51], L W is the weighted loss function, calculating the loss between the predicted value and the weighted value, and α is the control coefficient. The proportion of the weighted part in the overall loss function was determined by the control coefficient α. L W and α would be determined by experiments. When the weighted loss was not considered, the cross-entropy loss function was used to train the network, as shown in Equation (6). When the weighted loss was considered, the idea of weighted was introduced and the loss function was Equation (5).
where y i = {0, 1} is the category label of a real image and ∧ y i ∈ [0, 1] is the prediction probability when the corresponding category label is 1.
Since the weight was a discrete value distributed in a certain interval, calculating the weighted loss was a regression problem. The common loss functions include Mean Absolute Error Loss (L MAE ) [52], Mean Squared Error Loss (L MSE ) [53] and Cosine Proximity Loss (L CP ) [54]. On this basis, the corresponding improvements of weighted loss function were made, as shown in Equations (7)-(9).
where N is the number of samples, ∆ y i ∈ [0, 1] is the weight when the corresponding category label is 1. L MAE and L MSE are respectively named as L1 Loss and L2 Loss. The original L CP is the opposite of the cosine distance between the predicted value and the weight. Because the minimum of the original L CP is −1, which is not suitable to combine with other loss functions. In this paper, L CP was added 1 to make sure its minimum is 0. The optimal type of weighted loss function L W would be determined by experimental analysis.

Experimental Platform
The experimental environment was Ubuntu 16.04 and the deep learning framework was Keras. The hardware configuration included E5-2620 CPU and GeForce GTX 1080TI GPU. The encoder of semantic segmentation network is MobileNet [46] and decoder of semantic segmentation network is PSPnet [47]. The batch size was 4 during training. The initial learning rate was set as 0.0001 and the optimization method was Adam. Then the learning rate was dynamically adjusted according to the loss value of the verification set. Once the loss value of the validation set stop decreasing, the learning rate will decay at a rate of 0.9. The network input was RGB images, GT images and weighted images. The size of these images was adjusted to 576 × 576.

Forest Fire Smoke Dataset
With the consideration of the environment, shooting angle, shooting distance and interference such as the coexistence of clouds and smoke, the forest fire smoke dataset was composed of 176 forest fire smoke images collected from literature and websites. According to the concentration of forest fire smoke and cloud interference, the dataset was divided into four categories: thick smoke (TKS), thin smoke (TNS), thick smoke and clouds (TKSC), and thin smoke and clouds (TNSC), as shown in Figure 3. The distribution, which is basically the same as that in literature and websites because of the randomness when collecting these images without considering smoke concentration, is shown in Table 1. 10% of the images in the dataset were randomly selected as the test set. The remaining images were randomly divided into the training set and the validation set at a ratio of 1:1.

Evaluation Index
To better verify the accuracy of the semantic segmentation network for forest fire smoke recognition, the mean intersection over union (mIoU) was used to evaluate the model performance in this paper. The larger mIoU, the better the recognition performance.
mIoU is a standard indicator of semantic segmentation tasks [55]. In the semantic segmentation field, IoU is essentially a method to quantify the overlap percentage between the target mask and the prediction mask. Specifically, it refers to the ratio of the number of pixels in the common area of the target mask and the prediction mask to the total number of pixels between them. mIoU is the average of IoUs for each category, as shown in Equation (10).
where since each pixel in the image has a category label, it is assumed that the total number of categories is k + 1, including k object categories and 1 background. p ij represents the number of pixels of category i predicted to be category j. In this paper, k is 1.

The Segmentation Results for the Weighted Method
Three main factors affect the weighted method: the relationship between the pixel value and the concentration distribution of forest fire smoke, the type of weighted loss function L W and the control coefficient α. There are two kinds of relationships between the pixel value and the concentration distribution of forest fire smoke, respectively R gc and λ. The former is just the normalized relative pixel value of forest fire smoke, which is approximately regarded as the concentration distribution of forest fire smoke, while the latter is multiplied by a contrast coefficient k based on the former and then increases the lower limit of the pixel-concentration relationship.
When the pixel-concentration distribution of forest fire smoke is R gc and λ, the experimental segmentation results with different type of weighted loss function L W and control coefficient α are shown in Figure 4.

Evaluation Index
To better verify the accuracy of the semantic segmentation network for forest fire smoke recognition, the mean intersection over union (mIoU) was used to evaluate the model performance in this paper. The larger mIoU, the better the recognition performance.
mIoU is a standard indicator of semantic segmentation tasks [55]. In the semantic segmentation field, IoU is essentially a method to quantify the overlap percentage between the target mask and the prediction mask. Specifically, it refers to the ratio of the number of pixels in the common area of the target mask and the prediction mask to the total number of pixels between them. mIoU is the average of IoUs for each category, as shown in Equation (10).
where since each pixel in the image has a category label, it is assumed that the total number of categories is 1 k + , including k object categories and 1 background. ij p represents the number of pixels of category i predicted to be category j. In this paper, k is 1.

The Segmentation Results for the Weighted Method
Three main factors affect the weighted method: the relationship between the pixel value and the concentration distribution of forest fire smoke, the type of weighted loss function W L and the control coefficient α . There are two kinds of relationships between the pixel value and the concentration distribution of forest fire smoke, respectively g c R and λ . The former is just the normalized relative pixel value of forest fire smoke, which is approximately regarded as the concentration distribution of forest fire smoke, while the latter is multiplied by a contrast coefficient k based on the former and then increases the lower limit of the pixel-concentration relationship.
When the pixel-concentration distribution of forest fire smoke is   The results of mIoU for three types of weighted loss functions L W are shown in Table 2. All the experiments in Table 2 were made 20 times and then the average value was taken as the final result. When the pixel-concentration relationship is λ, the optimal weight loss function is L MAE and the optimal control coefficient is 0.1, mIoU of the weighted method is 75.49%, which is the highest among the weighted methods. Table 2. Segmentation results of the pixel-concentration relationship, the type of weighted loss function and the control coefficient.  Figure 5 shows the corresponding segmentation images of these methods and makes a visual comparison between the segmentation images and the corresponding GT images. It was concluded that the results obtained by the method proposed in this paper are more similar to GT images than the method without weighting. Figure 5d,h are the results of the weighted method.  In order to evaluate the statistical significance of classification differences, a 10-fold cross validation was performed so as to have more than 1 experiment (10 in this case) where to calculate average (Ave) and standard deviations (SD) about experimental results in case of different approaches or algorithms. As shown in Figure 6, the dataset was divided into ten parts randomly on average, one of which was used as the test set, and all images in the remaining nine parts would be randomly allocated as the train set and the validation set at a ratio of 1:1. Each test set is different, ensuring that each image in the dataset can be used as test set once.    Table 4. The segmentation performance for the proposed algorithm in four categories.   First, 10-fold cross-validation was used to evaluate the performance of the model with (w.) and without (w/o) weights. The experimental results showed that the weighted method was 1.52% higher than the average value of mIoU for the method without weighting in Table 3, which verified the effectiveness of the weighted method. At the same time, the weighted method has a lower standard deviation, which makes the network more stable.

Pixel-Concentration
Then the performance for the proposed algorithm was investigated by considering the segmentation performances in four categories (thick smoke, thin smoke, thick smoke and clouds, and thin smoke and clouds) separately, as shown in Table 4. For TKS, the weighted method has a good performance compared with the unweighted method. For TNS and TKSC, the mean value of mIoU for the weighted method is higher than that of the unweighted method, but its standard deviation is larger. The reason for this phenomenon may be that the amount of data about TNS and TKSC is too small, which leads to larger fluctuations in the model. The result of TNSC has further verified the weighted method would fail when the data about TNSC is severely insufficient.

Discussion
In order to evaluate the effectiveness of the weighted method, comparative experiments with and without the weight were conducted on several common semantic segmentation networks, such as FCN [56], Segnet [57] and Unet [58], and a forest fire smoke detection method, Frizzi [39]. The control coefficient and weighted loss function type for all the tested network architectures have been determined by alike experiments conducted in 3.4, as shown in Table 5. The comparative results for the above segmentation methods with and without weighting are shown in Table 6. As shown in Table 6, for each semantic segmentation network, mIoU of the weighted method has been improved than the unweighted method to some degree. The experimental results showed that the optimal type of weighted loss function and control coefficient of each segmentation method may be different. For a specific dataset, the three optimal parameters of the weighted method, including the pixel-concentration relationship, the type of loss function and the control coefficient, would be verified by experiments.
The above experiments show that the amount of data is an important factor that affects the weighted method. If the amount of data is too small, the performance of the network will fluctuate greatly, which can be proved by the standard deviation of the weighted method under 10-fold cross-validation becoming larger when the data is insufficient. The amount of data will be further expanded in the following research. In addition, as a multi-objective optimization problem, the selection for specific parameters of the weighted method will be made a further research focus.

Conclusions
The semantic segmentation method based on concentration weighting was proposed for the first time in this paper. After building a semantic segmentation dataset of forest fire smoke, the pixel-concentration relationship between the gray value and the concentration of forest fire smoke pixels in the image was established. Then the loss function of the semantic segmentation method based on concentration weighting was built and improved. Finally, the selection of the optimum weighted factors was made by experiments. The segmentation experiments based on the weighted method were made and mIoU increased by 1.52% than the unweighted method. It can be concluded that the weighted method has a better segmentation and recognition performance than the unweighted method and can reduce the influence of labeling ambiguity on segmentation results to a certain extent. The weighted method cannot only be applied to the semantic segmentation and target detection of forest fire smoke, but also has a certain significance to other dispersive target recognition.
Funding: This research was funded by the National Natural Science Foundation of China, grant number 31971668.

Conflicts of Interest:
The authors declare no conflict of interest.