Comparative Research on Forest Fire Image Segmentation Algorithms Based on Fully Convolutional Neural Networks

: In recent years, frequent forest ﬁres have plagued countries all over the world, causing serious economic damage and human casualties. Faster and more accurate detection of forest ﬁres and timely interventions have become a research priority. With the advancement in deep learning, fully convolutional network architectures have achieved excellent results in the ﬁeld of image segmentation. More researchers adopt these models to segment ﬂames for ﬁre monitoring, but most of the works are aimed at ﬁres in buildings and industrial scenarios. However, there are few studies on the application of various fully convolutional models to forest ﬁre scenarios, and comparative experiments are inadequate. In view of the above problems, on the basis of constructing the dataset with remote-sensing images of forest ﬁres captured by unmanned aerial vehicles (UAVs) and the targeted optimization of the data enhancement process, four classical semantic segmentation models and two backbone networks are selected for modeling and testing analysis. By comparing inference results and the evaluation indicators of models such as mPA and mIoU, we can ﬁnd out the models that are more suitable for forest ﬁre segmentation scenarios. The results show that the U-Net model with Resnet50 as a backbone network has the highest segmentation accuracy of forest ﬁres with the best comprehensive performance, and is more suitable for scenarios with high-accuracy requirements; the DeepLabV3+ model with Resnet50 is slightly less accurate than U-Net, but it can still ensure a satisfying segmentation performance with a faster running speed, which is suitable for scenarios with high real-time requirements. In contrast, FCN and PSPNet have poorer segmentation performance and, hence, are not suitable for forest ﬁre detection scenarios.


Introduction
The major causes of socio-economic losses and human casualties include traffic accidents [1,2], forest fires [3,4], and natural and geological disasters [5,6]. Among them, the ecological damage caused by forest fires is huge, irreversible and long-lasting compared to the economic damage caused to human society, even endangering the fragile global ecosystem. Forest fires occur suddenly and are wide-ranging; however, due to the lack of observation methods, forest fires are always detected after they have spread over a large scale, making them arduous to control and extinguish. Therefore, since the end of the last century, many scholars have carried out research on forest fire image recognition and monitoring technology.
Traditional forest fire detection methods include man-powered watchtowers, patrol aircraft, and remote-sensing satellites for woodland monitoring [7,8]. Among them, although the satellites have a wide range of applications, their detection periods are long and lack flexibility. In general, these methods not only require a large amount of resource input, but also are severely limited by the weather and environmental conditions, making it difficult comparative analysis in industrial fire environments with different backbone networks, experimenting in specific scenarios. Although this study carried out a comparative study on different backbones, it still lacks a comparison with other models, and is only applicable to industrial scenarios. At present, due to the shortage of remote-sensing images for forest fire segmentation, there is little related research on forest fire scenarios, and the comparative experiments of different semantic models applied in forest fire scenarios are inadequate; hence, the comprehensive performances of these models applied to forest fire scenarios cannot be evaluated from multiple perspectives.
In view of the above problems, in this study, we collected a large number of remotesensing images of forest fires taken by UAVs, manually annotated some of the data, constructed a flame segmentation dataset, and established a data enhancement process applicable to forest fire images to improve the data quality. Four classical semantic segmentation models and two backbone networks were selected for theoretical research, experimental design and comparative analysis. From different perspectives, we tested the performance of each model for forest fire segmentation tasks, explored the most suitable semantic segmentation model for forest fire scenarios, and provided forestry departments and authorities with more effective application means, data and decision support.

Materials and Methods
At present, the systems adopted by most countries for forest fire identification and segmentation include five stages, which are image capturing, data processing, model training and testing analysis, cloud computing, and alerts and intervention. This work focuses on the optimization and comparative research of data processing and model analysis stage. The overall process of this experiment is shown in Figure 1, and the specific contents of each stage are as follows: • Stage 1: Image capturing. Usually, when a forest fire occurs, UAVs, forest surveillance and cruise helicopters capture and collect images of the fire and upload them after collation, thus obtaining a real-time forest fire scene. • Stage 2: Data augmentation. In this paper, the traditional data-processing method is optimized, and a set of data enhancement processes for forest fire images is proposed to improve the data quality and the accuracy of the model. In this stage, the images collected by UAVs are firstly imported. During the augmentation process, the images are first flipped and rotated, and then cropped and resized; subsequently, the images are subjected to color dithering in batches. In order to improve the generalization capacity of the model, random noise is added to the image, and then the interference images are fused with the dataset and divided. • Stage 3: Model training and testing analysis. In this part, four widely used fully convolutional network models are selected and trained based on two backbone networks, respectively. Meanwhile, we adopt four different indicators to evaluate the models' segmentation performances, record segmentation results of the models on forest fire and interference images, and compare and analyze from various perspectives. • Stage 4: Cloud computing. In the actual forest-fire-monitoring application, the main computing tasks such as model training and real-time segmentation of forest fire images can be performed in the cloud computing center after 5 G high-speed data transmission, which greatly reduces the computing tasks of single-point servers to adapt to the problem of insufficient server resources in some regions. • Stage 5: Alerts and intervention. Once the cloud server completes the segmentation task, it can detect whether a forest fire has occurred in the currently captured area. If a fire area is segmented, an alarm is immediately raised to notify the relevant forestry and fire departments to intervene in a timely manner and take measures on the fire site.
On the basis of improving the data augmentation module to obtain high-quality data, this work focuses on the comparative analysis of common semantic segmentation algorithms, and explores models that are more suitable for forest fire scenarios. and fire departments to intervene in a timely manner and take measures on the fire site. On the basis of improving the data augmentation module to obtain high-quality data, this work focuses on the comparative analysis of common semantic segmentation algorithms, and explores models that are more suitable for forest fire scenarios.

Construction of Forest Fire Dataset
In this study, 4200 remote-sensing images of a forest fire were collected as the basis of the flame segmentation dataset. Among them, a large proportion of the images were from Northern Arizona University's forest fire dataset FLAME [25], captured by UAVs. However, most of these forest fire images were taken in sunny or clear weather, and the environmental conditions were relatively simple. In order to improve the robustness of the system under various conditions, images taken under four common environmental conditions such as foggy, stormy, blizzard, and smoky were selected and added to the dataset. These data came from the Internet and the forest fire videos were intercepted frame-by-frame, which were captured by forest surveillance and firefighting helicopters. In addition, in order to improve the anti-interference ability of the models, 200 interference images similar to forest fires and background colors were also added to the dataset. As shown in Table 1, 95.23% of the 4200-image dataset contains forest fires, and 4.76% of the data contain similar interferences; the respective proportions of these images taken in various conditions are also shown. Figure 2 shows part of the forest fire images with a resolution of 3840 × 2160; Figure 3 shows part of the interference images with a resolution of 800 × 450.

Construction of Forest Fire Dataset
In this study, 4200 remote-sensing images of a forest fire were collected as the basis of the flame segmentation dataset. Among them, a large proportion of the images were from Northern Arizona University's forest fire dataset FLAME [25], captured by UAVs. However, most of these forest fire images were taken in sunny or clear weather, and the environmental conditions were relatively simple. In order to improve the robustness of the system under various conditions, images taken under four common environmental conditions such as foggy, stormy, blizzard, and smoky were selected and added to the dataset. These data came from the Internet and the forest fire videos were intercepted frame-by-frame, which were captured by forest surveillance and firefighting helicopters. In addition, in order to improve the anti-interference ability of the models, 200 interference images similar to forest fires and background colors were also added to the dataset. As shown in Table 1, 95.23% of the 4200-image dataset contains forest fires, and 4.76% of the data contain similar interferences; the respective proportions of these images taken in various conditions are also shown. Figure 2 shows part of the forest fire images with a resolution of 3840 × 2160; Figure 3 shows part of the interference images with a resolution of 800 × 450.

Forest Fire Segmentation Models
In this study, four fully convolutional network models, FCN, U-Net, PSPNet DeepLabV3+, which are widely used in image semantic segmentation tasks, were sele for comparative analysis. Among them, FCN (fully convolutional networks) [26] app deep-learning technology to the field of image segmentation for the first time, achi end-to-end pixel-level processing, and opened a new era of semantic segmentation. U (unity-networking) [27] solved the problem of partial pixel spatial information through the encoder-decoder structure. PSPNet (pyramid scene parsing network) [2 semantic segmentation model based on feature fusion, improved segmentation accu through feature fusion of global and local information. DeepLabV3+ [29] fused sp pyramid pooling and a decoder to further refine the segmentation results. In summ the four models adopted four different core architectures which are representative.

FCN
In 2014, Long proposed fully convolutional networks, which was the first time d learning technology was used to fundamentally solve the image semantic segmenta tasks. The FCN structure is shown in Figure 4.

Forest Fire Segmentation Models
In this study, four fully convolutional network models, FCN, U-Net, PSPNet and DeepLabV3+, which are widely used in image semantic segmentation tasks, were selected for comparative analysis. Among them, FCN (fully convolutional networks) [26] applied deep-learning technology to the field of image segmentation for the first time, achieved end-to-end pixel-level processing, and opened a new era of semantic segmentation. U-Net (unity-networking) [27] solved the problem of partial pixel spatial information loss through the encoder-decoder structure. PSPNet (pyramid scene parsing network) [28], a semantic segmentation model based on feature fusion, improved segmentation accuracy through feature fusion of global and local information. DeepLabV3+ [29] fused spatial pyramid pooling and a decoder to further refine the segmentation results. In summary, the four models adopted four different core architectures which are representative.

FCN
In 2014, Long proposed fully convolutional networks, which was the first time deeplearning technology was used to fundamentally solve the image semantic segmentation tasks. The FCN structure is shown in Figure 4. The convolutional neural network (CNN) structure [30,31] is more used to solve target classification, which uses fully connected layers after convolution layers to obtain a fixed-length feature vector. However, the FCN structure uses convolutional layers to re-  The convolutional neural network (CNN) structure [30,31] is more used to solve target classification, which uses fully connected layers after convolution layers to obtain a fixed-length feature vector. However, the FCN structure uses convolutional layers to replace the fully connected layer for the first time, uses the skip-layer connection to fuse the features extracted by different layers, then uses deconvolution, bilinear interpolation and other means to up-sample and restore the image size, thereby achieving the end-to-end processing of images.
Compared with CNN, the FCN model can accept image inputs of any size without pre-sizing. Secondly, the FCN model is much more efficient, which can effectively solve the problems related to repeated storage and convolution calculations caused by the use of pixel blocks.
The FCN structure transforms the classification network into a structure for semantic segmentation for the first time. The FCN model's segmentation accuracy is greatly improved compared with previous methods, which drives the image semantic segmentation technology into a new era and provides inspirations for subsequent research on semantic segmentation algorithms.

U-Net
In the field of image segmentation, the encoder-decoder structure [27,32] is often used to solve the problem of spatial pixel loss. The structure mainly contains two parts: the encoder and decoder. The encoder mainly includes convolutional layers and downsampling layers, which gradually reduce the size of feature maps and capture higher-level semantic information through convolution. The decoder includes an up-sampling layer, a convolution layer and a fusion layer, and gradually recovers the target detail information and spatial dimension through deconvolution. The whole structure exploits the multi-scale features of the autoencoder and recovers the spatial resolution from the decoder.
In 2015, the U-Net (unity-networking) structure proposed by Ronneberger O became one of the classic representatives of the encoder-decoder structure.
The U-Net structure is shown in Figure 5, which mainly includes a contraction path to capture contextual information and a symmetric expansion path for precise localization. The encoder part on the left side of the model consists of four sub-modules; each submodule contains two convolutional layers, and is followed by a max-pooling layer to achieve down-sampling. The decoder part on the right side of the model also consists of four sub-modules, which recover the target details and spatial dimensions through up-sampling. The skip connection structure in U-Net connects the up-sampling results with the features of the same resolution in the encoder, and performs cross-layer feature fusion as the input to the next sub-module of the decoder. U-Net achieves multi-scale feature fusion and can complete end-to-end processing with a small amount of data.

PSPNet
In 2017, Zhao H proposed the PSPNet structure, which is shown in Figure 6. Based on FCN, it makes full use of global semantic connection to improve the reliability of prediction by aggregating contextual information in different regions. It is characterized by the use of a pyramid pooling module, which fuses four feature maps with different scales to extract local and global information of the image, cascading and integrating the features of the same level.
In the pyramid pooling module, the outputs of different scale levels contain feature maps of different sizes, and the contextual dimension is reduced to 1/N by using a 1 × 1 convolutional layer to ensure the global region weight. Then, the low-dimensional feature maps are up-sampled by bilinear interpolation to obtain features of equal size, these features of different levels are spliced into bit pyramid pooling global features, and, finally, pixel-by-pixel prediction results are outputted through a convolutional layer.
four sub-modules, which recover the target details and spatial dimensions through upsampling. The skip connection structure in U-Net connects the up-sampling results with the features of the same resolution in the encoder, and performs cross-layer feature fusion as the input to the next sub-module of the decoder. U-Net achieves multi-scale feature fusion and can complete end-to-end processing with a small amount of data.

PSPNet
In 2017, Zhao H proposed the PSPNet structure, which is shown in Figure 6. Based on FCN, it makes full use of global semantic connection to improve the reliability of prediction by aggregating contextual information in different regions. It is characterized by the use of a pyramid pooling module, which fuses four feature maps with different scales to extract local and global information of the image, cascading and integrating the features of the same level.

PSPNet
In 2017, Zhao H proposed the PSPNet structure, which is shown in Figure 6. Based on FCN, it makes full use of global semantic connection to improve the reliability of prediction by aggregating contextual information in different regions. It is characterized by the use of a pyramid pooling module, which fuses four feature maps with different scales to extract local and global information of the image, cascading and integrating the features of the same level.  To a certain extent, PSPNet overcomes the main disadvantage of the FCN structure for image semantic segmentation without considering contextual information, thereby extracting a complete feature representation and further improving the segmentation accuracy.

DeepLabV3+
In 2018, on the basis of deeplabV3, Chen proposed DeepLabV3+, which added a decoder structure. The structure is shown in Figure 7. DeepLabV3 is used as the encoder structure to extract rich contextual information, and the added new decoder optimizes the segmentation results and restores semantic information of the object.
The core aim of the encoder is to build an atrous spatial pyramid pooling module (ASPP) which can process the features of DCNN in 5 different ways, including 3 kinds of convolution with different hole rates, pooling and dimensionality reduction, and then fuse the features. The decoder connects the shallow-layer features of the DCNN through the features extracted by the ASPP module, and performs deconvolution to obtain the final result.
The DeeplabV3+ structure adds a separable depthwise separable convolution to the ASPP module and decoder, which improves the network running speed and robustness, and greatly improves the segmentation accuracy, so as to achieve a relative balance between model accuracy and algorithm time complexity.
for image semantic segmentation without considering contextual information, thereby tracting a complete feature representation and further improving the segmentation acc racy.

DeepLabV3+
In 2018, on the basis of deeplabV3, Chen proposed DeepLabV3+, which added a d coder structure. The structure is shown in Figure 7. DeepLabV3 is used as the encod structure to extract rich contextual information, and the added new decoder optimizes t segmentation results and restores semantic information of the object. The core aim of the encoder is to build an atrous spatial pyramid pooling modu (ASPP) which can process the features of DCNN in 5 different ways, including 3 kinds convolution with different hole rates, pooling and dimensionality reduction, and th fuse the features. The decoder connects the shallow-layer features of the DCNN throu the features extracted by the ASPP module, and performs deconvolution to obtain t final result.
The DeeplabV3+ structure adds a separable depthwise separable convolution to t ASPP module and decoder, which improves the network running speed and robustne and greatly improves the segmentation accuracy, so as to achieve a relative balance b tween model accuracy and algorithm time complexity.

Methods
This study used the forest fire dataset constructed above, and randomly selected 80 of the data for training, 10% for validation, and 10% for testing.

Methods
This study used the forest fire dataset constructed above, and randomly selected 80% of the data for training, 10% for validation, and 10% for testing.
FCN, U-Net, PSPNet, and DeeplabV3+ were used to conduct comparative segmentation research on forest fires, and the backbones were VGG16 and Resnet50, respectively. The structure of VGG16 is relatively simple, and the stacked small convolution kernels were used to replace the larger ones in the classical network, with fewer parameters and a faster running speed; Resnet50 has a deeper structure, and adopts a residual module to ameliorate the degradation of the model, which can build more convolutional layers to extract deeper features.
In the experiments, the binary cross-entropy loss function was used as Loss. The experimental environment and configuration are shown in Table 2, and the network parameter settings are shown in Table 3.

Evaluation Indicators
The evaluation indicators recorded during model testing include PA, mPA, mIoU and FWIoU. In the following formulas, TP (true positive) refers to the number of flame pixels correctly predicted by the model; FP (false positive) refers to the number of pixels predicted to be flames, but the true value is the woodland background; FN (false negative) indicates that the woodland background is predicted, but the true value is the number of pixels of the flame; TN (true negative) represents the number of correctly predicted woodland background pixels.

1.
Pixel Accuracy (PA): PA refers to the proportion of correctly classified flame and woodland background pixels in the total pixels in the image, which represents the accuracy of the model's classification of pixels. The greater the PA, the more accurate the model will be to segment the flame and woodland background.

Mean Pixel Accuracy (mPA)
mPA is to calculate the prediction accuracy of the flame and woodland background pixels separately, and take the average value of the two, which can better reflect the segmentation accuracy of the model for the overall semantics. The higher the mPA, the higher the segmentation accuracy of the flame and the woodland background, and the more precise the model.

Mean Intersection over Union (mIoU)
The IoU refers to the ratio of the intersection and union of the target ground-truth pixel set and the prediction set. The mIoU is to calculate the IoU of the flame and woodland background, respectively, and take the average value. The larger the mIoU, the higher the predicted coincidence between the flame and the woodland background and the real area.

Frequency-Weighted Intersection over Union (FWIoU):
The frequency-weighted IoU is a weighted summation of the various intersection ratios according to the frequency of the flame and forest background pixels. Compared with mIoU, FWIoU sets weights for each category according to the frequency of occurrence, which better improves the overall semantic evaluation. The larger the FWIoU, the stronger the overall segmentation performance.

Results
The test set contains a total of 420 pictures, including various environmental conditions such as clear, foggy, stormy, blizzard and smoky, and also includes interference images similar to the background color of flames and woodland. Based on the trained models above, we aimed to test and record the segmentation results and evaluation indicators from multiple perspectives.

Forest Fire Segmentation
Shown in Figure 8 are the segmentation results of each model with VGG16 as the backbone network; shown in Figure 9 are the segmentation results of each model with Resnet50 as the backbone network.

Results
The test set contains a total of 420 pictures, including various environmental conditions such as clear, foggy, stormy, blizzard and smoky, and also includes interference images similar to the background color of flames and woodland. Based on the trained models above, we aimed to test and record the segmentation results and evaluation indicators from multiple perspectives.

Forest Fire Segmentation
Shown in Figure 8 are the segmentation results of each model with VGG16 as the backbone network; shown in Figure 9 are the segmentation results of each model with Resnet50 as the backbone network.   From the perspective of different fully convolutional network models, for the forest fire area, it can be seen from column (d) that the U-Net model has the most precise segmentation results of the flame edge, with the highest degree of overlap with the flame area of ground-truth. U-Net also has a satisfying segmentation result on the outer flame, and can well segment the flame area occluded by the trees. Column (c) shows that the FCN model has the worst flame segmentation results among all models, whose segmentation results of the flame contour and outer flame are poor, and only the inner flame can be identified. However, the FCN model cannot segment the flame occluded by trees well. Columns (e) and (f) show that the PSPNet and DeepLabV3+ structure can better identify From the perspective of different fully convolutional network models, for the forest fire area, it can be seen from column (d) that the U-Net model has the most precise segmentation results of the flame edge, with the highest degree of overlap with the flame area of groundtruth. U-Net also has a satisfying segmentation result on the outer flame, and can well segment the flame area occluded by the trees. Column (c) shows that the FCN model has the worst flame segmentation results among all models, whose segmentation results of the flame contour and outer flame are poor, and only the inner flame can be identified. However, the FCN model cannot segment the flame occluded by trees well. Columns (e) and (f) show that the PSPNet and DeepLabV3+ structure can better identify the main body and part of the outer flame. However, for the irregular outer flame shape, the forest background will be incorrectly divided into flames, and the segmentation of flame details is not as good as the U-Net structure, causing errors.

Anti-Interference Segmentation
In the test with interference images similar to the background colors of flames and forest background, the models with VGG16 as the backbone predict the interference images precisely, the background pixels are not incorrectly predicted as flames, and the segmentation results are completely coincident with the label map which is in black.
However, models with Resnet50 as the backbone had slightly different segmentation of interference images. Figure 10 shows that the PSPNet model with Resnet50 was not affected by the interference images, the segmentation result is no flame in the figure which is a full black image, the same as ground-truth. U-Net misidentified flames in all three interfering images, causing errors. The FCN and DeepLabV3+ models performed similarly, they also had the correct segmentation performance for different interference images, and their false-pixel-prediction frequency was lower than U-Net.

Comparision of Models
From the data in Table 4, it can be seen that the evaluation indicators of the four models are generally high, and the above four models have good performance in the forest fire segmentation. U-Net with Resnet50 has higher indicators than the other models in different forest fire scenarios, and the segmentation result is the best. The FCN model with VGG16 is inferior to other models in all indicators, and the segmentation result is the poorest. The overall performance of PSPNet and DeepLabV3+ with Resnet50 is moderate and has a relatively good segmentation result.

Comparision of Models
From the data in Table 4, it can be seen that the evaluation indicators of the four models are generally high, and the above four models have good performance in the forest fire segmentation. U-Net with Resnet50 has higher indicators than the other models in different forest fire scenarios, and the segmentation result is the best. The FCN model with VGG16 is inferior to other models in all indicators, and the segmentation result is the poorest. The overall performance of PSPNet and DeepLabV3+ with Resnet50 is moderate and has a relatively good segmentation result. Table 5 shows the size of the weight files of each model. Combined with the experimental results in Table 4, it can be seen that the weight files of FCN and PSPNet are small, but the segmentation effect is relatively poor. U-Net's weight file is moderate and the segmentation is the best. DeepLabV3+ has the largest weight file, but its segmentation results are slightly lower than U-Net. Table 6 shows the time consumption of respective trained models when inferencing a single image when running on CPU.

Discussion
In forest fire detection, more and more systems adopt flame segmentation algorithms to replace classification algorithms. Table 7 shows some representative algorithms for fire classification and segmentation and their results. In comparison, the fire segmentation algorithms have higher accuracy and stronger detection performance. Secondly, for the forest background, the segmentation algorithm can better locate the fire site, then analyze and predict the fire situation in combination with the surrounding conditions. In addition, classification algorithms mainly focus on detecting flames, while most segmentation algorithms can combine flames together with smoke to further improve the detection performance. Comparing the segmentation results of each model with different backbone networks in Figures 8 and 9, on the whole, all of the above models could well segment the flame from the forest background, but there are still some differences in the accuracy of the outline. It can be seen that the U-Net with Resnet50 as the backbone network was the most accurate for forest fire segmentation and described the details the best; the FCN model with VGG16 as the backbone network was the least accurate for flame segmentation, but it could still identify the flame's main body.
From the perspective of different backbones, compared with VGG16, each model with Resnet50 as the backbone had more refined segmentation results, which could better segment the flame from the forest background, and the segmentation results were more complete and closer to ground-truth.
In terms of the segmentation of interference images, U-Net led to the misidentification of flames for interference images in the test due to its excessive sensitivity to fire features. In contrast, DeepLabV3+ outperformed the U-Net model in terms of segmentation performance against interference images, and its flame recognition misidentification rate was lower than that of the U-Net model. Although PSPNet had the lowest false-identification rate for interfering images, it is not suitable to be used because of its relatively poor segmentation accuracy.
From the analysis of model evaluation index level, PA and FWIoU can better reflect the model's overall segmentation performance, that is, the prediction of all pixels including the flames and forest background. However, because the forest fire pictures used in the experiment had high resolution and many pixels, most of the pixels were forest background, and just about 5%-10% of the pixels in the image were forest fire. Therefore, in the test, these two indicators tended to account for the larger portion of the forest background pixels, resulting in higher values.
In contrast, mPA and mIoU can better reflect the model's segmentation performance of the flame area. As can be seen from the above, each model had a good segmentation of the forest background whose pixels accounted for a large proportion. The above two indicators were calculated after averaging the forest fire area segmentation results. Therefore, mPA and mIoU had a strong positive correlation with the flame area segmentation accuracy.
The main reason for the differences in the segmentation results of each model is that different semantic segmentation network models have different fusion methods and utilization levels of deep and shallow features.
The FCN structure only performs two cross-layer feature fusions for the extracted features, and lacks multi-scale operations on features, so the segmentation results were poor. The core of PSPNet is to pool the extracted features at four different scales, and then merge them with the previous features. Compared with other models, PSPNet lacks the fusion of shallow features and has a low feature utilization rate. As an encoder-decoder structure, U-Net performs multiple down-sampling in the encoder, and the collected deep features can obtain the overall contextual semantic information. It performs up-sampling operations multiple times in the decoder, and connects the extracted features with the down-sampling results of the same stage, instead of directly back-propagating on the high-level semantic features, which ensures that more shallow features can be fused into the feature map. Therefore, U-Net retains and combines both deep and shallow features, so it achieves better results in forest fire segmentation tasks. DeepLabV3+ performs multi-dimensional operations on deep features in the encoder, and combines them with shallow features. It also takes into account both deep and shallow features, so its indicators are better than FCN and PSPNet, second only to U-Net.
Based on the above data, U-Net with Resnet50 performs the best with the highest indicators, but its weight file is the largest with a relatively slow running speed. It can be used in forest fire segmentation scenarios that do require high-precision flame details, such as fire attribute analysis, trend prediction for forestry departments, and related scientific research. The FCN and PSPNet models have performed relatively poorer with lower indicators, so they are not suitable for forest fire segmentation scenarios. The indicators of DeepLabV3+ with Resnet50 are relatively satisfying, only slightly lower than the U-Net, but its weight file is smaller with a faster running speed. It is suitable for scenarios with high real-time requirements, such as the real-time detection of forest fires and monitoring, fire real-time warning systems, etc.

Conclusions
In this paper, we compile remote-sensing forest fire images captured by UAVs in multiple conditions to construct a dataset, and optimize the data enhancement process for forest fire image features. Based on the enhancement of data quality, four full convolutional networks and two backbone network models are adopted for combined training, analysis and testing, respectively. The focus is on the comparative research of the semantic segmentation methods applied to the forest fire segmentation scenarios, analyzing and discussing their application value and functional focus.
The results show that, among the four widely used deep semantic segmentation models, the U-Net model has the best performance in forest fire segmentation, with a better description of semantic details such as outflames and contours, but the inference speed for a single instance is relatively slow, which is more suitable for application scenarios requiring high accuracy, such as forest fire attribute analysis and trend prediction; the DeepLabV3+ model has relatively faster inference speed while ensuring high segmentation accuracy. Despite the description of flame details being slightly lower than that of U-Net model, it is still satisfactory, which is suitable for scenarios requiring high real-time performance, such as forest fire real-time detection and alerts. The FCN and PSPNet models are relatively unsuitable for forest fire segmentation scenarios because of their low segmentation test and evaluation index. In addition, the comparative research concluded that the detection accuracy of forest fire segmentation methods is generally higher than that of the classification methods; therefore, the U-Net and DeepLabV3+ are more applicable in forest fire detection systems.
Image segmentation is the key area for forest fire detection. Forest fire segmentation and recognition in remote sensing woodland images with faster, more accurate and lower resource cost is the focus of current research in this field. Although the existing techniques have achieved certain results, improvements are still needed in the following aspects in future research: combining smoke and forest fires for identification and analysis, the optimization of identification for forest fires under complex backgrounds and occlusion, and the adoption of edge computing technology to improve inference speed.