Forest Fire Segmentation from Aerial Imagery Data Using an Improved Instance Segmentation Model

: In recent years, forest-ﬁre monitoring methods represented by deep learning have been developed rapidly. The use of drone technology and optimization of existing models to improve forest-ﬁre recognition accuracy and segmentation quality are of great signiﬁcance for understanding the spatial distribution of forest ﬁres and protecting forest resources. Due to the spreading and irregular nature of ﬁre, it is extremely tough to detect ﬁre accurately in a complex environment. Based on the aerial imagery dataset FLAME, this paper focuses on the analysis of methods to two deep-learning problems: (1) the video frames are classiﬁed as two classes (ﬁre, no-ﬁre) according to the presence or absence of ﬁre. A novel image classiﬁcation method based on channel domain attention mechanism was developed, which achieved a classiﬁcation accuracy of 93.65%. (2) We propose a novel instance segmentation method (MaskSU R-CNN) for incipient forest-ﬁre detection and segmentation based on MS R-CNN model. For the optimized model, the MaskIoU branch is reconstructed by a U-shaped network in order to reduce the segmentation error. Experimental results show that the precision of our MaskSU R-CNN reached 91.85%, recall 88.81%, F1-score 90.30%, and mean intersection over union ( mIoU ) 82.31%. Compared with many state-of-the-art segmentation models, our method achieves satisfactory results on forest-ﬁre dataset.


Introduction
Forest fires have caused substantial economic losses, air pollution, environmental degradation, and other challenges all over the world, wreaking havoc on human life, animals, and plants [1][2][3]. According to incomplete statistics, over 200,000 forest fires occur worldwide, destroying approximately 10 million hectares of forest land [4]. Furthermore, the risk of forest fires in China is increasing due to factors such as the wide distribution of forests, the complex topography of forest areas, and the backward monitoring technology [5]. As a result, rapid detection of forest fires can reduce damage to ecosystems and infrastructure. In particular, incipient forest-fire recognition technology can provide forest firefighters with more accurate data on fire behavior, thereby preventing the spread of fires [6].
Traditional forest-fire recognition methods include setting up watchtowers in forest areas for manual monitoring or using infrared instruments carried by helicopters for forest fire detection. These methods are time-consuming, laborious, and generally inefficient in recognition. Thanks to breakthroughs in the field of artificial neural networks, deep neural networks (DNNs) have recently emerged as the most advanced technology in some computer vision challenges [7]. The current success of these types of architectures in very complicated tasks has broadened the scope of their prospective applications and paved the way for their application to real-world problems [8,9]. Nonetheless, recognizing forest fires using aerial imagery is challenging due to fire's various shapes, scopes, and spectral overlaps [10,11].

•
We design a novel attention mechanism module, which consists of two independent branches for learning semantic information between different channels to enrich feature representation capability; • We utilize a U-shaped network to reconstruct the MaskIoU branch of MS R-CNN with the aim of correcting forest-fire edge pixels and reducing segmentation errors; and • Experimental results show that the proposed MaskSU R-CNN outperforms many existing CNN-based models on forest-fire instance segmentation.
The remainder of this paper is organized as follows. Section 2 introduces experimental materials and methods for two problems, namely, fire classification and fire instance segmentation. Section 3 provides the experimental results and analysis. Section 4 presents the discussion. Finally, Section 5 makes a few concluding remarks.

Materials and Methods
This section details the data source and the annotation of the training set, and then two approaches are presented to solve the different challenges. The first challenge is fire and no-fire classification using deep-learning (DL) method. The second challenge is fire instance segmentation, which is complementary to the first problem, and we will further identify fire regions and distinguish pixel classes on images classified as fire-containing.

Data Source
The data obtained through wireless sensor networks and infrared technology has been widely used in the detection, monitoring, and evaluation of forest fires. Furthermore, the aerial superiority of drones allows us to better understand the forest topography structure and the location of the fire [31]. As a result, we selected the FLAME dataset as our data source, which can be obtained from the website (https://ieee-dataport.org/openaccess/flame-dataset-aerial-imagery-pile-burn-detection-using-drones-uavs, accessed on 16 December 2021). The FLAME dataset was gathered by drones during the burning of deposits in Arizona pine forests, and it includes video frames and heatmaps taken by infrared cameras, such as the WhiteHot and GreenHot palettes. Figure 1 shows some representative images from fire, no-fire, and thermal videos.
The remainder of this paper is organized as follows. Section 2 introduces experimental materials and methods for two problems, namely, fire classification and fire instance segmentation. Section 3 provides the experimental results and analysis. Section 4 presents the discussion. Finally, Section 5 makes a few concluding remarks.

Materials and Methods
This section details the data source and the annotation of the training set, and then two approaches are presented to solve the different challenges. The first challenge is fire and no-fire classification using deep-learning (DL) method. The second challenge is fire instance segmentation, which is complementary to the first problem, and we will further identify fire regions and distinguish pixel classes on images classified as fire-containing.

Data Source
The data obtained through wireless sensor networks and infrared technology has been widely used in the detection, monitoring, and evaluation of forest fires. Furthermore, the aerial superiority of drones allows us to better understand the forest topography structure and the location of the fire [31]. As a result, we selected the FLAME dataset as our data source, which can be obtained from the website (https://ieee-dataport.org/open-access/flame-dataset-aerial-imagery-pile-burn-detection-using-drones-uavs, accessed on 16 December 2021). The FLAME dataset was gathered by drones during the burning of deposits in Arizona pine forests, and it includes video frames and heatmaps taken by infrared cameras, such as the WhiteHot and GreenHot palettes. Figure 1 shows some representative images from fire, no-fire, and thermal videos.   For the instance segmentation task, it can be defined as a pixel-level binary classification problem, in which each pixel is labeled as fire or no-fire (background). To complete the segmentation of the forest fires, the images labeled as fire from Table 1 are considered as a training set. In addition, to train the instance segmentation model and guarantee the quality of the segmentation, we extracted the ground truth of each image through Labelme software. Figure 2 shows the annotation result of a training sample. training set on the model's performance, we employed four different proportions of training set to train the model. The validation and testing set in these experiments continue to use the same images, and both are at a proportion of 10%. The implementation details are recorded in Table 1. For the instance segmentation task, it can be defined as a pixel-level binary classification problem, in which each pixel is labeled as fire or no-fire (background). To complete the segmentation of the forest fires, the images labeled as fire from Table 1 are considered as a training set. In addition, to train the instance segmentation model and guarantee the quality of the segmentation, we extracted the ground truth of each image through Labelme software. Figure 2 shows the annotation result of a training sample.

Fire Image Classification Using DSA-ResNet
The principle of image classification using deep neural networks is different from traditional digital image processing techniques [14,32,33]. Traditional methods mostly use mathematical modeling or shallow networks for processing and then recognition, which often fail to break the recognition rate bottleneck and have the problem of missed and false detections in practical applications [26]. However, training a CNN to realize this image classification task aids in learning elements unrelated to the fire. Among CNNs, Res-Net [29], with many residual blocks, allows for smooth gradient-flow and improves classification accuracy. Furthermore, the attention mechanism provides new momentum for the advancement of CNNs to extract more useful information [34,35]. Experiments show that some attention mechanisms based on channel domain or spatial domain, such as SE-Net [36] and CBAM [37], can significantly improve network recognition ability. Forestfire recognition is a challenging task due to the interference of smoke and the translucent nature of the flames. Considering the complexity of forest-fire characteristics, we further exploit multi-scale information based on the SE module to enhance the representation of the model. Unlike previous studies, we propose a novel module using attention mechanism for convolution kernels, which can dynamically select and fuse feature maps from different scales of convolution kernels, termed the Dual Semantic Attention (DSA) module. To be more specific, we implement the DSA module via three operators-Separate, Fuse, and Select, as shown in Figure 3.

Fire Image Classification Using DSA-ResNet
The principle of image classification using deep neural networks is different from traditional digital image processing techniques [14,32,33]. Traditional methods mostly use mathematical modeling or shallow networks for processing and then recognition, which often fail to break the recognition rate bottleneck and have the problem of missed and false detections in practical applications [26]. However, training a CNN to realize this image classification task aids in learning elements unrelated to the fire. Among CNNs, ResNet [29], with many residual blocks, allows for smooth gradient-flow and improves classification accuracy. Furthermore, the attention mechanism provides new momentum for the advancement of CNNs to extract more useful information [34,35]. Experiments show that some attention mechanisms based on channel domain or spatial domain, such as SENet [36] and CBAM [37], can significantly improve network recognition ability. Forestfire recognition is a challenging task due to the interference of smoke and the translucent nature of the flames. Considering the complexity of forest-fire characteristics, we further exploit multi-scale information based on the SE module to enhance the representation of the model. Unlike previous studies, we propose a novel module using attention mechanism for convolution kernels, which can dynamically select and fuse feature maps from different scales of convolution kernels, termed the Dual Semantic Attention (DSA) module. To be more specific, we implement the DSA module via three operators-Separate, Fuse, and Select, as shown in Figure 3. Remote Sens. 2022, 14, x FOR PEER REVIEW 5 of 21 Separate: An input feature map is converted by two transformations with a series of operations, such as grouped convolutions, batch normalization (BN), and Rectified Linear Unit (ReLU) activation function, to achieve the output, denoted as . Note that for the purpose of learning the weight relations of different branches, we define convolution kernel size of 3 × 3 and 5 × 5 for feature extraction.
Fuse: The purpose of fusion is to learn the channel weights between different feature streams by adaptively adjusting the convolutional kernels (neurons). Firstly, we fuse the output of F  and F using element-wise summation: then we obtained the global representation C s R ∈ by global average pooling: where c U denotes the feature map of c-th channel. Furthermore, a fully connected (FC) layer is applied to achieve a compact tensor d z R ∈ and reduce computational effort: where δ is the ReLU activation function, ( ) BN ⋅ represents the batch normalization, is a linear mapping, and r is the descending ratio, which is set to 16 in our experiments.
Select: Two independent FC layers are used to embed the attention information, followed by normalization using softmax function: where C i att R ∈ denotes i-th softmax attention, and Finally, the output of DSA module V is obtained via the channel-based attention weights and their corresponding feature maps, denoted as: Note that the above formula is implemented in two branches and one can easily derive the case with more branches by extending Equations (1), (4), and (5).
On the basis of ResNet, the above-mentioned DSA module is integrated into the model, and the network structure of DSA-ResNet with 50 layers is shown in Figure 4. Separate: An input feature map X ∈ R H×W×C is converted by two transformations with a series of operations, such as grouped convolutions, batch normalization (BN), and Rectified Linear Unit (ReLU) activation function, to achieve the output, denoted as F = X → U ∈ R H×W×C andF = X →Û ∈ R H×W×C . Note that for the purpose of learning the weight relations of different branches, we define convolution kernel size of 3 × 3 and 5 × 5 for feature extraction.
Fuse: The purpose of fusion is to learn the channel weights between different feature streams by adaptively adjusting the convolutional kernels (neurons). Firstly, we fuse the output of F andF using element-wise summation: then we obtained the global representation s ∈ R C by global average pooling: where U c denotes the feature map of c-th channel. Furthermore, a fully connected (FC) layer is applied to achieve a compact tensor z ∈ R d and reduce computational effort: where δ is the ReLU activation function, BN(·) represents the batch normalization, W ∈ R C r ×C is a linear mapping, and r is the descending ratio, which is set to 16 in our experiments. Select: Two independent FC layers are used to embed the attention information, followed by normalization using softmax function: where att i ∈ R C denotes i-th softmax attention, and i = {1, 2}. Finally, the output of DSA module V is obtained via the channel-based attention weights and their corresponding feature maps, denoted as: Note that the above formula is implemented in two branches and one can easily derive the case with more branches by extending Equations (1), (4), and (5).
On the basis of ResNet, the above-mentioned DSA module is integrated into the model, and the network structure of DSA-ResNet with 50 layers is shown in Figure 4. Similar to other CNNs, our DSA-ResNet50 model consists of three primary components: (1) the input feature matrix, (2) the feature extraction layers, and (3) the output layer. During the training phase, the input matrix X is firstly resized to 230 × 230 × 3, which is dependent on the image size and its channels. Then, the data is augmented using random rotation, horizontal flip, and other techniques to improve the generalization ability of the model and to avoid overfitting. The feature extraction layers consist of a large number of convolution blocks (DSA-Residual module), and each block follows identity mapping and a ReLU activation function [38]. The batch normalization helps to accelerate the convergence of the loss function and enables the model to learn different distributions of the data by normalization. The output of the last feature extraction layer is 7 × 7 × 2048, which is later adjusted to 1 × 1 × 2048 by global pooling. Due to the fact that fire classification is a binary classification problem, we use the sigmoid activation function to output its probabilities (fire, no-fire), denoted as: where ϕ(θ) represents the output of the FC layer, which is obtained using the input matrix X, pixel values for each channel, and all weights across the entire feature extraction layer, and θ is the weight for the last layer. The output is the probability of fire-recognition with a threshold set to 0.5. To train our DSA-ResNet50 model, a loss function is used to improve network accuracy and find the best weight matrix, which is defined as a binary cross-entropy: where N represents the total number of samples used for each epoch, y is the ground truth label for each image labeled as fire (y = 1) or no-fire (y = 0) in the training phase, and p(ŷ) represents the predicted result of an image classified as the fire class. In addition, training is carried out by Adam optimizer [39] to update gradient flow of the network, with the L 2 regularization set to 1 × 10 −4 . Similar to other CNNs, our DSA-ResNet50 model consists of three primary components: (1) the input feature matrix, (2) the feature extraction layers, and (3) the output layer. During the training phase, the input matrix X is firstly resized to 230 × 230 × 3, which is dependent on the image size and its channels. Then, the data is augmented using random rotation, horizontal flip, and other techniques to improve the generalization ability of the model and to avoid overfitting. The feature extraction layers consist of a large number of convolution blocks (DSA-Residual module), and each block follows identity mapping and a ReLU activation function [38]. The batch normalization helps to accelerate the convergence of the loss function and enables the model to learn different distributions of the data by normalization. The output of the last feature extraction layer is 7 × 7 × 2048, which is later adjusted to 1 × 1 × 2048 by global pooling. Due to the fact that fire classification is a binary classification problem, we use the sigmoid activation function to output its probabilities (fire, no-fire), denoted as: where ( ) ϕ θ represents the output of the FC layer, which is obtained using the input matrix X , pixel values for each channel, and all weights across the entire feature extraction layer, and θ is the weight for the last layer. The output is the probability of fire-recognition with a threshold set to 0.5. To train our DSA-ResNet50 model, a loss function is used to improve network accuracy and find the best weight matrix, which is defined as a binary cross-entropy: where N represents the total number of samples used for each epoch, y is the ground truth label for each image labeled as fire ( in the training phase, and ( ) p y represents the predicted result of an image classified as the fire class. In addition, training is carried out by Adam optimizer [39] to update gradient flow of the network, with the 2 L regularization set to 4 1 10 − × .

Fire Instance Segementation Using MaskSU R-CNN
As an improvement of Mask R-CNN [40], MS R-CNN is the most advanced instance segmentation method at present. Mask R-CNN regards classification confidence as a criterion for segmentation quality. However, experimental evidence demonstrates that there is no significant correlation between predicted mask quality and classification confidence. To solve this problem, MS R-CNN is obtained by adding a MaskIoU branch to the Mask R-CNN, which is employed to learn and predict the segmentation results, i.e., the segmentation confidence. The MaskIoU branch utilizes related feature map and the mask branch's result as input (14 × 14 × 257) and then calculates the Intersection Over Union (IoU) between the predicted mask and its corresponding ground truth label in the image. Consequently, MS R-CNN is capable of achieving more precise segmentation results than Mask R-CNN.
UNet [41] is a semantic segmentation model based on FCN, which mainly includes three parts: down-sampling, up-sampling, and concatenation operation. Down-sampling consists of convolution-pooling blocks and is used to compress the number of channels. Each block has two convolutional layers, one pooling layer, and ReLU activation function. In addition, up-sampling doubles the size of the feature map and halves the number of channels by transposed convolution. After that, the output is connected with the feature map with the same size obtained by down-sampling. Throughout the entire process, the concatenation of feature map helps to integrate information in both shallow and deep networks.
Inspired by the UNet structure, we reconstructed the MS R-CNN's MaskIoU branch using a U-shaped network called MaskSU R-CNN in this paper, and the model structure is presented in Figure 5. Given the outstanding performance of the attention mechanism on the image classification task, we adopt DSA-ResNet50 as the backbone of our MaskSU R-CNN. In addition, the introduction of the feature pyramid network (FPN) [42] for multi-scale fusion contributes to extracting more effective features.

Feature Extraction Network
Given the outstanding performance of the attention mechanism on the image classification task, we adopt DSA-ResNet50 as the backbone of our MaskSU R-CNN. In addition, the introduction of the feature pyramid network (FPN) [42] for multi-scale fusion contributes to extracting more effective features.

Region Proposal Network (RPN) and Region of Interest (RoI) Align
The RPN [43] is composed of two convolution blocks (3 × 3, 1 × 1): The 3 × 3 convolution block extracts features from the output of the backbone network using 32 convolution kernels; the 1 × 1 convolution block is used to adjust the parameters of each anchor box and determine whether there is an object in it. During the training phase, the RPN generates nine anchor boxes for each pixel on the feature map, and then a series of initial proposal regions are screened using Non-Maximum Suppression (NMS).
Unlike the RoI Pooling in the work of [43], RoI Align [40] is used here to scale the RPN output to the same size. RoI Align utilizes bilinear interpolation to calculate grid point coordinates, which effectively preserves the edge pixels of the object to obtain predicted masks with high quality.

Multi-branch Prediction for Classes, Bounding Boxes, and Masks
The multi-branch prediction network contains three branches: the R-CNN branch for classification and bounding-box regression, the mask branch for generating predicted masks, and the MaskIoU branch for segmentation evaluation. The R-CNN branch is consistent with most object detection methods, using the softmax function for classification and Smooth L1 loss for bounding-box regression.
In Mask R-CNN model, the segmentation quality is equivalent to the classification confidence, which is unscientific in practical situations. To address this problem, MS R-CNN introduces the MaskIoU branch, which evaluates the quality of segmentation by concatenating the feature map (14 × 14 × 256) with the output of the mask branch (14 × 14 × 1). During the process of convolution, the features in the shallow network will cause a certain loss. To better perform feature fusion and reduce feature loss, a U-shaped network is adopted in this paper to reconstruct the MaskIoU branch. The novel MaskIoU branch (Table 2) consists of eight convolutional layers, including down-sampling for channel compression, up-sampling for feature expansion, and concatenation operation. Finally, the IoU value between the predicted mask and corresponding ground truth is calculated by three FC layers. The feature concatenation in the U-shaped network integrates the information between different feature maps, which is significant for segmenting the edge pixels of fire.

Model Training and Loss Function
As a large image dataset in deep learning, COCO [44] has more than 220 k images with 80 categories. Before training our MaskSU R-CNN model, a COCO-based pretrained model was applied using transfer learning [45] to train deep neural networks to be more stable and efficient. During training of the model, the method used in the work of [40] has been adopted. Thirty-two anchors are randomly selected from each image in the batch, and the loss is generated based on the positional relationship with the ground truth label. If the IoU value is greater than 0.5, the RoI is considered as a positive sample; otherwise, it is a negative sample, and the ratio of positives and negatives is 1:3.
For generating the regression target in MaskIoU branch, the predicted mask is binarized with a threshold of firstly 0.5. Then we regress the MaskIoU between the predicted mask and its corresponding ground truth by l 2 loss. In addition, the mask loss L mask and the MaskIoU loss L maskiou are defined on positive samples only.
During the training phase, the curve of loss function visually reflects the convergence of the model. The total loss of MaskSU R-CNN is composed of two components: (1) the loss in RPN; and (2) the loss generated by the multi-branch prediction network, which can be described as: where the RPN loss L rpn is composed of the classification loss (softmax loss) and the bounding-box regression loss (Smooth L1 loss), which is used to generate many proposals (the output of RPN), including the identification of whether or not there are real objects in the anchor and the parameter adjustment of the anchor position. L rpn is computed as follows: where L mul−branch is generated by different branches, including the classification loss (softmax loss), the bounding-box regression loss (Smooth L1 loss), the mask loss L mask and the MaskIoU loss L maskiou . The formula is expressed as follows: where the classification term is normalized by the mini-batch size (i.e., N cls = 256) and regression term is normalized by the number of anchor locations (i.e., N reg~2 400); λ * and γ * are hyperparameters used to balance the loss of anchors or bounding-boxes regression and the loss of mask generation during the training phase, which are set to 10 and 1 in our implementation. The classification loss L cls , regression loss L reg , mask loss L mask , and MaskIoU loss L maskiou are listed as followed: where p i represents the probability that the predicted result of anchor i is the ground truth. Since RPN is used to detect the presence of the target (foreground or background) instead of classification, the value of p * i is 1 when anchor i is a positive sample, otherwise it is 0.
represents the regression parameters of anchor i, including the center coordinates of the bounding box (x, y), the width w, and height h. t * i indicates the ground truth corresponding to anchor i. s and s * represent the binary matrix of the predicted mask and ground truth, respectively. , ⊕, and log • denote the pixel-based product, summation, and logarithm, respectively.
Finally, the segmentation quality of each target can be expressed by mask score S mask : where S cls represent the classification confidence obtained from R-CNN branch, and S maskiou is the output of MaskIoU branch.

Results
This section presents the different performances of two deep neural network models on forest-fire image classification and segmentation. All the experiments are based on Python 3.6 and Pytorch using the Windows system. The hardware used is AMD R7-5800H and an NVIDIA RTX 3070 with 16 GB memory.

Accuracy Assessment
To observe the performance of the model at different training-set proportions, we compared five existing deep-classification networks (VGGNet [46], GoogleNet [47], ResNet [29], and SE-ResNet50 [36]) with our DSA-ResNet50. The performance of these models trained with different proportions (20%, 40%, 60%, and 80%) of training images is shown in Figure 6. We evaluate the classification performance using four metrics: accuracy (Acc), Kappa coefficient (K), Omission Error (OE), and Commission Error (CE). According to Figure 6, it can be noticed that with the increase of training sets, the Acc and K keep growing while the OE and CE decrease. This is because as the training set increases, the model can learn more relevant features. In addition, OE is slightly higher than CE, which indicates that some of the fine fire points are highly similar to the surrounding soil or obscured by vegetation, making it difficult for the model to classify them accurately. It is worth noting that our DSA-ResNet50 is superior to other models under different proportions of training images. The calculation formulas are as follows: where TP and FP represent the number of fire or no-fire images classified as fire label, respectively; FN and TN represent the number of fire or no-fire images classified as no-fire label, respectively; p o is the overall classification accuracy, and p e is the accidental consistency.  Table 3 demonstrates the results of comparison models trained with 80% images, where our DSA-ResNet50 performs the best (Acc = 93.65%, K = 0.864, OE = 20.59%, and CE = 4.23%). In addition, with a slight increase in network parameters (1.8 million), the addition of the DSA module increased Acc and K by 2.37%, 0.025, and decreased OE and CE by 9.28%, 4.12%, respectively, suggesting that the proposed attention mechanism can capture more features and thus improve the classification ability of the model.

Visualization Analysis
To better understand CNN's decision on image classification, the visualization method Gradient-weighted Class Activation Mapping (Grad-CAM) [48] was used to generate a heatmap for evaluating important regions in each input image. Figure 7 shows the Grad-CAM visualizations of forest-fire images taken from different UAV angles (shown in the first and third columns), based on DSA-ResNet50 model trained with 80% training images. According to each input image and its corresponding visualization, it can be seen  Table 3 demonstrates the results of comparison models trained with 80% images, where our DSA-ResNet50 performs the best (Acc = 93.65%, K = 0.864, OE = 20.59%, and CE = 4.23%). In addition, with a slight increase in network parameters (1.8 million), the addition of the DSA module increased Acc and K by 2.37%, 0.025, and decreased OE and CE by 9.28%, 4.12%, respectively, suggesting that the proposed attention mechanism can capture more features and thus improve the classification ability of the model.

Visualization Analysis
To better understand CNN's decision on image classification, the visualization method Gradient-weighted Class Activation Mapping (Grad-CAM) [48] was used to generate a heatmap for evaluating important regions in each input image. Figure 7 shows the Grad-CAM visualizations of forest-fire images taken from different UAV angles (shown in the first and third columns), based on DSA-ResNet50 model trained with 80% training images. According to each input image and its corresponding visualization, it can be seen that DSA-ResNet50 can easily focus on areas with fire points (marked with red boxes), indicating that the DSA module enhances the network's ability to recognize fire areas.

Evaluation Metrics
The accuracy of segmentation results is often measured by IoU, which represents the overlap rate between the predicted result and its corresponding ground truth label, and the closer the value is to 1, the better the segmentation performance is. To be fair, we adopt the mean value of IoUs on the testing set to measure the model, denoted as: where i P and i G denote the predicted result and corresponding ground truth label for i-th image, respectively. In our experiment, if the IoU is 0.5 or above, the target is considered a positive sample, otherwise negative. In addition, the F1-score is used as another evaluation metric, denoted as: where TP , FP , and FN are defined in Section 3.1.1. Obviously, the larger the value of f , the better the accuracy of the model.

Performance Analysis and Comparison
The proposed MaskSU R-CNN is an instance segmentation model that implements parallel processing for object detection and segmentation. From the experimental results

Evaluation Metrics
The accuracy of segmentation results is often measured by IoU, which represents the overlap rate between the predicted result and its corresponding ground truth label, and the closer the value is to 1, the better the segmentation performance is. To be fair, we adopt the mean value of IoUs on the testing set to measure the model, denoted as: where P i and G i denote the predicted result and corresponding ground truth label for i-th image, respectively. In our experiment, if the IoU is 0.5 or above, the target is considered a positive sample, otherwise negative. In addition, the F1-score is used as another evaluation metric, denoted as: where TP, FP, and FN are defined in Section 3.1.1. Obviously, the larger the value of f , the better the accuracy of the model.

Performance Analysis and Comparison
The proposed MaskSU R-CNN is an instance segmentation model that implements parallel processing for object detection and segmentation. From the experimental results (Figure 8b), it can be found that our model can correctly identify forest-fire targets and achieve good segmentation results. Compared with the ground truth labels (Figure 8c), they remain almost the same except for some defects in flame details, and the minor differences are mainly caused by the translucency of forest fires and the interference of occlusions. Figure 9 demonstrates the loss curves over 120 epochs for both the training and validation sets. It can be seen that our model shows an overall smooth decreasing trend and gradually converges after about 80 epochs.
Remote Sens. 2022, 14, x FOR PEER REVIEW 14 of 21 ( Figure 8b), it can be found that our model can correctly identify forest-fire targets and achieve good segmentation results. Compared with the ground truth labels (Figure 8c), they remain almost the same except for some defects in flame details, and the minor differences are mainly caused by the translucency of forest fires and the interference of occlusions. Figure 9 demonstrates the loss curves over 120 epochs for both the training and validation sets. It can be seen that our model shows an overall smooth decreasing trend and gradually converges after about 80 epochs.  To demonstrate the superiority of MaskSU R-CNN on forest-fire segmentation, we compared our method with several DL-based semantic segmentation models, including SegNet [49], UNet [41], PSPNet [50], and DeepLabv3 [51]. Noticeably, the same dataset  Figure 8b), it can be found that our model can correctly identify forest-fire targets and achieve good segmentation results. Compared with the ground truth labels (Figure 8c), they remain almost the same except for some defects in flame details, and the minor differences are mainly caused by the translucency of forest fires and the interference of occlusions. Figure 9 demonstrates the loss curves over 120 epochs for both the training and validation sets. It can be seen that our model shows an overall smooth decreasing trend and gradually converges after about 80 epochs.  To demonstrate the superiority of MaskSU R-CNN on forest-fire segmentation, we compared our method with several DL-based semantic segmentation models, including SegNet [49], UNet [41], PSPNet [50], and DeepLabv3 [51]. Noticeably, the same dataset To demonstrate the superiority of MaskSU R-CNN on forest-fire segmentation, we compared our method with several DL-based semantic segmentation models, including SegNet [49], UNet [41], PSPNet [50], and DeepLabv3 [51]. Noticeably, the same dataset and configurations were used to train all models to make the predictions comparable.
We selected a representative part of the images from the testing set for display, and the predicted results are shown in Figure 10. We can visually see that the segmentation results of our MaskSU R-CNN outperform other comparison models, especially on images that are hard for humans to recognize. In terms of apparent forest-fire targets that are significantly different from the background, most methods produced relatively accurate segmentation results. As for those forest fires with small targets and high concealment, as marked with green boxes in Figure 10a, most models generated some degree of under-segmentation, except for DeepLabv3 and our MaskSU R-CNN. It is worth noting that SegNet showed serious mis-segmentation (marked with blue boxes), which was mainly caused by the model without taking full advantage of contextual semantic relationships. Table 4 lists the results of the quantitative analysis with different comparison models. Our model obtained the highest f and mIoU on the testing set. Moreover, unlike the above segmentation methods, our model also achieves the differentiation of individual forest fires, which makes the fire segmentation more interpretable.
Remote Sens. 2022, 14, x FOR PEER REVIEW 15 o and configurations were used to train all models to make the predictions comparable. selected a representative part of the images from the testing set for display, and the p dicted results are shown in Figure 10. We can visually see that the segmentation result our MaskSU R-CNN outperform other comparison models, especially on images that hard for humans to recognize. In terms of apparent forest-fire targets that are significan different from the background, most methods produced relatively accurate segmentat results. As for those forest fires with small targets and high concealment, as marked w green boxes in Figure 10a, most models generated some degree of under-segmentati except for DeepLabv3 and our MaskSU R-CNN. It is worth noting that SegNet show serious mis-segmentation (marked with blue boxes), which was mainly caused by model without taking full advantage of contextual semantic relationships. Table 4 lists results of the quantitative analysis with different comparison models. Our model obtain the highest f and mIoU on the testing set. Moreover, unlike the above segmentat methods, our model also achieves the differentiation of individual forest fires, wh makes the fire segmentation more interpretable.   Furthermore, in order to verify the effectiveness of our improved model MaskSU R-CNN, we compared it with the original model Mask R-CNN and MS R-CNN, and some of the segmentation results are shown in Figure 11. It can be intuitively found that our MaskSU R-CNN achieves the best segmentation results, followed by the MS R-CNN. The main reason is that these two models both add a new branch MaskIoU on the basis of Mask R-CNN, and we further improve the quality of the predicted mask after reconstructing the MaskIoU branch using a U-shaped network. Therefore, the segmentation results are the most excellent. In particular, our model has a remarkable advantage in the correction of edge pixels in the fire regions, especially on inconspicuous fire images, such as fire points 1, 2, 6, and 8 in Figure 11e. As for those highly occluded fire targets in Figure 12, the segmentation confidence of our method is more reasonable, which is determined by the segmentation quality, rather than directly using the classification confidence. Meanwhile, the novel branch MaskIoU greatly improves the fine-grained characterization capability of our model.

Discussion
In contrast to other fixed-form objects, forest fires are dynamic objects with variable shapes and hard-to-depict textures [52]. Generally, a forest fire usually begins as a smallscale fire, develops into a medium-scale fire, and then becomes a large-scale fire. Typologically, it starts from ground fire, then spreads to the trunk, and finally to the tree crown [53,54]. Therefore, the detection of incipient fires appear to be particularly important. Unfortunately, there is little research on this aspect, and most of their research data sets are images of medium-scale or big-scale fires. We focused on this phenomenon and adopted the forest-fire data set based on UAV aerial photography, with minor fire points and strong flame concealment. This can have a high degree of simulating incipient fires. This study proposed a novel method by improving the existing instance segmentation model in order to provide more accurate fire-behavior data, which deeply explores the shallow information and deep higher-order semantics in image features and achieves high-precision recognition of incipient forest fires.
In terms of forest-fire recognition, previous methods have advantages in detecting fires with a faster speed and higher accuracy [55]. However, difficulties arise when complications occur, such as when the capture of fires from a drone's perspective increases the misdetection rate, and inconspicuous fire points with a small target or high concealment are not easily discovered. To address the issues above, we reconstructed the MaskIoU branch of existing MS R-CNN model by adding a U-shaped network. Specifically, the improved branch cascades feature maps of the same size during encoding and decoding phase, allowing for better integration of pixel location features in the shallow network and pixel category features in the deep network, which provides some correction for edge pixels of forest fire targets.
In order to fully illustrate the rationality of the model in this paper, our MaskSU R-CNN is compared with the original models Mask R-CNN and Mask Scoring R-CNN from several perspectives. The convergence comparison in Figure 9 shows that the overall training loss of our method is slightly lower than the other two models with the same training samples. The visualization comparison in Figure 11 reveals that the segmentation mask of our method has the highest matching degree with the actual shape of the forest fire and has obvious advantages in processing forest-fire edge pixels. The quantitative comparison in Table 5 shows that our method achieves SOTA performance in terms of both detection accuracy and segmentation quality. In addition, the fixed structure of our MaskSU R-CNN To demonstrate the importance of the novel MaskIoU branch in our model, we conducted a series of ablation experiments. According to the ablation results in Table 5, it can be found that the MaskIoU branch can significantly improve the segmentation quality of forest fires (mIoU). In particular, after adding the novel MaskIoU branch with the U-shaped network, our mIoU reached 80.77%. Meanwhile, introducing the attention mechanism (DSA module) to the backbone ResNet can further mine the intrinsic information of the features and improve the performance of the model. In addition to evaluating the segmentation performance of comparative methods, we also calculated the model size and running time. Our novel MaskIoU branch has about 0.63 G FLOPs compared with 0.39 G FLOPs in MS R-CNN. We use one 3070 GPU to test assess the running time (sec./frame). As for DSA-ResNet50, the speed is about 0.235 for Mask R-CNN, and 0.238 for both MS R-CNN and MaskSU R-CNN. Hence, the computation cost of MaskIoU branch is negligible.

Discussion
In contrast to other fixed-form objects, forest fires are dynamic objects with variable shapes and hard-to-depict textures [52]. Generally, a forest fire usually begins as a small-scale fire, develops into a medium-scale fire, and then becomes a large-scale fire. Typologically, it starts from ground fire, then spreads to the trunk, and finally to the tree crown [53,54]. Therefore, the detection of incipient fires appear to be particularly important. Unfortunately, there is little research on this aspect, and most of their research data sets are images of medium-scale or big-scale fires. We focused on this phenomenon and adopted the forest-fire data set based on UAV aerial photography, with minor fire points and strong flame concealment. This can have a high degree of simulating incipient fires. This study proposed a novel method by improving the existing instance segmentation model in order to provide more accurate fire-behavior data, which deeply explores the shallow information and deep higher-order semantics in image features and achieves high-precision recognition of incipient forest fires.
In terms of forest-fire recognition, previous methods have advantages in detecting fires with a faster speed and higher accuracy [55]. However, difficulties arise when complications occur, such as when the capture of fires from a drone's perspective increases the misdetection rate, and inconspicuous fire points with a small target or high concealment are not easily discovered. To address the issues above, we reconstructed the MaskIoU branch of existing MS R-CNN model by adding a U-shaped network. Specifically, the improved branch cascades feature maps of the same size during encoding and decoding phase, allowing for better integration of pixel location features in the shallow network and pixel category features in the deep network, which provides some correction for edge pixels of forest fire targets.
In order to fully illustrate the rationality of the model in this paper, our MaskSU R-CNN is compared with the original models Mask R-CNN and Mask Scoring R-CNN from several perspectives. The convergence comparison in Figure 9 shows that the overall training loss of our method is slightly lower than the other two models with the same training samples. The visualization comparison in Figure 11 reveals that the segmentation mask of our method has the highest matching degree with the actual shape of the forest fire and has obvious advantages in processing forest-fire edge pixels. The quantitative comparison in Table 5 shows that our method achieves SOTA performance in terms of both detection accuracy and segmentation quality. In addition, the fixed structure of our MaskSU R-CNN allows for end-to-end training. Therefore, it is feasible to prune our method and deploy it to mobile devices under the premise of ensuring recognition accuracy.
Future research will focus on the model's recognition capacity in satellite remotesensing imageries, and the fusion of satellite multimodal data for forest-fire detection.

Conclusions
In this study, we present two solutions regarding the classification and segmentation of forest-fire images, with the main contributions as follows: (1) we design a novel attention mechanism (DSA module) to enhance the representation ability of feature channels and further improve the classification accuracy of incipient forest fires; (2) we merge the DSA module into ResNet as the backbone network of the instance segmentation model to improve the feature extraction capability; and (3) we reconstruct the MaskIoU branch of MS R-CNN using a U-shaped network, aiming to reduce the segmentation error. Experiments show that our MaskSU R-CNN outperforms many state-of-the-art segmentation models with a precision of 91.85%, recall 88.81%, F1-score 90.30%, and mIoU 82.31% in incipient forest-fire detection and segmentation. Our method, with its flexible structure and excellent performance, represents a shift toward the possibility of unmanned fire monitoring in a large area of forest.

Data Availability Statement:
This work uses the publicly available dataset FLAME, see reference [28] for data availability. More details about the data are available under Section 2.1.