MTL-FFDET: A Multi-Task Learning-Based Model for Forest Fire Detection

: Deep learning-based forest ﬁre vision monitoring methods have developed rapidly and are becoming mainstream. The existing methods, however, are based on enormous amounts of data, and have issues with weak feature extraction, poor small target recognition and many missed and false detections in complex forest scenes. In order to solve these problems, we proposed a multi-task learning-based forest ﬁre detection model (MTL-FFDet), which contains three tasks (the detection task, the segmentation task and the classiﬁcation task) and shares the feature extraction module. In addition, to improve detection accuracy and decrease missed and false detections, we proposed the joint multi-task non-maximum suppression (NMS) processing algorithm that fully utilizes the advantages of each task. Furthermore, considering the objective fact that divided ﬂame targets in an image are still ﬂame targets, our proposed data augmentation strategy of a diagonal swap of random origin is a good remedy for the poor detection effect caused by small ﬁre targets. Experiments showed that our model outperforms YOLOv5-s in terms of mAP (mean average precision) by 3.2%, AP S (average precision for small objects) by 4.8%, AR S (average recall for small objects) by 4.0%, and other metrics by 1% to 2%. Finally, the visualization analysis showed that our multi-task model can focus on the target region better than the single-task model during feature extraction, with superior extraction ability.


Introduction
Forests provide a fundamental habitat for terrestrial plants and animals, and they play an important role in maintaining the ecological balance of the ecosystem.Forest fires are among the most devastating forestry disasters, harming the global carbon cycle, soil characteristics and species richness as well as contributing to climate change [1].Forest fires can even endanger human lives and public property, which will result in losses for the economy and resources.Forest fires spread swiftly, and are currently difficult to effectively control or prevent [2].Therefore, it is crucial to locate fires quickly and put them out before they become serious incidents.
Traditionally, forest fire monitoring mainly relies on manual inspection [3] and conventional sensor monitoring [4,5].However, manual inspection not only requires human and material resources, but also accurate positioning of the fire.The expense of covering the entire forest with a wireless sensor network consisting of traditional sensors, such as temperature, humidity, wind and rain sensors [6] is extremely high, in spite of the fact that sensor monitoring can locate fires quickly and precisely.With the rapid developments in computer vision, remote sensing and artificial intelligence technology, optical sensors are widely being used in forest fire detection due to their low cost, broad coverage, real-time processing and efficient recognition [2].Many optical sensor-based forest fire detection

•
We propose a multi-task learning-based forest fire detection (MTL-FFDet) model that involves three tasks to enhance its feature extraction and learning capabilities from small samples for better performance.

•
In order to minimize the occurrence of missed and false detections in complex forest scenes, a joint multi-task NMS processing algorithm is proposed to filter out redundant and poor-quality prediction boxes.

•
A data augmentation approach involving a diagonal swap of random origin, is proposed to increase the number of small targets and improve the detection performance for small flame targets, particularly incomplete small flame targets at the edge of the viewpoint.
The rest of this paper is organized as follows: in Section 2, we introduce our data set, the MTL-FFDet model and optimizations in detail; in Section 3, we provide the experimental results and visualization analysis.The discussion and conclusions are presented in Sections 4 and 5.

Data set and Annotation
The preparation of the data set is an essential component to implementating the algorithm in this paper.We collected images from several public forest fire data sets including VisiFire [29], ForestryImages [30], FiSmo [31], BowFire [32], Firesense [33] and EFD-Dataset [34], to form our data set.This self-built forest fire data set contains day fire, night fire, aerial fire, fixed shot fire, mountain fire, surface fire, trunk fire, canopy fire, etc., and natural forest images with disturbance.The diversity of the data set enabled the algorithm to be more generalized in complex forest environments.Our data set contained a total of 6595 images, of which 3987 were images with forest fires and 2608 were non-fire images in forest scenes.We randomly divided the entire data set into a training set and a validation set, in the ratio of 8:2, for training and testing processes, respectively.Some representative samples are shown in Figure 1.
detecting small flame targets.In particular, due to limitations in visual perspective, the flame subject may not be in the middle of the image but rather is distributed at the edges of the image as a small and incomplete target.If these small targets as well as incomplete targets can be effectively detected, this will greatly improve the performance of forest fire detection models.Therefore, in our paper, our model focuses more on improving the abilities of feature extraction, few-shot learning and small target detection.
The main contributions of our paper are three-fold: • We propose a multi-task learning-based forest fire detection (MTL-FFDet) model that involves three tasks to enhance its feature extraction and learning capabilities from small samples for better performance.

•
In order to minimize the occurrence of missed and false detections in complex forest scenes, a joint multi-task NMS processing algorithm is proposed to filter out redundant and poor-quality prediction boxes.

•
A data augmentation approach involving a diagonal swap of random origin, is proposed to increase the number of small targets and improve the detection performance for small flame targets, particularly incomplete small flame targets at the edge of the viewpoint.
The rest of this paper is organized as follows: in Section 2, we introduce our data set, the MTL-FFDet model and optimizations in detail; in Section 3, we provide the experimental results and visualization analysis.The discussion and conclusions are presented in Sections 4 and 5.

Data set and Annotation
The preparation of the data set is an essential component to implementating the algorithm in this paper.We collected images from several public forest fire data sets including VisiFire [29], ForestryImages [30], FiSmo [31], BowFire [32], Firesense [33] and EFD-Dataset [34], to form our data set.This self-built forest fire data set contains day fire, night fire, aerial fire, fixed shot fire, mountain fire, surface fire, trunk fire, canopy fire, etc., and natural forest images with disturbance.The diversity of the data set enabled the algorithm to be more generalized in complex forest environments.Our data set contained a total of 6595 images, of which 3987 were images with forest fires and 2608 were non-fire images in forest scenes.We randomly divided the entire data set into a training set and a validation set, in the ratio of 8:2, for training and testing processes, respectively.Some representative samples are shown in Figure 1.The detection model was based on the multi-task learning proposed in this paper, which implemented three tasks, namely, object detection, semantic segmentation and image classification.Each image in our data set was annotated with three types of annotations.The labeling of image classification was done during the collection, while the annotation for the other two tasks was slightly more complex and required some caveats.For object detection, we needed to frame the flame region with a rectangular box using the LabelImg tool [35]; it is worth noting that the four boundaries of the annotation had to fit the flame target with an error of no more than two pixels.For segmentation, we used polygon annotation to outline the flame target using the Labelme tool [36], which is a pixel-level classification.Annotation examples are shown in the following Figure 2. The detection model was based on the multi-task learning proposed in this paper, which implemented three tasks, namely, object detection, semantic segmentation and image classification.Each image in our data set was annotated with three types of annotations.The labeling of image classification was done during the collection, while the annotation for the other two tasks was slightly more complex and required some caveats.For object detection, we needed to frame the flame region with a rectangular box using the LabelImg tool [35]; it is worth noting that the four boundaries of the annotation had to fit the flame target with an error of no more than two pixels.For segmentation, we used polygon annotation to outline the flame target using the Labelme tool [36], which is a pixel-level classification.Annotation examples are shown in the following Figure 2.

MTL-FFDet
Multi-task learning (MTL) seeks to improve generalization and feature extraction by drawing on the domain-specific information that is found in related tasks; this approach is in contrast to traditional single-task learning, which strives to fulfill a task using a particular model [37].In addition, MTL brings several advantages.It contains an implicit mechanism for data augmentation that can improve the representation by averaging the noise patterns while learning multi-tasks [38].Meanwhile, it may be challenging for the model to separate useful from irrelevant variables if a task is very noisy, has little data, or is high-dimensional [39].MTL can aid the model by concentrating on key traits, since other tasks will provide additional evidence pertaining to their relevance or irrelevance.Additionally, inductive bias is introduced as a regularization term [40][41][42].The risk of overfitting, and the complexity of the model is decreased when several related tasks share complementary information and serve as regularizers for one another.Furthermore, by sharing layers across multiple tasks, computation duplication is minimized, inference speed is increased and memory utilization is decreased.
In this paper, our MTL-FFDet model consists of three tasks: forest fire object detection, forest fire semantic segmentation and image classification.The detection task is the primary task, while the other two are secondary tasks.These three tasks share the convolutional neural network-based backbone for better extraction of fire features, while the detection and segmentation tasks share the multi-scale fusion network for better feature expression ability.The shared layers boost both the learning capacity for a small number

MTL-FFDet
Multi-task learning (MTL) seeks to improve generalization and feature extraction by drawing on the domain-specific information that is found in related tasks; this approach is in contrast to traditional single-task learning, which strives to fulfill a task using a particular model [37].In addition, MTL brings several advantages.It contains an implicit mechanism for data augmentation that can improve the representation by averaging the noise patterns while learning multi-tasks [38].Meanwhile, it may be challenging for the model to separate useful from irrelevant variables if a task is very noisy, has little data, or is high-dimensional [39].MTL can aid the model by concentrating on key traits, since other tasks will provide additional evidence pertaining to their relevance or irrelevance.Additionally, inductive bias is introduced as a regularization term [40][41][42].The risk of overfitting, and the complexity of the model is decreased when several related tasks share complementary information and serve as regularizers for one another.Furthermore, by sharing layers across multiple tasks, computation duplication is minimized, inference speed is increased and memory utilization is decreased.
In this paper, our MTL-FFDet model consists of three tasks: forest fire object detection, forest fire semantic segmentation and image classification.The detection task is the primary task, while the other two are secondary tasks.These three tasks share the convolutional neural network-based backbone for better extraction of fire features, while the detection and segmentation tasks share the multi-scale fusion network for better feature expression ability.The shared layers boost both the learning capacity for a small number of samples as well as the feature extraction capabilities for forest fires, leading to greater performance and generalization.The architecture of our MTL-FFDet is shown in Figure 3, and the design details of the backbone, neck and head will be specified below. of samples as well as the feature extraction capabilities for forest fires, leading to greater performance and generalization.The architecture of our MTL-FFDet is shown in Figure 3, and the design details of the backbone, neck and head will be specified below.

Backbone
The backbone is used as a shared layer for our model to extract features for use in later modules and serves as an encoder for our MTL-FFDet model.In our backbone design, the cross-stage-partial (CSP) bottleneck [43] with three convolutions (C3) is used due to its excellent performance in YOLOv5.Additionally, influenced by the functionality of lightweight network architecture, GhostNet [44], we replace the Conv module with the GhostConv module, which uses fewer parameters with the same precision.In our tests, we found that the overall parameters used changed from 8.9 M to 4.8 M, a reduction of nearly one-half, while maintaining the same accuracy.Figure 4 illustrates the backbone architecture.The input image size was set to 416 × 416 pixels so that the model could achieve a better trade-off between inference latency and accuracy.In the first layer of the backbone, the Conv module converts the information on width and height dimensions to the channel dimension, which can expand the receptive field without losing features from the raw image.After that, the number of layers is increased by stacking the GhostConv module and the C3Ghost module to make the network go deeper and extract more exact features.The final spatial pyramid pooling faster structure (SPPF) [45] solves the problem of different input sizes, resulting in different output dimensions and reducing the risk of overfitting.

Backbone
The backbone is used as a shared layer for our model to extract features for use in later modules and serves as an encoder for our MTL-FFDet model.In our backbone design, the cross-stage-partial (CSP) bottleneck [43] with three convolutions (C3) is used due to its excellent performance in YOLOv5.Additionally, influenced by the functionality of lightweight network architecture, GhostNet [44], we replace the Conv module with the GhostConv module, which uses fewer parameters with the same precision.In our tests, we found that the overall parameters used changed from 8.9 M to 4.8 M, a reduction of nearly one-half, while maintaining the same accuracy.Figure 4 illustrates the backbone architecture.The input image size was set to 416 × 416 pixels so that the model could achieve a better trade-off between inference latency and accuracy.In the first layer of the backbone, the Conv module converts the information on width and height dimensions to the channel dimension, which can expand the receptive field without losing features from the raw image.After that, the number of layers is increased by stacking the GhostConv module and the C3Ghost module to make the network go deeper and extract more exact features.The final spatial pyramid pooling faster structure (SPPF) [45] solves the problem of different input sizes, resulting in different output dimensions and reducing the risk of overfitting.

Neck
The neck fuses the feature maps, which have 8×, 16×, and 32× down-sampling rates from the backbone.Our neck network references the design ideas of the feature pyramid network (FPN) [46] and the path aggregation network (PAN) [47], which have a top-down fusion and a bottom-up fusion, respectively, as shown in Figure 5.The fusion can combine semantic features and spatial features in a forest scene, in order to obtain a better feature expression effect.

Neck
The neck fuses the feature maps, which have 8×, 16×, and 32× down-sampling rates from the backbone.Our neck network references the design ideas of the feature pyramid network (FPN) [46] and the path aggregation network (PAN) [47], which have a top-down fusion and a bottom-up fusion, respectively, as shown in Figure 5.The fusion can combine semantic features and spatial features in a forest scene, in order to obtain a better feature expression effect.

Head
The head serves as a decoder for our MTL-FFDet to solve different tasks.Our model has three tasks, so we needed to design a specific head for each task, as shown in Figure 6.
The segmentation head uses the fused feature map from the neck and then up-samples it to the same size as the original (416 × 416).The number of output channels represents the categories, which determine where the pixels belong.In our segmentation task, background and flame are the two categories.
For the detection head, we adopted an anchor-based scheme for multi-scale detection, which is similar to the YOLO series [22,48].The first two of the output dimensions represent the grid size in the scale, while the third dimension represents the number of anchors.The last dimension represents the number of predictions, including the coordinates of the prediction bounding box, the confidence score and the categories of the detected object.
The classification head is used to do binary classification on the input image, in order to distinguish whether the image is a forest fire image or not.

Head
The head serves as a decoder for our MTL-FFDet to solve different tasks.Our model has three tasks, so we needed to design a specific head for each task, as shown in Figure 6.The segmentation head uses the fused feature map from the neck and then up-samples it to the same size as the original (416 × 416).The number of output channels represents the categories, which determine where the pixels belong.In our segmentation task, background and flame are the two categories.
For the detection head, we adopted an anchor-based scheme for multi-scale detection, which is similar to the YOLO series [22,48].The first two of the output dimensions

Loss Function
The design of the loss function is crucial for the training of deep neural networks.Since there are three tasks in our model, the multi-task loss contains three parts: detection loss L det , segmentation loss L seg and classification loss L cls .
Considering that the detection task not only needs to identify the fire, but also needs to frame the region of the fire with a rectangle, the detection loss L det is composed of object classification loss L dcls , object confidence loss L dobj and bounding box loss L dbox , and is a weighted sum of these three, as shown in Equation ( 1): where L dcls and L dobj use weighted binary cross-entropy (W-BCE) [49] loss to cope with the sample imbalance problem.The L dbox uses complete intersection over union (CIoU) [50] loss, which takes the overlapping rate, the distance and the aspect ratio between the predictions and ground truth, into overall consideration.
As for the segmentation loss L seg , it has two metrics.Firstly, we want each pixel point to be classified correctly.Secondly, we expect fewer false positives (FP) and true negatives (TN).Thus, its expression can be written as Equation ( 2), which follows: where L scls uses cross-entropy (CE) [51] loss, and L siou equals 1 − TP FP+TP+TN .Similarly, we use CE loss as the classification loss L cls .Thus, our total loss is a weighted sum of the three task losses, as shown in Equation ( 3): where α 1 , α 2 , α 3 , β 1 , β 2 , λ 1 , λ 2 , λ 3 for the preceding three equations are adjustable parameters that are used to balance each loss.

Joint Multi-Task NMS
The complex environment of the forest makes the characteristics of forest fires change.For example, irregularities in the shape of the flame target due to leaf and branch shading can render the previously learned shape features inapplicable.Another complicating factor is that overexposure or underexposure in strong or low light conditions causes losses in texture and color characteristics in the forest fire images.These external change models inevitably present many situations that lead to missed and false detections.Considering that our model is composed of multiple tasks, we combined their respective characteristics in order to obtain better performance and reduce false and missed detections.For segmentation and detection tasks, although both share feature extraction and feature fusion modules, the tasks are different and the utilization of features is different.The semantic segmentation task is pixel point-based and focuses more on fine-grained category differentiation, while the object detection task is regression box-based and focuses more on regional differentiation.As shown in Figure 7, the false detections and missed detections that occur in the detection task are well identified in the segmentation task, while the missed detections that occur in the segmentation task are well detected in the detection task.
Facing this problem, we proposed an improved non-maximum suppression method, the joint multi-task non-maximum suppression (JM-NMS) algorithm, in order to filter redundant and low-confidence boxes in conjunction with the segmentation task.
feature fusion modules, the tasks are different and the utilization of features is different.The semantic segmentation task is pixel point-based and focuses more on fine-grained category differentiation, while the object detection task is regression box-based and focuses more on regional differentiation.As shown in Figure 7, the false detections and missed detections that occur in the detection task are well identified in the segmentation task, while the missed detections that occur in the segmentation task are well detected in the detection task.Facing this problem, we proposed an improved non-maximum suppression method, the joint multi-task non-maximum suppression (JM-NMS) algorithm, in order to filter redundant and low-confidence boxes in conjunction with the segmentation task.
First, for each generated prediction box, calculate the pixel ratio ℛ of the object within the box using the following Equation ( 4): where  represents the number of all pixels in the prediction box, and  represents the number of pixels belonging to the flame object.
Second, map the pixel ratio ℛ onto a nonlinear space.We use a sigmoid function, as shown in Equation ( 5), as our mapping function: where  and  are the scaling and bias factors, respectively.After the experiment, it was found that the effect is better when  = 6 and  = 1.The function graph is shown in Figure 8.We can see that the majority of the high-quality prediction boxes can be maintained when ℛ is bigger than 0.5, since their coefficients are larger and differ less from one another.However, when ℛ is less than 0.5, its coefficients are lower and differentiate more, making it easier to censor prediction boxes with poor quality.Additionally, the coefficients are not zero when ℛ is close to 0, preventing certain normal boxes from being incorrectly censored as a result of poor segmentation results.First, for each generated prediction box, calculate the pixel ratio R obj of the object within the box using the following Equation (4): where N t represents the number of all pixels in the prediction box, and n obj represents the number of pixels belonging to the flame object.Second, map the pixel ratio R obj onto a nonlinear space.We use a sigmoid function, as shown in Equation ( 5), as our mapping function: where a and b are the scaling and bias factors, respectively.After the experiment, it was found that the effect is better when a = 6 and b = 1.The function graph is shown in Figure 8.We can see that the majority of the high-quality prediction boxes can be maintained when R obj is bigger than 0.5, since their coefficients are larger and differ less from one another.However, when R obj is less than 0.5, its coefficients are lower and differentiate more, making it easier to censor prediction boxes with poor quality.Additionally, the coefficients are not zero when R obj is close to 0, preventing certain normal boxes from being incorrectly censored as a result of poor segmentation results.
where  and  are the scaling and bias factors, respectively.After the experiment, it was found that the effect is better when  = 6 and  = 1.The function graph is shown in Figure 8.We can see that the majority of the high-quality prediction boxes can be maintained when ℛ is bigger than 0.5, since their coefficients are larger and differ less from one another.However, when ℛ is less than 0.5, its coefficients are lower and differentiate more, making it easier to censor prediction boxes with poor quality.Additionally, the coefficients are not zero when ℛ is close to 0, preventing certain normal boxes from being incorrectly censored as a result of poor segmentation results.Based on this, we improved the original NMS [52] algorithm from confidence ranking to mixed score ranking.The mix score P obj is calculated by Equation ( 6).Our JM-NMS algorithm is shown in Algorithm 1.
where S obj represents the confidence score of the prediction box.In our study, we found that current algorithms are less effective in recognizing small flame targets, including those formed by occlusion and those appearing at the edges of the image due to viewpoint limitations.This situation is caused by two factors: firstly, the number of pixels that can represent small target features is low; and secondly, the number of small targets in the forest fire data set is low.Thus, we proposed a new data augmentation approach, a diagonal swap of random origin, in order to enhance the identification of small targets in forest fire detection from the perspective of the data set.The diagram of this approach is shown in Figure 9.The approach is comparatively simple but practical.Initially, we determine whether the graph has only one target label, and whether that target is a medium-sized object ranging from 32 to 96 pixels in the area.The next step is to generate a random origin in the area between 25% and 75% of the label width and height (the green area in Figure 9); the image is then quadratically divided based on the origin.Then, the two diagonal images are swapped to form a new image.This approach not only increases the number of small targets, but also increases the scenarios with incomplete samples of flame targets due to viewpoint limitations; this improves the generalization of the model.Note that the data augmentation is used randomly, and its probability varies depending on the data set-it should not be too high, generally between 10% and 20%.

Training
Our forest fire detection model, MTL-FFDet, was built by PyTorch (v1.7.1, Facebook AI Research, New York, USA) and trained on NVIDIA RTX 3080 (NVIDIA corporation, St. Clara State of California, USA) with 10 GB memory.In training, we used the learning strategies of warmup and cosine learning rate, which helped to mitigate the early overfitting of the model and keep the distribution smooth, as well as maintain the stability of the model with deep layers.There are two ways of training multiple tasks; one is to backpropagate the loss for each task, step by step; another way is to backpropagate the total loss, which is a weighted sum of the individual task losses.In terms of performance, these two approaches are about the same.However, as for convenience, the second approach is better.The coefficients α , α , α , β , β , λ , λ , λ in total loss (see Section 2.2.4) were set as 0.5, 1, 0.05, 0.5, 0.4, 1, 0.1, 0.2, respectively.As for the other training parameters, they are listed in Table 1.

Comparisons
In order to demonstrate the performance of our MTL-FFDet model, we use a validation set to evaluate each task in our model.In our experiments, our model was not only compared with other models, but was also compared for its the performance between single-task and multi-task detection (see Table 2 and Figure 10).Since the detection task is the principal task of our model while the other two act as auxiliaries, performance The approach is comparatively simple but practical.Initially, we determine whether the graph has only one target label, and whether that target is a medium-sized object ranging from 32 2 to 96 2 pixels in the area.The next step is to generate a random origin in the area between 25% and 75% of the label width and height (the green area in Figure 9); the image is then quadratically divided based on the origin.Then, the two diagonal images are swapped to form a new image.This approach not only increases the number of small targets, but also increases the scenarios with incomplete samples of flame targets due to viewpoint limitations; this improves the generalization of the model.Note that the data augmentation is used randomly, and its probability varies depending on the data set-it should not be too high, generally between 10% and 20%.

Training
Our forest fire detection model, MTL-FFDet, was built by PyTorch (v1.7.1, Facebook AI Research, New York, USA) and trained on NVIDIA RTX 3080 (NVIDIA corporation, St. Clara State of California, USA) with 10 GB memory.In training, we used the learning strategies of warmup and cosine learning rate, which helped to mitigate the early overfitting of the model and keep the distribution smooth, as well as maintain the stability of the model with deep layers.There are two ways of training multiple tasks; one is to backpropagate the loss for each task, step by step; another way is to backpropagate the total loss, which is a weighted sum of the individual task losses.In terms of performance, these two approaches are about the same.However, as for convenience, the second approach is better.The coefficients α 1 , α 2 , α 3 , β 1 , β 2 , λ 1 , λ 2 , λ 3 in total loss (see Section 2.2.4) were set as 0.5, 1, 0.05, 0.5, 0.4, 1, 0.1, 0.2, respectively.As for the other training parameters, they are listed in Table 1.

Comparisons
In order to demonstrate the performance of our MTL-FFDet model, we use a validation set to evaluate each task in our model.In our experiments, our model was not only compared with other models, but was also compared for its the performance between single-task and multi-task detection (see Table 2 and Figure 10).Since the detection task is the principal task of our model while the other two act as auxiliaries, performance improvement in the detection task was considered more important than in other models.It is worth noting that when training our model for a single task, it was necessary to freeze the other two task branches.The subscript 50 means under the condition of IoU = 0.50, and the subscripts S, M and L represent small objects (area < 32 2 ), medium objects (32 2 < area < 96 2 ) and large objects (area > 96 ), respectively.Note that mAP, AP 50 , AP S , AP M , AP L , mAR, AR S , AR M and AR L are all percentages, and that the speed was tested on NVIDIA RTX 3080.The bolded numbers indicate the best performance in the comparison.+ Add ablation experienments on the basis of MTL-FFDet(multi-task).* MTL-FFDet involving the detection (Det) task only.
Microsoft COCO criteria [54] is widely used to evaluate object detection tasks.In the criteria, there are two main metrics, average precision (AP) and average recall (AR); their formulas are shown in Equations ( 7)- (10).
where TP, FP and FN represent the numbers of true positives, false positives and false negatives, respectively.P and R represent precision and recall, respectively.The variable n represents the number of recall levels (for COCO criteria, it has 11 levels, ranging from 0.0 to 1.0, with intervals of 0.1).The variable o represents the IoU between the prediction box and the ground truth box.
As shown in Table 2, we found that our MTL-FFDet model using data augmentation and JM-NMS proposed in this paper is significantly better than other methods, including YOLOv5-s, YOLOv3-tiny and NanoDet-g, reaching 56.3% in mAP.Compared with singletask learning, the shared feature extraction module for multi-task learning yielded excellent performance, with an improvement of 3.6%.After the data enhancement strategy of the diagonal swap of random origin, the detection of our model on small targets greatly improved, by 4.8% in AP S and by 4.0% in AR S (compare with YOLOv5-s).In addition, the joint multi-task NMS processing algorithm allowed some missed and false targets to be correctly identified, improving most metrics by roughly 1% to 2%.Our multi-task model (Figure 10a,e) showed advantages over other models, both in the detection task and in the segmentation task, especially in the recognition of small fire targets (IMG1, IMG3 and IMG4).In addition, our multi-task model improved the missed and false detections, such as the firefighter in IMG2, and the red hat in IMG3, as a result of inclusion of the joint multi-task NMS processing algorithm.According to the segmentation results (Figure 10e,f,g), the fire target profile is more precise and effectively distinguishes the fire-like targets.For the other two tasks, we used three metrics, Acc, IoU and mIoU [55], in order to evaluate the performance of the segmentation task; we used P, R and F1-score [56] metrics for the classification task.As seen in Table 3, though the two tasks were not the main task, the metrics still show a slight improvement compared with other models.The formula of the evaluation metric is calculated by Equations ( 11)-( 14): where N p represents the number of all pixels in the test image, and k represents the number of classes.Our multi-task model (Figure 10a,e) showed advantages over other models, both in the detection task and in the segmentation task, especially in the recognition of small fire targets (IMG1, IMG3 and IMG4).In addition, our multi-task model improved the missed and false detections, such as the firefighter in IMG2, and the red hat in IMG3, as a result of inclusion of the joint multi-task NMS processing algorithm.According to the segmentation results (Figure 10e,f,g), the fire target profile is more precise and effectively distinguishes the fire-like targets.

Visualization and Analysis
In order to investigate why our multi-task model, MTL-FFDet, performs better than the single-task model, we used Eigen-CAM [57] for visualization.Considering that the biggest difference between multi-task and single-task is the shared feature extraction module, the final layer of our feature extraction module, the eighth layer (see in Section 2.2.1), was analyzed for feature visualization.As shown in Figure 11, the multi-task model focused more precisely on the flame region, while the single-task model was limited to extracting the rough flame region and contained some redundant features.The accuracy of extracting flame features led to the discrepancy in the subsequent feature fusion and detection modules of the two models.

Discussion
Forest fire detection is more challenging to carry out compared with other types of vision inspection tasks (e.g., face detection, defect detection, lane line detection, etc.).The irregular shape of the flame target varies from moment to moment, and the interference of many features brought about by the complex forest environment makes the detection task difficult.A delayed or even missed detection may turn into a large-scale fire, causing devastating losses.Therefore, the use of computer vision technology instead of manual inspection is an advantageous and feasible solution, and improving computer vision technology's detection accuracy is one of the key objectives.

Discussion
Forest fire detection is more challenging to carry out compared with other types of vision inspection tasks (e.g., face detection, defect detection, lane line detection, etc.).The irregular shape of the flame target varies from moment to moment, and the interference of many features brought about by the complex forest environment makes the detection task difficult.A delayed or even missed detection may turn into a large-scale fire, causing devastating losses.Therefore, the use of computer vision technology instead of manual inspection is an advantageous and feasible solution, and improving computer vision technology's detection accuracy is one of the key objectives.
In ideal circumstances, color gamut-based studies [8,9] are indeed straightforward and efficient for the visual detection of forest fires, as the backgrounds in forests mostly contrast with flame targets.However, many misidentifications are brought on by changes in seasons, lighting and in the environment, which make forest fire detection systems malfunction.Consequently, numerous researchers have added shape [58], texture [59] and spatio-temporal [60] features in order to reduce the occurrence of missed and false detections, as well as improve the reliability of the forest fire detection task.Similarly, from the perspective of reducing missed and false detections, Xu et al. [26] used an ensemble of three deep-learning models, which achieved higher accuracy.However, this ensemble learning-based model is too large to deploy on edge computing devices for real-time detection.Real-time and high accuracy detection models enable early identification and early warning, which are critical in controlling the development and spread of fires in the field of forest fire safety.
In order to improve the detection accuracy in this paper, we began by improving feature extraction.A novel multi-task learning-based forest fire detection model was proposed in this paper, which was built using hard parameter sharing [41], i.e., using the same feature extraction layer.The aim is to extract more accurate feature information through the joint learning of multiple tasks (the detection task, the segmentation task and the classification task), especially when the sample size is not particularly large.Through a series of experiments (Tables 2 and 3), our model achieved better results than other common models in each task, with improvements shown in most metrics.With its lightweight design, the shared feature extraction module will reduce memory usage and enable efficient detection in real-time.In addition, the parametric visualization analysis (Figure 11) of our model in multi-tasking versus single-tasking also showed that multi-task learning in the backbone of feature extraction can better and more accurately focus on the flame region.
Furthermore, we also proposed two strategies to further improve flame detection accuracy.Considering that our model is outputted by three tasks, the joint multi-task NMS processing algorithm was proposed to make full use of these tasks and consequently reduce the occurrence of missed and false detections.After an experimental comparison test, the mAP improved by 2.5%, while other metrics achieved increments ranging from 0.5% to 2%.Another implemented strategy was the diagonal swap of random origin for data augmentation.The detection of small targets has always been a tricky problem in the field of vision-based detection, and also for the detection of small forest fire targets, which often appear to be obscured by trees as well as being incomplete targets captured on edges due to view limitations.The limited pixel representations of small flame targets, and the few small targets contained in the forest fire data set, cause poor performance of the model in extracting and learning small flame targets.Considering the objective fact that the divided flame targets in images are still flame targets, our proposed data augmentation strategy compensates well for the small number of small flame targets and improves the AP S and AR S metrics by 5.4% and 3.1% (compared with the original model), respectively.
However, there are still some limitations in our research.Firstly, small target detection is difficult to address fundamentally, due to the small amount of pixel information that is characterized.Our method only improves the detection quality from the perspective of data balancing.Secondly, smoke is a relatively common feature of early fires, and our model is not adapted for smoke.Finally, due to the complex environments in forests, the problem of shading by covered vegetation can render our model useless, while traditional sensors such as temperature and humidity sensors, infrared cameras, etc., are not affected.Therefore, multi-sensor fusion is also one of the main directions for future study.

Conclusions
In this study, we proposed a novel forest fire detection model, MTL-FFDet, and two improvement strategies based on it, namely the joint multi-task NMS processing algorithm, and the diagonal swap of random origin data augmentation.The main contributions are as follows: (1) the feature extraction module shared by the three tasks makes the network more sensitive and attentive to flame features, improving the performance of forest fire detection; (2) the joint multi-task NMS processing algorithm takes advantage of the differences between multiple tasks, and combines the advantages of each to reduce the occurrence of missed and false detections; (3) the number of small targets is enriched by the diagonal swap of random origin data augmentation, which greatly improves the detection accuracy of our model for small targets.Experiments show that our model substantially outperforms other models in most of the metrics.Our model also achieves an excellent trade-off between performance and efficiency thanks to the shared backbone and the use of lightweight convolutional modules which are suitable for real-time detection tasks deployed on UAVs for forest fire inspection.
For further study, we intend to concentrate our research efforts on multi-sensor fusion, in order to improve the detection of forest fires, and deploy our system on low-power and high-performance edge devices.

Figure 1 .
Figure 1.Samples of the data set.(a) Forest fire images; (b) non-fire images in the forest scene.

Forests 2022 , 20 Figure 1 .
Figure 1.Samples of the data set.(a) Forest fire images; (b) non-fire images in the forest scene.

Figure 2 .
Figure 2.An example of data annotation, where x and y denote the center point coordinates of the rectangular box, and w and h denote the width and height of the rectangular box, respectively.

Figure 2 .
Figure 2.An example of data annotation, where x and y denote the center point coordinates of the rectangular box, and w and h denote the width and height of the rectangular box, respectively.

Figure 3 .
Figure 3.The architecture of our MTL-FFDet.It contains three tasks, which are the detection task, the segmentation task, and the classification task.

Figure 3 .
Figure 3.The architecture of our MTL-FFDet.It contains three tasks, which are the detection task, the segmentation task, and the classification task.

Forests 2022 , 20 Figure 4 .
Figure 4.The architecture of the backbone in our MTL-FFDet.The parameter 'k' denotes the kernel size, the parameter 's' denotes the stride and the parameter 'g' denotes the group size.

Figure 4 .
Figure 4.The architecture of the backbone in our MTL-FFDet.The parameter 'k' denotes the kernel size, the parameter 's' denotes the stride and the parameter 'g' denotes the group size.

Figure 5 .
Figure 5.The architecture of the neck in our MTL-FFDet.Its input comes from three different scales of the backbone.The up-sampling layer uses the bilinear interpolation method.

Figure 5 .
Figure 5.The architecture of the neck in our MTL-FFDet.Its input comes from three different scales of the backbone.The up-sampling layer uses the bilinear interpolation method.

Forests 2022 , 20 Figure 6 .
Figure 6.The architecture of the three heads in our MTL-FFDet.The inputs of the segmentation head and the detection head come from the corresponding output in the neck, while the input of the classification head comes from the backbone.

Figure 6 .
Figure 6.The architecture of the three heads in our MTL-FFDet.The inputs of the segmentation head and the detection head come from the corresponding output in the neck, while the input of the classification head comes from the backbone.

Figure 7 .
Figure 7.Samples of false and missed detections in the detection task and the segmentation task.The blue and green rectangles represent the missed detection and the false detection in the detection task, respectively, and the yellow rectangle represents missed detection in the segmentation task.(a) has one missed detection and one false detection; (b) has one missed detection; (c) has two missed detections in the segmentation task.

Figure 7 .
Figure 7.Samples of false and missed detections in the detection task and the segmentation task.The blue and green rectangles represent the missed detection and the false detection in the detection task, respectively, and the yellow rectangle represents missed detection in the segmentation task.(a) has one missed detection and one false detection; (b) has one missed detection; (c) has two missed detections in the segmentation task.

Algorithm 1
Joint Multi-task Non-Maximum Suppression (JM-NMS) Input: B = {b 1 , . . . ,b N }, S = {s 1 , . . . ,s N }, M = {m 1 , . . . ,m N }, T nms B is the list of initial detection boxes S contains corresponding detection scores M contains the corresponding segmentation area in detection boxes Begin: R = Cal(M, B) C = S × P (R) D ← {} While B = empty do i ← argmax (C) I ← b i D ← D ∪ I; B ← B − I For b j in B do If CIoU I, b j ≥ T nms then B ← B − b j ; C ← C − c j ; S ← S − s j End End End Return D , C, S End 2.4.Data Augmentation Data augmentation is a more common means of improving model performance in deep learning model training.In the training of our model, HSV enhancement, left-right flip, topdown flip, random scale transformation and Mosaic [53] data augmentation were used to improve model robustness, increase training data, avoid overfitting and sample imbalance.

Forests 2022 , 20 Figure 9 .
Figure 9. Diagram of the diagonal swap of random origin strategy.

Figure 9 .
Figure 9. Diagram of the diagonal swap of random origin strategy.

Table 1 .
Training parameters of our model.

Table 1 .
Training parameters of our model.

Table 2 .
Comparison and ablation experiments for the detection tasks using our data set.

Table 3 .
Comparison results between different models in the segmentation task and classification task.