MS-FRCNN: A Multi-Scale Faster RCNN Model for Small Target Forest Fire Detection

: Unmanned aerial vehicles (UAVs) are widely used for small target detection of forest ﬁres due to its low-risk rate, low cost and high ground coverage. However, the detection accuracy of small target forest ﬁres is still not ideal due to its irregular shape, different scale and how easy it can be blocked by obstacles. This paper proposes a multi-scale feature extraction model (MS-FRCNN) for small target forest ﬁre detection by improving the classic Faster RCNN target detection model. In the MS-FRCNN model, ResNet50 is used to replace VGG-16 as the backbone network of Faster RCNN to alleviate the gradient explosion or gradient dispersion phenomenon of VGG-16 when extracting the features. Then, the feature map output by ResNet50 is input into the Feature Pyramid Network (FPN). The advantage of multi-scale feature extraction for FPN will help to improve the ability of the MS-FRCNN to obtain detailed feature information. At the same time, the MS-FRCNN uses a new attention module PAM in the Regional Proposal Network (RPN), which can help reduce the inﬂuence of complex backgrounds in the images through the parallel operation of channel attention and space attention, so that the RPN can pay more attention to the semantic and location information of small target forest ﬁres. In addition, the MS-FRCNN model uses a soft-NMS algorithm instead of an NMS algorithm to reduce the error deletion of the detected frames. The experimental results show that, compared to the baseline model, the proposed MS-FRCNN in this paper achieved a better detection performance of small target forest ﬁres, and its detection accuracy was 5.7% higher than that of the baseline models. It shows that the strategy of multi-scale image feature extraction and the parallel attention mechanism to suppress the interference information adopted in the MS-FRCNN model can really improve the performance of small target forest ﬁre detection.


Introduction
With the change in the global climate, the probability of forest fires has increased sharply.The occurrence of forest fires has brought great harm and loss to the forest ecosystem and human life [1][2][3].Therefore, effective forest fire monitoring is crucial for the protection of forest resources and the safety of human life and property [4].
Most forest fires occur in the field environment, and the uncertainty and unpredictability that occur after wildfires increase the difficulty of putting out wildfires [5].In the early stage of a wildfire, the flame area is small, and the spread is slow, which is the best time to fight it [6][7][8].However, traditional wildfire detection methods have obvious limitations in actual forest fire monitoring [9].Early research on forest fire detection was mostly based on fire heat detectors [10,11].Some studies even used animals as mobile biosensors for forest fire detection [12].Due to the limited detection range and high coverage cost, sensors cannot be widely installed in the field environment [13][14][15].At the same time, sensor-based wildfire detection cannot provide effective visual information for firefighters, help firefighters accurately judge the fire location or let firefighters grasp the situation of wildfire in real time.Satellite remote sensing images have the advantages of a wide coverage and a large detection area.In recent years, satellite remote sensing images were widely used to detect forest fires [16,17].Flasse et al. [18] proposed a fire detection algorithm based on remote sensing images to identify the fire area by comparing the fire pixels with their neighboring pixels.Xie et al. [19] proposed a spatio-temporal context model (STCM), which detected forest fires based on remote sensing images provided by the Himawari-8 satellite by analyzing the spatial and temporal dimensions of the images.Zhang et al. [20] proposed a forest smoke detection network PDAM-STPNNet based on remote sensing images to identify small target fire smoke.However, most of these models were for large-scale fire detection, and it was difficult to directly implement the research for small target forest fire detection [21].
Unmanned aerial vehicles (UAVs) have the advantages of wide coverage, low cost and real-time monitoring, which can be better applied to the detection of forest fires [22].On the basis of UAV aerial images, researchers proposed various fire detection models, such as the models based on the Faster RCNN [23], VGG (visual geometry group) [24], Resnet [25] and YOLO [26].Zhang et al. [27] proposed a forest fire recognition model based on migration learning to solve the problem of the limited scale and manual annotation of UAV forest fire images.Jiao et al. [28] proposed a YOLOv3 forest fire detection algorithm based on UAV aerial photography data to improve the real-time detection performance of forest fires.Yuan et al. [29] proposed a UAV-based forest fire detection and tracking model to extract fire pixels and track fire areas.Guan et al. [30] proposed an early forest fire semantic segmentation method based on an MS R-CNN model to segment the fire area in the UAV image.These models can make full use of the advantages of UAV aerial photography data to improve the effect of forest fire detection.However, these models still ignore the characteristics of small target forest fires and cannot be directly applied to their detection.
With the development of deep learning, various deep learning models have been applied to small target detection tasks in different fields.Li et al. [31] proposed a new semantic segmentation network ICTNet, which is committed to improving the quality of the edge segmentation of objects in remote sensing images by comprehensively capturing the context information.Based on a YOLO v5 model, Wu et al. [32] proposed a local full convolution neural network for detecting small objects in remote sensing images.Dou et al. [33] proposed a small target detection model for SAR images.Based on a YOLOv3 algorithm, Natheer et al. [34] detected the K-complex in EEG signals.However, these small target detection models paid more attention to small targets with regular shapes and cannot be directly applied to the detection of small target forest fires.In the research of small target forest fire detection, there are two problems (1) the shape of the ignition point is irregular, and the size is different; (2) small fire spots are easily covered by smoke and trees, resulting in missed or false detection.Therefore, eliminating the interference of obstacles and extracting the information of small target fire points as much as possible is key for detecting small target forest fires.
By taking the UAV aerial images as the data source, this paper discusses the above problems in depth.Combining the characteristics of small target forest fires, this paper proposes a detection model MS-FRCNN that is suitable for small target forest fires based on the improvement of the classical Faster RCNN target detection model.The main contributions of the MS-FRCNN model are as follows.
(1) ResNet50 replaced VGG-16 as the backbone network of the MS-FRCNN to alleviate the gradient explosion and gradient dispersion that can easily occur during the network training.The characteristic map output provided by ResNet50 is input into the Feature Pyramid Network (FPN).With the advantage of the FPN for extracting multi-scale image features, the deep semantic features and shallow detail features in the image are captured as much as possible to provide more comprehensive features for small target fire detection.
(2) A new attention module (PAM) was used in the Regional Proposal Network (RPN) to distinguish the contributions of different features.PAM adopted the mechanism of the Forests 2023, 14, 616 3 of 18 parallel running channel attention and spatial attention, which can effectively suppress the influence of complex background.This strategy can help the model pay more attention to the characteristics of the fire points to improve the accuracy of small target forest fire detection.
(3) The soft-NMS algorithm was used to replace the NMS algorithm to detect the target box, reduce the error deletion of the detection box and improve the generalization ability of the model for small target forest fire detection.
The structure of this paper is as follows.Section 2 discusses the data set and the main components of the MS-FRCNN model.Section 3 introduces the experimental process and discusses the experimental results.Section 4 summarizes the full text.

Data Set Introduction
This paper uses the FLAME (Fire Luminosity Airborne-based Machine learning Evaluation) [35] as the experimental data set.This data set provided the fire scene images collected by the UAVs during the specified burning and accumulation of debris in a pine forest in Arizona.Using the FLAME-Det, the annotation for each detection frame (external rectangle) could be automatically obtained.For the 2003 annotated images, 85% of the images were selected as the training set and the remaining 15% of the images as the test set (corresponding to 1703 and 300 images, respectively).The K-fold cross-validation (K = 5) was used to evaluate the detection effect.Figure 1 shows a sample of a forest fire detection taken at intervals in the same video.
captured as much as possible to provide more comprehensive features for small target fire detection.
(2) A new attention module (PAM) was used in the Regional Proposal Network (RPN) to distinguish the contributions of different features.PAM adopted the mechanism of the parallel running channel attention and spatial attention, which can effectively suppress the influence of complex background.This strategy can help the model pay more attention to the characteristics of the fire points to improve the accuracy of small target forest fire detection.
(3) The soft-NMS algorithm was used to replace the NMS algorithm to detect the target box, reduce the error deletion of the detection box and improve the generalization ability of the model for small target forest fire detection.
The structure of this paper is as follows.The second section discusses the data set and the main components of the MS-FRCNN model.The third section introduces the experimental process and discusses the experimental results.The fourth section summarizes the full text.

Data Set Introduction
This paper uses the FLAME (Fire Luminosity Airborne-based Machine learning Evaluation) [35] as the experimental data set.This data set provided the fire scene images collected by the UAVs during the specified burning and accumulation of debris in a pine forest in Arizona.Using the FLAME-Det, the annotation for each detection frame (external rectangle) could be automatically obtained.For the 2003 annotated images, 85% of the images were selected as the training set and the remaining 15% of the images as the test set (corresponding to 1703 and 300 images, respectively).The K-fold cross-validation (K = 5) was used to evaluate the detection effect.Figure 1 shows a sample of a forest fire detection taken at intervals in the same video.

The Structure of the Faster RCNN Model and Its Defects Analysis
The MS-FRCNN model proposed in this paper was generated on the basis of improving the classical Fast RCNN target detection model.Faster RCNN is a two-stage target detection model that can conduct end-to-end training [36].It proposes an RPN [37] network structure different from RCNN [38], SPPNet [39] and Fast-RCNN [40] to break the time consuming bottleneck of these models.Figure 2

The Structure of the Faster RCNN Model and Its Defects Analysis
The MS-FRCNN model proposed in this paper was generated on the basis of improving the classical Fast RCNN target detection model.Faster RCNN is a two-stage target detection model that can conduct end-to-end training [36].It proposes an RPN [37] network structure different from RCNN [38], SPPNet [39] and Fast-RCNN [40] to break the time consuming bottleneck of these models.Figure 2 shows the main architecture of the Fast RCNN model.The Faster RCNN model is mainly composed of four modules: the feature extraction module, the Region Proposal Network (RPN) module, the RoI pooling module and the classification and regression module.(1) Feature extraction module.The Faster RCNN uses VGG-16 as its backbone feature extraction network, in which 16 layers of convolution and full connection layers are used to extract the features.However, with the increase of the network depth, VGG-16 may exhibit overfitting, gradient dispersion and other phenomena, making the training of neural network very difficult.At the same time, simply increasing the number of layers will lead to an increase in the target detection error rate.
Moreover, in the Faster RCNN, the detection process is only based on the feature map output from the last layer of the VGG-16 network.The deep convolution operation in the VGG network will cause the feature map to contain more deep semantic features of the image, while missing the shallow detail features.In detecting small target forest fires, shallow features and deep features are equally important and will play an important role in the target location and target classification, respectively.Therefore, only using the information provided by the feature map for detection will not guarantee the effect of small target forest fire detection.
(2) Region Proposal Network (RPN).In this module, the characteristic graph generated by the VGG is input into the RPN network.The anchor frame mechanism is used to perform a sliding window operation on the feature map, generate candidate frames with possible targets and perform binary classification on whether to include the objects.The RPN divides the process of detecting the target box into two steps.First, the anchor points are classified using Softmax to obtain the foreground and background.Then, the frame regression offset of the anchor region is calculated to obtain an accurate candidate frame.The RPN network can not only determine the size of the candidate box but also determine its position to achieve the function of the target location.However, in the application of small target forest fire detection, such phenomena as a small fire target and how easy it can mix with the surrounding environment will lead to a large amount of noise in the foreground suggestion box extracted by the RPN, making it difficult to distinguish the target boundary, and thus affecting the effect of small target fire detection.
(3) RoI pooling of interested areas.Based on the feature map output by the VGG and the candidate frame output by the RPN, the RoI pool is responsible for generating the feature map of the candidate frame and inputting it to the classification layer for the target detection.However, when there are two adjacent detection boxes with a high confidence, the NMS algorithm adopted by the Faster RCNN tends to delete one of them directly, which may lead to an erroneous deletion and affect the detection effect of small target forest fires.
(4) Classification and regression.This module is composed of a full connection layer and a Softmax layer.On the basis of the feature map of each candidate frame, the classifi- (1) Feature extraction module.The Faster RCNN uses VGG-16 as its backbone feature extraction network, in which 16 layers of convolution and full connection layers are used to extract the features.However, with the increase of the network depth, VGG-16 may exhibit overfitting, gradient dispersion and other phenomena, making the training of neural network very difficult.At the same time, simply increasing the number of layers will lead to an increase in the target detection error rate.
Moreover, in the Faster RCNN, the detection process is only based on the feature map output from the last layer of the VGG-16 network.The deep convolution operation in the VGG network will cause the feature map to contain more deep semantic features of the image, while missing the shallow detail features.In detecting small target forest fires, shallow features and deep features are equally important and will play an important role in the target location and target classification, respectively.Therefore, only using the information provided by the feature map for detection will not guarantee the effect of small target forest fire detection.
(2) Region Proposal Network (RPN).In this module, the characteristic graph generated by the VGG is input into the RPN network.The anchor frame mechanism is used to perform a sliding window operation on the feature map, generate candidate frames with possible targets and perform binary classification on whether to include the objects.The RPN divides the process of detecting the target box into two steps.First, the anchor points are classified using Softmax to obtain the foreground and background.Then, the frame regression offset of the anchor region is calculated to obtain an accurate candidate frame.The RPN network can not only determine the size of the candidate box but also determine its position to achieve the function of the target location.However, in the application of small target forest fire detection, such phenomena as a small fire target and how easy it can mix with the surrounding environment will lead to a large amount of noise in the foreground suggestion box extracted by the RPN, making it difficult to distinguish the target boundary, and thus affecting the effect of small target fire detection.
(3) RoI pooling of interested areas.Based on the feature map output by the VGG and the candidate frame output by the RPN, the RoI pool is responsible for generating the feature map of the candidate frame and inputting it to the classification layer for the target detection.However, when there are two adjacent detection boxes with a high confidence, the NMS algorithm adopted by the Faster RCNN tends to delete one of them directly, which may lead to an erroneous deletion and affect the detection effect of small target forest fires.
(4) Classification and regression.This module is composed of a full connection layer and a Softmax layer.On the basis of the feature map of each candidate frame, the classification operation is used to detect the content of each candidate frame and output its probability value of belonging to different types of targets.The regression operation will perform a boundary regression on each candidate frame, further correct the position and size of the boundary and obtain the final accurate position of the detected frame.
It can be seen that, in the framework of the Faster RCNN model, there are many components that may affect the detection of small target forest fires.According to the characteristics of small target forest fires, this paper improves these components to enhance the detection performance of the model for small target forest fires.

Backbone Feature Extraction Network ResNet50
The backbone feature extraction network is the basis of the target detection model.The VGG-16 backbone network adopted in the Faster RCNN model has risks, such as gradient saturation or disappearance.In the MS-FRCNN model, ResNet50 was used instead of VGG-16 as the backbone feature extraction network of the model.ResNet50 network is a residual neural network, and its structure is shown in Figure 3. Assuming that the input of the neural network is x and the expected output of the network is H(x), the learning objective function of the residual network is F(x) = H(x) − x.The residual network can be connected by jumping the layers of the residual elements, so that the gradient can bypass several layers to reach the output layer, thus avoiding the phenomenon of gradient disappearance.
Forests 2023, 14, x FOR PEER REVIEW 5 of 18 cation operation is used to detect the content of each candidate frame and output its probability value of belonging to different types of targets.The regression operation will perform a boundary regression on each candidate frame, further correct the position and size of the boundary and obtain the final accurate position of the detected frame.
It can be seen that, in the framework of the Faster RCNN model, there are many components that may affect the detection of small target forest fires.According to the characteristics of small target forest fires, this paper improves these components to enhance the detection performance of the model for small target forest fires.

Backbone Feature Extraction Network ResNet50
The backbone feature extraction network is the basis of the target detection model.The VGG-16 backbone network adopted in the Faster RCNN model has risks, such as gradient saturation or disappearance.In the MS-FRCNN model, ResNet50 was used instead of VGG-16 as the backbone feature extraction network of the model.ResNet50 network is a residual neural network, and its structure is shown in Figure 3. Assuming that the input of the neural network is x and the expected output of the network is (), the learning objective function of the residual network is () = () − .The residual network can be connected by jumping the layers of the residual elements, so that the gradient can bypass several layers to reach the output layer, thus avoiding the phenomenon of gradient disappearance.

Weight layer
Weight layer ResNet50 is built from one convolutional layer, one fully connected layer and four groups of residual modules.Each group has three, four, six and three blocks, and each block has three convolution layers.Table 1 shows the structure of ResNet50.ResNet50 is built from one convolutional layer, one fully connected layer and four groups of residual modules.Each group has three, four, six and three blocks, and each block has three convolution layers.Table 1 shows the structure of ResNet50.

Layer Name
Output Size ResNet50 Unlike the Faster RCNN model, the MS-FRCNN model does not directly input the feature map output by ResNet50 into the RPN, but inputs it into the Feature Pyramid Network (FPN).By utilizing the advantages of multi-scale feature extraction of the FPN, the MS-FRCNN can fully extract and fuse the deep and shallow features of the image to improve the detection ability for small target forest fires.

Feature Pyramid Network (FPN)
The feature map of the shallow convolution layer of the deep convolution neural network is large in size, which contains more geometric information that is conducive to the target location.The feature map of the deep layer is smaller in size and contains richer semantic information, which is helpful for target classification.However, the Faster RCNN pays more attention to the extraction of the deep features of the image, so it cannot guarantee the effect of detecting small target forest fires.
In the MS-FRCNN model proposed in this paper, a Feature Pyramid Network (FPN) was used to extract and fuse the features of the different scales from the feature map output using ResNet50.Through this process, the deep and shallow features in the image can be fully extracted so that the model will not lose the details of the image and will enhance the robustness of the feature extraction of the model, which will help improve the detection accuracy of the model for small target forest fires.The multi-scale feature extraction process in the FPN includes two parts.The first part is the bottom-up feature extraction process and the second part is the top-down and horizontal fusion process.Figure 4 shows the structure of the FPN network.Unlike the Faster RCNN model, the MS-FRCNN model does not directly input the feature map output by ResNet50 into the RPN, but inputs it into the Feature Pyramid Network (FPN).By utilizing the advantages of multi-scale feature extraction of the FPN, the MS-FRCNN can fully extract and fuse the deep and shallow features of the image to improve the detection ability for small target forest fires.

Feature Pyramid Network (FPN)
The feature map of the shallow convolution layer of the deep convolution neural network is large in size, which contains more geometric information that is conducive to the target location.The feature map of the deep layer is smaller in size and contains richer semantic information, which is helpful for target classification.However, the Faster RCNN pays more attention to the extraction of the deep features of the image, so it cannot guarantee the effect of detecting small target forest fires.
In the MS-FRCNN model proposed in this paper, a Feature Pyramid Network (FPN) was used to extract and fuse the features of the different scales from the feature map output using ResNet50.Through this process, the deep and shallow features in the image can be fully extracted so that the model will not lose the details of the image and will enhance the robustness of the feature extraction of the model, which will help improve the detection accuracy of the model for small target forest fires.The multi-scale feature extraction process in the FPN includes two parts.The first part is the bottom-up feature extraction process and the second part is the top-down and horizontal fusion process.Figure 4 shows the structure of the FPN network.To reduce the calculation, the output feature maps are marked as {C2 , C3 , C4 , C5 }.In the top-down operation, the feature map of the top layer is enlarged to the same scale as the feature map of the previous layer through upper sampling.Where M5 is C5 , perform the up-sampling operation two times on M5 and add it to C4 to get M4.M3 and M2 are calculated according to the same process.Next, to eliminate the aliasing effect caused by the fusion of the deep and shallow features, execute a 3 × 3 convolution operation on {M2, M3, M4, M5}.The convolution core size is 3 × 3, the dimension is 256 and the step size is 1.The horizontal connection structure is used to combine the semantic features of the upper layer with the detailed features of the lower layer to obtain P2, P3, P4, P5 and P6 in turn.Among them, P6 is obtained through a 1 × 1 maximum pooling operation on P5.Use {P2, P3, P4, P5, P6} as the input of the RPN.This will enable the RPN to take advantage of the semantic features of the top layer and the rich geometric information of the lower layer of the image when selecting the candidate frame.
At the same time, the MS-FRCNN adds the parallel attention module (PAM) to the RPN and extracts richer semantic and spatial features from the feature map through a parallel running space and channel attention to improve the classification and regression performance of the RPN for small targets.

Parallel Attention Mechanism (PAM)
In the MS-FRCNN model, the parallel attention module (PAM) contains two different branches: the channel attention branch and the spatial attention branch.The channel attention models the importance of each feature channel, which helps to highlight the feature information that is valuable for the classification.The goal of the spatial attention is to highlight the spatial position in the feature map that is valuable for the target recognition.In other words, the channel attention focuses on the semantic information, "what", of the target, while the spatial attention focuses on the location information, "where", of the target.Therefore, the information provided by these two kinds of attention is complementary.
Suppose the original features are expressed as O f eature , the spatial attention feature is expressed as S f eature and the channel attention feature is called C f eature .Φ spatial and Φ channel represent the spatial attention function and the channel attention function, respectively, and the feature extraction can be expressed as As shown in the improved channel and spatial attention mechanism module in Figure 5 after the spatial and channel features obtained after the spatial feature and channel feature extractions are fused to obtain the attention feature, which is denoted as A f eature , the resulting attention features and original features are denoted by O f eature .First perform the fusion and then combine them to obtain the final fusion feature, which is denoted as F f eature .This process can be expressed as Forests 2023, 14, x FOR PEER REVIEW 8 of fused with the original features to get the result and then combined with the original fe tures to get the final fusion feature feature F .Woo et al. [41] proposed an attention mechanism-the CBAM (convolutional blo attention module)-that integrates that channel and spatial attentions.Unlike the PA the CBAM adopts a serial operation mechanism, with the channel attention in the fro and the spatial attention in the back.Figure 6 shows the structure diagram of the CBA mechanism.The feature map first enters the channel attention and calculates the chann attention weight according to the width and height of the feature map.A series of agg gation and convolution operations continue to be performed on the feature map outp through channel attention to complete the recalibration of the feature map on the spat dimension.Therefore, the new feature map obtained after the CBAM module will obta the focus weight from the relationship between each feature in the channel and spat dimension.The PAM module proposed in this paper adopts the mechanism of a paral running channel and the spatial attention, as shown in Figure 5. Compared to the CBA module, this mechanism does not need to set the operation order between the channel a spatial attention subjectively and can well achieve the integration of the channel and sp tial features.The experimental results in this paper show that in the MS-FRCNN mod using the PAM mechanism, is more helpful to improve the effect of small target fire d tection than using the CBAM mechanism, which indicates that the PAM achieves bett feature calibration and fusion.This original feature is obtained through the preliminary feature extraction of the spatial attention function Φ spatial and the channel attention function Φ channel to obtain the spatial attention feature and the channel attention feature, respectively.The two are fused to obtain the attention feature (attention feature).The obtained attention feature is fused with the original features to get the result and then combined with the original features to get the final fusion feature F f eature .
Woo et al. [41] proposed an attention mechanism-the CBAM (convolutional block attention module)-that integrates that channel and spatial attentions.Unlike the PAM, the CBAM adopts a serial operation mechanism, with the channel attention in the front and the spatial attention in the back.Figure 6 shows the structure diagram of the CBAM mechanism.The feature map first enters the channel attention and calculates the channel attention weight according to the width and height of the feature map.A series of aggregation and convolution operations continue to be performed on the feature map output through channel attention to complete the recalibration of the feature map on the spatial dimension.Therefore, the new feature map obtained after the CBAM module will obtain the focus weight from the relationship between each feature in the channel and spatial dimension.The PAM module proposed in this paper adopts the mechanism of a parallel running channel and the spatial attention, as shown in Figure 5. Compared to the CBAM module, this mechanism does not need to set the operation order between the channel and spatial attention subjectively and can well achieve the integration of the channel and spatial features.The experimental results in this paper show that in the MS-FRCNN model, using the PAM mechanism, is more helpful to improve the effect of small target fire detection than using the CBAM mechanism, which indicates that the PAM achieves better feature calibration and fusion.
tures to get the final fusion feature feature F .Woo et al. [41] proposed an attention mechanism-the CBAM (convolutional attention module)-that integrates that channel and spatial attentions.Unlike the the CBAM adopts a serial operation mechanism, with the channel attention in the and the spatial attention in the back.Figure 6 shows the structure diagram of the C mechanism.The feature map first enters the channel attention and calculates the ch attention weight according to the width and height of the feature map.A series of a gation and convolution operations continue to be performed on the feature map o through channel attention to complete the recalibration of the feature map on the s dimension.Therefore, the new feature map obtained after the CBAM module will the focus weight from the relationship between each feature in the channel and s dimension.The PAM module proposed in this paper adopts the mechanism of a p running channel and the spatial attention, as shown in Figure 5. Compared to the C module, this mechanism does not need to set the operation order between the chann spatial attention subjectively and can well achieve the integration of the channel an tial features.The experimental results in this paper show that in the MS-FRCNN m using the PAM mechanism, is more helpful to improve the effect of small target fi tection than using the CBAM mechanism, which indicates that the PAM achieves feature calibration and fusion.

Improved Non-Maximum Suppression Algorithm (Soft-NMS)
The Faster RCNN uses the NMS algorithm to detect the target frame.The impl tation principle of the NMS algorithm is as follows: preselect the detection frame wi highest confidence in the region of interest, traverse the coincidence degree of all the detection frames with the detection frame and remove the detection frame with a degree of coincidence to realize the purpose of reducing duplicate detection boxe Intersection Over Union ratio (IoU) is a calculation method for evaluating the positi accuracy, which defines the degree of overlap between the two detection frames.Th culation formula of the IoU is shown in Formula (5).

Improved Non-Maximum Suppression Algorithm (Soft-NMS)
The Faster RCNN uses the NMS algorithm to detect the target frame.The implementation principle of the NMS algorithm is as follows: preselect the detection frame with the highest confidence in the region of interest, traverse the coincidence degree of all the other detection frames with the detection frame and remove the detection frame with a high degree of coincidence to realize the purpose of reducing duplicate detection boxes.The Intersection Over Union ratio (IoU) is a calculation method for evaluating the positioning accuracy, which defines the degree of overlap between the two detection frames.The calculation formula of the IoU is shown in Formula (5).
The processing method of the traditional NMS algorithm can be intuitively expressed through the score reset function of Formula (6).
where s i represents the score of the i-th detection frame, A represents the detection frame with the highest confidence in the region of interest, B i represents the i-th detection frame.iou(A, B i ) represents the coincidence degree between the i-th detection frame and A and N t represents the calibrated coincidence threshold.
Forests 2023, 14, 616 9 of 18 Therefore, the NMS algorithm strictly limits the repeatability of the detection box.When it is applied to small target forest fire detection, if the boundary frame B where the small target fire is located intersects with the boundary frame A with the highest confidence, and the IoU of the two boundary frames is higher than the calibration threshold N_T, B will be forcibly deleted.This will lead to the phenomenon that small target B is deleted by mistake.Figure 7 shows the diagram for calculating the IoU.
through the score reset function of Formula ( 6).

𝑠 =
, (,  ) <  0, (,  ) ≥  (6) where  represents the score of the i-th detection frame, A represents the detection frame with the highest confidence in the region of interest,  represents the i-th detection frame.(,  ) represents the coincidence degree between the i-th detection frame and A and  represents the calibrated coincidence threshold.Therefore, the NMS algorithm strictly limits the repeatability of the detection box.When it is applied to small target forest fire detection, if the boundary frame B where the small target fire is located intersects with the boundary frame A with the highest confidence, and the IoU of the two boundary frames is higher than the calibration threshold N_T, B will be forcibly deleted.This will lead to the phenomenon that small target B is deleted by mistake.Figure 7 shows the diagram for calculating the IoU.In view of the shortcomings of the traditional NMS algorithm in fire detection, the MS-FRCNN model uses the soft-NMS algorithm to replace the NMS algorithm.When the IoU value of the frame to be detected is higher than the calibration threshold  , the soft-NMS algorithm does not delete the frame directly but reduces the confidence of the frame according to the value of the IoU.The soft-NMS algorithm can be intuitively expressed by the fractional reset function shown in Equation ( 7):  =  , (,  ) <   (1 − (,  )), (,  ) ≥  (7) Through the above description, in the small target fire image detection, when the IoU of the detection frame B and A of the fire feature map is higher than the calibration threshold  , the score of the detection frame B will be attenuated to a linear function of the coincidence degree with A to avoid deleting it by mistake.However, this score reset function is not a continuous function, which will cause faults in the score of the tested detection frame.Therefore, this study introduces the Gaussian function at the end, and the final improved function is as follows.
In the formula, σ is a hyperparameter, D is a set of detection frames and (,  ) represents the degree of coincidence between the i-th detection frame and M. Through the improvement of the NMS algorithm, the mistaken deletion of the detection frame by the In view of the shortcomings of the traditional NMS algorithm in fire detection, the MS-FRCNN model uses the soft-NMS algorithm to replace the NMS algorithm.When the IoU value of the frame to be detected is higher than the calibration threshold N t , the soft-NMS algorithm does not delete the frame directly but reduces the confidence of the frame according to the value of the IoU.The soft-NMS algorithm can be intuitively expressed by the fractional reset function shown in Equation ( 7): Through the above description, in the small target fire image detection, when the IoU of the detection frame B and A of the fire feature map is higher than the calibration threshold N t , the score of the detection frame B will be attenuated to a linear function of the coincidence degree with A to avoid deleting it by mistake.However, this score reset function is not a continuous function, which will cause faults in the score of the tested detection frame.Therefore, this study introduces the Gaussian function at the end, and the final improved function is as follows.
In the formula, σ is a hyperparameter, D is a set of detection frames and iou(M, b i ) represents the degree of coincidence between the i-th detection frame and M. Through the improvement of the NMS algorithm, the mistaken deletion of the detection frame by the traditional NMS algorithm can be reduced and the generalization ability of the MS-FRCNN model for the small target detection box of forest fires can be improved.

Overall Architecture of the MS-FRCNN Model
Figure 8 shows the overall architecture of the MS-FRCNN model proposed in this paper.It can be seen that, similar to the Faster RCNN model, the MS-FRCNN model is also a two-stage model.The first stage is used to extract the image features.In the second stage, the target box is generated and selected based on the extracted image features.However, compared to the Faster RCNN model, the following components of the MS-FRCNN model have been improved.(1) The MS-FRCNN uses ResNet 50 instead of VGG-16 as the backbone network of the model to alleviate the gradient explosion, disappearance and network degradation problems that may occur during the training of the deep neural network.However, although ResNet 50 can extract richer features in the image, the deepening of the network layer will cause more shallow semantic features to be lost, which will affect the performance of small target forest fire detection.In order to solve this problem, the MS-FRCNN inputs the feature map output of ResNet50 into the FPN.By extracting the different levels of the feature maps, the deep and shallow features in the image can be obtained on multiple scales, thus improving the detection performance of the MS-FRCNN model for small target forest fires.
(2) The MS-FRCNN model uses a parallel running channel and the spatial attention mechanism PAM in the RPN network.Through the parallel connection and complementarity of the spatial attention and channel attention, the features and spatial location information related to the target in the image are highlighted, while the features of the non-target parts (background) are suppressed.This will help the RPN network obtain more accurate target candidate frames, improve the performance of target classification and regression and help solve the problems of target occlusion, target ambiguity and complex backgrounds in small target forest fire recognition.
(3) The MS-FRCNN model uses the soft-NMS algorithm instead of the NMS algorithm, which can avoid the error deletion problem caused by the NMS algorithm forcing the deletion of the detection frame and help improve the generalization performance of small target forest fire detection.

Evaluation Indicators
In order to evaluate the effect of the forest fire small target detection model MS-FRCNN proposed in this paper, the accuracy (Acc), precision (Pre) and recall rate (Rec) were used as the evaluation indicators, as shown in Equations ( 10)-( 12).Among them, TN represents true negative, indicating the number of samples that were actually negative and also predicted to be negative.FP stands for false positive, indicating the number of samples that were actually negative and predicted to be positive.FN is false negative, indicating the number of samples that were actually positive and predicted to be negative.TP is true positive, indicating the number of samples that were actually positive and predicted to be positive.The MS-FRCNN proposed in this paper was compared to the Faster RCNN model.Figure 9 shows the precision recall curves for small target forest fire detection under the two models.It can be seen that, compared to the Faster RCNN model, the MS-FRCNN model achieved better detection results.The precision recall curve obtained from the MS-FRCNN model was always above the curve of the Faster RCNN, indicating that the improvement scheme of the MS-FRCNN model compared to the Faster RCNN was reasonable.Compared to the Faster RCNN, the MS-FRCNN model significantly improved the detection performance of small target forest fires and was insensitive to the detection threshold.

K-Fold Cross-Validation
The K-fold cross-validation [42] was suitable for randomly dividing the data set into K subsets with nearly the same number of samples, extracting one subset without repetition each time as the test set and the rest of the K-1 subset as the training set for K altogether.For each experiment, the accuracy and other evaluation indexes of each experi-

K-Fold Cross-Validation
The K-fold cross-validation [42] was suitable for randomly dividing the data set into K subsets with nearly the same number of samples, extracting one subset without repetition each time as the test set and the rest of the K-1 subset as the training set for K altogether.For each experiment, the accuracy and other evaluation indexes of each experiment were calculated and the average value of the K experimental evaluation indexes was taken to evaluate the performance of the model.The experiment in this paper adopted the five-fold cross-validation method, as shown in Figure 10.

K-Fold Cross-Validation
The K-fold cross-validation [42] was suitable for randomly dividing the data set into K subsets with nearly the same number of samples, extracting one subset without repetition each time as the test set and the rest of the K-1 subset as the training set for K altogether.For each experiment, the accuracy and other evaluation indexes of each experiment were calculated and the average value of the K experimental evaluation indexes was taken to evaluate the performance of the model.The experiment in this paper adopted the five-fold cross-validation method, as shown in Figure 10.In the experiment of this paper, the performance of the two models (the Faster RCNN and the MS-FRCNN) in the small target forest fire detection task was evaluated using the five-fold cross-validation.Table 2 shows the test accuracy obtained by the two models in the process of the five-fold cross-validation.From the accuracy obtained in each test, the minimum detection accuracy of the MS-FRCNN was 82.6, while the minimum accuracy of the Faster RCNN was 76.7.The average detection accuracy of the MS-FRCNN was 82.9, while the average accuracy of the Faster RCNN was 77.8.This shows that the MS-FRCNN model generated after the improvement of the Faster RCNN model can better apply to the detection of small target forest fires and achieve better detection results.In the experiment of this paper, the performance of the two models (the Faster RCNN and the MS-FRCNN) in the small target forest fire detection task was evaluated using the five-fold cross-validation.Table 2 shows the test accuracy obtained by the two models in the process of the five-fold cross-validation.From the accuracy obtained in each test, the minimum detection accuracy of the MS-FRCNN was 82.6, while the minimum accuracy of the Faster RCNN was 76.7.The average detection accuracy of the MS-FRCNN was 82.9, while the average accuracy of the Faster RCNN was 77.8.This shows that the MS-FRCNN model generated after the improvement of the Faster RCNN model can better apply to the detection of small target forest fires and achieve better detection results.

Experimental Comparison
In order to objectively evaluate the advantages of the MS-FRCNN model in small target forest fire detection, the latest target detection models were selected as the baseline comparison models, including a single-stage target detection model, YOLOv4 and the two-stage target detection models, the Faster RCNN and the FPN and variations of the two-stage target detection model under the different feature extraction networks (e.g., VGG-16, Inception, ResNet).At the same time, some newly proposed target detection models with ResNet50 as the backbone network were also selected as the comparison model.[44] ResNet50 82.4 Libra R-CNN [45] ResNet50 83.5 Grid R-CNN [46] ResNet50 83.9 FCOS [47] ResNet50 83.7 It can be seen that the MS-FRCNN model proposed in this paper achieved the best performance in small target forest fire detection.In the baseline models, compared to the two-stage Faster RCNN model, the one-stage YOLO model obtained better detection results.When the two-stage detection FPN model used VGG-16 and Inception as the feature extractors, the detection accuracy was lower than that of YOLO.However, when the Res-Net50 network with the residual structure was used to extract the features, the FPN achieved a better detection effect than YOLO.The reason was that the different feature extraction networks had different capabilities in capturing the image identification information.The YOLO model used Dark-net53 with a residual learning structure as the extractor, while the VGG-16 and Inception networks did not use a residual learning mode.With the deepening of the network layers, network degradation easily occurred in the VGG-16 and Inception networks.When the feature extraction network was replaced by ResNet50, the detection ability of the model was improved.The reason is that after the introduction of the remaining connections, the ResNet50 network provided the ability to learn identity mapping, and to some extent, avoid the problem of gradient disappearance that easily to occur redin the middle later stages of the training.In the baseline model, the models with ResNet50 as the backbone, such as the Cascade RPN, Guided Anchor, Libra R-CNN, Grid R-CNN and FCOS, also obtained good detection results.
Compared to these baseline models, the MS-FRCNN model proposed in this paper achieved the best detection performance because the MS-FRCNN made full use of the advantages of the ResNet50 and RPN networks.The MS-FRCNN model used ResNet50 network as the backbone network for the feature extraction, which can avoid the dispersion and disappearance of the gradient in the model training process.The RPN network was used to enhance the fusion of high-level semantic information and low-level detail information.The high-quality candidate frames generated by the RPN network also contributed to the subsequent positioning and regression process.At the same time, the parallel attention module (PAM) was used in the MS-FRCNN model to highlight the semantic and spatial location information of small target forest fires.All these helped the MS-FRCNN model obtain the best performance when detecting small target forest fires.

Ablation Experiment
In order to verify the effectiveness of the various improvement strategies adopted by the MS-FRCNN model compared to the Faster RCNN model, ablation experiments were conducted in this section.The experimental results are shown in Table 4. "Model 0" refers to the baseline model, that is, the standard Faster RCNN model without any improvement strategy."Model 1" to "Model 6" refer to the model after each improved strategy was introduced; "Model 7" to "Model 11" represent the combined model of the improvement strategy.The experimental results from "Model 1" to "Model 4" show that adding the attention mechanism to the Faster RCNN model improved the detection effect of the model on small target forest fires.However, when different attention modules (SE, SAM, CBAM and PAM) were used, the model achieved a different detection performance for small target forest fires.Among them, the SE (squeeze-and-excitation) focused on the relationship between the channels and obtained the important features from the different channels.The SAM (spatial attention mechanism) was a spatial attention mechanism that kept the spatial dimension unchanged and compressed the channel dimension, which focused on the location information of the target.The CBAM (convolutional block attention module) derived the graph of interest in a serial manner along two independent dimensions (channel and space).The PAM was a parallel running channel and spatial attention model proposed in this paper.Compared to the baseline model, the detection accuracy of the model after adding four attention modules increased by 1.4, 0.6, 1.8 and 2.8 percentage points, respectively.Among them, the addition of the PAM in the model significantly improved the detection effect of the model on small target forest fires.It shows that the PAM effectively highlighted the feature information related to forest fires in the image, thus helping the model obtain a better detection performance.The detection accuracy of "Model 5" was 3.1% higher than the baseline model using only the feature pyramid structure FPN, which verified the necessity of extracting multiscale features of the image and generating multi-scale candidate boxes based on the multiscale features.
The experimental results of "Model 6" showed the influence of the post-processing operation on the detection accuracy.Compared to the baseline model, although the detection accuracy was only improved by 0.6%, its characteristics of easy implementation and low complexity were convenient to embed into the established model.This scheme can improve the detection accuracy without changing the structure of the Faster RCNN model.On the basis of the Faster RCNN model, the soft-NMS algorithm was adopted to process the candidate boxes, which avoided the missing detection caused by the "harsh" threshold setting."Model 7" to "Model 11" showed the experimental results after combining the above improvement strategies.The experimental results showed that the fusion of multiple strategies did not necessarily cause more performance improvements, such as "Model 7" and "Model 8", which showed a decline in the performance compared to "Model 5".However, "Model 10", generated by combining the parallel attention module (PAM) with the multi-scale module, achieved and 84.6% detection accuracy.On the basis of "Model 10", "Model 11" further added the soft-NMS, which improved the detection accuracy by 0.6 percentage points on the basis of "Model 10" and achieved the best detection effect for small target forest fires.

Visualize the Results
This section visualizes the effect of each model on small target forest fire detection.Taking the five forest fire images as the examples, Figure 11

Conclusions
Small target forest fires have characteristics such as irregular shape, different scale and are easily blocked by obstacles, which leads to the phenomenon of missing or false detection in the detection process and affects the accuracy of small target fire detection.According to the visualization results, the MS-FRCNN model can achieve a more accurate location and classification prediction of small target forest fires.The Faster RCNN model had many problems regarding missing detection and false detection in the detection of small target forest fires.Specifically, in the example of "Image20", the Faster RCNN detection model ignored two fire source targets, resulting in missed detection.In the example of "Image321", the Faster RCNN model incorrectly marked the area of a non-fire source, which belonged to the error detection of the target.In the FPN model, there were similar problems of error detection and missed detection.This is because there were great inter-class similarities and intra-class differences in the forest fire scene.The surrounding objects, such as "branches" and "smoke", interfered with the real "fire source", which increased the difficulty of feature recognition in the model and ultimately affected the detection effect of these models.
The multi-scale feature enhancement FPN module introduced in the MS-FRCNN model played a certain role in promoting the detection and recognition of small fire source targets in the forest fire images.The FPN enhanced the extraction of the low-level detail features and high-level semantic information in the model, and helped the MS-FRCNN model identify more accurate positioning and markings for fire source targets.At the same time, the parallel operation channel and spatial attention mechanism adopted in the MS-FRCNN model also enhanced the ability of the MS-FRCNN model to extract the task-related features and obtain the representative distinguishing features from a more comprehensive perspective, which further helped the MS-FRCNN model reduce the risk of misjudging the fire source.

Conclusions
Small target forest fires characteristics such as irregular shape, different scale and are easily blocked by obstacles, which leads to the phenomenon of missing or false detection in the detection process and affects the accuracy of small target fire detection.Based on the classic target detection model, the Faster RCNN, this paper proposed a small target forest fire detection model, the MS-FRCNN, which integrates the multi-scale features of the images.Similar to the Faster RCNN, the MS-FRCNN model is also a two-stage end-to-end detection model.The first stage realizes the extraction of the image features and the second stage realizes the identification of the small target forest fires.However, unlike the Faster RCNN, the MS-FRCNN uses ResNet50 instead of VGG-16 as the backbone network for the feature extraction to avoid the gradient saturation with the increase in the network depth.At the same time, the MS-FRCNN integrates the feature pyramid module (FPN) into the feature extraction module.Through the multi-scale information extraction and information fusion, the MS-FRCNN model can more comprehensively capture the shallow details and deep semantic information in the image.In the stages of forest fire detection, the MS-FRCNN integrates the parallel attention module (PAM) into the RPN.Through parallel running channels and spatial attention, the model pays more attention to the characteristics of the fire spots and suppresses the impact of complex interference backgrounds.At the same time, the MS-FRCNN replaces the NMS algorithm with the soft-NMS algorithm to reduce the error deletion of the detection frame by the RPN and improve the generalization ability of the MS-FRCNN for the detection frame of small target forest fires.The experimental results show that compared to the baseline model, the MS-FRCNN model can effectively reduce the phenomenon of missing detection or false alarm for small target forest fires and achieve the best detection of small target forest fires.

Figure 1 .
Figure 1.Schematic diagram of forest fire detection in continuous video frames.
shows the main architecture of the Fast RCNN model.The Faster RCNN model is mainly composed of four modules: the feature extraction module, the Region Proposal Network (RPN) module, the RoI pooling module and the classification and regression module.

Figure 1 .
Figure 1.Schematic diagram of forest fire detection in continuous video frames.

Figure 3 .
Figure 3.The structure diagram of the residual module.

Figure 3 .
Figure 3.The structure diagram of the residual module.

Figure 4 .
Figure 4. Schematic diagram of the FPN module structure.The bottom-up process corresponds to the convolution operation process of Res-Net50.ResNet50 includes five convolution modules, namely Conv1_x, Conv2_x, Conv3_x, Conv4_x and Conv5_x.The output feature map of {Conv1 Res, Conv2 Res, Conv3 Res, Conv4 Res, Conv5 Res} of the last residual block under each convolution module is abbreviated as {C1, C2, C3, C4, C5}.Input {C2, C3, C4, C5} into the FPN module as the feature map of the corresponding level.After inputting the feature maps into the FPN, they first pass a 1 × 1 convolution layer (convolution core size is 1 × 1; the dimension is 256; step length is 1).To reduce the calculation, the output feature maps are marked as {C2′, C3′, C4′, C5′}.In the top-down operation, the feature map of the top layer is enlarged

Figure 4 .
Figure 4. Schematic diagram of the FPN module structure.The bottom-up process corresponds to the convolution operation process of ResNet50.ResNet50 includes five convolution modules, namely Conv1_x, Conv2_x, Conv3_x, Conv4_x and Conv5_x.The output feature map of {Conv1 Res, Conv2 Res, Conv3 Res, Conv4 Res, Conv5 Res} of the last residual block under each convolution module is abbreviated as {C1, C2, C3, C4, C5}.Input {C2, C3, C4, C5} into the FPN module as the feature map of the corresponding level.After inputting the feature maps into the FPN, they first pass a 1 × 1 convolution layer (convolution core size is 1 × 1; the dimension is 256; step length is 1).To reduce the calculation, the output feature maps are marked as {C2 , C3 , C4 , C5 }.In the top-down operation, the feature map of the top layer is enlarged to the same scale as the feature map of the previous layer through upper sampling.Where M5 is C5 , perform

Figure 5 .
Figure 5. Structure diagram of the PAM module.

Figure 5 .
Figure 5. Structure diagram of the PAM module.

Figure 5 .
Figure 5. Structure diagram of the PAM module.

Figure 6 .
Figure 6.Structure diagram of the CBAM module.

Figure 6 .
Figure 6.Structure diagram of the CBAM module.

Figure 7 .
Figure 7. Schematic diagram of calculating the IoU.

Figure 7 .
Figure 7. Schematic diagram of calculating the IoU.

Figure 8
Figure8shows the overall architecture of the MS-FRCNN model proposed in this paper.It can be seen that, similar to the Faster RCNN model, the MS-FRCNN model is also a two-stage model.The first stage is used to extract the image features.In the second stage, the target box is generated and selected based on the extracted image features.However, compared to the Faster RCNN model, the following components of the MS-FRCNN model have been improved.

Figure 8 .
Figure 8. Overall architecture of the MS-FRCNN model.(1) The MS-FRCNN uses ResNet 50 instead of VGG-16 as the backbone network of the model to alleviate the gradient explosion, disappearance and network degradation

Figure 8 .
Figure 8. Overall architecture of the MS-FRCNN model.

18 Figure 9 .
Figure 9. Precision recall curves for small target forest fire detection (Baseline is the Faster RCNN model; Ours is the MS-FRCNN model).

Figure 9 .
Figure 9. Precision recall curves for small target forest fire detection (Baseline is the Faster RCNN model; Ours is the MS-FRCNN model).

Figure 9 .
Figure 9. Precision recall curves for small target forest fire detection (Baseline is the Faster RCNN model; Ours is the MS-FRCNN model).

Figure 10 .
Figure 10.Schematic diagram of the five-fold cross-validation process.

Figure 10 .
Figure 10.diagram of the five-fold cross-validation process.
shows the detection effects of the Faster RCNN model, the FPN model and the MS-FRCNN model proposed in this paper.The target prediction box and the target truth box are marked with green and red rectangles, respectively.

ForestsFigure 11 .
Figure 11.Examples of the visualized results of the detection of small target forest fire images in each model.
Based on the classic target detection model, the Faster RCNN, this paper proposed a small target forest fire detection model, the MS-FRCNN, which integrates the multi-scale features of the images.Similar to the Faster RCNN, the MS-FRCNN model is also a two-stage end-to-end detection model.The first stage realizes the extraction of the image features and the second stage realizes the identification of the small target forest fires.However, unlike the Faster RCNN, the MS-FRCNN uses ResNet50 instead of VGG-16 as the backbone network for the feature extraction to avoid the gradient saturation with the increase in the network depth.At the same time, the MS-FRCNN integrates the feature pyramid module (FPN) into the feature extraction module.Through the multi-scale information extraction and information fusion, the MS-FRCNN model can more comprehensively capture the shallow details and deep semantic information in the image.In the stages of forest fire detection, the MS-FRCNN integrates the parallel attention module (PAM) into the

Figure 11 .
Figure 11.Examples of the visualized results of the detection of small target forest fire images in each model.

Author
Contributions: Conceptualization, data curation, investigation, validation, resources, writingoriginal draft: L.Z.; software, supervision, visualization: M.W.; writing-review and editing: M.W., Y.D. and X.B.; formal analysis & methodology: L.Z. and Y.D.All authors have read and agreed to the published version of the manuscript.Funding: This research was funded by the National Natural Science Foundation of China, grant number 71473034, and the Heilongjiang Provincial Natural Science Foundation of China, grant number LH2019G001.

Table 1 .
Structure of the ResNet50 network.

Table 1 .
Structure of the ResNet50 network.

Table 2 .
Five-fold cross-validation of the Faster RCNN and the MS-FRCNN on the training set and the test set.
Table 3 lists the experimental results of small target forest fire detection under each model.

Table 3 .
Performance comparison of the different detection schemes.