A Forest Fire Recognition Method Using UAV Images Based on Transfer Learning

: Timely detection of forest wildﬁres is of great signiﬁcance to the early prevention and control of large-scale forest ﬁres. Unmanned Aerial Vehicle(UAV) with cameras has the characteristics of wide monitoring range and strong ﬂexibility, making it very suitable for early detection of forest ﬁre. However, the visual angle/distance of UAV in the process of image sampling and the limited sample size of UAV labeled images limit the accuracy of forest ﬁre recognition based on UAV images. This paper proposes a FT-ResNet50 model based on transfer learning. The model migrates the ResNet network trained on an ImageNet dataset and its initialization parameters into the target dataset of forest ﬁre identiﬁcation based on UAV images. Combined with the characteristics of the target data set, Adam and Mish functions are used to ﬁne tune the three convolution blocks of ResNet, and focal loss function and network structure parameters are added to optimize the ResNet network, to extract more effectively deep semantic information from ﬁre images. The experimental results show that compared with baseline models, FT-ResNet50 achieved better accuracy in forest ﬁre identiﬁcation. The recognition accuracy of the FT-ResNet50 model was 79.48%; 3.87% higher than ResNet50 and 6.22% higher than VGG16.


Introduction
Wild forest fires occur frequently all over the world. Forest fires usually have the characteristics of high risk and strong destructive potential, and pose a great harm to social and economic development, environmental protection and ecosystems. Different from other fires, forest fires present specific damage modes due to their environment. In the open environment and with sufficient oxygen, fires are more likely to occur and spread in forests, causing serious personal safety risks and economic losses. Early fire detection is the only effective way to reduce the harm of forest fires [1]. Therefore, research on forest fire identification and early warning has attracted extensive attention.
At present, forest fire detection is mainly realized through monitoring towers, aviation and satellite systems, optical sensors, digital cameras and wireless sensor networks [2,3]. However, forest fire detection methods based on monitoring towers largely depend on the experience of observers, and it is difficult for the monitoring range to cover a large area of wild forest. Satellite remote sensing is very effective for detecting large-scale forest fires, but it is limited by the difficulty in effectively identifying early regional fires [4,5]. Fire detection systems based on sensor networks have good identification performance in indoor spaces, but are difficult to install and maintain in wild forest areas due to high hardware costs [6,7]. In addition, due to the limitations of sensor materials, interference from the environment may lead to false positives. At the same time, wireless sensor networks are unable to provide important visual information to help firefighters track the fire scene. In recent years, with the development of machine vision technology, researchers have proposed various fire detection models based on image processing [8,9]. However, image processing methods enhanced UAV forest fire dataset realize the effective identification of forest fire. The main contributions of this paper are: (1) The FT-ResNet50 model adopted transfer learning to solve the problem of insufficient labeled samples of UAV forest fire images. This model can also realize highperformance forest fire recognition when UAV labeled samples are limited in size and uneven in sample distribution. (2) The FT-ResNet50 selected ResNet50 as the basic network to realize transfer learning through experimental results. By fixing the shallow layers of ResNet50 and fine tuning its deep layers, we obtained the optimal configuration of ResNet50 suitable for the target dataset. The FT-ResNet50 model could successfully extract deep semantic features from UAV images, thus improving the accuracy of the model for forest fire recognition. (3) The FT-ResNet50 model combined the mixup-based sample enhancement method with the traditional sample enhancement method to expand the sample size of UAV images, so as to enhance the generalization ability of the model.
The structure of this paper is organized as follows. In Section 2, the dataset used in the experiments is presented, and the structure of the FT-ResNet50 model is discussed in detail. Section 3 introduces the configuration of the experiment, and experimentally verifies the influence on forest fire identification of configuration parameters such as network depth, loss function, activation function, and optimizer, to explain the framework of the FT-ResNet50 model. In Section 4, the experimental results are discussed in depth and analyzed; Section 5 summarizes the full work.

Dataset
The FLAME (Fire Luminosity Airborne-based Machine learning Evaluation) dataset is a dataset of fire images collected by UAV in an Arizona pine forest [27]. The dataset used different UAVs and cameras to collect image samples of forest fires. Table 1 describes the technical specifications of UAVs and cameras used in the FLAME dataset, and the resolution of the collected samples. The dataset includes video recording and heat maps taken by infrared camera. Each frame of the video is labeled as an image. In this paper, 31501, 7874 and 8617 image samples were extracted from the FLAME data set as the training set, verification set and test set of this experiment, respectively. Figure 1 shows some examples of typical forest fire images in the FLAME dataset.

Mixup
In order to give the forest fire recognition model better generalization ability, this paper presents an expansion strategy for the forest fire image training samples. By increasing the number of training samples, the distribution of training samples can be improved and the robustness of the model to noise can be improved.

Mixup
In order to give the forest fire recognition model better generalization ability, this paper presents an expansion strategy for the forest fire image training samples. By increasing the number of training samples, the distribution of training samples can be improved and the robustness of the model to noise can be improved.
Zhang et al. [28] proposed a sample enhancement method based on mixup. This method is a sample expansion algorithm for computer vision, which can expand the size of a dataset by mixing different types of images. Two image samples are randomly selected from the training dataset, and the pixel values and labels of the two image samples are weighted according to a certain weight. Specifically, the mixup builds virtual training samples in the following ways: where (x i , y i ) and (x j , y j ) are the two random examples extracted from training sample data, and λ ∈ [0, 1]. λ follows a Beta distribution, namely λ ∼ Beta(α, α). This mixup-based data enhancement method has the advantages of processing decision boundary blurring, providing smoother predictions, and enhancing the prediction ability of the model beyond the scope of training dataset. Figure 2 shows the process of mixup-based image sample augmentation. The experiment shows that when λ = 0.5, α = β, the best data-fusion effect is achieved. Figure 1. Some image samples of forest fires in the FLAME dataset.

Mixup
In order to give the forest fire recognition model better generalization ability, this paper presents an expansion strategy for the forest fire image training samples. By increasing the number of training samples, the distribution of training samples can be improved and the robustness of the model to noise can be improved.
Zhang et al. [28] proposed a sample enhancement method based on mixup. This method is a sample expansion algorithm for computer vision, which can expand the size of a dataset by mixing different types of images. Two image samples are randomly selected from the training dataset, and the pixel values and labels of the two image samples are weighted according to a certain weight. Specifically, the mixup builds virtual training samples in the following ways: . This mixup-based data enhancement method has the advantages of processing decision boundary blurring, providing smoother predictions, and enhancing the prediction ability of the model beyond the scope of training dataset. Figure 2 shows the process of mixupbased image sample augmentation. The experiment shows that when λ = 0.5, α β = , the best data-fusion effect is achieved.

Residual Network (ResNet-50)
Following the success of VGGNet architectures [29], researchers believe that deeper models outperform shallower models. However, as the number of model layers increases, the complexity and training difficulty of the model also increases, and the accuracy decreases. In 2016, Kaiming He and colleagues at Microsoft Research solved the problem of gradient disappearance and gradient explosion by building ResNet, making feasible deeper network training. They introduced a new learning framework to simplify the training of deeper networks [30], and called the framework residual learning; accordingly, the model using this framework is called residual network (ResNet). ResNet allows original input information to be directly connected to subsequent neurons, and takes as its goal minimization of the difference (residual) between input and output. Specifically, the original input to the network is set to x and the final desired output is set to H(x). When the original input x is passed directly to the tail of the network as the initial result, the objective to be learned in this case becomes F(x) = H(x) − x. Figure 3 illustrates the principle of residual learning in ResNet. minimization of the difference (residual) between input and output. Specifically, the original input to the network is set to x and the final desired output is set to ( ) H x . When the original input x is passed directly to the tail of the network as the initial result, the objective to be learned in this case becomes ( ) Figure 3 illustrates the principle of residual learning in ResNet. This paper is devoted to extracting deeper semantic information from forest fire images, beyond color and structural features, so the ResNet-50 network was selected as the backbone network of our model. Table 2 lists the architecture of ResNet-50. ResNet-50 contains 49 convolution layers, one of which is 3 × 3, an average pool layer, and a fully connected layer. The classical ResNet-50 model involves 25.56 million parameters, of which the rectification nonlinearity (ReLU) activation function and batch normalization (BN) function are applied to the back of all convolution layers in the "Bottle-neck" block, and the softmax function is applied to the full connection layer. x identity This paper is devoted to extracting deeper semantic information from forest fire images, beyond color and structural features, so the ResNet-50 network was selected as the backbone network of our model. Table 2 lists the architecture of ResNet-50. ResNet-50 contains 49 convolution layers, one of which is 3 × 3, an average pool layer, and a fully connected layer. The classical ResNet-50 model involves 25.56 million parameters, of which the rectification nonlinearity (ReLU) activation function and batch normalization (BN) function are applied to the back of all convolution layers in the "Bottle-neck" block, and the softmax function is applied to the full connection layer.

Transfer Learning
The idea of transfer learning was introduced to solve the problem of limited sample size of UAV forest fire images. In this study, the ResNet50 network trained on ImageNet dataset [31] was migrated to the experimental dataset of UAV forest fires. The ImageNet dataset contains about 1.2 million images in 1000 categories. Using the network model pre-trained on such a large dataset, it can be effectively migrated to classification tasks of various images [32]. The ResNet50 network was trained on the Imagenet dataset, taken as the preliminary training model, and the optimal configuration of the ResNet50 network was realized by fixing the convolution block of shallow feature extraction, fine-tuning the convolution block of deep feature extraction, and adjusting the Mish and Adam parameters, to complete feature extraction and recognition based on UAV forest fire images.

Adam Optimizer
In this study, an Adam optimizer was used to accelerate the convergence of the FT-ResNet50 model. Adam is a first-order gradient-based stochastic objective function optimization algorithm [33]. Adam combines the advantages of the AdaGrad [34] and RMSProp [35] algorithms; the former is used for sparse gradient problems, and the latter is used for nonlinear and unfixed optimization objective problems. Adam has the advantages of easy implementation, high computing efficiency and low memory requirements [36]. Its gradient diagonal scaling is invariant, so it is suitable for solving problems with large-scale data or parameters. For different parameters, Adam can adaptively adjust the learning rate and iteratively update the weights of the neural network according to the training data [37,38]. The calculation process and pseudocode of the Adam algorithm are shown in Algorithm 1. Algorithm 1. The calculation process and pseudocode of Adam algorithm.

Focal Loss
Focal Loss function [39] is mainly used to solve problems such as unbalanced sample number and sample difficulty. When training the FT-ResNet50 model, Focal Loss was used as a loss function to update ω and b. The Focal Loss function is defined as follows: p t reflects the proximity to ground truth. The larger p t is, the closer it is to the ground truth, i.e., the more accurate the classification. γ is the adjustable factor. Focal Loss's modulation factor is (1 − p t ) γ ; for the accurately classified sample p t → 1 , modulating factor approaches 0; for the inaccurately classified sample 1 − p t → 1 , modulating factor approaches 1. Compared with traditional cross entropy loss, Focal Loss does not change for samples with inaccurate classification, and for samples with accurate classification, loss decreases. On the whole, it is equivalent to increasing the weight of the inaccurately classified samples in the loss function. p t also reflects the difficulty of classification. The larger p t , the higher the confidence of classification, and the easier it is to divide the sample. The smaller p t is, the lower the confidence of classification, and the more difficult it is to distinguish the sample. Therefore, Focal Loss increases the weight of difficult samples in the loss function, making the loss function tend towards the difficult samples, which helps to improve the accuracy of difficult samples and improve the learning ability of the network for the current task.

Mish
Mish function [40] is a novel self-regularized non-monotonic activation function. Its shape and properties are similar to those of Swish. It plays an important role in the performance and training dynamics of neural networks. The Mish activation function can be expressed as follows: Compared with the ReLU function, which is the common activation function in the neural network, Mish is differentiable anywhere in its domain, so there is no hard turning point at zero. Figure 4 shows the curve of the Mish function.
Compared with the ReLU function, which is the common activation function in the neural network, Mish is differentiable anywhere in its domain, so there is no hard turning point at zero. Figure 4 shows the curve of the Mish function.

The Proposed Forest Fire Identification Model-FT-ResNet50
This section introduces the FT-ResNet50 model in detail. Figure 5 shows the architecture of the FT-ResNet50 model. FLAME-Cls is the extended data set after sample enhancement. The FT-ResNet50 model uses five-level residual blocks for feature extraction. The first two residual blocks are mainly used to extract the edge, texture and color features of the image. Because the extraction process of these features is highly universal for all types of images, the structure of the first two-stage residual block is the same as that of ResNet50 in the FT-ResNet50 model. The next three-level residual block mainly extracts the abstract semantic features of the image, which is the key to improving the accuracy of forest fire recognition. The FT-ResNet50 model adjusts the last three residual blocks of the ResNet50 network, and adds the Adam random gradient descent algorithm to residual blocks 3, 4 and 5 to avoid training falling into local optimization, and to ensure that the model can obtain more accurate recognition results. The feature map output from the last convolution layer of the FT-Res-Net50 model is converted into a 2048-dimensional vector through the global average pool, and the forest fire identification results are output in the form of probability through the SoftMax function.
Meanwhile, in the FT-ResNet50 model, the original activation function ReLu was replaced by the Mish function to improve the gradient vanishing problem in model training. In addition, the Focal Loss s was employed to replace the traditional binary cross-entropy loss. Focal Loss pays more attention to the training of difficult samples, which is more helpful for improving the learning ability of the model.

The Proposed Forest Fire Identification Model-FT-ResNet50
This section introduces the FT-ResNet50 model in detail. Figure 5 shows the architecture of the FT-ResNet50 model.  Table 3 lists the experimental conditions. In order to verify the performance of the FT-ResNet50 model based on the enhanced FLAME-Cls dataset, this study compared the FLAME-Cls is the extended data set after sample enhancement. The FT-ResNet50 model uses five-level residual blocks for feature extraction. The first two residual blocks are mainly used to extract the edge, texture and color features of the image. Because the extraction process of these features is highly universal for all types of images, the structure of the first two-stage residual block is the same as that of ResNet50 in the FT-ResNet50 model. The next three-level residual block mainly extracts the abstract semantic features of the image, which is the key to improving the accuracy of forest fire recognition. The FT-ResNet50 model adjusts the last three residual blocks of the ResNet50 network, and adds the Adam random gradient descent algorithm to residual blocks 3, 4 and 5 to avoid training falling into local optimization, and to ensure that the model can obtain more accurate recognition results. The feature map output from the last convolution layer of the FT-ResNet50 model is converted into a 2048-dimensional vector through the global average pool, and the forest fire identification results are output in the form of probability through the SoftMax function.

Experimental Condition Configuration
Meanwhile, in the FT-ResNet50 model, the original activation function ReLu was replaced by the Mish function to improve the gradient vanishing problem in model training. In addition, the Focal Loss s was employed to replace the traditional binary cross-entropy loss. Focal Loss pays more attention to the training of difficult samples, which is more helpful for improving the learning ability of the model. Table 3 lists the experimental conditions. In order to verify the performance of the FT-ResNet50 model based on the enhanced FLAME-Cls dataset, this study compared the recognition performance of the FT-ResNet50 model with VGG, Inception, and ResNet. Table 4 shows the setting of super parameters of the FT-ResNet50 model.

Evaluation Indicators
In order to evaluate comprehensively the effect of the forest fire identification method proposed in this paper, we used accuracy (Acc), precision (Pre), recall (Rec), specificity (Spe), and F1 score as evaluation indicators shown in Equations (6)

Sample Augmentation
In this study, the traditional sample augmentation method was combined with the mixup-based sample augmentation method to expand the samples of the FLAME dataset, and a new forest fire dataset, FLAME-Cls, was obtained after the sample augmentation. Figures 6 and 7 show the expansion effects of the traditional sample augmentation method and the mixup-based sample augmentation method, respectively.  Using ResNet50 as the forest fire identification model (the parameters in each convolutional block determined by the ImageNet dataset), the effects of different sample augmentation methods were verified, and the results are shown in Table 5. As an online augmentation strategy (as the training process proceeds), mixup does not change the number of training samples in each round. The traditional augmentation scheme (offline expansion, supplementing the number of samples in the basic training set) can improve identification accuracy to a certain extent (73.66% to 75.18%) but also brings additional training costs. We finally adopted a combination of two augmentation strategies and achieved a level of performance improvement, namely 77.47% recognition accuracy.  Table 6 shows the respective recognition accuracy and loss for the proposed method on the training set, validation set and test set. It can be seen that it achieved relatively good results with both the training set and the validation set, while the performance for the test set was relatively lower, reflecting a large domain offset between the test set data and the training and validation data, improving the generalization requirements of the model to a certain extent. Domain shift can be understood as the difference in data distribution between two sample sets. Generalization refers to how well a model has been tested on different distributions. For example, there is a large difference between training samples and test samples, so a robust model is needed to generalize samples that have not been seen before. The convergence curves of the loss function and the recognition accuracy with the number of iterations are shown in Figures 8 and 9. It can be seen that the performance of the proposed network in the validation set also improved, reflecting its relatively reliable generalization performance.

Influence of Different Backbone Networks on Recognition Accuracy
This research used VGG, Inception, and ResNet50 as the backbone network to test forest fire recognition rates. As shown in Table 7, compared with the other two structures, choosing ResNet50 with residual structure as the backbone network can to a certain extent alleviate the problem of gradient disappearance, which is beneficial for model training. ResNet50 replaces the fully connected layer in VGG with global average pooling in the final output, which greatly reduces the network parameters and the risk of overfitting. Figure 10 shows the comparison of confusion matrices for three different backbone network structures. It can be seen that ResNet50 achieved more accurate predictions than other methods.

Influence of Different Backbone Networks on Recognition Accuracy
This research used VGG, Inception, and ResNet50 as the backbone network to test forest fire recognition rates. As shown in Table 7, compared with the other two structures, choosing ResNet50 with residual structure as the backbone network can to a certain extent alleviate the problem of gradient disappearance, which is beneficial for model training. ResNet50 replaces the fully connected layer in VGG with global average pooling in the final output, which greatly reduces the network parameters and the risk of overfitting. Figure 10 shows the comparison of confusion matrices for three different backbone network structures. It can be seen that ResNet50 achieved more accurate predictions than other methods.

Influence of Different Backbone Networks on Recognition Accuracy
This research used VGG, Inception, and ResNet50 as the backbone network to test forest fire recognition rates. As shown in Table 7, compared with the other two structures, choosing ResNet50 with residual structure as the backbone network can to a certain extent alleviate the problem of gradient disappearance, which is beneficial for model training. ResNet50 replaces the fully connected layer in VGG with global average pooling in the final output, which greatly reduces the network parameters and the risk of overfitting. Figure 10 shows the comparison of confusion matrices for three different backbone network structures. It can be seen that ResNet50 achieved more accurate predictions than other methods.

Influence of ResNet Network Depth on Identification Accuracy
This research explored the impact of network depth on model performance from the aspects of recognition accuracy and inference time. The experimental results are shown in Table 8. The backbone network used in this experiment was ResNet, and four different configuration structures were tested here, which are respectively represented as Res-Net18, ResNet34, ResNet50, and ResNet101. Among these, ResNet18 and ResNet34 use the BasicBlock model structure, and the latter two use another model structure called Bottleneck. For deeper convolutional networks, the Bottleneck structure can reduce parameters to a certain extent and can prevent overfitting. Observing the data, it can be concluded that as the number of network layers deepens, the inference time also increases, that is, the efficiency of image recognition becomes lower. Although ResNet101 achieved the highest recognition accuracy, the inference speed was 1/3 slower than the second-ranked ReNet50. In summary, it is important to choose the most appropriate network depth for a

Influence of ResNet Network Depth on Identification Accuracy
This research explored the impact of network depth on model performance from the aspects of recognition accuracy and inference time. The experimental results are shown in Table 8. The backbone network used in this experiment was ResNet, and four different configuration structures were tested here, which are respectively represented as ResNet18, ResNet34, ResNet50, and ResNet101. Among these, ResNet18 and ResNet34 use the BasicBlock model structure, and the latter two use another model structure called Bottleneck. For deeper convolutional networks, the Bottleneck structure can reduce parameters to a certain extent and can prevent overfitting. Observing the data, it can be concluded that as the number of network layers deepens, the inference time also increases, that is, the efficiency of image recognition becomes lower. Although ResNet101 achieved the highest recognition accuracy, the inference speed was 1/3 slower than the second-ranked ReNet50. In summary, it is important to choose the most appropriate network depth for a specific task. In this study, after balancing model complexity, training cost, test accuracy and other factors, ResNet50 was finally selected as the backbone network for feature extraction. To further explore the impact of training methods on recognition accuracy, this study used different activation functions and optimizers for the testing, and the results are shown in Table 9. The experimental results show that the best recognition effect was provided by the combination of Mish and Adam. A schematic diagram of activation functions and the convergence curves for different optimizers are shown in Figures 11 and 12, respectively. To further explore the impact of training methods on recognition accura used different activation functions and optimizers for the testing, and th shown in Table 9. The experimental results show that the best recognition ef vided by the combination of Mish and Adam. A schematic diagram of activat and the convergence curves for different optimizers are shown in Figures 1 spectively.

The Effect of Transfer Learning Strategy on Identification Accuracy
This paper explores the impact of transfer learning strategies on model t racy under different network architectures. ResNet50 and VGG16 networks w backbone networks to extract features, both composed of five convolution blo as ConvBlock1-ConvBlock5. The transfer learning strategy used the weight p the original network model trained on the ImageNet dataset as the initializat ters for the corresponding part of the FT-ResNet50 model proposed in this selected fixed and fine-tuned parameters on this basis to achieve the effect of network convergence. To this end, the authors designed six transfer learning each network model, as shown in Table 10.
In order to explore the method and process of extracting the features of a image by the model proposed in this paper, the features captured by the mod network ResNet50 were visualized as shown in Figure 13. It can be seen th can extract richer features. The combination of low-level edge information an semantic information more specifically guides the process of feature acqui target-related regions, thereby improving the accuracy of target classification

The Effect of Transfer Learning Strategy on Identification Accuracy
This paper explores the impact of transfer learning strategies on model testing accuracy under different network architectures. ResNet50 and VGG16 networks were used as backbone networks to extract features, both composed of five convolution blocks, denoted as ConvBlock1-ConvBlock5. The transfer learning strategy used the weight parameters of the original network model trained on the ImageNet dataset as the initialization parameters for the corresponding part of the FT-ResNet50 model proposed in this paper, and selected fixed and fine-tuned parameters on this basis to achieve the effect of accelerating network convergence. To this end, the authors designed six transfer learning schemes for each network model, as shown in Table 10. In order to explore the method and process of extracting the features of a given input image by the model proposed in this paper, the features captured by the model backbone network ResNet50 were visualized as shown in Figure 13. It can be seen that ResNet50 can extract richer features. The combination of low-level edge information and high-level semantic information more specifically guides the process of feature acquisition in the target-related regions, thereby improving the accuracy of target classification.   Figure 14 shows the visualization effect for 64 convolution kernels of ConvBlock1 in the ResNet50 network. It can be observed that different convolution kernels can obtain different image features, which ensures that the FT-ResNet50 model has strong feature acquisition ability.
Forests 2022, 13, x FOR PEER REVIEW 18 of 21 Figure 14 shows the visualization effect for 64 convolution kernels of ConvBlock1 in the ResNet50 network. It can be observed that different convolution kernels can obtain different image features, which ensures that the FT-ResNet50 model has strong feature acquisition ability.

Discussion
The experimental results in Table 5 show that the proposed sample expansion strategy had a positive impact on improving the accuracy of forest fire identification. The expansion of the sample size made the training data more diversified, which can reduce the field transfer of training and testing to a certain extent. Therefore, compared with the dataset before the sample expansion, the expanded dataset achieved higher forest fire recognition accuracy.
The choice of loss function also affects the accuracy of forest fire identification. The experimental results in Tables 6 and 7 show that, compared with the traditional crossentropy loss function, the Focus Loss function focused more on the training of difficult samples, which helped to improve the learning ability of the network.
The experimental results shown in Table 8 indicate that different depths of the Res-Net network affect the recognition accuracy and operational performance of the model. With the deepening of network layers, although the accuracy of forest fire identification is improved, the time consumption of operations also increases. After weighing the model complexity, training cost, test accuracy and other factors, this study finally selected Res-Net 50 as the backbone network for feature extraction.
The selection of activation function and optimizer affects forest fire recognition accuracy. The experimental results shown in Table 9 show that better recognition results can be obtained by fine-tuning the convolution block with Mish as the activation function and Adam as the optimizer. This is because the Mish activation function can effectively improve the gradient loss in network training. The Adam optimizer can avoid the model falling into local optimization during training.
In this study, six transfer learning schemes were designed for each network model, given in the second column of Table 10. Specifically, "Scheme 0" represents the base scheme baseline in which the pre-training weights of all five convolutional blocks were fixed, and the network did not require fine-tuning during the training; "Scheme 1" to "Scheme 4" indicate the fine-tuning scheme from deep to shallow layers respectively, which gradually unlocked the fine-tuning operation of the deep layer network; "Scheme 5" indicates the fine-tuning of all the pre-training parameters. Observing the results in Table 10, we can see that the ReNet50 network achieved the highest recognition accuracy

Discussion
The experimental results in Table 5 show that the proposed sample expansion strategy had a positive impact on improving the accuracy of forest fire identification. The expansion of the sample size made the training data more diversified, which can reduce the field transfer of training and testing to a certain extent. Therefore, compared with the dataset before the sample expansion, the expanded dataset achieved higher forest fire recognition accuracy.
The choice of loss function also affects the accuracy of forest fire identification. The experimental results in Tables 6 and 7 show that, compared with the traditional crossentropy loss function, the Focus Loss function focused more on the training of difficult samples, which helped to improve the learning ability of the network.
The experimental results shown in Table 8 indicate that different depths of the ResNet network affect the recognition accuracy and operational performance of the model. With the deepening of network layers, although the accuracy of forest fire identification is improved, the time consumption of operations also increases. After weighing the model complexity, training cost, test accuracy and other factors, this study finally selected ResNet 50 as the backbone network for feature extraction.
The selection of activation function and optimizer affects forest fire recognition accuracy. The experimental results shown in Table 9 show that better recognition results can be obtained by fine-tuning the convolution block with Mish as the activation function and Adam as the optimizer. This is because the Mish activation function can effectively improve the gradient loss in network training. The Adam optimizer can avoid the model falling into local optimization during training.
In this study, six transfer learning schemes were designed for each network model, given in the second column of Table 10. Specifically, "Scheme 0" represents the base scheme baseline in which the pre-training weights of all five convolutional blocks were fixed, and the network did not require fine-tuning during the training; "Scheme 1" to "Scheme 4" indicate the fine-tuning scheme from deep to shallow layers respectively, which gradually unlocked the fine-tuning operation of the deep layer network; "Scheme 5" indicates the finetuning of all the pre-training parameters. Observing the results in Table 10, we can see that the ReNet50 network achieved the highest recognition accuracy by fixing the first two and fine-tuning the last three ConvBlocks, while the VGG16 network achieved a higher detection result by fixing the first three and fine-tuning the last two ConvBlocks. The accuracy of "Scheme 1" and of "Scheme 7", which only fine-tuned the last convolutional block, is lower than that of schemes that fine-tuned the last two convolutional blocks; e.g., in the ResNet architecture "Scheme 1" was reduced by 0.81 percentage points compared with "Scheme 2", and by 2.5 percentage points compared with "Scheme 3". This is because features acquired by deeper networks are often abstract and have strong category correlation, therefore, fixed parameter information inhibits to a certain extent the ability of the network to capture discriminant information for the current task. In other words, the images in the ImageNet dataset tend to depict natural landscapes, animals, plants, etc., with low applicability after training on these images for transferring high-level semantic information obtained by the backbone network to the current forest fire image task. Low-level information such as edge, texture, color, etc., is often universal, and no matter the type of image it will cover similar features to a certain extent. Transferring and fixing trained shallow network weights to the proposed model will help improve the network convergence performance and prevent the deterioration of training. However, the parameters before which convolution blocks are to be fixed need specific analysis for different problems, so it is difficult to identify from the theoretical level the specific level most beneficial to the migration effect, and the situation is different for different models. For example, for the current forest fire image recognition problem, it is better to use ResNet50 to fix the first two convolution blocks, and VGG16 to fix the first three convolution blocks. "Scheme 5" and "Scheme 11" provide a training scheme for fine-tuning all migration parameters. It can be seen from the results that this training method achieved relatively good performance. "Scheme 5" and "Scheme 11" are poor than "Scheme 3" and "Scheme 8" because fine tune all parameters are prone to large fluctuations, especially in the initial stage, the training of the layered transmission characteristics influence each other, more "error" may occur between to optimize parameters of cumulative phenomenon, resulting in instability in the process of training. Finally, we selected "Scheme 3" as the fine-tuning method for the transfer learning strategy in the subsequent experimental process.
In order to verify that the FT-ResNet50 model can effectively extract image features, this paper visually displays the features captured by FT-ResNet50. The results in Figure 13 show that with the increase of network depth, the extracted feature level also increases. Low-level features initially extracted from Convblock1 and Convblock2, such as edge, texture and color, are transformed in Convblock3 and subsequent convolution blocks into high-level abstract features with stronger task relevance. The results shown in Figure 14 show that different convolution kernels also help to obtain different features of the image. The more types of convolution kernels, the better the feature extraction ability. Therefore, the FT-ResNet50 model proposed in this paper can extract more abundant features from fire images, and can better improve the accuracy of forest fire classification by combining low-level edge information with high-level semantic information.

Conclusions
Forest fire recognition based on image processing is an important method to assist with the early detection of forest fire. Deep learning is an important research direction for forest fire identification. Taking the improved ResNet50 as the backbone framework, this paper proposes a forest fire identification model, FT-ResNet50. FT-ResNet50 combines the mixup-based augmentation method with the traditional sample augmentation method to increase the number of training samples, thereby improving the generalization ability of the model. Considering the effects on the accuracy of forest fire identification of loss function, backbone network, network depth and transfer learning strategy, this study determined the optimal configuration of the model. At the same time, this paper also discusses the influence on forest fire recognition of different block parameter convolution adjustments. The experimental results show that, based on UAV images with limited labeled samples, the FT-ResNet50 model proposed in this paper can realize high-performance forest fire recognition tasks.