PDAM–STPNNet: A Small Target Detection Approach for Wildland Fire Smoke through Remote Sensing Images

: The target detection of smoke through remote sensing images obtained by means of unmanned aerial vehicles (UAVs) can be effective for monitoring early forest ﬁres. However, smoke targets in UAV images are often small and difﬁcult to detect accurately. In this paper, we use YOLOX-L as a baseline and propose a forest smoke detection network based on the parallel spatial domain attention mechanism and a small-scale transformer feature pyramid network (PDAM–STPNNet). First, to enhance the proportion of small forest ﬁre smoke targets in the dataset, we use component stitching data enhancement to generate small forest ﬁre smoke target images in a scaled collage. Then, to fully extract the texture features of smoke, we propose a parallel spatial domain attention mechanism (PDAM) to consider the local and global textures of smoke with symmetry. Finally, we propose a small-scale transformer feature pyramid network (STPN), which uses the transformer encoder to replace all CSP_2 blocks in turn on top of YOLOX-L’s FPN, effectively improving the model’s ability to extract small-target smoke. We validated the effectiveness of our model with recourse to a home-made dataset, the Wildﬁre Observers and Smoke Recognition Homepage, and the Bowﬁre dataset. The experiments show that our method has a better detection capability than previous methods.


Introduction
Forest fires can cause widespread forest mortality, bringing huge losses to forest ecological resources and the social economy, and serious forest fires can even lead to human casualties [1,2]. In recent years, the frequency of forest fires has increased, and the extent of the damage has been increasing year by year. Wildfires in Australia burned more than 1300 houses and approximately 6 million hectares of land in January 2020 [3]. As combustible material in forests is not usually dry, burning produces large amounts of fine solid particles that form smoke [4]. Early smoke areas are larger than flame areas, and fires can easily be covered by smoke, making monitoring smoke an effective means of conducting early forest fire monitoring [5]. If forest fires are not responded to in a timely manner, they cause greater damage and increase the cost of fire suppression [6]. If we can detect the distinctive visual feature of smoke in the early stages of a forest fire, we can control small fires that have not yet spread and reduce the damage they cause to a minimum.
To meet the demand for real-time performance and accuracy in forest fire monitoring tasks, researchers have conducted research on forest fire smoke monitoring using UAV (1) Component stitching data enhancement is used to generate images with smaller scale targets in a scaled collage. The collage generates images of the same size as the original images, ensuring that the model can effectively detect small targets of forest fire smoke without incurring additional overheads to the model. (2) A parallel spatial domain attention mechanism is proposed, which contains a parallel local attention mechanism module and a global attention mechanism module as its sub-modules. The attention mechanism module explores the local deep texture features of smoke and the relationship between features, while the global attention mechanism module focuses on the global texture features of smoke, taking half of the number of channels of the feature map, respectively, and using concat fusion features to fully consider the smoke texture features and improve the results. (3) The small-scale transformer feature pyramid network is proposed to capture rich global and contextual information, with the aim of improving the detection of small targets in forest fire smoke detection tasks and avoiding the misdetection of small target smoke as far as possible.
We designed PDAM-STPNNet to work on improving the model's feature extraction and feature fusion effect on smoke. The application of PDAM-STPNNet to the forest fire smoke detection task aims to improve smoke detection accuracy and reduce the error rate of detection, which is important for the timely monitoring of forest fires. The diagram of the working principle is shown in Figure 1, where the UAV forest fire monitoring system captures the input images and performs the target detection of smoke in the images. First, component stitching data enhancement is used to increase the number of small target samples. Next, basic features are extracted using traditional backbone extraction features. Then, the parallel spatial domain attention mechanism is used to combine local texture and global texture features. Finally, a small-scale transformer feature pyramid network is used Symmetry 2021, 13, 2260 4 of 29 to enhance the fusion effect of small target features. After these steps, the UAV forest fire monitoring system obtains the final detection results with the help of component stitching data enhancement and PDAM-STPNNet.
Symmetry 2021, 13, 2260 4 of 29 captures the input images and performs the target detection of smoke in the images. First, component stitching data enhancement is used to increase the number of small target samples. Next, basic features are extracted using traditional backbone extraction features. Then, the parallel spatial domain attention mechanism is used to combine local texture and global texture features. Finally, a small-scale transformer feature pyramid network is used to enhance the fusion effect of small target features. After these steps, the UAV forest fire monitoring system obtains the final detection results with the help of component stitching data enhancement and PDAM-STPNNet.
(1) Huangfengqiao Forestry is located in the east and west of You County, Hunan Province, and is dominated by low and medium mountainous landscapes, with a maximum elevation of 1270 m and a minimum elevation of 115 m. The forest type is mainly fir plantation. (2) Qingyang Lake State Forestry Field is located in the remnants of the Xuefeng Mountains, the main mountain system of Ningxiang, in a zone of transition from low and medium mountains to hills. The forest type is mainly subtropical deciduous broad-leaved forest. The woodland is interspersed with residential areas, with obvious traces of human activity. The different background characteristics of the two woodlands provide a good test of the effectiveness of our method in different complex environments. Moreover, the two woodlands are located in a subtropical monsoon climate with a wide variety of trees and are prone to fires in the dry summer and autumn, making them valuable for field testing.
(1) Huangfengqiao Forestry is located in the east and west of You County, Hunan Province, and is dominated by low and medium mountainous landscapes, with a maximum elevation of 1270 m and a minimum elevation of 115 m. The forest type is mainly fir plantation.
(2) Qingyang Lake State Forestry Field is located in the remnants of the Xuefeng Mountains, the main mountain system of Ningxiang, in a zone of transition from low and medium mountains to hills. The forest type is mainly subtropical deciduous broad-leaved forest. The woodland is interspersed with residential areas, with obvious traces of human activity. The different background characteristics of the two woodlands provide a good test of the effectiveness of our method in different complex environments. Moreover, the two woodlands are located in a subtropical monsoon climate with a wide variety of trees and are prone to fires in the dry summer and autumn, making them valuable for field testing.
captures the input images and performs the target detection of smoke in the images. First, component stitching data enhancement is used to increase the number of small target samples. Next, basic features are extracted using traditional backbone extraction features. Then, the parallel spatial domain attention mechanism is used to combine local texture and global texture features. Finally, a small-scale transformer feature pyramid network is used to enhance the fusion effect of small target features. After these steps, the UAV forest fire monitoring system obtains the final detection results with the help of component stitching data enhancement and PDAM-STPNNet.
(1) Huangfengqiao Forestry is located in the east and west of You County, Hunan Province, and is dominated by low and medium mountainous landscapes, with a maximum elevation of 1270 m and a minimum elevation of 115 m. The forest type is mainly fir plantation. (2) Qingyang Lake State Forestry Field is located in the remnants of the Xuefeng Mountains, the main mountain system of Ningxiang, in a zone of transition from low and medium mountains to hills. The forest type is mainly subtropical deciduous broad-leaved forest. The woodland is interspersed with residential areas, with obvious traces of human activity. The different background characteristics of the two woodlands provide a good test of the effectiveness of our method in different complex environments. Moreover, the two woodlands are located in a subtropical monsoon climate with a wide variety of trees and are prone to fires in the dry summer and autumn, making them valuable for field testing.

Component Stitching Data Enhancement
The definition of image enhancement is very broad. Usually, image enhancement is used to purposefully emphasise the overall or local features of an image and improve the clarity of the image. When drones monitor forest fires, they are often far away from the fire source for reasons of flight safety and the monitoring range of the drone [26]. At this time, the smoke captured by the UAV's camera usually corresponds to small targets. In order to enhance the detection of small targets of forest fire smoke, we use component stitching data enhancement to emphasise the differences between the features of different targets in the images and to balance the proportion of targets of different sizes in the dataset during model training [27].
In order to ensure that the images of objects such as smoke and trees in a forest environment are not distorted, the aspect ratio needs to be maintained when stitching the images. This is achieved by reducing and stitching together k regular images arranged in the same number of rows and columns to form a stitched image, where k is the number of squared rows/columns; for example, 1, 2 2 , 3 2 , etc. The spatial resolution of the original individual images is (h,w), and the spatial resolution of each component image after the where row represents the number of rows in the collage,col represents the number of columns in the collage, and k represents the total number of components. Experiments show that a collage of two rows and two columns gives the best result-i.e., k = 4-as shown in Figure 3.

Component Stitching Data Enhancement
The definition of image enhancement is very broad. Usually, image enhancement is used to purposefully emphasise the overall or local features of an image and improve the clarity of the image. When drones monitor forest fires, they are often far away from the fire source for reasons of flight safety and the monitoring range of the drone [26]. At this time, the smoke captured by the UAV's camera usually corresponds to small targets. In order to enhance the detection of small targets of forest fire smoke, we use component stitching data enhancement to emphasise the differences between the features of different targets in the images and to balance the proportion of targets of different sizes in the dataset during model training [27].
In order to ensure that the images of objects such as smoke and trees in a forest environment are not distorted, the aspect ratio needs to be maintained when stitching the images. This is achieved by reducing and stitching together k regular images arranged in the same number of rows and columns to form a stitched image, where k is the number of squared rows/columns; for example, 1, 2 2 , 3 2 , etc. The spatial resolution of the original individual images is ( h , w ), and the spatial resolution of each component image after the where row represents the number of rows in the collage, col represents the number of columns in the collage, and k represents the total number of components. Experiments show that a collage of two rows and two columns gives the best result-i.e., 4 k = -as shown in Figure 3. It can be seen that the image is scaled to half of its original length and width and then collaged, with the collaged image maintaining its original size. It can be seen that the collage increases the proportion of small targets in the dataset by creating targets with smaller scales. As the composite image remains the same size as the regular image, there is no additional overhead involved in the forward propagation of the network model. An image collage is not an infinite augmentation of the images in the dataset. To determine exactly how many images need to be collaged, a feedback paradigm is set. During the training process of PDAM-STPNNet, the proportion of loss caused by small targets can be calculated after each forward propagation. If the proportion of loss caused by small targets in the current iteration is less than a threshold, the collaged images are used in the next iteration; otherwise, no image collage is performed-i.e.,  It can be seen that the image is scaled to half of its original length and width and then collaged, with the collaged image maintaining its original size. It can be seen that the collage increases the proportion of small targets in the dataset by creating targets with smaller scales. As the composite image remains the same size as the regular image, there is no additional overhead involved in the forward propagation of the network model. An image collage is not an infinite augmentation of the images in the dataset. To determine exactly how many images need to be collaged, a feedback paradigm is set. During the training process of PDAM-STPNNet, the proportion of loss caused by small targets can be calculated after each forward propagation. If the proportion of loss caused by small targets in the current iteration is less than a threshold, the collaged images are used in the next iteration; otherwise, no image collage is performed-i.e., where I t+1 denotes the next iteration, I c denotes the use of a collage, I denotes the use of the original image, τ denotes the set threshold, and r t s denotes the percentage of loss caused by the small target in the current iteration.

PDAM-STPNNet
Given the operational characteristics of UAV forest patrols, where the UAV carries image-capture equipment to collect images of the forest in real time while flying at high Symmetry 2021, 13, 2260 6 of 29 altitude, the large area covered by the field of view requires models that can accurately identify forest fire features and locate them accurately. Forest fires are usually shadowy processes, with smoke being produced when a fire occurs. The timely and accurate detection of smoke is difficult due to the small size of the fire, the lightness of the smoke and the presence of trees, shifting branches, exposed grey and white rocks, and other smoke-like objects.
In order to efficiently and accurately identify smoke in different stages and states from UAV aerial images and to reduce and minimize forest fire damage, the objectives of this paper for smoke detection are to solve the difficulty of detecting the small targets presented by forest fire smoke by conventional methods and to identify the location of a fire's starting point through the positioning information obtained by UAVs. Therefore, this paper improves the model architecture based on YOLOX-L and proposes the method of PDAM-STPNNet for target detection based on UAV aerial photography of forest fire smoke. Our model structure is shown in Figure 4. where 1 t I + denotes the next iteration, c I denotes the use of a collage, I denotes the use of the original image,τ denotes the set threshold, and t s r denotes the percentage of loss caused by the small target in the current iteration.

PDAM-STPNNet
Given the operational characteristics of UAV forest patrols, where the UAV carries image-capture equipment to collect images of the forest in real time while flying at high altitude, the large area covered by the field of view requires models that can accurately identify forest fire features and locate them accurately. Forest fires are usually shadowy processes, with smoke being produced when a fire occurs. The timely and accurate detection of smoke is difficult due to the small size of the fire, the lightness of the smoke and the presence of trees, shifting branches, exposed grey and white rocks, and other smokelike objects.
In order to efficiently and accurately identify smoke in different stages and states from UAV aerial images and to reduce and minimize forest fire damage, the objectives of this paper for smoke detection are to solve the difficulty of detecting the small targets presented by forest fire smoke by conventional methods and to identify the location of a fire's starting point through the positioning information obtained by UAVs. Therefore, this paper improves the model architecture based on YOLOX-L and proposes the method of PDAM-STPNNet for target detection based on UAV aerial photography of forest fire smoke. Our model structure is shown in Figure 4. The convolution blocks in the dashed black box all belong to the backbone, after which comes the FPN structure. In contrast to YOLOX-L, we added the PDAM (enclosed by the red dashed box at the end of the backbone) in order to take full account of local and global texture features of smoke. The features extracted from the backbone are later fused using STPN (enclosed by the purple dashed box), which uses a transformer encoder instead of the original CSP_2 to enhance the detection of small targets of forest fire smoke. For more information, see the following three subsections.

Parallel Spatial Domain Attention Mechanism (PDAM)
A. Local attention mechanism module (LAM) The local attention mechanism module is based on deep local texture features and the relationship between the features. The texture features express the spatial distribution and combination of the target, and the full extraction of the local texture features of the object improves the stability of the model and distinguishes it from nearby backgrounds that can easily be confused with smoke. The extraction effect of fully enhanced texture features is intended to be specific to the local texture features of the smoke.
As shown in Figure 5, the LAM assigns horizontal weight coefficients to each row of features through the attention mechanism and vertical weight coefficients to each column The convolution blocks in the dashed black box all belong to the backbone, after which comes the FPN structure. In contrast to YOLOX-L, we added the PDAM (enclosed by the red dashed box at the end of the backbone) in order to take full account of local and global texture features of smoke. The features extracted from the backbone are later fused using STPN (enclosed by the purple dashed box), which uses a transformer encoder instead of the original CSP_2 to enhance the detection of small targets of forest fire smoke. For more information, see the following three subsections.

Parallel Spatial Domain Attention Mechanism (PDAM)
A. Local attention mechanism module (LAM) The local attention mechanism module is based on deep local texture features and the relationship between the features. The texture features express the spatial distribution and combination of the target, and the full extraction of the local texture features of the object improves the stability of the model and distinguishes it from nearby backgrounds that can easily be confused with smoke. The extraction effect of fully enhanced texture features is intended to be specific to the local texture features of the smoke.
As shown in Figure 5, the LAM assigns horizontal weight coefficients to each row of features through the attention mechanism and vertical weight coefficients to each column of features through the vertical attention mechanism. Each row feature obtained by dimensionality reduction is symmetrical with each column feature, so the transmission of the data stream in the LAM is symmetrical, which is more conducive to extracting complete and effective features. of features through the vertical attention mechanism. Each row feature obtained by dimensionality reduction is symmetrical with each column feature, so the transmission of the data stream in the LAM is symmetrical, which is more conducive to extracting complete and effective features.
We process the weighted features on rows and columns as follows, mining the deep feature information and expanding the weight coefficients by multiplication with a minimum term penalty to obtain extended features.
Here, the local maximum feature is taken as the significant feature and summed with α multiples of the minimum value characteristic, where α is a decimal between 0 and 1. Using this method, the maximum value is used as the main factor, and another feature is taken into account to obtain comprehensive features.
LAM integrates the processed feature information by means of concatenation in the following steps.  We process the weighted features on rows and columns as follows, mining the deep feature information and expanding the weight coefficients by multiplication with a minimum term penalty to obtain extended features. EF = c I * c I I − min(c I , c I I ) (5) Here, the local maximum feature is taken as the significant feature and summed with α multiples of the minimum value characteristic, where α is a decimal between 0 and 1. Using this method, the maximum value is used as the main factor, and another feature is taken into account to obtain comprehensive features.
LAM integrates the processed feature information by means of concatenation in the following steps.
where e ij is the weight coefficient of the LAM, i is the temporal feature, j is the sequence feature, h j is the hidden layer information of the sequence feature j, (c I = {c 1 , c 2 . . . c i−1 , c i }) is the sequence of features in the column dimension, and (c I I = {c 1 , c 2 . . . c i−1 , c i }) is the sequence of features in the row dimension. EF stands for deep feature information, max for maximum operation, min for minimum operation and CF for combined feature information.
The LAM weight assignment procedure is shown in Figure 6.
quence feature, j h is the hidden layer information of the sequence feature j , is the sequence of features in the row dimension. EF stands for deep feature information, max for maximum operation, min for minimum operation and CF for combined feature information. The LAM weight assignment procedure is shown in Figure 6. B. Global attention mechanism module (GAM) The global attention mechanism module focuses on spatial features from a global perspective, fully considering the contextual relationship between forest fire smoke and the forest background. This is designed to comprehensively capture information with differentiation, exclude the interference of redundant information, occlusion, and blurring in the image, improve the adaptability of the model and effectively integrate more comprehensive features of forest fire smoke and preserve similar features of different smoke targets during training. The structure of the GAM is shown in Figure 7. (1) In general, larger convolution kernels are better at perceiving large target objects, while smaller-sized convolution kernels are better at extracting features from small targets. However, the diffusion direction and concentration of different kinds of smoke vary significantly; some backgrounds are complex, and targets are not easy to find. Therefore, B. Global attention mechanism module (GAM) The global attention mechanism module focuses on spatial features from a global perspective, fully considering the contextual relationship between forest fire smoke and the forest background. This is designed to comprehensively capture information with differentiation, exclude the interference of redundant information, occlusion, and blurring in the image, improve the adaptability of the model and effectively integrate more comprehensive features of forest fire smoke and preserve similar features of different smoke targets during training. The structure of the GAM is shown in Figure 7.
where ij e is the weight coefficient of the LAM, i is the temporal feature, j is the sequence feature, j h is the hidden layer information of the sequence feature j , is the sequence of features in the row dimension. EF stands for deep feature information, max for maximum operation, min for minimum operation and CF for combined feature information. The LAM weight assignment procedure is shown in Figure 6. B. Global attention mechanism module (GAM) The global attention mechanism module focuses on spatial features from a global perspective, fully considering the contextual relationship between forest fire smoke and the forest background. This is designed to comprehensively capture information with differentiation, exclude the interference of redundant information, occlusion, and blurring in the image, improve the adaptability of the model and effectively integrate more comprehensive features of forest fire smoke and preserve similar features of different smoke targets during training. The structure of the GAM is shown in Figure 7. (1) In general, larger convolution kernels are better at perceiving large target objects, while smaller-sized convolution kernels are better at extracting features from small targets. However, the diffusion direction and concentration of different kinds of smoke vary significantly; some backgrounds are complex, and targets are not easy to find. Therefore, (1) In general, larger convolution kernels are better at perceiving large target objects, while smaller-sized convolution kernels are better at extracting features from small targets. However, the diffusion direction and concentration of different kinds of smoke vary significantly; some backgrounds are complex, and targets are not easy to find. Therefore, we increase the branch of convolutional kernels of different sizes and use convolutional kernels of sizes 3 * 3, 5 * 5, and 7 * 7 to improve recognition accuracy.
(2) The GAM structure divides the feature map obtained after 1 * 1 convolution into four scales equally, where 3 * 3 convolution uses depth-separable convolution to reduce the number of parameters and computational effort.
(3) Remote sensing images of forest fire smoke have diverse scenes, and to adapt to the complex and variable forest background we use the integrated normalization method of switchable normalization (SN) instead of the traditional batch normalization (BN) layer.
The statistics of SN are used to calculate the statistics of BN, LN, and IN, and then six weighting parameters (corresponding to the mean and variance, respectively) are introduced to calculate the weighted mean and weighted variance as the mean and variance of the SN [28]. Normalisation is performed using the softmax activation function: The input data of an implicit convolutional layer of PDAM-STPNNet can be represented as a feature map with four dimensions (N, C, H, W). The four dimensions represent the minibatch size, the number of channels and the height and width of the channels, respectively. h ncij denotes a pixel;ĥ ncij denotes the h ncij normalized result of the corresponding pixel; w k , w k indicate weighting factors; µ k is the mean; and σ 2 k is the variance. The model learns the scaling factor γ, the offset factor β and ε.
where λ k denotes the control parameters corresponding to the three-dimensional statistics.
The control parameters are all initialised to 1 and optimised for learning during backpropagation. The control parameters λ k are normalised using the softmax function and the weight coefficients w k are calculated.
(4) The SiLU activation function is used instead of the commonly used ReLU activation function or sigmoid activation function to improve the learning convergence of the model. The SiLU activation function is calculated as follows: The structure allows the neural network model to focus more on the global texture features of smoke, such as granular smoke, with particles of a similar colour and size, and its spatial distribution, and also to distinguish backgrounds that are similar to smoke features, improving the accuracy of the extraction of detailed smoke features. Figure 8 shows a detailed design schematic of the GAM structure. The multi-scale convolutional structure, while expanding the field of perception, also introduces the problem of increasing the number of computational parameters. At the same time, the GAM uses a multiscale structure, as well as multiple 1 * 1 and 3 * 3 smallsized convolutional kernel structures and a deeper network model. Therefore, we use depth-wise separable convolution to build the GAM, which has the advantage of allowing complex multi-scale structures to operate efficiently. Depthwise separable convolution is based on the idea of splitting the traditional convolution operation into two steps: first, depthwise convolution-i.e., one-to-one two-dimensional convolution for each channel, where the input feature map is used to reduce the parameter computation-then, after the The multi-scale convolutional structure, while expanding the field of perception, also introduces the problem of increasing the number of computational parameters. At the same time, the GAM uses a multiscale structure, as well as multiple 1 * 1 and 3 * 3 smallsized convolutional kernel structures and a deeper network model. Therefore, we use depth-wise separable convolution to build the GAM, which has the advantage of allowing complex multi-scale structures to operate efficiently. Depthwise separable convolution is based on the idea of splitting the traditional convolution operation into two steps: first, depthwise convolution-i.e., one-to-one two-dimensional convolution for each channel, where the input feature map is used to reduce the parameter computation-then, after the traditional convolution (3D convolution) operation, a 1 * 1 sized convolution kernel is used to combine the features of each channel, also known as point-wise convolution. The structure of depthwise separable convolution is shown in Figure 9. The multi-scale convolutional structure, while expanding the field of perception introduces the problem of increasing the number of computational parameters. A same time, the GAM uses a multiscale structure, as well as multiple 1 * 1 and 3 * 3 s sized convolutional kernel structures and a deeper network model. Therefore, w depth-wise separable convolution to build the GAM, which has the advantage of allo complex multi-scale structures to operate efficiently. Depthwise separable convolut based on the idea of splitting the traditional convolution operation into two steps: depthwise convolution-i.e., one-to-one two-dimensional convolution for each cha where the input feature map is used to reduce the parameter computation-then, afte traditional convolution (3D convolution) operation, a 1 * 1 sized convolution ker used to combine the features of each channel, also known as point-wise convolution structure of depthwise separable convolution is shown in Figure 9.  Assuming that the size of the input feature mapping is S in * S in , the number of channels is C, the size of the convolution kernels is S K * S K , and the total number of 3D convolution kernels is N, the computational effort for conventional convolution and depth-separable convolution is as follows.
Thus, the computational ratio of depth-separable convolution to conventional convolution is: It can be seen that the reduction in computational effort for depthwise separable convolution is related to the size of the 2D convolution kernels S K * S K and the total number N of 3D convolution kernels.
In practice, depthwise separable convolution generally uses a convolution kernel of size 3 × 3. In contrast, conventional convolution parameters are 10 times more computationally intensive than depthwise separable convolution. The PDAM is placed at the end of the backbone, and the feature channels generated by the LAM and GAM are each compressed to half of their original size, and the two are fused in a concatenation symmetrically.

Small-Scale Transformer Feature Pyramid Network (STPN)
In the remote sensing image dataset of forest fire smoke, many very small smoke subjects are included. In this paper, a small-scale transformer feature pyramid network, with the specific structure shown in Figure 4, is proposed to adapt to the single classification task of forest fire smoke detection and to enhance the prediction capability of small forest fire smoke targets. Different flight altitudes of drones and varying fire sizes often lead to drastic changes in the scale of smoke objects. The structure mitigates the negative effects caused by this, thus enhancing the feature fusion capability for smoke images of different scales. Figure 10 shows the specific architecture of the transformer encoder. The transformer has achieved excellence in the fields of image recognition, target detection, and semantic segmentation [29]. We employed the transformer encoder to replace all the CSP_2 blocks in the neck section. Compared to the CSP_2 blocks in CSPDarknet53, the transformer encoder can capture global information and rich contextual information. Each transformer encoder contains two sub-layers-the multi-headed attention layer and the multilayer perceptron (MLP) layer-which are connected using residuals. The multi-headed attention layer helps the model to focus on different locations to accommodate multiple fires in the image.
where the projection is the parameter matrix We set the multi-headed attention layer here to 8 layers. At the same time,   The transformer encoder can capture different local information and search for features through a self-attentive mechanism [30]. Therefore, the transformer encoder exhibits better performance in forest fire smoke small target detection.
Based on YOLOX-L, we use the transformer encoder to replace the original CSP_2, with the introduction of the transformer encoder and the addition of the prediction head used to form the STPN. The STPN is applied to the low-resolution feature maps to reduce computational and memory costs.

UAV Forest Fire Monitoring System
Traditional technical means of forest fire monitoring have various problems, such as blind monitoring areas, poor real-time performance, high operating costs, and high resource consumption [31]. To address these problems, this paper presents a UAV forest fire monitoring system based on PDAM-STPNNet which detects smoke from images returned by UAVs in real-time to determine fire conditions. When a fire occurs, an alarm can be issued and the UAV can be operated from a distance to photograph the situation around the fire, providing information for decision making in relation to rescue and firefighting operations.
The UAV forest fire monitoring system has three sub-systems: (1) a UAV-gimbal camera system, (2) a ground control system, and (3) a ground station terminal monitoring system. (1) The UAV-gimbal camera system uses the YK6-30 model six-rotor UAV. This model UAV is equipped with jiYi K++v2 for flight control; its maximum load is 30 kg and the effective flight time is 20 min to meet the mission requirements. The UAV will experience high-frequency vibration and angular oscillation in flight and is equipped with a gimbal camera for image stabilisation and angular compensation. (2) The ground control system is mainly used for the inspection range, inspection altitude input and inspection task start, UAV flight status view, and other functions. (3) The ground station terminal monitoring system receives high-definition images from the gimbal camera and processes the images for smoke target detection.
The UAV forest fire monitoring system designed in this paper uses a UAV equipped with a high-definition camera and a ground control system which uploads a planned route. After receiving the automatic mission command, the UAV is controlled in ortho mode (gimbal pitch axis: vertical, down) and then takes off and ascends to the route altitude, using the GPS positioning system to feed real-time position information to the ground station terminal monitoring system. While the drone is flying on the route, the pod transmits the image information collected in real time to the ground station terminal monitoring system, which uses PDAM-STPNNet to detect smoke and analyse whether a fire has occurred. When a fire is judged by the monitoring system to have occurred, the ground station sends an alert to the fire service. The workflow is shown in Figure 11. mode (gimbal pitch axis: vertical, down) and then takes off and ascends to the route altitude, using the GPS positioning system to feed real-time position information to the ground station terminal monitoring system. While the drone is flying on the route, the pod transmits the image information collected in real time to the ground station terminal monitoring system, which uses PDAM-STPNNet to detect smoke and analyse whether a fire has occurred. When a fire is judged by the monitoring system to have occurred, the ground station sends an alert to the fire service. The workflow is shown in Figure 11.

Results
This section experimentally verifies the effectiveness of DRGNet for smoke detection and compares it with other related models for the same test set. This section treats of dataset acquisition, evaluation metrics, the experimental environment and setup, DRGNet performance analysis, the analysis of method effects, a comparison between different models, ablation experiments, and practical application testing.

Results
This section experimentally verifies the effectiveness of DRGNet for smoke detection and compares it with other related models for the same test set. This section treats of dataset acquisition, evaluation metrics, the experimental environment and setup, DRGNet performance analysis, the analysis of method effects, a comparison between different models, ablation experiments, and practical application testing.

Dataset Acquisition
In order to train the model proposed in this paper and test its effectiveness in the smoke detection task, a home-made dataset of remotely sensed images of forest fire smoke from aerial photographs taken by drones was developed. The dataset is divided into three parts.
The first part of the dataset comes from a number of publicly available video smoke datasets, such as (1) a public dataset published by the Signal Processing Group of Bilkent University, Turkey [32]; (2) a smoke dataset published by the Machine Intelligence Labo-ratory of the University of Salerno, Italy [33]; (3) a public dataset published by Professor Yuan Feiniu [34]; and (4) a computer vision and pattern recognition laboratory public dataset [35]. We collected 6928 smoke images from the video dataset using screenshots. Of these, we removed some images with low pixels in which it was difficult for the human eye to distinguish the smoke targets, leaving 5935 images with a total of 2638 small targets of smoke. The smoke in these images has obvious features that facilitate the extraction and learning of smoke features by the model. They cover various forms of smoke on the ground and can better reflect the high transparency and lack of obvious edges of the smoke itself, but the ground dataset of the close observation scene has differences in texture, colour, and background from the images taken by the UAV, and it is difficult to adapt to the overhead characteristics of aerial photography using ground data alone.
The second part of the dataset comes from the FLAME dataset [36]-a UAV aerial smoke video dataset which was produced as a selected image dataset with video frame draws (5814 images in total). Of these, we removed some images with blurred cameras and low-quality frame draws, leaving 4652 images and a total of 2394 images of small target smoke. These images were captured with a UAV camera in the air, and the captured images have the characteristics of remote sensing images, which can better reflect the characteristics of small smoke targets and long-distance overhead views under UAV remote sensing conditions and are more suitable for the target detection task of UAV aerial photography scenes. However, there is only one data scene, from a single location in an Arizonan pine forest.
The third part of the dataset is derived from images obtained from aerial photography of simulated forest fire smoke scenarios using drones. In order to improve the robustness of the model, we lit smoke cakes made of fresh branches, dried grass, flour, rosin, and ammonium chloride in various scenarios ranging from school woods to the tops of buildings to open fields in the countryside. We used a Phantom 4 Pro, manufactured by DJI. We flew the drone between 10 m and 90 m in the air and took a total of 1399 images of the smoke. Of these, we removed some poorly angled, poorly lit and blurred images, leaving 1093 images and a total of 539 images of small target smoke. These images simulate the production of smoke when a forest fire occurs, with a background similar to the forest. The purpose of capturing these images was to enrich the forest fire scenario under UAV remote sensing and to enhance the generalisation of the model. The different flight altitudes of the UAV and the multiple smoke situations can better simulate a realistic UAV inspection scenario. The captured scenes are shown in Figure 12.

Dataset Acquisition
In order to train the model proposed in this paper and test its effectiveness in the smoke detection task, a home-made dataset of remotely sensed images of forest fire smoke from aerial photographs taken by drones was developed. The dataset is divided into three parts.
The first part of the dataset comes from a number of publicly available video smoke datasets, such as (1) a public dataset published by the Signal Processing Group of Bilkent University, Turkey [32]; (2) a smoke dataset published by the Machine Intelligence Laboratory of the University of Salerno, Italy [33]; (3) a public dataset published by Professor Yuan Feiniu [34]; and (4) a computer vision and pattern recognition laboratory public dataset [35]. We collected 6928 smoke images from the video dataset using screenshots. Of these, we removed some images with low pixels in which it was difficult for the human eye to distinguish the smoke targets, leaving 5935 images with a total of 2638 small targets of smoke. The smoke in these images has obvious features that facilitate the extraction and learning of smoke features by the model. They cover various forms of smoke on the ground and can better reflect the high transparency and lack of obvious edges of the smoke itself, but the ground dataset of the close observation scene has differences in texture, colour, and background from the images taken by the UAV, and it is difficult to adapt to the overhead characteristics of aerial photography using ground data alone.
The second part of the dataset comes from the FLAME dataset [36]-a UAV aerial smoke video dataset which was produced as a selected image dataset with video frame draws (5814 images in total). Of these, we removed some images with blurred cameras and low-quality frame draws, leaving 4652 images and a total of 2394 images of small target smoke. These images were captured with a UAV camera in the air, and the captured images have the characteristics of remote sensing images, which can better reflect the characteristics of small smoke targets and long-distance overhead views under UAV remote sensing conditions and are more suitable for the target detection task of UAV aerial photography scenes. However, there is only one data scene, from a single location in an Arizonan pine forest.
The third part of the dataset is derived from images obtained from aerial photography of simulated forest fire smoke scenarios using drones. In order to improve the robustness of the model, we lit smoke cakes made of fresh branches, dried grass, flour, rosin, and ammonium chloride in various scenarios ranging from school woods to the tops of buildings to open fields in the countryside. We used a Phantom 4 Pro, manufactured by DJI. We flew the drone between 10 m and 90 m in the air and took a total of 1399 images of the smoke. Of these, we removed some poorly angled, poorly lit and blurred images, leaving 1093 images and a total of 539 images of small target smoke. These images simulate the production of smoke when a forest fire occurs, with a background similar to the forest. The purpose of capturing these images was to enrich the forest fire scenario under UAV remote sensing and to enhance the generalisation of the model. The different flight altitudes of the UAV and the multiple smoke situations can better simulate a realistic UAV inspection scenario. The captured scenes are shown in Figure 12.  As can be seen from the figure, the smoke target in Figure 12a is too small and too similar to the background, making it difficult for YOLOX-L to extract features and detect the presence of smoke. The smoke in Figure 12b is not well defined against the surrounding trees, making it difficult for YOLOX-L to extract global features of the image. The smoke in Figure 12c is diffuse and its local texture features are more complex.
The analysis shows that the YOLOX-L detection method has difficulty in accurately detecting smoke similar to that in Figure 12a-c. Therefore, it is necessary to design a forest fire smoke detection algorithm to extract global and local textures from the images and to enhance the small target smoke handling capability. In Section 3.6, we show a comparison of the visualisation results for the model we designed.

Assessment Indicators
When the IoU between the detection box and the labelled box is greater than the threshold, we consider it to be detected as a positive sample by the model. Otherwise, it is considered to be detected as a negative sample by the model. Based on the above settings, we can classify the sample results of the target detection model as true positive(TP), f alse positive(FP), true negative(TN), and f alse negative(FN).
In this paper, the performance of the model is evaluated by precision (P), recall (R), mAP, AR, FPS, parameter size, and FLOPs. To compare the performance of PDAM-STPNNet with other models, we use the commonly used evaluation metrics of: precision (P), recall (R), F 1 -score (F 1 ), parameter size, and FLOPs.
where precision indicates the percentage of true positives among the successfully detected images in the test set, and recall indicates the percentage of images in the test set that were correctly judged as positive samples. The F 1 -score is also used to evaluate the performance of the model. The where t is the average time taken to process each image. mAP is averaged for multiple categories. In contrast, the target detection method studied in this paper focuses on the detection of a single category of smoke targets, on the basis of which mAP and AP have the same meaning. For convenience, in this paper, we only refer to mAP. mAP is the single most important measure of performance in target detection and is calculated as: The average recall rate (AR) is mainly used to measure the degree of model detection failure. AR is calculated using the following formula:

Experimental Environment
In order to verify the performance of the proposed PDAM-STPNNet, all experiments in this paper were conducted in the same hardware environment and software environment and with the same imaging equipment, with the specific environmental parameters shown in Table 1.

Experimental Setup
To verify the accuracy and effectiveness of the PDAM-STPNNet model, the model was trained using the Pytorch framework, and the trained model was used to predict aerial smoke images. The hardware environment for this experiment was an NVIDIA GeForce GTX 3080Ti GPU and the software environment was Windows 10.
Prior to the start of this experiment, we produced the available aerial forest fire smoke dataset. To ensure that the model could fully extract the features of smoke, we divided the home-made dataset into 70% for the training set, 15% for the validation set, and 15% for the test set to train the model. During training, each layer of the model was initialised by a Gaussian distribution. Considering the GPU memory size and time cost, we set the batch_size to 16, the momentum to 0.9, and the initial learning rate to 0.005, and we adjusted the learning rate according to the ADAM optimizer. We also set the decay to 0.002 and the number of iterative rounds to 150 epochs (see Table 2).

Comparison with YOLOX-L
We conducted a series of performance evaluation experiments on a home-made dataset to verify the advantages of PDAM-STPNNet over YOLOX-L for aerial forest fire smoke detection tasks. To prevent the interference of different IoU thresholds in the experiments, an IoU threshold of 0.5 was set in this paper. The results of the comparison experiments between PDAM-STPNNet and YOLOX-L on the home-made dataset are shown in Table 3. This paper compares the accuracy of PDAM-STPNNet and YOLOX-L at two IoU thresholds (see Figure 13). This paper compares the accuracy of PDAM-STPNNet and YOLOX-L at two IoU thresholds (see Figure 13). The accuracy of PDAM-STPNNet is significantly improved compared to YOLOX-L for both threshold values of 0.5 and 0.75. The reason for this is that PDAM-STPNNet has a stronger feature extraction capability and feature fusion capability.

Comparison between Different Models
To validate the detection performance of the PDAM-STPNNet model for the aerial photography of forest fire smoke scenes, this paper compares some target detection models that have performed well in recent years with the PDAM-STPNNet proposed in this paper on the same test environment and home-made dataset. The mAP, mAP 50 , mAP 75 , AR, and FPS of different models for the test set are shown in Table 4. The accuracy of PDAM-STPNNet is significantly improved compared to YOLOX-L for both threshold values of 0.5 and 0.75. The reason for this is that PDAM-STPNNet has a stronger feature extraction capability and feature fusion capability.

Comparison between Different Models
To validate the detection performance of the PDAM-STPNNet model for the aerial photography of forest fire smoke scenes, this paper compares some target detection models that have performed well in recent years with the PDAM-STPNNet proposed in this paper on the same test environment and home-made dataset. The mAP, mAP 50 , mAP 75 , AR, and FPS of different models for the test set are shown in Table 4. In an emergency situation during a forest fire, the two-stage detector has significant shortcomings compared to the single-stage detector due to its poor real-time performance.
Compared to single-stage detectors, dual-stage detectors show good accuracy but have difficulty in meeting the rapid response requirements of forest fire detection tasks. For single-stage detectors, SSD512, DSSD513, and YOLOv5-L excel in speed but lack in accuracy. FSAF and NAS-FPN have high detection accuracy but have poor FPS index performance and are not suitable given the urgency of actual forest fire monitoring scenarios. YOLOX-L outperformed YOLOv5-L for mAP by 1.96%, for mAP 50 by 2.22%, and for mAP 75 by 2.14%. Compared to the other YOLOX models, YOLOX-L achieves a trade-off in terms of speed and accuracy. We improve on YOLOX-L and show that PDAM-STPNNet is second only to NAS-FPN in terms of accuracy and significantly better than the other models, while meeting the criteria for real-time detection in terms of speed. We found that the accuracy of NAS-FPN is higher than that of PDAM-STPNNet. To further explore the optimal model, we conducted experiments after replacing the backbone of YOLOX-L and PDAM-STPNNet using AmoebaNet (the backbone of NAS-FPN) to see whether the model can be further optimised. The experimental results show that replacing the YOLOX-L's backbone with AmoebaNet can slightly improve its performance by 1.78% on mAP, 1.60% on mAP50, and 1.67% on mAP75 compared to CSP-Darknet53, but at a significantly lower speed. After replacing the backbone of PDAM-STPNNet with AmoebaNet, the model's accuracy was not as good as before the replacement and the speed was significantly reduced. It was thought that the reason for the reduced model accuracy might be that AmoebaNet searched the network architecture with the evolutionary strategy of aging evolution, which caused the disorder of PDAM's weight assignment. In summary, PDAM-STPNNet is the most suitable model for aerial forest fire smoke detection. We analyse the reasons why our proposed model outperforms other models: (1) PDAM-STPNNet is improved based on YOLOX-L, integrating many speed-optimised solutions, and its model architecture is simple and outstanding in terms of real-time performance. (2) The four improvement strategies proposed in this paper are all designed according to the smaller target characteristics of the aerial forest fire smoke task, which are highly targeted and have significant improvement effects. (3) The home-made dataset in this paper eliminates some blurred and low-quality images, and its images are clear and conducive to the training of the model.

Exploring the Effects of a Single Approach
In this section, the experimental results of component stitching data enhancement, LAM, GAM, and STPN in this paper are presented in detail to demonstrate the process of investigating the effects of each method and the way in which they were combined in our experiments. The comparison of their model performance before and after the addition of YOLOX-L is also analysed. The results of the comparative experiments are presented in the following four subsections.
(a) Component stitching data enhancement YOLOX-L uses Mosaic, and MixUp achieves better data enhancement results. However, the forest fire smoke studied in this paper was taken by an unmanned aircraft, and the targets were often far away from the camera. In this paper, component stitching data enhancement is added to the original data enhancement method to improve the image enhancement method. To demonstrate the effectiveness of this method and to investigate the optimal parameters, we added component stitching data enhancement to the YOLOX-L model and set different k values for comparison experiments to investigate the effect of component stitching data enhancement in the image enhancement session. The results are shown in Table 5. The experimental results show that collage augmentation is beneficial in balancing the proportion of small targets and improving the training effect. It can be seen that when k = 2 2 , the model has optimal accuracy and the speed is not significantly reduced compared to k = 1 2 , indicating that component stitching data enhancement can alleviate the imbalance of the dataset due to the under-representation of small targets. Therefore, PDAM-STPNNet uses the component stitching data enhancement approach set to k = 2 2 .
(b) Local attention mechanism module (LAM) In Section 2.3.1, the process of obtaining comprehensive features is conducted through Equation (6), where the value of α represents the extent to which the local feature information is taken into account and has an important impact on the effectiveness of LAM, which involves the allocation of the weights. In order to investigate the appropriate value of α, we added YOLOX-L to LAM and set different values of α for testing. The experimental results are shown in Figure 14. In Section 2.3.1, the process of obtaining comprehensive features is conducted through equation (6), where the value of α represents the extent to which the local feature information is taken into account and has an important impact on the effectiveness of LAM, which involves the allocation of the weights. In order to investigate the appropriate value of α , we added YOLOX-L to LAM and set different values of α for testing.
The experimental results are shown in Figure 14. Experiments have verified that LAM can enhance the local texture features of smoke. It can be seen that the optimal value is 0.25 The reason for this is that when the value of α is too large, the smaller value feature vector is over-considered, and attention is not fully focused on the important information. α is neglected and the attention is overfocused on the global information, which affects the effectiveness of LAM feature extraction.
(c) Global attention mechanism module (GAM) GAM employs the SN and SiLU activation functions to achieve optimal accuracy enhancement. To verify the feasibility and effectiveness of this choice of GAM, this paper adds GAM to YOLOX-L and conducts ablation experiments on GAM to investigate the effect of GAM performance under different batch processing methods and activation functions, the results of which are shown in Table 6.  Experiments have verified that LAM can enhance the local texture features of smoke. It can be seen that the optimal value is α = 0.25. The reason for this is that when the value of α is too large, the smaller value feature vector is over-considered, and attention is not fully focused on the important information. α is neglected and the attention is over-focused on the global information, which affects the effectiveness of LAM feature extraction.
(c) Global attention mechanism module (GAM) GAM employs the SN and SiLU activation functions to achieve optimal accuracy enhancement. To verify the feasibility and effectiveness of this choice of GAM, this paper adds GAM to YOLOX-L and conducts ablation experiments on GAM to investigate the effect of GAM performance under different batch processing methods and activation functions, the results of which are shown in Table 6. The experiments verify that GAM can enhance the global texture features of smoke. From the experimental results, it can be seen that the GAM selects the appropriate batching method and activation function to ensure that more effective feature maps are retained, which facilitates the improvement of accuracy. The optimal SN + SiLU combination improves mAP by 0.84%, mAP 50 by 0.74%, and mAP 75 by 0.86% compared to the BN + Sigmoid combination.
(d) Small-scale transformer feature pyramid network (STPN) In this paper, a detection head was added all the way through, and a transformer encoder was used to replace all the CSP_2 blocks in the neck section of the YOLOX-L. To investigate its effectiveness, three improvement experiments were carried out on YOLOX-L, and the test results are shown in Table 7. When the SE attention and CBAM blocks are added after the CSP_2 block, it is found that detection accuracy is slightly improved by the addition of the attention mechanism module, but the improvement is not significant, and the introduction of the attention mechanism module inevitably leads to an increase in the number of parameters. From the results, it is clear that the replacement of the CSP_2 block with the transformer encoder is an effective and feasible way to improve the accuracy of aerial forest fire smoke detection while reducing the number of parameters in the network.

Ablation Experiments with PDAM-STPNNet
In order to verify the overall effectiveness of the method proposed in this paper with optimal parameters, we designed ablation experiments for PDAM-STPNNet, based on YOLOX-L, and we used the control variables method to add component stitching data enhancement, LAM, GAM, and STPN for the combination of these four improvement points for 16 sets of experiments. The results of the experiments are shown in Table 8. PDAM can improve the feature discrimination ability of the feature extraction network and effectively prevent the smoke and its background being confused with each other. Among them, LAM mainly extracts local texture features, and adding LAM to YOLOX-L can improve mAP by 3.22%, mAP 50 by 3.08%, and mAP 75 by 3.24%. GAM mainly extracts global texture features and adding GAM to YOLOX-L can improve mAP by 3.08%, mAP 50 by 3.16%m and mAP 75 by 3.16%. Comparing the two, LAM can improve detection accuracy more, while GAM can help the model to be more accurate in localisation. STPN focuses on the detection of small targets designed for forest fire smoke and replaces the original YOLOX-L block to reduce the number of parameters in the model, resulting in improvements in speed, the number of parameters and accuracy. In summary, PDAM-STPNNet improved by 10.52% for mAP, 6.86% for mAP 50 , and 8.67% for mAP 75 compared to YOLOX-L. There was a small reduction in speed, but it still met the criteria for real-time detection. The results of the 16 sets of experiments demonstrate the role of component stitching data enhancement, LAM, GAM, and STPN. Based on the above experiments, we conclude that PDAM-STPNNet is the most suitable target detection model for forest fire smoke detection tasks using UAV aerial photography.

Comparison of Visualisation Results
For a more visual analysis of PDAM-STPNNet, we visualise the detection results for YOLOX-L, PDAM-STPNNet, and YOLOX-L, with component stitching data enhancement, LAM, GAM, and STPN added, respectively. The detection frames, categories, and confidence levels are shown in the detection results graph in Table 9. Table 9. Visual comparison of test results.

Method Detection Result
YOLOX-L component stitching data enhancement, LAM, GAM, and STPN. Based on the above experiments, we conclude that PDAM-STPNNet is the most suitable target detection model for forest fire smoke detection tasks using UAV aerial photography.

Comparison of Visualisation Results
For a more visual analysis of PDAM-STPNNet, we visualise the detection results for YOLOX-L, PDAM-STPNNet, and YOLOX-L, with component stitching data enhancement, LAM, GAM, and STPN added, respectively. The detection frames, categories, and confidence levels are shown in the detection results graph in Table 9.  From this, we can see that for Figure a in Table 9, the smoke is far away from the UAV camera and has a high similarity to the surrounding background. YOLOX-L does component stitching data enhancement, LAM, GAM, and STPN. Based on the above ex-periments, we conclude that PDAM-STPNNet is the most suitable target detection model for forest fire smoke detection tasks using UAV aerial photography.

Comparison of Visualisation Results
For a more visual analysis of PDAM-STPNNet, we visualise the detection results for YOLOX-L, PDAM-STPNNet, and YOLOX-L, with component stitching data enhancement, LAM, GAM, and STPN added, respectively. The detection frames, categories, and confidence levels are shown in the detection results graph in Table 9.  From this, we can see that for Figure a in Table 9, the smoke is far away from the UAV camera and has a high similarity to the surrounding background. YOLOX-L does component stitching data enhancement, LAM, GAM, and STPN. Based on the above experiments, we conclude that PDAM-STPNNet is the most suitable target detection model for forest fire smoke detection tasks using UAV aerial photography.

Comparison of Visualisation Results
For a more visual analysis of PDAM-STPNNet, we visualise the detection results for YOLOX-L, PDAM-STPNNet, and YOLOX-L, with component stitching data enhancement, LAM, GAM, and STPN added, respectively. The detection frames, categories, and confidence levels are shown in the detection results graph in Table 9. Table 9. Visual comparison of test results. From this, we can see that for Figure a in Table 9, the smoke is far away from the UAV camera and has a high similarity to the surrounding background. YOLOX-L does YOLOX-L with component stitching data enhancement periments, we conclude that PDAM-STPNNet is the most suitable target detection model for forest fire smoke detection tasks using UAV aerial photography.

Comparison of Visualisation Results
For a more visual analysis of PDAM-STPNNet, we visualise the detection results for YOLOX-L, PDAM-STPNNet, and YOLOX-L, with component stitching data enhancement, LAM, GAM, and STPN added, respectively. The detection frames, categories, and confidence levels are shown in the detection results graph in Table 9. Table 9. Visual comparison of test results. From this, we can see that for Figure a in Table 9, the smoke is far away from the UAV camera and has a high similarity to the surrounding background. YOLOX-L does periments, we conclude that PDAM-STPNNet is the most suitable target detection model for forest fire smoke detection tasks using UAV aerial photography.

Comparison of Visualisation Results
For a more visual analysis of PDAM-STPNNet, we visualise the detection results for YOLOX-L, PDAM-STPNNet, and YOLOX-L, with component stitching data enhancement, LAM, GAM, and STPN added, respectively. The detection frames, categories, and confidence levels are shown in the detection results graph in Table 9. Table 9. Visual comparison of test results. From this, we can see that for Figure a in Table 9, the smoke is far away from the UAV camera and has a high similarity to the surrounding background. YOLOX-L does periments, we conclude that PDAM-STPNNet is the most suitable target detection model for forest fire smoke detection tasks using UAV aerial photography.

Comparison of Visualisation Results
For a more visual analysis of PDAM-STPNNet, we visualise the detection results for YOLOX-L, PDAM-STPNNet, and YOLOX-L, with component stitching data enhancement, LAM, GAM, and STPN added, respectively. The detection frames, categories, and confidence levels are shown in the detection results graph in Table 9. Table 9. Visual comparison of test results. From this, we can see that for Figure a in Table 9, the smoke is far away from the UAV camera and has a high similarity to the surrounding background. YOLOX-L does YOLOX-L with LAM periments, we conclude that PDAM-STPNNet is the most suitable target detection model for forest fire smoke detection tasks using UAV aerial photography.

Comparison of Visualisation Results
For a more visual analysis of PDAM-STPNNet, we visualise the detection results for YOLOX-L, PDAM-STPNNet, and YOLOX-L, with component stitching data enhancement, LAM, GAM, and STPN added, respectively. The detection frames, categories, and confidence levels are shown in the detection results graph in Table 9. Table 9. Visual comparison of test results. From this, we can see that for Figure a in Table 9, the smoke is far away from the UAV camera and has a high similarity to the surrounding background. YOLOX-L does periments, we conclude that PDAM-STPNNet is the most suitable target detection model for forest fire smoke detection tasks using UAV aerial photography.

Comparison of Visualisation Results
For a more visual analysis of PDAM-STPNNet, we visualise the detection results for YOLOX-L, PDAM-STPNNet, and YOLOX-L, with component stitching data enhancement, LAM, GAM, and STPN added, respectively. The detection frames, categories, and confidence levels are shown in the detection results graph in Table 9. Table 9. Visual comparison of test results. From this, we can see that for Figure a in Table 9, the smoke is far away from the UAV camera and has a high similarity to the surrounding background. YOLOX-L does periments, we conclude that PDAM-STPNNet is the most suitable target detection model for forest fire smoke detection tasks using UAV aerial photography.

Comparison of Visualisation Results
For a more visual analysis of PDAM-STPNNet, we visualise the detection results for YOLOX-L, PDAM-STPNNet, and YOLOX-L, with component stitching data enhancement, LAM, GAM, and STPN added, respectively. The detection frames, categories, and confidence levels are shown in the detection results graph in Table 9. Table 9. Visual comparison of test results. From this, we can see that for Figure a in Table 9, the smoke is far away from the UAV camera and has a high similarity to the surrounding background. YOLOX-L does YOLOX-L with GAM periments, we conclude that PDAM-STPNNet is the most suitable target detection model for forest fire smoke detection tasks using UAV aerial photography.

Comparison of Visualisation Results
For a more visual analysis of PDAM-STPNNet, we visualise the detection results for YOLOX-L, PDAM-STPNNet, and YOLOX-L, with component stitching data enhancement, LAM, GAM, and STPN added, respectively. The detection frames, categories, and confidence levels are shown in the detection results graph in Table 9. Table 9. Visual comparison of test results. From this, we can see that for Figure a in Table 9, the smoke is far away from the UAV camera and has a high similarity to the surrounding background. YOLOX-L does periments, we conclude that PDAM-STPNNet is the most suitable target detection model for forest fire smoke detection tasks using UAV aerial photography.

Comparison of Visualisation Results
For a more visual analysis of PDAM-STPNNet, we visualise the detection results for YOLOX-L, PDAM-STPNNet, and YOLOX-L, with component stitching data enhancement, LAM, GAM, and STPN added, respectively. The detection frames, categories, and confidence levels are shown in the detection results graph in Table 9. Table 9. Visual comparison of test results. From this, we can see that for Figure a in Table 9, the smoke is far away from the UAV camera and has a high similarity to the surrounding background. YOLOX-L does periments, we conclude that PDAM-STPNNet is the most suitable target detection model for forest fire smoke detection tasks using UAV aerial photography.

Comparison of Visualisation Results
For a more visual analysis of PDAM-STPNNet, we visualise the detection results for YOLOX-L, PDAM-STPNNet, and YOLOX-L, with component stitching data enhancement, LAM, GAM, and STPN added, respectively. The detection frames, categories, and confidence levels are shown in the detection results graph in Table 9. Table 9. Visual comparison of test results. From this, we can see that for Figure a in Table 9, the smoke is far away from the UAV camera and has a high similarity to the surrounding background. YOLOX-L does YOLOX-L with STPN component stitching data enhancement, LAM, GAM, and STPN. Based on the above experiments, we conclude that PDAM-STPNNet is the most suitable target detection model for forest fire smoke detection tasks using UAV aerial photography.

Comparison of Visualisation Results
For a more visual analysis of PDAM-STPNNet, we visualise the detection results for YOLOX-L, PDAM-STPNNet, and YOLOX-L, with component stitching data enhancement, LAM, GAM, and STPN added, respectively. The detection frames, categories, and confidence levels are shown in the detection results graph in Table 9.  From this, we can see that for Figure a in Table 9, the smoke is far away from the UAV camera and has a high similarity to the surrounding background. YOLOX-L does component stitching data enhancement, LAM, GAM, and STPN. Based on the above experiments, we conclude that PDAM-STPNNet is the most suitable target detection model for forest fire smoke detection tasks using UAV aerial photography.

Comparison of Visualisation Results
For a more visual analysis of PDAM-STPNNet, we visualise the detection results for YOLOX-L, PDAM-STPNNet, and YOLOX-L, with component stitching data enhancement, LAM, GAM, and STPN added, respectively. The detection frames, categories, and confidence levels are shown in the detection results graph in Table 9.  From this, we can see that for Figure a in Table 9, the smoke is far away from the UAV camera and has a high similarity to the surrounding background. YOLOX-L does component stitching data enhancement, LAM, GAM, and STPN. Based on the above experiments, we conclude that PDAM-STPNNet is the most suitable target detection model for forest fire smoke detection tasks using UAV aerial photography.

Comparison of Visualisation Results
For a more visual analysis of PDAM-STPNNet, we visualise the detection results for YOLOX-L, PDAM-STPNNet, and YOLOX-L, with component stitching data enhancement, LAM, GAM, and STPN added, respectively. The detection frames, categories, and confidence levels are shown in the detection results graph in Table 9.  From this, we can see that for Figure a in Table 9, the smoke is far away from the UAV camera and has a high similarity to the surrounding background. YOLOX-L does PDAM-STPNNet component stitching data enhancement, LAM, GAM, and STPN. Based on the above experiments, we conclude that PDAM-STPNNet is the most suitable target detection model for forest fire smoke detection tasks using UAV aerial photography.

Comparison of Visualisation Results
For a more visual analysis of PDAM-STPNNet, we visualise the detection results for YOLOX-L, PDAM-STPNNet, and YOLOX-L, with component stitching data enhancement, LAM, GAM, and STPN added, respectively. The detection frames, categories, and confidence levels are shown in the detection results graph in Table 9.  From this, we can see that for Figure a in Table 9, the smoke is far away from the UAV camera and has a high similarity to the surrounding background. YOLOX-L does component stitching data enhancement, LAM, GAM, and STPN. Based on the above experiments, we conclude that PDAM-STPNNet is the most suitable target detection model for forest fire smoke detection tasks using UAV aerial photography.

Comparison of Visualisation Results
For a more visual analysis of PDAM-STPNNet, we visualise the detection results for YOLOX-L, PDAM-STPNNet, and YOLOX-L, with component stitching data enhancement, LAM, GAM, and STPN added, respectively. The detection frames, categories, and confidence levels are shown in the detection results graph in Table 9.  From this, we can see that for Figure a in Table 9, the smoke is far away from the UAV camera and has a high similarity to the surrounding background. YOLOX-L does component stitching data enhancement, LAM, GAM, and STPN. Based on the above experiments, we conclude that PDAM-STPNNet is the most suitable target detection model for forest fire smoke detection tasks using UAV aerial photography.

Comparison of Visualisation Results
For a more visual analysis of PDAM-STPNNet, we visualise the detection results for YOLOX-L, PDAM-STPNNet, and YOLOX-L, with component stitching data enhancement, LAM, GAM, and STPN added, respectively. The detection frames, categories, and confidence levels are shown in the detection results graph in Table 9.  From this, we can see that for Figure a in Table 9, the smoke is far away from the UAV camera and has a high similarity to the surrounding background. YOLOX-L does a b c

Experimental
From this, we can see that for Figure a in Table 9, the smoke is far away from the UAV camera and has a high similarity to the surrounding background. YOLOX-L does not recognise the target but identifies it as background, and thus a misdetection occurs. After component stitching data enhancement, the smoke is detected successfully, but with low confidence. When LAM was used to enhance the local texture extraction, the target is detected, but a false detection occurs because there is a spot where the background is similar to the smoke. The problem was solved when GAM was used for global texture extraction, or when STPN was used to enhance the detection of small targets in forest fire smoke. The PDAM-STPNNet integrating these four improved strategies demonstrates good results in detection and yields accurate localisations and high confidence levels. For Figure b in Table 9, the smoke density is low and the local features are not obvious. YOLOX-L is inaccurately localised, and a low confidence level is obtained. After component stitching data enhancement, the results were not significantly improved. The use of LAM, GAM, and STPN all improved the detection of smoke at this low density, and PDAM-STPNNet gave accurate localisation and high confidence. For Figure c in Table 9, the smoke spreads over a large area and is strongly illuminated. YOLOX-L has difficulty in recognising the exact localisation of the smoke and produces more detection frames. After component stitching data enhancement, the results improve, but the localisation is still inaccurate; LAM, GAM, and STPN all improved the results, but the localisation was not accurate enough, while PDAM-STPNNet accurately localised it and obtained a high confidence level.

Model Performance Comparison on Public Datasets
The metrics obtained from the evaluation of home-made datasets alone may not be able to fully objectively evaluate the performance of PDAM-STPNNet. In this section, in order to verify the superiority of PDAM-STPNNet over the commonly used target detection models in recent years, we used 85% of home-made datasets as the training set, 15% of home-made datasets as the validation set to train the model, and the Wildfire Observers and Smoke Recognition Homepage [37] and Bowfire datasets [38], two public datasets, were used as test sets for a comprehensive evaluation experiment.

Wildfire Observers and Smoke Recognition Homepage
The Wildfire Observers and Smoke Recognition Homepage dataset was created and is maintained by the Wildfire Research Centre, which is part of a university separate from the School of Electrical Engineering, Mechanical Engineering and Shipbuilding. While early fire detection was traditionally based on human wildfire monitoring, this dataset uses modern information and communication technologies (ICT) that can replace wildfire observers to perform human wildfire observations. The site's dataset is divided into two categories: (1) a wildfire smoke image database and (2) a wildfire smoke video database. The image data in the wildfire smoke video database are frame-by-frame screenshots of the video data, so we dropped the wildfire video database and selected one of the wildfire smoke images from the database for every two images as test data. In total, 3482 data points were selected, and six metrics-mAP, mAP 50 , mAP 75 , AR, FPS, and GFLOPs-were measured; the results are shown in Table 10. The analysis of the above model evaluation parameters shows that YOLOX-L still excels in terms of speed, but its accuracy is no longer sufficient to meet the requirements of forest fire smoke detection and it is prone to missed and false detections. YOLOv4, YOLOv5 and YOLOR are lacking too much in terms of accuracy to meet the accuracy requirements of forest fire smoke detection.

Bowfire Dataset
The Bowfire dataset is a classical dataset of only 227 images and contains many negative samples that are easily confused with smoke. We used the Bowfire dataset directly as a test set to examine the resistance of PDAM-STPNNet to interference. Due to the small size of this dataset, we trained some commonly used target detection models of recent years on a home-made dataset and tested them on Bowfire, measuring six metrics: mAP, mAP 50 , mAP 75 , AR, FPS, and GFLOPs. The results are shown in Table 11. The Bowfire dataset contains many objects that can be easily confused with smoke, making detection difficult and prone to misses and false detections. Nevertheless, our PDAM-STPNNet has significantly improved its mAP, mAP 50 , and mAP 75 scores compared to YOLOX-L, YOLOv4, YOLOv5, and YOLOR and has obtained good detection results. The experimental results on the Bowfire dataset show that PDAM-STPNNet also has a strong detection capability in difficult detection scenarios, which is important for forest fire smoke detection in some complex forest terrains. PDAM-STPNNet achieved values of 60.52% for mAP, 74.22% for mAP 50 , and 63.12% for mAP 75 , and 43.62 for FPS when tested on the Bowfire dataset.

Practical Application Tests
To test the generalisation ability and practicality of the model, we conducted field tests in Huangfengqiao State Forestry Farm, and the results were good. Huangfengqiao State Forestry is located in a hilly area in the east and west of You County, Hunan Province, China, as shown in Figure 15. The area is dominated by fir plantations with relatively low water content, which are densely distributed and prone to large fires, and where foliage and smoke obscure each other, resulting in poor detection. There are also bare rocks in the hilly areas that are similar in colour to the smoke and can be easily misidentified. The complex conditions for the occurrence of actual forest fires are met. A 40-day simulation was carried out at Huangfengqiao State Forest, creating smoke by lighting a smoke cake with flour, rosin, and ammonium chloride as the main ingredients. STPNNet was used to detect the captured images.
A comparison of the detection results of YOLOX-L and PDAM-STPNNet for the three categories of smoke is shown in Figure 16. For each category, 50 images of real scenes were selected for testing, and the detection results were considered to be accurate if the IoU with the actual location of the smoke was greater than 0.7. As can be seen from the figure, the recognition accuracies of YOLOX-L and PDAM-STPNNet were 86% and 98% in category A, 58% and 84% in category B, and 16% and 74% in category C, respectively.  A 40-day simulation was carried out at Huangfengqiao State Forest, creating smoke by lighting a smoke cake with flour, rosin, and ammonium chloride as the main ingredients. STPNNet was used to detect the captured images.
A comparison of the detection results of YOLOX-L and PDAM-STPNNet for the three categories of smoke is shown in Figure 16. For each category, 50 images of real scenes were selected for testing, and the detection results were considered to be accurate if the IoU with the actual location of the smoke was greater than 0.7. As can be seen from the figure, the recognition accuracies of YOLOX-L and PDAM-STPNNet were 86% and 98% in category A, 58% and 84% in category B, and 16% and 74% in category C, respectively. ents. STPNNet was used to detect the captured images.
A comparison of the detection results of YOLOX-L and PDAM-STPNNet for the three categories of smoke is shown in Figure 16. For each category, 50 images of real scenes were selected for testing, and the detection results were considered to be accurate if the IoU with the actual location of the smoke was greater than 0.7. As can be seen from the figure, the recognition accuracies of YOLOX-L and PDAM-STPNNet were 86% and 98% in category A, 58% and 84% in category B, and 16% and 74% in category C, respectively.

Training and Test Datasets
In order to investigate the effect of PDAM-STPNNet on forest fire smoke detection in practical application scenarios, this paper developed a home-made dataset of remote sensing images of forest fires taken by UAVs and applied the dataset to the training and testing of PDAM-STPNNet. In addition, to further illustrate the detection effectiveness of PDAM-STPNNet, we tested PDAM-STPNNet on the home-made datasets, Wildfire Observers and Smoke Recognition Homepage and Bowfire, respectively, and achieved good evaluation results, proving PDAM-STPNNet's model performance and generalisation ca- Comparison of identification results for PDAM-STPNNet and YOLOX-L in Huangfengqiao state-owned forestry site: Class A refers to cases with medium smoke concentration and size, while Class B refers to cases with low smoke concentration and less distinctive features; Class C refers to cases where smoke is distant and in the early stages of burning.

Training and Test Datasets
In order to investigate the effect of PDAM-STPNNet on forest fire smoke detection in practical application scenarios, this paper developed a home-made dataset of remote sensing images of forest fires taken by UAVs and applied the dataset to the training and testing of PDAM-STPNNet. In addition, to further illustrate the detection effectiveness of PDAM-STPNNet, we tested PDAM-STPNNet on the home-made datasets, Wildfire Observers and Smoke Recognition Homepage and Bowfire, respectively, and achieved good evaluation results, proving PDAM-STPNNet's model performance and generalisation capability. Due to the nature of the smoke cake material used in the production of the simulated forest fire smoke, the smoke images in this dataset are usually light-coloured and transparent white smoke. When actual forest fires occur, this may lead to a lack of oxygen in the burned area, and the smoke contains large-scale carbon particles and takes on a dark black colour character. In the future, the darker black smoke should be included in the dataset to enhance the model's ability to detect when there is insufficient oxygen in the burning area.

Application and Future Work Directions
In this paper, PDAM-STPNNet is deployed in the UAV forest fire monitoring system, and three sub-systems-the UAV-gimbal camera system, the ground control system, and the ground station terminal monitoring system-are built and put into practical application for forest fire smoke monitoring. The UAV forest fire monitoring system built in this paper uses GPS positioning, combined with image target detection, to roughly locate the UAV and the fire location. The operational airspace for UAVs in forest fire monitoring is mainly concentrated in mountainous woodlands. Due to signal problems, how to establish long-lasting, reliable, and timely communication with UAVs and to determine the precise location of UAVs has always troubled UAV-related practitioners. In the future, the signal in mountainous woodlands should be further strengthened to precisely locate the UAV and fire locations in order to enhance the practicality and feasibility of the UAV forest fire monitoring system in practical applications.

Advantages of the Method in This Paper
In the training session of the model, a portion of the images used had a large smoke coverage area and occupied a large number of pixels in the image. We use component stitching data enhancement to pre-process the images and input k images into the neural network model for training, which reduces the scale of the smoke target while ensuring that the length and width of the collaged image is the same as the original image, allowing the model to learn the smoke features of multiple fire sources, which is conducive to improving the performance of the model in practical applications. PDAM is proposed, which mines deep features through both local and global textures, which is enabled by the original backbone feature extraction capability and is of great significance for practical applications. STPN is proposed, which uses the transformer encoder to replace CSP_2 on FPN, aiming to make the model perform well in the detection of small targets of forest fire smoke. The methods proposed in this paper are all designed according to the needs of the UAV aerial photography forest fire smoke detection task, taking into account factors such as the long distance of the target during aerial photography, the unclear colour characteristics of the smoke, and the complexity of the forest environment, and focus on improving the detection capability for remote sensing images of forest fire smoke. In this paper, we propose a PDAM-STPNNet for aerial photography of forest fire smoke detection by UAV, and the feasibility and practicality of the PDAM-STPNNet is demonstrated in experiments. However, in practice, forest fires do not only occur during the daytime; smoke is a prominent visual feature during the day and is difficult to detect at night. In the future, in order to improve the forest fire detection task and to ensure that forest fires can be detected in a timely manner at all times of day and night, a deep learning model should be trained and deployed on the UAV forest fire monitoring system to complete the task of detecting flames at night.

Analysis and Outlook
Analyzing the falsely detected data is very important to improve the performance of our network. Therefore, we analysed a sample of smoke images that were misdetected by PDAM-STPNNet. Figure 17 shows a blurred white transparent object formed by the reflection of car glass, which has an irregular shape and white transparency similar to that of smoke. It can be seen that PDAM-STPNNet is not good at distinguishing small targets that have a similar shape, colour and transparency to smoke. To solve this problem, our future work will consider studying the properties of object reflections to further address the problem of false detection of small targets. Although the PDAM-STPNNet proposed in this paper has achieved good results in smoke detection from UAV aerial photography, in reality forest fires do not only occur during daytime; smoke is a prominent visual feature during daytime, but difficult to detect at night. In the future, in order to improve forest fire detection and ensure that forest fires can be detected in time in all weathers, flames, which are prominent features at night, should be adopted as the research object and figure in deep learning models for deployment in a UAV forest fire monitoring system combined with sensors, such as thermal cameras, to complete the forest fire detection task at night. in reality forest fires do not only occur during daytime; smoke is a prominent visual feature during daytime, but difficult to detect at night. In the future, in order to improve forest fire detection and ensure that forest fires can be detected in time in all weathers, flames, which are prominent features at night, should be adopted as the research object and figure in deep learning models for deployment in a UAV forest fire monitoring system combined with sensors, such as thermal cameras, to complete the forest fire detection task at night.

Conclusions
In order to solve the problem that small target smoke in UAV remote sensing images is under-represented in datasets and easily confused with its background, this paper proposes a forest fire smoke detection model-PDAM-STPNNet-for a UAV forest fire monitoring system. At the same time, we constructed a forest fire smoke target detection dataset based on UAV images, containing a total of 11,680 forest fire smoke images, of which there are 5571 small target smoke images. PDAM-STPNNet has been improved based on YOLOX-L. Component stitching data enhancement can balance the proportion of small targets in the dataset, increasing mAP by 1.87%, mAP 50 by 1.91%, and mAP 75 by 1.87% without changing the original network structure. PDAM can improve the feature discrimination ability of the feature extraction network and effectively prevent the smoke and background from being confused with each other. Among them, adding LAM to YOLOX-L can improve mAP by 3.22%, mAP 50 by 3.08%, and mAP 75 by 3.24%. Adding GAM to YOLOX-L can improve mAP by 3.08%, mAP 50 by 3.16%, and mAP 75 by 3.16%. STPN can mitigate the negative effects caused by drastic scale changes and has stronger feature fusion capability. Adding STPN to YOLOX-L was able to improve mAP by 1.92%, mAP 50 by 1.82%, and mAP 75 by 2.51%. We chose two public datasets to validate the effectiveness of PDAM-STPNNet, and it is clear from the experimental results that PDAM-STPNNet has high detection accuracy and speed.
In forest fire control, PDAM-STPNNet can be mounted on a UAV forest fire monitoring system to detect smoke based on remote sensing images captured by UAVs to locate forest fire smoke, which is of great importance for the protection of forest ecology.

Conclusions
In order to solve the problem that small target smoke in UAV remote sensing images is under-represented in datasets and easily confused with its background, this paper proposes a forest fire smoke detection model-PDAM-STPNNet-for a UAV forest fire monitoring system. At the same time, we constructed a forest fire smoke target detection dataset based on UAV images, containing a total of 11,680 forest fire smoke images, of which there are 5571 small target smoke images. PDAM-STPNNet has been improved based on YOLOX-L. Component stitching data enhancement can balance the proportion of small targets in the dataset, increasing mAP by 1.87%, mAP 50 by 1.91%, and mAP 75 by 1.87% without changing the original network structure. PDAM can improve the feature discrimination ability of the feature extraction network and effectively prevent the smoke and background from being confused with each other. Among them, adding LAM to YOLOX-L can improve mAP by 3.22%, mAP 50 by 3.08%, and mAP 75 by 3.24%. Adding GAM to YOLOX-L can improve mAP by 3.08%, mAP 50 by 3.16%, and mAP 75 by 3.16%. STPN can mitigate the negative effects caused by drastic scale changes and has stronger feature fusion capability. Adding STPN to YOLOX-L was able to improve mAP by 1.92%, mAP 50 by 1.82%, and mAP 75 by 2.51%. We chose two public datasets to validate the effectiveness of PDAM-STPNNet, and it is clear from the experimental results that PDAM-STPNNet has high detection accuracy and speed.
In forest fire control, PDAM-STPNNet can be mounted on a UAV forest fire monitoring system to detect smoke based on remote sensing images captured by UAVs to locate forest fire smoke, which is of great importance for the protection of forest ecology.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to partial author disagreement.