A Small-Target Forest Fire Smoke Detection Model Based on Deformable Transformer for End-to-End Object Detection

: Forest ﬁres have continually endangered personal safety and social property. To reduce the occurrences of forest ﬁres, it is essential to detect forest ﬁre smoke accurately and quickly. Traditional forest ﬁre smoke detection based on convolutional neural networks (CNNs) needs many hand-designed components and shows poor ability to detect small and inconspicuous smoke in complex forest scenes. Therefore, we propose an improved early forest ﬁre smoke detection model based on deformable transformer for end-to-end object detection (deformable DETR). We use deformable DETR as a baseline containing the best sparse spatial sampling for smoke with deformable convolution and relation modeling capability of the transformer. We integrate a Multi-scale Context Contrasted Local Feature module (MCCL) and a Dense Pyramid Pooling module (DPPM) into the feature extraction module for perceiving features of small or inconspicuous smoke. To improve detection accuracy and reduce false and missed detections, we propose an iterative bounding box combination method to generate precise bounding boxes which can cover the entire smoke object. In addition, we evaluate the proposed approach using a quantitative and qualitative self-made forest ﬁre smoke dataset, which includes forest ﬁre smoke images of different scales. Extensive experiments show that our improved model’s forest ﬁre smoke detection accuracy is signiﬁcantly higher than that of the mainstream models. Compared with deformable DETR, our model shows better performance with improvement of mAP (mean average precision) by 4.2%, AP S ( AP for small objects) by 5.1%, and other metrics by 2% to 3%. Our model is adequate for early forest ﬁre smoke detection with high detection accuracy of different-scale smoke objects.


Introduction
Forest resources are essential for the global environment and human society.In addition to improving the quality of the atmospheric environment, forests also play a crucial role in the global carbon cycle, soil properties, and climate regulation [1].The increasing occurrence of forest fires is destroying the world's forest resources and impacting human society in terms of considerable losses in human lives and public properties [2,3].Due to forest fires being too difficult to rapidly control and extinguish once they occur, effective detection of early forest fires is an urgent need.The characteristics of smoke are more obvious, always appearing earlier than fires when a forest fire breaks out.It is of great significance for fire detection if forest fire smoke can be detected quickly and accurately.
Traditional forest fire smoke monitoring is based on manual inspection and smoke sensor monitoring [4].However, manual inspection consumes substantial human and material resources with low efficiency and unsatisfactory results.Various sensors have been used to detect fire and smoke in the last two decades.Point sensors [5,6] obtain remarkable results indoors, but the investment of a fire smoke wireless sensor network over an entire forest is too expensive, and sensors are easily interfered with and damaged.Smoke sensors require close proximity to the forest fire because the alarm needs particles to trigger.However, when the particle concentration reaches the threshold, the forest fire might be too strong to be controlled.Unmanned Aerial Vehicles (UAVs) can collect important visual information on early forest fire smoke detection during patrols [7].Satellite sensors [8] have been used widely in forest fire smoke detection, and are not affected by various environmental factors, but can only monitor large-scale fires.Due to infrequent periods and resolution limitations, satellite sensors cannot immediately detect forest fires.Currently, with the development of computer vision technology, video surveillance systems that can be installed in forests have become a suitable alternative to previous detection methods and have lower cost, convenient deployment, and high detection efficiency.Watchtowers [9] and UAVs [4] equipped with cameras are appropriate for automatically monitoring forest fire smoke.Previous forest fire smoke detection methods based on computer vision technology usually make use of color and motion characteristics of the pixels from surveillance video frames.They mostly adopt pattern recognition processes, including feature extraction and classification, which are human-designed.After the candidate areas are extracted, static and dynamic smoke features are used for smoke recognition.Gubbi et al. [10] used wavelets to extract smoke characteristics and then classified them by using SVM (Support Vector Machines).ByoungChul et al. [11] trained two random forests for wildfire smoke classification using RGB (Red Green Blue) color, wavelets coefficients, motion orientation and a histogram of oriented gradients as independent temporal and spatial feature vectors.Prema et al. [12] proposed an image-processing approach using YUV color space, wavelet energy, and correlation and contrast of smoke to detect smoke.However, such methods are heavily dependent on human prior knowledge and are limited in various scenarios due to complex changeable forest environments, small-target smoke, and low-contrast flame and smoke.
Deep learning methods have attracted more attention in recent years than traditional image processing methods.Compared with traditional fire smoke detection methods, the fire smoke detection methods based on deep learning could extract more abstract and highlevel features, and have the advantages of fast speed, high accuracy and strong robustness in complex forest environments.Convolutional neural networks (CNNs) have become prevalent object detection methods due to their outstanding performance in image recognition [13].Frizzi et al. [14] proposed a CNN for fire and smoke detection and classification by extracting features in video.Wu et al. [15] used classical object detection models to detect forest fires.The adopted models contained You Only Look Once (YOLO) [16], Single Shot multi-box Detector (SSD) [17] and Faster Region-CNN (R-CNN) [18,19], and the experiments showed that an improved YOLO model could detect early forest fires efficiently.Semantic segmentation is also a common method to detect smoke.The task of semantic segmentation is to classify the input image pixel by pixel and mark the pixel-level objects.Pan et al. [20] introduced a collaborative region detection and grading framework for forest fire and smoke using a weakly supervised fine segmentation and a lightweight Faster R-CNN.Frizzi et al. [21] showed the comparison of network performance on two smoke semantic segmentation databases.Semantic segmentation and object detection, which have similar task objectives, mark objects and specific classification information of objects.The difference is that the object marked by semantic segmentation is at the pixel level, while the object marked by target detection is its bounding box.When preparing the smoke dataset, there is no need for tedious pixel-level marking operation, nor to classify every pixel in the image during detection, which leads to great optimization in running speed.
There is a transformer [22] model which has become a preferred settlement for machine translation, text generation, etc. [23][24][25] with the development of natural language processing (NLP).The self-attention mechanism could gather global information and pay attention to important elements more quickly and efficiently.Inspired by the success of transformer model and self-attention mechanism, Dosovitskiy et al. [26] proposed Vision Transformer (ViT) for image recognition.The first end-to-end object detection method based on transformer (DETR) demonstrated higher accuracy and speed on par with the previous well-established Faster R-CNN on COCO dataset [27].DETR has a simple architecture with a CNN backbone and transformer encoder-decoders.However, DETR needs more epochs than Faster R-CNN to converge and shows low performance in detecting small targets.Deformable DETR [28], which is modified from DETR by using a deformable attention module, obtains satisfactory results in object detection tasks, especially in detecting small targets.Here, we set deformable DETR as our baseline for forest fire smoke detection and demonstrate its efficiency through experiments.
In previous studies of forest fire smoke detection, many detection models have been used and have obtained good results.However, there are many existing problems for early forest fire smoke detection in forest environments due to the complex background and the difficulty of extracting smoke features.Firstly, forest images usually contain not only smoke but other irrelevant background information with similar characteristics to smoke, such as clouds, lake surface, fog, etc.The light change in the natural environment will also cause interference, resulting in the change of some image features, affecting the subsequent feature extraction and recognition.Secondly, it is challenging to detect early smoke precisely with their dynamic characteristics and small fuzzy shape.Therefore, in this paper, we aim to address this critical issue by improving feature extraction and small object detecting abilities using a Multi-scale Context Contrasted Local Feature module (MCCL), Dense Pyramid Pooling module (DPPM) [29][30][31], and iterative bounding box combination method.
The contributions of our paper are as follows: • We propose an improved deformable DETR model to detect forest fire smoke which involves a Multi-scale Context Contrasted Local Feature module (MCCL) and Dense Pyramid Pooling module (DPPM).The modules enhance low contrast smoke for detecting small and inconspicuous smoke by capturing locally discriminative features.

•
An iterative bounding box combination method is proposed to obtain precise boxes for smoke objects to obtain more accurate localization and bounding boxes of semitransparent blurred smoke.

•
In order to evaluate our model, we build a forest fire smoke dataset from public resources, including various kinds of smoke and smoke-like objects in complex forest environments.
The rest of the paper is organized as follows.Section 2 describes our dataset and the details of the improved deformable DETR model, Section 3 presents the experimental results and performance analysis, the discussion is given in Section 4, and finally Section 5 concludes this paper.

Dataset and Annotation
It is well-known that the quality and size of a dataset are essential for a deep learning model's performance.However, there are few public datasets about forest fire smoke or smoke datasets suitable for forest environments.Therefore, we proposed a forest fire smoke dataset (FFS dataset) by collecting forest fire smoke images (JPG format) from crawling open data on the Internet.Our self-built dataset contained different views and scales of forest fire smoke images.We manually labeled smoke areas in images and converted them to COCO [32] format.The dataset contained 10,250 images total, and we randomly divided them by 9:1; thus, 90% of the dataset was used as a training set and 10% as a validation set.Some sample images are shown in Figure 1.

Deformable DETR
Recently, DETR has demonstrated very competitive performance in the object detection field as a real end-to-end detector.In contrast to other modern object detect models, it does not need any hand-crafted components such as anchor generation and non-maximum suppression (NMS) and has a very simple architecture: a CNN backbone and an encoderdecoder transformer model.However, DETR has its own issues.Firstly, DETR need more epochs to converge, which is mainly due to the difficulty of processing image features to train for the attention module.While the model is initializing, the cross-attention module gives average attention to the whole feature map.After training, the attention module gives attention to feature maps sparsely.Secondly, it is hard for DETR to detect small objects.The self-attention module in the encoder part of the transformer cannot handle high-resolution feature maps with unacceptable complexity.Zhu et al. [28] proposed a deformable DETR, which achieved satisfactory results in small object detection, and training epochs were reduced by almost a factor of 10.Authors provided the deformable attention module on each query to pay attention to the more meaningful locations that the network thought contained more local information, which were fewer in number, and a fixed number of locations as keys.This alleviated the problem of large computation requirements caused by high-resolution feature maps.
 Δ The deformable attention feature was calculated by: In the formula, x is the input feature map, q represents the query element with content feature z q and 2-d reference point p q , k indexes the sampled keys with the sampling offset ∆ mqk and normalized attention weight A mqk of the k sampling point in m attention head.In addition, W m represents attention weights after linear transformation from different heads and K sampling offsets ∆ mqk are calculated according to the linear layer, then K sampling offsets and p q determine the selected points in the neighborhood.
Furthermore, a deformable attention module could be extended to a multi-scale deformable attention module in the deformable DETR's encoder part.Output feature maps of the encoder have the same resolution with the input feature maps.The input feature maps {x l } L−1 l=1 (L = 4) of the encoder are extracted from the backbone's output feature maps of stages C 3 to C 5 (such as ResNet [33], transformed by 1 × 1 convolution).Every resolution of C l is 2 l lower than the input images.Authors proposed C 6 stage, which was obtained by 3 × 3 stride 2 convolution from C 5 stage.For clarity of each query pixel's location, scale-level embedding is used for feature representation.
Then, the multi-scale deformable attention module is applied as: On the basis of the deformable attention module, l indexes input feature level, {x l } L l=1 are input feature maps which are divided by different levels, p q are normalized coordinates which are not equivalent to reference points p q in deformable attention module, function φ l rescales normalized coordinates at every feature layer to locate points that are sampled at different levels, and the remaining elements are similar to Equation ( 1) except for an additional l element.
The network structure of the deformable DETR is shown in Figure 2. By replacing the traditional transformer attention module, deformable DETR used a multi-scale deformable attention module to process feature maps which could be extended by aggregating multi-scale features naturally.

Multi-Scale Context Contrasted Local Feature Module
As we know, context information can improve performance through scene labeling.CNN provides high-level context features which contain abstract and global information on the whole image for object recognition [31,33].However, there are many inconspicuous targets in complex natural environments.Those context features from CNN usually focus on the dominated objects in the image and cannot make sure that they are useful for inconspicuous objects recognition.The Context Contrasted Local Feature (CCL) module solves this problem well by computing the contrast of local context information, which not only makes full use of context but foregrounds the local information.This is an imitation of human behavior.Human beings concentrate on an object while we pay attention to its surrounding context.The contrast is computed by: To process subsequent high-level feature maps conveniently, we resize the input features as 16 × 16 and restore them at the output block.This module contains 4 different levels of dilated convolution blocks with dilation rates = 1, 2, 4 and 6, respectively.Then we concatenate their output feature maps from each of the two blocks.We use a Dense Pyramid Pooling Module (DPPM) to extract more abstract information from the concatenated multiscale feature maps.Confusion categories are a common problem in classification.It is an enormous challenge to distinguish between smoke and smoke-like objects such as clouds and haze.Zhao et al. [34] proposed a Pyramid Pooling Module (PPM) for global scene prior construction upon the high-level feature maps, and this obtains context information between sub-regions efficiently.Furthermore, the Dense Pyramid Pooling Module (DPPM) is used to process feature maps efficiently with fewer parameters and a larger size of the receptive field, as shown in Figure 4.
The module contains features under four different scales.We use four average pooling layers with different kernels and strides to generate feature maps (size 1 × 1, 2 × 2, 4 × 4 and 8 × 8, respectively) into different sub-regions.After that, we use a 1 × 1 convolution layer to reduce the dimension of features which could keep the weights of global feature consistent.Then we concatenate multi-scale feature maps from different pyramid levels and upsample several times directly to obtain the same size between input and output features via bilinear interpolation.Finally, we concatenate these feature maps as multi-scale features.
As we discussed in Section 2.2, the input feature maps {x l } L−1 l=1 (L = 4) of the en- coder are extracted from the backbone's output feature maps of stages C 3 to C 5 (such as ResNet [33], transformed by 1 × 1 convolution).The input multi-scale feature maps are obtained via 1 × 1 stride 1 convolution on C 3 , C 4 and C 5 stage.In addition, we use the MCCL module to process the lowest-resolution feature maps on final C 5 stage, then use 3 × 3 stride 2 convolution to get the highest-dimensional feature maps as illustrated in Figure 5.The numbers below each layer represent the size and dimension of the feature maps.

Iterative Bounding Box Combination Method
Forest fire smoke is easily affected by complex forest environments, and its characteristics change easily.Early smoke usually represents a semitransparent characteristic which leads to a blurred boundary.Unlike general object detection, it is difficult to obtain a precise bounding box for smoke.These uncertain elements inevitably lead to missed and false detections, as shown in Figure 6.In the previous object detection model, Non-Maximum Suppression (NMS) is proposed to obtain bounding boxes based on their scores.However, NMS is not necessary for DETR which lowers AP (Average Precision) in final layers and only improves AP 50 (AP at IoU = 0.5) slightly [27].Deformable DETR uses iterative bounding box refinement to obtain precise bounding boxes based on predictions from each layer and different layers compute parameters independently [28]; each decoder layer predicts bounding boxes based on the predictions from the previous layer.As shown in Equation ( 4), for the boxes from the d-th decoder layer, the key elements are sampled to boxes predicted from the (d-1)-th decoder layer and the new reference points are set as (b d−1 jx , b d−1 jy ).Additionally, these methods are not suitable for blurred smoke box proposals.Considering that our ideal goal is to detect early smoke rapidly and obtain an accurate position in images, we propose an iterative bounding box combination method based on NMS and iterative bounding box refinement to obtain satisfactory results and decrease the occurrence of missed and false detections.Our algorithm generates bounding boxes that do not overlap with each other, and where the whole smoke objects are surrounded by bounding boxes.Ablation experimental results are shown in Figure 6.Firstly, we set D numbers of deformable DETR decoder layers (e.g., D = 6) and the predictions of bounding boxes box j from every decoder layer are sorted by their confidences.The box j is defined as: where d = {1, 2, . . ., D}, b d j{x, y, w, h} are the predictions of the d-th decoder layer, and box j is relevant to the predictions of d-1-th layer.The σ (•) and σ −1 (•) represent sigmoid function and inverse sigmoid function, respectively.
Secondly, we delete the box j whose confidences are lower than 0.01.Then we calculate the Intersection over Union (IoU) between boundingbox i and box j : We keep the box j as a new bounding box if its IoU equals to zero.Moreover, we combine bounding box i and box j as a new bounding box i+1 if the box j only coincides with one bounding box and the IoU between two boxes is less than 0.7.We also need to keep the new boxes independent from other bounding boxes.Based on these, we improve the bounding box generation algorithm and our iterative bounding box combination algorithm is shown in Algorithm 1.

Loss Function
In terms of loss function, our model follows the function of deformable DETR.Therefore, we totally set three components to the loss, classification loss, bounding box distance loss and GIoU loss [35].The classification loss is necessary for the training model and classification task, which are represented as cross-entropy loss.The bounding box distance loss is set as L1 loss, which calculates the distance between prediction box and the ground truth then propagates gradients.Furthermore, we use GIoU loss to make the prediction box closer to the ground truth: where A, B and C represent prediction box, ground truth and smallest closing box between A and B, respectively.Thus, our total loss is weighed sum of three loss:

Training
The details of our experimental environments are shown in Table 1.Training parameters of our model were designed based on the deformable DETR as shown in Table 2. Furthermore, we set M as the number of heads for multi-scale deformable attention module, which equals to 8, and K indexes the number of sample keys, which equals to 4.

Comparison and Evaluation
In order to analyze and demonstrate the early forest fire smoke detection performance of our improved deformable DETR model, we used Microsoft COCO evaluation metrics here, which are widely used to evaluate object detection tasks.Our model trains on the training set and evaluates on the validation set.The formulas of the two main metrics AP (Average Precision) and AR (Average Recall), which are calculated based on Precision (P) and Recall (R), are shown in Equations ( 9)- (12).
TP, FP and FN represent the numbers of true positive samples, false positive samples and false negative samples, respectively.In Equation (12), the variable o indexes the IoU between the prediction box and the ground truth box.
Microsoft COCO evaluation metrics include various object detection accuracies of different area sizes.Therefore, we use AP and AR for comparison.mAP is mean Average Precision and mAR is mean Average Recall for all categories; AP S , AP M and AP L represent the AP for small objects (area size < 32 2 ), medium objects (32 2 < area size < 96 2 ) and large objects (area size > 96 2 ), respectively.AP 50 means average precision at IoU = 0.5 and AR indicators are similar to AP.Specifically, the units of AP and AR are percentages.We also added ablation experiments.The experimental results are shown in Table 3.

Detection Performance and Analysis
Compared with other remarkable detection models, extensive experiments indicated that our improved deformable DETR model with MCCL module and iterative bounding box combination method achieved satisfactory results in early forest fire smoke detection tasks.We also used YOLOv5s and DETR for comparison, which are widely used in object detection.Compared with Faster R-CNN + FPN, DETR shows higher accuracy of detection performance but needs much more training time to converge and delivers low accuracy in detecting small smoke.Our baseline, deformable DETR achieves more satisfactory performance with small targets with fewer training epochs.Compared with the baseline, the Multi-scale Context Contrasted Local Feature module improves the overall performance, especially with improvement at AP S by 5.1%.After adding the iterative bounding box combination method, the detection of our model on forest fire smoke obtains higher accuracy with 4.2% in mAP, 2.6% in AP 50 and 6.1% in mAR (compared with baseline), improving other metrics by 3%.Based on these experiments, we can conclude that our improved deformable DETR model is competent for small and inconspicuous smoke detection and the detection accuracy of smoke at different scales is higher than other common models.Some detection results are shown in Figures 7-11.
As shown in Figure 7, the detection results of the improved model show that there are no false and missed detections, and the bounding boxes cover the entire smoke objects with high accuracy.We also used YOLOv5s, DETR and baseline to detect ultra-small smoke targets in the wild with strong interference (such as strong direct sunlight interference in Figures 7 and 8); they all had a missed detection, but our model detected them accurately (as shown in .A series of images on the left show that small gray smoke can be detected well by common models.As shown in the right images, ultra-small white smoke with strong light is too difficult to be detected by general models, but our model could detect it well.To investigate our improvement of feature extraction and understand the multi-scale attention module better, we visualize sampling points and attention weights of the last layer in the encoder.As shown in Figure 12, compared with the baseline, our improved model can focus more precisely on the inconspicuous smoke part by giving it larger attention weights, while the original model pays attention to the boundary of smoke roughly.The attention weights and the positions of sampling points lead to the difference in the subsequent learning and detection modules of the two models.

Discussion
It is very important to detect forest fires quickly and accurately.Smoke, as a significant feature of early fires, should be paid more attention to during detection.However, objects such as smoke and flames have irregular shapes and are easily disturbed by complex forest environments.Delayed or even missed detection of forest fire smoke can lead to the rapid spread of fire, which causes immeasurable losses.The development of computer vision has made it possible for high-precision automatic inspection to replace manual inspection in the last two decades.Because of the translucency and blurred boundary of smoke, it is easily influenced by other factors such as light and wind.Previous smoke detection methods based on deep learning have mainly studied the texture and spatiotemporal characteristics from smoke videos to achieve more accurate smoke detection results [36][37][38].Smoke detection can also adopt another strategy, that is, paying attention to data re-processing such as dark channel prior, optical flow, and super-pixel segmentation of images [20,39].
Our improved deformable DETR model concentrates on feature extraction in order to obtain higher accuracy of smoke detection.Through these comparisons and ablation experiments, we found that our model is more suitable for early forest fire smoke detection tasks compared with other common models, as shown in Table 3.The MCCL module provides precise multi-scale features of small and inconspicuous smoke objects for highlevel feature processing and the module has more dilated convolution blocks and fewer parameters than CCL.We used the DPPM module, which is expanded from the Pyramid Pooling Module to generate more features with fewer parameters than the Pyramid Pooling Module.As shown in Figure 4, our DPPM module computes multi-scale features naturally by upsampling at each stage.The module we used combines efficient feature extraction with fewer calculation parameters.In Figure 12, we visualize sampling points and attention weights of the last layer in the encoder, and our improved model can focus more precisely on smoke objects while the MCCL module extracts more useful features for subsequent feature learning and detection modules.Compared with the original model, more accurate sampling points and attention weights show the advantages of our method in feature extraction.Additionally, the detection performance of our model also demonstrates advantages in this task (as shown in Figures 7-11).Our detection samples contain ultra-small smoke objects with strong inference (such as strong direct sunlight and smoke-like clouds in Figure 11).Due to the further processing of high-dimensional feature maps by the MCCL module and DPPM greatly reducing the possibility of misclassification, this model can more accurately obtain inconspicuous smoke features and distinguish the smoke from smoke-like objects.In the field of vision-based target detection, small target detection has always been a difficult problem.Mis-detection of our model occurs when detecting small targets.Early small smoke targets tend to be easily covered by trees and dissipate quickly.Limited pixel representations of early smoke flow and the interference from smoke-like objects usually lead to the problem of mis-detection in the original model.In order to improve the detection performance of inconspicuous smoke targets, we propose using several dilated convolutions with different rates to obtain useful context information, and also pay attention to local information of inconspicuous targets.The proposed improvement strategy obtained satisfactory result in detecting early smoke targets and improved the AP S metric by 5.1% (compared with the original model).
The previous bounding box generation method is obviously suitable for smoke in forest fire smoke detection tasks; the generated bounding box always has a smaller or larger offset to the ground truth, which leads to high training loss.Considering this situation, we used an iterative bounding box combination method to generate bounding boxes more consistently with ground truth which reduced the occurrences of false and missed detections and improved mAP by 4.2%.With the addition of our bounding box generation method, the detection results become more accurate than the baseline in Figure 6.Furthermore, we constructed a large forest fire smoke dataset to evaluate our method.Four common object detection models were obtained in the experiments with good performance on forest fire smoke detection, which made it possible to detect the forest fire smoke in the wild.
However, our model still has some disadvantages to improve.Small object detection is not only the detection of forest fire smoke but also one of the difficulties of computer vision.We extracted features from high dimensions to detect small smoke, which will still be limited by the lack of small target pixel information.Complex environments, such as foggy weather, greatly affect the detection of our model, but smoke sensors still have high accuracy in detecting smoke.Therefore, combining computer vision with traditional smoke sensor networks may make smoke detection more accurate.

Conclusions
In this paper, we propose an improved end-to-end deformable DETR model for forest fire smoke detection.Firstly, in order to capture the information of small and inconspicuous smoke, a feature extraction module with Multi-scale Context Contrasted Local Feature module and Dense Pyramid Pooling module is used.Several dilated convolutions with different rates make full use of context information and local information of inconspicuous objects, which improves the performance of early forest fire smoke detection.Secondly, we propose an iterative bounding box combination method to reduce the occurrences of false and missed detections and generate a bounding box for forest fire smoke more accurately to the ground truth.Lastly, due to the lack of relevant public datasets, we established a quantitative and qualitative forest fire smoke dataset to verify the performance of our model.Ablation experiments show that our improved model for detecting forest fire smoke is superior to the mainstream detection model in most metrics.Our model not only achieves high detection accuracy of smoke but can detect early forest fire smoke which is too small and inconspicuous to be detected by common models.
In the next stage, we plan to conduct joint detection of early fire and smoke, then prune and distill the knowledge for our improved model so that it can be deployed to edge devices such as UAVs and watchtowers for real-time detection with fewer parameters and higher processing speed.

Figure 1 .
Figure 1.Samples of the FFS dataset (images show smoke objects at different scales, images of first row shows light smoke, the second row shows dense smoke).

Figure 2 .
Figure 2. The network structure of deformable DETR.

Figure 3 .
Figure 3. Illustration of Multi-scale Context Contrasted Local Feature module.

Figure 6 .
Figure 6.Different detection samples before and after using iterative bounding box combination method.(a,c) Original detection results; (a) contains one missed detection; (c) contains one false detection.(b,d) The updated detection results where bounding boxes are generated by our method.The bounding boxes can cover the whole smoke accurately in both (b,d).

Figure 7 .
Figure 7. Detection results of our improved deformable DETR model.The first row shows smalltarget smoke images, the second row shows large-target smoke images.

Table 3 .
Experimental results.Comparison of our improved model with other detection models on our FFS dataset.
backbone, respectively.Training epochs are set to different values for the best training results of models.The bolded numbers indicate the best performance in the comparison.+ Add ablation experiments are based on deformable DETR.