RT-DETR-Smoke: A Real-Time Transformer for Forest Smoke Detection

Wang, Zhong; Lei, Lanfang; Li, Tong; Zu, Xian; Shi, Peibei

doi:10.3390/fire8050170

Open AccessArticle

RT-DETR-Smoke: A Real-Time Transformer for Forest Smoke Detection

by

Zhong Wang

^1,2,3

,

Lanfang Lei

⁴,

Tong Li

^1,*

,

Xian Zu

^1,3 and

Peibei Shi

¹

School of Computer and Artificial Intelligence, Hefei Normal University, Hefei 230601, China

²

Hefei Institute for Public Safety Research, Tsinghua University, Hefei 230601, China

³

State Key Laboratory of Fire Science, University of Science and Technology of China, Hefei 230026, China

⁴

School of Artificial Intelligence and Big Data, Hefei University, Hefei 230601, China

^*

Author to whom correspondence should be addressed.

Fire 2025, 8(5), 170; https://doi.org/10.3390/fire8050170

Submission received: 22 February 2025 / Revised: 10 April 2025 / Accepted: 23 April 2025 / Published: 27 April 2025

Download

Browse Figures

Versions Notes

Abstract

Smoke detection is crucial for early fire prevention and the protection of lives and property. Unlike generic object detection, smoke detection faces unique challenges due to smoke’s semitransparent, fluid nature, which often leads to false positives in complex backgrounds and missed detections—particularly around smoke edges and small targets. Moreover, high computational overhead further restricts real-world deployment. To tackle these issues, we propose RT-DETR-Smoke, a specialized real-time transformer-based smoke-detection framework. First, we designed a high-efficiency hybrid encoder that combines convolutional and Transformer features, thus reducing computational cost while preserving crucial smoke details. We then incorporated an uncertainty-minimization strategy to dynamically select the most confident detection queries, further improving detection accuracy in challenging scenarios. Next, to alleviate the common issue of blurred or incomplete smoke boundaries, we introduced a coordinate attention mechanism, which enhances spatial-feature fusion and refines smoke-edge localization. Finally, we propose the WShapeIoU loss function to accelerate model convergence and boost the precision of the bounding-box regression for multiscale smoke targets under diverse environmental conditions. As evaluated on our custom smoke dataset, RT-DETR-Smoke achieves a remarkable 87.75% mAP@0.5 and processes images at 445.50 FPS, significantly outperforming existing methods in both accuracy and speed. These results underscore the potential of RT-DETR-Smoke for practical deployment in early fire-warning and smoke-monitoring systems.

Keywords:

smoke detection; hybrid encoder; transformer; coordinate attention; WShapeIoU

1. Introduction

Fire incidents pose a significant threat to both human lives and the environment. The 2023 wildfires in Canada, for instance, devastated over 156,000 square kilometers of land—surpassing the baseline established in 1995—and released massive amounts of air pollutants and greenhouse gases, exacerbating climate change [1]. Early and accurate fire detection is therefore critical to reducing casualties and mitigating damage to property and ecosystems [2,3].

In the initial stages of a fire, smoke often emerges before visible flames and is less likely to be obstructed, making smoke detection an effective early-warning strategy. Traditional methods relying on visible-light images or meteorological data [4,5] have proven inadequate in complex environments, where variations in lighting, weather, and background conditions can cause inconsistent detection performance. In recent years, deep learning-based smoke-detection methods [6] have demonstrated an improved capability to learn rich features from diverse visual contexts, leading to more accurate detection and fewer false alarms. However, achieving high accuracy while maintaining low false-positive rates remains a formidable challenge in practice [7].

Deep learning-based approaches to smoke detection generally fall into two categories. The two-stage frameworks, such as RCNN [8], Fast R-CNN [9], Faster R-CNN [10], Cascade R-CNN [11], and Sparse R-CNN [12], offer high accuracy and robust feature extraction, even in cluttered scenes; however, their reliance on region proposals tends to incur substantial computational overhead, hindering real-time applications [13]. Conversely, single-stage detectors, including SSD [14] and the YOLO series [15,16], are well-suited for real-time monitoring due to their efficiency. However, they often struggle with small or blurred targets, causing performance degradations in smoke-detection scenarios.

In an effort to bridge the gap between high accuracy and real-time performance, the DETR (Detection Transformer) algorithm was introduced [17]. By leveraging Transformer architectures, DETR removes the need to use manually designed anchor boxes and simplifies the detection pipeline. Nonetheless, the vanilla DETR suffers from slow training convergence and limited spatial resolution. Although multiple variants—Deformable DETR [18], Sparse DETR [19], Lite-DETR [20], Focus-DETR [21], and RT-DETR [22]—have alleviated some of these issues, they are generally designed for conventional benchmarks like COCO and often underperform in fire or smoke detection, where the scene can be chaotic and ever-changing. Moreover, smoke is inherently fluid and semitransparent, featuring large variations in scale, shape, and deformation that blur its edges and make detection exceedingly difficult.

To address these challenges, we propose RT-DETR-Smoke, a novel Transformer-based, end-to-end smoke-detection framework designed for real-time performance. Specifically, we integrated the following: a high-efficiency hybrid encoder that captures both coarse- and fine-grained features while reducing the computational cost, aided by an uncertainty minimization strategy in query selection; a Coordinate Attention mechanism that enhances feature fusion for improved smoke-edge localization; a WShapeIoU loss function that accelerates convergence and stabilizes detection performance under the significant shape variance and complexity inherent in smoke scenarios.

Our main contributions are summarized as follows:

(1): Proposal of RT-DETR-Smoke: We extended RT-DETR to create a dedicated smoke-detection framework by introducing a hybrid encoder and integrating a coordinate attention mechanism specifically tailored for smoke-edge recognition.
(2): Introduction of WShapeIoU Loss Function: We designed a specialized bounding-box regression loss function to expedite model convergence and improve detection stability in dynamic fire scenes.
(3): Creation of a Custom Smoke-Detection Dataset: To comprehensively evaluate our method, we constructed a challenging smoke dataset containing diverse scene variations drawn from real-world surveillance videos and meticulously annotated.

The remainder of this paper is organized as follows: Section 2 provides a survey of related research; Section 3 describes our proposed method in detail, with emphasis on the coordinate attention mechanism and the WShapeIoU loss function; Section 4 outlines our experimental design, presents results, and discusses the effectiveness of the approach; and Section 5 concludes the paper and offers future research directions.

2. Related Works

Traditional machine-learning approaches to smoke detection often rely on hand-crafted features, which can limit their scalability and robustness in highly variable scenarios. In contrast, deep learning-based methods have shown remarkable improvements in both accuracy and generalization, making them well-suited for complex tasks such as smoke and fire detection. This section reviews the latest one-stage algorithms, two-stage algorithms, DETR-based algorithms, and task-specific methods, discussing their applicability and limitations in addressing smoke-detection challenges.

2.1. One-Stage Algorithms

One-stage detectors have gained substantial popularity in fire and smoke detection due to their faster inference speed and lower hardware demands. Wu et al. [23] successfully employed traditional object-detection models, including Faster R-CNN, the YOLO series, and SSD, to handle real-time forest-fire detection and reduce false alarms. Notably, improvements to the YOLO architecture enhanced detection accuracy, while SSD stood out for its balance of speed and precision. Guo et al. [24] further optimized single-stage performance by introducing DF-SSD, which integrates deep fire modules into a SqueezeNet backbone, thereby reducing computational overhead.

Among single-stage methods, the YOLO family [25] remains the most prominent. Successive iterations [26,27] have improved detection accuracy and speed. Abdusalomov et al. [28] benchmarked YOLOv3, YOLOv4, and their lightweight versions in fire localization, highlighting YOLOv3 as a strong candidate due to its robust performance. Zheng et al. [29] compared EfficientDet, Faster R-CNN, YOLOv3, and SSD for forest-fire smoke localization; EfficientDet achieved the highest mean average precision (mAP), whereas YOLOv3 offered the best speed. To enhance feature extraction and handle challenging samples, Wang et al. [30] proposed SASC-YOLOX, incorporating a lightweight self-attention mechanism to focus on critical smoke characteristics. Smadi et al. [31] found that YOLOv5x surpassed other YOLO variants in mAP at the expense of slower speed. More recently, Yang et al. [32] introduced an improved YOLOv8 for forest-fire detection, integrating deformable convolutions, SCConv modules, and coordinate attention mechanisms to tackle the scale and complexity of fire scenes.

These studies confirm that one-stage YOLO variants remain a compelling choice for smoke detection, particularly when real-time monitoring is prioritized; however, challenges persist for small or highly distorted smoke regions, and these often lead to performance drops in more chaotic or visually noisy environments.

2.2. Two-Stage Algorithms

In contrast to single-stage methods, two-stage approaches focus on higher detection accuracy, albeit at a computational cost. Barmpoutis et al. [33] combined deep learning with higher-order linear dynamic systems for multidimensional texture analysis, first using Faster R-CNN to identify fire regions and then mapping them to a Grassmannian manifold for finer analysis. Similarly, Chaoxia et al. [34] introduced a color-guided anchoring strategy, leveraging color cues from flames to guide anchor placement in Faster R-CNN and thereby reducing false positives.

Despite such enhancements, anchor-based two-stage models often struggle with irregularly shaped or amorphous targets [35]. Given that both flames and smoke can present highly variable boundaries and transparency levels, anchor definitions become less reliable. Consequently, while two-stage approaches can yield strong accuracy, their limited adaptability to fluid, semitransparent smoke and their high computational overhead restrict their suitability for real-time or resource-constrained deployments.

2.3. DETR

DETR (Detection Transformer) [17] marked a significant shift in object detection by streamlining the pipeline into an end-to-end framework. Its key innovation is the elimination of manually tuned anchor boxes and Non-Maximum Suppression (NMS); instead, it involves directly predicting bounding boxes and class labels via the Transformer architecture. Although DETR excels in simplifying the detection process, it exhibits slow convergence, suboptimal small-object detection, and high reliance on large datasets and substantial computational resources.

To mitigate these issues, numerous DETR variants have emerged. Deformable DETR [18] employs deformable attention to capture sparse, relevant features, accelerating convergence. Sparse DETR [19] adopts sparse attention for a more computationally efficient approach, while Lite DETR [20] reduces model complexity and parameters. Focus-DETR [21] prioritizes attention on critical regions, trimming redundant calculations. RT-DETR [22] further pushes real-time capability yet may underperform established methods like YOLO in scenarios with highly localized or small-scale targets.

In response to these challenges, Huang et al. [36] proposed a fire-and-smoke-detection model based on Deformable DETR to address the issues of small-object detection and the effective capture of smoke features. Their approach enhances the representation of smoke characteristics by incorporating deformable convolution alongside the powerful modeling capabilities of Transformers. Meanwhile, Li et al. [37] developed a fire-and-smoke-detection algorithm that integrates both CNN and Transformer frameworks. By incorporating a normalized attention module, their method strengthens feature representation and accelerates convergence through the use of multiscale deformable attention. However, despite these improvements, their approach still relies on static images and fails to fully capture the temporal dynamics of smoke diffusion, which may compromise its real-time warning capabilities. In contrast, Liang et al. [38] introduced the FSH-DETR model, which optimizes multiscale smoke detection within the DETR framework by employing a Single-Scale Feature Interaction (SSFI) module and a Cross-Scale Feature Fusion Module (CCFM). Additionally, Sun et al. [39] proposed the Smoke-DETR model, which utilizes a lightweight architecture (such as ECPConv) and multiscale feature-fusion techniques (including EMA and MFFPN), thereby significantly enhancing both its accuracy and its real-time performance in smoke detection.

Despite these advances, applying DETR-based models to smoke detection remains challenging. Smoke’s semitransparency, dynamic shape deformation, and blurred edges often exceed the representational capacity of standard DETR pipelines, which can result in missed detections or inaccurately localized bounding boxes. In the chaotic, scale-varying context of fire scenes, achieving both fast inference and high accuracy continues to be difficult.

2.4. Task-Specific Method

In response to challenges encountered in specific scenarios, researchers have proposed more targeted solutions. Ren et al. [40] introduced the Significant Feature Guided Decoupling Network (SFGDN) to address flame detection in UAV remote-sensing images under conditions of smoke occlusion. The core innovations of their method include the Strong Significant Feature Guided Subnetwork (SSFGS) and the Multi-Task Information Decoupling Detection Head (MID-Head), which together enhance the method’s ability to detect concealed flames. Zhao et al. [41] proposed the FSDF framework, which synergistically combines traditional image-processing techniques—such as the HSV color space and Complete Local Binary Patterns (CLBP)—with state-of-the-art deep learning components like YOLOv8 and Vector Quantized Variational Autoencoder (VQ-VAE), significantly boosting both the accuracy and the robustness of fire detection. Additionally, Ding et al. [42] presented an event-camera-based flame-detection method to overcome the limitations of conventional RGB cameras, including background interference and motion blur. Their approach introduced the FlaDE dataset, a Recursive Event Denoiser (RED) module, and a BEC-SVM flame-detection algorithm, thereby markedly improving both detection accuracy and processing speed.

Despite the progress achieved by these specialized fire-detection methods, challenges remain regarding generalization, real-time performance, and robustness in practical applications, largely due to the diverse nature of datasets and hardware constraints. Consequently, developing a universal fire-detection method that strikes an optimal balance among robustness, real-time capability, and low resource requirements continues to be a critical focus of ongoing research.

Based on a thorough analysis of one-stage algorithms, two-stage algorithms, DETR-based algorithms, and task-specific methods, it is clear that none of the current approaches fully meets the requirements of robustness, low-latency operation, and ease of deployment for smoke detection. This situation motivates our work to develop an approach that retains DETR’s end-to-end advantages while enhancing small-object sensitivity, addressing scale invariances, and reducing computational complexity—ultimately offering a more practical solution for real-world smoke detection.

3. Research Method

Inspired by RT-DETR [22], we propose RT-DETR-Smoke, a real-time, end-to-end smoke-detection framework specifically optimized for the unique challenges of detecting semitransparent, irregularly shaped smoke in dynamic fire scenes. As illustrated in Figure 1, our framework consists of four primary components:

A backbone network integrated with Coordinate Attention (CoordAtt) for precise spatial-feature encoding.
A hybrid encoder that fuses multiscale features via an Attention-based Intra-Feature Interaction (AIFI) and a CNN-based Cross-Scale Feature Fusion (CCFF) module.
A Transformer decoder leveraging an uncertainty-minimization strategy for query selection and iterative auxiliary heads for bounding-box refinement.
A novel WShapeIoU loss function that enhances bounding-box regression performance by adapting to smoke’s fluid morphology.

These components collectively aim to boost detection accuracy and retain real-time performance, even under complex or fast-changing environmental conditions.

3.1. Overall Architecture

(1): Coordinate Attention in the Backbone

Detection of smoke in cluttered backgrounds is hindered by its semitransparency and blurred edges. To address these issues, we integrated the CoordAtt mechanism into the P4 and P5 layers of the backbone. Unlike conventional channel or spatial attention modules, CoordAtt encodes positional information directly into channel attention, thereby highlighting the following:

Long-range correlations along horizontal and vertical dimensions; these are crucial for capturing the elongated or diffuse shapes of smoke.
Relevant local patterns; these are highlighted by suppressing background noise and reinforcing smoke-boundary details.

This enhancement effectively improves feature extraction for ambiguous smoke regions while imposing minimal additional computational overhead.

(2): Hybrid Encoder with Multi-Scale Feature Fusion

Following the creation of the backbone, the following features from P3, P4, and P5 were passed into a hybrid encoder designed to capitalize on both attention-based and convolution-based strategies:

Attention-based Intra-Feature Interaction (AIFI) refines features within each scale, allowing the network to better capture the subtle variations in texture or transparency that characterize smoke.
CNN-based Cross-Scale Feature Fusion (CCFF) merges the refined features from multiple scales, enabling robust detection of small, medium, and large smoke plumes in various environments.

By harmonizing these modules, the encoder produces discriminative and scale-aware feature sequences that facilitate effective bounding-box regression in subsequent stages.

(3): Uncertainty Minimization in Query Selection

Smoke targets can be elusive and are frequently missed in detection. To mitigate this, we employed an uncertainty-minimization strategy for query selection. Specifically, we generated a pool of candidate query features and selected those with higher uncertainty for initialization in the decoder. This tactic ensures that harder-to-detect or ambiguous smoke regions are prioritized, compelling the decoder to devote greater attention to regions prone to false negatives or complex shape deformations.

Specifically, uncertainty is measured using the predicted classification confidence scores and bounding-box regression variances. Queries with higher uncertainty scores—indicating lower confidence or higher variance in predicted bounding-box coordinates—are dynamically prioritized, prompting the model to iteratively refine predictions on these challenging smoke regions. The feature uncertainty U is quantified by the divergence between the predicted distributions of localization P and classification C (Equation (1)). To prioritize uncertain queries during training, U is incorporated into the gradient-driven loss function (Equation (2)), enabling the model to dynamically focus on hard-to-detect regions.

U (\hat{x}) = ‖P (\hat{x}) - C (\hat{x})‖, \hat{x} \in R^{\begin{matrix} D \\ Z \end{matrix}}

(1)

L (\hat{x}, \hat{y}, y) = L_{b o x} (\hat{b}, b) + L_{c l s} (U (\hat{x}), \hat{c}, c)

(2)

where

\hat{y}

and

y

denote the prediction and ground truth;

\hat{y} = \{\hat{c}, \hat{b}\}

,

\hat{c}

and

\hat{b}

represent the category and bounding box, respectively;

\hat{x}

represents the encoder feature.

(4): Auxiliary Prediction Heads

To further refine detection, our Transformer decoder applies auxiliary prediction heads at multiple decoding layers. Each layer iteratively updates the bounding-box coordinates and class predictions, improving accuracy step by step. This iterative design is especially advantageous in addressing shape distortions that occur as smoke evolves over time.

3.2. WShapeIoU Loss Function

3.2.1. Limitations of Existing IoU-Based Losses

Intersection over Union (IoU) is among the most widely adopted metrics in object detection for evaluating the overlap between predicted bounding boxes and ground-truth boxes. However, vanilla IoU and its derivatives often struggle in cases in which the following conditions are met:

(1): Aspect Ratio Mismatch: Smoke can exhibit extremely elongated or irregular forms, making bounding-box regression sensitive to slight inaccuracies. Traditional IoU measures may fail to guide the model effectively when aspect ratios diverge significantly from those of typical rectangular objects.
(2): Displacement Sensitivity: Generalized IoU partially addresses edge cases by introducing the concept of the convex hull but can still yield high similarity scores for boxes that share comparable areas yet are spatially displaced.
(3): Slow Convergence on Hard Examples: Focal mechanisms have been shown to improve classification by prioritizing difficult samples. However, existing IoU-based regression losses seldom incorporate similar strategies to address poorly localized bounding boxes.

Given the complex shape dynamics of smoke—semitransparency, blurred boundaries, and fluid deformations—these limitations become more pronounced, leading to unstable training and inaccurate bounding-box predictions.

3.2.2. ShapeIoU as a Baseline

To mitigate aspect ratio issues and incorporate shape information, we began with ShapeIoU [43], formulated to integrate both distance-based and shape-based components into the loss function. As illustrated in Figure 2:

G I o U = I o U - \frac{A_{c} - U}{A_{c}}

(3)

I o U = \frac{B_{g t} \cap B}{B_{g t} \cup B}

(4)

{L o s s}_{G I o U} = 1 - G I o U

(5)

{d i s t a n c e}^{s h a p e} = h h \times \frac{{(x_{c} - x_{c}^{g t})}^{2}}{c^{2}} + w w \times \frac{{(y_{c} - y_{c}^{g t})}^{2}}{c^{2}}

(6)

The horizontal and vertical weighting coefficients,

w w

and

h h

, are defined as follows:

w w = \frac{2 \times {(w^{g t})}^{s c a l e}}{{(w^{g t})}^{s c a l e} + {(h^{g t})}^{s c a l e}}

(7)

h h = \frac{2 \times {(h^{g t})}^{s c a l e}}{{(w^{g t})}^{s c a l e} + {(h^{g t})}^{s c a l e}}

(8)

The shape value loss

Ω^{s h a p e}

is given by the following equations:

Ω^{s h a p e} = \sum_{t = w, h} {(1 - e^{w_{t}})}^{θ}, θ = 4, \{\begin{matrix} w_{w} = w w \times \frac{|w - w^{g t}|}{\max (w, w^{g t})} \\ w_{h} = w w \times \frac{|h - h^{g t}|}{\max (h, h^{g t})} \end{matrix}

(9)

L_{S h a p e I o U} = 1 - I o U + {d i s t a n c e}^{s h a p e} + 0.5 * Ω^{s h a p e}

(10)

where

L_{S h a p e I o U}

represents the overall result of the loss function, with

I o U

calculated according to Equation (4).

{d i s t a n c e}^{s h a p e}

denotes the shape distance loss, as shown in Equation (6), while

Ω^{s h a p e}

captures shape-value loss, penalizing significant deviations in width or height.

By incorporating shape distances, ShapeIoU provides finer guidance to the model; this is particularly beneficial for objects exhibiting significant geometric diversity, such as smoke plumes.

3.2.3. Proposed WShapeIoU

Although ShapeIoU introduces shape awareness, it does not explicitly prioritize hard examples—a crucial aspect in smoke detection, where bounding boxes can be highly distorted or partially transparent. To address this shortcoming, we propose WShapeIoU, which integrates two novel components inspired by WIoU [44] and Focal-EIoU [45].

We define the following equation:

{I o U}_{f o c a l e r} = (1 - \frac{I o U - d}{u - d}) . c l a m p (0, 1)

(11)

where

d

and

u

represent the lower and upper bounds of the focal factor, with values set to 0.0 and 0.95, respectively. The clamp function ensures that the result stays between 0 and 1. This term amplifies loss contributions for bounding boxes with low IoU, focusing the model’s updates on difficult samples.

To stabilize training and accelerate convergence, we employed a monotonic scaling factor, as follows:

β = ({\frac{L_{{I o U}_{f o c a l e r}}^{*}}{\bar{L_{{I o U}_{f o c a l e r}}}})}^{r}

(12)

Here, the gradient gain

r

is set to 0.5 and

L_{{I o U}_{f o c a l e r}}^{*}

is the monotonic focusing coefficient, which effectively reduces the impact of low-quality samples on the loss value, thereby improving classification performance.

\bar{L_{{I o U}_{f o c a l e r}}}

is the moving average value with momentum mmm, which helps address the issue of slowed convergence speed caused by a decrease in

L_{{I o U}_{f o c a l e r}}

.

When these two terms are incorporated into the ShapeIoU structure, the overall WShapeIoU loss is as follows. Please refer to Algorithm A1 in Appendix A for the detailed execution process of the WShapeIoU loss function code.

W i s e I o u L o s s = β * ({I o U}_{f o c a l e r} + {d i s t a n c e}^{s h a p e} + 0.5 * Ω^{s h a p e})

(13)

Key Benefits for Smoke Detection:

(1): Prioritization of Difficult Bounding Boxes: The focal factor compels the model to learn from ambiguous smoke regions where bounding-box overlap remains low or uncertain.
(2): Adaptation to Fluid Shapes: The distance and shape terms ensure that bounding boxes can adjust to irregular boundaries and aspect ratios.
(3): Stable Convergence: $β$ dynamically scales the loss, accelerating optimization on harder boxes while maintaining robust updates on easier ones.

In Section 4, we demonstrate that WShapeIoU significantly improves both bounding-box regression accuracy and model-convergence speed, particularly in scenarios featuring thin or diffused smoke plumes.

Figure 2. WShapeIoU calculation structure diagram.

3.3. CoordAtt Attention Mechanism

In deep learning, attention mechanisms have been extensively utilized to enhance feature representation by focusing on the most relevant spatial or channel components. Conventional modules such as CBAM or SENet provide channel or spatial attention but often overlook coordinate information, which can be critical when handling elongated or irregularly shaped smoke.

To better retain positional cues, we adopted Coordinate Attention (CoordAtt) [46], which decomposes global pooling into vertical and horizontal one-dimensional aggregations.

Algorithm A2 in Appendix A delineates the code implementation of CoordAtt Attention Mechanism. As illustrated in Figure 3, given an input feature map

x

, the CoordAtt mechanism performs 1D pooling operations along both the horizontal and vertical axes to generate feature vectors for each direction. For a channel c and height ℎ, the output in the vertical direction is calculated as follows:

Z_{c}^{h} (h) = \frac{1}{W} \sum x_{c} (h, i) (0 \leq i \leq W)

(14)

where

x_{c} (h, i)

represents the value at coordinate point

(h, i

) in channel c.

Similarly, in the horizontal direction, the output for channel c with width w is expressed as follows:

Z_{c}^{w} (w) = \frac{1}{H} \sum x_{c} (h, w) (0 \leq j \leq H)

(15)

where

x_{c} (j, w)

represents the value at coordinate point (

j, w

) in channel c.

These transformations aggregate features along both spatial directions, creating a pair of direction-aware feature maps. Next, the feature maps from Equations (14) and (15) are concatenated along the spatial dimension and input into a shared

1 \times 1

convolution transformation function F₁, resulting in the following equation:

f = σ (F_{1} ([z^{h}, z^{w}]))

(16)

where

[z^{h}, z^{w}]

represents the concatenation along the spatial dimension,

σ

is a non-linear activation function, and

f \in R^{C / r \times (H + W)}

is an intermediate feature map containing spatial information for both directions. Next,

f

is split along the spatial dimension into two separate tensors:

f \in R^{C / r \times (H + W)}

and

f^{w} \in R^{C / r \times W}

. The

1 \times 1

convolution transformations

F_{h}

and

F_{w}

are applied to

f^{h}

and

f^{w}

, respectively, converting them into tensors with the same channel number as the input

x

, resulting in the following equations:

g^{h} = σ (F_{h} (f^{h}))

(17)

g^{w} = σ (F_{w} (f^{w}))

(18)

The outputs

g^{h}

and

g^{w}

are then expanded and used as attention weights. Finally, the output of the CoordAtt mechanism, y, is expressed as follows:

y_{c} (i, j) = x_{c} (i, j) \times g_{c}^{h} (i) \times g_{c}^{w} (j)

(19)

By embedding both vertical and horizontal spatial information within the channel attention, CoordAtt provides finer localization and longer-range context than standard attention modules. This property is particularly beneficial for smoke edges, which may stretch across large regions of the input image in unpredictable patterns.

We inserted CoordAtt blocks into the BasicBlock structures at the P4 and P5 stages of the RT-DETR-Smoke backbone. This placement was chosen to accomplish the following goals:

(1): Preserve finer-scale details from earlier layers, ensuring smaller or thinner smoke structures are not lost during downsampling;
(2): Aggregate more global context from deeper layers, a step beneficial for large smoke clusters or multidirectional drifting of smoke plumes.

Empirically, we observed that CoordAtt consistently boosts the discriminability of feature maps, improving both edge localization (critical for bounding-box regression) and contextual understanding (important for reducing false positives in cluttered fire scenes). The advantages of Smoke Detection are as follows:

(1): Enhanced Spatial Encoding: By splitting pooling into vertical and horizontal directions, CoordAtt better captures elongated or curved smoke shapes than conventional 2D pooling mechanisms do.
(2): Low Overhead: The additional computational cost remains modest compared to the performance gains, preserving the real-time capacity of RT-DETR-Smoke.
(3): Robust to Background Noise: CoordAtt’s selective emphasis on crucial spatial regions aids in suppressing unrelated textures (e.g., foliage, clouds, or urban structures) that can confound detection.

4. Experiments

4.1. Dataset and Experimental Setup

To validate the effectiveness of RT-DETR-Smoke, we conducted experiments on an Ubuntu 20.04 system using PyTorch 1.10.0 as our deep learning framework. The RT-DETR-R18 model served as our baseline. Table 1 summarizes the hardware and software configurations.

All experiments were trained with consistent hyperparameters, as detailed in Table 2. Specifically, we used 640 × 640 input images and the AdamW optimizer with a mo-mentum of 0.9; these settings are suitable for large-scale smoke-detection tasks in dynamic forest scenes. The specific hyperparameter settings include an initial learning rate of 0.01, a batch size of 32, 250 epochs, and a weight decay of 0.0005. These settings ensure the stability and effectiveness of the experiments, enabling efficient smoke detection in complex environments.

To systematically assess performance, we adopted Precision, Recall, Average Precision (AP), and mean Average Precision (mAP). For a single class (i.e., “smoke”), AP is computed by integrating precision over recall (Equation (22)), while mAP averages AP across multiple classes if they are available (Equation (23)). These established detection metrics confirm how effectively the model identifies true smoke instances while minimizing both false positives and missed detections.

Precision measures the proportion of correctly identified positive instances among all instances classified as positive. It is defined as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(20)

where TP denotes the number of true positive detections (correctly detected smoke instances) and FP represents the number of false positives (incorrect detections where non-smoke is identified as smoke). A higher precision indicates fewer false alarms in the detection results.

Recall assesses the model’s ability to identify all actual positive instances. It is calculated as follows:

R e c a l l = \frac{T P}{T P + F N}

(21)

where FN is the number of false negatives (instances where smoke is present but not detected by the model). A higher recall signifies that the model is effectively capturing most of the smoke instances in the dataset.

Average Precision (AP) summarizes the precision-recall trade-off for a single class by integrating the precision over all recall levels. It is computed using the interpolated precision values at different recall thresholds, as follows:

A P = \sum_{i = 1}^{n - 1} (r_{i + 1} - r_{i}) P_{i n t e r} (r_{i} + 1)

(22)

In this equation,

r_{i}

and

r_{i + 1}

are consecutive recall levels;

P_{i n t e r} (r_{i} + 1)

is the interpolated precision at recall

r_{i + 1}

; and n is the number of points on the precision-recall curve. The AP provides a single metric that reflects both the precision and recall of the model for a particular class.

Mean Average Precision (mAP) extends the AP metric to multiple classes by averaging the AP across all kkk classes in the dataset, as follows:

m A P = \frac{\sum_{i = 1}^{k} {A P}_{i}}{k}

(23)

Here,

{A P}_{i}

is the Average Precision for the

i - t h

class. The mAP offers an overall assessment of the model’s detection performance across all classes, making it a comprehensive metric for evaluating object-detection models.

4.2. Dataset Description

Our custom dataset consists of 16,316 images extracted from real-world outdoor surveillance footage, covering a variety of environmental conditions, including forest scenes, rural scenes, and more. Each image was manually annotated using the “LabelImg” tool to ensure that the bounding boxes accurately captured the subtle differences in small smoke plumes or diffuse smoke.

(1): Scene Distribution

The images in the dataset are distributed across scenes as follows: approximately 70% forest scenes and 30% rural scenes. These scenes encompass a rich variety of natural and manmade environments, including dense forests, rural areas, and agricultural areas.

(2): Smoke Size and Shape Distribution

Additionally, the dataset includes a range of smoke sizes and shapes, from faint smoke to dense smoke. Small smoke plumes (with smaller width and height) account for approximately 60% of the total images; medium smoke plumes account for about 30%; and large smoke plumes account for about 10%. These different scales of smoke plumes present distinct visual characteristics in the images, ranging from faint diffuse smoke to intense fire smoke, further enhancing the diversity of the dataset.

To thoroughly evaluate the model’s generalization ability, we split the dataset into training and validation sets in an 8:2 ratio. Figure 4 shows representative samples, illustrating images with various environmental conditions (such as dense forests, open fields, and residential areas) and different smoke sizes (ranging from faint smoke to dense smoke), further highlighting the diversity and coverage of the dataset.

4.3. Experimental Results

4.3.1. Comparison of Loss Functions

An integral component of RT-DETR-Smoke is our newly introduced WShapeIoU loss function. To highlight the innovative nature and significant impact of the proposed WShapeIoU loss function, we compared it to ten existing loss functions—CIoU, SIoU, GIoU, InnerIoU, ShapeIoU, MDPIoU, NWD, WIoU, PIoU, and PIoU2—using RT-DETR-R18 as the baseline. To ensure a fair comparison, all experiments were conducted using the same baseline model, as shown in Table 2. These hyperparameters were kept identical across all loss-function experiments, ensuring that the only variable being compared was the loss function itself. We report the results in terms of mAP@0.5 and mAP@0.5:0.95, which emphasize both detection precision and generalization across varying IoU thresholds. This setup allows for a rigorous and reproducible comparison of the loss functions.

Table 3 shows that WShapeIoU yields the highest mAP@0.5 (0.87224), surpassing GIoU by 1.242% and ShapeIoU by 0.845%. Particularly in forest environments characterized by complex backgrounds and small, low-contrast smoke plumes, WShapeIoU guided the model to focus on erroneous samples—encouraging more fine-grained bounding-box corrections. This is due to the focal factor and monotonic scaling, which prioritize bounding boxes with large deviations in shape or position.

WShapeIoU’s improvements underscore its ability to handle irregular and fluid smoke boundaries better than other IoU variants do. By downweighting well-predicted samples and penalizing large bounding-box discrepancies, it accelerates model convergence while preserving robust detection accuracy, especially in dynamic fire scenarios.

4.3.2. Comparison of Attention Mechanisms

We further reinforced RT-DETR-Smoke’s capability by integrating CoordAtt into the backbone network. To highlight CoordAtt’s distinct positional encoding advantages, we compared it with ten other mechanisms (ECA, MLCA, Deformable_LKA, etc.). All experiments used WShapeIoU to isolate the attention module’s effect.

As seen in Table 4, CoordAtt yields the greatest values of mAP@0.5 (0.8775) and mAP@0.5:0.95 (0.5233), with comparable or lower parameter counts and GFLOPs. Notably, CoordAtt improved mAP@0.5 by an average of 1.2% over other methods, reaching a maximum gain of 2.186% compared to Deformable_LKA, ECA, and MLCA.

The positional embedding inherent in CoordAtt allows the model to pinpoint crucial smoke edges and ignore irrelevant background clutter; these abilities are particularly beneficial in long-range or multitarget fire environments. The mechanism’s low overhead further preserves real-time inference—a key requirement for rapid fire detection.

4.3.3. Model-Comparison Experiments

To verify the real-time end-to-end object-detection capabilities of the RT-DETR-Smoke model, we conducted FPS measurements on an NVIDIA A100 GPU using TensorRT FP16 precision. For tests involving YOLOv5 through YOLOv9, Non-Maximum Suppression (NMS) was incorporated during the ONNX conversion before conversion to TensorRT for FPS testing. For other models, NMS was not included.

Table 5 details the exceptional performance of the RT-DETR-Smoke model across multiple metrics. The RT-DETR-Smoke model has 19,937,700 parameters and 56.9 GFLOPs, making it comparable to the RT-DETR-R18 model with nearly no increase in complexity. In terms of computational load, RT-DETR-Smoke has lower requirements compared to many other models, such as YOLOv6x and YOLOv7x, which gives it a clear advantage in inference speed.

RT-DETR-Smoke achieved an mAP50 of 0.8775 in and an mAP50-95 of 0.5233 in, for improvements of 1.77% and 1.15%, respectively, compared to RT-DETR-R18. In comparison with mainstream models like YOLOv5s, YOLOv7, and YOLOv8-DETR, RT-DETR-Smoke demonstrated superior accuracy, especially as measured by mAP50-95, showcasing its stronger adaptability in diverse scenarios.

RT-DETR-Smoke achieved excellent real-time detection performance, with an FPS of 445.50 and a GPU inference time of 2.24464 ms. Although its FPS is slightly lower than RT-DETR-R18’s 506.62 FPS, the difference is minimal, and RT-DETR-Smoke shows an improvement in accuracy. Compared to models like YOLOv5s and YOLOv6s, RT-DETR-Smoke also exhibits a better balance, ensuring high inference speed without significantly sacrificing detection accuracy.

Figure 5 visually illustrates the superior recognition performance of RT-DETR-Smoke in complex forest smoke scenarios, highlighting its ability to maintain fast detection speeds while improving accuracy. Overall, RT-DETR-Smoke not only demonstrates efficiency in computational load and inference time but also exhibits significant advantages in detection accuracy and speed, affirming its overall superiority for object-detection tasks.

4.3.4. Ablation Experiment

To assess the impact of our optimization modules, we conducted ablation experiments using a controlled-variable method. Training and testing were performed on the same dataset with identical training parameters. The results, presented in Table 6, indicate that incorporating the improved WShapeIoU loss function increased the model’s mAP@0.5 by 1.768% and also brought enhancements in other metrics. This suggests that WShapeIoU enables the model to focus more effectively on inaccurately predicted samples while reducing emphasis on well-predicted ones, leading to more precise bounding-box regression and improved accuracy in target localization and size estimation.

Furthermore, to enhance the model’s ability to capture positional information in complex background images or with multiple targets, we incorporated the CoordAtt module into the BasicBlock, generating the CoordAtt_BasicBlock. This module combines spatial coordinate information and decomposes channel and spatial attention, allowing the model to focus on different channels and spatial positions within the feature map simultaneously. The introduction of the CoordAtt_BasicBlock improved the model’s mAP@0.5 and mAP@0.5:0.95 by 0.526% and 2.459%, respectively. As shown in the results of the ablation experiment in Figure 6, the improved model not only achieved significant increases in both mAP metrics but also exhibited faster convergence.

Figure 7 provides a visual comparison of the model’s performance under different optimization settings, including the baseline model, the incorporation of the WShapeIoU loss function, and the integration of the CoordAtt module. In each image, detected smoke regions are indicated by red bounding boxes accompanied by the corresponding confidence scores. In Figure 7a, the baseline model detects smoke in a standard environment with a confidence score of 0.91; this score increases to 0.93 after the WShapeIoU loss function and CoordAtt module have been incorporated, demonstrating a significant improvement in detection accuracy. Figure 7b shows the detection results for specific smoke shapes. The introduction of the WShapeIoU loss function refines the bounding box of the smoke plume, increasing the confidence score to 0.85. In Figure 7c, which represents a low-illumination environment, the addition of the WShapeIoU loss function effectively reduces bounding-box uncertainty, further improving detection accuracy, and the subsequent integration of the CoordAtt module significantly enhances the model’s accuracy and robustness in complex environments, thereby fully validating the effectiveness of the proposed improvements.

To further validate RT-DETR-Smoke’s effectiveness in detecting smoke within complex forest environments, we conducted tests across various forest scenarios. Figure 8 has been augmented, in response to reviewer feedback, with an additional column of images annotated with ground-truth smoke markings; these markings are presented alongside the original images and the detection results from RT-DETR-R18 and RT-DETR-Smoke. This added ground-truth column provides a clear reference for the actual smoke regions and significantly enhances the figure’s clarity and informativeness. In Figure 8a, RT-DETR-Smoke demonstrated significantly better performance in detecting smaller smoke plumes, with the mAP rising from 0.77 to 0.83, representing a 6% improvement over the RT-DETR-R18 baseline. For Figure 8b, involving large smoke plumes, RT-DETR-Smoke improved the mAP from 0.86 to 0.90, indicating superior detection performance on these large targets. In the multitarget scenario of Figure 8c, the RT-DETR-R18 model exhibited discrepancies in both localization and detection, while RT-DETR-Smoke achieved accurate localization and detection for all smoke instances, with all detection scores exceeding 0.8. Finally, in the long-distance-detection scenario of Figure 8d, the original RT-DETR-R18 model made several errors in localization and detection, whereas RT-DETR-Smoke performed much more reliably, with its detections closely aligning with the ground-truth smoke regions.

In summary, RT-DETR-Smoke not only leverages WShapeIoU to handle bounding-box ambiguities but also benefits from CoordAtt, which allows it to retain crucial positional information, thus reducing false alarms and missed detections. The synergy among these components enhances detection accuracy, robustness, and practical applicability, directly aligning with industrial requirements for efficient monitoring in large-scale, dynamic fire scenarios.

5. Conclusions

In this paper, we propose RT-DETR-Smoke, a specialized real-time smoke-detection algorithm designed to address the unique challenges posed by semitransparent, fluid, and often small-scale smoke regions in forest-fire scenarios and other complex environments. As a result of the integration of a Coordinate Attention (CoordAtt) mechanism into the backbone network, the model retains crucial positional cues while effectively suppressing background noise. Furthermore, we introduced the WShapeIoU loss function, which accelerates model convergence and enhances bounding-box regression for difficult or heavily distorted smoke targets. Extensive experiments on a custom smoke dataset underscore the robustness and efficiency of RT-DETR-Smoke. Achieving an mAP@0.5 of 87.75% and a processing speed of 445.50 FPS, our approach demonstrates state-of-the-art real-time performance without compromising detection accuracy. In particular, the synergy of the CoordAtt mechanism and WShapeIoU enables the model to focus on ambiguous or small-scale smoke with blurred edges, significantly reducing false alarms and missed detections compared to baseline methods.

Looking ahead, we intend to broaden our research by collecting more diverse real-world data under varying weather conditions and monitoring modalities (e.g., drones, infrared sensors) to enhance model robustness and generalization in increasingly complex environments. In addition, we will explore further architectural enhancements—for example, optimizing the backbone and expanding the hybrid encoder design—to strengthen the synergy between the backbone and our attention modules. Finally, by optimizing the network architecture and inference strategies, we seek to facilitate deployment on resource-limited platforms, thereby expanding the application of real-time smoke detection to remote or mobile monitoring scenarios.

Author Contributions

Data curation, X.Z.; Formal analysis, T.L.; Methodology, L.L., T.L. and P.S.; Software, L.L.; Supervision, P.S.; Validation, X.Z. and P.S.; Visualization, Z.W.; Writing—original draft, Z.W. and T.L.; Writing—review & editing, Z.W. and T.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded The National Natural Science Foundation Of China, grant number 61976198, and The Natural Science Research Key Project For Colleges And University Of Anhui Province, grant number 2022AH052141, 2022AH052142 and 2023AH051302, and The Hefei Municipal Natural Science Foundation, grant number 202322.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to data privacy.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A

Algorithm A1: The pseudocode of the WShapeIoU loss calculation

Algorithm Wise-ShapeIoU_Loss(pred, target, scale, L_IoU):
Convert pred and target from (x, y, w, h) to (x1, y1, x2, y2)
Compute w1, h1 from predicted box; w2, h2 from ground truth box
Compute:
ww = 2 * (w2^scale)/((w2^scale) + (h2^scale))
hh = 2 * (h2^scale)/((w2^scale) + (h2^scale))
Calculate convex box dimensions:
cw = max(b1_x2, b2_x2) − min(b1_x1, b2_x1)
ch = max(b1_y2, b2_y2) − min(b1_y1, b2_y1)
c2 = cw^2 + ch^2 + epsilon
Compute center distances:
Δx² = ((b2_x1 + b2_x2 − b1_x1 − b1_x2)^2)/4
Δy² = ((b2_y1 + b2_y2 − b1_y1 − b1_y2)^2)/4
D = (hh * Δx² + ww * Δy²)/c2
Compute shape differences:
ω_w = hh * |w1 − w2|/max(w1, w2)
ω_h = ww * |h1 − h2|/max(h1, h2)
C_shape = (1 − exp(-ω_w))^4 + (1 − exp(-ω_h))^4
Final Loss:
L_WShapeIoU = L_IoU + D + 0.5 * C_shape
Return L_WShapeIoU

Algorithm A2: The pseudocode of Coordinate Attention

Algorithm CoordAtt(x, reduction):
Input: x with dimensions (N, C, H, W)
Compute x_h = AdaptiveAvgPool(x) along width -> shape: (N, C, H, 1)
Compute x_w = AdaptiveAvgPool(x) along height -> shape: (N, C, 1, W)
Transpose x_w to shape (N, C, W, 1)
Concatenate x_h and x_w along spatial dimension -> y with shape (N, C, (H + W), 1)
Apply conv1: y′ = h_swish(BN(Conv1(y)))
Split y′ into y′_h (first H rows) and y′_w (remaining W rows)
Transpose y′_w back to shape (N, C, 1, W)
Compute attention maps:
a_h = sigmoid(Conv_h(y′_h)) --> shape: (N, C, H, 1)
a_w = sigmoid(Conv_w(y′_w)) --> shape: (N, C, 1, W)
Element-wise: out = x * a_h * a_w
Return out

References

Hirsch, E.; Koren, I. Record-breaking aerosol levels explained by smoke injection into the stratosphere. Science 2021, 371, 1269–1274. [Google Scholar] [CrossRef] [PubMed]
Song, H.; Chen, Y. Video smoke detection method based on cell root–branch structure. Signal Image Video Process. 2024, 18, 4851–4859. [Google Scholar] [CrossRef]
Chen, S.; Cao, Y.; Feng, X.; Lu, X. Global2Salient: Self-adaptive feature aggregation for remote sensing smoke detection. Neurocomputing 2021, 466, 202–220. [Google Scholar] [CrossRef]
Jang, H.-Y.; Hwang, C.-H. Revision of the input parameters for the prediction models of smoke detectors based on the FDS. Fire Sci. Eng. 2017, 31, 44–51. [Google Scholar] [CrossRef]
Jang, H.-Y.; Hwang, C.-H. Obscuration threshold database construction of smoke detectors for various combustibles. Sensors 2020, 20, 6272. [Google Scholar] [CrossRef]
Wang, Y.; Piao, Y.; Wang, H.; Zhang, H.; Li, B. An Improved Forest Smoke Detection Model Based on YOLOv8. Forests 2024, 15, 409. [Google Scholar] [CrossRef]
Wang, J.; Zhang, X.; Zhang, C. A lightweight smoke detection network incorporated with the edge cue. Expert Syst. Appl. 2024, 241, 122583. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: High quality object detection and instance segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 1483–1498. [Google Scholar] [CrossRef]
Sun, P.; Zhang, R.; Jiang, Y.; Kong, T.; Xu, C.; Zhan, W.; Tomizuka, M.; Li, L.; Yuan, Z.; Wang, C. Sparse r-cnn: End-to-end object detection with learnable proposals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14454–14463. [Google Scholar]
Zhao, L.; Zhi, L.; Zhao, C.; Zheng, W. Fire-YOLO: A small target object detection method for fire inspection. Sustainability 2022, 14, 4930. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single shot multibox detector. In Computer Vision—ECCV 2016, Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14, 2016; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Cheng, G.; Chen, X.; Wang, C.; Li, X.; Xian, B.; Yu, H. Visual fire detection using deep learning: A survey. Neurocomputing 2024, 596, 127975. [Google Scholar] [CrossRef]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Computer Vision—ECCV 2020, Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Roh, B.; Shin, J.; Shin, W.; Kim, S. Sparse detr: Efficient end-to-end object detection with learnable sparsity. arXiv 2021, arXiv:2111.14330. [Google Scholar]
Li, F.; Zeng, A.; Liu, S.; Zhang, H.; Li, H.; Zhang, L.; Ni, L.M. Lite detr: An interleaved multi-scale encoder for efficient detr. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 18558–18567. [Google Scholar]
Zheng, D.; Dong, W.; Hu, H.; Chen, X.; Wang, Y. Less is more: Focus attention for efficient detr. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 6674–6683. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
Wu, S.; Zhang, L. Using popular object detection methods for real time forest fire detection. In Proceedings of the 2018 11th International Symposium on Computational Intelligence and Design (ISCID), Hangzhou, China, 8–9 December 2018; pp. 280–284. [Google Scholar]
Guo, H.; Bai, H.; Zhou, Y.; Li, W. DF-SSD: A deep convolutional neural network-based embedded lightweight object detection framework for remote sensing imagery. J. Appl. Remote Sens. 2020, 14, 014521. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. Yolov9: Learning what you want to learn using programmable gradient information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. Yolov10: Real-time end-to-end object detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Abdusalomov, A.; Baratov, N.; Kutlimuratov, A.; Whangbo, T.K. An improvement of the fire detection and classification method using YOLOv3 for surveillance systems. Sensors 2021, 21, 6519. [Google Scholar] [CrossRef]
Zheng, X.; Chen, F.; Lou, L.; Cheng, P.; Huang, Y. Real-time detection of full-scale forest fire smoke based on deep convolution neural network. Remote Sens. 2022, 14, 536. [Google Scholar] [CrossRef]
Wang, J.; Zhang, X.; Jing, K.; Zhang, C. Learning precise feature via self-attention and self-cooperation YOLOX for smoke detection. Expert Syst. Appl. 2023, 228, 120330. [Google Scholar] [CrossRef]
Al-Smadi, Y.; Alauthman, M.; Al-Qerem, A.; Aldweesh, A.; Quaddoura, R.; Aburub, F.; Mansour, K.; Alhmiedat, T. Early wildfire smoke detection using different yolo models. Machines 2023, 11, 246. [Google Scholar] [CrossRef]
Yang, Z.; Shao, Y.; Wei, Y.; Li, J. Precision-Boosted Forest Fire Target Detection via Enhanced YOLOv8 Model. Appl. Sci. 2024, 14, 2413. [Google Scholar] [CrossRef]
Barmpoutis, P.; Dimitropoulos, K.; Kaza, K.; Grammalidis, N. Fire detection from images using faster R-CNN and multidimensional texture analysis. In Proceedings of the ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 8301–8305. [Google Scholar]
Chaoxia, C.; Shang, W.; Zhang, F. Information-guided flame detection based on faster R-CNN. IEEE Access 2020, 8, 58923–58932. [Google Scholar] [CrossRef]
Duan, K.; Xie, L.; Qi, H.; Bai, S.; Huang, Q.; Tian, Q. Corner proposal network for anchor-free, two-stage object detection. In Computer Vision—ECCV 2020, Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 399–416. [Google Scholar]
Huang, J.; Zhou, J.; Yang, H.; Liu, Y.; Liu, H. A small-target forest fire smoke detection model based on deformable transformer for end-to-end object detection. Forests 2023, 14, 162. [Google Scholar] [CrossRef]
Li, Y.; Zhang, W.; Liu, Y.; Jing, R.; Liu, C. An efficient fire and smoke detection algorithm based on an end-to-end structured network. Eng. Appl. Artif. Intell. 2022, 116, 105492. [Google Scholar] [CrossRef]
Liang, T.; Zeng, G. Fsh-detr: An efficient end-to-end fire smoke and human detection based on a deformable detection transformer (detr). Sensors 2024, 24, 4077. [Google Scholar] [CrossRef]
Sun, B.; Cheng, X. Smoke Detection Transformer: An Improved Real-Time Detection Transformer Smoke Detection Model for Early Fire Warning. Fire 2024, 7, 488. [Google Scholar] [CrossRef]
Ren, D.; Wang, Z.; Sun, H.; Liu, L.; Wang, W.; Zhang, J. Salience Feature Guided Decoupling Network for UAV Forests Flame Detection. Expert Syst. Appl. 2025, 270, 126414. [Google Scholar] [CrossRef]
Zhao, H.; Jin, J.; Liu, Y.; Guo, Y.; Shen, Y. FSDF: A high-performance fire detection framework. Expert Syst. Appl. 2024, 238, 121665. [Google Scholar] [CrossRef]
Ding, S.; Zhang, H.; Zhang, Y.; Huang, X.; Song, W. Hyper real-time flame detection: Dynamic insights from event cameras and FlaDE dataset. Expert Syst. Appl. 2025, 263, 125746. [Google Scholar] [CrossRef]
Zhang, H.; Zhang, S. Shape-iou: More accurate metric considering bounding box shape and scale. arXiv 2023, arXiv:2312.17663. [Google Scholar]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding box regression loss with dynamic focusing mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]
Zhang, H.; Zhang, S. Focaler-IoU: More Focused Intersection over Union Loss. arXiv 2024, arXiv:2401.10525. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. AAAI Conf. Artif. Intell. 2020, 34, 12993–13000. [Google Scholar] [CrossRef]
Gevorgyan, Z. SIoU loss: More powerful learning for bounding box regression. arXiv 2022, arXiv:2205.12740. [Google Scholar]
Liu, C.; Wang, K.; Li, Q.; Zhao, F.; Zhao, K.; Ma, H. Powerful-IoU: More straightforward and faster bounding box regression loss with a nonmonotonic focusing mechanism. Neural Netw. 2024, 170, 276–284. [Google Scholar] [CrossRef] [PubMed]
Zhang, H.; Xu, C.; Zhang, S. Inner-IoU: More effective intersection over union loss with auxiliary bounding box. arXiv 2023, arXiv:2311.02877. [Google Scholar]
Ma, S.; Xu, Y. MPDIoU: A loss for efficient and accurate bounding box regression. arXiv 2023, arXiv:2307.07662. [Google Scholar]
Wang, J.; Xu, C.; Yang, W.; Yu, L. A normalized Gaussian Wasserstein distance for tiny object detection. arXiv 2021, arXiv:2110.13389. [Google Scholar]
Azad, R.; Niggemeier, L.; Hüttemann, M.; Kazerouni, A.; Aghdam, E.K.; Velichko, Y.; Bagci, U.; Merhof, D. Beyond self-attention: Deformable large kernel attention for medical image segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 1287–1297. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Wan, D.; Lu, R.; Shen, S.; Xu, T.; Lang, X.; Ren, Z. Mixed local channel attention for object detection. Eng. Appl. Artif. Intell. 2023, 123, 106442. [Google Scholar] [CrossRef]
Goyal, A.; Bochkovskiy, A.; Deng, J.; Koltun, V. Non-deep networks. Adv. Neural Inf. Process. Syst. 2022, 35, 6789–6801. [Google Scholar]
Zhang, Q.-L.; Yang, Y.-B. Sa-net: Shuffle attention for deep convolutional neural networks. In Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 2235–2239. [Google Scholar]
Liu, H.; Liu, F.; Fan, X.; Huang, D. Polarized self-attention: Towards high-quality pixel-wise regression. arXiv 2021, arXiv:2107.00782. [Google Scholar]
Li, X.; Wang, W.; Hu, X.; Yang, J. Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 510–519. [Google Scholar]
Huang, H.; Chen, Z.; Zou, Y.; Lu, M.; Chen, C.; Song, Y.; Zhang, H.; Yan, F. Channel prior convolutional attention for medical image segmentation. Comput. Biol. Med. 2024, 178, 108784. [Google Scholar] [CrossRef] [PubMed]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6569–6578. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully convolutional one-stage object detection. arXiv 2019, arXiv:1904.01355. [Google Scholar]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Tian, Y.; Ye, Q.; Doermann, D. Yolov12: Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]

Figure 1. The specific structure of RT-DETR-Smoke.

Figure 3. Structure diagram of the CoordAtt Coordinate Attention Mechanism.

Figure 4. Examples of annotated smoke images in our dataset, highlighting variations in scale, shape, and visibility.

Figure 5. Comparison of model performance via mAP@0.5, mAP@0.5-0.95, and FPS. Red represents our RT-DETR-Smoke model.

Figure 6. Performance comparison based on ablation experiments.

Figure 7. Ablation study: WShapeIoU Loss and CoordAtt Module. (a) Detection results for normal smoke. (b) Detection results for specific smoke shapes. (c) Detection results for smoke under low-light conditions.

Figure 8. Comparison of results of the ablation experiment. (a) Small-target smoke detection; (b) Large-target smoke detection; (c) Multitarget smoke detection; (d) Long-distance smoke detection.

Table 1. Experimental environment configuration table.

Environmental Parameter	Value
Operating system	Ubuntu20.04
Deep learning framework	Pytorch
Programming language	Python3.8.12
CPU	AMD EPYC 7713 64-Core Processor
GPU	A100-SXM4-80 GB
RAM	256 GB

Table 2. Training hyperparameters.

Hyperparameters	Value
Learning Rate	0.01
Image Size	640 × 640
Momentum	0.9
Batch Size	32
Epoch	250
Weight Decay	0.0005
Optimizer	AdamW

Table 3. Comparison of loss functions.

Loss Function	mAP50	mAP50-95
GIoU [47]	0.85982	0.51181
CIoU [48]	0.85954	0.50841
SIoU [49]	0.86316	0.51389
WIoU [44]	0.8646	0.51174
PIoU2 [50]	0.86218	0.51005
InnerIoU [51]	0.86691	0.51334
ShapeIoU [43]	0.86379	0.50705
MPDIoU [52]	0.86561	0.50603
NWD [53]	0.86377	0.51201
PIoU [50]	0.86002	0.51133
WShapeIoU	0.87224	0.50842

Table 4. Comparison of attention mechanisms.

Model	Parameter	GFLOPs	mAP50	mAP50-95
Deformable_LKA [54]	30,150,052	74.0	0.86562	0.50457
ECA [55]	20,083,048	58.3	0.86548	0.50842
MLCA [56]	20,083,068	58.3	0.86785	0.51757
ParNetAttention [57]	27,302,740	66.7	0.86317	0.50699
ShuffleAttention [58]	20,083,604	58.3	0.86597	0.51018
SequentialPolarizedAttention [59]	21,402,200	59.5	0.87093	0.522
SKAttention [60]	20,739,948	58.3	0.86484	0.51625
CPCA [61]	21,238,484	61.0	0.8588	0.50086
ParallelPolarizedAttention [59]	21,402,200	59.5	0.86706	0.51407
DualAttention [62]	20,694,172	58.0	0.85564	0.49782
CoordAtt [46]	20,147,684	58.3	0.8775	0.5233

Table 5. Comparison of different models.

Model	Parameters	GFLOPS	mAP50	mAP50-95	FPS	GPU Time
CenterNet [63]	32,665,432	70.2	0.76551	0.37556
SSD [14]	26,284,974	62.7	0.77465	0.41334
FCOS [64]	32,154,969	161.9	0.84500	0.45854
YOLOv5n	2,503,139	7.1	0.83077	0.4867	428.69	2.3327 ms
YOLOv5s	9,111,923	23.8	0.85463	0.51421	382.29	2.6158 ms
YOLOv6n [65]	4,238,243	11.9	0.81541	0.47587	426.57	2.34427 ms
YOLOv6s [65]	116,297,619	44.0	0.85422	0.51601	392.52	2.54761 ms
YOLOv6m [65]	51,978,931	161.1	0.80247	0.47384	285.50	3.50259 ms
YOLOv6l [65]	110,864,083	391.2	0.76037	0.43131	213.89	4.67555 ms
YOLOv6x [65]	172,983,795	610.2	0.7628	0.43215	159.22	6.28052 ms
YOLOv7x [16]	70,780,150	188.0	0.845	0.471	227.29	4.39959 ms
YOLOv7-tiny [16]	6,007,596	13.0	0.796	0.416	386.86	2.58493 ms
YOLOv7 [16]	36,481,772	103.2	0.836	0.461	274.48	3.64323 ms
YOLOv8n	3,005,843	8.1	0.84668	0.49937	415.49	2.40637 ms
YOLOv9t [26]	2,005,603	7.8	0.8155	0.47887	234.35	4.2671 ms
YOLOv9s [26]	7,287,795	27.4	0.85396	0.52243	272.67	3.66743 ms
YOLOv10x [27]	31,656,806	171.0	0.83989	0.49463	300.50	3.32788 ms
YOLOv10l [27]	25,766,870	127.2	0.80827	0.47293	401.45	2.49099 ms
YOLOv8-DETR	6,091,124	11.7	0.83722	0.47444	433.21	2.30832 ms
YOLOv12n [66]	2,508,539	5.8	0.81327	0.47284	281.31	3.55475 ms
RT-DETR-R101 [22]	74,657,603	247.1	0.81541	0.47587	220.51	4.53381 ms
RT-DETR-R18 [22]	19,873,044	56.9	0.85982	0.51181	506.62	1.97383 ms
RT-DETR-Smoke	19,937,700	56.9	0.8775	0.5233	445.50	2.24464 ms

Table 6. Results of the ablation experiment.

Model	Baseline	WShapeIoU	CoordAtt	mAP50	mAP50-95
1	√			0.85954	0.50841
2	√	√		0.87224	0.50842
RT-DETR-Smoke	√	√	√	0.87750	0.5233

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Z.; Lei, L.; Li, T.; Zu, X.; Shi, P. RT-DETR-Smoke: A Real-Time Transformer for Forest Smoke Detection. Fire 2025, 8, 170. https://doi.org/10.3390/fire8050170

AMA Style

Wang Z, Lei L, Li T, Zu X, Shi P. RT-DETR-Smoke: A Real-Time Transformer for Forest Smoke Detection. Fire. 2025; 8(5):170. https://doi.org/10.3390/fire8050170

Chicago/Turabian Style

Wang, Zhong, Lanfang Lei, Tong Li, Xian Zu, and Peibei Shi. 2025. "RT-DETR-Smoke: A Real-Time Transformer for Forest Smoke Detection" Fire 8, no. 5: 170. https://doi.org/10.3390/fire8050170

APA Style

Wang, Z., Lei, L., Li, T., Zu, X., & Shi, P. (2025). RT-DETR-Smoke: A Real-Time Transformer for Forest Smoke Detection. Fire, 8(5), 170. https://doi.org/10.3390/fire8050170

Article Menu

RT-DETR-Smoke: A Real-Time Transformer for Forest Smoke Detection

Abstract

1. Introduction

2. Related Works

2.1. One-Stage Algorithms

2.2. Two-Stage Algorithms

2.3. DETR

2.4. Task-Specific Method

3. Research Method

3.1. Overall Architecture

3.2. WShapeIoU Loss Function

3.2.1. Limitations of Existing IoU-Based Losses

3.2.2. ShapeIoU as a Baseline

3.2.3. Proposed WShapeIoU

3.3. CoordAtt Attention Mechanism

4. Experiments

4.1. Dataset and Experimental Setup

4.2. Dataset Description

4.3. Experimental Results

4.3.1. Comparison of Loss Functions

4.3.2. Comparison of Attention Mechanisms

4.3.3. Model-Comparison Experiments

4.3.4. Ablation Experiment

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI