Dual-Path CSDETR: Cascade Stochastic Attention with Object-Centric Priors for High-Accuracy Fire Detection

Yu, Dongxing; Han, Bing; Zhao, Xinyi; Ren, Weikai

doi:10.3390/s25185788

Open AccessArticle

Dual-Path CSDETR: Cascade Stochastic Attention with Object-Centric Priors for High-Accuracy Fire Detection

¹

Key Laboratory of Fire Protection Technology for Industry and Public Building, Ministry of Emergency Management, Tianjin 300381, China

²

Tianjin Fire Science and Technology Research Institute of MEM, Tianjin 300381, China

³

Sino-European Institute of Aviation Engineering, Civil Aviation University of China, Tianjin 300300, China

⁴

Institute of Energy, Peking University, Beijing 100871, China

^*

Author to whom correspondence should be addressed.

Sensors 2025, 25(18), 5788; https://doi.org/10.3390/s25185788

Submission received: 15 August 2025 / Revised: 13 September 2025 / Accepted: 15 September 2025 / Published: 17 September 2025

(This article belongs to the Section Intelligent Sensors)

Download

Browse Figures

Versions Notes

Abstract

Detecting dynamic and amorphous objects like fire and smoke poses significant challenges in object detection. To address this, we propose Dual-Path Cascade Stochastic DETR (Dual-Path CSDETR). Unlike Cascade DETR, our model introduces cascade stochastic attention (CSA) to model the irregular morphologies of fire and smoke through variational inference, combined with a dual-path architecture that enables bidirectional feature interaction for enhanced learning efficiency. By integrating object-centric priors from bounding boxes into each decoder layer, the model refines attention mechanisms to focus on critical regions. Experiments show that Dual-Path CSDETR achieves 94% AP50 on fire/smoke detection, surpassing deterministic baselines.

Keywords:

fire and smoke detection; cascade stochastic attention; Dual-Path CSDETR

1. Introduction

Fire incidents pose severe threats to life and property, underscoring the urgent need for reliable detection systems. Despite advancements in deep learning [1,2] and transformer-based detection [3,4,5], accurately identifying dynamic, irregular fire/smoke patterns remains challenging due to their variability in appearance [6], diverse environmental contexts, and limited training data. Traditional methods struggle with three critical limitations: sensitivity to combustion materials and fire stages [6], poor generalization across scenarios (e.g., forests vs. indoor environments), and deterministic attention mechanisms’ inability to model complex variations without extensive data.

To address these challenges, we propose Dual-Path Cascade Stochastic DETR (Dual-Path CSDETR), which introduces a cascade stochastic attention (CSA) module leveraging variational inference to probabilistically model fire/smoke morphologies. Unlike conventional approaches, CSA integrates a dual-path architecture—combining a stochastic downward path for iterative attention refinement via bounding box priors and a deterministic upward path for feature stabilization—to enhance bidirectional information flow. By embedding object-centric gamma priors from predicted bounding boxes at each decoder layer, the model adaptively focuses on critical regions while maintaining computational efficiency for real-time deployment. Experiments demonstrate that our framework significantly outperforms existing methods, achieving 94% AP50 in fire/smoke detection.

The remainder of this paper is organized as follows: Section 2 reviews related work, Section 3 details the Dual-Path CSDETR methodology, Section 4 presents experimental results, and Section 5 concludes the study.

2. Related Works

Recent advances in object detection have been driven by the integration of convolutional and transformer-based architectures. Convolutional Neural Networks (CNNs), including seminal works like Faster R-CNN [7] and YOLO [8], established robust frameworks for feature extraction and bounding box regression. The introduction of the Detection Transformer (DETR) [9] marked a paradigm shift by eliminating handcrafted components through transformer-based object queries and bipartite matching losses. While DETR achieves competitive accuracy, its limitations in small-object detection and occlusion handling prompted innovations like Deformable DETR [10], which accelerates convergence by 40% through deformable attention mechanisms, and Cascade DETR [11], which improves occluded object detection by 8.2% mAP via multi-stage refinement.

In fire and smoke detection, domain-specific adaptations of these frameworks have emerged. Decoder-free transformer models [12] enable early flame recognition through simplified architectures, while Deformable DETR-based detectors [13] achieve real-time performance in dynamic fire scenarios. Lightweight variants like YOLO-FM [14] reduce parameters by 30% while maintaining 45 FPS on edge devices and YOLOv5s hybrids [15] integrate channel attention modules to boost smoke detection F1-scores to 0.89. Despite these advancements, persistent challenges arise from dataset limitations (e.g., 12% annotation errors in public fire datasets [16]) and environmental variability, where deterministic attention mechanisms [17,18,19] falter under rapid illumination changes or dense smoke occlusion [20].

Stochastic attention mechanisms, which model attention weights as probabilistic distributions, offer a promising solution. Early hard-attention approaches [21] employed categorical distributions, but faced optimization barriers due to non-reparameterizability. Bayesian attention [22] addressed this through reparameterizable Dirichlet distributions, yet its application remains confined to shallow NLP models. Recent extensions like deep stochastic networks [23] demonstrate potential in machine translation, but lack adaptation to vision tasks requiring spatial uncertainty modeling—a critical gap for fire detection, where probabilistic attention could mitigate false negatives in occluded scenarios by 9% [22]. This underscores the need to integrate stochastic principles into transformer-based detectors, enabling adaptive feature focusing in high-variance fire environments.

3. Preliminaries

3.1. Bayesian Attention Modules

Bayesian Attention Modules (BAMs) [9] introduce stochasticity into attention mechanisms by modeling attention weights as random variables governed by probabilistic distributions. Unlike traditional deterministic attention, BAM parameterizes unnormalized attention scores

\hat{W}

using continuous distributions, such as Weibull or Lognormal, which capture uncertainty in the attention process. These parameters are non-regularized, treated as latent variables learned from the input features. To enable gradient-based optimization, reparameterization is applied, turning the sampling process deterministic and allowing efficient backpropagation. The variational objective,

L_{B A M}

, combines the task-specific loss

L_{t a s k}

with a KL divergence term, regularizing the posterior distribution

q (\hat{W} | X)

against a data-dependent prior

p (\hat{W} | X)

L_{B A M} = E_{q (\tilde{α} | X)} [L_{t a s k} (y, f (X, \tilde{α}))] - D_{K L} (q (\tilde{α} | X) ∥ p (\tilde{α} | X))

(1)

where the

\tilde{α}

denotes the set of Gamma shape parameters learned by the network and

D_{K L}

denotes the KL divergence operator. This formulation allows BAMs to incorporate uncertainty in attention, enhancing flexibility and generalization, while ensuring efficient training via reparameterization.

3.2. Cascade-DETR

Cascade-DETR, introduced by Ye, M. et al. [11], enhances the Detection Transformer (DETR) by improving localization accuracy and generalization across diverse domains. Its key innovation is the Cascade Attention Layer, which refines the attention mechanism in the detection decoder. In standard DETR models, the decoder uses cross-attention over the entire image feature map, which can lead to imprecise localization, especially in complex scenes. Cascade-DETR addresses this by iteratively refining attention in each decoding layer, focusing on regions predicted by the previous layer. Specifically, the attention mechanism at layer

l + 1

is confined to regions inside the predicted bounding box

B^{l}

from layer l, ensuring that attention progressively narrows down to relevant areas. This refinement improves localization precision and enhances DETR’s performance across datasets like COCO and UDB10.

4. Dual-Path Cascade Stochastic DETR

4.1. Cascade Stochastic Attention Layers

In this section, we present the formulation for learning the distribution of stochastic attention weights. Note that unnormalized attention weights have been shown to outperform normalized weights in capturing distribution patterns [22]. Let the unnormalized attention weights be denoted as

\hat{W} = {{\hat{W}}_{l}}_{l = 1 : L}

, treated as data-dependent latent variables sampled from distribution

q_{ϕ}

Following prior work [22,24], we employ amortized variational inference

q_{ϕ} (\hat{W})

to approximate the true posterior of local attention weights under a Bayesian framework, conditioned on training data

(x, y) .

For each attention layer l, we approximate the posterior distribution of unnormalized attention weights

p ({\hat{W}}^{l} | x, y)

with a variational distribution

q_{ϕ} ({\hat{W}}^{l})

, that is,

q_{ϕ} ({\hat{W}}^{l}) \approx p ({\hat{W}}^{l} ∣ x, y)

. In particular, we parameterize

q_{ϕ} ({\hat{W}}^{l})

as a Weibull distribution and the prior

p_{η} ({\hat{W}}^{l})

as a Gamma distribution:

q_{ϕ} ({\hat{W}}^{l}) = Weibull (k^{l}; λ^{l} = f_{ϕ} (K^{l}, Q^{l}))

(2)

p_{η} ({\hat{W}}^{l}) = Gamma (α^{l} = f_{η} (K^{l}); β)

(3)

where the

K^{l}

and

Q^{l}

denote the key and query in l layer attention, where

k^{l}, λ^{l}

are the Weibull shape and scale parameters,

α^{l}, β

denote the Gamma shape and rate parameters, and

f_{ϕ}

and

f_{η}

are neural networks parameterized by

ϕ

and

η

, respectively. The hyperparameter

β

is a globally fixed, while

k^{l}

is layer-specific.

The rationale for distribution choices is twofold: (1) The Weibull distribution offers dynamic adaptability to scale variations, which is crucial for fire detection scenarios. In the early layers, a small

k^{l}

mimics an exponential distribution to explore diffuse regions (e.g., smoke edges). As the layers deepen, increasing

k^{l}

reduces variance, resembling a Gaussian-like distribution for precise fire-core localization. (2) The Gamma distribution serves as a natural prior for modeling positive-valued variables, such as attention weights. It also enables the efficient computation of the KL divergence. Thus, the combination of these distributions supports effective learning and model convergence in fire detection tasks.

The variational formulation extends naturally to multi-layer attention with L layers. Both the approximate posterior

q_{ϕ} ({\hat{W}}^{1 : L})

and prior

p_{η} ({\hat{W}}^{1 : L})

are modeled as joint distributions. Using the chain rule to preserve layer-wise dependencies,

q_{ϕ} ({\hat{W}}^{1 : L}) = \prod_{l = 1}^{L} q_{ϕ} ({\hat{W}}^{l} | {\hat{W}}^{l - 1}), p_{η} ({\hat{W}}^{1 : L}) = \prod_{l = 1}^{L} p_{η} ({\hat{W}}^{l} | {\hat{W}}^{l - 1})

(4)

where the prior employs factorized Gamma distributions for conjugacy. The semi-analytic KL divergence between the Weibull posterior and Gamma prior ensures efficient gradient computation (Section 4.3).

4.2. Dual-Path CSDETR Architecture

In this section, we describe the architecture of Dual-Path-based CSDETR, which injects both local object-centric bias and a dual-path into the CSA module in Section 4.1. This architecture is based on the standard Cascade DETR [11], which consists of encoder layers and decoder layers. The encoder remains unchanged, with features passed to the decoder. The decoder uses learnable object queries to localize and classify fire/smoke. Self-attention remains deterministic to model query relationships, while cross-attention employs stochastic attention layers to capture fire/smoke distributions. Two modifications integrate CSA with Cascade DETR, bounding-box-based prior distribution and dual-path architecture, enhancing information utilization under limited training data.

Dual-Path Architecture

As shown in the work introduced by Zhang, S et al. and Ren, W. et al. [23,25], using only the downward structure (i.e., previous decoder layers) can lead to instability. A dual-path architecture is employed by combining downward (prior-driven) and upward (likelihood-driven) paths that stabilize training. Hence,

λ^{l}

in Equation (2) is obtained by

\begin{matrix} λ^{l} & = σ \cdot ln [1 + exp (f_{ϕ, 1}^{l} (h^{l}))] + \frac{exp (Φ^{l})}{Γ (1 + 1 / k^{l})} \\ \Rightarrow the stochastic downward path \\ h^{l} & = ln [1 + exp (f_{ϕ, l}^{l} (h^{l + 1}))] \\ \Rightarrow the deterministic upward path \end{matrix}

(5)

where

σ \in [0, 1]

is the importance weight.

f_{ϕ, 1}^{l}, f_{ϕ, 2}^{l}

are linear projections, and

Φ^{l}

is the scaled dot product between keys

K^{l}

and queries

Q^{l}

, i.e.,

Φ^{l} = \frac{Q^{l} \cdot K^{l}}{\sqrt{d_{k}}}

. The upward path propagates deterministic hidden states

h^{L}

, initialized from

h^{L + 1} = softmax (Φ^{1})

, to refine layer-wise features (Figure 1).

Object-Centric Prior-Based Attention Refinement

As highlighted in Cascade-DETR, local information surrounding each object is crucial for accurate detection [11]. Building upon this insight, we integrate the object-centric spatial priors to inform the learning of parameters of

p_{η} ({\hat{W}}^{l})

and

q_{ϕ} ({\hat{W}}^{l})

, i.e.,

λ^{l}

and

α^{l}

in Equations (2) and (3), ensuring that the model’s attention is focused on object-relevant regions. Mathematically, the refined key is given by

K_{box}^{l} = M (B^{l - 1})

, where

M

is a mapping function that extracts feature maps, refined to better align with the bounding box

B^{l - 1}

from the previous layer. Next, using the object-centric prior as a constraint, we rewrite

p_{η} ({\hat{W}}^{l})

in Equation (3) as

p_{η} ({\hat{W}}^{l}) = Gamma (α^{l} = f_{η} (K_{box}^{l}); β)

(6)

A two-layer Multi-Layer Perceptron (MLP) is employed to process

K_{box}^{l}

and produce the predicted parameters

α^{l}

, which are used to shape the distribution of the object’s prior [23]:

α^{l} = softmax (f_{η, 2}^{l} (ReLU (f_{η, 1}^{l} (K_{box}^{l}))))

(7)

where

f_{η, 1}^{l}, f_{η, 2}^{l}

are two linear layers connected by ReLU.

The keys within the bounding box are also refined to generate

q_{ϕ} ({\hat{W}}^{l})

, which models the uncertainty or variability of the features within the box. The attention distribution for regions outside the box is preserved from the previous layer. Specifically,

Φ^{l}

in Equation (5) is updated as

Φ^{l} = \frac{Q^{l} \cdot K_{box}^{l}}{\sqrt{d_{k}}}

.

By refining the keys within the bounding box and incorporating them into the CSA, the model enhances its attention on regions with a higher likelihood of fire and smoke, as shown in Figure 2. The overall structure of the Dual-Path CSDETR decoder is illustrated in Figure 3.

4.3. Training and Inference

Our Cascade Stochastic DETR is trained end-to-end with a multi-task loss

L_{Loss}

:

L_{Loss} = L_{Detect} - ELBO

(8)

where

L_{Detect}

(inherited from the Cascade-DETR) supervizes bounding box regression and classification, while the Evidence Lower Bound (ELBO), in terms of the learned parameters

\hat{W}

, combines reconstruction loss and layer-wise KL divergence:

ELBO = E_{q_{ϕ} (\hat{W})} [log p_{θ} (y ∣ x, \hat{W})] - KL (q_{ϕ} (\hat{W}) ‖ p_{η} (\hat{W}))

(9)

The KL divergence from the prior to variational distributions can be computed in a semi-analytic way [22] as

\begin{matrix} KL (q_{ϕ} (\hat{W}) ‖ p_{η} (\hat{W})) \\ = \sum_{l = 1}^{L} E_{q_{ϕ} ({\hat{W}}^{1 : l - 1})} \underset{analytic}{\underset{︸}{KL (q_{ϕ} ({\hat{W}}^{l} ∣ {\hat{W}}^{1 : l - 1}) ‖ p_{η} ({\hat{W}}^{l} ∣ {\hat{W}}^{1 : l - 1}))}} \end{matrix}

(10)

The semi-analytic KL term decomposes into layer-specific divergences between the Weibull posterior

q_{ϕ} ({\hat{W}}^{l} ∣ {\hat{W}}^{1 : l - 1})

and Gamma prior

p_{η} ({\hat{W}}^{l} ∣ {\hat{W}}^{1 : l - 1})

:

\begin{matrix} KL (q_{ϕ} ({\hat{W}}^{l}; k, λ^{l}) ‖ p_{η} ({\hat{W}}^{l}; α^{l}, β)) = \frac{α^{l}}{k^{l}} - α^{l} log λ^{l} \\ + log k^{l} + β λ^{l} Γ (1 + \frac{1}{k^{l}}) - γ - log β + log Γ (α^{l}) \end{matrix}

(11)

Using Rao-Blackwellization, we reduce Monte Carlo variance by integrating analytic KL components. Reparameterization with uniform noise

ϵ

enables differentiable sampling:

\begin{matrix} ELBO & = E_{ϵ} [log p_{θ} (y ∣ x, r (ϵ)) \\ - \sum_{l = 1}^{L} KL (q_{ϕ} ({\hat{W}}^{l} ∣ r_{ϕ} (ϵ^{1 : l - 1})) ‖ p_{η} ({\hat{W}}^{l} ∣ r_{ϕ} (ϵ^{1 : l - 1})))] \end{matrix}

(12)

where

\hat{W} = r (ϵ) = λ {(- ln (1 - ϵ))}^{1 / k}

, which is the reparameterized expression of the Weibull distribution,

ϵ

is a uniform random variable

ϵ \sim Uniform (0, 1)

.

During both learning and inference, we sample

ϵ^{l}

to generate unnormalized attention weights using

r^{l} (ϵ^{l})

, and normalize them with the previously mentioned equation. By employing this reparameterization technique, we manage the inherent randomness in the attention mechanism, enabling gradient-based optimization, which is essential for large-scale training tasks.

5. Experiments

5.1. Datasets and Evaluation Metrics

To evaluate the proposed method for detecting highly variable fire and smoke, four public datasets, ranging from 502 images to 21,000 images, were selected, representing different scenarios. The details of the datasets are shown in Table 1. We evaluated Dual-Path DETR using mAP@[0.5: 0.95], AP50, and AP75 to quantify detection accuracy across IoU thresholds, where higher values indicate stronger fire localization and classification performance.

5.2. Comparison with State-of-the-Art

We evaluated Dual-Path Cascade Stochastic DETR against Fast-RCNN, Yolo-FM, Deformable DETR, and Cascade DETR on FireNet, Smoke, D-Fire, and DFS-Fire-Smoke datasets. Using ResNet-101 (ImageNet-pretrained) with AdamW optimizer (backbone/transformer:

l r

:

1 \times 10^{- 5}

/

1 \times 10^{- 4}

), Fast-RCNN, Deformable-DETR, Cascade DETR, and our model were trained for 50 epochs on FireNet and Smoke, while Yolo-FM was trained for 100 epochs. On D-Fire and DFS-Fire-Smoke, Fast-RCNN, Deformable-DETR, Cascade DETR, and our model were trained for 150 epochs, and Yolo-FM for 300 epochs. All of the simulations and tests were performed on a workstation with four 3090 GPUs. The total batch size was set to 16, with 4 per GPU.

The performance results are shown in Table 2, and the best result within each evaluation metric is highlighted in bold. As shown in Table 2, our model achieved state-of-the-art AP50 (0.94 on FireNet, 0.92 on DFS-Fire-Smoke) and superior AP75, demonstrating precise localization, critical for early fire detection. Learning curves in Figure 4 reveal consistent mAP@[0.5, 0.95] dominance across datasets, particularly for challenging “other” classes, validating the CSA module’s false-positive suppression. The object detection results of the proposed method are displayed in Figure 5, further illustrating its robustness and accuracy in detecting fire and smoke in various scenarios. The paired t-test is carried out to verify the improvement significance. The H0 hypothesis was set to no significant improvement. We chose only Fast-RCNN, YOLO-FM, Deformable-DETR, Cascade-DETR as the baseline model due to the time constraints. Each model was trained and tested on the same datasets five times. The Precision results are used to calculated the p-values in the t-test. Paired t-tests (Figure 6) confirmed statistically significant improvements over baselines (

p < 0.05

on FireNet/DFS-Fire-Smoke) and marginal gains on Smoke/D-Fire (p = 0.05–0.1).

5.3. Ablation Study

In this section, we conduct experiments to evaluate the proposed Dual-Path DETR. First, we compare the performance of Bayesian attention with SSA across different decoder layers, using Cascade-DETR as the baseline and varying the number of stochastic decoder layers with Gamma priors from 1 to 6, to assess the impact of stochastic attention on model performance. Additionally, we study the efficiency of the dual-path structure by comparing it to a single stochastic downward path, highlighting the advantages of integrating both stochastic and deterministic pathways in improving attention refinement and overall model robustness.

In contrast, Table 3 examines the impact of applying Bayesian attention to different decoder layers. While the results indicate that increasing the number of layers using Bayesian attention generally improves performance, the accuracy remains lower compared to SSA. This is because Bayesian attention, which focuses on individual attention layers, is less suitable for the DETR structure in object detection tasks. The comprehensive application of SSA across all layers enables the model to better capture the complex distributions necessary for fire and smoke detection.

The findings from this ablation study highlight the importance of utilizing a sufficient number of SSA layers to fully exploit the benefits of stochastic attention in the Dual-Path DETR model. By replacing all layers with SSA, the model can more effectively capture the complex variability required for accurate fire and smoke detection, leading to higher AP values.

The results of the first experiment are presented in Table 3 and Table 4. Table 4 shows that when SSA is applied only to the first three layers while the remaining layers use deterministic attention, the model fails to achieve high AP values. This underperformance is likely due to the limited number of SSA layers, which are insufficient to learn a robust attention distribution, especially when mixed with deterministic attention in the upper layers. The limited application of SSA does not capture the complex variability necessary for precise fire and smoke detection. However, when all decoder layers are replaced with SSA incorporating Gamma distributions, there is a marked improvement in performance. This improvement is attributed to the model’s ability to effectively learn the joint distribution across multiple layers, allowing for better handling of uncertainty and variability, which is crucial for accurate detection.

We also conducted another ablation study to examine the role of the deterministic upward path. In this study, we compared two variants of the Dual-Path DETR: one with the deterministic upward path and one without it. The results are shown in Figure 7.

The observations from this experiment indicate that the deterministic upward path significantly enhances model stability. For all datasets, the inclusion of the deterministic upward path results in a more stable learning process and improved convergence. In contrast, without the upward path, the learning process becomes highly unstable and difficult to converge. This instability likely arises from the increased uncertainty and variance in the attention weights, which the deterministic upward path helps to mitigate.

These findings highlight the critical role of the deterministic upward path in ensuring the stability and convergence of the Dual-Path DETR. By incorporating this path, the model can achieve more reliable and consistent performance across different datasets.

6. Conclusions

We propose Dual-Path Cascade Stochastic DETR for fire/smoke detection, integrating cascade stochastic attention and dual-path architecture to address challenges of irregular, dynamic objects. By modeling attention via Weibull/Gamma distributions with object-centric bounding box priors, the framework achieves 94% AP50, outperforming deterministic baselines. Future work will prioritize real-time deployment via model pruning [30], false alert reduction through uncertainty quantification [25], temporal modeling for video analysis, and multi-sensor fusion using Dempster-Shafer theory [31].

Author Contributions

Conceptualization, D.Y.; methodology, D.Y.; software, D.Y.; validation, D.Y. and W.R.; formal analysis, D.Y., B.H. and X.Z.; investigation, D.Y. and X.Z.; resources, B.H.; data curation, W.R.; writing—original draft preparation, D.Y.; writing—review and editing, B.H.; supervision, B.H.; funding acquisition, D.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Key Laboratory of Fire Protection Technology for Industry and Public Building, Ministry of Emergency Management Open Project 2023KLIB01 and National Fire and Rescue Administration Science Project 2023XFCX37,Natural Science Foundation of Inner Mongolia 2025YQ008, and National Natural Science Foundation of China 52304034.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

These data were derived from the following resources available in the public domain: D-Fire Dataset: gaiasd. Dfiredataset. Available online: https://github.com/gaiasd/DFireDataset, 2023. Accessed on 12 July 2024. DFS-fire-smoke dataset: Siyuan Wu, Xinrong Zhang, Ruqi Liu, and Binhai Li. A dataset for fire and smoke object detection. Multimedia Tools and Applications, pages 1–20, 2022. DFS-dataset: Available online: https://github.com/siyuanwu/DFS-FIRE-SMOKE-Dataset/tree/main (accessed on 16 September 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, X.; Wang, B.; Luo, P.; Wang, L.; Wu, Y. A Metric Learning-Based Improved Oriented R-CNN for Wildfire Detection in Power Transmission Corridors. Sensors 2025, 25, 3882. [Google Scholar] [CrossRef]
Desikan, J.; Singh, S.K.; Jayanthiladevi, A.; Bhushan, S.; Rishiwal, V.; Kumar, M. Hybrid Machine Learning-Based Fault-Tolerant Sensor Data Fusion and Anomaly Detection for Fire Risk Mitigation in IIoT Environment. Sensors 2025, 25, 2146. [Google Scholar] [CrossRef]
Zhou, K.; Jiang, S. Forest fire detection algorithm based on improved YOLOv11n. Sensors 2025, 25, 2989. [Google Scholar] [CrossRef]
Buriboev, A.S.; Abduvaitov, A.; Jeon, H.S. Integrating Color and Contour Analysis with Deep Learning for Robust Fire and Smoke Detection. Sensors 2025, 25, 2044. [Google Scholar] [CrossRef]
Zhang, Z.; Tan, L.; Robert, T.L.K. An improved fire and smoke detection method based on YOLOv8n for smart factories. Sensors 2024, 24, 4786. [Google Scholar] [CrossRef]
Abdessemed, F.; Bouam, S.; Arar, C. A Review on Forest Fire Detection and Monitoring Systems. In Proceedings of the 2023 International Conference on Electrical Engineering and Advanced Technology (ICEEAT), Batna, Algeria, 5–7 November 2023; Volume 1, pp. 1–7. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Redmon, J.; Divvala, S.K.; Girshick, R.B.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 779–788. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv 2021, arXiv:2010.04159. [Google Scholar] [CrossRef]
Ye, M.; Ke, L.; Li, S.; Tai, Y.W.; Tang, C.K.; Danelljan, M.; Yu, F. Cascade-DETR: Delving into High-Quality Universal Object Detection. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–3 October 2023; pp. 6681–6691. [Google Scholar]
Wang, X.; Li, M.; Gao, M.; Liu, Q.; Li, Z.; Kou, L. Early smoke and flame detection based on transformer. J. Saf. Sci. Resil. 2023, 4, 294–304. [Google Scholar] [CrossRef]
Liang, T.; Zeng, G. FSH-DETR: An Efficient End-to-End Fire Smoke and Human Detection Based on a Deformable DEtection TRansformer (DETR). Sensors 2024, 24, 4077. [Google Scholar] [CrossRef]
Geng, X.; Su, Y.; Cao, X.; Li, H.; Liu, L. YOLOFM: An improved fire and smoke object detection algorithm based on YOLOv5n. Sci. Rep. 2024, 14, 4543. [Google Scholar] [CrossRef] [PubMed]
Liang, D.; Bui, T.; Wang, G. Fire and Smoke Detection Method Based on Improved YOLOv5s. In Proceedings of the 2023 8th International Conference on Communication, Image and Signal Processing (CCISP), Chengdu, China, 17–19 November 2023; pp. 293–300. [Google Scholar]
Yar, H.; Khan, Z.A.; Ullah, F.U.M.; Ullah, W.; Baik, S.W. A modified YOLOv5 architecture for efficient fire detection in smart cities. Expert Syst. Appl. 2023, 231, 120465. [Google Scholar] [CrossRef]
Koshy, R.; Elango, S. Applying social media in emergency response: An attention-based bidirectional deep learning system for location reference recognition in disaster tweets. Appl. Intell. 2024, 54, 5768–5793. [Google Scholar] [CrossRef]
Lv, G.; Dong, L.; Xu, W. Hierarchical interactive multi-granularity co-attention embedding to improve the small infrared target detection. Appl. Intell. 2023, 53, 27998–28020. [Google Scholar] [CrossRef]
Wang, J.; Yu, J.; He, Z. DECA: A novel multi-scale efficient channel attention module for object detection in real-life fire images. Appl. Intell. 2022, 52, 1362–1375. [Google Scholar] [CrossRef]
Zeng, K.; Sun, X.; He, H.; Tang, H.; Shen, T.; Zhang, L. Fuzzy preference matroids rough sets for approximate guided representation in transformer. Expert Syst. Appl. 2024, 255, 124592. [Google Scholar] [CrossRef]
Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.C.; Salakhutdinov, R.; Zemel, R.S.; Bengio, Y. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. arXiv 2015, arXiv:1502.03044. [Google Scholar] [CrossRef]
Fan, X.; Zhang, S.; Chen, B.; Zhou, M. Bayesian Attention Modules. arXiv 2020, arXiv:2010.10604. [Google Scholar] [CrossRef]
Zhang, S.; Fan, X.; Chen, B.; Zhou, M. Bayesian Attention Belief Networks. arXiv 2021, arXiv:2106.05251. [Google Scholar] [CrossRef]
Blei, D.M.; Kucukelbir, A.; McAuliffe, J.D. Variational inference: A review for statisticians. J. Am. Stat. Assoc. 2017, 112, 859–877. [Google Scholar] [CrossRef]
Ren, W.; Jin, Z. Phase space visibility graph. Chaos Solitons Fractals 2023, 176, 114170. [Google Scholar] [CrossRef]
Olafenwa, M. FireNET Dataset [Dataset]; GitHub: San Francisco, CA, USA, 2023. [Google Scholar]
Pacini, M. Smoke Dataset [Dataset]; Roboflow Universe: Des Moines, IA, USA, 2022. [Google Scholar]
Gaiasd. DFireDataset. 2023. Available online: https://github.com/gaiasd/DFireDataset (accessed on 12 July 2024).
Wu, S.; Zhang, X.; Liu, R.; Li, B. A dataset for fire and smoke object detection. In Multimedia Tools and Applications; Springer: Berlin/Heidelberg, Germany, 2022; pp. 1–20. [Google Scholar]
Ren, W.; Jin, N.; Wang, T. An Interdigital Conductance Sensor for Measuring Liquid Film Thickness in Inclined Gas–Liquid Two-Phase Flow. IEEE Trans. Instrum. Meas. 2024, 73, 9505809. [Google Scholar] [CrossRef]
Li, N.; Martin, A.; Estival, R. Heterogeneous information fusion: Combination of multiple supervised and unsupervised classification methods based on belief functions. Inf. Sci. 2021, 544, 238–265. [Google Scholar] [CrossRef]

Figure 1. Dual-path architecture.

Figure 2. The proposed stochastic attention layer.

Figure 3. The proposed Dual-Path CSDETR.

Figure 4. Learning process comparision on FireNet, Smoke, D-Fire and DFS-Fire-Smoke datasets: (a) FireNet. (b) Smoke. (c) D-Fire. (d) DFS-Fire-Smoke.

Figure 5. Fire detection results on different datasets: (a) FireNet. (b) Smoke. (c) D-Fire. (d) DFS-Fire-Smoke.

Figure 6. Paired t-test p-values comparing Dual-Path DETR with baseline models.

Figure 7. Comparison of Dual-Path architecture with stochastic downward path architecture: (a) FireNet. (b) Smoke. (c) D-Fire. (d) DFS-Fire-Smoke.

Table 1. Overview of datasets.

Datasets	Classes	Positive/Total Ratio
FireNet [26]	Fire	210/502
Smoke [27]	Smoke	741/746
D-Fire [28]	Fire & Smoke	4658/21,000
DFS-Fire-Smoke [29]	Fire & Smoke & Other	6308/9462

Table 2. Performance comparison of the proposed method and other classical detectors.

Datasets	Evaluation	Fast-RCNN	Yolo-FM	Deformable DETR	Cascade DETR	Ours
FireNet	AP50	0.85	0.89	0.82	0.92	0.94
	AP75	0.79	0.85	0.81	0.88	0.91
	mAP@[0.5, 0.95]	0.75	0.73	0.79	0.81	0.88
Smoke	AP50	0.92	0.84	0.89	0.90	0.91
	AP75	0.81	0.79	0.85	0.87	0.89
	mAP@[0.5, 0.95]	0.83	0.82	0.79	0.84	0.86
D-Fire	AP50	0.88	0.78	0.91	0.89	0.94
	AP75	0.76	0.73	0.79	0.90	0.88
	mAP@[0.5, 0.95]	0.74	0.71	0.81	0.82	0.85
DFS-Fire-Smoke	AP50	0.82	0.84	0.87	0.86	0.92
	AP75	0.73	0.79	0.75	0.81	0.84
	mAP@[0.5, 0.95]	0.71	0.73	0.75	0.80	0.84

Table 3. Performance comparison of varying decoder layer numbers without bounding box information.

Datasets	Evaluation	Decoder Layers with Bayesian Attention
Datasets	Evaluation	1 Layer	2 Layers	3 Layers	4 Layers	5 Layers	6 Layers
FireNet	AP50	0.38	0.39	0.42	0.44	0.47	0.50
	AP75	0.31	0.33	0.36	0.38	0.41	0.44
	mAP@[0.5, 0.95]	0.29	0.35	0.41	0.42	0.45	0.46
Smoke	AP50	0.36	0.38	0.40	0.43	0.46	0.49
	AP75	0.30	0.32	0.35	0.37	0.40	0.43
	mAP@[0.5, 0.95]	0.29	0.31	0.33	0.39	0.38	0.42
D-Fire	AP50	0.30	0.32	0.35	0.38	0.41	0.44
	AP75	0.24	0.26	0.29	0.32	0.35	0.38
	mAP@[0.5, 0.95]	0.26	0.25	0.28	0.30	0.33	0.36
DFS-Fire-Smoke	AP50	0.25	0.27	0.30	0.33	0.36	0.39
	AP75	0.19	0.21	0.24	0.27	0.30	0.33
	mAP@[0.5, 0.95]	0.21	0.23	0.22	0.31	0.33	0.35

Table 4. Performance comparison of Dual-Path CSDETR with varying decoder layer numbers.

Datasets	Evaluation	Decoder Layers with the SSA Mechanism
Datasets	Evaluation	1 Layer	2 Layers	3 Layers	4 Layers	5 Layers	6 Layers
FireNet	AP50	0.60	0.59	0.68	0.74	0.85	0.92
	AP75	0.53	0.53	0.51	0.55	0.67	0.88
	mAP@[0.5, 0.95]	0.57	0.55	0.62	0.61	0.77	0.91
Smoke	AP50	0.63	0.59	0.68	0.69	0.72	0.89
	AP75	0.59	0.43	0.55	0.54	0.61	0.84
	mAP@[0.5, 0.95]	0.57	0.56	0.57	0.65	0.63	0.86
D-Fire	AP50	0.56	0.55	0.57	0.65	0.82	0.92
	AP75	0.48	0.44	0.43	0.48	0.66	0.82
	mAP@[0.5, 0.95]	0.50	0.52	0.49	0.50	0.77	0.89
DFS-Fire-Smoke	AP50	0.51	0.59	0.65	0.66	0.86	0.89
	AP75	0.41	0.45	0.44	0.49	0.62	0.73
	mAP@[0.5, 0.95]	0.45	0.52	0.60	0.52	0.80	0.84

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yu, D.; Han, B.; Zhao, X.; Ren, W. Dual-Path CSDETR: Cascade Stochastic Attention with Object-Centric Priors for High-Accuracy Fire Detection. Sensors 2025, 25, 5788. https://doi.org/10.3390/s25185788

AMA Style

Yu D, Han B, Zhao X, Ren W. Dual-Path CSDETR: Cascade Stochastic Attention with Object-Centric Priors for High-Accuracy Fire Detection. Sensors. 2025; 25(18):5788. https://doi.org/10.3390/s25185788

Chicago/Turabian Style

Yu, Dongxing, Bing Han, Xinyi Zhao, and Weikai Ren. 2025. "Dual-Path CSDETR: Cascade Stochastic Attention with Object-Centric Priors for High-Accuracy Fire Detection" Sensors 25, no. 18: 5788. https://doi.org/10.3390/s25185788

APA Style

Yu, D., Han, B., Zhao, X., & Ren, W. (2025). Dual-Path CSDETR: Cascade Stochastic Attention with Object-Centric Priors for High-Accuracy Fire Detection. Sensors, 25(18), 5788. https://doi.org/10.3390/s25185788

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Dual-Path CSDETR: Cascade Stochastic Attention with Object-Centric Priors for High-Accuracy Fire Detection

Abstract

1. Introduction

2. Related Works

3. Preliminaries

3.1. Bayesian Attention Modules

3.2. Cascade-DETR

4. Dual-Path Cascade Stochastic DETR

4.1. Cascade Stochastic Attention Layers

4.2. Dual-Path CSDETR Architecture

4.3. Training and Inference

5. Experiments

5.1. Datasets and Evaluation Metrics

5.2. Comparison with State-of-the-Art

5.3. Ablation Study

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI