Next Article in Journal
Infrared Bionic Compound-Eye Camera: Long-Distance Measurement Simulation and Verification
Previous Article in Journal
Artificial Intelligence-Driven Optimal Charging Strategy for Electric Vehicles and Impacts on Electric Power Grid
Previous Article in Special Issue
A CLIP-Based Framework to Enhance Order Accuracy in Food Packaging
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

FasterGDSF-DETR: A Faster End-to-End Real-Time Fire Detection Model via the Gather-and-Distribute Mechanism

School of Cyber Science and Engineering, Zhengzhou University, Jinshui District, Zhengzhou 450001, China
*
Author to whom correspondence should be addressed.
Electronics 2025, 14(7), 1472; https://doi.org/10.3390/electronics14071472
Submission received: 2 March 2025 / Revised: 2 April 2025 / Accepted: 4 April 2025 / Published: 6 April 2025
(This article belongs to the Special Issue Deep Learning-Based Object Detection/Classification)

Abstract

:
Fire detection using deep learning has become a widely adopted approach. However, YOLO-based models often face performance limitations due to NMS, while DETR-based models struggle to meet real-time processing requirements. To address these challenges, we propose FasterGDSF-DETR, a novel fire detection model built upon the RT-DETR framework, designed to enhance both detection accuracy and efficiency. Firstly, this model introduces the FasterDBBNet backbone, which efficiently captures and retains feature information, accelerating the model’s convergence speed. Secondly, we propose the AIFI-GDSF hybrid encoder to reduce information loss in intra-scale interactions and improve the capability of detecting varying morphological flames. Furthermore, to better adapt to complex fire scenarios, we expand the dataset based on the KMU Fire and Smoke database and incorporate WIoU as the loss function to improve model robustness. Experimental results demonstrate that our proposed model surpasses mainstream object detection models in both accuracy and computational efficiency. FasterGDSF-DETR achieves a mean Average Precision of 71.5% on the self-constructed dataset, outperforming the YOLOv9 model of the same scale by 2.4 percentage points. This study introduces a novel task-specific enhancement to the RT-DETR framework, offering valuable insights for future advancements in fire detection technology.

1. Introduction

Fire poses a significant threat to both human lives and infrastructure, often igniting unexpectedly and spreading rapidly with high intensity. In the early phases of fire detection technology advancement, detection methodologies predominantly depended on optical imaging systems—including telescopes, infrared sensors, and hyperspectral imaging devices—as well as physical sensing units such as smoke alarms and thermal detectors. For instance, infrared-based image processing techniques demonstrate high efficiency in recognizing nearby fire sources; however, their reliability decreases when detecting flames at a distance or under adverse environmental conditions where factors like dense fog obscure thermal signals and distort detection accuracy [1]. In the early stages of fire detection, detection technologies that combined visual systems with machine learning algorithms also attracted considerable attention. In one case, the detection of fire was accomplished by assessing the changes between frames of characteristics within possible fire regions, followed by their integration into a Bayesian classification model. Although these methods utilized many physical and mathematical approaches to extract features, their feature representation capability remained limited by manually designed feature extractors. Moreover, these early detection methods had poor resistance to interference from external factors such as complex scene changes and varying lighting conditions, resulting in low robustness and weak generalization capabilities.
It is evident that the inherent limitations of traditional physical methods and machine learning techniques have persisted, and this has resulted in deep learning algorithms being applied in fire detection scenarios [2]. In addressing the long-standing issues of real-time performance, interference from complex scenes, and detection accuracy, the rapid development of deep learning is bringing novel solutions for overcoming these difficulties in fire detection. Researchers commonly leverage the outputs obtained from traditional sensors, coupled with deep learning methodologies, to more effectively address practical issues. For instance, Ning Li et al. [3] have contributed to the field of multifunctional optoelectronics by utilizing upconversion detectors and visualization techniques. Hongguang Wei et al. [4] have employed an innovative STFF-GA network to tackle challenges in the domain of remote sensing change detection, which serves as an exemplar of the application of remote sensing technology in conjunction with deep learning. Xiubao Sui et al. [5] have achieved remarkable detection accuracy by integrating deep learning techniques with infrared focal plane array (IRFPA) sensors.
There are two main types of traditional deep learning fire detection approaches—con- volutional neural network-based methods [6] and Transformer-based methods [7]. At present, CNN-based algorithms play a dominant role in fire detection and are broadly divided into one-stage and two-stage approaches. Two-stage models, such as Faster R-CNN, first generate region proposals from the input image and subsequently refine these regions through object classification and bounding box regression [8]. Differing from approaches that rely on region proposals, one-stage methodologies such as YOLO [9] directly predict class labels and coordinate results from the input context in a single pass, bypassing a separate candidate generation phase.
With the application of Transformer methods in computer vision, the DETR [10] model emerged, achieving results on the COCO dataset that are comparable to YOLO and Faster R-CNN. DETR provides a complete end-to-end path for object detection, capable of directly predicting both the spatial coordinates of bounding boxes and the corresponding object classes using only raw images as input. It eliminates the need to manually create candidate boxes and removes the requirement for complex components such as NMS. Due to DETR’s inherent issues, many researchers have been dedicated to optimizing it, resulting in the development of variants such as Deformable DETR [11], DINO-DETR [12], and DN-DETR [13]. While these modified DETR versions offer advancements in particular areas, a performance gap persists concerning the accuracy and speed levels necessary for effective fire detection systems.
Recently, real-time object detectors based on Transformer methods have gradually outperformed YOLO, such as RT-DETR [14], which was introduced to further optimize real-time performance based on DETR. In terms of application, Zhu et al. [15] made improvements in UAV target detection and introduced the GCD-DETR model. Ge et al. [16] proposed the HPRT-DETR model, making contributions to industrial defect detection. Cheng et al. [17] developed the GSG-Adapter and LFO-Adapter, which allow the faster and more accurate identification of power line insulator defects. Nevertheless, it would appear that there has been a paucity of developments in the field of RT-DETR for fire detection. Mainstream YOLO and DETR variants are predominantly optimized for structured, fixed-pattern jobs, as exemplified by COCO detection challenges [18]. In contrast, fire detection scenarios are often more chaotic and complex, with dynamic and uncontrollable scene variations. This demands fire detection algorithms possess stronger interference resistance. Additionally, although RT-DETR offers excellent prediction accuracy, its computational cost and inference speed restrict its use on resource-constrained detection platforms.
To tackle the aforementioned challenges and bridge the gap in RT-DETR’s application for fire detection, this paper presents a new end-to-end model for multi-scenarios. The primary contributions of this paper can be outlined as follows:
  • To our knowledge, FasterGDSF-DETR is the first real-time detection model based on the RT-DETR that is designed for fire detection scenarios. It enables rapid and precise detection in complex fire environments, offering outstanding performance in terms of detection accuracy, model size, and computational efficiency.
  • In order to enhance the accuracy of detection while ensuring the lightweightness of the model, we developed a novel backbone network, FasterDBBNet. Moreover, comparative experiments and visualizations were conducted on it with the objective of enhancing the interpretability of FasterDBBNet.
  • The AIFI intra-scale interaction mechanism was retained and the fusion architecture was redesigned when introducing the AIFI-GDSF hybrid encoder. By incorporating the enhanced Gather-and-Distribute mechanism [19] and the SSFF [20] module, we further strengthened the model’s intra-scale and cross-scale feature interaction capabilities, preventing information loss during feature fusion.
  • Based on an open source fire dataset, we manually collected and annotated fire images, creating a more comprehensive dataset that covers a wide range of fire scenarios. Furthermore, WIoU [21] was introduced as a more appropriate loss function, given the characteristics of the dataset.

2. Related Work

The preceding section presented an initial survey regarding the advancements in applying object detection algorithms based on CNNs and the DETR architecture. These works are quite remarkable and significant, and it is hoped that they can provide new insights for researchers. Subsequently, we will delve into more traditional approaches, enumerating and analyzing the practical application cases of these algorithms.
In the field of fire detection, the majority of solutions are designed based on traditional object detection methods. However, there also exist other advanced deep learning approaches for accomplishing the detection task. For instance, Ashutosh Sharma et al. [22] introduced a joint learning approach that integrates RNNs and CNNs to process multi-modal fire data, aiming to improve both detection accuracy and data privacy. Pascal Vorwerk et al. [23] investigated the application of feature representation transfer and power transfer techniques to the problem of early fire detection, aiming to enable the transfer learning of small-scale training data to the target space for early fire conditions. It is important to acknowledge the role of these methodologies in enabling advancements in the domain of fire detection through deep learning.
Focus on traditional object detection algorithms. The application of two-stage algorithms in fire detection is relatively mature. Barmpoutis et al. [24] augmented the Faster R-CNN framework with multi-scale features computed via higher-order linear dynamic methods, achieving considerable gains in fire detection performance and simultaneously decreasing false alarm occurrences. Chaoxia et al. [25] employed a color-guided anchor box method and parallelly connected the GIN with Faster R-CNN, achieving promising performance. Pan J et al. [26] combined weakly supervised fine segmentation with a lightweight Faster R-CNN to propose a novel framework called FireDGWF, which uses fuzzy logic to evaluate fire and smoke severity levels. However, two-stage methods exhibit poor accuracy in small object detection. Furthermore, two-stage anchor-based methods encounter issues with localizing objects of different shapes and maintaining fast response times for fire detection, a process that demands high real-time performance and accuracy [27].
In comparison to two-stage algorithms, one-stage algorithms provide faster training and deployment speeds along with lower hardware requirements, making them the preferred choice for most researchers. Zheng et al. [28] offered a real-time fire detection algorithm, aimed at addressing the slow runtime and low recognition accuracy of fire detection algorithms on embedded devices. Zheng et al. [29] presented a lightweight forest fire detection algorithm that integrates FireYOLO with Real-ESRGAN, where Real-ESRGAN enhances local flame features, followed by FireYOLO for accurate flame identification. Ren et al. [30] developed the FCLGYOLO model by incorporating the FICC structure and LGGM module, which greatly enhances feature discriminability. This approach specifically aims to tackle the difficulties of smoke occlusion and low resolution object detection in UAV-based forest fire detection.
One-stage object detection algorithms stand out among many others due to their low cost and high precision. They can directly perform object detection tasks on images, addressing real-time concerns to a certain extent. However, most one-stage algorithms compromise on detection accuracy, particularly when dealing with small objects or complex environmental conditions. Conversely, detectors employing a two-stage methodology generally achieve higher accuracy reliability. Their sequential process, involving distinct phases for region proposal and classification, facilitates a more thorough analysis of object–context interactions. This characteristic renders them suitable for demanding fire detection scenarios where precision is paramount, even at the expense of inference speed. To overcome the limitations and achieve better real-time performance and detection accuracy, many efforts have been focused on developing DETR-based algorithms. The DETR series employs a self-attention model to establish the relationship between global image information and detection targets. Through continuous improvements, these models have progressively reduced training costs and enhanced detection performance. Zheng et al. [18] used diffusion models (DDPM) and image super-resolution algorithms (SR3) to enhance dataset quality, while proposing the FTA-DETR framework to provide robust support for fire alarm detection and fire prevention. Li et al. [31] introduced a normalized attention module and restructured the encoder–decoder layers, making DETR lighter and reducing the requirements for deployment on application devices. Liang et al. [32] presented a new end-to-end model called FSH-DETR, built on the Deformable DETR. It replaces the ResNet backbone with the ConvNeXt network and incorporates the SSFI and CCFM modules to strengthen multi-scale feature fusion in the encoder.
Although previous research endeavors have enhanced the efficiency of the DETR series algorithms, the potential of the DETR architecture has not been fully realized. Previous methods have solved many problems in different areas, but they also face challenges, such as convergence difficulties with DETR models and poor performance in complex scenes with considerable interference. More critically, optimizing for low deployment costs and high inference speed remains essential for real-world applications. The adoption of the real-time-efficient RT-DETR model presents new opportunities for enhancing fire detection performance.

3. Method

Driven by the objectives of lowering the costs associated with training and deploying fire detection systems and enhancing their performance, we introduce FasterGDSF-DETR, a model designed for real-time fire identification. This architecture is developed as an advancement over the RT-DETR baseline, retaining a comparable overall network structure. As shown in Figure 1, FasterGDSF-DETR employs our proposed FasterDBBNet backbone, with the neck formed by the new AIFI-GDSF hybrid encoder. We also replaced the default loss function with WIoU. These mechanisms allow the model to accurately extract contextual information related to the object position and category after capturing semantic information at various scales. In the following subsections, we explain the above improvements.

3.1. Backbone

3.1.1. DBBCSPELAN

RT-DETR employs ResNet [33] as the backbone for image feature extraction. Characterized by its low computational demands and parameter efficiency, ResNet utilizes a structure of stacked residual units with 3 × 3 convolutions. This design aims to capture intricate spatial details, semantic information, and the features of small objects. However, the inherent limitations of its feature extraction mechanism can potentially hinder the overall effectiveness of the model. ResNet uses a residual structure to mitigate gradient explosion or vanishing, but its simple connection stacking is suboptimal when it comes to multi-scale feature detection and cross-layer feature merging, which directly affects the feature fusion process in later encoder stages. To address the aforementioned issues, it was necessary to extend the DBBCSPELAN module and enhance it through the utilization of the DBB module.
The core concept behind the YOLO series models for real-time identification is making the task a regression issue, using spatially segregated bounding boxes with corresponding class distributions. We drew inspiration from the YOLOv9 [34] backbone, which introduced a new architecture called the GELAN network. This architecture combines the advantages of CSPNet [35] and ELAN [36]. CSPNet offers advantages in reducing parameter count and computational complexity. ELAN, the efficient layer aggregation network provides a structured layer aggregation method that enhances GELAN’s internal feature representation and gradient flow. Moreover, GELAN allows users to customize its architecture and replace network modules according to specific task requirements. This structure is insensitive to depth changes within a certain threshold, offering flexibility to users without compromising the stability of detection accuracy. Building on GELAN, replacing the internal modules with multiple RepNCSP blocks creates the RepNCSPELAN4 module.
RepNCSPELAN4 is a lightweight and efficient layer aggregation structure based on gradient path planning. It integrates various advanced network design concepts, such as CSPNet’s cross-stage partial connections, as well as GELAN and ELAN efficient layer aggregation networks, improving model performance from multiple aspects. RepNCSPELAN4 splits the input into two parts, converting the complex branches’ convolution structure into an efficient one-path structure. This transformation preserves the high expressive capability during training while significantly reducing computational costs during inference. In practical applications, this means the model can process image data faster, enabling timely object detection. The feature input is split into two sections in CSPNet, one of which is handled by stacked Conv layers, while the other part is linked to subsequent layers, maintaining gradient flow through feature fusion at different stages. By employing cross-stage partial connections, CSPNet effectively reduces computational and memory costs. ELAN further strengthens RepNCSPELAN4’s feature representation and gradient flow by stacking and aggregating features map from disparate outputs. It effectively enhances the capacity to understand and process complex visual information. Additionally, ELAN’s design helps alleviate the common gradient vanishing issue in deep networks, increasing the model’s training efficiency and stability.
Building on the RepNCSPELAN4 module, we further replaced RepConvN with the Diverse Branch Block (DBB) [37], as shown in Figure 2. Utilizing a multi-branch structure that integrates pathways of differing scales and computational complexities, DBB augments the feature learning capacity of a single convolution, thereby increasing the heterogeneity of the learned feature space. The convolution process and its corresponding output channels can be formulated as shown below, with ⊛ signifying the convolution operator:
O = I F + R E P ( b ) ,
O, I, and F represent the outputs, inputs, and the effective convolution kernel after reparameterization, respectively. R E P ( b ) denotes the bias term b, appropriately shaped to be added element-wise to the convolution result.
O j , h , w = c = 1 C u = 1 K v = 1 K F j , c , u , v X ( c , h , w ) u , v + b j ,
Specifically, each element O j , h , w in the output feature map is computed by summing the element-wise product of the kernel weights F j , c , u , v and the corresponding input patch X ( c , h , w ) u , v over all input channels C and kernel spatial dimensions, followed by the addition of the bias term b j specific to the output channel j.
The homogeneity and additivity can be derived from the following equation:
I ( p F ) = p ( I F ) , p R ,
I F ( 1 ) + I F ( 2 ) = I F ( 1 ) + F ( 2 ) ,
where assuming the input channels of the convolution are C, the output channels are D, and the convolution kernel size is K × K , X ( c , h , w ) R K × K , the convolution kernel is represented as F R D × C × K × K , the bias parameter as b R D , R E P ( b ) R D × H × W , and the feature map as I R C × H × W .
These equations demonstrate the homogeneity and additivity properties inherent to the convolution operation with respect to the kernel F. Homogeneity implies that scaling the kernel by a factor p scales the output by the same factor. Additivity implies that the convolution operation distributes over the addition of kernels. These linear properties are fundamental for the structural reparameterization technique employed in DBB, enabling the fusion of multiple parallel branches into a single, equivalent kernel ( F ( 1 ) + F ( 2 ) ) during the inference phase.
The DBB module features six transformation forms, all based on homogeneity and additivity, including Conv-BN, Branch Addition, Conv Sequential, Depth Concatenation, Average Pooling, and Multi-scale Conv. This multi-branch structure design significantly improves the feature extraction detection of DBB. Specifically, DBB introduces sequential convolution, forming a hierarchical feature extraction process, where each layer of convolution aims to extract complex and abstract feature from the input data. Additionally, DBB uses multi-scale convolutional kernels. This strategy helps the model capture fine details in images and ensures an understanding of the global structure. Within the training phase, the reparameterization strategy used in DBB allows it to maintain the flexibility of the multi-branch structure, making full use of the feature diversity provided by each branch. During inference, this complex structure can be optimized into an equivalent single convolution layer using specific transformation techniques, thus avoiding any additional inference costs. We replaced the RepConvN block in the RepNCSP block with the Diverse Branch Block, forming the DBBNCSP module and the DBBCSPELAN structure. The structural configuration of the DBBCSPELAN model is delineated in Figure 3.

3.1.2. PAD

The accelerated progression of diverse computer vision tasks is driven by advances in artificial neural networks. However, the need to deploy models on end devices and the limitations of computational power have shifted network design towards prioritizing low latency and high throughput. While minimizing FLOPs has been a common objective in previous network designs, this strategy alone may not lead to more compact models. Recognizing this, Partial Convolution (PConv) [38] was developed, PConv targets inefficiencies beyond FLOPs, specifically aiming to reduce redundant calculations and memory access costs, alongside boosting spatial feature learning. The PConv structure is shown in Figure 4.
Partial convolution introduces a distinct mechanism for spatial feature learning. The model performs standard Conv in a limited number of input channels, selectively bypassing others. This method, which assumes the convolved subset captures representative features, differs fundamentally from grouped or depthwise separable convolutions, whose main goal is to diminish parameters and FLOPs through filter redundancy. To avoid degradation into standard convolution, the remaining unfiltered channels are retained, allowing feature information to flow through all channels.
In the domain of deep learning models, a prevalent technique is the application of downsampling, a process which involves the reduction of spatial dimensions in feature maps, thereby reducing computational load. The ADown module is an efficient downsampling module with minimal impact on model performance. The operation processes the input tensor through two parallel paths. One path employs average pooling, while the other utilizes max pooling. Conv operations are subsequently applied independently to the outputs of each pooling layer. The final result is obtained by concatenating the feature maps from these two paths. We combine PConv and ADown to form the PAD module, which serves to further reduce the model’s complexity. Positioned subsequent to the RepNCSPELAN4 module, the PAD module facilitates the capture of both detailed global and local features while maintaining parameter efficiency. A key characteristic of the PAD module is its use of partial convolution, which, unlike standard convolutions, adaptively defines the convolutional kernel’s operational scope based on the validity of the input data. This mechanism proves particularly effective for handling incomplete information, enabling the module to derive meaningful features even from partially obscured views common in fire detection scenarios. Consequently, this contributes to enhanced model accuracy and robustness, especially within specialized contexts or on resource-limited platforms.

3.2. AIFI-GDSF Hybrid Encoder

Handling diverse object sizes in detection relies on leveraging features from various network depths, encapsulating scale-specific positional information. The neck module, typically implementing Feature Pyramid Networks (FPNs) [39] or related designs, is responsible for fusing these multi-level representations. FPNs’ goal was to improve multi-scale detection by integrating information across levels. Yet, standard FPN implementations often struggle with efficient long-range feature propagation due to complex connection patterns, creating a bottleneck where fusion is most effective only between neighboring feature layers. This limitation hinders comprehensive cross-level information exchange without degradation, impacting overall fusion quality and processing speed. Even within advanced models like RT-DETR, which employs a hybrid neck encoder (AIFI for intra-scale, CCFM for cross-scale), the CCFM module’s underlying FPN-inspired structure for feature fusion presents opportunities for enhancement.
Drawing inspiration from the efficient hybrid encoder design of RT-DETR and Gold-YOLO’s Gather-and-Distribute (GD) mechanism [19], we initiated an encoder architectural reorganization. This involved linking the GD mechanism with the AIFI module from RT-DETR. Subsequently, building upon the GD framework, we utilized the DBB module to enhance both the Information Fusion Module (IFM) and the Information Injection Module. This process resulted in the creation of a novel, efficient, and low-loss hybrid encoder, termed AIFI-GD. To further refine feature fusion across scales, we incorporated the Scale Sequence Feature Fusion module [20], which employs techniques like 3D convolution to aggregate global semantic information from features at different resolutions. The culmination of these enhancements is our proposed AIFI-GDSF, whose structure is depicted in Figure 1 and will be elaborated upon in the subsequent sections.

3.2.1. Gather-and-Distribute Branch

To mitigate the information degradation often encountered during feature propagation in conventional FPN architectures, we redesigned the encoder structure by incorporating the Gather-and-Distribute (GD) mechanism. The fundamental principle is to centralize the collection and integration of information from all hierarchical levels within a unified module. Subsequently, this fused information is disseminated back across different levels, enhancing the model’s ability to effectively represent objects at various scales. The GD mechanism operates via two parallel pathways, as follows: a low-stage gather-and-distribute branch employing convolution-based blocks and a high-stage counterpart utilizing attention-based blocks for feature extraction and fusion. Each branch integrates both a Feature Alignment Module (FAM) and an Information Fusion Module (IFM). Furthermore, a computationally efficient injection module is introduced to facilitate local feature transfer, specifically between adjacent levels.
The expense of self-attention, when viewed in conjunction with the tangible performance advantages, suggests that this procedure should be implemented on high-order features characterized by more sophisticated semantic notions. The purpose is to capture interconnections between object entities within the image. Therefore, the encoder applies intra-scale interactions using the self-attention mechanism only on the S 5 feature layer, and it fuses the post-interaction S 6 with S 5 in adjacent layers to minimize the loss of feature detail. The feature maps S 2 , S 3 , S 4 , S 5 , and S 6 are obtained after processing by the backbone network, where S i R N × C S i × R S i , N represents the batch size, and the feature size is denoted as R = H × W . The structure is shown in Figure 5.
The lower-level gather-and-distribute branch amalgamates feature maps from stages S 2 to S 5 , deriving high-resolution feature sets that encapsulate details critical for small object identification. Inside the Low-FAM, average pooling is employed to downsample input feature maps, ensuring uniform spatial dimensions. A key design choice is selecting the S 4 feature level as the target for this resolution harmonization, mediating between the preservation of granular information and the neck’s computational budget to balance speed and accuracy. The Low-IFM subsequently accepts these resolution-matched features from the FAM and applies multiple Diverse Branch Block (DBB) layers for enhanced representation learning. The tensors output by the DBB processing are then partitioned along the channel axis into two groups. These maps are considered as global information and are injected into different layers of features, providing richer information to feature maps at various scales for effective interaction and fusion [15]. The formula is as follows:
F align = L o w _ F A M ( [ S 2 , S 3 , S 4 , S 5 ] ) ,
F fuse = D i v e r s e B r a n c h B l o c k ( F align ) ,
F I _ P 3 , F I _ P 4 = S p l i t ( F fuse ) ,
On the other branch, the specially processed P6 feature map is small in size and rich in semantic information. Therefore, we chose P6 as the target for feature alignment. Subsequently, the matched features are fed into the High-IFM, which consists of multiple Transformer modes. These modules function collectively to enable the model to track remote dependencies at higher feature levels. Additionally, the High-IFM leverages a custom FFN block to maintain equilibrium between processing throughput and computational demands. It further incorporates a deep convolutional layer flanked by two 1 × 1 convolutional layers. Feature maps originating from the Transformer are segmented channel-wise and subsequently amalgamated with lateral features from the identical processing stage. This strategic fusion ensures that information from different representational levels is effectively synthesized, ultimately strengthening the model’s performance on objects of varying dimensions. The formula is as follows:
F align = H i g h _ F A M ( [ P 3 , P 4 , P 6 ] ) ,
F fuse = Transformer ( F align ) ,
F I _ N 4 , F I _ N 6 = S p l i t C o n v l × 1 ( F fuse ) ,
Utilizing average pooling, the FAM first reduces input feature dimensionality to a consistent size, a process crucial for aligning multi-level feature representations. Beyond alignment, this module enables computationally lean information aggregation, benefiting the efficiency of the downstream Transformer. Separately, the IFM aims to preserve information integrity, which bolsters the detection performance for objects of disparate sizes while maintaining low latency. Its mechanism involves fusing hierarchical features to capture global cues, followed by redistributing this contextual information across various feature scales to facilitate robust interaction and fusion.
As shown in Figure 6, in order to efficiently collect and utilize global information and integrate it at multiple scales, we used the Information Injection Module to fuse the information. The injection module receives information from IFM and the current level feature layer, respectively, denoted as F i n j and F l o c a l . Different convolution operations are then applied to the two inputs, yielding F g l o b a l _ e m b e d , F l o c a l _ e m b e d and F a c t . Afterward, F g l o b a l _ e m b e d and F a c t are scaled using average pooling or bilinear interpolation depend on the size of F i n j to ensure proper alignment of the feature information. Lastly, we incorporate the DBB module after each attention-based information fusion to further extract and fuse the information. The Information Injection Module in High-GD is almost identical to that in Low-GD. The merged formula is as follows:
F global _ act _ Pi = R e ( S i g m o i d ( C o n v a c t ( F inj _ Pi ) ) ) ,
F global _ embed _ Pi = R e ( C o n v global _ embed _ Pi ( F inj _ Pi ) ) ,
F att _ fuse _ Pi ( Ni ) = C o n v l o c a l _ e m b e d _ P i ( N i ) ( B i )   F i n g _ a c t _ P i ( N i ) + F g l o b a l _ e m b e d _ P i ( N i ) ,
P i ( N i ) = D i v e r s e B r a n c h B l o c k ( F a t t _ f u s e _ P i ( N i ) ) .
Drawing inspiration from YOLOv6 [40], the GD mechanism introduces the DBBInjection-LAF module to optimize the accuracy–latency trade-off. This module integrates a lightweight LAF component preceding the main injection block. Specifically, LAF handles feature scaling by applying bilinear interpolation during upsampling and average pooling during downsampling. It also reconciles channel differences across feature maps via 1 × 1 convolutions. The strategic placement and function of LAF within the DBBInjection structure enrich the inter-level information exchange by multiplying the available communication paths. This enhanced connectivity leads to improved model performance without a substantial increase in processing time.

3.2.2. Scale Sequence Feature Fusion Module

To counteract information degradation during feature fusion, a known limitation of FPNs, we replaced the CCFM component with our enhanced Gather-and-Distribute (GD) mechanism. Furthermore, we augmented the resulting AIFI-GD hybrid encoder by integrating the Scale Sequence Feature Fusion module. This culminated in our proposed AIFI-GDSF architecture, which leverages the following three key, complementary components: (1) The AIFI module executes intra-scale interactions exclusively on the S5 high-level feature map using stacked self-attention, thereby reducing computational overhead while capturing high-level conceptual relationships within the image. (2) The GD mechanism optimizes the efficiency of cross-scale information fusion and propagation, circumventing the inherent information loss associated with traditional FPN pathways and significantly strengthening feature extraction and integration. (3) The SSFF module synthesizes global semantic information across multiple scales by normalizing, upsampling, and processing concatenated multi-scale features through 3D convolutions. The combined effect of these diverse processing structures within AIFI-GDSF enables the robust handling of objects with varying sizes and shapes, a crucial capability for addressing the complex and dynamic characteristics of real-world fire detection.
The SSFF module fuses feature maps from S3, S4, and processed S5, with each feature map performing feature extraction tasks at different spatial scales. This hierarchical design covers a wide range of flame sizes and shapes. Within the SSFF module, input feature maps are first processed using a series of Gaussian filters to generate multi-scale representations. These resulting feature maps are then aggregated by horizontal stacking. Subsequently, 3D convolutions are employed to extract features from this stacked, multi-scale volume. Recognizing that the Gaussian smoothing yields outputs with disparate resolutions and acknowledging the importance of the S3 level for small object detection, we designate the high-resolution S3 feature map as the target resolution. All other feature maps are then spatially aligned to match S3’s dimensions via nearest-neighbor interpolation. Following this alignment, 3D batch normalization and the SiLU activation function are applied, finalizing the scale sequence feature extraction process. Finally, these processed SSFF features are integrated with the output from the injection module, typically through element-wise addition followed by a convolution operation, to yield the final output. The precise mathematical formulation is given below:
N 3 _ S F = S S F M ( [ S 3 , S 4 , C o n v ( S 5 ) ] ) ,
N 3 = C o n v ( P 3 + N 3 _ S F ) ,

3.3. Loss Function

To quantify how well anchor boxes align with ground truth boxes in object detection, the Intersection over Union (IoU) metric is utilized, measuring their degree of overlap. Its calculation proceeds according to the following formula:
L I o U = 1 I o U = 1 W i H i S u ,
where L I o U represents the IoU loss function; W i and H i denote the width and height of the overlapping area between the ground truth and predicted boxes; and S u represents the union area. The parameters are illustrated in Figure 7.
In the event that the predicted bounding box fails to intersect with the ground truth box, the network may encounter a phenomenon referred to as ’vanishing gradients’ during the process of backpropagation. This occurrence has the potential to impede the ability of the network to adjust the overlap width as training progresses. The current solutions to this problem in designing boundary loss functions often introduce additional penalty terms R i to mitigate the risk of vanishing gradients; such loss functions include CIoU [41] and GIoU [42].
The effectiveness of such loss functions generally relies on the assumption that the training dataset predominantly consists of high-quality annotations, thereby focusing efforts on refining the accuracy of bounding box regression. However, the reality is that even benchmark datasets like ImageNet and VOC2007 contain unavoidable instances of poorly labeled examples. Consequently, optimizing the regression loss based on these inaccurate samples can inadvertently degrade the overall detection capabilities of the trained model. The flame detection model training in this study inevitably encounters these issues during the labeling process for the collected dataset. Due to the irregular shape and high variability of flames, using rectangular boxes for labeling may result in the flame occupying a small portion of the bounding box. Moreover, in real-world scenarios, factors like color and brightness complicate the detection of objects, leading to severe occlusion. Therefore, it is necessary to differentiate anchor boxes of varying quality to effectively improve detection performance.
To address the issue of annotation quality affecting network performance, the present paper adopts the Wise-IoU [21]. It introduces a novel approach, using distance as the attention metric between layers. When the predicted and ground truth boxes show significant overlap within a defined range, weights are allocated based on the distance between the boxes, allowing for better generalization. The Wise-IoU calculation formula is as follows:
L W I o U ν 1 = R W I o U L I o U ,
R W I o U = exp ( x x g t ) 2 + ( y y g t ) 2 ( W g t 2 + H g t 2 ) 2 ,
This initial version defines the Wise-IoU loss as the IoU loss modulated by a weighting factor R W I o U . The calculation of this factor entails the squared Euclidean distance between the center coordinates of the predicted box and the ground truth box. Where L W I o U represents the Wise-IoU; R W I o U represents the distance metric between the two boxes; ( x , y ) and ( x g t , y g t ) denote the coordinates of the centers of the predicted and ground truth boxes; and W g t and H g t represent the width and height of the ground truth box;
L W I o U v 3 = β δ α β δ × R W I o U L I o U ,
β = L IoU * L IoU ¯ [ 0 , + ) ,
WIoU v3 incorporates a novel focusing method via β δ α β δ , and it uses the outlier degree β along with hyperparameters α and δ to dynamically adjust the gradient contribution of the anchor boxes, where β is computed as the ratio of L IoU * , representing the detached IoU loss to L IoU ¯ . A higher value of β indicates that the anchor box has a high IoU loss compared to the average, signifying it is of lower quality. Smaller and larger gradient gains are then assigned accordingly, allowing the regression task to focus more on the medium quality boxes and effectively moderating the negative gradients caused by poor quality examples. These strategies facilitate the model’s ability to effectively handle samples of varying quality and manage their effects on the model.
The efficacy of this selection is substantiated by its capacity to amalgamate the numerous advantages inherent in contemporary loss functions, thereby aligning with the foundational principles of effective loss design. In addition, WIoUv3 adopts a dynamic non-monotonic algorithm to score anchor box performance, which enhances the suitability of the model to handle medium quality samples. By mitigating the detrimental gradients from low-quality samples, WIoUv3 strengthens model robustness. Moreover, it contributes to improved detection accuracy, particularly for smaller objects. These benefits collectively enable the model to handle samples of diverse quality more effectively, optimizing their influence during training.

4. Results and Discussion

4.1. Implementation Details

We conduct experiments on 1 × 24G NVIDIA GeForce RTX 4090 GPU, and the virtual environment consists of Python 3.9, Cuda 12.0, and PyTorch 2.0.0. The AdamW optimizer and cosine learning rate scheduler are used in the experiments, with a learning rate of 1 × 10−4.

4.2. Dataset

Considering the limited availability, low resolution, and lack of environmental diversity in current open source fire datasets for public environments, our self-constructed dataset primarily comprises flame images sourced from the KMU Fire and Smoke database, along with additional fire images gathered from various online sources. To improve the detection of flames, we collected and manually annotated images containing small flames from the internet. The dataset under consideration encompasses a variety of scenes, including those depicting daytime, nighttime, indoor and outdoor settings, in addition to flame targets of varying sizes and shapes. Given that flame detection in stationary indoor scenes is not excessively challenging or generalizable, a mere 10% of the dataset comprises indoor scenes. The dataset predominantly comprises outdoor scenes, with a slightly higher proportion of images captured in dark light environments compared to bright light environments. In order to ensure a balanced representation of various types of fires, particularly those occurring in built-up areas as well as in natural environments such as forests and grasslands, the dataset has been meticulously curated to ensure a 50/50 proportion of building fires and forest and grassland fire images, respectively. After a thorough data cleaning and filtering process, we shuffled the dataset and randomly selected 6080 images, with 4052 images in the training set, 1014 images in the test, and 1014 images in the validation sets. Training in this self-constructed dataset helps to validate the effectiveness and robustness of the model.
To enhance methodological rigor and empirical validity, we conducted comprehensive experimental evaluations using the open source M4SFWD [43] dataset. Developed using Unreal Engine 5, this synthetic dataset enables the photorealistic generation of diverse forest wildfire scenarios, unconstrained by temporal or spatial limitations. The dataset’s comprehensive coverage encompasses the following: eight distinct natural environments with varied terrain features, three temporal phases (dawn, daylight, and dusk), multiple meteorological conditions, and some negative samples to enhance model robustness and reduce false positives. Comprising 3974 high-resolution images with 9627 annotated flame instances, the dataset was randomly partitioned into training, testing, and validation subsets following an 8:1:1 ratio, ensuring statistically reliable results through stratified sampling across all environmental variables.
A sample of the partial dataset is shown in Figure 8. The fire dataset covers a broad spectrum of scenes, with evenly distributed target box center positions and a reasonable proportion of target box sizes relative to the images. However, factors like the shape of the flames have a significant impact on the quality of the annotation. Based on this characteristic, this study adopts the WIoU, which places greater emphasis on average-quality anchor boxes and serves to mitigate the effects of high-quality and low-quality anchor boxes on the regression task.

4.3. Ablation Experiments

The main algorithmic components in this study include the FasterRepNet backbone, composed of the RepNCSPELAN4 module; the FasterDBBNet backbone, composed of the DBBCSPELAN module; the AIFI-GDSF hybrid encoder; and the WIoU loss function. We validated the effectiveness of the proposed improvements on our custom dataset, with the experimental results presented in Table 1.
Evidence supporting the superiority of our proposed models over the original RT-DETR algorithm can be found by examining Experiments 1, 8, and 9, detailed in Table 1. The enhanced performance observed in our approach stems from a configuration utilizing the FasterRepNet backbone network, the AIFI-GD hybrid encoder for feature integration, and the substitution of the default CIoU loss function with WIoUv3. The resulting FasterGold-DETR model improved the mean average precision (mAP50) by 2.6% while reducing the parameters and GFLOPs. On this basis, the FasterGDSF-DETR model introduced DBB modules into both the backbone and encoder, and it integrated the SSFF module into the feature fusion layer, creating the new AIFI-GDSF encoder. Although the parameter count slightly increased, the computational complexity was significantly reduced, and the mean average precision saw a slight improvement.
The effectiveness of individual modules was also validated. Comparing Experiment 1 and 3 showed a 2.2% accuracy gain attributable to the new feature fusion network. Figure 9 illustrates the comparison of data between the models. We also found the hybrid encoder significantly accelerated convergence, cutting the required training iterations by over 33%. The parameter dynamics during training are shown in Figure 10.
We also separately validated the performance of partial convolution. The juxtaposition of Experiment 1 and Experiment 2 was undertaken, which yielded the conclusion that replacing the backbone network with the PConv module resulted in an optimisation of both the parameter count and the computational complexity. Despite a minor loss in mean average precision, visual analysis and performance in complex scenarios showed that PConv provided the model with strong adaptability and robustness. Similarly, a comparison between Experiment 1 and Experiment 4 revealed that replacing the loss function with WIoU led to a 1.5% improvement in mAP50. Moreover, we found that WIoU is better suited for datasets with irregular flame shapes, offering dual improvements in both convergence speed and detection performance.
To specifically validate the distinct advantages of our proposed FasterDBBNet backbone, we performed a comparative analysis against seven alternative network architectures. The quantitative results, detailed in Table 2, strongly support the efficacy of the FasterDBBNet structure within our overall model. Among the alternatives, Unireplknet notably achieved competitive performance metrics. However, a closer examination of its inference behavior revealed a tendency to adopt a more conservative detection strategy, which could potentially impact the system’s overall operational consistency. Despite this observation, we recognize Unireplknet as a promising architecture deserving further investigation.
Conversely, we can also briefly share some examples of unsuccessful experiments. The predominant choice among author teams in the domain of fire detection is using WIoU as the loss function. The present study aims to build on the foundations of WIoU by incorporating the principles of PIoU [44] and Inner IoU [45] into the existing loss function. However, the experimental results obtained thus far have been unsatisfactory. The suboptimal performance of Inner IoU is somewhat unexpected. One hypothesis to explain this is that the Inner IoU’s approach to the treatment of auxiliary borders has some overlap with the idea of WIoU, and thus, it has not been able to achieve the desired results.
As a related line of inquiry beyond this paper’s main results, we explored enhancing AIFI by separating features into high and low frequency information. The strategy involved adapting attention for the distinct processing of these components. An implementation based on the HiLo [46] architecture was tested experimentally. While this approach improved model recall, it simultaneously lowered the precision rate, generating numerous false detections and consequently hindering stable system performance.

4.4. Comparison with the State-of-the-Art Detectors

The performance of our proposed model was benchmarked against other contemporary object detectors, with the quantitative results presented in Table 3. Notably, our approach exhibits enhanced accuracy compared to the baseline models while utilizing a similar or reduced number of parameters. Furthermore, when evaluated against GoldYOLO-M, a significantly larger model (possessing over twice the parameters), our model maintains a competitive level of accuracy with minimal degradation. Compared to YOLOv8, our method significantly outperforms that model in all performance metrics. For YOLOv9, the similar parameters YOLOv9-S demonstrate an advantage in speed but suffer a relatively large loss in accuracy, and the model performs poorly in detecting small objects. YOLOv9-M’s accuracy is still 0.2% lower, with double the number of parameters. In summary, our model outperforms the benchmark models in terms of parameters, GFLOPs, and accuracy. In particular, due to the introduction of the PAD module and WIoU loss function, our model is more adept at detecting occluded and small targets. Furthermore, we performed experimental comparisons on the M4SFWD [43] public dataset, with the results shown in Table 4. In comparison with the base models, the two improved models we proposed achieved better performance with respect to the number of parameters and precision.
In summary, both FasterGold-DETR and FasterGDSF-DETR outperform all mainstream baseline models in terms of overall performance, whether on the custom dataset or the M4SFWD public dataset. It is important to contextualize the FPS comparison between YOLO and Transformer-based object detectors. Typically, Transformer models exhibit lower FPS, largely attributable to their longer inference latency within standard frameworks like Pytorch. However, this gap can be significantly narrowed through the deployment of optimizations, such as utilizing the TensorRT framework, which markedly accelerates inference. Furthermore, even without such specific optimizations, the current inference speeds achievable on adequately resourced hardware can often satisfy the demands of real-time tasks.

4.5. Visualisation and Analysis

4.5.1. Effective Receptive Field (ERF) Visualization

In the context of Convolutional Neural Networks (CNNs), the term ’receptive field’ denotes the area in the input layer that exerts influence on the computation of a specific element in a given layer [47]. Analogous to human visual perception, the receptive field defines the scope within which the convolution kernel interacts with the input feature map. A limited receptive field restricts the network’s processing to only localized and fragmented details. Conversely, a more expansive receptive field facilitates the integration of a more exhaustive array of contextual features, thereby improving the model’s ability to accurately interpret features at different levels.
In this study, an Effective Receptive Field (ERF) visualization method is employed to analyze the backbone network, explaining the performance enhancements resulting from the replacement of the backbone network [48]. Figure 11 illustrates that, in ResNet-18, high-contribution pixels are concentrated around the central region, while those extending towards the periphery contribute minimally, with outermost pixels exhibiting almost no impact. This pattern suggests a constrained Effective Receptive Field (ERF). A similar trend is observed when employing PConv, where contributions generally decrease, indicating that partial convolutions alone do not expand the ERF without additional network depth. In contrast, Figure 11c,d demonstrate a higher uniformity in the distribution of high-contribution pixels, thus indicating that the proposed FasterRepNet and FasterDBBNet architectures allocate greater importance to peripheral regions. This phenomenon consequently gives rise to a more balanced contribution distribution and an expanded ERF.
As shown in Figure 11, a statistical evaluation of the proportion of high-contribution areas for different models at different contribution levels is presented. A pixel is considered highly contributive only if its contribution value to the final prediction exceeds the specified contribution score. The high-contribution area ratio refers to the proportion of the area occupied by all high-contribution pixels to the total area of the input image. From Table 5, it can be observed that replacing ResNet with PConv results in a reduction in the high-contribution area ratio by over 50% across a range of contribution scores.Conversely, adopting the proposed FasterDBBNet structure leads to an increase of more than double in the high-contribution area ratio across various contribution scores. It is noteworthy that for ResNet-18, over 99% of the contribution score is concentrated within a small area that accounts for just 25.15% of the total area, while FasterDBBNet achieves an area ratio of 89.95%. A larger ERF enables the network to gather the greater quantity of additional inputs, facilitating the comprehension of the global structure of features.
We think that this result is due to the DBB module in FasterDBBNet, which employs multi-scale convolutional kernels. This complex structure optimizes multiple convolution layers into a single equivalent convolution layer using specific transformation techniques, thus introducing an equivalent large kernel. Furthermore, the DBB module not only exhibits the properties of an equivalent large kernel but also aligns with the feature of small kernel reparameterization. This means that we replaced ResNet’s small kernel deep convolutional layers with large kernel reparameterized convolution layers, resulting in a significantly larger Effective Receptive Field.
Since the input to the self-attention mechanism originates from the backbone network featuring a large ERF, the inherent global information complements the attention process. This enables the self-attention layer to concentrate its resources on extracting fine-grained local patterns and capturing relationships across distant regions, leading to a more adaptable feature fusion strategy.

4.5.2. Feature Visualization of the Backbone Network

For a long time, researchers’ efforts in optimization have mainly focused on developing more powerful architectures or auxiliary methods, often neglecting the potential information loss in input data during the feedforward process, a phenomenon referred to as the information bottleneck [34]. This information bottleneck leads to biased gradient flows during model updates, making it challenging for deep networks to link input with feature information, resulting in incorrect predictions from the trained model. As shown in Figure 12, different backbone networks exhibit varying degrees of feature information loss during forward propagation to later network layers. In deeper networks, ResNet begins to produce feedforward outputs that obscure object information. In contrast, the proposed FasterDBBNet network preserves feature information as completely as possible, providing the most reliable gradient information for computing the objective function.

4.5.3. Heatmap Visualization of the Feature Fusion Layer

To validate the newly proposed feature fusion method, we utilized Grad-CAM [50] to visualize the heatmaps of multiple feature layers within the encoder’s feature fusion section. Grad-CAM computes the gradient for each layer and multiplies it with the corresponding feature map output, followed by a weighted average to generate a heatmap. In this way, we overlaid the heatmaps of the Inject layer and the fusion layer, resulting in the final heatmap shown in Figure 13.
Since we visualized the higher-level features of the model, the heatmap primarily reflects how the model integrates local texture information to generate higher-level semantic information. YOLOv8 and RT-DETR, which rely on traditional FPN feature fusion structures; are more conservative in selecting regions of interest; and often fail to accurately localize the focal points of ground-truth objects in some images. Conversely, models utilizing the gather-and-distribute mechanism perform exceptionally well in identifying hotspot regions, and their small object detection performance clearly surpasses that of YOLOv8 and RT-DETR. Compared to Gold-YOLO, our FasterGDSF-DETR model exhibits superior performance in capturing target shapes and clustering heatmap points. It displays fewer instances of abnormal heatmap distribution than Gold-YOLO, suggesting greater robustness to interference.

5. Conclusions

We explored the creation of a brand new backbone network, FasterDBBNet, which exhibits superior performance in capturing and maintaining feature gradient information while effectively managing incomplete data. The enhancements to the model have been developed in order to improve its resilience to interference in complicated fire situations. Given the diverse and dynamic nature of fire scenarios, we introduced modifications to the encoder. The AIFI structure was preserved, while the Gather-and-Distribution mechanism and Scale Sequence Feature Fusion module were incorporated, forming the AIFI-GDSF hybrid encoder. This design strengthens the neck’s ability to integrate information and enhances detection accuracy without introducing significant computational overhead. Experimental results confirm that this hybrid encoder improves both accuracy and robustness in challenging detection tasks while accelerating model convergence, thereby mitigating the high training cost associated with RT-DETR. Additionally, as the standard RT-DETR loss function is suboptimal for fire detection, we replace it with WIoUV3 to better align with task-specific requirements. Subsequent investigative endeavors will concentrate on the lightweighting of models and the optimization of hardware deployment strategies, with the objective of enhancing efficiency.

Author Contributions

Conceptualization, F.W., C.L. and L.S.; methodology, F.W. and C.L.; software, F.W.; validation, F.W., C.L. and L.S.; writing—review and editing, F.W., C.L. and L.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded, in part, by the National Key R&D Program of China (2020YFB1712401), the Key Project of Public Benefit in Henan Province of China (201300210500), the Key Project of Collaborative Innovation in Nanyang (22XTCX12001), and the Key Scientific and Technology Project in Henan Province of China (221100210100).

Data Availability Statement

We are more than willing to contribute to the community by making the code and experimental datasets publicly available. However, since our work is not yet finished, and due to the confidentiality of the project, we cannot guarantee to open source it now. We can complement this by the following: (1). We can guarantee that emails stating the intention and purpose will be positively answered, and that the code and data can be made available. (2). We will open source the complete code and dataset after all work is completed. (3). If you are eager for us to complete the open source work, you can notify us. We will try to communicate with project partners to try and make some of the work public now; M4SFWD: https://github.com/Philharmy-Wang/M4SFWD (accessed on 1 April 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Ya’acob, N.; Najib, M.S.M.; Tajudin, N.; Yusof, A.L.; Kassim, M. Image Processing Based Forest Fire Detection using Infrared Camera. J. Phys. Conf. Ser. 2021, 1768, 012014. [Google Scholar] [CrossRef]
  2. Yang, S.; Huang, Q.; Yu, M. Advancements in remote sensing for active fire detection: A review of datasets and methods. Sci. Total Environ. 2024, 943, 173273. [Google Scholar] [CrossRef] [PubMed]
  3. Li, N.; Hu, X.; Lu, Y.; Li, Y.; Ren, M.; Luo, X.; Ji, Y.; Chen, Q.; Sui, X. Wavelength-Selective Near-Infrared Organic Upconversion Detectors for Miniaturized Light Detection and Visualization. Adv. Funct. Mater. 2024, 34, 2411626. [Google Scholar] [CrossRef]
  4. Wei, H.; Wang, N.; Liu, Y.; Ma, P.; Pang, D.; Sui, X.; Chen, Q. Spatio-Temporal Feature Fusion and Guide Aggregation Network for Remote Sensing Change Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–16. [Google Scholar] [CrossRef]
  5. Sui, X.; Chen, Q.; Bai, L. Detection algorithm of targets for infrared search system based on area infrared focal plane array under complicated background. Optik 2012, 123, 235–239. [Google Scholar] [CrossRef]
  6. Kattenborn, T.; Leitloff, J.; Schiefer, F.; Hinz, S. Review on Convolutional Neural Networks (CNN) in vegetation remote sensing. ISPRS J. Photogramm. Remote Sens. 2021, 173, 24–49. [Google Scholar] [CrossRef]
  7. Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A Survey on Vision Transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 87–110. [Google Scholar] [CrossRef]
  8. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
  9. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  10. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
  11. Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
  12. Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.-E. DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
  13. Li, F.; Zhang, H.; Liu, S.; Guo, J.; Ni, L.M.; Zhang, L. DN-DETR: Accelerate DETR Training by Introducing Query DeNoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 13619–13627. [Google Scholar]
  14. Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
  15. Zhu, M.; Kong, E. Multi-Scale Fusion Uncrewed Aerial Vehicle Detection Based on RT-DETR. Electronics 2024, 13, 1489. [Google Scholar] [CrossRef]
  16. Ge, Q.; Yuan, H.; Zhang, Q.; Hou, Y.; Zang, C.; Li, J.; Liang, B.; Jiang, X. Hyper-Progressive Real-Time Detection Transformer (HPRT-DETR) algorithm for defect detection on metal bipolar plates. Int. J. Hydrog. Energy 2024, 74, 49–55. [Google Scholar] [CrossRef]
  17. Cheng, Y.; Liu, D. AdIn-DETR: Adapting Detection Transformer for End-to-End Real-Time Power Line Insulator Defect Detection. IEEE Trans. Instrum. Meas. 2024, 73, 1–11. [Google Scholar] [CrossRef]
  18. Zheng, H.; Wang, G.; Xiao, D.; Liu, H.; Hu, X. FTA-DETR: An efficient and precise fire detection framework based on an end-to-end architecture applicable to embedded platforms. Expert Syst. Appl. 2024, 248, 123394. [Google Scholar] [CrossRef]
  19. Wang, C.; He, W.; Nie, Y.; Guo, J.; Liu, C.; Wang, Y.; Han, K. Gold-YOLO: Efficient Object Detector via Gather-and-Distribute Mechanism. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S., Eds.; Curran Associates, Inc.: Sydney, Australia, 2023; Volume 36, pp. 51094–51112. [Google Scholar]
  20. Kang, M.; Ting, C.M.; Ting, F.F.; Phan, R.C.W. ASF-YOLO: A novel YOLO model with attentional scale sequence fusion for cell instance segmentation. Image Vis. Comput. 2024, 147, 105057. [Google Scholar] [CrossRef]
  21. Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding Box Regression Loss with Dynamic Focusing Mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]
  22. Sharma, A.; Kumar, R.; Kansal, I.; Popli, R.; Khullar, V.; Verma, J.; Kumar, S. Fire Detection in Urban Areas Using Multimodal Data and Federated Learning. Fire 2024, 7, 104. [Google Scholar] [CrossRef]
  23. Vorwerk, P.; Kelleter, J.; Müller, S.; Krause, U. Classification in Early Fire Detection Using Multi-Sensor Nodes—A Transfer Learning Approach. Sensors 2024, 24, 1428. [Google Scholar] [CrossRef]
  24. Barmpoutis, P.; Dimitropoulos, K.; Kaza, K.; Grammalidis, N. Fire Detection from Images Using Faster R-CNN and Multidimensional Texture Analysis. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 8301–8305. [Google Scholar] [CrossRef]
  25. Chaoxia, C.; Shang, W.; Zhang, F. Information-Guided Flame Detection Based on Faster R-CNN. IEEE Access 2020, 8, 58923–58932. [Google Scholar] [CrossRef]
  26. Pan, J.; Ou, X.; Xu, L. A Collaborative Region Detection and Grading Framework for Forest Fire Smoke Using Weakly Supervised Fine Segmentation and Lightweight Faster-RCNN. Forests 2021, 12, 768. [Google Scholar] [CrossRef]
  27. Duan, K.; Xie, L.; Qi, H.; Bai, S.; Huang, Q.; Tian, Q. Corner Proposal Network for Anchor-Free, Two-Stage Object Detection. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer: Berlin/Heidelberg, Germany, 2020; pp. 399–416. [Google Scholar]
  28. Zheng, H.; Duan, J.; Dong, Y.; Liu, Y. Real-time fire detection algorithms running on small embedded devices based on MobileNetV3 and YOLOv4. Fire Ecol. 2023, 19, 31. [Google Scholar] [CrossRef]
  29. Zheng, H.; Dembélé, S.; Wu, Y.; Liu, Y.; Chen, H.; Zhang, Q. A lightweight algorithm capable of accurately identifying forest fires from UAV remote sensing imagery. Front. For. Glob. Chang. 2023, 6, 1134942. [Google Scholar] [CrossRef]
  30. Ren, D.; Zhang, Y.; Wang, L.; Sun, H.; Ren, S.; Gu, J. FCLGYOLO: Feature Constraint and Local Guided Global Feature for Fire Detection in Unmanned Aerial Vehicle Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 5864–5875. [Google Scholar] [CrossRef]
  31. Li, Y.; Zhang, W.; Liu, Y.; Jing, R.; Liu, C. An efficient fire and smoke detection algorithm based on an end-to-end structured network. Eng. Appl. Artif. Intell. 2022, 116, 105492. [Google Scholar] [CrossRef]
  32. Liang, T.; Zeng, G. FSH-DETR: An Efficient End-to-End Fire Smoke and Human Detection Based on a Deformable DEtection TRansformer (DETR). Sensors 2024, 24, 4077. [Google Scholar] [CrossRef]
  33. Targ, S.; Almeida, D.; Lyman, K. Resnet in Resnet: Generalizing Residual Architectures. arXiv 2016, arXiv:1603.08029. [Google Scholar] [CrossRef]
  34. Wang, C.Y.; Yeh, I.H.; Liao, H. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
  35. Wang, C.Y.; Liao, H.Y.M.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A New Backbone That Can Enhance Learning Capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
  36. Wang, C.Y.; Liao, H.; Yeh, I.H. Designing Network Design Strategies Through Gradient Path Analysis. arXiv 2022, arXiv:2211.04800. [Google Scholar]
  37. Ding, X.; Zhang, X.; Han, J.; Ding, G. Diverse Branch Block: Building a Convolution as an Inception-Like Unit. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 10886–10895. [Google Scholar]
  38. Chen, J.; Kao, S.h.; He, H.; Zhuo, W.; Wen, S.; Lee, C.H.; Chan, S.H.G. Run, Do not Walk: Chasing Higher FLOPS for Faster Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 12021–12031. [Google Scholar]
  39. Lin, T.Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  40. Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
  41. Zheng, Z.; Wang, P.; Ren, D.; Liu, W.; Ye, R.; Hu, Q.; Zuo, W. Enhancing Geometric Factors in Model Learning and Inference for Object Detection and Instance Segmentation. IEEE Trans. Cybern. 2022, 52, 8574–8586. [Google Scholar] [CrossRef]
  42. Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
  43. Wang, G.; Li, H.; Li, P.; Lang, X.; Feng, Y.; Ding, Z.; Xie, S. M4SFWD: A Multi-Faceted synthetic dataset for remote sensing forest wildfires detection. Expert Syst. Appl. 2024, 248, 123489. [Google Scholar] [CrossRef]
  44. Chen, Z.; Chen, K.; Lin, W.; See, J.; Yu, H.; Ke, Y.; Yang, C. PIoU Loss: Towards Accurate Oriented Object Detection in Complex Environments. arXiv 2020, arXiv:2007.09584. [Google Scholar] [CrossRef]
  45. Zhang, H.; Xu, C.; Zhang, S. Inner-IoU: More Effective Intersection over Union Loss with Auxiliary Bounding Box. arXiv 2023, arXiv:2311.02877. [Google Scholar] [CrossRef]
  46. Pan, Z.; Cai, J.; Zhuang, B. Fast Vision Transformers with HiLo Attention. arXiv 2023, arXiv:2205.13213. [Google Scholar] [CrossRef]
  47. Luo, W.; Li, Y.; Urtasun, R.; Zemel, R. Understanding the Effective Receptive Field in Deep Convolutional Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., Garnett, R., Eds.; Curran Associates, Inc.: Sydney, Australia, 2016; Volume 29. [Google Scholar]
  48. Ding, X.; Zhang, X.; Han, J.; Ding, G. Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11963–11975. [Google Scholar]
  49. Zhang, J.; Li, X.; Li, J.; Liu, L.; Xue, Z.; Zhang, B.; Jiang, Z.; Huang, T.; Wang, Y.; Wang, C. Rethinking Mobile Block for Efficient Attention-based Models. arXiv 2023, arXiv:2301.01146. [Google Scholar] [CrossRef]
  50. Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations From Deep Networks via Gradient-Based Localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
Figure 1. Overall structure of the FasterGDSF-DETR: (a) FasterGDSF-DETR; (b) FasterDBBNet; (c) PAD Block; and (d) ADown Block.
Figure 1. Overall structure of the FasterGDSF-DETR: (a) FasterGDSF-DETR; (b) FasterDBBNet; (c) PAD Block; and (d) ADown Block.
Electronics 14 01472 g001
Figure 2. Architecture of the Diverse Branch Block: (a) RepConvN; (b) DBB.
Figure 2. Architecture of the Diverse Branch Block: (a) RepConvN; (b) DBB.
Electronics 14 01472 g002
Figure 3. DBBCSPELAN is deconstructed into CSPNet, ELAN, GELAN, and DBB. (a) CSPNet. (b) ELAN. (c) GELAN. (d) RepNCSP and DBBCSP are two structures, and in the figure, brackets are used to indicate the use of different modules.
Figure 3. DBBCSPELAN is deconstructed into CSPNet, ELAN, GELAN, and DBB. (a) CSPNet. (b) ELAN. (c) GELAN. (d) RepNCSP and DBBCSP are two structures, and in the figure, brackets are used to indicate the use of different modules.
Electronics 14 01472 g003
Figure 4. (a,b) Differences between deepwise separable convolution and partial convolution.
Figure 4. (a,b) Differences between deepwise separable convolution and partial convolution.
Electronics 14 01472 g004
Figure 5. Architecture of the Gather-and-Distribute branch.
Figure 5. Architecture of the Gather-and-Distribute branch.
Electronics 14 01472 g005
Figure 6. Architecture of the injection module: (a) DBBInjection; (b) DBBInjection module with LAF.
Figure 6. Architecture of the injection module: (a) DBBInjection; (b) DBBInjection module with LAF.
Electronics 14 01472 g006
Figure 7. Schematic of the loss function parameters.
Figure 7. Schematic of the loss function parameters.
Electronics 14 01472 g007
Figure 8. (a,b) Training images in the self-built dataset; (c,d) training images in the M4SFWD dataset.
Figure 8. (a,b) Training images in the self-built dataset; (c,d) training images in the M4SFWD dataset.
Electronics 14 01472 g008
Figure 9. Comparison of our model with other mainstream models based on key metrics, including detection precision (P), mean average precision ( m A P 0.5 ), computational complexity ( G F L O P s ), and the number of parameters.
Figure 9. Comparison of our model with other mainstream models based on key metrics, including detection precision (P), mean average precision ( m A P 0.5 ), computational complexity ( G F L O P s ), and the number of parameters.
Electronics 14 01472 g009
Figure 10. Comparison of the loss function and average accuracy for different modules.
Figure 10. Comparison of the loss function and average accuracy for different modules.
Electronics 14 01472 g010
Figure 11. Schematic representation of the ERF of ResNet, PConv, FasterRepNet, and FasterDBBNet.
Figure 11. Schematic representation of the ERF of ResNet, PConv, FasterRepNet, and FasterDBBNet.
Electronics 14 01472 g011
Figure 12. Visual representation of the output feature maps from various network architectures, with randomly initialized weights.
Figure 12. Visual representation of the output feature maps from various network architectures, with randomly initialized weights.
Electronics 14 01472 g012
Figure 13. Schematic representation illustrating the ERF of ResNet-18, PConv, FasterRepNet, and FasterDBBNet.
Figure 13. Schematic representation illustrating the ERF of ResNet-18, PConv, FasterRepNet, and FasterDBBNet.
Electronics 14 01472 g013
Table 1. Ablation experiments validating the effectiveness of the proposed components (where “-” indicates the structure is not used, and “✓” indicates the structure is used).
Table 1. Ablation experiments validating the effectiveness of the proposed components (where “-” indicates the structure is not used, and “✓” indicates the structure is used).
MethodBackbone Encoder Evaluation Metrics
PConv FasterRepNet FasterDBBNet AIFI-GD AIFI-GDSF WIoUv3 mAP@0.5 Params (M) GFLOPs
Single Structure Verification
1 (RT-DETR, Baseline)--- --- 68.619.8756.9
2-- --- 67.614.1043.2
3--- -- 70.822.2559.9
4--- -- 70.119.8756.9
5 - --- 68.48.5623.7
6 --- 68.710.6819.5
Multi Structure Verification
7 - -- 70.715.4646.6
8--- - 71.422.2559.9
9 - 71.415.7149.8
10 (FasterGold-DETR, Ours) - - 71.215.4646.6
11 (FasterGDSF-DETR, Ours) 71.516.7639.4
Table 2. Comparative experiments on different backbone networks.
Table 2. Comparative experiments on different backbone networks.
ModelmAP@0.5mAP@0.5:0.95Params (M)GFLOPs
ResNet-1868.635.519.8756.9
PConv67.635.214.1043.2
Unireplknet68.535.612.7133.4
EfficientViT68.235.310.7027.2
MobileNet v364.732.89.5423.6
RepNCSPELAN68.035.29.0526.5
FasterRepNet (ours)68.435.48.5623.7
FasterDBBNet (ours)68.735.710.6819.5
Table 3. Performance comparison of FasterGold-DETR with state-of-the-art fire object detectors on our self-built dataset.
Table 3. Performance comparison of FasterGold-DETR with state-of-the-art fire object detectors on our self-built dataset.
ModelmAP@0.5mAP@0.5:0.95Params (M)GFLOPsFPS
S and M Models of YOLO Detectors
YOLOv8-S67.134.411.1228.4150.2
YOLOv8-M68.235.725.8478.7102.7
GoldYOLO-S67.933.621.5146.1143.5
GoldYOLO-M71.437.841.2887.3112.3
YOLOv9-S69.137.29.5938.7170.5
YOLOv9-M71.338.132.55130.7130.8
RT-DETR Detectors
RT-DETR (R18)68.635.519.8756.966.3
RT-DETR (R34)69.837.131.1188.860.1
FasterGold-DETR (ours)71.237.915.4646.675.5
FasterGDSF-DETR (ours)71.538.116.7639.472.1
Table 4. Performance comparison of FasterGold-DETR with state-of-the-art fire object detectors on the M 4 S F W D [43] dataset.
Table 4. Performance comparison of FasterGold-DETR with state-of-the-art fire object detectors on the M 4 S F W D [43] dataset.
ModelmAP@0.5mAP@0.5:0.95Params (M)GFLOPsFPS
S and M Models of YOLO Detectors
YOLOv8-S86.049.611.1228.4155.7
YOLOv8-M86.650.225.8478.7107.2
GoldYOLO-S85.145.921.5146.1149.6
GoldYOLO-M87.350.841.2887.3115.4
YOLOv9-S87.250.69.5938.7169.5
YOLOv9-M89.151.232.55130.7122.8
RT-DETR Detectors
RT-DETR (R18)87.550.519.8756.969.5
RT-DETR (R34)88.351.031.1188.862.9
FasterGold-DETR (ours)88.150.915.4646.677.2
FasterGDSF-DETR (ours)89.351.416.7639.473.4
Table 5. Measurement of ERF characteristics via the high-contribution area ratio r [48].
Table 5. Measurement of ERF characteristics via the high-contribution area ratio r [48].
Modelt = 20%t = 30%t = 50%t = 99%
ResNet-181.23%2.02%4.18%25.15%
PConv0.45%0.73%1.52%14.89%
iRMB [49]0.05%1.29%30.07%81.84%
RepNCSPELAN [34]1.23%2.20%5.42%47.05%
Unireplknet4.85%10.46%26.74%95.06%
FasterRepNet(ours)0.79%1.37%3.08%36.56%
FasterDBBNet(ours)2.39%4.58%11.71%89.95%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, C.; Wu, F.; Shi, L. FasterGDSF-DETR: A Faster End-to-End Real-Time Fire Detection Model via the Gather-and-Distribute Mechanism. Electronics 2025, 14, 1472. https://doi.org/10.3390/electronics14071472

AMA Style

Liu C, Wu F, Shi L. FasterGDSF-DETR: A Faster End-to-End Real-Time Fire Detection Model via the Gather-and-Distribute Mechanism. Electronics. 2025; 14(7):1472. https://doi.org/10.3390/electronics14071472

Chicago/Turabian Style

Liu, Chengming, Fan Wu, and Lei Shi. 2025. "FasterGDSF-DETR: A Faster End-to-End Real-Time Fire Detection Model via the Gather-and-Distribute Mechanism" Electronics 14, no. 7: 1472. https://doi.org/10.3390/electronics14071472

APA Style

Liu, C., Wu, F., & Shi, L. (2025). FasterGDSF-DETR: A Faster End-to-End Real-Time Fire Detection Model via the Gather-and-Distribute Mechanism. Electronics, 14(7), 1472. https://doi.org/10.3390/electronics14071472

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop