1. Introduction
Rapid global urbanization and population agglomeration have catalyzed economic prosperity while simultaneously undermining the resilience of urban public safety systems [
1]. Urbanization, while driving economic and social progress, concurrently introduces significant challenges, most notably the escalating incidence of urban fires [
2]. The intrinsic complexity of urban environments—characterized by dense populations, high-rise building clusters, traffic congestion, and heterogeneous topography—renders fire prevention and emergency response exceptionally difficult [
3]. Under such conditions, fires can result in catastrophic casualties and substantial property losses. Research indicates that fires are responsible for approximately 50,000 deaths and 170,000 injuries globally each year, with the cumulative fatalities and economic losses attributable to urban fires frequently exceeding those of forest fires globally [
4]. During the period from 2001 to 2020, more than 90 million fire incidents were recorded worldwide [
5]. In China alone, a 2018 national fire report documented 237,000 urban fires, resulting in 1407 deaths, 798 injuries, and direct economic losses of 3.67 billion CNY [
6]. These statistics highlight the profound threats posed to social stability and public safety, underscoring the urgent need for high-precision, real-time fire detection and perception technologies [
7,
8].
Addressing this fire safety challenge, traditional technologies such as fire alarm systems and smoke detectors have played a foundational role in urban fire prevention and suppression. However, their inherent technical limitations render them inadequate for meeting current detection and control demands [
9,
10]. Fire alarm systems are highly susceptible to environmental interference induced by high temperatures, elevated humidity, and equipment aging, causing frequent false alarms and significantly increasing operational costs [
11]. Similarly, smoke detectors suffer from limited monitoring coverage and poor environmental adaptability [
12]. Their passive triggering mechanisms, which rely on smoke accumulation, often result in delayed detection, and achieving comprehensive coverage necessitates the dense deployment of numerous devices, further escalating system complexity and deployment costs [
13].
To overcome the inherent limitations of traditional point-based monitoring technologies, advances in remote sensing satellite technology offer a viable path for wide-area smoke detection [
14]. Leveraging its advantages of broad coverage, real-time monitoring, and relatively low operational costs, satellite imagery has demonstrated significant potential in fire smoke monitoring [
15,
16]. However, elevated false alarm rates arising from complex background environments remain a critical bottleneck limiting its practical efficacy. Smoke manifests highly dynamic characteristics in terms of scale, intensity, and morphology, significantly compounding the difficulty of accurate identification [
17]. Moreover, interfering phenomena such as clouds, fog, and dust exhibit strong similarities to smoke in terms of texture, color, and spectral characteristics, readily causing identification confusion and detection errors [
18].
To address the aforementioned technical challenges of accurate fire-related object recognition in complex environments, the rapid advancement of deep learning has provided a robust methodological foundation for fire monitoring algorithms, facilitating the widespread application of object detection methods in fire identification [
19,
20,
21]. These algorithms, trained on large-scale fire image datasets, automatically identify and classify fire-related features, thereby improving detection accuracy. For instance, Barmpoutis et al. [
22] integrated Faster R-CNN with multidimensional texture analysis (Linear Dynamic Systems, LDS) for an enhanced fire detection framework. By analyzing the dynamic texture features of candidate regions, this method effectively distinguishes actual fire from spectrally similar background interferences, significantly reducing false positive rates in complex environments. Similarly, Guan et al. [
23] employed an improved Mask R-CNN model to process aerial imagery, reconstructing the mask branch to refine the perception of irregular flame boundaries, achieving pixel-level segmentation of forest fires with high precision and providing reliable data support for fire spread estimation.
Although the aforementioned algorithms excel in detection accuracy, their considerable parameter counts and high computational latency often limit their applicability in real-time scenarios [
24,
25]. In contrast, one-stage object detection algorithms, owing to their superior inference speed, have gradually become the mainstream choice for real-time early-warning tasks [
26,
27]. Li and Zhao [
28] evaluated multiple architectures, including Faster R-CNN and SSD, confirming the superiority of YOLOv3 in balancing detection speed and accuracy, thus establishing its foundational role in real-time warning tasks. To address the deployment challenges of Transformer-based models on low-power devices, Zheng et al. [
29] proposed the FTA-DETR framework. By optimizing encoder efficiency and leveraging TensorRT acceleration, this model achieved an accuracy of 98.32% and an inference speed of 76 FPS, demonstrating that end-to-end detectors can simultaneously satisfy high-precision and real-time requirements in edge computing. Furthermore, P et al. [
30] investigated false alarms induced by clouds and sunset afterglow by benchmarking YOLOv8, YOLOv11, and YOLOv12 architectures. Their results indicated that YOLOv12 effectively mitigated complex background interference, achieving an accuracy of 97.17%, whereas YOLOv11 demonstrated clear advantages in resource-constrained field deployments owing to its 15 MB lightweight architecture and millisecond-level inference speed.
Despite these advances in real-time performance and robustness to background interference, existing one-stage detectors still exhibit notable deficiencies in feature extraction for extremely small fire sources and morphologically variable smoke [
31]. To address this challenge, Wang et al. [
8] proposed incorporating dynamic snake convolutions, FSPPF modules, and dedicated small object detection layers to improve feature representation. Similarly, for small-target fire detection, Wang et al. [
32] proposed DCSNet, which employs dynamic contextual aggregation and partial cross-stage feature fusion to enhance boundary perceptibility. However, such strategies yield limited performance gains in scenarios involving extremely weak fire sources or optically thin smoke. Moreover, the stacking of complex modules incurs substantial increases in FLOPs and parameter counts, compromising the lightweight advantage of these models. Consequently, mainstream models continue to face significant limitations in detecting small targets and smoke, underscoring an urgent need to improve their generalization capability.
To overcome the inherent trade-off between computational efficiency and detection accuracy in existing deep learning models, this study proposes YOLO-Fire, a high-precision, noise-robust framework specifically designed for fire detection in complex, safety-critical environments. Built upon the YOLOv11 architecture, the proposed method effectively addresses the detection bottlenecks associated with identifying incipient fire sources and diffuse smoke against dynamic background interference.
Figure 1 illustrates the urban fire detection pipeline proposed in this study, encompassing data processing, model architecture design, and multi-scenario early-warning applications. The main contributions of this study are summarized as follows.
- (1)
The HFFM is proposed to address the frequency-domain discrepancy between flame and smoke. By integrating a Dual-Attention Grouped Block (DAGB) and a Parallel Depthwise Block (PDB) in parallel, this module implements a dual-branch feature decoupling strategy that separates the high-frequency boundary features of small fire sources from the low-frequency diffusive textures of smoke, significantly improving feature representation at minimal additional parameter cost.
- (2)
The C2f-DCD module is designed by incorporating a Dual-Scale Contextual Diffusion mechanism. Leveraging the Split-Transform-Fuse strategy, it captures global environmental context via depthwise dilated convolutions and integrates this into local features through an additive fusion scheme. Unlike multiplicative gating, additive injection preserves the intensity of semi-transparent smoke signals while acting as a semantic regularizer to suppress complex background interference, thereby ensuring robust feature discrimination.
- (3)
The GSPPF module is introduced to handle dynamic morphological variations in fire and smoke targets. By incorporating the GELU activation function for multi-scale feature fusion, this module enlarges the receptive field, enhancing the model’s capacity to capture fire targets across multiple scales—from sub-pixel incipient ignition points to large-scale dispersing smoke plumes.
- (4)
Extensive experiments on a self-constructed cross-scenario fire dataset demonstrate that YOLO-Fire achieves state-of-the-art detection performance. The proposed framework integrates data processing, model architecture, and practical deployment, effectively addressing incipient fire sources and diffuse smoke detection. It provides high-precision technical support for urban safety monitoring and ecological disaster prevention.
2. Materials and Methods
2.1. Dataset
A high-fidelity and comprehensive dataset serves as the cornerstone for training robust data-driven safety systems. To ensure the generalization capability of YOLO-Fire across complex safety-critical environments, we constructed a large-scale, multi-source fire repository. This process encompasses images captured from diverse platforms including ground-level surveillance cameras, mobile terminals, Unmanned Aerial Vehicles (UAVs), and spaceborne remote sensing satellites, covering a wide spectrum of viewpoints from close-range ground observation to high-altitude remote sensing. Crucially, acknowledging the intensifying risks at the Wildland-Urban Interface (WUI) due to urban expansion, our dataset explicitly incorporates scenarios from these fringe zones. Fire incidents in WUI areas pose direct threats to urban infrastructure and ecological stability. The inclusion of these multi-scale samples ensures that the model possesses reliable detection capabilities not only within concrete building complexes but also in high-risk vegetation areas on the urban periphery, thereby bridging the gap between localized urban safety and broad-scale geospatial monitoring. While UAV and satellite imagery provides invaluable aerial and large-scale perspectives, the scene diversity obtainable solely from these sources is inherently limited; ground-level images captured by surveillance cameras and mobile terminals were therefore also incorporated to ensure comprehensive coverage of diverse urban fire scenarios.
The dataset was constructed through a hybrid collection strategy. A portion of the images was sourced from publicly available fire dataset platforms, including PaddlePaddle AI Studio and similar open-source repositories. However, given the limited scene diversity provided by these public sources, additional urban fire images were further collected from the internet to supplement the dataset, covering a broader range of scenarios including urban building fires, vehicle-induced fires, WUI fires, satellite remote sensing fires, and nighttime low-illumination fires. It should be noted that the UAV and satellite imagery in this dataset was obtained in RGB image format from publicly available platforms and the internet, and therefore does not contain standardized remote sensing metadata such as sensor specifications, spatial resolution, or geographic coordinate information. Although the dataset does not follow a conventional remote sensing data structure with georeferenced imagery, the inclusion of UAV and satellite RGB imagery ensures that the proposed method remains applicable to remote sensing fire monitoring scenarios, extending its applicability from ground-level surveillance to wide-area aerial and satellite observation. To guarantee annotation precision, all images were manually annotated using the Labelme (v5.6.1) tool to ensure consistent annotation quality and labeling standards across the entire dataset. Subsequently, data augmentation techniques were applied during the training phase to expand sample diversity and mitigate overfitting. The final dataset comprises a total of 12,272 finely annotated images, which were randomly partitioned into a training set of 9256 images and a test set of 3016 images. This rigorous dataset partitioning provides a solid foundation for evaluating the model’s performance in real-world applications.
As illustrated in
Figure 2, this dataset overcomes the limitations of single-scenario datasets by constructing a highly representative multi-source sample repository. In the spatial dimension, the data encompasses a multi-view distribution ranging from ground-level close-range surveillance to high-altitude satellite remote sensing. In the temporal and environmental dimensions, it incorporates all-weather characteristics, spanning diurnal illumination variations and extreme low-light scenarios. This composition, bridging diverse spatiotemporal scales from ground-level surveillance to wide-area satellite remote sensing, establishes the completeness of the dataset and supports the model’s applicability across both localized and large-scale fire monitoring scenarios. It effectively simulates complex background interferences found in real-world urban environments, thereby significantly enhancing the model’s generalization capability and robustness in practical disaster prevention and mitigation tasks.
2.2. YOLOv11 Model
Driven by continuous architectural evolution in backbone efficiency and feature integration strategies, the YOLO series has cemented its status as the dominant paradigm in real-time object detection. YOLOv11 [
33], the latest generation released by Ultralytics, builds upon the efficient design of YOLOv8 [
34], further optimizing parameter efficiency and detection accuracy through advanced feature extraction modules and attention mechanisms. It adheres to the classic Backbone-Neck-Head meta-architecture. The Backbone adopts a Cross Stage Partial (CSP) [
35]-based design utilizing C3k2 (Faster Implementation of CSP Bottleneck with 2 convolutions) modules for hierarchical feature extraction. The Spatial Pyramid Pooling-Fast (SPPF) module, distinguished from the traditional Spatial Pyramid Pooling (SPP) [
36] design, employs a serial cascade of three max-pooling layers to efficiently expand the spatial receptive field while maintaining fast inference., and a C2PSA (CSP with Position-Sensitive Attention, for enhanced feature extraction and processing) module is introduced at the end of the backbone to capture deep global semantic information. The Neck retains the classic PAN-FPN (Path Aggregation Network with Feature Pyramid Network) [
37] architecture, achieving bidirectional aggregation of multi-scale features through top-down upsampling and bottom-up convolutional pathways, with C3k2 modules applied at feature fusion nodes to ensure representational capability and gradient flow stability. The Head adopts the Decoupled Head [
38] paradigm with an Anchor-free [
39] strategy, directly predicting target centers and boundary distances, jointly optimized using DFL (Distribution Focal Loss) [
40] and CIoU Loss [
41] for robust and precise bounding box localization.
2.3. Overall Architecture
Although YOLOv11 performs exceptionally well in general object detection tasks, its standard architecture faces challenges in simultaneously addressing the extremely minute scale of early flames and the semi-transparent, blurry characteristics of diffused smoke in early fire warning scenarios. To overcome these challenges, we propose an enhanced detection architecture tailored to the specific characteristics of fire scenarios, as illustrated in
Figure 3. While retaining the efficient inference capabilities of YOLOv11, this architecture performs a strategic restructuring of the critical feature extraction and fusion stages.
First, to resolve the feature discrepancy inherent in processing two distinct targets—”fire” and “smoke”—using traditional fusion methods, we propose the HFFM in the Neck stage. Moving beyond simple feature concatenation, HFFM adopts a parallel dual-stream paradigm for feature decoupling. Specifically, the PDB branch focuses on preserving high-frequency, point-like features, ensuring that early tiny flames are not smoothed out in deep layers. Conversely, the DAGB branch specializes in enhancing low-frequency, large-scale semi-transparent textures to capture the amorphous morphology of smoke. This hierarchical fusion strategy ensures that the model balances the representation capabilities for both fine-grained fire and diffused smoke during the feature aggregation stage, significantly improving detection robustness in complex fire scenes.
Secondly, addressing the challenge where smoke features are highly coupled with complex backgrounds (e.g., clouds and lighting interference) and tiny fire sources are prone to being overwhelmed by environmental noise (e.g., urban lights and strong reflections) due to pixel scarcity, we introduce the C2f-DCD module into the Neck, replacing the original C2f structure. Through an explicit contextual diffusion mechanism, C2f-DCD constrains local texture extractors to perform feature discrimination conditional on global environmental information. This design not only enhances the network’s capability to distinguish blurry smoke boundaries under low-contrast conditions but also provides critical “semantic verification” for tiny objects. By relying on background logic, it effectively suppresses false interferences resembling fire points, thereby significantly improving the recall rate for extremely small fire sources.
Finally, to adapt to the extreme scale variations ranging from local fire points to large-area dense smoke, a GSPPF module is deployed at the end of the backbone. Utilizing an optimized pooling strategy and the GELU [
42] activation function, this module aggregates multi-scale global context with negligible computational overhead, effectively expanding the effective receptive field of deep features.
It is worth noting that the three proposed modules are not only tailored for general urban fire detection but are also specifically designed to address the unique challenges inherent in remote sensing scenarios. In UAV and satellite imagery, fire targets often manifest as sub-pixel ignition points embedded in complex terrain backgrounds, while smoke plumes may span large spatial extents with highly variable morphology. The GSPPF module directly addresses the large-scale scene challenge by aggregating multi-scale receptive fields to capture both localized fire points and wide-area smoke diffusion patterns. The HFFM module tackles the small target detection challenge in remote sensing by preserving high-frequency boundary details of tiny ignition points through the PDB branch, preventing their erosion during deep feature extraction. The C2f-DCD module further enhances robustness in remote sensing scenarios by injecting global terrain context into local feature extraction, effectively suppressing complex background interference such as cloud formations and varied land cover textures that are characteristic of aerial and satellite imagery.
2.3.1. Hybrid Feature Fusion Module
Conventional feature aggregation methods typically couple global and local features within a single processing stream. This design limits the network’s ability to recalibrate blurry smoke features in uncontrolled environments and hinders its flexibility in responding to drastic changes in target scale [
43]. To break this coupling limitation between global and local representations, we designed the Hybrid Feature Fusion Module (HFFM), as shown in
Figure 4. It adopts a parallel dual-stream cooperative framework aimed at decoupling feature recalibration from contextual aggregation.
Given an input feature tensor
, HFFM diverts the information flow into two parallel branches: the Dual-Attention Grouped Block (DAGB) for texture refinement and the Parallel Depthwise Block (PDB) for multi-scale perception. These heterogeneous features are subsequently integrated through a fusion layer to produce the refined output
. The overall formulation is expressed as:
where
denotes a 1 × 1 convolution followed by Batch Normalization and the SiLU activation function, used to project the concatenated features back to the original channel dimension.
represents the operation function of the DAGB branch, and
represents the operation function of the PDB branch.
and
denote the input and output feature tensors, respectively. We employ a residual connection to facilitate gradient flow and prevent network degradation.
Dual-Attention Grouped Block (DAGB). Smoke detection in uncontrolled environments faces a unique challenge: smoke features are typically blurry and lack a rigid geometric structure, resulting in high inter-class similarity with the background and low intra-class variance. To address this, we drew on the design of CBAM [
44] and proposed DAGB, DAGB employs a “Coarse-to-Fine” recalibration strategy to progressively purify features before extraction.
The process begins with Channel-Level Recalibration. Since standard convolutions treat all channels equally, they struggle to distinguish informative smoke textures from background noise. We employ the “Squeeze-and-Excitation” (SE) mechanism [
45] to explicitly model channel dependencies. By aggregating global spatial information into a channel descriptor vector
via Global Average Pooling (GAP), we capture the global distribution of features. This descriptor is then mapped to a set of channel weights
through a bottleneck Multi-Layer Perceptron (MLP), enabling the network to selectively enhance smoke-relevant channels while suppressing noise:
where
and
denote the SiLU and Sigmoid functions, respectively.
and
represent the learnable weight parameters for the dimensionality reduction and expansion layers, respectively;
is the channel descriptor vector obtained via Global Average Pooling (GAP), and
is the generated set of channel weights used to selectively enhance smoke-relevant channels.
represents element-wise multiplication (Hadamard product), used to reweight the input feature
.
Immediately following, we perform Spatial-Level Recalibration to localize targets. A lightweight 3 × 3 convolution compresses the channel dimension of the weighted tensor
to generate a spatial probability map
. This map acts as a spatial filter, suppressing background clutter and focusing computational attention on salient regions:
here,
is the generated spatial attention mask;
is the final feature map after dual recalibration, and
is the feature after channel-level recalibration.
Finally, to efficiently encode these refined features, we utilized Grouped Convolution with the group number set to . Unlike standard convolution, grouped convolution isolates information flow within channel subsets. This design not only significantly reduces parameter redundancy but also imposes a regularization effect, forcing the network to learn diagonal correlations and preventing the destruction of the carefully recalibrated attention features.
Parallel Depthwise Block (PDB). A critical bottleneck in detecting distant smoke or small fire points is the dilemma of scale variation: standard downsampling operations often erode the details of small targets, while maintaining high resolution limits the perceptual range required to distinguish smoke from background noise. PDB solves this problem by adopting a multi-branch topology that processes local and global information simultaneously.
To ensure computational efficiency, the input is first projected into a low-dimensional space
(where the channel count
). The feature flow is subsequently split into two parallel branches. The Detail Branch is designed to preserve high-frequency information. It utilizes a standard
Depthwise Convolution. By performing spatial filtering independently on each channel, this branch excels at maintaining the structural integrity of edges and boundaries, which is crucial for delineating small targets. Simultaneously, the Wide-Field Branch is designed for contextual verification. It utilizes a
Depthwise Dilated Convolution with a dilation rate of d = 2. This operation inserts “holes” into the convolution kernel, effectively expanding the receptive field to 5 × 5 without increasing parameters or reducing resolution. This enables the network to integrate surrounding environmental context to verify the presence of smoke or targets:
In this formula, represents the intermediate feature after dimensionality reduction; and respectively represent the captured local high-frequency detail features and the wide-field large receptive field features.
To synergize these complementary features, the outputs of the two branches are concatenated and fused via a 1 × 1 convolution. This fusion step dynamically aggregates local details and global context, ensuring that small targets are not lost in the background while possessing a sufficiently broad field of view for correct classification:
where
denotes the final output feature of the PDB, and
represents the concatenation operation along the channel dimension.
2.3.2. Dual-Scale Contextual Diffusion
Traditional Convolutional Neural Networks (CNNs) often face an inherent contradiction between “Receptive Field Expansion” and “Local Detail Preservation” during feature extraction. Existing multi-scale fusion methods, such as Inception [
46] modules or ASPP [
47], typically employ parallel branch structures to extract features independently, performing simple aggregation only at the terminal stage. This “Late Fusion” strategy severs the interaction potential between features of different scales during the extraction process, leading to a lack of global semantic guidance for local feature extraction.
To address this issue, we propose the Dual-Scale Contextual Diffusion (DCD) module. Unlike traditional parallel designs, DCD introduces a novel Cascade-style Feature Diffusion Mechanism. By explicitly injecting long-range contextual information into the input of the local branch, we ensure that the extraction of local textures is constrained by the “condition” of global environmental knowledge. This design significantly improves the suppression of fire-like interferences (e.g., urban neon lights, strong reflections) and the disentanglement of semi-transparent smoke from complex backgrounds without adding significant computational burden. We replaced the Bottleneck part in the C2f structure with the DCD module to further enhance model performance.
As shown in
Figure 5, the DCD module adheres to the lightweight “Split-Transform-Fuse” paradigm but reconstructs the branch interaction logic. Given an input feature map
, it is first projected to a hidden layer dimension via a 1 × 1 convolution and decoupled in the channel dimension into two independent data streams: the Context Branch (
) and the Local Branch (
).
Context Perception. To capture context environmental information with low parameter overhead, the Context Branch prioritizes a Depthwise Dilated Convolution with a Dilation Rate of 2. Compared to standard convolution, this design physically expands the receptive field to a 5 × 5 range without increasing computational cost. The objective of this branch is to extract low-frequency, large-scale semantic features (such as amorphous diffusion trends of smoke or large-scale environmental textures like buildings), denoted as
:
where
represents the output feature of the Context Branch.
denotes Batch Normalization.
refers to the 3 × 3 depthwise dilated convolution with a dilation rate of 2.
is the input feature for the context perception branch.
Semantic Diffusion Mechanism. This is the core innovation of the DCD module. Distinct from isolated parallel extraction, we constructed a unidirectional information pathway from context to local. We treat the output of the Context Branch,
, as a Spatial Prior, explicitly “diffusing” and injecting it into the input end of the Local Branch via Element-wise Addition:
here,
represents the local branch feature after the injection of semantic diffusion, and
is the initial input feature of the local branch.
Mathematically, this “diffusion” operation is equivalent to introducing a dynamic Spatial Bias. It enables the local branch to possess the “base color” (global context) of the environment before processing. This design ensures that subsequent local feature extraction is no longer blind but performs a targeted search for object textures that contrast strongly with the background under known environmental context conditions. For tiny fire sources, this implies the network can filter out isolated noise based on environmental logic; for thin smoke, this aids in enhancing foreground feature contrast within turbid atmospheres.
Local Refinement & Reconstruction. The context-modulated local feature
is subsequently processed by a standard Depthwise Convolution, focusing on capturing the high-frequency edge details of early tiny flames, generating
. Finally, to achieve a unified representation of multi-scale information, we perform channel Concatenation on the outputs of both scales. To further enhance the model’s spatial modeling capability for irregular objects, we introduced the Funnel ReLU (FReLU) [
48] activation function before the final linear projection:
where
is the feature output by the local branch.
is the final output feature of the DCD module.
represents the original input of the DCD module, used for the residual connection.
represents the FReLU activation function. Unlike the scalar activation of ReLU [
49], FReLU integrates a pixel-level Spatial Condition Window, which complements the dual-scale interaction features of DCD, maximizing the discriminability of fire targets.
2.3.3. Gaussian Spatial Pyramid Pooling Fast
To bolster the non-linear representational capacity of the YOLOv11 model during multi-scale feature fusion and to optimize the aggregation of deep features, we propose an improved spatial pyramid pooling module—Gaussian Spatial Pyramid Pooling Fast (GSPPF), as shown in
Figure 3. While retaining the efficient receptive field expansion capabilities of the original SPPF, this module replaces the original SiLU activation function with the Gaussian Error Linear Unit (GELU), constructing a composite convolutional unit comprising Convolution, Batch Normalization, and GELU. Compared to ReLU and its variants, GELU introduces the concept of stochastic regularization. By multiplying the input by the Cumulative Distribution Function (CDF) of the standard normal distribution, it achieves activation with smoother gradient propagation characteristics. This property helps mitigate the gradient vanishing problem and accelerates model convergence.
In this study, we adopt the tanh-based approximation for the GELU implementation to balance computational precision and speed. Its mathematical expression is defined as follows:
where
represents the input feature value, and
is the hyperbolic tangent function. The constant
and the coefficient
are fitting parameters introduced to approximate the CDF of the standard normal distribution.
The specific operational workflow of GSPPF begins by passing the input feature map through a 1 × 1 convolutional layer activated by GELU, which compresses the channel dimension to C/2 to reduce parameter redundancy. The dimensionality-reduced feature map then undergoes three consecutive Max Pooling operations with a 5 × 5 kernel and a stride of 1. To achieve comprehensive multi-scale perception, the original projected feature is concatenated channel-wise with the outputs of the three pooling layers, effectively fusing information from four distinct receptive field scales. Finally, the aggregated features are processed through a concluding 1 × 1 composite convolutional layer to restore the channel count to the target dimension. This design enables GSPPF to capture finer contextual features of targets while preserving computational efficiency, thereby bolstering the model’s detection performance in complex backgrounds.
3. Experiments and Results
3.1. Experimental Setup
All experiments were conducted on a Linux system equipped with a single NVIDIA A800 GPU, while inference latency and FPS were evaluated on a CPU (AMD Ryzen 7 6800H) using ONNX Runtime with a single thread to simulate resource-constrained environments. For latency measurement, each model was first warmed up for 20 iterations, followed by 200 inference runs, and the reported latency was computed as the average over all runs. We utilized the Stochastic Gradient Descent (SGD) optimizer for all models, configured with a momentum of 0.937, a weight decay of 0.0005, and a uniform initial learning rate of 0.01. Given that the dataset contains a large number of small targets, including sub-pixel ignition points and diffused smoke with blurred boundaries, the input image resolution for all models was standardized to 1024 × 1024 pixels to preserve fine-grained target details and improve small object detection accuracy. This unified resolution setting was applied equally to all compared models to ensure strictly consistent experimental conditions and guarantee the fairness and reproducibility of the comparative evaluation. The training duration was set to 200 epochs with a batch size of 16. To bolster model generalization, a suite of data augmentation strategies—including Mosaic, random erasing, and multi-scale scaling—was employed during the training phase. Specifically, to facilitate the model’s ultimate convergence to the real-world data distribution, Mosaic augmentation was deactivated for the final 10 epochs. Conversely, to ensure the objectivity and reliability of the evaluation, no data augmentation was applied during the validation phase, thereby accurately measuring the model’s true generalization performance.
To comprehensively evaluate the proposed method, we employed a total of five metrics categorized into performance and complexity domains. For detection performance, the Mean Average Precision at an IoU threshold of 0.5 (mAP50) and across the 0.5–0.95 range (mAP50-95), alongside the F1-score, were utilized to quantify detection accuracy and the equilibrium between precision and recall. Regarding model complexity, the number of Parameters and Giga Floating-point Operations (GFLOPs) were calculated to measure the model’s parameter scale and hardware computational overhead, respectively. Furthermore, to evaluate real-world deployment feasibility in resource-constrained environments, Average Inference Latency (ms) and Frames Per Second (FPS) were additionally measured on a single-core CPU using ONNX Runtime (v1.19.2) with a single thread. All comparative models were trained under these identical configurations to ensure a fair and rigorous comparison.
3.2. Detailed Performance Analysis of the Proposed Method
To rigorously validate the operational reliability of YOLO-Fire, we conducted comprehensive quantitative evaluations on the urban fire test set.
Table 1 corroborates that the proposed model exhibits robust detection capabilities across distinct fire categories. The model achieved an overall mAP50 of 75.7%, an mAP50-95 of 53.3%, and an F1-score of 73.7%. Notably, these metrics were attained with a parameter count of only 10.02 M, indicating that the model strikes an optimal balance between high-precision perception and lightweight structural efficiency, making it suitable for resource-constrained monitoring devices. Specifically, performance on the “Fire” category is particularly outstanding, yielding an mAP50 of 84.5% and an F1-score of 80.4%. This high precision validates the model’s ability to accurately lock onto incipient ignition points characterized by distinct visual signatures. Conversely, while smoke detection inherently presents greater physical challenges due to its semi-transparent and amorphous nature, the model still maintained a robust mAP50 of 67.0% and an F1-score of 66.9%. This demonstrates that the introduced improvement modules effectively enhance feature extraction capabilities for weak-boundary targets, significantly reducing the risk of missed detections in early fire stages.
In safety-critical urban environments, detection faces substantial challenges: sub-pixel fire spots often suffer from feature paucity, while tenuous smoke exhibits low contrast against chaotic backgrounds, easily triggering missed detections or false alarms. To intuitively demonstrate the model’s generalization capability across diverse operational scenarios, representative detection results are visualized in
Figure 6.
As illustrated in
Figure 6a,b, the model successfully identified small-scale ignition points within high-rise buildings and industrial zones. Despite these targets occupying minimal pixel areas, the model accurately localized the anomalies, thereby validating the effectiveness of the HFFM in handling multi-scale variations. Subsequently,
Figure 6c,d showcase the detection performance for thin, lingering smoke. Even when confronted with low-contrast interference from complex backgrounds (e.g., gray concrete structures), the model effectively overcame the semi-transparent characteristics to precisely segment smoke plumes. Furthermore,
Figure 6e,f confirm the applicability of the proposed method in satellite remote sensing. Confronted with complex terrain textures and macro-scale backgrounds, the model accurately identified fire spread and smoke diffusion from an overhead perspective, demonstrating its immense potential in wide-area disaster early warning tasks.
3.3. Quantitative Comparison with Mainstream Models
To comprehensively evaluate the superiority of the proposed method, we compared it against a variety of advanced object detection models under identical experimental conditions. The comparative baselines encompass SSD, RetinaNet, the classic two-stage detector Faster R-CNN, the Transformer-based RT-DETR, RF-DETR, the mainstream YOLO series models, as well as the state-of-the-art (SOTA) models for fire detection, YOLO-FireAD and FireSmoke-YOLO. For the YOLO series, with the exception of YOLOv3, we uniformly selected the lightweight “s” (small) variants for a fair comparison. The detailed experimental results are presented in
Table 2 and
Table 3.
Overall Performance Analysis As demonstrated in the comprehensive performance evaluation in
Table 2, the proposed model comprehensively surpasses all comparative models across key metrics. Specifically, compared to FireSmoke-YOLO, which ranked second in overall performance, our method achieved all-around improvements: mAP50 increased by 2.4%, mAP50-95 by 3.9%, and the F1-score by 2.2%. Notably, our model is the only one among all comparative experiments to break the 50% threshold for mAP50-95. This consistent superiority across all indicators fully demonstrates that the improved modules not only elevate the upper bound of feature extraction but also significantly optimize the equilibrium between precision and recall.
Regarding fire detection performance shown in
Table 3, further improving performance is inherently challenging as the baseline metrics of most advanced models have approached performance saturation. Despite this bottleneck, our model still achieves the best performance across all metrics except mAP50, reaching 84.5%, which is only 0.2% lower than that of YOLO-FireAD. In addition, it attains an mAP50-95 of 58.9% and an F1 score of 80.4%, these results outperform the second-ranked YOLOv10 and FireSmoke-YOLO by margins of 1.2% and 0.6%, respectively. This proves that even in high-precision scenarios, our method performs bounding box regression more precisely than existing models and effectively reduces false positives.
In the significantly more challenging task of smoke detection, our advantage is even more pronounced. The mAP50 and F1 score are improved by 1.3% and 2.5%, respectively, compared with the second-ranked RF-DETR, while the mAP50-95 is improved by 6.2% compared with FireSmoke-YOLO. Meanwhile, compared with the baseline model, the three metrics have improved by 4.8%, 7.3%, and 4.8% respectively. This substantial performance leap indicates that our model successfully overcomes the feature extraction difficulties caused by the semi-transparency and blurred edges of smoke.
Computational Complexity Analysis In terms of computational complexity (as shown in
Table 2), our model maintains highly competitive efficiency while achieving the highest accuracy. Although the parameter count (10.02 M) is marginally higher than that of the lightweight YOLOv11 (9.41 M), this slight increase is well justified by the significant breakthroughs in accuracy. Regarding comparisons with specialized fire detection models, while our parameter count is slightly higher than that of YOLO-FireAD (9.07 M), our model achieves substantially lower computational overhead with FLOPs of 22.0 G, reducing computational cost by 10.4 G compared to YOLO-FireAD, and both metrics remain significantly below those of FireSmoke-YOLO (29.21 M parameters and 87.9 G FLOPs), demonstrating a more favorable efficiency profile. Furthermore, compared to the widely used YOLOv8 (11.13 M), RT-DETR (31.99 M) and RF-DETR (27.23 M), our model exhibits a superior Parameter-Accuracy Trade-off, making it more suitable for deployment in real-world scenarios with stringent performance requirements.
Regarding inference efficiency, all models were evaluated on a single-core CPU to simulate resource-constrained deployment environments. Faster R-CNN, due to its two-stage architecture involving region proposal generation and subsequent classification, is inherently unsuitable for real-time deployment and was therefore excluded from latency comparison. Among the remaining models, YOLO-Fire achieves an average latency of 57.86 ms and an inference speed of 17.28 FPS, maintaining competitive real-time performance comparable to mainstream lightweight models such as YOLOv11 (45.73 ms, 21.87 FPS) and YOLOv8 (48.23 ms, 20.73 FPS), while delivering substantially superior detection accuracy. Compared to specialized fire detection models, YOLO-Fire demonstrates a decisive advantage in inference efficiency, as YOLO-FireAD incurs a latency of 138.69 ms (7.21 FPS) and FireSmoke-YOLO reaches as high as 448.30 ms (2.23 FPS). Transformer-based models similarly exhibit prohibitively high latency, with RT-DETR at 209.52 ms and RF-DETR at 256.42 ms, rendering them impractical for resource-constrained scenarios. These results demonstrate that YOLO-Fire strikes a favorable balance between detection accuracy and inference efficiency, making it well-suited for practical real-time urban fire safety surveillance applications.
3.4. Qualitative Comparison of Detection Results
To further qualitatively validate the superiority of the proposed method, we conducted a comprehensive visual comparison against mainstream object detection models using identical test samples. As illustrated in
Figure 7, five representative and challenging scenarios were selected for this analysis.
Analysis of Early and Lightweight Detectors Observing the visualization results reveals that early or lightweight detectors (e.g., SSD, RetinaNet, Faster-RCNN, YOLOv3-tiny, and YOLOv5) exhibit significant limitations in feature recall. Constrained by weaker feature extraction capabilities, these models are prone to severe missed detections. Specifically, in
Figure 7a, all the above-mentioned models failed to detect the target, while in
Figure 7b, except for Faster RCNN, the rest of the models were unable to fully detect the thin and semi-transparent smoke, demonstrating their difficulty in effectively perceiving targets under low-contrast backgrounds. Furthermore, in the complex industrial fire scene of
Figure 7e, SSD, RetinaNet and Faster-RCNN struggled to capture ground-level scattered small-scale fire points, while YOLOv3-tiny and YOLOv5 displayed a lack of sensitivity to the dense black smoke generated by industrial fires, leading to the loss of critical targets. Meanwhile, in
Figure 7c, Faster RCNN shows a serious case of false detection.
Although RT-DETR and RF-DETR, these two types of Transformer-based models, have relatively high detection sensitivity, they still show severe under-detection and false detection in the complex scenarios of
Figure 7. In
Figure 7a,d,e, RF-DETR fails to detect smoke and small fire points, while in
Figure 7c, RT-DETR is prone to misidentifying non-fire objects (such as fire-fighting equipment or complex background textures) as targets, indicating that both models have obvious limitations in background suppression and precise prediction.
The latest YOLO series (from YOLOv8 to YOLO26) has effectively addressed the common issues of missed detections and false alarms in the Transformer-based models; however, they still face challenges regarding bounding box regression consistency when dealing with non-rigid targets. Due to the amorphous nature and gradient edges of smoke, the bounding boxes generated by these models in
Figure 7a (ground-level smoke) and
Figure 7d (satellite remote sensing smoke) are often not sufficiently refined—manifesting as boxes that are too loose or fractured, failing to completely cover drifting smoke edges. Additionally, in
Figure 7e where multi-scale fire points coexist, they occasionally exhibit localization drift on tiny peripheral targets, suggesting room for improvement in regression stability under extreme scale variations.
Regarding the specialized fire detection SOTA models, YOLO-FireAD exhibits varying degrees of missed detections across all five test images. FireSmoke-YOLO performs comparatively well in
Figure 7a–c; however, in
Figure 7d, its smoke detection is incomplete, failing to capture the smoke regions on the left and upper portions of the scene, and in
Figure 7e, it misses small-scale fire points in the industrial scenario. These results indicate that despite being specifically designed for fire detection tasks, both models still exhibit notable limitations in generalizing across diverse and complex scenarios, particularly in detecting semi-transparent smoke and small-scale fire sources under challenging backgrounds.
In contrast, YOLO-Fire method demonstrates the most superior detection performance across all comparative experiments. It not only maintains an exceptionally high recall rate but also achieves a qualitative leap in localization precision. As shown in
Figure 7b,d, the bounding boxes generated by our model are highly compact and well-fitted, accurately outlining the contours of irregular smoke without including excessive background. Simultaneously, in
Figure 7c, which is prone to false positives, our model exhibits excellent discriminative power with zero false alarms; in
Figure 7e, it precisely captures all dispersed, minute fire points. This ample evidence proves that the proposed improvement modules endow the model with both extreme robustness and pixel-level localization accuracy in complex environments.
To further visually demonstrate the advantages of YOLO-Fire over the baseline and specialized fire detection SOTA models in challenging scenarios, we provide an additional qualitative comparison using satellite remote sensing imagery, as shown in
Figure 8. Across all five images, the competing models exhibit varying degrees of missed detections and false alarms. The baseline YOLOv11 shows limited robustness against cloud interference, misclassifying cloud regions as smoke in multiple images, while also missing smoke targets in certain scenes. YOLO-FireAD demonstrates poor generalization in satellite remote sensing scenarios, failing to detect any target in images 3–5. FireSmoke-YOLO persistently misidentifies cloud regions and complex background terrain as smoke or fire points across multiple images, generating substantial false alarms that would severely compromise the reliability of early warning systems in practice. In contrast, YOLO-Fire successfully suppresses cloud and terrain interference across all five images while accurately detecting all fire and smoke targets without false alarms, demonstrating significantly stronger robustness and generalization capability in satellite remote sensing fire monitoring tasks.
4. Discussion
4.1. Effectiveness Analysis of the Proposed Framework
4.1.1. Impact Analysis of Modules and Internal Mechanisms
Contribution Analysis of Core Components. To systematically evaluate the independent contributions and synergistic effects of HFFM, C2f-DCD, and GSPPF on model performance, we conducted a stepwise ablation study, with quantitative results summarized in
Table 4. From the perspective of stepwise module addition, starting from the baseline YOLOv11 (mAP50: 73.3%, mAP50-95: 48.8%, F1-score: 70.8%), the progressive introduction of each module yields consistent performance improvements. Upon introducing HFFM alone, mAP50 improves to 74.9%, mAP50-95 to 51.5%, and F1-score to 72.6%, demonstrating the significant contribution of frequency-domain structural decoupling to feature representation, particularly for the smoke category where mAP50 improves from 62.2% to 65.1%. Further incorporating C2f-DCD raises the overall mAP50 to 75.2% and mAP50-95 to 51.6%, confirming that the cascade-style contextual diffusion mechanism effectively complements HFFM by suppressing background interference and enhancing semi-transparent smoke features. Finally, the addition of GSPPF completes the full YOLO-Fire model, achieving the optimal overall performance with mAP50 of 75.7%, mAP50-95 of 53.3%, and F1-score of 73.7%, with the most pronounced gain observed in mAP50-95, confirming the necessity and cumulative effectiveness of all three modules operating in concert. From the perspective of individual module contributions, all single-module variants surpass the baseline, rigorously validating the independent effectiveness of each proposed module. Notably, although the GSPPF-only variant lags behind in overall metrics, it demonstrates superior sensitivity in capturing fire textures thanks to its expanded receptive field, achieving the highest single-category mAP50 of 85.1% for the Fire class. In the more challenging smoke detection task, the full model achieves the most substantial gains, with mAP50 and mAP50-95 improving by 4.8% and 7.3% respectively over the baseline, confirming that the synergistic operation of all three modules is particularly critical for detecting semi-transparent and amorphous targets.
Analysis of HFFM Internal Mechanism and Position. We further investigated the internal substructure design of the HFFM module and its optimal integration position within the network, with comparative data presented in
Table 5. The experiment primarily compared the module’s performance at different feature levels (P4 vs. P5) and analyzed the impact of the integrity of its internal components (DAGB and PDB). Results show that compared to single-branch structures, the complete design combining DAGB and PDB achieved the best performance, validating the necessity of simultaneously capturing attention and multi-scale receptive fields. Regarding position selection, deploying HFFM at the P5 high-level semantic feature layer yielded improvements of 1.1%, 2.4%, and 1.1% in mAP50, mAP50-95, and F1-score, respectively, compared to deployment at the P4 layer. This discrepancy suggests that DAGB and PDB are more efficient when processing high-level features rich in semantic information, thereby better guiding the model to utilize deep semantics to distinguish background interference from target features.
Analysis of DCD Fusion Strategy and Dilation Rate.
Table 6 details the impact of key design parameters in the DCD module on detection performance. Regarding the feature fusion mechanism, experiments revealed the clear superiority of the Element-wise Addition strategy. Its mAP50 reached 74.6%, which not only significantly outperforms the 73.3% achieved by multiplicative fusion but, notably, exceeds the 73.8% of the dual-branch configuration with no interaction. This result strongly suggests that complex multiplicative gating introduces an information bottleneck at early stages, suppressing the transmission of critical features. Conversely, the adopted addition strategy functions similarly to a non-linear residual injection: it integrates contextual information while maximally preserving the magnitude and gradient flow of the original features, thereby achieving lossless feature enhancement.
Simultaneously, regarding receptive field configuration, we compared the adaptability of different convolution settings to multi-scale targets. For small Fire objects, the configuration with a dilation rate of d = 2 achieved an mAP50-95 of 57.2%, slightly outperforming the 56.3% of standard depthwise convolution. This corroborates that appropriate receptive field expansion does not negatively impact small object detection. More critically, depthwise dilated convolution exhibited stronger global modeling capabilities for large-scale environmental information, successfully overcoming detection difficulties caused by the amorphous nature and blurred boundaries of smoke targets. Compared to standard depthwise convolution, the use of d = 2 dilated convolution drove significant leaps of 2.3% and 3.0% in mAP50 and mAP50-95 for the “Smoke” category, respectively.
4.1.2. Visualization and Response Analysis
To provide a more intuitive evaluation of the comprehensive performance of various models across different confidence levels, we plotted the Precision-Recall (P-R) curves and F1-score curves generated from the test set in
Figure 9.
Superiority of the Overall Architecture: As illustrated in
Figure 9a,b, the red curve represents the proposed YOLO-Fire and the brown curve represents the YOLOv11 baseline. It can be clearly observed that the red P-R curve completely envelopes the curves of other variants including the stepwise combinations. This implies that at the same Recall level, the complete model consistently maintains the highest Precision. Meanwhile, in the F1-score curve, the complete model exhibits the highest peak and the widest coverage range, while the baseline curve declines earliest, proving that the progressive introduction of each module contributes incrementally to the model’s overall robustness.
Effectiveness of the HFFM Fusion Strategy:
Figure 9c,d compare the performance of the HFFM module against other feature fusion designs. The HFFM adopted in this paper demonstrates a larger Area Under the Curve (AUC) in the P-R space, and its F1-score shows a more gradual decline in the high-confidence interval. This indicates that, compared to other fusion mechanisms, HFFM efficiently aggregates multi-scale features, thereby reducing both missed detections and false positives.
Validation of C2f-DCD Internal Design: Regarding the internal parameter design of the DCD module,
Figure 9e,f provide compelling visual evidence. The scheme employing addition fusion significantly outperforms both multiplicative fusion and the parallel non-interaction approach, confirming the lossless nature of the addition operation during feature injection. Furthermore, compared to standard depthwise convolution with d = 1, our scheme introducing depthwise dilated convolution maintains higher precision at the tail end of the P-R curve. This revalidates the necessity of expanding the receptive field to improve the detection capability for complex samples.
To intuitively investigate the impact of different modules on feature extraction, we visualized the intermediate feature maps of critical network layers (as shown in
Figure 10). These visualizations reveal the internal response mechanisms of the model when handling complex textures and background interference.
Figure 10a presents a comparison of feature responses between the proposed YOLO-Fire model and its variants, including stepwise combinations and the baseline. The YOLO-Fire model, which integrates all core components, effectively identifies thin smoke and tiny fire points in both real-world scenes and satellite imagery, generating feature maps with a significantly higher Signal-to-Noise Ratio (SNR). In smoke and fire regions, feature responses exhibit clear structured textures, whereas background regions are effectively suppressed. The YOLOv11-HFFM+C2f_DCD variant demonstrates notably improved feature responses compared to single-module variants and the baseline, confirming the complementary effectiveness of these two modules in jointly suppressing background interference and enhancing target saliency. However, without GSPPF, its feature maps still exhibit slightly weaker responses in multi-scale target regions compared to the complete model. In contrast, the baseline YOLOv11 and single-module variants are often accompanied by substantial background noise or exhibit blurred target contours, with the baseline showing the weakest target discrimination capability. This indicates that the progressive introduction of each module incrementally enhances feature quality, and the synergistic operation of all three modules provides the highest-quality feature inputs for the subsequent detection head.
To further visually validate the effectiveness of the synergistic internal components of the HFFM module and the rationality of its deployment strategy, we analyzed intermediate feature maps under different configurations, as shown in
Figure 10b. Notably, only the proposed HFFM module successfully detected the tiny fire points in the image and avoided misclassifying clouds as smoke. By comparing the feature maps, it is clearly observable that when relying solely on a single sub-module, the model’s response to diffused smoke targets often exhibits distinct sparsity, failing to effectively cover the target’s complete morphology. Conversely, the feature maps generated by the HFFM with its dual-branch integrated design demonstrate significant superiority: activations in target regions present high continuity and structural integrity, while preserving rich texture details. Furthermore, a comparison of background regions reveals that deploying HFFM at the high-level semantic layer (P5) more efficiently suppresses environmental noise interference and significantly boosts the SNR of target regions. This visually verifies the necessity of leveraging deep semantics for feature reconstruction.
To intuitively validate the design rationality of DCD, we compared feature maps under different parameters, as shown in
Figure 10c. Our designed DCD module effectively distinguishes smoke from background in satellite imagery, whereas other comparative models consistently produced false detections. Regarding fusion strategies, feature maps in the Multiplication group appear significantly darker, confirming the suppressive effect of multiplicative gating on early signals. The Parallel group retains feature intensity but results in looser target texture structures due to the lack of information interaction between branches. In contrast, the Addition strategy effectively integrates the advantages of both branches, injecting context while maintaining high response levels, thus achieving lossless feature enhancement. Regarding receptive field, although the feature response of standard depthwise convolution (d = 1) captures smoke regions, the distribution is relatively scattered with excessive attention paid to surrounding non-smoke targets. However, the DCD successfully expands the effective receptive field, completely depicting the diffusive morphology of smoke. This provides a powerful visual explanation for the performance advantages of the proposed design in complex scenarios.
4.2. Discussion on Validity
The empirical dominance of YOLO-Fire underscores a critical insight for safety monitoring: aligning network architecture with the inherent physical properties of targets is far more effective than generic feature learning. While UFS-YOLO [
63] enhances detection via standard CBAM attention, it oversimplifies the spectral heterogeneity between fire and smoke, treating them as uniform targets. Unlike classical attention mechanisms such as SE and CBAM, HFFM achieves structural decoupling at the architectural level. Incipient flames present high-frequency, sharp-boundary point features, while diffused smoke manifests as low-frequency, amorphous diffusive textures; processing both within a unified stream inevitably leads to mutual feature interference. By contrast, the parallel dual-stream design of HFFM structurally isolates the extraction pathways of these two heterogeneous targets, ensuring that high-frequency flame boundaries and low-frequency smoke textures are processed independently. This design also offers distinct advantages over serial structures and multi-head attention mechanisms: serial structures suffer from cumulative feature contamination, where mixed representations in early layers constrain the expressiveness of subsequent layers; multi-head attention mechanisms, while introducing multiple perspectives, still operate on a shared input feature map, incurring inter-head competition and substantial computational overhead. HFFM’s two streams originate from the same input yet compute entirely independently, achieving genuine complementarity through final fusion—a structural isolation that is particularly critical where high-frequency flame features and low-frequency smoke features inherently coexist, while maintaining significantly lower computational cost for lightweight real-time deployment.
Interestingly, IFS-DETR [
64] also recognizes the significance of frequency-domain analysis, employing 2D-DCT for channel attention. However, HFFM achieves structural decoupling rather than merely re-weighting channels, offering superior interpretability. Furthermore, C2f-DCD functions as a semantic regularizer, fundamentally differing from late-fusion strategies such as Inception and ASPP, which extract features independently across parallel branches and aggregate them only at the terminal stage, severing inter-scale feature interaction. In contrast, C2f-DCD introduces a cascade-style feature diffusion mechanism that injects global contextual information into the local branch prior to local texture extraction, ensuring that local feature extraction is conditioned on global environmental knowledge. Unlike multiplicative gating mechanisms, which suppress feature transmission by scaling activations and risk losing weak signals such as thin smoke or sub-pixel fire points, additive injection preserves the magnitude and gradient flow of the original features, achieving lossless feature enhancement. This additive context injection facilitates a dual-enhancement strategy: by integrating dilated global context, it simultaneously prevents the fragmentation of semi-transparent smoke features and imposes an environmental consistency check to suppress isolated high-brightness noise, ensuring robust perceptibility for both amorphous smoke and sub-pixel fire sources.
In the realm of real-time safety surveillance, a dichotomy exists between two dominant paradigms: heavyweight Transformers and lightweight CNNs. The recent IFS-DETR and the improved DETR framework proposed of Li et al. [
65] represent the former, utilizing end-to-end architectures to capture global dependencies. While IFS-DETR attempts to optimize speed via a lightweight backbone (LeanNet), it relies heavily on specialized TensorRT quantization to reach high frame rates, creating a deployment barrier. In contrast, YOLO-Fire demonstrates that pure CNN architectures can achieve comparable global perception without the stringent hardware dependencies of Transformers. By simulating global attention through DCD and GSPPF, YOLO-Fire maintains 10.02 M parameters—significantly lighter than standard DETR variants—while running efficiently on standard hardware without complex pre-compilation. This native efficiency renders it more suitable for resource-constrained edge devices where computational resources and quantization support may be limited.
Despite its superior quantitative performance, we must critically acknowledge the boundaries of YOLO-Fire’s current capability, particularly in uncontrolled industrial or wilderness environments. Its reliance on the visible light spectrum creates an inherent spectral blindness that cannot be overcome by architectural optimization alone. Like other RGB-based architectures, it extracts features based on color and texture gradients; consequently, in nighttime scenarios or concealed smoldering events—such as underground peat fires or overheating process equipment—the model faces a definitive physical detection limit. Regarding environmental robustness, while the DCD mechanism effectively suppresses static urban noise, it remains sensitive to dynamic high-frequency perturbations. Under extreme conditions—such as dense fog, industrial steam causing Tyndall effects, or strong winds driving chaotic vegetation movement—the background may generate transient high-frequency patterns mimicking flames, potentially causing false positives. Additionally, sub-pixel tiny fire sources remain challenging in wide-area remote sensing; despite receptive field expansion by GSPPF, deep convolutional downsampling inevitably causes feature erosion for objects occupying minimal pixel areas. While this study primarily contributes to the field of deep learning-based digital image processing for fire detection, the incorporation of UAV and satellite imagery demonstrates its potential applicability to remote sensing scenarios. Furthermore, it should be acknowledged that the UAV and satellite imagery incorporated in the dataset was obtained in RGB image format from publicly available platforms and the internet, rather than through conventional remote sensing data acquisition workflows. As a result, standardized remote sensing metadata such as sensor specifications, spatial resolution, spectral resolution, and geographic coordinate information are not uniformly available across all imagery sources, and the dataset does not follow a conventional remote sensing data structure with georeferenced imagery. This limits the ability to present detection results within a defined geographic coordinate reference system, which represents a gap relative to the standards of conventional remote sensing research.
To address these limitations, future research will proceed along the following directions. First, to overcome spectral blindness, a visible-light and thermal infrared dual-mode fusion framework will be explored. By incorporating mid-wave or long-wave infrared thermal imaging alongside the existing RGB stream, the model will detect heat signatures invisible to RGB sensors—including smoldering underground fires, concealed overheating industrial equipment, and nighttime fire sources. Cross-modal feature alignment strategies and modality-specific pretraining schemes will be investigated to bridge the domain gap between visible and infrared representations. Second, to mitigate dynamic background interference, spatiotemporal attention mechanisms will be introduced. Dense optical flow fields computed from consecutive video frames will provide motion-level cues, enabling the model to distinguish periodic vegetation swaying from the irregular turbulent motion of real fire. Background dynamic modeling will also be explored to establish temporal consistency constraints, suppressing transient high-frequency patterns under extreme conditions that may trigger false positives in spatial-only frameworks. Third, to address sub-pixel tiny fire sources in wide-area remote sensing, a dedicated small-object detection branch operating on high-resolution shallow feature maps will be introduced within an enhanced feature pyramid. By preserving fine-grained spatial information at the P2 feature level, this head will directly target objects occupying minimal pixel areas, improving recall for extremely small ignition points in satellite and UAV imagery.
Future work will also consider fine-grained annotation schemes for smoke concentration levels and diffusion phases—including initial, diffusion, and dissipation stages—to enable more detailed stage-wise performance evaluation and identify scenarios requiring further improvement. Although YOLO-Fire has been preliminarily validated in resource-constrained environments through single-core CPU simulation, achieving 17.28 FPS, further optimization is required for edge devices and UAV platforms. Specifically, model compression through INT8 and FP16 quantization, structured channel pruning, and knowledge distillation will be investigated, targeting NVIDIA Jetson Orin and ARM Cortex-based processors, with the goal of achieving over 30 FPS while maintaining detection accuracy within an acceptable margin. Finally, future work will prioritize constructing a more rigorous remote sensing fire dataset incorporating georeferenced imagery with complete metadata from specific satellite missions, enabling detection results to be presented within a defined geographic coordinate reference system and enhancing the geospatial applicability of the proposed method.
5. Conclusions
Addressing the critical safety imperative of identifying incipient ignition points and segmenting diffused smoke in complex urban environments, this paper proposes YOLO-Fire, a lightweight yet high-precision deep learning detection framework built upon YOLOv11. To overcome the persistent bottlenecks of existing models in simultaneously handling tiny high-frequency flame features and low-frequency semi-transparent smoke textures, three targeted modules are strategically integrated into the architecture. The HFFM achieves structural frequency-domain decoupling between flame boundaries and smoke textures through a parallel dual-stream design, effectively eliminating the mutual feature interference that arises when heterogeneous targets are processed within a unified stream. The C2f-DCD module introduces a cascade-style contextual diffusion mechanism that injects global environmental context into local feature extraction, simultaneously suppressing fire-like background clutter and enhancing the contrast of semi-transparent smoke features. The GSPPF module optimizes multi-scale receptive field aggregation at the end of the backbone through GELU-activated pooling, expanding the model’s perceptual range across extreme scale variations from sub-pixel ignition points to large-area smoke plumes.
Extensive experiments on a self-constructed large-scale multi-source urban fire dataset comprising 12,272 annotated images demonstrate that YOLO-Fire achieves an overall mAP50 of 75.7%, mAP50-95 of 53.3%, and F1-score of 73.7% with only 10.02 M parameters. Specifically, the model attains an mAP50 of 84.5% and an F1-score of 80.4% for the fire category, while maintaining a competitive mAP50 of 67.0% and F1-score of 66.9% for the inherently more challenging smoke category. Compared to the YOLOv11 baseline, YOLO-Fire achieves improvements of 2.4%, 4.5%, and 2.9% in mAP50, mAP50-95, and F1-score, respectively. Quantitative comparisons against a broad range of mainstream detection models and specialized fire detection SOTA models further confirm that YOLO-Fire comprehensively outperforms all compared methods across key overall performance metrics while maintaining a favorable balance between detection accuracy and computational efficiency. Furthermore, inference evaluation on a single-core CPU achieves 17.28 FPS, validating the practical deployment feasibility of YOLO-Fire in resource-constrained environments.
The proposed method effectively addresses the core challenges of detecting small-scale ignition points and semi-transparent diffused smoke in complex urban environments, offering an efficient and lightweight solution for practical fire safety surveillance. At a higher level, this study proposes a modular design paradigm of first decoupling (HFFM), then suppressing (DCD), and finally aggregating (GSPPF), and experimentally demonstrates that this paradigm effectively enhances model performance in detecting targets with small scale, semi-transparent appearance and high background interference, offering valuable insights for analogous fine-grained visual perception tasks in safety-critical domains.