Multi-Scale Construction Site Fire Detection Algorithm with Integrated Attention Mechanism

Sun, Haipeng; Yao, Tao

doi:10.3390/fire8070257

Open AccessArticle

Multi-Scale Construction Site Fire Detection Algorithm with Integrated Attention Mechanism

by

Haipeng Sun

¹ and

Tao Yao

^1,2,*

¹

College of Information Science and Technology, Southwest Jiaotong University, Chengdu 610097, China

²

Yantai New Generation Information Technology Research Institute, Southwest Jiaotong University, Yantai 264000, China

^*

Author to whom correspondence should be addressed.

Fire 2025, 8(7), 257; https://doi.org/10.3390/fire8070257

Submission received: 16 May 2025 / Revised: 24 June 2025 / Accepted: 24 June 2025 / Published: 30 June 2025

(This article belongs to the Topic AI for Natural Disasters Detection, Prediction and Modeling)

Download

Browse Figures

Versions Notes

Abstract

The occurrence of construction site fires is consistently accompanied by casualties and property damage. To address the issues of large target-scale variations and frequent false detections in construction site fire monitoring, we propose a fire detection algorithm based on an improved YOLOv8 model, achieving real-time and efficient detection of fires on construction sites. First, considering the wide range of scale variations in detected objects, an additional detection layer with a 64-times down-sampling rate is introduced to enhance the algorithm’s detection capability for multi-scale targets. Then, the MBConv module and the ESE attention block are integrated into the C2f structure to enhance feature extraction capabilities while reducing computational complexity. An iCBAM attention module is designed to suppress background noise interference and enhance the representation capability of the network. Finally, the WIoUv3 metric is adopted in the loss function for bounding box regression to mitigate harmful gradient issues. Comparative experiments demonstrate that, on a self-constructed construction site fire dataset, the improved algorithm achieves an accuracy and recall increase of 4.6% and 3.0%, respectively, compared to the original YOLOv8 model. Additionally, mAP50 and mAP50-95 are improved by 1.6% and 1.5%, respectively. This algorithm provides a more effective solution for fire monitoring in construction environments.

Keywords:

YOLOv8; multi-scale detection; fire detection; attention mechanism; Wise-IoU

1. Introduction

With the rapid advancement of global modernization and infrastructure development, the scale of engineering projects continues to expand, significantly increasing the complexity of construction environments and posing severe challenges to construction site safety management. According to data from China’s National Fire and Rescue Administration, in the first half of 2023, a total of 550,000 fires were reported nationwide, resulting in 959 fatalities and direct property losses of 3.94 billion yuan, demonstrating a clear upward trend compared to the same period last year. As losses caused by fires exhibit an exponential positive correlation with response time, timely detection and extinguishment of fire sources at their initial stage can greatly reduce damage [1]. Therefore, there is an urgent need for a real-time and accurate fire detection solution.

Traditional fire monitoring methods mainly rely on manual inspections and sensor networks [2]. Manual inspection suffers from poor real-time performance and high missed detection rates, whereas sensor networks are susceptible to environmental interference (such as dust and illumination variations) and entail high deployment costs. To overcome the limitations of traditional monitoring methods and enhance supervisory efficiency, object detection algorithms based on deep learning have provided new approaches for fire detection in recent years [3].

Two-stage object detection algorithms have attracted considerable attention in the field of fire detection due to their superior detection accuracy and robust feature representation capabilities. Choi [4] et al. integrated the Swin Transformer backbone into the Faster R-CNN framework, achieving excellent performance in multi-category wildfire detection tasks. This method notably enhanced the identification capability for small-scale flame and smoke targets and effectively reduced the false detection rate between smoke and non-fire smoke, attaining an mAP of 0.841 and demonstrating strong robustness. Cheknane [5] et al. proposed an improved Faster R-CNN model incorporating a hybrid feature extraction mechanism to address the challenges of flame and smoke detection in complex indoor and outdoor scenarios. Their method achieved outstanding performance, with an mAP of 90.1% and an accuracy of 96.5% on a private dataset. Through multi-level feature fusion, the model significantly improved its perception of diverse fire scenarios. However, two-stage object detection algorithms typically suffer from substantial model parameter sizes and relatively slow inference speeds, making it particularly challenging to meet the real-time requirements of fire warning systems. Consequently, these methods have not yet been widely implemented in practical engineering applications.

Single-stage algorithms have achieved numerous breakthroughs in the field of fire detection. Gragnaniello [6] et al. proposed a detection framework named “FLAME,” whose innovation lies in exploiting flame motion characteristics to filter out background interference and enhance detection accuracy. Experimental results indicated that, compared with the original YOLO model, incorporating the motion analysis module significantly reduced the false alarm rate, increasing the accuracy and F-score by 18% and 10%, respectively, with an average alarm delay of 9.17 s, thus satisfying real-time alarm requirements. Mukhriddinov [7] et al. proposed a lightweight, improved YOLOv5 algorithm aimed at early smoke detection for outdoor wildfire scenarios. By re-optimizing anchor box dimensions and introducing a fast spatial pyramid pooling module (SPP-Fast) to strengthen small-scale feature extraction, alongside employing a bidirectional feature pyramid network BiFPN to enhance multi-scale feature fusion efficiency, combined with model pruning and transfer learning, the adaptability of the model in resource-constrained environments was significantly improved. Experimental results on an unmanned aerial vehicle (UAV)-captured wildfire smoke dataset demonstrated that the proposed method achieved an average precision of 73.6%, exhibiting promising real-time detection performance. Liu [8] et al. proposed a detection network termed TFNet, integrating convolutional neural networks and Transformer architectures, to address the challenging problem of identifying small-scale fire spots and sparse smoke in the early detection of forest fires. The proposed method introduces a multi-branch scale-aware representation S-R module, effectively enhancing multi-scale feature representation capabilities. Additionally, a Context-Guided Multi-Scale Feature Fusion CG-MSFF encoder is developed to efficiently integrate local and global contextual information. Meanwhile, a Transformer-based decoding detection head combined with a Weighted IoU loss function is employed, significantly improving the model’s ability to identify and localize challenging samples. TFNet achieved superior detection performance compared to baseline models on two publicly available fire detection datasets. Li [9] et al. proposed an improved algorithm, YOLO11s-MSCA, by integrating a Multi-Scale Convolutional Attention MSCA module with the YOLO11 framework to address challenges associated with variations in fire target scales and complex environmental background interference. Experimental results demonstrated that this method achieved superior detection accuracy and generalization performance across multiple publicly available datasets, significantly improving the detection precision for small-scale fire and smoke, while maintaining the high efficiency and low complexity required for real-time detection.

Although deep learning algorithms have made notable breakthroughs in the field of fire detection, existing methods still exhibit deficiencies in the collaborative optimization of multi-scale detection efficiency and robustness against complex backgrounds. To address these challenges, this paper proposes an improved YOLOv8-based algorithm named YOLO-Fire, with key innovations summarized as follows:

(1) Self-Constructed Construction Site Fire Dataset:

Current mainstream object detection datasets lack annotation data specifically targeted at smoke and fire in fire scenarios, and there is currently no authoritative and publicly available specialized dataset for construction site fire detection. To bridge this gap, this paper constructs a self-developed construction site fire detection dataset (Fire-5000) through multi-channel collection, precise annotation, and data augmentation.

(2) Multi-scale Lightweight Collaborative Architecture:

An additional detection layer with a 64-times down-sampling rate is introduced to broaden the detection scale coverage. Meanwhile, a lightweight convolutional module, termed Efficient Mobile Bottleneck (EMB), is designed and integrated into the C2f unit, mitigating the computational overhead caused by increased network depth and achieving a balanced trade-off between detection efficiency and accuracy.

(3) Inverted Convolutional Block Attention Module (iCBAM):

By embedding the Convolutional Block Attention Module (CBAM) within the Inverted Residual Block (IRB), we guide the network to selectively focus on informative features across both channel and spatial dimensions, which leads to a more powerful feature representation.

(4) Dynamic Gradient Optimization Strategy:

The WIoU v3 loss function is introduced, which dynamically adjusts gradient weights through adaptive assessment of sample outliers. This effectively mitigates interference caused by extreme-quality samples during model training, thereby enhancing the stability and convergence efficiency of the training process.

2. Materials and Methods

2.1. YOLOv8 Algorithm

The YOLO series of algorithms was first introduced by Joseph Redmon [10] et al. in their 2015 paper, innovatively formulating the object detection task as a regression problem within a single convolutional neural network. This approach effectively achieves a remarkable balance between real-time performance and detection accuracy. In subsequent developments, the YOLO algorithm has undergone several significant iterations and optimizations, each yielding substantial improvements in detection accuracy, inference speed, and model lightweighting. Among these versions, the YOLOv5 [11] model developed by Ultralytics has gained widespread recognition and adoption in industry due to its efficiency, stability, and ease of deployment. YOLOv8 [12], inheriting the efficient architecture of YOLOv5, achieves further architectural majorization and technological innovations. Its network structure primarily consists of the following core components:

(1) Backbone: The backbone network is built upon an improved CSPDarkNet architecture, employing stacked convolutional layers and bottleneck structures to facilitate efficient feature extraction. Compared to YOLOv5, the key improvement in YOLOv8 is the replacement of the fundamental C3 module with the more efficient C2f module. Inspired by the ELAN [13] design, the new module introduces additional parallel gradient-flow branches, effectively reducing computational overhead while capturing richer gradient-flow information. Moreover, its basic convolutional unit adopts the CBS structure, consisting of a two-dimensional convolutional layer, batch normalization, and the SiLU [14] activation function.

(2) Neck: The neck network adopts the FPN [15] (Feature Pyramid Network) combined with PAN [16] (Path Aggregation Network) structures, enabling efficient multi-scale feature fusion through a bidirectional feature pyramid approach.

(3) Detection Head: The detection layers perform detection tasks across multiple feature maps at different scales. The detection head employs a decoupled head design, separating classification and regression tasks to prevent interference between them, thereby enhancing the accuracy of object recognition.

2.2. Fire-5000 Dataset

Currently, in the field of construction site fire detection, there is no authoritative and publicly available dataset. Mainstream fire datasets typically include various fire scenarios without specific distinctions for work site images, lacking scenes that encompass distinctive corresponding characteristics. To address practical engineering needs, this paper constructs a custom construction site fire CV dataset. The dataset constructed in this study strictly follows a four-stage construction paradigm, namely: multi-source heterogeneous data acquisition and scene adaptation, content-consistency-based data cleaning, refined target labeling, and physically constrained data augmentation. The detailed implementation process is as follows:

(1) Data Collection and Screening: Fire-related datasets were collected from multiple publicly available dataset platforms, from which images specifically limited to construction site fire scenarios were selected. Relevant video materials were downloaded, and key frames were extracted using frame sampling techniques (one frame per second) to retain temporal feature images covering the entire developmental cycle of fires. The image collection contains construction site fire images from various fire development stages, lighting conditions, and resolutions, closely aligning with real application scenarios to enhance the specificity of feature extraction by the algorithm.

(2) Data Cleaning: During the integration of fire images, it was observed that a significant number of duplicate images related to construction site scenarios existed across different datasets. Using these repetitive samples during model training could cause severe problems, such as model overfitting and evaluation bias. To prevent the aforementioned issues, a two-stage data cleaning method was adopted in this study. Firstly, a Python script was developed to sequentially read all images from the directory, generating hash values for each image file using the MD5 algorithm. Identical hash values indicate duplicate image content and batch deleting of identical images. After the initial cleaning stage, images with identical visual content but differing in file format or resolution remained within the directory. Since such images produce largely similar features during feature extraction, they should also be considered duplicates and removed accordingly. Subsequently, a secondary cleaning process was conducted using the Structural Similarity (SSIM) algorithm based on the OpenCV library. The SSIM [17] index measures image similarity across three dimensions: contrast, luminance, and structure. In this study, a threshold θ ≥ 0.6 was set as the criterion for determining visual duplicates and then deleting similar files.

(3) Data labeling: Images of fire and smoke within the dataset were labeled using the LabelMe software. As Figure 1, the Minimum Bounding Rectangle strategy was employed, where fire was enclosed using bounding boxes, tightly fitting the full fire contours. Each bounding box was labeled with the target category “Fire”, and smoke was annotated similarly. After annotation completion, labels were batch exported and converted into the TXT format compatible with the YOLO model.

(4) Data Augmentation: To increase the number of samples, 500 images were selected from the annotated dataset and randomly processed using data augmentation techniques. The augmented images were then integrated into the dataset.

According to statistical analysis, the self-constructed dataset comprises a total of 5004 fire images, with approximately 10,500 annotated instances. Among these, there are approximately 6800 fire instances and 3700 smoke instances. The number of fire instances is nearly twice that of smoke instances, exhibiting a clear class imbalance. In real fire scenarios, fire tend to spread to surrounding combustible objects, forming multiple fire targets. In contrast, smoke generally has lower density, rises due to heat, and often aggregates into a single smoke region. Thus, the observed phenomenon of abundant fire targets and fewer smoke targets in the actual image collection aligns with authentic fire conditions, validating the reliability of the self-constructed dataset. As shown in Figure 2, color intensity is used to indicate the quantitative distribution of targets that meet the specified criteria. In the dataset, the distribution of target center points is relatively dispersed, with higher density in the central region, indicating a clear center-oriented composition tendency during data collection. Meanwhile, peripheral regions also exhibit a relatively uniform distribution, consistent with the realistic scenarios of fire occurrences. Furthermore, the scale distribution of targets exhibits clear multi-scale characteristics. Given the explicitly demonstrated multi-scale characteristics of targets within this dataset, it becomes particularly essential to conduct research focused on multi-scale optimization of object detection algorithms.

3. Algorithm Optimization

To address the problems of large-scale variations in target sizes and susceptible to environmental interference causing false detections in construction site fire scenarios, this paper optimizes the algorithm from three aspects: multi-scale detection [18], attention mechanisms, and the loss function, aiming to improve the algorithm’s detection performance specifically in construction site contexts.

3.1. Multi-Scale Lightweight Collaborative Architecture

3.1.1. Multi-Scale Network Structure Optimization

The primary method for evaluating the performance of object detection algorithms is to compare their detection capabilities on comprehensive benchmark datasets, such as MS COCO. Such datasets typically encompass target objects of moderate size and exhibit relatively balanced scale distributions. Therefore, mainstream multi-scale detection algorithms, including the YOLO series, typically adopt a compromise strategy regarding structural parameters such as network depth, down-sampling rate, and feature fusion scales, aiming to achieve a balance in detection performance across targets of various scales. This strategy performs well in conventional object detection tasks; however, it exhibits notable limitations in specific application scenarios (such as the prevalence of large-scale targets in construction environments or small-object detection in remote sensing), making it challenging for the algorithm to effectively identify targets beyond the normal scale range [19]. As illustrated in Figure 3, the original YOLOv8 model exhibits missed detections for large-scale smoke targets and insufficient confidence (only 0.32) for small-scale fire targets during predictions on construction site fire images.

The official pretrained weights for the YOLOv8 algorithm were trained and optimized using the MS COCO [20] dataset, which covers 80 common object categories. Among these, approximately 72% of target instances have sizes concentrated between 32 × 32 and 256 × 256 pixels. Therefore, the original YOLOv8 network architecture incorporates only three scale levels with down-sampling rates of 32×, 16×, and 8×, corresponding to theoretical receptive fields of 32 × 32, 16 × 16, and 8 × 8 pixels, respectively. While this structural design demonstrates good detection performance for medium-scale targets, it exhibits clear limitations when applied to special scenarios such as construction site fire incidents. In this paper, the network architecture is redesigned to enhance multi-scale detection performance.

According to the principle of down-sampling, after down-sampling operations, the resolution of feature maps gradually decreases, resulting in the loss of detailed geometric information within the images. However, this simultaneously expands the receptive field of model, enabling the network to capture richer semantic information. Therefore, in-creasing the down-sampling rate of the model can yield feature maps with larger receptive fields, thereby enhancing the model’s capability to detect large-scale targets. To address the insufficient semantic information expressed by the original YOLOv8 architecture at its deepest output feature map with a 32-fold down-sampling rate, this study introduces an additional cascaded down-sampling pathway at the end of the backbone network. This added pathway consists of convolutional layers and the C2f module, further down-sampling to a 64× scale and generating a 10 × 10 feature map. Considering the real-time requirements of construction site fire detection tasks, the network width was adjusted after introducing the new down-sampling layer. Specifically, the number of channels in the original 32× down-sample feature map was reduced from 1024 to 768, while the newly introduced 64× down-sample feature map was set to 1024 channels, thus balancing detection accuracy and computational efficiency of the network, as presented in Figure 4.

Meanwhile, to effectively utilize the newly introduced 64× down-sampling layer’s feature information, this paper further designs an additional set of feature fusion pathways based on the original feature pyramid structure proposed by Ultralytics. This includes additional up-sampling and down-sampling operations, along with the corresponding construction of a detection head adapted specifically to this scale. The specific improvements are highlighted in green in Figure 5.

3.1.2. Lightweight Convolution-Based Reconstruction of the C2f Module

The newly introduced 64× down-sampling detection layer described in the previous section significantly enhances the detection performance for large-scale targets. However, this addition increases the depth of the neural network, substantially elevating the computational complexity of the model, thus posing challenges to real-time construction site fire detection [21]. To address this issue, a lightweight collaborative convolutional block suitable for multi-scale detection is designed to reconstruct the C2f module, thereby balancing the efficiency and accuracy of the model.

Google’s research team proposed the EfficientNet [22] architecture, a lightweight convolutional neural network that achieves higher detection accuracy with lower computational costs and fewer model parameters by hierarchically stacking MBConv [23] modules. The module’s design is primarily inspired by depthwise separable convolutions and bottleneck structures, replacing conventional convolutions with a combination of pointwise convolutions and depthwise convolutions for feature extraction, significantly reducing computational complexity and the number of parameters.

For a convolution operation, suppose the input feature map has dimensions

W_{i} \times H_{i}

with

C_{i}

channels, the convolution kernel size is

K_{w} \times K_{h}

, also with

C_{i}

channels, and the output feature map has dimensions

W_{o} \times H_{o} \times C_{o}

. In standard convolution, each kernel performs element-wise multiplication with every channel of the input and sums the results, producing a scalar value at each spatial location. The computational complexity is represented by Equation (1):

{FLOPs}_{Conv} = W_{o} \times H_{o} \times C_{o} \times C_{i} \times K_{w} \times K_{h}

(1)

In depthwise convolution, each input channel is independently convolved with its corresponding kernel channel, thereby avoiding cross-channel computations. The computational complexity is given by Equation (2):

{FLOPs}_{DW} = W_{o} \times H_{o} \times C_{i} \times K_{w} \times K_{h}

(2)

Pointwise convolution is essentially a standard convolution operation with a kernel size of 1 × 1. In the forward propagation of the MBConv module, two pointwise convolutions and one depthwise convolution are sequentially performed. The ratio of computational complexity between this module (MBConv) and standard convolution is given by Equation (3):

\frac{{FLOPs}_{MBConv}}{{FLOPs}_{Conv}} = \frac{C_{i} \times C_{o} + C_{i} \times K_{w} \times K_{h} + C_{i} \times C_{o}}{C_{o} \times C_{i} \times K_{w} \times K_{h}} = \frac{K_{w} \times K_{h} + 2 C_{o}}{C_{o} \times K_{w} \times K_{h}}

(3)

Assuming an output channel number of 256 and a convolution kernel size of 3 × 3, substituting these values into the ratio formula yields that the computational complexity of the convolution operations within the MBConv module is only 22.6% of that required by standard convolution.

The Effective Squeeze-and-Excitation (ESE) attention mechanism is a lightweight, improved module designed for channel-wise feature recalibration tasks. To address the channel dimensional collapse issue caused by the two fully connected layers (dimensionality reduction–expansion) paradigm employed by the classical SENet [24], the ESE module innovatively reconstructs the channel dependency modeling process. By directly learning global correlations among channels through a single fully connected layer and eliminating redundant dimensionality reduction operations, the ESE module maintains the integrity of original channel information while significantly reducing computational overhead. This design achieves accurate modeling of cross-channel interactions under the premise of reduced parameters. It provides an efficient computational foundation for decoupling fire features from complex backgrounds. In this paper, we propose a lightweight attention enhancement strategy, replacing the standard SE channel attention within the MBConv module with the ESE module, thereby constructing a composite convolutional unit termed EMB. The EMB module is integrated into the C2f architecture, replacing the original standard convolutional unit CBS (Conv, BN, SiLU), thus forming a novel C2f-EMB module.

In summary, the proposed multi-scale lightweight collaborative architecture effectively integrates the benefits of multi-scale network structure optimization and lightweight convolutional reconstruction of the C2f module. Specifically, by introducing a deeper 64× down-sampling pathway and complementary feature fusion operations, the redesigned network significantly enhances the detection capability for large-scale targets encountered in construction site fire scenarios. Concurrently, the innovative lightweight C2f-EMB module, which leverages MBConv with the optimized ESE attention mechanism, addresses the increased computational overhead resulting from deeper network structures. This integrated approach balances semantic representation, feature extraction efficiency, and computational resource management, ultimately achieving superior performance suitable for practical applications in complex construction site environments. In the subsequent sections of this paper, we refer to this architecture as the “MSCE” structure.

3.2. Inverted Residual Spatial–Channel Attention Module

The complexity and variability of construction site environments make it challenging for the backbone network to effectively distinguish foreground targets from background noise during feature extraction. This frequently leads the model to mistakenly identify irrelevant background features as important information, thus reducing the accuracy of object detection. Reducing interference from background noise during feature extraction has become a critical challenge for enhancing detection performance. To address this issue, this paper proposes a novel spatial–channel attention mechanism termed iCBAM. The network architecture of this module is illustrated in Figure 6.

This module integrates the channel–spatial collaborative modeling capability of the CBAM [25] attention mechanism with the efficient feature-propagation pathway of the inverted residual block [26] (iRB), significantly enhancing the network’s feature representation for flame and smoke regions while maintaining controlled computational overhead. The input feature map first passes through the CBAM module, where channel and spatial attention weights are computed sequentially. The computed attention weights are multiplied element-wise with the output of the pointwise convolution layer in the iRB module, achieving joint calibration across both channel and spatial dimensions to emphasize fire-related features. Subsequently, depthwise separable convolutions are applied to further extract feature information. Finally, a pointwise convolution adjusts the output channel count to match that of the input, and a residual connection merges the original input features to mitigate the vanishing-gradient issue.

In this work, the attention module is embedded within a lightweight inverted residual structure to fully leverage its efficient information flow and gradient propagation advantages. iCBAM not only enhances the network’s perception of critical regions while preserving the integrity of feature representations, but also effectively mitigates erroneous dependencies arising from excessive semantic focusing in deep-feature modeling by traditional attention mechanisms. This design suppresses the tendency of attention weights to overfit at high semantic levels, preventing the network from erroneously focusing on redundant regions unrelated to fire features, thereby enhancing the model’s discriminative robustness in complex scenarios.

3.3. Improvement of the Boundary Loss Function

In construction site fire detection tasks, background occlusions and overlapping targets in complex construction environments significantly interfere with the accurate localization of candidate bounding boxes. The YOLOv8 model employs a combination of Distribution Focal Loss (DFL) and Complete Intersection over Union [27] (CIoU) loss to jointly optimize bounding box regression. The core formulas for CIoU are defined as follows in Equations (4)–(6):

C I o U = I o U - \frac{ρ {(b, b_{g t})}^{2}}{c^{2}} - α v

(4)

α = \frac{v}{1 - I o U + v}

(5)

v = \frac{4}{π^{2}} {(\arctan \frac{w_{g t}}{h_{g t}} - \arctan \frac{w}{h})}^{2}

(6)

In the equations above,

ρ (b, b_{g t})

represents the Euclidean distance between the center points of the predicted bounding box and the ground-truth bounding box;

α

denotes the aspect ratio weighting factor;

c

is the diagonal length of the smallest enclosing box; and

v

is a measure of the aspect ratio distortion. Due to variations in data acquisition devices, shooting distances, and illumination conditions, the quality of collected fire images varies considerably, consequently leading to differences in the quality of annotated samples. The geometric characteristics of construction site fire images exhibit significant difference, and the original CIoU function neglects differences in learning difficulty among samples, making it prone to harmful gradient gains from low-quality samples (e.g., blurred flames or distorted smoke). For the above problems, this paper adopts the dynamic gradient optimization-based WIoUv3 [28] loss function, which balances the gradient contributions from samples of varying quality through an outlier degree mechanism. The formulas are defined as Equations (7)–(9):

L_{W I o U_{v 1}} = R_{W I o U} * L_{I o U}, L_{I o U} = 1 - \frac{W_{i} H_{i}}{S_{u}}

(7)

R_{W I o U} = \exp (\frac{{(x - x_{g t})}^{2} + {(y - y_{g t})}^{2}}{W_{g}^{2} + H_{g}^{2}})

(8)

L_{W I o U v 3} = r \cdot L_{W I o U v 1}, r = \frac{β}{δ α^{(β - δ)}}, β = \frac{L_{I o U}^{*}}{\bar{L_{I o U}}}

(9)

Equation (8) represents the formulation of WIoU v1, which incorporates a distance attention mechanism to amplify the loss contribution of anchors with average quality while reducing the loss from high-quality anchors. WIoU v3 builds upon the distance attention mechanism introduced in WIoU v1 by incorporating a non-monotonic focusing factor

r

and an outlier degree

β

, where

β

is defined as the ratio between the exponential moving average and the simple average of the IoU. This metric reflects the quality of the anchor box and enables the model to adaptively adjust the gradient contributions of different samples. WIoU v3 assigns smaller gradient gains to samples exhibiting extreme outlier degrees, enabling the loss function to focus primarily on anchors of moderate quality. This strategy effectively mitigates harmful gradients caused by low-quality samples, thereby improving the detection performance of the enhanced model.

4. Experiments and Results

4.1. Evaluation Metrics

Evaluation metrics for fire detection systems primarily focus on accuracy and response speed. Therefore, this study employs Precision, Recall, mean Average Precision (mAP), and Giga Floating Point Operations Per second (GFLOPs) as the main evaluation metrics [29]. Firstly, it is essential to clarify several fundamental cases within detection results: True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). TP refers to the number of positive samples correctly predicted as positive; FP denotes the number of negative samples incorrectly classified as positive; FN represents the number of positive samples incorrectly classified as negative.

Precision represents the proportion of correctly predicted results among all predicted outcomes, reflecting the accuracy of the model’s predictions. It is calculated as Equation (10):

Precision = \frac{T P}{T P + F P}

(10)

Recall represents the proportion of correctly predicted results among all actual positive instances, reflecting the comprehensiveness of the model’s predictions. It is calculated as Equation (11):

Recall = \frac{T P}{T P + F N}

(11)

By setting precision as the Y-axis and recall as the X-axis, a rectangular coordinate system is established to plot the Precision-Recall (PR) curve. The area enclosed by the PR curve and the coordinate axes represents the Average Precision (AP). When the dataset contains multiple classes, the mean Average Precision (mAP) is the average of AP values across all categories, calculated as Equation (12):

mAP = \frac{1}{N} \sum_{i = 1}^{N} {AP}_{i}

(12)

GFLOPs is a critical metric for measuring the computational complexity of a model, representing billions of floating-point operations executed per second. It quantifies the computational overhead of a model during inference by counting and normalizing the multiply–add operations within each computational module of the deep neural network into billions of floating-point operations.

4.2. Experimental Environment and Hyperparameter Settings

To ensure the scientific rigor and fairness of the experiments, all comparative experiments were conducted on the same software and hardware platform, with training hyperparameters strictly controlled. The experimental environment configuration is presented in Table 1:

Among them, “Epochs” refers to the number of complete training iterations; “Patience” indicates the maximum number of consecutive epochs allowed without performance improvement before early stopping is triggered; “Batch” denotes the batch size, that is, the number of samples used for each parameter update; and “lr0” represents the initial learning rate set at the start of training.

4.3. Ablation Study

4.3.1. Results and Analysis

To validate the performance of the proposed algorithm for fire detection in complex construction site scenarios, systematic experiments were conducted on our self-built Fire-5000 dataset. The dataset was split into training and validation sets at a 4:1 ratio, ensuring a balanced class distribution. Experiments were conducted using YOLOv8s as the baseline model, ensuring a fair comparison under identical hyperparameter configurations.

To comprehensively assess the impact of each module on model performance, ablation experiments were conducted by incrementally introducing the proposed improvement strategies into the baseline YOLOv8s model. A multidimensional quantitative analysis of each strategy’s impact on model performance was performed, and the results are presented in Table 2.

By incorporating an additional detection branch with a 64× down-sampling ratio into the baseline model, we aimed to improve its perceptual capability to large-scale fire targets. Experiment (1) demonstrates that this architecture increases the model’s precision by 3.2% and raises mAP from 69.7% to 70.4%, thereby validating the effectiveness of the added scale branch for detecting large-scale targets. However, this modification also introduces additional convolutional pathways, increasing the parameter count to 21.15 million and reducing inference speed to 117 FPS, thereby significantly undermining the model’s real-time detection performance.

In Experiment (2), a lightweight convolution-reconstructed C2f-EMB module was introduced into the original YOLOv8 architecture to replace the conventional C2f module. This module integrates the advantages of depthwise separable convolutions and bottleneck architectures, markedly reducing computational redundancy while enhancing feature propagation efficiency and inter-channel representation capacity, thereby effectively improving the model’s discrimination performance for fire targets. Experimental results show that this improvement elevated the mAP to 71.2% and the recall to 66.1%, achieving the best performance among all single-module enhancement schemes and fully validating the practical value of the C2f-EMB structure in object-detection tasks under complex environments.

Given the stringent real-time performance requirements of fire-detection tasks, this study integrates the iCBAM attention mechanism at a single location above the SPPF module in the backbone network. Results from Experiment (3) demonstrate that this module significantly improves the model’s precision, yielding a 3.1% increase, but gains in other performance metrics are comparatively modest. Without compromising recall, iCBAM effectively enhances the model’s discrimination of positive-class targets and suppresses interference from background noise.

The improvement of the loss function primarily targets gradient allocation without altering the model’s feature extraction and representation capabilities; therefore, it is difficult to yield a substantive performance leap. Results show that incorporating the WIoU loss function, without increasing the model’s computational overhead, yields a little improvement in detection performance.

By jointly integrating the multi-scale detection structure and the lightweight feature-enhancement module C2f-EMB, we construct the proposed “Multi-Scale Lightweight Collaborative Architecture.” Compared to the multi-scale structure, this architecture significantly enhances the model’s recall (improving it by 1.1%) and increases mAP by 0.4%, while maintaining precision and mAP 50–90 performance. Furthermore, owing to the optimized design of the lightweight convolutional units, this scheme demonstrates superior computational efficiency, reducing the parameter count by 28% and the FLOPs by 17%, thereby achieving a favorable balance between detection performance and computational overhead. Experimental results indicate that, although the proposed architecture does not universally outperform every single optimization strategy on certain evaluation metrics, it achieves better comprehensive performance on key indicators such as precision, recall, mAP, and computational overhead. Finally, building upon the aforementioned optimizations, the iCBAM attention mechanism and the WIoU loss function are sequentially introduced to construct the final improved algorithm, YOLO-Fire, yielding a second performance leap. Compared to the baseline YOLOv8s model, YOLO-Fire achieves significant improvements across multiple core metrics: precision increases by 4.6%, recall by 3.0%, mAP by 1.6%, and mAP 50–90 by 1.5%, thereby fully demonstrating the superiority of the proposed algorithm.

4.3.2. Visualization Results

In the following sections, this paper will comparatively analyze the performance of the different enhancement strategies in practical detection scenarios. This comparative experiment involves the baseline model, YOLOv8s, the enhanced models, YOLOv8 + MSCE and YOLOv8 + MSCE + iCBAM, and the final model, YOLO-Fire. The aforementioned models were applied to detect the same construction site fire image, and the visualized comparative experimental results are presented below.

Figure 7 indicates that because the baseline YOLOv8 lacks an effective detection mechanism for large-scale objects, it erroneously fragments a single large smoke target into two independent instances, each assigned a low confidence score. With the incorporation of the MSCE architecture, the addition of a dedicated detection layer for large-scale objects equips the model with enhanced perception of wide-ranging targets, enabling accurate identification of smoke instances; however, the confidence scores still leave room for improvement. Upon incorporation of the iCBAM attention mechanism, the model’s feature representation is strengthened, thereby facilitating the precise identification of smoke instances. Finally, by integrating multiple improvement strategies, the YOLOv8-Fire model is formed. This model combines the MSCE architecture’s capability for effective large-scale object capture with the iCBAM attention mechanism’s enhanced feature representation, thereby accurately identifying large-scale smoke targets while significantly improving the confidence scores of detected objects. The aforementioned visualization experiments validate the effectiveness of the proposed fusion algorithm in practical fire-detection scenarios.

To intuitively illustrate the optimization effects of the YOLO-Fire model, this section plots the curves of key performance metrics over training iterations. A comparative analysis with the baseline YOLOv8s model clearly demonstrates the enhancements in detection performance afforded by the proposed improvement strategies.

As shown in Figure 8, the YOLO-Fire model attains its peak performance at the 150th iteration. In the latter half of the training curve, all four evaluation metrics of YOLO-Fire significantly exceed those of the baseline model, indicating that the proposed algorithm is better suited to fire-detection tasks in construction site scenarios.

4.4. Comparative Experiments with Classical/SOTA Models

To further validate the effectiveness of the proposed algorithm, this section conducts a comprehensive comparative analysis of its performance against classical and state-of-the-art (SOTA) algorithms on our self-built construction site fire dataset, thereby more clearly demonstrating the advantages and characteristics of the proposed method. Since the different models differ in size specifications and underlying principles, comparing computational complexity and real-time metrics such as FPS would lack rigor. The detection performance of the selected models on the self-built construction site fire dataset was compared across four metrics: precision, recall, mAP, and mAP50-90.

Table 3. Performance comparison of various models.

Model	Precision (%)	Recall (%)	mAP (%)	mAP 50-95 (%)
YOLOv8	68.3	64.3	69.7	35.7
YOLOv5	71.2	63.2	70.1	35.2
YOLOv9 [30]	71.7	65.6	70.9	36.5
YOLOv10	69	36.6	69.3	35.6
YOLOv11	69.9	64.1	70.4	36.2
RT-DETR	72.1	61.2	65.1	31.8
Hyper-YOLO [31]	68.1	65.3	68.9	35.4
Mamba-YOLO [32]	70.3	65.9	71	37.3
YOLO-Fire	72.9	67.3	71.3	37.2

Analysis of the data in Table 3 reveals that the YOLO-Fire algorithm proposed in this paper exhibits significant performance advantages in construction site fire detection tasks. In terms of precision, the YOLO-Fire algorithm achieved 72.9%, which is not only markedly higher than the baseline YOLOv8s model 68.3% but also exceeds the current state-of-the-art RT-DETR [33] model 72.1%. With respect to recall, YOLO-Fire likewise demonstrates outstanding performance, achieving 67.3% and comprehensively outperforming all other compared models. Furthermore, on the mAP metric reflecting overall detection performance, YOLO-Fire achieved a best result of 71.3%; on the more stringent mAP50–90 metric, YOLO-Fire reached 37.2%, closely approaching the top-performing Mamba-YOLO 37.3%. In summary, the YOLO-Fire algorithm demonstrates remarkable performance advantages and high stability in construction site fire detection tasks, thereby fully validating the effectiveness of the proposed MSCE architecture and attention mechanism optimization strategies.

4.5. Cross-Domain Generalization Evaluation

To further evaluate the generalization capability of the proposed YOLO-Fire algorithm across diverse fire scenarios and to validate its effectiveness and robustness in a complex application environment. In this subsection, we select the publicly available D-Fire and FASDD [34] fire-detection datasets to conduct cross-domain generalization performance comparison experiments. These two datasets differ markedly from the construction site fire dataset used in this study in terms of fire scenario types, background complexity, and object-scale distributions, thus providing an objective assessment of the model’s generalization and adaptability to unknown environments.

4.5.1. FASDD Dataset

In this study, we employ the FASDD dataset—a remote sensing–based flame and smoke target detection dataset for deep learning—publicly released by Wuhan University in 2024 via the Scientific Data Bank. The dataset comprises fire images captured at varying viewing distances, across different scenarios and illumination conditions, and acquired using a variety of visual sensors. The dataset comprises on the order of 10⁵ images, containing 113,154 flame object instances and 73,072 smoke object instances, making it the largest and most broadly representative flame-and-smoke detection dataset to date. The included scenarios are illustrated in Figure 9.

This experiment compared the detection performance of the baseline YOLOv8s model and the proposed YOLO-Fire on the FASDD dataset; the results are shown in Table 4.

Compared with the baseline model, the proposed YOLO-Fire algorithm achieved improvements of 1.3%, 0.3%, and 0.8% in precision, recall, and mAP on the FASDD dataset, respectively, thereby validating the effectiveness of the proposed enhancement strategies in cross-domain scenarios. However, the recall for the smoke category exhibited a slight decline of 0.5 percentage points. Analysis indicates that the primary reason is that, in the FASDD dataset, the average scale of smoke targets captured from UAV and satellite perspectives is smaller compared to those in our self-built dataset. This results in an excessively large receptive field in the large-object detection layer during feature fusion, which dilutes local fine-grained information and causes missed detections of small-scale smoke targets. Nevertheless, YOLO-Fire still outperforms the baseline model across all other key metrics, indicating that the multi-scale collaborative optimization architecture (MSCE) and the dynamic gradient allocation strategy proposed in this paper substantially enhance the model’s feature decoupling capability, thereby achieving stable generalization across diverse detection tasks.

4.5.2. D-Fire Dataset

The D-Fire [35] (Drone-based Fire) dataset is a UAV fire-imagery dataset specifically designed by the Brazilian GAIA research team for machine learning and object-detection algorithms, comprising over 21,000 images. The dataset’s images are categorized into four classes: 1164 images containing only flames, 5867 images containing only smoke, 4658 images containing both flames and smoke, and 9838 distractor images containing neither flames nor smoke. A total of 14,692 flame instances and 11,865 smoke instances were annotated [36]. The D-Fire dataset was collected primarily in outdoor wildfire environments such as forests and grasslands, which differ substantially from the construction site fire scenarios addressed in this study. Therefore, selecting the D-Fire dataset for cross-domain generalization experiments facilitates an objective validation of the proposed algorithm’s generalization capability and adaptability across different fire scenarios, thereby ensuring scientific rigor. The models compared in this experiment are YOLOv8, YOLOv10 [37], YOLOv11, and the proposed YOLO-Fire; the experimental results are presented in Table 5.

As shown by the comparative experiments on the D-Fire dataset (Table 5), the YOLO-Fire algorithm proposed in this paper demonstrates clear performance advantages. The YOLO-Fire algorithm outperforms all comparison models across precision, recall, and mAP metrics. The experimental results indicate that the proposed optimization strategies effectively enhance the model’s adaptability and generalization performance across diverse fire scenarios, enabling the YOLO-Fire algorithm to maintain stable and superior performance in complex and variable fire-detection environments, thus offering clear application advantages.

Figure 10 illustrates the visualized performance of YOLO-Fire and other one-stage object detection algorithms on the D-Fire dataset, highlighting the differences in detection performance under real-world fire scenarios.

Figure 10a presents the detection results of each algorithm under environmental interference conditions. In this instance, YOLOv8 missed a fire instance due to occlusion by a foreground vehicle, whereas YOLOv10 erroneously identified the worker’s red jacket on the right as a fire target. In contrast, both YOLOv11 [38] and the proposed YOLO-Fire algorithm successfully detected all fire targets, with YOLO-Fire achieving higher confidence scores.

Figure 10b presents the detection results in a multi-scale wildfire scenario. YOLOv8 failed to detect one small-scale flame instance, whereas the other algorithms successfully identified all targets, demonstrating superior multi-scale adaptability and detection precision.

Figure 10c illustrates detection performance for targets at the image periphery. Among all compared methods, only the proposed YOLO-Fire algorithm accurately detected the flame instance located at the left border of the image and achieved the highest confidence score.

In summary, both in theoretical metrics and practical detection performance, YOLO-Fire significantly outperforms comparable algorithms, indicating that the proposed method possesses strong cross-domain generalization capabilities.

5. Discussion

Although the YOLO-Fire algorithm has achieved significant improvements in detecting fires within construction site environments, several limitations remain that deserve further attention. Firstly, while the algorithm demonstrates adequate real-time performance under current hardware conditions (achieving 144 FPS), the computational complexity of the proposed model could still pose deployment challenges in scenarios with extremely limited computational resources or embedded devices [39]. Future work should explore additional lightweight optimization techniques to further reduce the computational overhead.

Secondly, despite robust performance in typical fire scenarios, the proposed model exhibits slight weaknesses when dealing with extremely small-scale smoke targets, as revealed in cross-domain experiments on the FASDD dataset. This indicates that the multi-scale structure could potentially dilute fine-grained local features during feature fusion, affecting the detection of very small targets. Subsequent studies should consider adaptive feature fusion mechanisms to better preserve and utilize fine-grained information.

Lastly, the current dataset, though comprehensive, still lacks sufficient representation of extremely rare fire scenarios, such as chemical or electrical fires specific to certain construction materials or methods. Expanding dataset coverage through targeted data collection could further enhance the robustness and practical applicability of the algorithm.

6. Conclusions

This study proposes an improved YOLOv8-based fire detection algorithm named YOLO-Fire, specifically optimized for construction site environments. Key innovations include a multi-scale lightweight collaborative architecture (MSCE) enhancing detection across diverse scales, the iCBAM attention module improving feature discrimination and suppressing background interference, and the dynamic gradient allocation strategy utilizing the WIoUv3 loss function. Experimental validation on the self-constructed Fire-5000 dataset demonstrates significant improvements: precision increased by 4.6%, recall by 3.0%, mAP by 1.6%, and mAP50-95 by 1.5% compared to baseline YOLOv8s. Furthermore, cross-domain tests on public datasets FASDD and D-Fire confirm the robustness and generalizability of YOLO-Fire. Overall, the proposed model effectively addresses the complexity of construction site fire detection tasks, achieving a balance between detection accuracy and computational efficiency.

Author Contributions

Conceptualization, H.S.; methodology, H.S.; software, H.S.; validation, H.S.; formal analysis, H.S.; investigation, H.S.; resources, H.S.; data curation, H.S.; writing—original draft preparation, H.S.; writing—review and editing, H.S. and T.Y.; visualization, H.S.; supervision, T.Y.; project administration, T.Y.; funding acquisition, T.Y.; All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Innovation Capacity Promotion Project for Technology-based SMEs in Shandong Province: “AI-empowered Smart Construction Site Platform Project”, grant number 2023TSGC0877.

Institutional Review Board Statement

The study in this paper did not involve humans or animals.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Alkhatib, A.A. A review on forest fire detection techniques. Int. J. Distrib. Sens. Netw. 2014, 10, 597368. [Google Scholar] [CrossRef]
Jin, C.; Wang, T.; Alhusaini, N.; Zhao, S.; Liu, H.; Xu, K.; Zhang, J. Video fire detection methods based on deep learning: Datasets, methods, and future directions. Fire 2023, 6, 315. [Google Scholar] [CrossRef]
Vasconcelos, R.N.; Franca Rocha, W.J.; Costa, D.P.; Duverger, S.G.; Santana, M.M.D.; Cambui, E.C.; Ferreira-Ferreira, J.; Oliveira, M.; Barbosa, L.D.S.; Cordeiro, C.L. Fire Detection with Deep Learning: A Comprehensive Review. Land 2024, 13, 1696. [Google Scholar] [CrossRef]
Choi, S.; Kim, S.; Jung, H. Optimized Faster R-CNN with Swintransformer for Robust Multi-Class Wildfire Detection. Fire 2025, 8, 180. [Google Scholar] [CrossRef]
Cheknane, M.; Bendouma, T.; Boudouh, S.S. Advancing fire detection: Two-stage deep learning with hybrid feature extraction using faster R-CNN approach. Signal Image Video Process. 2024, 18, 5503–5510. [Google Scholar] [CrossRef]
Gragnaniello, D.; Greco, A.; Sansone, C.; Vento, B. FLAME: Fire detection in videos combining a deep neural network with a model-based motion analysis. Neural Comput. Appl. 2025, 37, 6181–6197. [Google Scholar] [CrossRef]
Mukhiddinov, M.; Abdusalomov, A.B.; Cho, J. A wildfire smoke detection system using unmanned aerial vehicle images based on the optimized YOLOv5. Sensors 2022, 22, 9384. [Google Scholar] [CrossRef] [PubMed]
Liu, H.; Zhang, F.; Xu, Y.; Wang, J.; Lu, H.; Wei, W.; Zhu, J. Tfnet: Transformer-based multi-scale feature fusion forest fire image detection network. Fire 2025, 8, 59. [Google Scholar] [CrossRef]
Li, Y.; Nie, L.; Zhou, F.; Liu, Y.; Fu, H.; Chen, N.; Dai, Q.; Wang, L. Improving Fire and Smoke Detection with You Only Look Once 11 and Multi-Scale Convolutional Attention. Fire 2025, 8, 165. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Jocher, G.; Stoken, A.; Borovec, J.; Changyu, L.; Hogan, A.; Diaconu, L.; Poznanski, J.; Yu, L.; Rai, P.; Ferriday, R. ultralytics/yolov5, version 3.0; Zenodo: Geneva, Switzerland, 2020. [Google Scholar]
Sohan, M.; Sai Ram, T.; Rami Reddy, C.V. A review on yolov8 and its advancements. In Proceedings of the International Conference on Data Intelligence and Cognitive Informatics, Tirunelveli, India, 18–20 November 2024; pp. 529–545. [Google Scholar]
Wang, C.-Y.; Liao, H.-Y.M.; Yeh, I.-H. Designing network design strategies through gradient path analysis. arXiv 2022, arXiv:2211.04800. [Google Scholar]
Ramachandran, P.; Zoph, B.; Le, Q.V. Searching for activation functions. arXiv 2017, arXiv:1710.05941. [Google Scholar]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Bakurov, I.; Buzzelli, M.; Schettini, R.; Castelli, M.; Vanneschi, L. Structural similarity index (SSIM) revisited: A data-driven approach. Expert Syst. Appl. 2022, 189, 116087. [Google Scholar] [CrossRef]
Chen, G.; Zhou, H.; Li, Z.; Gao, Y.; Bai, D.; Xu, R.; Lin, H. Multi-scale forest fire recognition model based on improved YOLOv5s. Forests 2023, 14, 315. [Google Scholar] [CrossRef]
Rasheed, A.F.; Zarkoosh, M. Optimized YOLOv8 for multi-scale object detection. J. Real-Time Image Process. 2025, 22, 6. [Google Scholar] [CrossRef]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
Zhou, N.; Gao, D.; Zhu, Z. YOLOv8n-SMMP: A Lightweight YOLO Forest Fire Detection Model. Fire 2025, 8, 183. [Google Scholar] [CrossRef]
Koonce, B. EfficientNet. In Convolutional Neural Networks with Swift for Tensorflow: Image Recognition and Dataset Categorization; Springer: Berlin/Heidelberg, Germany, 2021; pp. 109–123. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 12993–13000. [Google Scholar]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding box regression loss with dynamic focusing mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]
Padilla, R.; Netto, S.L.; Da Silva, E.A. A survey on performance metrics for object-detection algorithms. In Proceedings of the 2020 International Conference on Systems, Signals and Image Processing (IWSSIP), Niterói, Brazil, 1–3 July 2020; pp. 237–242. [Google Scholar]
Wang, C.-Y.; Yeh, I.-H.; Mark Liao, H.-Y. Yolov9: Learning what you want to learn using programmable gradient information. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 1–21. [Google Scholar]
Feng, Y.; Huang, J.; Du, S.; Ying, S.; Yong, J.-H.; Li, Y.; Ding, G.; Ji, R.; Gao, Y. Hyper-yolo: When visual object detection meets hypergraph computation. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 47, 2388–2401. [Google Scholar] [CrossRef]
Wang, Z.; Li, C.; Xu, H.; Zhu, X. Mamba YOLO: SSMs-based YOLO for object detection. arXiv 2024, arXiv:2406.05835. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
Wang, M.; Jiang, L.; Yue, P.; Yu, D.; Tuo, T. FASDD: An open-access 100,000-level flame and smoke detection dataset for deep learning in fire detection. Earth Syst. Sci. Data Discuss. 2023, 2023, 1–26. [Google Scholar]
de Venâncio, P.V.A.; Campos, R.J.; Rezende, T.M.; Lisboa, A.C.; Barbosa, A.V. A hybrid method for fire detection based on spatial and temporal patterns. Neural Comput. Appl. 2023, 35, 9349–9361. [Google Scholar] [CrossRef]
de Venancio, P.V.A.; Lisboa, A.C.; Barbosa, A.V. An automatic fire detection system based on deep convolutional neural networks for low-power, resource-constrained devices. Neural Comput. Appl. 2022, 34, 15349–15368. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J. Yolov10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
Khanam, R.; Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar]
Zhao, C.; Zhao, L.; Zhang, K.; Ren, Y.; Chen, H.; Sheng, Y. Smoke and Fire-You Only Look Once: A Lightweight Deep Learning Model for Video Smoke and Flame Detection in Natural Scenes. Fire 2025, 8, 104. [Google Scholar] [CrossRef]

Figure 1. Annotating fire images using LabelMe software.

Figure 2. Scale distribution of dataset targets: In (a), the X and Y axes represent the relative horizontal and vertical positions; In (b), the X and Y axes represent the relative width and height.

Figure 3. Multi-scale performance of the YOLOv8 model in construction site fire scenarios.

Figure 4. Expansion of Deep Feature Extraction Layers.

Figure 5. Network structure after multi-scale optimization.

Figure 6. iCBAM attention mechanism network architecture.

Figure 7. Comparison of the practical effectiveness of the optimization algorithm.

Figure 8. Iteration curves of performance metrics.

Figure 9. Fire images from the FASDD dataset.

Figure 10. Analysis of the generalization performance of YOLO-Fire. (a) Environmental interference scenarios. (b) Small-scale fire target. (c) Edge target detection.

Table 1. Configuration of the experimental environment and hyperparameters.

Category	Configuration	Value
Hardware Platform	CPU	Intel Xeon E5-2609 v4 @1.70 GHz
	GPU	Nvidia RTX 3090(24 GB) × 2
	Memory	Kingston DDR4 3200 Hz 128 GB
	Hard Disk	DELL SATA 32 TB
Software Platform	OS	Ubuntu 22.04 LTS
	Framework	PyTorch 2.2 CUDA 11.3
	Container	Docker v5.1.3
Hyperparameters	Epochs	300
	Patience	50
	Batch	16
	Learning Rate	0.01
	Optimizer	SGD
	Momentum	0.937

Table 2. Results of the ablation study.

ID	A	B	C	D	Precision	Recall	mAP	mAP50-95	Parame	GFLOPs	FPS
0					68.3	64.3	69.7	35.7	11126358	28.4	230
1	√				71.5	65.2	70.4	36.4	21158870	28.7	117
2		√			70.8	66.1	71.2	36.3	12444534	23.4	153
3			√		71.4	64.9	70	35.8	11428024	28.7	218
4				√	68.7	65.1	70.1	35.6	11126358	28.4	229
5	√	√			71.4	66.3	70.8	36.3	15217976	23.7	145
6	√	√	√		72.8	65.0	71.1	36.6	20519642	23.8	145
7	√	√	√	√	72.9	67.3	71.3	37.2	20519642	23.8	144

A refers to multi-scale detection; B refers to the C2f-EMB module; C denotes the iCBAM attention mechanism; and D refers to the WIoU loss function.

Table 4. Comparative experiments on the FASDD dataset.

Target Objects	YOLOv8s			YOLO-Fire
Target Objects	Precision	Recall	mAP	Precision	Recall	mAP
Fire	71.5	64.7	72.6	71.7	65.7	73.4
Smoke	81.9	72.7	81.6	84.5	72.2	82.4
Average	76.7	68.7	77.1	78.1	69	77.9

Table 5. Performance on the D-Fire dataset.

Model	Precision	Recall	AP		mAP
Model	Precision	Recall	Fire	Smoke	mAP
YOLOv8	78.3	72.2	72.6	84.4	78.5
YOLOv10	79.9	72.4	72.8	85.5	79.2
YOLOv11	79.1	73.3	73.2	85	79.1
YOLO-Fire	82.3	73.7	74.4	85.9	80.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sun, H.; Yao, T. Multi-Scale Construction Site Fire Detection Algorithm with Integrated Attention Mechanism. Fire 2025, 8, 257. https://doi.org/10.3390/fire8070257

AMA Style

Sun H, Yao T. Multi-Scale Construction Site Fire Detection Algorithm with Integrated Attention Mechanism. Fire. 2025; 8(7):257. https://doi.org/10.3390/fire8070257

Chicago/Turabian Style

Sun, Haipeng, and Tao Yao. 2025. "Multi-Scale Construction Site Fire Detection Algorithm with Integrated Attention Mechanism" Fire 8, no. 7: 257. https://doi.org/10.3390/fire8070257

APA Style

Sun, H., & Yao, T. (2025). Multi-Scale Construction Site Fire Detection Algorithm with Integrated Attention Mechanism. Fire, 8(7), 257. https://doi.org/10.3390/fire8070257

Article Menu

Multi-Scale Construction Site Fire Detection Algorithm with Integrated Attention Mechanism

Abstract

1. Introduction

2. Materials and Methods

2.1. YOLOv8 Algorithm

2.2. Fire-5000 Dataset

3. Algorithm Optimization

3.1. Multi-Scale Lightweight Collaborative Architecture

3.1.1. Multi-Scale Network Structure Optimization

3.1.2. Lightweight Convolution-Based Reconstruction of the C2f Module

3.2. Inverted Residual Spatial–Channel Attention Module

3.3. Improvement of the Boundary Loss Function

4. Experiments and Results

4.1. Evaluation Metrics

4.2. Experimental Environment and Hyperparameter Settings

4.3. Ablation Study

4.3.1. Results and Analysis

4.3.2. Visualization Results

4.4. Comparative Experiments with Classical/SOTA Models

4.5. Cross-Domain Generalization Evaluation

4.5.1. FASDD Dataset

4.5.2. D-Fire Dataset

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI