Mine Exogenous Fire Detection Algorithm Based on Improved YOLOv9

Zhan, Xinhui; Yao, Rui; Qi, Yun; Bai, Chenhao; Li, Qiuyang; Qi, Qingjie

doi:10.3390/pr14010169

Open AccessArticle

Mine Exogenous Fire Detection Algorithm Based on Improved YOLOv9

by

Xinhui Zhan

¹

,

Rui Yao

^1,*,

Yun Qi

^2,3,*

,

Chenhao Bai

¹,

Qiuyang Li

² and

Qingjie Qi

⁴

¹

School of Coal Engineering, Shanxi Datong University, Datong 037000, China

²

School of Mining and Coal, Inner Mongolia University of Science & Technology, Baotou 014010, China

³

China Safety Science Journal Editorial Department, China Occupational Safety and Health Association, Beijing 100011, China

⁴

Emergency Science Research Institute, China Coal Research Institute Co., Ltd., Beijing 100013, China

^*

Authors to whom correspondence should be addressed.

Processes 2026, 14(1), 169; https://doi.org/10.3390/pr14010169

Submission received: 24 November 2025 / Revised: 23 December 2025 / Accepted: 30 December 2025 / Published: 4 January 2026

(This article belongs to the Section AI-Enabled Process Engineering)

Download

Browse Figures

Versions Notes

Abstract

Exogenous fires in underground coal mines are characterized by low illumination, smoke occlusion, heavy dust loading and pseudo fire sources, which jointly degrade image quality and cause missed and false alarms in visual detection. To achieve accurate and real-time early warning under such conditions, this paper proposes a mine exogenous fire detection algorithm based on an improved YOLOv9m, termed PPL-YOLO-F-C. First, a lightweight PP-LCNet backbone is embedded into YOLOv9m to reduce the number of parameters and GFLOPs while maintaining multi-scale feature representation suitable for deployment on resource-constrained edge devices. Second, a Fully Connected Attention (FCAttention) module is introduced to perform fine-grained frequency–channel calibration, enhancing discriminative flame and smoke features and suppressing low-frequency background clutter and non-flame textures. Third, the original upsampling operators in the neck are replaced by the CARAFE content-aware dynamic upsampler to recover blurred flame contours and tenuous smoke edges and to strengthen small-object perception. In addition, an MPDIoU-based bounding-box regression loss is adopted to improve geometric sensitivity and localization accuracy for small fire spots. Experiments on a self-constructed mine fire image dataset comprising 3000 samples show that the proposed PPL-YOLO-F-C model achieves a precision of 97.36%, a recall of 84.91%, mAP@50 of 96.49% and mAP@50:95 of 76.6%, outperforming Faster R-CNN, YOLOv5m, YOLOv7 and YOLOv8m while using fewer parameters and lower computational cost. The results demonstrate that the proposed algorithm provides a robust and efficient solution for real-time exogenous fire detection and edge deployment in complex underground mine environments.

Keywords:

YOLOv9m; mine fire; object detection; lightweight; multi-scale fusion; attention mechanism

1. Introduction

Fire is one of the principal hazards to coal-mine safety. Coal-mine fires are typically latent in development, abrupt in onset, and rapid in propagation; failure to detect and accurately locate the ignition source in a timely manner can readily precipitate incident escalation, resulting in severe casualties and environmental damage. According to 2023 statistics released by China’s Ministry of Emergency Management, in 68.3% of coal-mine fire incidents, failure to rapidly identify the ignition source and to implement effective early-warning measures aggravated the consequences and markedly increased safety risks [1].

In coal mines, a “mine fire” denotes an uncontrolled, self-sustaining combustion of underground combustibles—coal and remnant coal, coal dust, conveyor belts and support materials, lubricants, cable sheaths, and the like—under an oxygen supply, accompanied by sustained heat release, oxygen depletion, temperature rise, and the spread of toxic and hazardous smoke [2]. By causation, mine fires are classified as exogenous and endogenous. Exogenous fires are ignited by external energy or open flames; typical scenarios include electrical short circuits and cable breakdown, welding/blasting sparks, belt slippage with overheated rollers, heat from mechanical friction, and ignition of leaked oils. They are characterized by relatively clear ignition points, controllable precursors, and abrupt onset, and are closely associated with operating conditions and equipment failures. Endogenous fires mainly refer to spontaneous combustion of coal, in which residual coal in goaf/gob areas and roadways, under low air-velocity, relatively oxygen-rich, and poorly dissipative conditions, undergoes oxidation heat accumulation → temperature rise → thermal runaway to ignition; these events have long incubation, strong concealment, and high risk of re-ignition. Monitoring precursors commonly include abnormal CO/CO₂, elevated oxygen-consumption rates, and slow temperature increases [3]. Accordingly, endogenous-fire monitoring should fuse visual cues with gas/temperature multisensor modeling to cover the covert incubation stage, whereas exogenous-fire monitoring relies more on rapid visual detection of flames/smoke coupled with operational status awareness.

Underground exogenous fires differ from surface fires due to confined spaces, directed ventilation, and heavy dust. This leads to sparse data and noisy signals. Ventilation distorts flames, causing misses/false alarms, while dust reduces contrast and shifts textures. Camera placement is further limited by supports and pipe/cable routings, yielding short fields of view and occlusion, which substantially increases real-time monitoring difficulty. Traditional exogenous-fire detection relies on manual patrols and gas/smoke sensors: patrols are subjective and untimely, while sensors suffer long response lags, high false-alarm rates, and a lack of real-time visual risk assessment, leaving protection blind spots and hindering effective early warning. Recently, deep-learning–based visual detection has advanced rapidly in industrial safety, offering a new pathway for intelligent early warning in mines; however, dust interference, uneven illumination, and spatial constraints complicate flame/smoke feature extraction, and limited edge-device compute capacity hampers the balance between high real-time performance and high accuracy, affecting safety control and emergency response efficiency [4,5]. Developing a real-time flame-detection system tailored to the specific underground environment is therefore of clear significance for overcoming bottlenecks in intelligent mine safety monitoring.

In recent years, deep-learning–based object detection has achieved remarkable progress [6]. For object detection, Ross Girshick [7] proposed the improved Fast R-CNN (Fast Region-based Convolutional Neural Network), which uses deep convolutional networks to classify object proposals efficiently, substantially accelerating training and achieving detection speeds nine-fold and three-fold faster than R-CNN and SPPnet (Spatial Pyramid Pooling Network), respectively. Building on Fast R-CNN, Jifeng Dai et al. [8] introduced R-FCN (Region-based Fully Convolutional Network) with position-sensitive score maps to reconcile translation invariance in image classification with the translation sensitivity required in detection, thereby balancing speed and accuracy. Kaiming He et al. [9] addressed the deformation and localization-accuracy degradation caused by RoI Pooling (Region of Interest Pooling) by proposing Mask R-CNN, which augments Faster R-CNN with a mask prediction branch and, beyond enabling instance segmentation, enhances feature-learning capacity. Joseph Redmon et al. [10] proposed YOLO (You Only Look Once), which partitions an image into grids and directly predicts bounding boxes and class probabilities per grid, enabling end-to-end learning.

For mine-site target detection, Professor Fan Zhang of China University of Mining and Technology (Beijing) et al. [11] reviewed deep-learning–based techniques, highlighting their advantages and promising directions in coal-mine applications. Related studies have yielded notable results. Liu et al. [12] used gray relational analysis and multisensor fusion of multiple gas indicators to build a tiered early-warning system, achieving accurate early warning for the slow-oxidation stage of coal spontaneous combustion. Qin Botao et al. [13] realized high-precision, real-time underground monitoring of marker gases such as CO and CO₂ using non-dispersive infrared (NDIR) and tunable diode laser absorption spectroscopy (TDLAS), overcoming the drawbacks of traditional chromatography—response lag and strong susceptibility to airflow. Hu Jinian et al. [14] proposed a deep-convolutional-network-based model for source localization of exogenous mine fires, mitigating the limitations of local monitoring (small coverage) and poor generalization of global localization systems, and improving the accuracy of identifying the affected roadway.

In the aspect of fire detection based on YOLO, Wang Weifeng et al. [15] improved YOLOv5 by integrating a K-means–enhanced dark-channel dehazing algorithm and dynamic-target extraction, alleviating feature loss under complex underground conditions and increasing fire-feature recognition accuracy. Zhang Wei and Wei Jingjing [16] incorporated DenseNet (Densely Connected Convolutional Network) blocks and dilated convolutions into YOLOv3, reducing parameters while mitigating gradient vanishing during feature extraction and improving small-object detection accuracy. Zhang Mingzhen et al. [17] proposed an improved YOLOv5 that replaces the standard feature pyramid and path aggregation with a weighted bidirectional FPN (Feature Pyramid Network) and substitutes traditional NMS (Non-Maximum Suppression) with DIoU-NMS (Distance-IoU–based NMS), thereby enhancing flame-detection accuracy. Li Xianguo et al. [18] developed FD-YOLO for fire detection, designing CFM_N, SPPFCSPC, and DownC modules in the backbone and neck to reduce computational burden and model complexity while maintaining detection precision. Gao Junyi et al. [19] augmented YOLOv10 with E-PSA, BiFPN, and C2f-Faster modules to strengthen spatial-semantic extraction and low-level feature fusion during fire recognition, improving detection precision and mAP@50 by 5.9% and 1.4%, respectively. Recent advancements in YOLO-based models have increasingly targeted fire and smoke detection in low-visibility environments, including underground mines. For instance, Zhao et al. [20] proposed an enhanced YOLOv8n framework for fire detection in underground coal mines, incorporating image preprocessing (denoising via KBNet, dehazing via DeHamer, and deblurring via ChaIR) alongside attention modules (CCC, BiFormer, Agent) and Projected GAN for data augmentation, achieving a mAP of 85.8% while addressing smoke, dust, and blur. Similarly, Li and Liu [21] introduced a fusion of entropy-enhanced processing with an improved YOLOv8m for smoke recognition in mine fires, using entropy-constrained feature focusing, HATHead, and PSA modules to tackle background-smoke integration and low entropy in dim conditions, yielding a mAP@50 of 95.0% and real-time FPS of 25. In parallel, Zhang et al. [22] developed UFS-YOLO based on YOLOv5s for small fire targets in underground facilities, integrating a modified CBAM (DCBAM) and SIoU loss to enhance discrimination in dim, humid spaces, improving accuracy by 5.4% over the baseline. Additionally, DSS-YOLO by Wang et al. [23] refines YOLOv8n with DynamicConv, SEAM attention, and SPPELAN for obscured small targets in fire scenarios, boosting mAP by 0.6% and reducing GFLOPs by 12.3%, demonstrating applicability to low-visibility conditions like smoke diffusion. Although considerable progress has been made in mine-fire detection, significant gaps remain in dynamic-feature modeling, cross-layer information fusion, and lightweight deployment [24].

In response to the above issues, an improved YOLOv9m for mine fire monitoring was proposed: (1) Embed PP-LCNet for lightweighting; (2) Add FCAttention for feature enhancement; (3) Replace upsampling with CARAFE; (4) Use MPDIoU loss.

2. Methods

2.1. YOLOv9

The YOLO family is a class of real-time object detectors that perform localization and classification in a single forward pass. Unlike conventional two-stage detectors, YOLO partitions the input image into grid cells, with each cell directly predicting bounding boxes, objectness confidence, and class probabilities, thereby substantially increasing detection speed. YOLOv9 primarily mitigates the information loss that arises as data propagate through deep networks via two advances: first, the introduction of Programmable Gradient Information (PGI), which accommodates the variability inherent in multi-object detection and supplies the learning objective with complete task-relevant inputs to yield reliable gradients for updating network weights; second, the design of GELAN, a lightweight architecture based on gradient path planning that offers notable advantages in parameter efficiency, computational complexity, accuracy, and inference speed [25].

The YOLOv9 architecture comprises three components: the backbone, neck, and head. The backbone adopts GELAN, which integrates cross-stage feature aggregation with scalable computation units and supports combinations of residual, CSP, and other modules. Its custom downsampling block combines average and max pooling to effectively preserve multi-scale features. The neck is constructed on PGI, preserving a complete gradient flow via a reversible auxiliary branch and introducing multi-level supervision to mitigate error accumulation in shallow layers. The detection head maintains an anchor-free design and employs a task-aligned assigner to dynamically optimize label matching.

The Silence module is a special block used solely for the raw 640 × 640 × 3 input image; it has no additional function and is followed downstream by a Conv block to extract local features. The RepNCSPELAN4 module combines the strengths of CSPNet and ELAN and incorporates re-parameterizable convolutions (RepConv) to accelerate inference. Within the PGI framework, the CBLinear module consists of a single convolutional layer and a tensor-splitting stage: it maps input features into a multi-scale space and performs a preset channel-wise split, supplying multi-stage inputs to the subsequent CBFuse module. The split features produced by CBLinear participate in multi-level supervision, effectively alleviating information loss in deep networks and ensuring reliable gradient propagation. SPPELAN, proposed in YOLOv9m as a new multi-scale fusion module, combines spatial pyramidization with efficient layer aggregation to overcome the receptive-field limitations and computational redundancy of the traditional SPPF, thereby markedly enhancing multi-scale feature extraction. CBFuse aligns and fuses feature maps from different levels and can be removed via an inverse operation at inference time, enhancing multi-scale detection while reducing computation and improving inference speed. Accordingly, YOLOv9 is suitable for fire detection in complex underground mine environments; the network structure is illustrated in Figure 1.

2.2. PP-LCNet Lightweight Module

Because the introduction of auxiliary branches in YOLOv9 substantially increases the parameter count and computational load, it is difficult to deploy on underground edge devices with limited resources. To address this, we modify the detector’s backbone using the depthwise separable convolution (DepSepConv) modules from PP-LCNet, thereby reducing GFLOPs and model size to accommodate the constrained compute of mining edge platforms. Considering the visual characteristics of early-stage mine fires caused by external factors—smoke dispersion, low illumination, dust, and strong glare—the proposed module maintains real-time performance while providing high robustness, making it well-suited for early warning and rapid edge deployment in mines.

PP-LCNet is a lightweight convolutional neural network characterized by high efficiency and accuracy. The architecture is constructed by stacking multiple DepthSepConv base blocks. Using a depthwise-separable convolution scheme—two linear convolutional stages (depthwise and pointwise) coupled with the H-Swish nonlinearity—it effectively constrains activation ranges and reduces computational complexity. The DepthSepConvSE enhancement block introduces a Squeeze-and-Excitation (SE) channel-attention mechanism at key network stages to strengthen feature representation.

Conventional classifiers typically couple a global average pooling (GAP) layer directly to a fully connected classification layer. In lightweight designs; however, such direct coupling may cause loss of high-dimensional feature information and thus limit classification performance. To address this bottleneck, we append a 1 × 1 pointwise convolution with 1280 channels after the GAP layer, achieving nonlinear feature mapping and dimensional compression while, through parameter optimization, preserving computational efficiency [26]. The PP-LCNet network diagram is shown in Figure 2.

2.3. FCAttention

FCAttention comprises four components—global feature extraction, multi-granularity interaction, adaptive fusion, and output weighting. By effectively integrating global and local cues and allocating weights appropriately, it enables the network to emphasize informative features while suppressing less useful ones [27]. In real-time detection of mine fires caused by external factors, the multi-granularity interaction in FCAttention enhances the representation of small objects and slender smoke plumes, suppresses low-frequency noise and non-flame texture interference, and thereby markedly reduces the false-alarm rate while concurrently lowering the miss rate.

First, the feature map F, which contains global spatial information, is subjected to global average pooling to convert it into a channel descriptor U for capturing global context. The equation is as follows:

U_{n} = GAP (F_{n}) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} F_{n} (i, j)

(1)

where

F \in ℝ^{C \times H \times W}

, C, H and W denote the number of channels, height, and width, respectively, and

U \in ℝ^{C}

, GAP(x) denotes the global average pooling function.

Second, a diagonal matrix D is employed to capture the dependencies among all channels as global information; the procedure is as follows:

D = [\begin{matrix} d_{1}, & d_{2}, & d_{3}, & \dots, & d_{c} \end{matrix}]

(2)

U_{g c} = \sum_{i = 1}^{c} U \cdot d_{i}

(3)

where U_gc denotes the global information and C is the number of channels. Meanwhile, to capture local channel information with few learnable parameters, a banded matrix B is employed to perform local channel interactions; the procedure is as follows:

B = [\begin{matrix} b_{1}, & b_{2}, & b_{3}, & \dots, & b_{k} \end{matrix}]

(4)

U_{l c} = \sum_{i = 1}^{k} U \cdot b_{i}

(5)

where U_lc denotes the local information and k is the number of adjacent channels. To capture their correlations at different granularities, the global information U_gc and the local information U_lc are combined via a cross-correlation operation to obtain the correlation matrix M, as follows:

M = U_{g c} \cdot U_{l c}^{T}

(6)

Rows and columns are extracted from the correlation matrix and its transpose to serve as the weight vectors for the global and local information, respectively, as follows:

U_{g c}^{ω} = \sum_{j}^{c} M_{i, j}, i = 1, 2, 3, \dots \dots, c

(7)

U_{l c}^{ω} = \sum_{j}^{c} {(U_{l c} \cdot U_{g c}^{T})}_{i, j} = \sum_{j}^{c} M_{i, j}^{T}, i = 1, 2, 3, \dots \dots, c

(8)

The two types of weights are dynamically fused by a learnable factor

σ (θ)

as follows:

W = σ (σ (θ) \times σ (U_{g c}^{ϖ}) + (1 - σ (θ)) \times σ (U_{l c}^{ϖ}))

(9)

where

θ

denotes the Sigmoid activation function. The obtained weights W are then multiplied by the input feature map to produce the final output feature map:

F^{*} = W \times F

(10)

where F denotes the input feature map and

F^{*}

denotes the output feature map; the overall network architecture is illustrated in Figure 3.

2.4. CARAFE

CARAFE is a dynamic upsampling operator that generates position-adaptive convolution kernels in a content-aware manner, thereby overcoming the limitations of conventional interpolation-based methods. Its architecture consists of a kernel prediction module and a feature reassembly module: the former employs channel compression and convolution to generate the upsampling kernels, while the latter performs matrix multiplication between the kernel weights and local features to achieve feature reconstruction [28]. The detailed procedure is as follows.

First, the kernel prediction module generates a set of reassembly kernels at each spatial location based on the input low-resolution feature map

X \in ℝ^{C \times H \times W}

. Then, the content reassembly module applies these kernels to reweight and reorganize the original feature map, producing the upsampled high-resolution feature map

Y \in ℝ^{C \times s H \times s W}

, where s denotes the upsampling scale factor. Compared to fixed interpolation methods, CARAFE adaptively allocates reassembly weights based on contextual information, enabling better preservation of object contours and texture details.

2.4.1. Kernel Prediction Module

For the input feature map X, the kernel prediction module first applies a 1 × 1 convolution to map the channel dimension from C to C_mid, thereby reducing computational complexity:

F = {Conv}_{1 \times 1} (X) \in ℝ^{C_{mid} \times H \times W}

(11)

Subsequently, a k × k sliding window is applied to F through an unfold operation, extracting k² feature vectors around each spatial location. A subsequent 1 × 1 convolution maps the channel dimension to s² × k², thereby generating reassembly weights of size k² for each upsampling position:

K = {Conv}_{1 \times 1} ({Unfold}_{k} (F)) \in ℝ^{s^{2} k^{2} \times H \times W}

(12)

Next, a softmax function is applied along the channel dimension of the generated kernel weights to normalize them within each k²-sized window:

{\overset{⌢}{K}}_{i, j} = \frac{\exp (K_{i, j})}{\sum_{u = 1}^{k^{2}} \exp (K_{u, j})}, j = 1, \dots, H \times W

(13)

In the equation,

{\overset{⌢}{K}}_{i, j}

denotes the i weight corresponding to the j position in the original feature.

2.4.2. Feature Reassembly Module

The core idea of the content reassembly module is to reweight and sum the k × k features within the receptive field of the input feature map for each output position, using the predicted reassembly weights. For the position (u,v) in the upsampled feature map Y, the corresponding center position in the input feature map is:

(⌊\frac{u}{s}⌋, ⌊\frac{v}{s}⌋)

. The k × k neighborhood of feature vectors centered at this position can be represented as:

{X_{c, i + a, j + b} ∣ a, b \in {0, \dots, k - 1}}

(14)

In the equation,

i = ⌊\frac{u}{s}⌋, j = ⌊\frac{v}{s}⌋

. The reassembly process can thus be expressed as:

Y_{c, u, v} = \sum_{a = 0}^{k - 1} \sum_{b = 0}^{k - 1} {\overset{⌢}{K}}_{(a \cdot k + b), (i, j)} \cdot X_{c, i + a, j + b}

(15)

Since

\overset{⌢}{K}

is generated adaptively for each position, CARAFE is capable of assigning optimal reassembly weights according to different scenarios and object sizes.

The overall model framework is shown in Figure 4.

2.5. MPDIoU

In underground coal mines, flames often appear as tiny fire spots under low-light and other complex visual conditions. However, the conventional Complete Intersection over Union (CIoU) loss mainly focuses on the aspect ratio discrepancy between bounding boxes and is prone to gradient vanishing and other defects in complex detection scenarios, making it difficult to accurately detect such mine fires. It is therefore necessary to enhance the geometric sensitivity of the model by redesigning the loss function [29].

Specifically, MPDIoU (Minimum Point Distance IoU) advances beyond traditional IoU by minimizing the Euclidean distances between the four corresponding corner points of the predicted and ground-truth bounding boxes, rather than relying solely on overlap ratios. In contrast to IoU, which degrades to zero for non-overlapping boxes and fails to provide gradient guidance in such cases, MPDIoU ensures continuous optimization even for disjoint boxes. Compared to GIoU (Generalized IoU), which introduces a penalty term based on the smallest enclosing rectangle but can still suffer from slow convergence in non-overlapping scenarios, MPDIoU offers faster regression by directly targeting point alignments. Relative to DIoU (Distance IoU), which incorporates center-point distance normalization but ignores aspect ratios, and CIoU (Complete IoU), which adds an aspect ratio penalty yet may over-penalize in axis-aligned mismatches, MPDIoU simplifies the formulation with lower computational complexity, achieving superior accuracy (e.g., up to 1–2% mAP gains in benchmarks) and convergence speed for small or irregularly shaped targets like mine fire spots, as it aligns more closely with geometric evaluation metrics without additional terms [30]. The detailed computation is given in Equations (12)–(15).

L_{MPDIoU} = 1 - MPDIoU

(16)

MPDIoU = IoU - \frac{d_{1}^{2}}{w^{2} + h^{2}} - \frac{d_{2}^{2}}{w^{2} + h^{2}}

(17)

d_{1}^{2} = {(x_{1}^{prd} - x_{1}^{gt})}^{2} + {(y_{1}^{prd} - y_{1}^{gt})}^{2}

(18)

d_{2}^{2} = {(x_{2}^{prd} - x_{2}^{gt})}^{2} + {(y_{2}^{prd} - y_{2}^{gt})}^{2}

(19)

where

w

and

h

denote the width and height of the input image;

(x_{1}^{prd}, y_{1}^{prd})

and

(x_{2}^{prd}, y_{2}^{prd})

are the coordinates of the top-left and bottom-right corners of the predicted bounding box, respectively; and

(x_{1}^{gt}, y_{1}^{gt})

and

(x_{2}^{gt}, y_{2}^{gt})

are the coordinates of the top-left and bottom-right corners of the ground-truth bounding box, respectively. The effect of the MPDIou is shown in Figure 5.

2.6. PPL-YOLO-F-C

The PPL-YOLO-F-C model architecture integrates several enhancements into the YOLOv9m framework to optimize it for mine exogenous fire detection, balancing accuracy, real-time performance, and lightweight deployment on edge devices. The backbone embeds a lightweight PP-LCNet by replacing the original RepNCSPELAN4 modules with PPLRepNCSPELAN4 variants for efficient feature extraction, processing input tensors of shape (batch_size, 3, 640, 640) through initial Silence [3, 64, 3, 2] yielding (batch_size, 64, 320, 320), Conv [64, 128, 3, 2] to (batch_size, 128, 160, 160), PPLRepNCSPELAN4 [128, 256, 3, True, 1, 2] to (batch_size, 256, 160, 160), Conv [256, 512, 3, True, 1, 2] to (batch_size, 512, 80, 80), PPLRepNCSPELAN4 [256, 512, 3, True, 1, 4] to (batch_size, 512, 80, 80), Conv [512, 512, 5, True, 1, 4] to (batch_size, 512, 40, 40), PPLRepNCSPELAN4 [512, 512, 3, True, 1, 4] to (batch_size, 512, 40, 40), Conv [512, 512, 5, True, 1, 4] to (batch_size, 512, 20, 20), and PPLRepNCSPELAN4 [512, 512, 3, True, 1, 4] to (batch_size, 512, 20, 20); these stages incorporate depthwise-separable convolutions with H-Swish activation and Squeeze-and-Excitation (SE) attention in key blocks, producing multi-level outputs P3 (batch_size, 512, 80, 80), P4 (batch_size, 512, 40, 40), and P5 (batch_size, 512, 20, 20). The FCAttention module is introduced at the end of the backbone to perform fine-grained frequency-channel calibration, taking input (batch_size, 512, 20, 20) and outputting the same shape after global average pooling, diagonal and banded matrix interactions (with channel count C matching the input and adaptive sigmoid-weighted fusion), and element-wise multiplication to enhance discriminative flame and smoke features while suppressing low-frequency background clutter and non-flame textures. In the neck, the Path Aggregation Network (PAN) replaces standard upsampling operators with CARAFE content-aware dynamic upsamplers, configured with a 3 × 3 kernel, upsampling factor of 2, and channel compression ratio of 1/4 in the kernel prediction module; it begins with SPPELAN [512, 512, 256] at P5, followed by CARAFE upsampling to (batch_size, 512, 40, 40) before Concat [512] with P4 and RepNCSPELAN4 [1024, 512, 512, 12, 5, 1], then another CARAFE to (batch_size, 512, 80, 80) before Concat [512] with P3 and RepNCSPELAN4 [1024, 256, 256, 12, 8, 1], with the bottom-up path involving Conv downsampling, additional Concats, and RepNCSPELAN4 modules to fuse multi-scale features. An auxiliary reversible branch is incorporated for training, featuring Conv, RepNCSPELAN4, CBLinear [512, 256, 512], and CBFuse modules to provide programmable gradient information across levels. The detection head retains YOLOv9m’s anchor-free design across three scales but adopts MPDIoU loss for bounding-box regression, improving geometric sensitivity and localization accuracy for small fire spots. This architecture, illustrated in Figure 6, reduces parameters and GFLOPs while maintaining robust multi-scale fusion suitable for low-illumination, smoke-occluded mine environments.

3. Experimental Design

3.1. Experimental Environment and Parameter Settings

In this study, the deep learning framework is implemented using PyTorch 1.8.1. The experimental environment consists of Visual Studio Code, Python 3.8, CUDA 11.1 and cuDNN 8.0.5, running on a Windows 11 operating system. The hardware configuration includes an Intel Xeon(R) Gold 6430 CPU, 32 GB of RAM, and an NVIDIA GeForce RTX 4090 GPU with 24 GB of memory. To ensure that all experiments are conducted under identical conditions, the same set of hyperparameters is adopted throughout. The SGD optimizer with a cosine annealing learning rate schedule is employed, and early stopping is triggered when the evaluation metrics show no improvement for 100 consecutive epochs. The detailed parameter settings are summarized in Table 1.

3.2. Dataset Preparation

To ensure the reliability of the proposed model, a dataset was constructed using flame and smoke images extracted from surveillance videos in a coal mine district. All source videos were first uniformly pre-sampled at 1 fps to obtain candidate frames, after which frame differencing and histogram distance were jointly employed to retain motion-salient frames. To avoid near-duplicate samples, a sliding window of at least 1.5 s was imposed for each event, and at most 3–5 frames were preserved per event. Frames with strong light flashes, dust occlusion, or defocus blur were removed using Laplacian variance and exposure entropy thresholds to filter out low-clarity or overexposed frames. Stratified sampling with proportional allocation was then performed across different scenes (goaf, longwall working face, and transport roadway) and time/illumination conditions to ensure a balanced dataset. However, due to the limited number of field samples, laboratory-simulated scenes were additionally introduced to augment the dataset. The self-constructed dataset consists of 3000 images in total, with 600 real field samples extracted from surveillance videos captured in actual coal mine districts and 2400 laboratory-simulated images, ensuring balanced representation of complex fire characteristics such as smoke occlusion and pseudo-fire interferences. The final dataset covers typical high-risk fire areas such as goafs, longwall faces, and transport roadways, and includes complex conditions such as uneven underground illumination and dust interference, comprising 3000 images in total. The train/val/test split is conducted in a grouped manner at the event/video level (i.e., frames from the same event/video are assigned to only one subset), which reduces the risk of temporal leakage when using sliding-window sampling. The final dataset contains 3000 images and is split 7:2:1, with 2102, 598, and 300 images, respectively. All samples are annotated in YOLO format, and representative examples of the dataset are shown in Figure 7.

3.3. Data Augmentation Methods

When deploying the improved model to the task of detecting mine fires caused by external factors, a series of data augmentation strategies is designed in view of the complex underground environment to enhance the model’s generalization ability and robustness. On the basis of the original Mosaic augmentation in YOLOv9m, a stratified image compositing strategy is introduced, in which four raw images are composited in layers according to the illumination conditions and background characteristics of the mine site, thereby enriching the relative spatial distribution of targets while preserving the mine-specific low-illumination and high-contrast scene information. Second, instance-level MixUp augmentation is employed, where pixel-wise interpolation and fusion across samples are performed so that the model can better handle partially occluded safety helmets and overlapping multiple targets. Third, random affine transformations (including rotations of ±15°, scaling by a factor of 0.8–1.2, and translations of ±10%), combined with horizontal flipping, are applied to simulate variations in camera installation angles and motion blur in roadways. In addition, perturbations in brightness (±30%), contrast (±20%), and hue (±10°) are imposed in the HSV color space, and Gaussian noise together with motion blur is added to reproduce the visual disturbances caused by unstable mine-lamp illumination and moving transportation equipment. All stochastic augmentations are applied to the training set only; validation and test sets use the same deterministic preprocessing (resizing/normalization) without augmentation, ensuring a clean and leakage-free evaluation.

3.4. Evaluation Metrics

The performance of the model is evaluated using three metrics: precision (P), recall (R), mean average precision (mAP),GFLOPs and FPS [31,32], whose definitions are given as follows:

P = \frac{T P}{T P + F P}

(20)

R = \frac{T P}{T P + F N}

(21)

mAP = \frac{\sum_{1}^{n} A P_{i}}{n}

(22)

where TP denotes the number of true positives (predicted as positive and actually positive), FP denotes the number of false positives (predicted as positive but actually negative), TN denotes the number of true negatives (predicted as negative and actually negative), FN denotes the number of false negatives (predicted as negative but actually positive), AP denotes the average precision for a given class, and

n

denotes the number of object categories.

Moreover, GFLOPs is calculated using the thop library in PyTorch. The input image resolution is 640 × 640 pixels, and the batch size is 1. FPS is measured on an NVIDIA GeForce RTX 4090 GPU. The average value is taken after 100 inferences on the test set with a batch size of 1, and half-precision (FP16) is enabled. This includes the complete end-to-end inference process, covering non-maximum suppression (NMS).

4. Results and Analysis

4.1. Selection of Pre-Trained Models

As a medium-sized variant in the YOLOv9 family, YOLOv9m achieves a balance between detection speed and accuracy. According to the official YOLOv9 reports, YOLOv9m attains an mAP of 51.4% on the COCO dataset, which is 1.2% higher than that of the previous-generation model of the same scale, YOLOv8m. Meanwhile, its model size is only 67.2 MB, i.e., 18.2 MB smaller than YOLOv8m, making it more suitable for real-time deployment on edge devices. The performance comparison of different baseline YOLOv9 models is summarized in Table 2.

4.2. Experimental Results of the Improved Model

Based on a transfer learning strategy, the improved model is trained on the proposed dataset using pre-trained weights on the COCO dataset [33]. Figure 8 illustrates the evolution of the loss function during training, Figure 9 shows the precision–recall curves, and Figure 10 presents the performance curves of PPL-YOLO-F-C, where the horizontal axis denotes the number of epochs. The experimental results indicate that the evaluation metrics of the improved model increase rapidly within the first 25 epochs and gradually converge after about 100 epochs. Ultimately, the P, R, mAP@50 and mAP@50:95 of PPL-YOLO-F-C stabilize at approximately 95.7%, 94.5%, 95.5% and 76.6%, respectively. Figure 11 shows the flame detection results of the improved model.

4.3. Ablation Experiments

To verify the effectiveness of the proposed PPL-YOLO-F-C model and the individual contributions of each component to the overall performance, ablation experiments are conducted under the same experimental conditions using the same dataset. Taking YOLOv9m as the baseline model, four enhancement strategies—PP-LCNet module, FCAttention mechanism, CARAFE dynamic upsampling operator, and MPDIoU loss function—are examined by selectively adding or removing these modules. The ablation results are summarized in Table 3.

As shown in Table 3, compared with the baseline model, replacing the YOLOv9m backbone with PP-LCNet leads to a significant reduction in the number of parameters, but also to a noticeable degradation in detection accuracy. This performance drop mainly arises from the characteristics of the depthwise separable convolutions in the PP-LCNet architecture. Compared with standard convolutions, this structure decouples spatial filtering and channel mixing, which effectively reduces the total number of parameters and floating-point operations, but inevitably weakens the feature representation capacity to some extent.

After introducing the FCAttention mechanism, the model precision is markedly improved, whereas the detection speed drops by 26.7 frames·s⁻¹ compared with PPL-YOLO. On this basis, incorporating the CARAFE upsampling module further boosts precision, recall, and mAP@50 by 3.12%, 0.82%, and 6.46%, respectively, although the number of parameters and GFLOPs increase by 3.7% and 7.7%. This can be attributed to the fact that CARAFE dynamically generates upsampling kernels tailored to the characteristics of the feature maps, thereby overcoming the limitation of traditional interpolation methods that rely only on fixed patterns for local information reconstruction and enabling a globally optimized reconfiguration of semantic information in feature space. However, such a content-aware adaptive mechanism, while enhancing representational power, unavoidably introduces additional parameters and computational cost. After adding the MPDIoU loss, P, R, and mAP@50 are further improved by 3.2%, 5.6%, and 1.1%, respectively.

The ablation results demonstrate that the PPL-YOLO-F-C model achieves the best overall performance. Its detection accuracy is significantly enhanced, reaching 96.49%; compared with the baseline, P and FPS increase by 5.2% and 23.2%, respectively, while R decreases slightly, and GFLOPs and parameters are reduced by 13.6% and 13.2%. These findings verify the reliability of the improved algorithm and its superiority in detection accuracy.

4.4. Comparative Experiments

To verify the superiority of the proposed PPL-YOLO-F-C model, a series of comparative experiments are conducted against Faster R-CNN, YOLOv5, YOLOv7, and YOLOv8. The experimental setup is shown in Section 3.1 and all experiments were conducted under the same environment. The performance metrics of each model are summarized in Table 4, the corresponding bar chart is shown in Figure 12 and P-R curve comparison chart is shown in Figure 13. As can be seen from Table 4 and Figure 12, compared with Faster R-CNN, YOLOv5m, YOLOv7, and YOLOv8, the proposed PPL-YOLO-F-C model achieves improvements in mAP@50 of 12.73%, 1.58%, 5.07%, and 2.00%, respectively, while the precision is increased by 15.74%, 3.29%, 4.88%, and 4.77%, respectively. The recall is slightly lower than that of YOLOv8, mainly because the introduction of the CARAFE upsampling operator amplifies background noise and blurred-region features, which increases the number of false positives during the upsampling process and thus reduces the precision. In addition, This can be mainly attributed to the lightweight backbone (depthwise separable convolutions) trading off some feature representation capacity for extremely small/ambiguous targets. In combination with the results presented in Section 4.2, it can be concluded that PPL-YOLO-F-C achieves high accuracy in mine fire recognition.

4.5. Visual Comparison of Detection Performance

To further verify the reliability of the above experimental results, two challenging underground scenarios are simulated, namely smoke interference and small flame targets at the early stage of a fire, and the trained models are tested accordingly. The test results are shown in Figure 14. As depicted in Figure 14a, under severe smoke interference, YOLOv9m and YOLOv7 exhibit missed detections, while the detection accuracies of YOLOv5 and YOLOv8 are 0.39 and 0.51, respectively. By contrast, the improved YOLO model shows a pronounced response to flame edge textures, demonstrating a stronger capability to capture flame targets, with a detection accuracy of 0.71. As shown in Figure 14b, in small-object detection scenarios, the flame recognition accuracy of the improved model is increased by 10% compared with the original model. In summary, the detection results indicate that the improved model can significantly enhance flame recognition accuracy and small-object detection performance, effectively reducing the probability of missed and false detections in complex environments.

5. Conclusions

Targeting typical mine exogenous fire scenarios such as low illumination, smoke occlusion, dust and metallic reflections, small-scale incipient fires, airflow disturbance, and interference from pseudo fire sources, the improved model establishes a closed-loop adaptation from perception to localization through the coordinated mechanism of a lightweight backbone, frequency-domain attention, content-aware upsampling, and MPDIoU-based regression.

(1): By introducing the lightweight PPL-CNet backbone into YOLOv9, the number of model parameters is reduced from 32.55 MB to 19.53 MB, and the GFLOPs are reduced from 247.76 to 126.78. In the mine flame detection experiments, the detection speed reaches 59.49 frames·s⁻¹, indicating that under low-light and color-biased conditions, the model can enhance shallow texture and edge responses and capture rapid flame-boundary flickering at a higher frame rate, thereby mitigating motion blur and missed detections and satisfying the real-time early warning requirements of mine fire safety monitoring.
(2): After integrating the FCAttention frequency-domain channel attention module into PPL-YOLOv9, the precision increases to 91.2%, and mAP improves by 5%, while the number of parameters increases by only 7.59 MB, thus achieving a favorable balance between multi-band channel calibration in the frequency domain and model compactness. On this basis, the introduction of the CARAFE module further boosts precision and mAP to 94.32% and 95.46%, respectively, with a negligible increase in model parameters and GFLOPs. This demonstrates that, under complex exogenous fire operating conditions in mines, frequency-domain attention and cross-layer feature reassembly help distinguish pseudo fire sources such as welding sparks, strong miner-lamp reflections, and metallic highlights from true flames, thereby reducing false alarms and missed alarms and improving detection accuracy. Finally, replacing the loss function of the regression head with MPDIoU increases the precision and mAP in flame detection to 97.36% and 96.49%, respectively, representing a marked improvement over the original model. Compared with Faster R-CNN, YOLOv5m, YOLOv7, and YOLOv8m, the mAP is improved by 12.73%, 1.58%, 5.07%, and 2.00%, respectively. In summary, the proposed mine fire detection algorithm ensures strong robustness and real-time performance, enhances the real-time detection accuracy of mine fires, and provides a practical technical solution for edge deployment in mine safety early warning systems.
(3): Owing to the limited number of training samples in the mine-fire dataset, the overall detection accuracy is still constrained, and the dataset needs to be further enriched. In particular, the self-constructed dataset of 3000 samples may suffer from limitations such as insufficient diversity in real-world mine environments (e.g., variations in coal types, ventilation patterns, dust concentrations, and geographic mine settings across different regions), potential over-reliance on simulated or controlled fire scenarios rather than authentic underground incidents, and underrepresentation of edge cases like extreme low-visibility conditions or multi-source interferences. These factors could hinder the model’s generalization to unseen mines or dynamic operational changes, potentially leading to higher false positives or negatives in deployment. In future work, the proposed model can be combined with smoke-detection modules (e.g., a YOLOv9m-based smoke detection branch) to achieve more comprehensive early warning at the early stage of mine fires. More narrow research directions could include the following: (i) validating the model through targeted field trials in specific high-risk mine subsections (e.g., conveyor belt areas prone to friction-induced fires) to assess real-time performance under varying dust and humidity levels; (ii) optimizing the lightweight architecture for integration with low-power IoT edge devices commonly used in mines, focusing on hardware-specific constraints like ARM-based processors; and (iii) exploring fine-tuned adaptations for detecting exogenous fires from particular causes, such as electrical faults in cabling, by augmenting the dataset with cause-specific annotations to improve causal inference in early warnings.

Author Contributions

Conceptualization, methodology, software, X.Z.; validation, formal analysis, investigation, writing—original draft, R.Y.; writing—review and editing, visualization, Y.Q.; resources, C.B.; data curation, Q.L.; supervision, project administration, Q.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

Author Qingjie Qi was employed by the China Coal Research Institute Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Kodur, V.; Kumar, P.; Rafi, M.M. Fire hazard in buildings: Review, assessment and strategies for improving fire safety. PSU Res. Rev. 2020, 4, 1–23. [Google Scholar] [CrossRef]
Wang, D.M.; Cheng, Y.P.; Zhou, F.B.; Ji, J. Experimental Research on Combustion Property of Mine Fire Source. J. China Univ. Min. Technol. 2002, 31, 33–36. [Google Scholar]
Zhu, H.; Hu, C.; Zhang, Y.; Hu, L.; Yuan, X.; Wang, X. Research Status on Prevention and Control Technology of Coal Spontaneous Fire in China. Saf. Coal Mines 2020, 51, 88–92. [Google Scholar]
Li, H.; Xiong, S.; Sun, P. S-FCN fire image detection method based on feature engineering. China Saf. Sci. J. 2024, 34, 191–201. [Google Scholar]
Tang, W.; Zhang, W.; Yuan, H.; Xie, C.; Ren, J. Improvement of Infrared Smoldering Fire Detection Algorithm Based on YOLOv7. J. Combust. Sci. Technol. 2024, 30, 532–538. [Google Scholar]
Di Nardo, M.; Gallo, M.; Murino, T.; Santillo, L.C. System dynamics simulation for fire and explosion risk analysis in home environment. Int. Rev. Model. Simul. 2017, 10, 43–54. [Google Scholar] [CrossRef]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015. [Google Scholar]
Dai, J.; Li, Y.; He, K.; Sun, J. R-FCN: Object Detection via Region-based Fully Convolutional Networks. In Advances in Neural Information Processing Systems; Curran Associates Inc.: New York, NY, USA, 2016. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference, Venice, Italy, 22–29 October 2017. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar] [CrossRef]
Zhang, F.; Zhang, J.; Cheng, H. Research review on intelligent object detection technology for coal mines based on deep learning. Coal Sci. Technol. 2025, 53, 284–296. [Google Scholar]
Liu, Y.; Wen, H.; Chen, C.; Guo, J.; Jin, Y.; Zheng, X.; Cheng, X.; Li, D. Research status and development trend of coal spontaneous combustion fire and prevention technology in China: A review. ACS Omega 2024, 9, 21727–21750. [Google Scholar] [CrossRef]
Qin, B.; Zhong, X.; Wang, D.; Xin, H.; Shi, Q. Research progress of coal spontaneous combustion process characteristics and prevention technology. Coal Sci. Technol. 2021, 49, 66–99. [Google Scholar]
Hu, J.; Li, Y.; Li, J.; Zhang, W. Study on localization method of mine exogenous fire source based on CNN. J. Saf. Technol. 2024, 20, 134–140. [Google Scholar]
Wang, W.; Zhang, B.; Wang, Z.; Zhang, F.; Ren, H.; Wang, J. Intelligent identification method of mine fire video images based on YOLOv5. Ind. Mine Autom. 2021, 47, 53–57. [Google Scholar]
Zhang, W.; Wei, J. Improved YOLOv3 Fire Detection Algorithm Embedded in DenseNet Structure and Dilated Convolution Module. J. Tianjin Univ. (Sci. Technol.) 2020, 53, 976–983. [Google Scholar]
Zhang, M.; Duan, J.; Liang, Z.; Guo, J.; Chai, D. Firework detection method based on improved YOLO-V5 algorithm. China Saf. Sci. J. 2024, 34, 155–161. [Google Scholar]
Li, X.; Fan, Y.; Liu, Y.; Li, X.; Liu, Z. Industrial and Mining Fire Detection Algorithm Based on Improved YOLO. Fire Technol. 2025, 61, 709–728. [Google Scholar] [CrossRef]
Gao, J.; Zhang, W.; Li, Z. YOLO-BFEPS: Efficient Attention-Enhanced Cross-Scale YOLOv10 Fire Detection Mode. Comput. Sci. 2025, 52, 424–432. [Google Scholar]
Zhao, X.; Yu, M.; Xu, J.; Wu, P.; Yuan, H. An Enhanced YOLOv8n-Based Method for Fire Detection in Complex Scenarios. Sensors 2025, 25, 5528. [Google Scholar] [CrossRef]
Li, X.; Liu, Y. A Fusion of Entropy-Enhanced Image Processing and Improved YOLOv8 for Smoke Recognition in Mine Fires. Entropy 2025, 27, 791. [Google Scholar] [CrossRef]
Zhang, K.; Mao, B.; Liu, H.; Zhang, Y. UFS-YOLO: A real-time small fire target detection method incorporated hybrid attention in underground facilities. Measurement 2025, 117948. [Google Scholar] [CrossRef]
Wang, H.; Fu, X.; Yu, Z.; Zeng, Z. DSS-YOLO: An improved lightweight real-time fire detection model based on YOLOv8. Sci. Rep. 2025, 15, 8963. [Google Scholar] [CrossRef]
Diwan, T.; Anirudh, G.; Tembhurne, J.V. Object detection using YOLO: Challenges, architectural successors, datasets and applications. Multimed. Tools Appl. 2023, 82, 9243–9275. [Google Scholar] [CrossRef]
Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. Yolov9: Learning what you want to learn using programmable gradient information. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer Nature: Cham, Switzerland, 2024; pp. 1–21. [Google Scholar]
Cui, C.; Gao, T.; Wei, S.; Du, Y.; Guo, R.; Dong, S.; Lu, B.; Zhou, Y.; Lv, X.; Liu, Q.; et al. PP-LCNet: A lightweight CPU convolutional neural network. arXiv 2021, arXiv:2109.15099. [Google Scholar] [CrossRef]
Sun, H.; Wen, Y.; Feng, H.; Zheng, Y.; Mei, Q.; Ren, D.; Yu, M. Unsupervised Bidirectional Contrastive Reconstruction and Adaptive Fine-Grained Channel Attention Networks for Image Dehazing. Neural Netw. Off. J. Int. Neural Netw. Soc. 2024, 176, 106314. [Google Scholar] [CrossRef]
Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. CARAFE: Content-Aware ReAssembly of FEatures. In Proceedings of the IEEE International Conference on Computer Vision 2019, Seoul, Republic of Korea, 27–28 October 2019; pp. 3007–3016. [Google Scholar]
Gao, J.; Chen, Y.; Wei, Y.; Li, J. Detection of specific building in remote sensing images using a novel YOLO-S-CIOU model. Case: Gas station identification. Sensors 2021, 21, 1375. [Google Scholar] [CrossRef]
Ma, S.; Xu, Y. MPDIoU: A loss for efficient and accurate bounding box regression. arXiv 2023, arXiv:2307.07662. [Google Scholar] [CrossRef]
Qi, Y.; Xue, K.; Wang, W.; Cui, X.; Wang, H.; Qi, Q. Assessment model of emergency response capability for coal and gas outburst accidents in mines. China Saf. Sci. J. 2024, 34, 225–230. [Google Scholar]
Jiang, S.; Zou, X.; Yang, J.; Li, H.; Huang, X.; Li, R.; Zhang, T.; Liu, X.; Wang, D. Concrete bridge crack detection method based on improved YOLOv8s in complex backgrounds. J. Traffic Transp. Eng. 2024, 24, 135–147. [Google Scholar]
Drzymała, A.J.; Korzeniewska, E. Application of the YOLO algorithm in fire and smoke detection systems for early detection of forest fires in real time. Przegląd Elektrotechniczny 2025, 101, 44–47. [Google Scholar] [CrossRef]

Figure 1. YOLOv9 model architecture.

Figure 2. PP-LCNet module architecture.

Figure 3. FCAttention architecture.

Figure 4. CARAFE architecture.

Figure 5. The effect of the MPDIou.

Figure 6. PPL-YOLO-F-C model architecture.

Figure 7. A subset of the dataset.

Figure 8. Loss function of improved model. (a) Bounding box regression loss curve during model training; (b) Classification loss curve during model training; (c) Distribution focal loss during model training; (d) Bounding box regression loss curve during model validation; (e) Classification loss curve during model validation; (f) Distribution focal loss during model validation.

Figure 9. P–R curve of improved model.

Figure 10. Performance metrics curves of PPL-YOLO-F-C model.

Figure 11. Flame detection results of improved model. (a) Scenes with smoke interference; (b) Low-light scenes; (c) Multi-object scenarios; (d) Small object detection.

Figure 12. Comparison chart of performance indicators.

Figure 13. P-R curve comparison chart.

Figure 14. Visual detection results comparison of improved model. (a) Scenes with smoke interference; (b) Small object detection.

Table 1. Hyperparameter setting.

Parameters	Values
Image Size	640 × 640 × 3
Learning rate	0.001
Momentun	0.937
Batch size	8
Epoch	300
Weight decay	0.005

Table 2. Performance comparison of different base YOLOv9 models.

Model	mAP [%]	mAP@50 [%]	mAP@75 [%]	Params [MB]	GFLOPs
YOLOv9-T	38.3	53.1	41.3%	2.0	7.7
YOLOv9-S	46.8	63.4	50.7%	7.1	26.4
YOLOv9-M	51.4	68.1	56.1%	20.0	76.3
YOLOv9-C	53.0	70.2	57.8%	25.3	102.1
YOLOv9-E	55.6	72.8	60.6%	57.3	189.0

Table 3. Ablation experiment results.

PP-LCNet	FCAttention	CARAFE	MPDIoU	P	R	mAP@50	Params	GFLOPs	FPS
PP-LCNet	FCAttention	CARAFE	MPDIoU	[%]	[%]	[%]	[MB]	GFLOPs	[Frames·s⁻¹]
−	−	−	−	92.59	87.36	93.49	32.55	130.72	29.50
√	−	−	−	83.79	76.04	84.72	19.53	83.901	59.49
√	√	−	−	91.20	79.61	89.00	27.12	104.90	27.79
√	√	√	−	94.32	80.43	95.46	28.12	112.97	35.35
√	√	√	√	97.36	84.91	96.49	28.12	112.97	36.33

√ indicates that the module has been added, − indicates that this module has not been added.

Table 4. Performance comparison between the improved model and mainstream models.

Model	P [%]	R [%]	mAP@50 [%]	Params [MB]	GFLOPs	FPS [Frames·s⁻¹]
Faster RCNN	81.62	80.86	83.76	128.55	247.76	15.60
YOLOv5	94.07	82.77	94.91	34.63	126.78	32.52
YOLOv7	92.48	81.34	91.42	35.79	136.67	31.62
YOLOv8	92.59	87.36	94.49	32.45	140.72	33.14
PPL-YOLO-F-C	97.36	84.91	96.49	28.12	112.97	36.33

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhan, X.; Yao, R.; Qi, Y.; Bai, C.; Li, Q.; Qi, Q. Mine Exogenous Fire Detection Algorithm Based on Improved YOLOv9. Processes 2026, 14, 169. https://doi.org/10.3390/pr14010169

AMA Style

Zhan X, Yao R, Qi Y, Bai C, Li Q, Qi Q. Mine Exogenous Fire Detection Algorithm Based on Improved YOLOv9. Processes. 2026; 14(1):169. https://doi.org/10.3390/pr14010169

Chicago/Turabian Style

Zhan, Xinhui, Rui Yao, Yun Qi, Chenhao Bai, Qiuyang Li, and Qingjie Qi. 2026. "Mine Exogenous Fire Detection Algorithm Based on Improved YOLOv9" Processes 14, no. 1: 169. https://doi.org/10.3390/pr14010169

APA Style

Zhan, X., Yao, R., Qi, Y., Bai, C., Li, Q., & Qi, Q. (2026). Mine Exogenous Fire Detection Algorithm Based on Improved YOLOv9. Processes, 14(1), 169. https://doi.org/10.3390/pr14010169

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Mine Exogenous Fire Detection Algorithm Based on Improved YOLOv9

Abstract

1. Introduction

2. Methods

2.1. YOLOv9

2.2. PP-LCNet Lightweight Module

2.3. FCAttention

2.4. CARAFE

2.4.1. Kernel Prediction Module

2.4.2. Feature Reassembly Module

2.5. MPDIoU

2.6. PPL-YOLO-F-C

3. Experimental Design

3.1. Experimental Environment and Parameter Settings

3.2. Dataset Preparation

3.3. Data Augmentation Methods

3.4. Evaluation Metrics

4. Results and Analysis

4.1. Selection of Pre-Trained Models

4.2. Experimental Results of the Improved Model

4.3. Ablation Experiments

4.4. Comparative Experiments

4.5. Visual Comparison of Detection Performance

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI