Helmet Detection in Underground Coal Mines via Dynamic Background Perception with Limited Valid Samples

Wang, Guangfu; Sun, Dazhi; Li, Hao; Cheng, Jian; Yan, Pengpeng; Li, Heping

doi:10.3390/make7030064

Open AccessArticle

Helmet Detection in Underground Coal Mines via Dynamic Background Perception with Limited Valid Samples

by

Guangfu Wang

^1,2,3,

Dazhi Sun

^1,3,

Hao Li

^1,3,

Jian Cheng

^1,2,3,*

,

Pengpeng Yan

^1,3 and

Heping Li

^1,2,3

¹

Research Institute of Mine Artificial Intelligence, Chinese Institute of Coal Science, Beijing 100013, China

²

State Key Laboratory for Intelligent Coal Mining and Strata Control, Beijing 100013, China

³

Beijing Technology Research Branch, Tiandi Science and Technology Co., Ltd., Beijing 100013, China

^*

Author to whom correspondence should be addressed.

Mach. Learn. Knowl. Extr. 2025, 7(3), 64; https://doi.org/10.3390/make7030064

Submission received: 19 May 2025 / Revised: 23 June 2025 / Accepted: 7 July 2025 / Published: 9 July 2025

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

The underground coal mine environment is complex and dynamic, making the application of visual algorithms for object detection a crucial component of underground safety management as well as a key factor in ensuring the safe operation of workers. We look at this in the context of helmet-wearing detection in underground mines, where over 25% of the targets are small objects. To address challenges such as the lack of effective samples for unworn helmets, significant background interference, and the difficulty of detecting small helmet targets, this paper proposes a novel underground helmet-wearing detection algorithm that combines dynamic background awareness with a limited number of valid samples to improve accuracy for underground workers. The algorithm begins by analyzing the distribution of visual surveillance data and spatial biases in underground environments. By using data augmentation techniques, it then effectively expands the number of training samples by introducing positive and negative samples for helmet-wearing detection from ordinary scenes. Thereafter, based on YOLOv10, the algorithm incorporates a background awareness module with region masks to reduce the adverse effects of complex underground backgrounds on helmet-wearing detection. Specifically, it adds a convolution and attention fusion module in the detection head to enhance the model’s perception of small helmet-wearing objects by enlarging the detection receptive field. By analyzing the aspect ratio distribution of helmet wearing data, the algorithm improves the aspect ratio constraints in the loss function, further enhancing detection accuracy. Consequently, it achieves precise detection of helmet-wearing in underground coal mines. Experimental results demonstrate that the proposed algorithm can detect small helmet-wearing objects in complex underground scenes, with a 14% reduction in background false detection rates, and thereby achieving accuracy, recall, and average precision rates of 94.4%, 89%, and 95.4%, respectively. Compared to other mainstream object detection algorithms, the proposed algorithm shows improvements in detection accuracy of 6.7%, 5.1%, and 11.8% over YOLOv9, YOLOv10, and RT-DETR, respectively. The algorithm proposed in this paper can be applied to real-time helmet-wearing detection in underground coal mine scenes, providing safety alerts for standardized worker operations and enhancing the level of underground security intelligence.

Keywords:

helmet wearing detection; complex coal mine scenes; background awareness; small object; spatial bias

Graphical Abstract

1. Introduction

Underground coal mine visual targets account for a large proportion of small targets; taking helmet-wearing as an example, more than 25% of the visual targets are small targets according to the definition of small targets in the COCO dataset. The safety of workers in the production and operation environment of underground coal mines has always been the primary concern. Due to the complexity and variability of the underground environment, there are many potential dangers such as gang slice, roof coal rock fall, etc. Wearing a helmet is a basic requirement to protect workers’ lives and safety. However, the existing helmet detection methods in underground coal mine scenarios face many challenges in practical application, including the following: (1) There are few effective samples and is strong spatial bias; the helmet detection in underground coal mines not only needs to detect the wearing of helmets, but also needs to detect the non-wearing of helmets by the workers, and the latter is the focus of the concern of coal mine safety. Due to the safety regulations, workers in the underground scene do not wear safety helmets when working less, and therefore face the problem of significantly less effective samples, and also because the monitoring camera is usually installed in the two gangs of the roadway, the distribution of safety helmet position presents a strong spatial bias. (2) The underground environment has a complex background, and the scene contains a variety of production and transportation equipment, and the visual data is susceptible to the influence of dust and water mist. (3) The safety-helmet-wearing targets are usually small in size in the image [1], so helmet-wearing detection of small targets faces difficulties. These problems make it difficult for the accuracy and reliability of existing detection systems to meet practical needs.

In recent years, deep learning algorithms represented by convolutional neural networks [2] and Transformer networks [3] have been rapidly developed in the field of object detection, accompanied by the release of generalized scenario datasets such as COCO [4], Pascal VOC [5], and so on. The research and applications of object detection have been deeply applied, and in the field of real-time object detection, with the birth of the YOLO [6] series of increasingly mature algorithmic models for target recognition, detection, and semantic segmentation. For underground coal mine helmet-wearing detection, these techniques are able to efficiently recognize and detect the wearing condition of workers by analyzing the visual data of underground coal mine surveillance [1] and send timely alerts when workers are not wearing helmets. However, the direct application of the above algorithms for helmet-wearing detection in coal mine scenes faces many challenges as mentioned before. The complex conditions and data distribution of the underground environment require the detection algorithms to have better background interference immunity, adaptability to scene generalization, and characterization of helmet-wearing detection for small targets, and to be able to maintain high-precision detection performance in the presence of fewer samples and complex backgrounds.

The safety problems in the underground coal mine environment are prominent, which requires real-time detection of workers wearing helmets. Therefore, this study proposes a helmet-wearing detection method for the underground coal mine environment with few samples on the basis of the latest YOLOv10 real-time object detection algorithm, which effectively improves the detection accuracy by effectively expanding the training samples and optimizing the network structure, thus providing powerful support for the safety management of the workers during the operation of the underground coal mine.

The main task of object detection is to determine the location of an object in a given image and to label the category to which the object belongs. Convolutional neural networks pioneered the breakthrough in visual object classification algorithms [1], and subsequently researchers applied them to visual object detection tasks. Existing object detection algorithms can be broadly categorized into single-stage and two-stage. In the two-stage object detection research, Girshick et al. proposed the R-CNN [7] object detection algorithm based on a regional convolutional neural network in 2014, which was realized to pioneer the deep learning-based object detection paradigm. Meanwhile, Ren et al. [8] proposed the Faster R-CNN algorithm that fuses region generating networks in combination with the R-CNN algorithm to achieve faster speeds and higher accuracy detection algorithms. He et al. [9] proposed the Mask R-CNN algorithm, which increases instance segmentation capability, and then TridentNet [10], which achieves a multi-scale fusion module for higher-accuracy object detection. Two-stage object detection algorithms usually excel in accuracy due to fine-grained classification and regression after candidate region generation, but they are less computationally efficient, so single-stage object detection algorithms are more commonly used. Liu et al. [11] proposed an SSD algorithm based on multi-scale convolutional detection that can detect targets of different sizes. Lin et al. [12] proposed a focal loss that can focus on the category of fewer samples for the problem of positive–negative sample imbalance, which enhances the detection performance. Zhu et al. [13] proposed a BiFPN network to improve the feature fusion efficiency while taking into account the detection accuracy.

The YOLO (You Only Look Once) architecture family has fundamentally reshaped real-time object detection paradigms since its seminal introduction by Redmon et al. [6], which has been iterated to the YOLOv12 version [14]. Subsequent iterations demonstrate progressive evolution: YOLOv2 [15], YOLOv3 [16], and YOLOv4 [17] enhanced feature extraction through multi-scale predictions and anchor box optimization; YOLOv5 [18], YOLOv6 [19], and YOLOv7 [20] introduced modular design patterns enabling architecture scaling; while YOLOv8 [21], YOLOv9 [22], and YOLOv10 [23] achieved state-of-the-art speed–accuracy tradeoffs through reparametrized convolutions and spatial-channel decoupling. YOLOv10 removes time-consuming NMS in the inference stage and achieves high accuracy for object detection tasks. YOLOv11 [24] enhances performance in complex vision tasks such as semantic segmentation through sophisticated spatial attention mechanisms. YOLOv12 achieves computationally efficient attention operations via integration of the hardware-aware Flash Attention computational framework. This continuous innovation trajectory—driven by computational efficiency demands across embedded systems, robotics, and industrial applications—positions YOLO as the de facto solution for latency-critical vision systems, culminating in the highly efficient YOLOv10 architecture that forms this study’s foundation. Recently, transformer-based approaches have also been introduced to the field of object detection. Inspired by the channel-wise attention mechanism in SENet [25], the Convolutional Block Attention Module (CBAM) [26] incorporates sequential channel and spatial attention operations to enhance feature extraction capabilities. RT-DETR [27] shows comparable latency to YOLO series with NMS considered. RF-DETR [28] even pushes the latency bar further with multi-scale self-attention.

In terms of coal mine scene helmet object detection, coal mine scene helmet detection faces the challenges of fewer effective samples and low sample quality. Li et al. [29] effectively improved the accuracy of underground small object detection by adopting the convolution and attention to enhance the feature extraction ability. Yang et al. [30] propose to use a local enhancement module to map low illumination pixels to normal states for image processing.

Another challenge of helmet detection in coal mine scenarios is the complex background, where dust and water mist interference in the underground environment, as well as the distribution of underground illumination and its changes, can adversely affect helmet detection. Feng et al. [31] proposed to apply a recursive feature pyramid for more efficient multi-scale feature fusion methods to further improve the helmet detection accuracy and reduce the missed detection. Sun & Liu [32] introduce a parameter-free attention mechanism in YOLOv7, which effectively enhances the helmet feature extraction capability and enriches the contextual information captured by the model, thus improving the speed and accuracy of helmet detection in underground environment.

Coal mine underground scene helmet detection also faces the difficulty of small target samples. Hou et al. [33] proposed to use Ghost convolution to improve the YOLOv5 backbone network and combined it with BiFPN for feature fusion to realize small target helmet detection in underground scene. Cao et al. [34] optimized the initial value estimation of anchor frames in YOLOv7 by K-Means to improve the accuracy of small object detection underground.

The application of helmet-wearing detection algorithms in coal mine scenarios has achieved certain results; however, with the existing methods in coal-mine-scenario helmet detection, there are still problems: (1) There is more consideration in the detection algorithms to detect the helmet itself, while ignoring the downhole safety concerns, and there is a lack of analysis of the distribution of data on the downhole helmets. (2) The downhole background is more complex, and there is still a lack of effective programs to reduce the small object detection interference in the background. (3) Helmets belong to the small object detection, but regarding the detection head’s perceptual ability, there is still a lot of room for optimization. For real-time detection tasks, the YOLO architecture series remains the predominant solution due to its exceptional performance–efficiency balance. While YOLOv11 and YOLOv12 represent more recent advancements, they introduce either increased architectural complexity without commensurate gains in small-object detection accuracy or imposed stringent hardware constraints during deployment. Consequently, we select YOLOv10 as the foundational architecture for our helmet detection algorithm for the benefit of its high real-time performance and detection accuracy in underground mining environments.

Based on the architecture of YOLOv10, this paper analyzes the distribution of downhole visual data and the spatial bias of helmet position, and further applies data expansion and data enhancement algorithms to effectively increase the number of samples; proposes the application of a convolution module that integrates the background sensing in the backbone network to inhibit the interference of the complex background on the detection of the helmet, and integrates the convolution and the attention module in the detecting head, thus realizing the expansion of the sensory field to improve the model’s ability to perceive small targets wearing helmets; and optimizes the loss function by combining the a priori information that the helmet length and width comparison is centralized to improve the constraint strength for the helmet-wearing target. Experiments show that the optimized algorithm in this paper can effectively improve the detection accuracy of the small target of helmet-wearing in underground environments, which is beneficial to promote the safety management of underground workers.

2. Methodology of the Paper

The underground environment of coal mines has different characteristics from the surface environment. (1) The underground environment faces a complex background and more interference. (2) The underground area of coal mines presents restricted spatial characteristics, and safety helmets often appear in certain areas of the visual data, thus causing spatial bias. (3) From the data level, the underground helmet visual data has an obvious imbalance, with more data of workers wearing helmets and few samples of workers not wearing helmets due to safety management regulations. (4) The target of helmet-wearing is small, further increasing the difficulty of detection.

Aiming at these problems, this paper combines YOLOv10 to optimize the helmet-wearing detection model for underground scenes, and proposes an underground helmet-wearing detection method based on fused dynamic background perception under few samples, which is applied to the helmet detection in underground complex environments. The structure of the model is shown in Figure 1, and the improved modules proposed in this paper are marked with red dashed boxes.

Among them, the backbone network (backbone) is mainly used for feature extraction, the neck network (neck) is used to generate multi-scale detection information, and the detection head network (detection head) is used to compute different scales of detection information and obtain the loss of output. Among them, in order to solve the significant spatial bias problem in the underground environment, this paper is restricted by analyzing the distribution of underground visual data and helmet spatial bias, and proposed to combine the ground helmet-wearing detection open-source data. The dataset Baidu PaddlePaddle [35] provides a strategy to carry out an effective data expansion, so as to achieve more uniform data distribution and so that it is easier to generalize; for the problem of false alarms caused by the complex interference of downhole background, this paper proposes a dynamic background-aware convolution module (DBConv, Dynamic Background Convolution) for background masking, and applies it to the backbone network of YOLOv10, which guides the network to learn the important areas and effectively reduces the interference of the background through the introduction of dynamic background masking regions, and combined with the depth-separable convolution of the large convolutional kernel, the number of parameters is further optimized while increasing the computational sense field. To cope with the helmet small target problem, this paper proposes to apply the local feature and global feature fusion module (GLFM, Global-Local Fusion Module) in the detection head; the convolution module extracts the local high-frequency features, and the attention module pays more attention to the global low-frequency features [36], which fuses the local features with the global features for characterization, to further improve the ability of the small target perception, and at the same time reduce the scale sensitivity, so as to further optimize the number of parameters. At the same time, scale sensitivity is reduced, so as to realize stable detection of small targets. Finally, the algorithm learning process is optimized by optimizing the hyperparameters in the model training process.

2.1. Helmet Data Enhancement

Underground environment helmet data presents the following characteristics: (1) The existing open-source underground scene helmet dataset [37] mainly detects helmets in the underground environment; however, in practice, the underground safety management needs to target the detection of operating workers wearing helmets/not wearing helmets, so the existing dataset is difficult to be directly applied to the detection of helmet-wearing in the underground. (2) The underground scene of coal mines’ data distribution shows an island-like shape; the data in similar scenes underground are more accumulated and less dispersed, and the data in different scenes are distributed farther away with fewer intersections, e.g., Figure 2. (3) Due to the safety management regulations, there are very few samples without helmets in underground environments, and a large number of samples are samples of normal wearing of helmets. (4) The visual acquisition cameras in the underground scene are usually installed at fixed points and are rarely changed, resulting in the data in the locations where helmets appear showing significant spatial bias, as shown in Figure 3.

Data distribution. Due to the overall regional characteristics of the downhole environment, there is a significant island effect in the downhole data, affecting the deep learning network’s ability to learn different scene data features; due to the data alone, there is a lack of diversity in the data distribution, resulting in the network finding it difficult to effectively learn the continuous changes in the environment.

Spatial Bias. Due to the underground environment, the camera installation is more fixed, and most of the cameras are installed in the upper part of the two gangs of the roadway, forming a diagonal downward side of the collected image data. This feature leads to the underground scene of the visual data in a large number of pixel areas not appearing in the data of the personnel machine helmet-wearing or not, which leads to the neural network learning process to receive the spatial bias of the target of the helmet-wearing pixel area, making the network generalization Poor generalization ability. The spatial bias of the downhole dataset used in this paper is shown in Figure 3. In drawing the spatial bias, we adopt the Mollifier Kernel Function to smooth the helmet-wearing target frame region.

2.2. Convolutional Module for Dynamic Background Awareness in Backbone Networks

Coal mine underground scene helmet data is affected by dust, water mist, etc., and the underground environment is faced with background complexity, illumination changes, single color of the scene and other adverse effects; these factors make the underground helmet data show low clarity, and the overall color is dark, which leads to the underground helmet detection being vulnerable to the background environment interference, which guides the neural network to detect the region that is itself a background for the helmet, and even detect as not wearing a helmet, resulting in false alarms for safety management. In addition, some background areas of mechanical equipment and personnel helmets have similar visual effects, so it is also easy to cause helmet detection to focus on the wrong area. In the application of object detection algorithms, the backbone network is mainly responsible for the characterization of the input image; due to the above adverse effects of the underground environmental data, the output characterization of the backbone network is susceptible to background interference, which results in the subsequent detection of the head being difficult to effectively fuse the helmet target area. In this paper, we draw on the Dynamic Region Convolution (DRConv, Dynamic Region Convolution) [39], propose the Dynamic Background Sensing Convolution Module (DBConv), and apply it to the backbone network of YOLOv10, through which we dynamically generate the background region masks, and combine it with the product of the elements to perform the mask feature aggregation, and the scale of the module feature space is further improved [40], and, at the same time, the convolution is directly used to generate the mask in the module to avoid the problem of inconsistency between the forward process and the backward process, and the problem of non-conductivity of the backward process brought by the use of argmax operation in DRConv.

The original DRConv module uses argmax to obtain the background region perception and splits the input data into different regions, and at the same time uses the convolution kernel generation module to generate convolution kernels, and applies different convolution kernels in the different regions of the background to perform the feature extraction, so as to obtain the high-dimensional feature representation of the module output. This module can be effectively applied to the field of face recognition, and it can effectively extract different regions of the face, such as eyes, nose, and other regions, but it is difficult to distinguish the specific meaning of the background region when directly applying this module to helmet object detection. In addition, this module uses argmax to get the region-aware mask, and due to the irreducible property of this operation itself, it leads to the existence of irreducible links in the computation of the backward process, which requires a complex computational structure for approximation.

The structure of the DBConv module proposed in this paper is shown in Figure 4, given the input feature characterization as

X^{B \times C \times H \times W}

,where

B

represents the batch size and

C

represents the number of channels. The input feature data passes through the mask-generating convolution module of the upper side pathway to obtain a single channel mask feature for background sensing

M^{B \times 1 \times H \times W}

; in order to keep the mask size the same as the input feature size, set the step size

s t r i d e = 1,

and the convolution kernel size and padding size relationship to

k = 2 \times p + 1

, where

p

is the size of padding, and

k

is the convolution kernel size. Thus, the output size constraint is satisfied without the need of up-sampling or inverse convolution calculation. In the design of this module, the main considerations are computational efficiency and solving accuracy. From the perspective of solving efficiency, in order to reduce the number of parameters and computational scale, the convolution module adopts depth-separable convolution for computation; from the perspective of solving accuracy, it adopts the combination of enlarged convolution kernel and cavity convolution to further expand the feeling field of feature computation. The module uses two convolution operations, the first convolution is mainly for mask region extraction, and the second convolution expands the receptive field through spatial convolution to form a mask for the general region perception of the background. After the mask generation convolution module calculation, the mask features for feature fusion dynamic background perception are obtained.

The lower side pathway of DBConv is the convolutional kernel generation module. For a given input feature representation

X^{B \times C \times H \times W}

, the main role of this module is to generate

m

convolution kernels, each of size

B \times O \times C \times k \times k

, where

O

is the number of channels of the DBConv output module and

k

is the convolution kernel size. The formula for this module is as follows:

K = {P W}_{g r o u p = m} (P W (A A P (X)))

(1)

where

A A P

stands for Adaptive Average Pooling operation,

P W

stands for Point Wise Convolution operation,

D W

stands for Depth Wise Convolution operation, and the outermost

P W

is computed by using the group mechanism to get the convolution kernel of size

B \times m \times O \times C \times k \times k

, and undergoes the Chunk operation to get the final convolution kernel size.

Through the mask generation module and the convolution kernel generation module, construction is completed, the generated convolution kernel is applied to each channel of the input features, and after channel aggregation, the element-by-element product operation (the ‘*’ operation) is performed with the mask features; theoretical analysis shows that the element-by-element product operation can effectively increase the spatial dimension of the feature characterization, thus making the features more differentiated from each other. On the other hand, the DBConv adopts the convolutional for efficient feature extraction, aggregation, and characterization, rather than using the attention mechanism for implementation; this is because in the backbone network, feature extraction is more desirable to be able to retain more high-frequency information, and the theoretical analysis [36] suggests that the attention mechanism is more adept at capturing the global features rather than the local high-frequency features, i.e., the attention mechanism computation can be regarded as a low-pass filter.

2.3. Detection Head Convolution and Attention Fusion Module

The helmet detection target occupies a relatively small area in the image, so it is necessary for the detection head to be able to effectively extract and characterize the helmet data features. The area distribution and aspect ratio of helmet small targets in the dataset used in this paper are shown in Figure 5, according to the division of the COCO dataset for small targets, the targets with pixel area less than

32 \times 32

(the resolution of input image of COCO dataset is 640 × 640) belong to small target objects, and the proportion of downhole data belonging to small targets is 25.81%, so for the downhole helmet targets, most of the helmet objects all belong to small object detection. For small object detection, due to the small pixel information that can be utilized, it is easy to lose the effective information of the target in the feature extraction part of the backbone network. When detecting small targets, it is difficult to effectively characterize the shape information and material information of the target, which makes the network’s performance for detecting small targets drop sharply.

In this paper, we propose to use the joint optimization strategy of feature fusion and attention mechanism in the detection head for helmet small object detection optimization. The structure of this module is shown in Figure 6. In order to improve the feature perception effect of helmet small targets, this module proposes a local–global fusion module (GLFM, Global–Local Fusion Module) that fuses the high-frequency local sensing information with the low-frequency global sensing information, so as to overcome the problem of easy disappearance and interference of the small targets’ information response during the detection process. The GLFM module contains two computational pathways: (1) Low-frequency Global Detection Sensing Module; (2) High-frequency Local Detection Sensing Module. GLFM applies the attention mechanism for low-frequency global information extraction to overcome the problem of small targets in helmets being susceptible to global interference. By combining the attention mechanism with the feature data, it makes it easy for the network to learn the weights of the important features in the data, and inhibit the weights of the unimportant features, thus reducing the global background interference. By applying convolutional operations to achieve the local high-frequency information extraction, we further amplify the feature response of the local region of the size target to avoid the target region data due to the excessive smoothing of the Pooling operation, thus losing the helmet target box. The two parts of the information are accumulated to form the detection head output information.

The low-frequency global detection and perception module uses the depth-separable global simulation of the attention computation process, which uses three independent parallel depth-separable convolutions to compute

Q / K / V

, and uses

Q / K

to compute the required Attention Map. The Attention Map is multiplied with

V

to obtain the Attention Output, on the basis of which Wise Convolution (PW) is applied to obtain the Global Output.

The high-frequency local detection perceptual module first performs channel mixing on the input features to introduce randomness and, at the same time, reduce the dependence of the subsequent convolution operation on different channels; then it utilizes depth-separable convolution for local high-frequency detection information extraction to obtain the output of the local detection perceptual pathway. The high-frequency local detection perception module is calculated as follows:

L = P W (D W (C h a n n e l S h u f f l e (X)))

(2)

It is applied to the detection head of YOLOv10 to form the structure as shown in Figure 7. Multiple resolution detection heads are used for classification and regression computation of helmet targets and optimization of network parameters after backward propagation.

2.4. Loss Function Optimization

The true aspect ratio for the tracing frame in the helmet data annotation file is uniform, and its distribution is shown in Figure 5. As can be seen from the figure, the helmet aspect ratio is distributed in the Y~Z interval, which is more uniform overall; however, YOLOv10 adopts the CIoU calculation as follows.

L_{C I o U} = 1 - I o U + \frac{d^{2}}{C^{2}} + α v

(3)

where

I o U

is the

I o U

of the prediction frame and the truth frame,

d

is the distance between the prediction frame and the center of the truth frame,

C

is the farthest distance between the prediction frame and the truth frame,

v

is the error term considering the aspect ratio of the truth frame and the prediction frame,

α

is the balancing coefficient of the different losses, and the formulas of

α

and

v

are as follows:

v = \frac{4}{π^{2}} {(a r c t a n \frac{w^{g t}}{h^{g t}} - a r c t a n \frac{w}{h})}^{2}

(4)

α = \frac{v}{(1 - I o U) + v}

(5)

Although the application of CIoU can comprehensively consider the aspect ratio, the distance between the prediction frame, and the true value frame and other factors, it is usually applied in the application of object detection with a large change in the aspect ratio; for the helmet, a target with a small change in the aspect ratio, the dynamic weighting can be applied to achieve the penalty for the failure to comply with the true value of the aspect ratio, and the balancing coefficient

α

can be corrected to:

α = \frac{β}{2} \cdot (\frac{w \cdot h^{g t}}{h \cdot w^{g t}} + \frac{h \cdot w^{g t}}{w \cdot h^{g t}}) \cdot \frac{v}{(1 - I o U) + v}

(6)

where

β

is a predetermined hyperparameter. The balancing parameter

α

dynamically penalizes aspect ratio inconsistencies by increasing the aspect ratio error when the predicted frame aspect ratio differs significantly from the true frame aspect ratio.

In this paper, the algorithm is used for the optimization process of the network learning rate using OneCycleLR to dynamically adjust the learning rate; it is applied so that in the beginning of the network training, learning rate is gradually increased so that the parameters can be quickly updated. When the learning reaches a certain extent, the learning rate gradually reduces and stabilizes so that the network is more likely to converge to the parameters of smaller loss.

3. Results and Analysis

3.1. Datasets and Metrics

Dataset. According to the previous analysis, the dataset used in the experiments of this paper is constructed from the data collected from the downhole site and the Baidu PaddlePaddle open-source dataset from the ordinary ground scene. Among them, the images collected from the underground scene totaled 3076 and contained 7 scenes. As the number of underground scenes without helmets is very small, the pictures collected in the underground scene only have samples of wearing helmets, only more than ten pictures without helmets, and the total number of helmet-wearing targets is 5010. The original resolution of the pictures collected at the underground site is 1920 × 1080, and considering that the size of the input pictures of the YOLO series is all 640 × 640, they are scaled, and when calculating the aspect ratio of the safety helmet, since the neural network learns the scaled pictures and their true values, the aspect ratio used is the scaled value. The other part of the data comes from Baidu PaddlePaddle, which contains 5000 images with a size of 416 × 416. This open-source dataset contains diverse scenarios in common ground scenes, including urban construction sites, factory workshops, electric power overhaul, field construction and so on, and the color of helmets includes white, yellow, blue, red, etc. The target of the helmet without wearing a helmet is 13,000,000. Among them, there are 13,363 targets without helmet and 3743 targets with helmet. The data used in this paper is divided into 80% training/10% validation/10% testing. To ensure reproducible results, we conduct comparative experiments on the Safety Helmet-wearing Detection (SHWD) dataset [41], comprising 7581 annotated images with 9044 helmet-wearing instances and 111,514 non-helmet-wearing instances. All images were uniformly resized to 416 × 416 pixels during preprocessing to standardize input dimensions, addressing the significant size variation present in the original SHWD dataset.

Environment and parameters. In this paper, when applying the image enhancement method described in Section 3.1, it is only used in the network training phase, and no image enhancement is performed in the validation and testing phase. The hardware environment for the experiments in this paper is i7 13700KF/32GB RAM/NVIDIA 4090GPU, the operating system is Ubuntu 22.04, Python version 3.8.18, Pytorch version 2.1.2+cu121, the batch size is 8, the optimizer selection is AdamW, the initial learning rate is 0.001667, the momentum is 0.9, loss function hyperparameter

β = 1.15

, learning rate tuner maximum learning rate is 0.6, step size per round is the length of the dataset, and for a fair comparison, the training exposition of both the method in this paper and the comparison method are set to

e p o c h s =

100.

m = 1

is selected in the DBConv module. These hyper parameters are tuned by grid search method.

Evaluation Metrics. In this paper, we use precision (P, Precision), recall (R, Recall), and mean average precision (mAP) to perform the performance metrics, in which precision and recall are calculated as follows:

P = \frac{T P}{T P + F P}

(7)

R = \frac{T P}{T P + F N}

(8)

The formula

T P

means that the actual is true and the prediction is also true (True Positive),

F P

means that the actual is not true but the prediction is true (False Positive),

F N

means that the actual is true but the prediction is false (False Negative).

A P

(Average Precision) represents the mean accuracy, which is calculated as the area of the PR curve. When the average accuracy mean is calculated, mAP₅₀ and mAP_50–95 are used to calculate this, where mAP₅₀ represents the average accuracy when the IoU (Intersection of Union) threshold is set to 50%, and mAP_50–95 represents the average value taken when the IoU threshold is set from 50% to 95% and the calculations are performed at 5% intervals. i.e.,

{m A P}_{50 - 95} = \frac{1}{10} ({m A P}_{50} + {m A P}_{55} + \dots + {m A P}_{95})

(9)

3.2. Data Enhancement

Aiming for the characteristics presented by the underground helmet data mentioned above, this paper proposes two types of data expansion and enhancement strategies based on the analysis of data distribution. (1) Use the helmet-wearing detection dataset in the ground scene to mix with the downhole dataset; (2) considering the same length and width in the input dimensions, data augmentation with a combination of mirrors, rotations (90-degree), and mosaic can be used in the training process. Mixing the ground scene helmet data will introduce many advantages, including (1) effectively increasing the number of helmet-wearing/non-helmet-wearing in the dataset; (2) alleviating the problem of the very small number of unhelmet samples in the downhole; and (3) the spatial bias in the ground helmet-wearing detection data is not significant, and thus can be generalized effectively. The data enhancement using the combination of mirror imaging and mosaic can further alleviate the spatial bias.

With the expansion of the hybrid data strategy, the data distribution of the dataset used in this paper is more decentralized after applying the TSNE algorithm [38] for dimensionality reduction as shown in Figure 8. Also, the ground helmet-wearing detection data is able to encircle the downhole data, which means that the downhole helmet visual data can be regarded as a part of the helmet-wearing data in the ground scenario at the data level.

The effect of data enhancement on spatial bias after mirroring and mosaic implementation is shown in Figure 9. The training accuracy can be improved and the spatial bias problem can be alleviated by data enhancement during neural network training. The data enhancement is only used in the training process, and there is no change in the amount of data in the helmet object detection dataset. After data blending, a total of 8076 visual images and their detection frame labeling files are formed.

3.3. Analysis of Model Validity

In order to verify the effectiveness of the module proposed in this paper, the author firstly carries out qualitative experiments on helmet-wearing detection using the test dataset on the basis of homemade dataset division and compares it with YOLOv8, YOLOv9, YOLOv10, and RT-DETR as shown in Figure 10. From the figure, it can be found that after the improvement of the proposed algorithm module, the algorithm of this paper is able to effectively detect small helmet-wearing targets on small target samples for helmet-wearing detection, whereas among the other algorithms, YOLOv8 may be able to detect small targets, but the confidence level of its target frame and the truth frame is significantly low, whereas the rest of the algorithms face leakage of the helmet-wearing targets when the helmet-wearing targets are small.

On the basis of qualitative analysis, this paper also analyzes the quantitative performance of the algorithm and compares it with other algorithms. The detailed results of the experiments are shown in Table 1 and Table 2, where mAP represents the detection accuracy of the model, and in calculating this accuracy, the overall accuracy, the accuracy of detecting someone not wearing a helmet, and the accuracy of detecting someone wearing a helmet are listed, and the optimal performance and the sub-optimal performance are marked with colors. In order to facilitate the comparison of models of different sizes, the table also lists the accuracy performance of models of different configurations and the parameter scale and calculation scale. In the speed evaluation, the inference speed is used as its running speed assessment, and FPS (Frame Per Second) refers to the number of inference frames in one second; in this example, 100 images are selected and the average value of five runs is taken. GFLOPs (Giga Floating Point Operations Per Second) means 1 billion floating point operations per second, which represents the number of operations needed when the model is running, and it represents the scale of computation required for the model inference; the larger the value represents the larger the amount of computation.

Through the above experiments, it can be seen that the algorithm proposed in this paper effectively improves the algorithm detection accuracy while ensuring that the operation scale and inference speed are basically unchanged. Relative to YOLOv10, using mAP50 and type l model for metrics, the overall accuracy of this paper for helmet-wearing is improved by 2.7%, and mAP_50–95 is improved by 3.2% for the mixed dataset. As the algorithm in this paper applies a novel data processing method, it effectively increases the number of samples without helmet-wearing and alleviates the spatial bias by mixing the common scene data, and it also designs a learnable mask module for complex backgrounds and a global–local fusion perceptual detection module for small targets, which realizes a significant improvement in the overall accuracy. As this paper adopts Post Norm in GLFM module, compared with Pre Norm, the convergence of this paper’s algorithm is better. Since the change of the front and back features of the layers decreases with the deepening of the network, Pre Norm makes the network equivalent to widening, while Post Norm can effectively maintain the depth, so Post Norm is equivalent to the network’s further deepening [36]. As evidenced by the comparative results in Table 2, our model demonstrates superior performance on the SHWD dataset, achieving a significant margin of improvement over all baseline approaches. These experimental outcomes substantiate the efficacy of the proposed methodology.

Since helmet-wearing in underground scenarios is mainly concerned with the situation of not wearing a helmet, the background false detection rate in this paper adopts the precision and recall of not wearing a helmet in the test set as the metrics, and calculates the P/R of the confusion matrix in which the true value of the background is the background but it is detected to be wearing a helmet. The following Figure 11 shows the comparison of the background false detection rate of this paper’s algorithm with that of YOLOv8l, YOLOv10l, and RT-DETR. From the number of background misdetections in the confusion matrix, it can be calculated that the background misdetection rate of this paper’s algorithm is reduced by 4.37%, 4.24%, and 14.02%, respectively, compared with the above three algorithms, using precision as the metric.

3.4. Ablation Experiments

In order to verify the effectiveness of the data-enhancement approach proposed in this paper, the dataset containing only downhole data is firstly applied for training, and the training effect is compared with the dataset that has been data-enhanced, and evaluated using mAP₅₀, and the l-type is used for the evaluation due to the fact that the different sizes of the models in the algorithmic model of this paper are structurally similar, as shown in Table 3. As seen in Table 3, for models of similar size, the application of the data mixing and enhancement strategy proposed in this paper can significantly improve the detection of helmet-wearing. Since mixing ordinary scene data will greatly reduce the data island effect and spatial bias, and expand the amount of training data, it makes the accuracy improvement exceptionally obvious.

In order to verify the effectiveness of the proposed modules in this paper, the effectiveness of each module proposed in this paper is first analyzed regarding the accuracy, as shown in Table 4. The ‘+’ and ‘-’ symbols represent with/without the modules applied into the system. Serial number 1 represents the original YOLOv10l model, serial number 2 represents the accuracy performance of applying the dynamic background sensing module DBConv only in the backbone network, serial number 3 represents the accuracy performance of applying the DBConv + global–local fusion module GLFM module, and serial number 4 represents the accuracy performance of applying the two aforementioned modules and loss function optimization. In this paper, after applying the DBConv module, the accuracy of helmet detection is significantly improved by 1.1% relative to the baseline model, and its background false detection rate is reduced by 1.93%/14%, respectively, due to further suppression of background interference. After applying the GLFM to the detection head, the detection accuracy is further improved, especially as the detection of small target samples becomes better, and the loss function optimization by increasing the helmet aspect ratio information prompts a 3.3% improvement in the detection accuracy in

{m A P}_{50}

.

To comprehensively evaluate our proposed modules against established alternatives, we conducted substitution experiments using mAP@0.5 as the primary evaluation metric. Benchmark results (Table 5) reveal that while substituting DBConv with CBAM yields marginal reductions in parameter count (↓0.1M) and GFLOPs (↓0.1GFLOPs), the resultant accuracy remains substantially inferior to our approach (ΔmAP₅₀ = −1.5%). Furthermore, replacing the GLFM module with BiFPN inevitably increases computational complexity (+0.3GFLOPs) and parameter requirements (+0.4M). Crucially, our architecture demonstrates superior detection accuracy when integrating the proposed modules, with experimental validation confirming enhanced efficacy across all metrics.

3.4.1. Backbone Network Dynamic Background Awareness Module Ablation Experiments

Object detection of helmet-wearing in an underground coal mine environment faces the difficult problem of high background interference, so this paper is inspired by DRConv to propose the application of a dynamic background sensing module for background sensing and masking, so as to inhibit the background interference and improve the detection accuracy.

Table 6 shows the overall effect of applying the DBConv module to the backbone network based on the YOLOv10 model of different sizes. As seen in the table, the application of the DBConv module to the different sizes of the model has different degrees of accuracy enhancement, and the best accuracy enhancement is applied to the l-type of the network.

To further evaluate the performance of the DBConv module, we visualize its mask feature maps, with the resulting heatmaps presented in Figure 12. The results demonstrate that the module effectively suppresses non-helmet regions in the background, significantly reducing interference from complex environments. As shown in the figure, the learnable mask mechanism plays a critical role—without it, the intricate backgrounds in underground coal mines would substantially hinder helmet detection accuracy.

3.4.2. Detection of Head Global–Local Fusion Module Ablation Experiments

To evaluate the practical effectiveness of the proposed GLFM module in small object detection, we conducted systematic ablation experiments focusing on its integration within the detection head. Prior to quantitative analysis, we first examine the module’s internal behavior through feature map visualization. Figure 13 presents a comparative visualization of the high-frequency and low-frequency pathways, revealing distinct operational characteristics: The high-frequency pathway preserves fine-grained spatial details, exhibiting sharper feature responses that are particularly crucial for small object localization. The low-frequency pathway demonstrates smoother feature representations, focusing on broader contextual information. This empirical observation confirms our design hypothesis that the dual-path architecture effectively decomposes and processes complementary frequency components, which collectively enhance detection performance.

In order to further analyze the effect of the module for small object detection, this paper extracts the helmet-wearing targets with target pixels less than 32 × 32 pixels in the test set, and these images are subjected to object detection, and the results are shown in Table 7. From the data in the table, it can be seen that the GLFM module can be effectively detected when the target is small, and compared to the baseline model, the reduction in accuracy after applying the GLFM module is significantly smaller relative to when the target is larger. According to the following table, the accuracy of the module improves more significantly when it is applied to larger scale models, and for smaller scale models, the accuracy is mainly limited by the feature extraction capability of the backbone network, which makes it difficult to improve the performance of the detection head.

3.4.3. Loss Function Optimization Ablation Experiments

Figure 14 shows the loss function decline curve before and after the application of the aspect ratio constraints; it can be clearly seen in the figure, after the application of the aspect ratio constraints on the loss function, it accelerated the speed of the decline, and the final loss compared to the baseline has a certain degree of decline, due to the helmet-wearing object detection of its own aspect ratio distribution being smaller, so this optimization can be effectively formed for the constraints on the loss function, which leads the network to converge more quickly.

4. Conclusions

In this paper, we proposed a novel yet practical method for the underground helmet-wearing small object detection problem with large background interference. Our method mainly includes two parts: data processing and structure design. Firstly, to deal with the spatial bias of helmet objects in an underground coal mine environment, we conduct the data augmentation steps to mitigate this strong bias. Secondly, to overcome the extremely imbalanced helmet-wearing data in the real-world coal mine environment, we mix up the data samples with normal samples, which are publicly available. This step effectively enhances the diversity of the sample. The helmet itself is a small target and the aspect ratio does not change much; we design a backbone network optimization module based on the dynamic background perception module for background masking and reducing the background interference; in the small object detection, the network optimization of the detection head is based on the fusion of the global and local information to improve the detection effect; in the design of the loss function, we combined it with the aspect ratio for dynamic weight adjustment to achieve the performance optimization of helmet-wearing object detection.

Experiments show that the algorithm proposed in this paper effectively improves the accuracy of safety helmet-wearing small target detection in coal mine underground scenes, and the accuracy of the algorithm proposed in this paper improves by 5.1% relative to the baseline model YOLOv10l, and the detection performance improves for the convolution-based methods (YOLO series) and attention-based methods (e.g., RT-DETR) at this stage. The algorithm proposed in this paper has high detection accuracy and better detection performance for helmet-wearing small targets.

While the proposed methodology demonstrates compelling efficacy in our experimental evaluation, we acknowledge that emerging model architectures will inevitably offer alternative solutions. Crucially, however, the core design principles underlying our approach—particularly regarding data processing and structure design—remain transferable to enhance future iterations of real-time detection systems. Furthermore, the data mixing strategy requires deliberate consideration of domain distribution characteristics. When significant domain discrepancies exist, this approach risks performance degradation and may adversely impact model convergence. Crucially, however, helmet detection scenarios do not encounter such substantial domain gaps due to inherently consistent feature distributions across mining environments.

Author Contributions

Conceptualization, J.C. and H.L. (Heping Li); Data curation, D.S.; Formal analysis, G.W.; Funding acquisition, J.C.; Investigation, H.L. (Heping Li); Methodology, G.W.; Project administration, J.C.; Resources, D.S.; Software, P.Y.; Supervision, H.L. (Heping Li); Validation, H.L. (Hao Li); Visualization, H.L. (Hao Li) and P.Y.; Writing—original draft, G.W.; Writing—review & editing, G.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work is funded by National Key Research and Development Program of China, 2023YFC2907600; Innovation and Entrepreneurship funds special projects by Tiandi Science and Technology Co., Ltd., 2022-TD-ZD004, 2021-TD-ZD002.

Data Availability Statement

The data underlying this article will be shared on reasonable request to the corresponding author.

Conflicts of Interest

Authors Guangfu Wang, Dazhi Sun, Hao Li, Jian Cheng, Pengpeng Yan and Heping Li were employed by the company Beijing Technology Research Branch, Tiandi Science and Technology Co., Ltd.

References

Zuo, M.; Jiao, W. Helmet-wearing recognition algorithm for coal mine underground operation scenarios. China Saf. Sci. J. 2024, 34, 237. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 1–11. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part v 13. Springer International Publishing: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Region-based convolutional networks for accurate object detection and segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 142–158. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Li, Y.; Chen, Y.; Wang, N.; Zhang, Z. Scale-aware trident networks for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6054–6063. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer International Publishing: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Zhu, L.; Deng, Z.; Hu, X.; Fu, C.W.; Xu, X.; Qin, J.; Heng, P.A. Bidirectional feature pyramid network with recurrent attention residual modules for shadow detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 121–136. [Google Scholar]
Tian, Y.; Ye, Q.; Doermann, D. Yolov12: Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Jocher, G.; Stoken, A.; Borovec, J.; NanoCode012; ChristopherSTAN; Changyu, L.; Laughing; Hogan, A.; Lorenzomammana; Tkianai; et al. ultralytics/yolov5: v3.0; Zenodo: Geneva, Switzerland, 2020. [Google Scholar]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Niew, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Terven, J.; Córdova-Esparza, D.M.; Romero-González, J.A. A comprehensive review of yolo architectures in computer vision: From yolov1 to yolov8 and yolo-nas. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar]
Wang, C.Y.; Yeh, I.H.; Mark Liao, H.Y. Yolov9: Learning what you want to learn using programmable gradient information. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer Nature: Cham, Switzerland, 2024; pp. 1–21. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J. Yolov10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
Khanam, R.; Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
Robinson, I.; Robicheaux, P.; Popov, M. RF-DETR [Computer Software]. Apache License 2.0. 2015. Available online: https://github.com/roboflow/rf-detr (accessed on 20 June 2025).
Li, J.; Xie, S.; Zhou, X.; Zhang, L.; Li, X. Real-time detection of coal mine safety helmet based on improved YOLOv8. J. Real-Time Image Process. 2025, 22, 26. [Google Scholar]
Yang, W.; Wang, S.; Wu, J.; Chen, W.; Tian, Z. A low-light image enhancement method for personnel safety monitoring in underground coal mines. Complex Intell. Syst. 2024, 10, 4019–4032. [Google Scholar]
Peiyun, F.; Yurong, Q.; Yingying, F.; Wei, H.; Qin, Y.; Mo, W. Safety helmet detection algorithm based on the improved Cascade R-CNN. Microelectron. Comput. 2024, 41, 63–73. [Google Scholar]
Chi, S.; Xiaowen, L. Improved detection of miners’ safety helmet detection based on YOLOv7-tiny. China Sci. Pap. 2023, 18, 1250–1256+1274. [Google Scholar]
Gongyu, H.; Qinhuang, C.; Zhenhua, Y.; Zhang, Y.; Zhang, D.; Li, H. Helmet detection method based on improved YOLOv5. Chin. J. Eng. 2024, 46, 329–342. [Google Scholar]
Shuai, C.; Lihong, D.; Fan, D.; Feng, G. A small object detection method for coal mine underground scene based on YOLOv7-SE. J. Mine Autom. 2024, 50, 35–41. [Google Scholar]
Ma, Y.; Yu, D.; Wu, T.; Wang, H. PaddlePaddle: An open-source deep learning platform from industrial practice. Front. Data Comput. 2019, 1, 105–115. [Google Scholar]
Park, N.; Kim, S. How do vision transformers work? arXiv 2022, arXiv:2202.06709. [Google Scholar]
Yang, W.; Zhang, X.; Ma, B.; Wang, Y.; Wu, Y.; Yan, J.; Liu, Y.; Zhang, C.; Wan, J.; Wang, Y.; et al. An open dataset for intelligent recognition and classification of abnormal condition in longwall mining. Sci. Data 2023, 10, 416. [Google Scholar] [PubMed]
Chan, D.M.; Rao, R.; Huang, F.; Canny, J.F. GPU accelerated t-distributed stochastic neighbor embedding. J. Parallel Distrib. Comput. 2019, 131, 1–13. [Google Scholar]
Chen, J.; Wang, X.; Guo, Z.; Zhang, X.; Sun, J. Dynamic region-aware convolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8064–8073. [Google Scholar]
Ma, X.; Dai, X.; Bai, Y.; Wang, Y.; Fu, Y. Rewrite the stars. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 5694–5703. [Google Scholar]
Gochoo, M. Safety Helmet Wearing Dataset; Mendeley Data, V1; Elsevier: Amsterdam, The Netherlands, 2021. [Google Scholar] [CrossRef]

Figure 1. Model architecture; contribution module marked with dashed red box. The proposed DBConv module is in the backbone and the GLFM module is in the detection head.

Figure 2. Image distribution after projecting to low dimension with TSNE [38]. Different scene data are projected to separate clusters.

Figure 3. Helmet-wearing spatial bias. More heat means more boxes located.

Figure 4. DBConv module design. The input will generate convolutional kernels and masks, with these kernels applied to input, and then the masks are utilized with element-wise products.

Figure 5. Object detection distribution of helmet-wearing dataset. Aspect ratio and the area of boxes are both dominated with a small value.

Figure 6. GLFM module. Attention Map is generated in the low frequency path with efficient convolution operation. The DW and PW are adopted to capture high-frequency features.

Figure 7. Improved detection head, with the above-mentioned module plugged in.

Figure 8. Blending with normal scene helmet-wearing data; the data distribution is more diverse.

Figure 9. Helmet-wearing spatial bias after applying data augmentation; the spatial bias is mitigated.

Figure 10. Helmet-wearing detection. Our method can detect small helmet objects.

Figure 11. Confusion matrix for YOLOv8 (top-left)/YOLOv10 (top-right)/RT-DETR (bottom-left)/our (bottom-right); Our method are better performed.

Figure 12. Learnable mask visualization. The blue areas are filtered out, leading to low interference from the background.

Figure 13. Given an input image (left), high frequency path (middle) exhibits more sharp features than the low frequency path (right).

Figure 14. Loss decreasing comparison. Constraints make training loop converge faster.

Table 1. Comparison of this paper’s algorithm with other algorithms.

Model	P	R	mAP₅₀	mAP_50–95	Latency (ms)	Params (M)	GFLOPs
YOLOv8n	0.925/0.908/0.942	0.858/0.834/0.882	0.923/0.902/0.944	0.576/0.563/0.589	1.0	3.1	8.1
YOLOv8s	0.926/0.910/0.942	0.857/0.831/0.883	0.922/0.907/0.937	0.579/0.568/0.590	1.8	11.2	28.6
YOLOv8m	0.928/0.916/0.940	0.859/0.843/0.875	0.924/0.908/0.940	0.578/0.560/0.596	3.6	25.9	78.7
YOLOv8l	0.928/0.918/0.938	0.859/0.844/0.874	0.925/0.908/0.941	0.579/0.565/0.594	5.5	43.7	165.2
YOLOv8x	0.926/0.912/0.940	0.857/0.829/0.885	0.922/0.905/0.939	0.577/0.561/0.593	8.9	68.2	257.8
YOLOv9e	0.916/0.902/0.940	0.878/0.849/0.907	0.930/0.914/0.946	0.588/0.572/0.604	7.6	58.1	192.7
YOLOv9c	0.916/0.907/0.925	0.875/0.848/0.901	0.931/0.912/0.950	0.586/0.577/0.595	5.1	25.3	102.3
YOLOv10n	0.912/0.899/0.924	0.853/0.832/0.874	0.921/0.900/0.941	0.563/0.546/0.579	0.8	2.7	8.2
YOLOv10s	0.918/0.912/0.924	0.860/0.836/0.884	0.924/0.900/0.947	0.569/0.555/0.583	1.8	8.0	24.4
YOLOv10m	0.918/0.916/0.923	0.866/0.863/0.881	0.925/0.911/0.948	0.559/0.551/0.591	3.5	16.5	63.4
YOLOv10b	0.912/0.904/0.920	0.871/0.859/0.884	0.927/0.910/0.945	0.571/0.560/0.582	4.4	20.4	97.9
YOLOv10l	0.916/0.905/0.927	0.872/0.847/0.897	0.927/0.904/0.950	0.570/0.555/0.585	5.6	25.7	126.3
YOLOv10x	0.918/0.917/0.919	0.849/0.819/0.879	0.922/0.905/0.939	0.574/0.571/0.577	8.6	31.6	169.8
RT-DETRl	0.908/0.899/0.917	0.826/0.802/0.849	0.885/0.861/0.901	0.532/0.530/0.534	5.3	32.0	103.4
RT-DETRx	0.927/0.903/0.952	0.870/0.866/0.874	0.926/0.908/0.943	0.571/0.561/0.580	8.3	65.5	222.5
Our-s	0.942/0.922/0.962	0.885/0.869/0.901	0.954/0.943/0.965	0.610/0.580/0.640	1.6	8.2	24.7
Our-m	0.940/0.926/0.958	0.889/0.872/0.906	0.953/0.930/0.976	0.608/0.594/0.622	3.6	16.6	63.6
Our-l	0.944/0.932/0.956	0.890/0.881/0.899	0.954/0.938/0.970	0.602/0.574/0.630	5.5	25.9	126.4
Our-x	0.941/0.928/0.954	0.889/0.876/0.902	0.952/0.936/0.968	0.601/0.588/0.614	8.5	31.7	168.9

Table 2. Benchmarks on the SHWD dataset.

Model	P	R	mAP₅₀	mAP_50–95
YOLOv8l	0.945	0.887	0.938	0.604
YOLOv9c	0.941	0.889	0.942	0.611
YOLOv10l	0.933	0.892	0.955	0.631
RT-DETR-l	0.926	0.865	0.946	0.609
Our-l	0.959	0.903	0.979	0.648

Table 3. Data mixing and enhanced validity analysis.

Model	Downhole Dataset+ No Data Enhancement	Downhole Dataset+ Data Enhancement	Mixed Data+ No Data Enhancement	Mixed Data+ Data Enhancement
YOLOv8l	0.884	0.901	0.911	0.925
YOLOv9c	0.887	0.910	0.919	0.931
YOLOv10l	0.903	0.913	0.915	0.927
RT-DETR-l	0.836	0.841	0.868	0.885
Our-l	0.909	0.911	0.944	0.954

Table 4. Ablation analysis by module.

No.	DBConv	GLFM	Loss	mAP₅₀	Background False Detection Rate
1	–	–	–	0.927	0.0827/0.75
2	+	–	–	0.938	0.0654/0.64
3	+	+	–	0.945	0.0638/0.62
4	+	+	+	0.954	0.0634/0.61

Table 5. Comparison analysis with other modules.

Model	Key Modification	mAP₅₀	Params (M)	GFLOPs
Ours(Full)	-	0.954	25.9	126.4
Ours w/CBAM	DBConv→CBAM	0.939	25.8	126.3
Ours w/BiFPN	GLFM→BiFPN	0.945	26.3	126.7
Ours w/CBAM+BiFPN	DBConv→CBAM and GLFM→BiFPN	0.947	26.4	126.7

Table 6. DBconv module ablation analysis.

Model	mAP₅₀ (w/o DBConv)	mAP₅₀ (DBConv)
YOLOv10n	0.921	0.929
YOLOv10s	0.925	0.929
YOLOv10m	0.924	0.930
YOLOv10b	0.927	0.927
YOLOv10l	0.927	0.938
YOLOv10x	0.926	0.934

Table 7. Helmet-wearing small object detection accuracy ablation experiments.

Model	P	R	mAP₅₀
YOLOv10l	0.780	0.710	0.812
Our-l (w/o GLFM)	0.825	0.752	0.841
Our-l (w GLFM)	0.874	0.843	0.901

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, G.; Sun, D.; Li, H.; Cheng, J.; Yan, P.; Li, H. Helmet Detection in Underground Coal Mines via Dynamic Background Perception with Limited Valid Samples. Mach. Learn. Knowl. Extr. 2025, 7, 64. https://doi.org/10.3390/make7030064

AMA Style

Wang G, Sun D, Li H, Cheng J, Yan P, Li H. Helmet Detection in Underground Coal Mines via Dynamic Background Perception with Limited Valid Samples. Machine Learning and Knowledge Extraction. 2025; 7(3):64. https://doi.org/10.3390/make7030064

Chicago/Turabian Style

Wang, Guangfu, Dazhi Sun, Hao Li, Jian Cheng, Pengpeng Yan, and Heping Li. 2025. "Helmet Detection in Underground Coal Mines via Dynamic Background Perception with Limited Valid Samples" Machine Learning and Knowledge Extraction 7, no. 3: 64. https://doi.org/10.3390/make7030064

APA Style

Wang, G., Sun, D., Li, H., Cheng, J., Yan, P., & Li, H. (2025). Helmet Detection in Underground Coal Mines via Dynamic Background Perception with Limited Valid Samples. Machine Learning and Knowledge Extraction, 7(3), 64. https://doi.org/10.3390/make7030064

Article Menu

Helmet Detection in Underground Coal Mines via Dynamic Background Perception with Limited Valid Samples

Abstract

1. Introduction

2. Methodology of the Paper

2.1. Helmet Data Enhancement

2.2. Convolutional Module for Dynamic Background Awareness in Backbone Networks

2.3. Detection Head Convolution and Attention Fusion Module

2.4. Loss Function Optimization

3. Results and Analysis

3.1. Datasets and Metrics

3.2. Data Enhancement

3.3. Analysis of Model Validity

3.4. Ablation Experiments

3.4.1. Backbone Network Dynamic Background Awareness Module Ablation Experiments

3.4.2. Detection of Head Global–Local Fusion Module Ablation Experiments

3.4.3. Loss Function Optimization Ablation Experiments

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI