Enhancing Mine Safety with YOLOv8-DBDC: Real-Time PPE Detection for Miners

Yang, Jun; Xie, Haizhen; Zhang, Xiaolan; Chen, Jiayue; Sun, Shulong

doi:10.3390/electronics14142788

Open AccessArticle

Enhancing Mine Safety with YOLOv8-DBDC: Real-Time PPE Detection for Miners

by

Jun Yang

^1,2

,

Haizhen Xie

^1,*,

Xiaolan Zhang

¹,

Jiayue Chen

¹ and

Shulong Sun

¹

Big Data and Internet of Things Research Center, China University of Mining and Technology, Beijing 100083, China

²

Key Laboratory of Intelligent Mining and Robotics, Ministry of Emergency Management, Beijing 100083, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(14), 2788; https://doi.org/10.3390/electronics14142788

Submission received: 4 June 2025 / Revised: 27 June 2025 / Accepted: 29 June 2025 / Published: 11 July 2025

(This article belongs to the Special Issue Advances in Information Processing and Network Security)

Download

Browse Figures

Versions Notes

Abstract

In the coal industry, miner safety is increasingly challenged by growing mining depths and complex environments. The failure to wear Personal Protective Equipment (PPE) is a frequent issue in accidents, threatening lives and reducing operational efficiency. Additionally, existing PPE datasets are inadequate for model training due to their small size, lack of diversity, and poor labeling. Current methods often struggle with the complexity of multi-scenario and multi-type PPE detection, especially under varying environmental conditions and with limited training data. In this paper, we propose a novel minersPPE dataset and an improved algorithm based on YOLOv8, enhanced with Dilated-CBAM (Dilated Convolutional Block Attention Module) and DBB (Diverse Branch Block) Detection Block (YOLOv8-DCDB), to address these challenges. The minersPPE dataset constructed in this paper includes 14 categories of protective equipment needed for various body parts of miners. To improve detection performance under complex lighting conditions and with varying PPE features, the algorithm incorporates the Dilated-CBAM module. Additionally, a multi-branch structured detection head is employed to effectively capture multi-scale features, especially enhancing the detection of small targets. To mitigate the class imbalance issue caused by the long-tail distribution in the dataset, we adopt a K-fold cross-validation strategy, optimizing the detection results. Compared to standard YOLOv8-based models, experiments on the minersPPE dataset demonstrate an 18.9% improvement in detection precision, verifying the effectiveness of the proposed YOLOv8-DCDB model in multi-scenario, multi-type PPE detection tasks.

Keywords:

personal protective equipment; Dilated-CBAM; multi-branch structured; long-tail distribution

1. Introduction

Coal mining remains a vital energy source globally, with major producers like Australia, the United States, and China playing significant roles in both national and international markets [1]. Despite its importance, coal mining presents numerous safety hazards, including gas explosions, roof collapses, and water inrushes, often exacerbated by harsh environmental conditions. Miners are required to work in dark, confined, and potentially dangerous underground spaces, heightening the risk of accidents and making rescue operations extremely difficult.

In such a high-risk working environment, the proper use of Personal Protective Equipment (PPE) is essential to safeguard miners. However, the improper or non-use of PPE remains a common issue, with frequent failures to wear safety helmets, shoes, gloves, face shields, and other protective gear. These lapses in safety practices not only jeopardize the health of miners [2] but also contribute to substantial economic losses and adverse social impacts on mining enterprises. Studies have shown that the correct use of PPE can significantly reduce injury risks for miners at work [3]. The reasons for improper PPE usage are complex, often related to factors such as miners’ education levels, safety values, work attitudes, and the safety conditions of the environment.

Despite the critical importance of PPE, traditional methods [4,5,6] of monitoring compliance, such as manual inspections and visual observations, are inefficient, prone to human error, and incapable of real-time monitoring. To address these limitations, there is a growing need for automated systems that integrate AI algorithms with surveillance equipment to enable the real-time detection of PPE compliance [7,8,9].

Currently, existing detection methods include traditional feature extraction techniques and deep learning-based approaches. Traditional methods, such as edge detection, threshold segmentation, shape analysis, and template matching, manually extract image features for classification [9,10,11]. However, these methods suffer from inefficiency, error susceptibility, and a lack of real-time capabilities, particularly in complex and dynamic environments.

The complexity of underground mining environments, combined with the need for multi-category and multi-scene detection, poses a significant challenge to existing detection algorithms. Current methods typically focus on single PPE types or static environments, limiting their adaptability to the varied and dynamic conditions in coal mines.

To address these challenges, we propose an improved YOLOv8-DCDB algorithm. This model incorporates a Dilated-CBAM to enhance detection performance under varying lighting conditions and feature variations of PPE. Additionally, a multi-branch structured detection head is introduced to better capture multi-scale features, improving the accuracy of small target detection. To overcome the dataset class imbalance caused by long-tail distributions, a K-fold cross-validation strategy is implemented to optimize model performance. Our experiments demonstrate that the proposed method substantially improves PPE detection accuracy, providing a reliable and efficient solution for real-time monitoring in coal mining operations.

2. Related Work

Object detection is a core task in computer vision, aiming to accurately identify and locate the positions and categories of specific objects in images or videos. In recent years, with the rapid development of deep learning technologies, object detection methods based on Convolutional Neural Networks (CNNs) have made significant progress and have been widely applied in various fields, such as pedestrian tracking [12], pose recognition [13], autonomous driving, medical image analysis, and industrial inspection [14].

Before the rise of deep learning, object detection primarily relied on hand-crafted features and traditional machine learning algorithms. Representative works include the Adaboost detector based on Haar features [15] and the SVM classifier based on Histogram of Oriented Gradients (HOG) features [16], which achieved relatively good results in early object detection tasks. However, these methods suffered from complex feature extraction, limited generalization ability, and poor adaptability to complex backgrounds.

With the development of deep learning, object detection methods based on CNNs have gradually become mainstream. These methods can be broadly divided into two categories: two-stage detection methods and one-stage detection methods.

Two-stage detection methods are based on the Region Proposal Network (RPN), which generates candidate regions (Region Proposals) and then classifies and performs bounding box regression on each candidate region. For example, R-CNN [17] and its improved versions, such as Fast R-CNN [18] and Faster R-CNN [19], significantly improved detection efficiency and accuracy by introducing end-to-end training and shared convolutional features. Although these methods excel in accuracy, they have certain limitations in real-time performance.

One-stage detection methods directly predict the positions and categories of objects on the entire image without generating candidate regions. Methods such as the YOLO [20] series and SSD [21] have attracted widespread attention for their fast detection speed and high real-time performance.

The introduction of YOLO in industrial safety applications marks a significant leap in real-time monitoring. YOLO’s ability to process images quickly and accurately makes it suitable for safety-critical scenarios, such as detecting protective equipment compliance in mining or foreign object identification in railway systems. Recent studies have highlighted the effectiveness of YOLO in automated foreign object detection within railway catenary systems, where the model detects potential hazards like debris, preventing accidents [22]. Furthermore, YOLO’s application has extended to safety hazard evaluation in high-speed railway systems, utilizing unmanned aerial vehicle (UAV) images to assess risks from various angles [23].

Coal mines and metal mines, among other mining areas, are characterized by complex environments and numerous hazardous factors [24]. Traditional methods for detecting miners’ protective equipment mainly rely on manual inspections and observations, which are inefficient, prone to human error, and unable to provide real-time monitoring. With the rapid development of computer vision and deep learning technologies, methods for recognizing miners’ protective equipment based on image recognition and automated detection have gradually become a research hotspot. Modern mining management systems have begun to apply automated technologies to monitor miners in real time through cameras or sensors, determining whether they are wearing protective equipment that meets safety requirements. This has enhanced the safety and management efficiency of mining areas. Adjiski [25] et al. developed a system that integrates the Internet of Things (IoT) with personal protective equipment (PPE), utilizing sensors on standard PPE connected to smartphones and smartwatches for real-time monitoring to enhance safety in underground mining.

Nikulin [26] et al. discussed the use of smart PPE equipped with sensors in the Russian coal mining industry to enhance safety. M. Imam [1] et al. developed a system that ensures miners’ proper use of PPE through pose estimation, collecting datasets from the harsh environmental conditions of the Draa Sfar coal mine and improving detection efficiency by combining YOLO with RT-DETR. Wang [27] et al. proposed a lightweight backbone network architecture, combining Mobile Inverted Bottleneck Convolution modules and Ghost Bottleneck modules, to improve the detection of miners’ PPE. Du [28] et al. proposed the BLP-YOLOv10 model for efficient safety helmet detection under low-light mining conditions. This model optimizes feature extraction and image processing by adjusting backbone channel parameters, integrating sparse attention mechanisms, and incorporating low-frequency enhancement filters. Tan [29] et al. improved training speed based on EfficientNet, proposing a convolutional neural network with fewer parameters. Due to the achievements of Transformers in natural language processing, an increasing number of scholars are applying them to the field of computer vision.

3. Methodology

This paper introduces an algorithm for detecting miners’ safety equipment in multiple scenes and categories. The Dilated-CBAM attention mechanism, which combines channel attention and dilated spatial attention, is applied in the three branches of the neck network. Dilated-CBAM recognizes that different types of equipment usually occupy specific positions in the image, achieving the precise localization of targets at different scales. Even when the external features of protective equipment vary under different conditions, it can effectively capture and utilize the key features of the protective equipment to improve the accuracy of the model. A multi-branch efficient detection head structure is used to address the issue of YOLOv8’s inability to fully capture multi-scale target features when detecting targets with large size variations in complex backgrounds. K-fold cross-validation is employed to solve the problem of model overfitting to head classes and underfitting to tail classes during training due to the long-tail distribution of the dataset [30]. Therefore, YOLOv8-DCDB optimizes the deep learning model, providing a more efficient and accurate solution for checking whether miners are correctly wearing their equipment in harsh coal mine environments, meeting the mining industry’s requirements for safe production.

3.1. Network Architecture

The network architecture proposed in this study is primarily divided into three parts: the backbone (feature extractor), the neck (feature fusion), and the head (detection layer). This hierarchical structure effectively extracts and fuses features across multiple scales, enhancing the model’s detection accuracy and robustness. The network architecture of YOLOv8-DCDB is illustrated in Figure 1.

The input layer is responsible for receiving RGB images of size 640 × 640 pixels. It normalizes the pixel values to the range [0, 1] or [−1, 1], and during the training phase, applies data augmentation techniques such as random cropping and flipping to enhance the training efficiency, stability, and generalization ability of the model.

The backbone network is responsible for extracting low-level features from the input images. In this study, the backbone consists of multiple convolutional layers (Conv) and feature fusion modules (C2f). Convolutional layers extract basic feature information such as edges and textures, while the feature fusion modules enhance the representational capability of these features. At the end of the backbone, a Spatial Pyramid Pooling Fast (SPPF) module [31] is introduced. This module employs multi-scale spatial pyramid pooling operations to effectively capture features at different scales, thereby improving the model’s robustness to variations in object size.

The neck module primarily performs the deep processing of features extracted by the backbone, including feature fusion and multi-scale feature generation. As shown in Figure 1, the neck module in this study comprises multiple upsampling layers (Upsample), feature concatenation operations (Concat), and C2f modules, with the Dilated-CBAM module also integrated. Upsampling enlarges low-resolution feature maps to higher resolutions to facilitate fusion with higher-level feature maps. Feature concatenation merges feature maps from different layers along the channel dimension, enabling multi-scale information integration. The Dilated-CBAM module captures contextual information at multiple scales and enhances the feature attention mechanism.

The head network in this study consists of three main parts: several convolutional layers, multiple Diverse Branch Block Detection (DBBDetect) modules, and the Distribution Focal Loss (DFL) module. The convolutional layers perform further feature extraction and transformation on the fused features. The DBBDetect modules are specifically designed to handle diverse feature characteristics and precise spatial locations of protective equipment under varying conditions. Through the convolutional layers and DBBDetect modules, the network outputs the target locations and categories, with the DFL module optimizing the bounding box regression task in object detection.

The Dilated-CBAM module is incorporated in both the neck and head networks. Dilated-CBAM is an improved convolutional block attention module that combines dilated convolutions with channel attention mechanisms to better capture contextual information and feature correlations, thereby enhancing detection accuracy. The DBBDetect module is a critical component of the network architecture, incorporating specific detection algorithms and parameters for the precise localization and classification of targets such as safety helmets.

The output layer converts the processed features from the head network into final detection results, generating bounding boxes, class labels, and confidence scores for each detected object. Bounding boxes are represented by four coordinate values indicating the object’s position. Class labels are integer indices corresponding to a predefined list of categories. Confidence scores are floating-point numbers between 0 and 1, reflecting the model’s confidence in the detection results.

By leveraging multi-scale feature fusion, the network demonstrates enhanced robustness to challenges such as illumination variations, occlusions, varying equipment sizes, and background noise in complex scenarios, thereby improving its applicability in real-world conditions.

3.2. Multi-Scale Feature Enhancement Network Based on Dilated-CBAM

In the detection network for miners’ protective equipment, the Channel Attention Mechanism (CAM) [32] and the Dilated Spatial Attention Mechanism (DSAM) are two important techniques that enhance the network’s perception capabilities. These mechanisms can automatically highlight useful features and suppress redundant or irrelevant features in the feature maps, thereby improving the accuracy in object detection tasks. This section will provide a detailed introduction to the principles, implementation methods, and applications of these two attention mechanisms in the detection of miners’ protective equipment.

3.2.1. Channel Attention Mechanism (CAM)

In the working scenarios of miners, the protective equipment worn by miners exhibits diverse characteristics under different environmental conditions. For instance, helmets may produce varying reflections or shadows due to changes in lighting, and the color and surface features of goggles may alter because of dust or fog. Channel Attention can automatically select key channels by using global average pooling and maximum pooling operations to assign a weight to each channel. This mechanism helps the network focus on channels closely related to the features of the protective equipment. By enhancing the weights of these channels, CAM enables the network to more accurately identify the miners’ equipment.

Channel Attention can also effectively reduce the interference of redundant information. In the working scenarios of miners, background information (such as mining equipment, tools, or clutter) can interfere with the detection of protective equipment. CAM helps the network avoid such interference by strengthening the features of useful channels while suppressing irrelevant ones, thereby improving the accuracy of protective equipment detection. Channel Attention is shown in Figure 2.

For a given feature map

X \in R^{c \times H \times W}

,

C

represents the number of channels (such as the texture and reflectivity feature channels of protective equipment) and

H \times W

represents the spatial dimensions.

Global Average Pooling (GAP) calculates the spatial statistical mean of each channel, suppressing noise interference and highlighting stable features within the channel. The calculation formula for

F_{avg}^{c \times 1 \times 1}

is as follows:

F_{avg}^{c \times 1 \times 1} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} X^{c} (i, j)

(1)

Global Max Pooling (GMP) captures the local responses of each channel and enhances high-frequency features. The calculation formula for Global Max Pooling is as follows:

F_{\max}^{c \times 1 \times 1} = \max_{i, j} X^{c \times 1 \times 1} (i, j)

(2)

where

X^{c \times 1 \times 1}

represents the input feature map and

F_{\max}^{c \times 1 \times 1}

refers to the feature map after max pooling.

The results of both pooling operations are fed into a shared-weight Multilayer Perceptron (MLP) to learn channel importance through nonlinear transformations, balancing global and local information. The calculation formula for Channel Attention is as follows:

W_{C}^{C \times 1 \times 1} = σ (MLP (F_{\max}) + MLP (F_{avg}))

(3)

where

MLP

represents the fully connected layer, and

σ

represents the activation function.

3.2.2. Dilated Spatial Attention

The spatial attention mechanism assigns different weights to different spatial locations, guiding the network to focus on key areas within an image. In the task of detecting miners’ protective equipment, various types of equipment typically occupy specific positions in the image. For instance, helmets are often located at the top of the image, goggles are centered on the face, and shoes appear at the bottom. The spatial attention mechanism can apply weighted processing to these key areas, enabling the network to accurately focus on the locations of the equipment even in the presence of complex backgrounds and significant noise. Even when the equipment is partially occluded, the spatial attention mechanism can help the network effectively identify the regions of the equipment, thereby enhancing the accuracy and reliability of detection.

The Spatial Attention Mechanism (SAM) uses single-scale convolution to capture local spatial features and weights the feature maps. This method can effectively enhance important regions in the feature maps and suppress less important ones with relatively low computational complexity. However, the receptive field of single-scale convolution is limited, making it difficult to adapt to the coexistence of large, medium, and small targets in the minersPPE dataset. This limitation can lead to performance degradation when dealing with objects of varying sizes and distributions.

The Dilated Spatial Attention Mechanism (DSAM) proposed in this paper introduces multi-scale dilated convolution to fuse features with different receptive fields and capture multi-scale spatial context information. The spatial weights are generated by summing the results of multiple convolution operations. DSAM addresses issues such as the limited receptive field, poor adaptability to targets of different sizes, limited feature extraction capabilities in complex scenes, poor robustness to class imbalance, and high computational complexity in spatial attention mechanisms. These improvements enable multi-scale spatial attention to perform better in handling complex scenes and multi-scale targets. Dilated Spatial Attention is illustrated in Figure 3.

DSAM performs channel-wise compression on the input feature map

X \in R^{c \times H \times W}

, generating two different feature maps: one obtained through global average pooling and the other through global max pooling of the input feature map. These two feature maps are then concatenated for subsequent multi-scale dilated convolution operations.

Global average pooling is applied to the input feature map

X

, resulting in a feature map

F_{avg}

with a size of

R^{1 \times H \times W}

. This step can be represented as:

F_{avg} = AvgPool (X) \in R^{1 \times H \times W}

(4)

Global max pooling is applied to the input feature map

X

, resulting in a feature map

F_{\max}

with the same size of

R^{1 \times H \times W}

. This step can be represented as:

F_{\max} = MaxPool (X) \in R^{1 \times H \times W}

(5)

The feature maps

F_{avg}

and Fmax obtained through global average pooling and global max pooling, respectively, are concatenated along the channel dimension to produce a new feature map

F_{c a t}

with a size of

R^{2 \times H \times W}

. This step can be represented as:

F_{cat} = Conact (F_{avg}, F_{\max}) \in R^{2 \times H \times W}

(6)

Subsequently, a convolutional group with multiple dilation rates is employed for spatial attention modeling. Each convolutional layer shares the input and output channels (2→1) and is configured with independent dilation rates r:

r = 1: Local detail perception (equivalent receptive field of 3 × 3), preserving fine-grained features at the original resolution and being sensitive to the edges of small targets.

r = 3: Mid-level context capture (equivalent receptive field of 7 × 7), expanding the receptive field to 7 × 7 pixels to capture the overall structure of medium-sized targets.

r = 5: Global semantic perception (equivalent receptive field of 11 × 11), covering a 13 × 13 region to obtain the distribution context of large targets.

Then, feature fusion of the multi-scale attention maps is achieved by element-wise summation. The calculation formula is (7).

A_{fusion} = \sum_{i = 1}^{N} A_{ri} \in R^{1 \times H \times W}

(7)

The spatial attention weights for each channel are ultimately obtained through Sigmoid activation, as shown in Equation (8).

W_{S}^{1 \times W \times H} = σ A_{fusion}

(8)

Multiplying

W_{C}^{C \times 1 \times 1}

by

W_{S}^{1 \times W \times H}

yields the output features of the DSAM.

Through DSAM, the model can dynamically adapt to the lighting changes in the mining environment using the dual-pooling channel attention mechanism, precisely enhancing the features of reflective materials. By combining multi-scale dilated convolutions to capture multi-level spatial contexts ranging from local details (3 × 3) to global semantics (11 × 11), it achieves the accurate localization of targets of different scales, such as helmets and goggles. Additionally, by inferring the contour distribution of occluded protective equipment through branches with large receptive fields, the detection robustness is improved in scenarios with dust interference, complex backgrounds, and partial occlusions, resulting in an overall recognition accuracy improvement of 4.3% for miners’ safety equipment.

3.3. Efficient Detection Head Based on Multi-Branch Structure

In object detection tasks, especially in complex scenarios, the design of the detection head plays a decisive role in the model’s performance and inference efficiency. The detection head of YOLOv8 employs a conventional convolutional structure, which performs well in terms of computational efficiency. However, it falls short in feature representation when dealing with complex scenes, particularly when there are significant variations in target sizes, complex backgrounds, or class imbalances. It struggles to fully capture multi-scale target features under these conditions.

To address the issue of inconsistent target sizes and class imbalance in mining scenarios, such as those found in mines, we propose an efficient detection head called Diverse Branch Block Detect (DBBDetect) based on the Diverse Branch Block (DBB) [33]. The structure of DBBDetect is shown in Figure 4. Compared with the traditional YOLOv8 detection head, DBBDetect introduces a multi-branch DBB module to enhance the model’s detection performance in complex scenes. After the output of the DBB module, multiple Conv2d layers are used to further process the merged features.

By effectively fusing multi-scale features from different branches, more abstract high-level features are extracted to optimize bounding box regression and class prediction. Meanwhile, the convolutional layers capture the spatial information of objects, helping the model better understand their positions and shapes, thereby improving detection accuracy. The DFL (Distribution Focal Loss) optimizes the regression loss through a focal mechanism, which performs exceptionally well when dealing with small objects and class imbalance problems.

3.3.1. Diverse Branch Block Module

To address the sensitivity to scale variations in the task of detecting miners’ protective equipment, this study employs the Diverse Branch Block (DBB) to introduce a multi-branch structure during the training phase. This structure enriches the feature space by including branches such as average pooling and multi-scale convolutions. These branches extract information from the input feature map at different angles and scales. This enables the network to more comprehensively capture the details and diversity of features, thereby enhancing the model’s ability to detect complex targets.

During the inference phase, these multi-branch structures are equivalently transformed into a single standard convolutional layer. This transformation maintains the computational load and inference time while improving the model’s accuracy. The structure of DBB is shown in Figure 5, and the module comprises four parallel branches.

During the training phase, DBB extracts features through multiple branch structures. Each branch can be regarded as a convolutional operation, using a convolutional kernel

K

and bias

B

. When the outputs of multiple branches are merged, the combined convolutional kernel and bias can be obtained through a weighted sum, with the merged convolutional kernel formula as (9) and the merged bias as (10):

K_{merged} = K_{origin} + K_{1 \times 1} + K_{1 \times 1_k \times k} + K_{avg}

(9)

B_{merged} = B_{origin} + B_{1 \times 1} + B_{1 \times 1_k \times k} + B_{avg}

(10)

K_{merged}

represents the merged convolutional kernel obtained from combining the original convolutional kernel

K_{origin}

and additional kernels from the different branches.

B_{merged}

denotes the combined bias term, resulting from merging the original bias term

B_{merged}

with biases from the branches in a similar manner.

For structures containing 1 × 1 convolution and 3 × 3 convolution (e.g.,

d b b_1 \times 1_k \times k

), the fusion operation is based on the combination of multiple convolutional layers and can be expressed as:

K_{1 \times 1_k \times k} = conv 1 (K_{1} x 1) + conv 2 (K_{3} x 3)

(11)

For the average pooling branch, the convolutional kernel generation formula is adjusted based on the size of the average pooling and the output channels of the convolutional layer:

K_{avg} = trainsV_avg (C_{out}, K_{size}, G)

(12)

where

C_{out}

is the number of output channels,

K_{size}

is the size of the convolutional kernel, and

G

is the number of groups in the grouped convolution.

The DDB module enriches the feature space through its multi-branch design. During the training phase, it extracts different types of features through multiple branches. During the inference phase, it compresses the computational load through equivalent convolution, thereby enhancing model performance without increasing inference time.

3.3.2. Distribute Focal Loss (DFL)

DFL is applied at the final stage of the network, receiving the output from the preceding layer (a convolutional layer) and optimizing it. The primary objective of DFL is to refine the position regression of bounding boxes, particularly the four coordinates of the bounding boxes. This process involves calculating the location distribution and focal loss. The loss function of DFL is typically represented by Equation (13):

L_{DFL} = \sum_{i} α_{i} \cdot {(1 - p_{i})}^{γ} \cdot L_{loc} (p_{i})

(13)

where:

α_{i}

is the weight assigned to each bounding box, controlling the importance of different boxes.

p_{i}

is the predicted probability of the box location, representing the distance between the predicted value and the ground truth.

γ

is the focal factor, typically greater than 1, which adjusts the loss ratio for samples of varying difficulty.

L_{loc} (p_{i})

is the location regression loss, measuring the error between the predicted bounding box and the ground truth box.

DFL, through its focal mechanism, enables the model to focus more on small objects and challenging bounding boxes to predict. This mechanism helps address the issue of location imbalance in the regression process, thereby enhancing the accuracy of the regression task.

DBBDetect enhances the model’s detection performance in complex scenes by incorporating the DBB module and the DFL loss function while maintaining efficient inference speed. This design provides an effective solution for addressing issues such as variations in target sizes, complex backgrounds, and class imbalance.

3.4. Miner’s Protective Equipment Detection Based on K-Fold Cross-Validation

K-fold cross-validation (K-fold CV) is a commonly used model evaluation method. It divides the dataset into k subsets, uses k − 1 subsets as the training set in sequence, and the remaining 1 subset as the validation set. This process is repeated k times, and the average value is taken to assess the model’s performance. This method helps to more accurately evaluate the model’s performance while alleviating the bias caused by imbalanced data.

The minersPPE dataset proposed in this paper exhibits a scarcity of PPE images specific to mining scenarios, leading to significant class imbalance and a pronounced long-tail distribution. The “person” class contains the most instances (14,506), followed by the “head” class with 11,985 and the “ear” class with 7730 instances. Other categories such as “ear-muffs,” “face,” “face-guard,” “face-mask,” “foot,” “glasses,” “gloves,” “helmet,” “hands,” “shoes,” and “safety-suit” have comparatively fewer samples, with “face-guard” and “safety-suit” having only 134 and 741 instances, respectively. This typical long-tail pattern, where a few dominant classes constitute the majority of samples while most classes have limited examples, may cause the model to overfit the dominant (head) classes and underfit the less represented (tail) classes during training. The long-tail distribution of the minersPPE dataset is illustrated in Figure 6.

K-fold cross-validation employs stratified sampling, ensuring that the proportion of each class in each fold remains consistent with the original dataset. This approach avoids the class imbalance issues that can arise from random partitioning. Through experimental comparisons of different k values (such as 3, 5, and 10) and their impact on model performance, the optimal k value is ultimately selected, which is k = 5 in this case. Figure 7 illustrates the principle of 5-Fold Cross-Validation.

The 5-Fold Cross-Validation procedure conducted in this paper is as follows:

Step 1: Divide the training dataset, minersPPE, into 5 equally sized subsets.

Step 2: Loop from i = 1 to i = 5.

Step 3: Use 4 subsets as the training set and the remaining 1 subset as the test set.

Step 4: Train the model YOLOv8-DCDB using cross-validation and calculate the accuracy.

Step 5: Evaluate the accuracy using the results from the 5 iterations of cross-validation.

The algorithm for 5-Fold Cross-Validation is shown in Algorithm 1.

Algorithm 1. 5-Fold Cross-Validation Approach

read (k)

(

k

as a vector)

read (D_{train})

(The training dataset minersPPE)

read classifiers (c l)

(

c l

denotes the list of chosen classifiers)

for {cl}_{i}

in cl (Loop through all classifiers)

—>

for k

in k_{i}

(

Traverse all values of k

)

divide (D_{train})

into k_{i}

using CV train ({cl}_{i})

and calculate the accuracy

calculate the performance of ({cl}_{i})

with all the (k_{i})

instances of CV

—>end for
end for

result ({cl}_{i})

The method of 5-Fold Cross-Validation, employing stratified sampling to ensure that the class distribution in each subset is consistent with the original data, thereby reduces the impact of class skewness on model training results. This approach effectively addresses the class imbalance issue in the dataset, ensuring stable performance evaluation of the model across different subsets.

4. Results

4.1. Experiment Introduction

This section first introduces the datasets used in the experimental methods, then describes the experimental environment and training strategies. Finally, the evaluation metrics related to the experimental results are presented.

4.1.1. Dataset

This paper presents the minersPPE dataset, which focuses on coal mining scenarios and consists of 8585 annotated images of miners. It is organized across 14 object categories related to safety equipment, including worker, ear, earmuff, face, mask, protective face shield, goggles, foot, safety shoe, head, safety helmet, hand, glove, and protective clothing, with a total of 72,946 annotated instances. Designed to monitor safety in coal mines, this dataset enables the detection of whether miners are properly equipped with protective gear. It captures diverse mining environments and working conditions, ensuring the comprehensive coverage of safety-related objects in terms of their types, quantities, and spatial distributions. Figure 8 provides a detailed overview of each category and its corresponding instance count.

4.1.2. Experimental Environment

The software and hardware environments used in this study are summarized in Table 1. The hardware setup includes an NVIDIA GeForce RTX 3080 GPU and 16 GB of RAM. The software environment is based on Ubuntu 22.04 LTS.

The NVIDIA GeForce RTX 3080 is a graphics card manufactured by NVIDIA Corporation. NVIDIA is headquartered in Santa Clara, CA, USA. The Intel Core i9-13900KF is a high-performance processor manufactured by Intel Corporation. Intel Corporation is headquartered in Santa Clara, CA, USA. Ubuntu 22.04 LTS is developed by Canonical Ltd., a company based in the United Kingdom. The company′s headquarters is located in London, UK.

4.1.3. Evaluation Metrics

To comprehensively and accurately describe the excellent performance of the proposed YOLOv8-DCDB model, various evaluation metrics are introduced, including Precision, mAP@0.5, and FPS. Precision reflects the proportion of correctly predicted positive samples among all predicted positive samples:

Precision = \frac{TP}{TP + FP}

(14)

where TP denotes the number of true positive samples correctly predicted by the model, and FP denotes the number of false positive samples incorrectly predicted as positive.

Recall refers to how many actual positive class samples are correctly predicted as the positive class by the model. The calculation is shown in Equation (15):

Recall = \frac{TP}{TP + F N}

(15)

where FN (False Negative) represents the number of samples incorrectly predicted as the negative class by the model.

The mean Average Precision (mAP) is the weighted average of the Average Precision (AP) values across all classes, used to evaluate the overall performance of the model in multi-class scenarios. Its calculation formulas are given in Equations (16)–(18):

Recall = \frac{TP}{TP + FN}

(16)

AP = \int_{0}^{1} Precision (Recall) d (Recall)

(17)

mAP = \frac{1}{N} \sum_{i = 1}^{N} {AP}_{i}

(18)

Here, FN represents the number of false negative samples incorrectly predicted as negative. AP corresponds to the area under the Precision–Recall curve.

{AP}_{i}

denotes the Average Precision of the ith class, and N is the total number of classes in the training dataset. In addition, mAP@0.5 refers to the mean Average Precision calculated at an Intersection over Union (IoU) threshold of 0.5. mAP0.5:0.95 is an extension of mAP over a broader IoU range, similar to mAP0.5but considering a wider IoU range from 0.5 to 0.95. IoU measures the overlap between the predicted bounding box and the ground truth bounding box in object detection [34].

FPS, namely frames per second, is used to measure the speed and real-time performance of a model when processing video streams or sequences of images. The higher the FPS, the more image frames the model can process within a unit of time.

4.2. Experimental Results

This study takes the minersPPE dataset as an example to conduct an in-depth validation on the miners’ protective equipment dataset.

4.2.1. Ablation Experiments

To verify the effectiveness of different modules, this chapter also conducts ablation experiments to test the Dilated-CBAM module, multi-branch detection head, and K-fold cross-validation. The baseline model used in the experiments is YOLOv8, and the validation set of the minersPPE dataset is employed. The ablation experiment results are presented in Table 2.

As shown in Table 2, the introduction of the Dilated-CBAM module, DDBDetect module, and K-fold cross-validation—both independently and in combination—improves the model’s performance. Specifically, the Dilated-CBAM module alone increases precision from 73.6% to 77.9% and mAP@50 from 0.661 to 0.668, demonstrating its effectiveness in enhancing feature extraction and contextual awareness in complex scenes. The DDBDetect module alone improves precision by 3% and mAP@50 by 1.8%, showing that its multi-branch structure enhances the model’s ability to localize and classify targets, particularly small or occluded objects. When Dilated-CBAM and DDBDetect are combined, the model achieves a precision of 79.8% and mAP@50 of 0.675, leveraging the strengths of both modules to improve feature extraction and robustness. Introducing K-fold cross-validation significantly boosts performance, achieving a precision of 92.5%, mAP@50 of 0.836, and mAP@0.5:0.95 of 0.642, emphasizing the importance of cross-validation in improving generalization. When all modules—Dilated-CBAM, DDBDetect, and K-fold cross-validation—are used together, the model achieves its best performance, with a precision of 92.5%, mAP@50 of 0.836, and mAP@0.5:0.95 of 0.642. This combination highlights the synergistic effect of these components in enhancing feature extraction and detection accuracy.

Based on the comparison of the results from the validation set in Table 2 and the training set in Table 3, the model′s performance is similar across both datasets, with no significant performance discrepancy observed. This indicates that the model has not suffered from overfitting during training and can generalize the learned features effectively to new, unseen data.

4.2.2. Comparison Experiments

To verify the effectiveness of the proposed method, comparative experiments were conducted using several mainstream object detection models, including RetinaNet [35], YOLOv10 [36], Faster R-CNN, YOLOv8, YOLOv12 [37] and the improved YOLOv8-DCDB model. The evaluation metric used was mean Average Precision (mAP), expressed as a percentage, which measures the average detection accuracy across different object categories.

As shown in Table 4, the YOLOv8-DCDB model achieves an mAP of 83.6%, significantly outperforming the other models. YOLOv10 and YOLOv8 obtained mAP scores of 67.7% and 66.1%, respectively, with comparable performance but inferior to YOLOv8-DCDB. Faster R-CNN and RetinaNet achieved relatively lower mAPs of 61.6% and 59.2%, respectively.

A detailed analysis of the performance differences reveals that YOLOv10 and YOLOv8, as newer versions in the YOLO series, offer fast detection speeds and high accuracy but still exhibit certain limitations in handling complex scenes and small objects. Faster R-CNN, as a classical object detection model, utilizes a Region Proposal Network (RPN) to generate candidate regions; although it delivers relatively high accuracy, it is slower and prone to missed and false detections in complex scenarios. RetinaNet employs Focal Loss to address class imbalance, providing advantages in dense object detection, but its overall accuracy requires further improvement.

In contrast, the YOLOv8-DCDB model, with the integration of the DCDB module, effectively enhances feature extraction and fusion capabilities while maintaining the fast detection advantages of the YOLO series. Consequently, it achieves a marked improvement in mAP. This indicates that the DCDB module better captures detailed target features and contextual information, improving detection accuracy for objects of varying scales and shapes. Notably, it excels in complex scenes and small object detection, validating the effectiveness and superiority of the proposed method. The comparative experiment results are shown in Table 4.

4.2.3. Qualitative Analysis

To validate the effectiveness of the improved model, this section analyzes the performance differences before and after model enhancement through the visualization of detection results and typical case comparisons, focusing on practical scenarios and comparing models used in the ablation and comparative experiments.

Ablation Study Qualitative Analysis

Visualization of Dilated-CBAM Integration Effects

In dark mining environments, the original YOLOv8 model does not perform well, with false positives and missed detections (as shown in Figure 9a and Figure 10a). After integrating the Dilated-CBAM module, the detection accuracy of YOLOv8 is improved, and the number of false positives is reduced (as shown in Figure 9b and Figure 10b).

2.: Performance of the Efficient Multi-Branch Detection Head Structure

In miner protective equipment detection, due to the large size variation of the equipment, models often miss small objects. YOLOv8 lacks sufficient feature-matching capability for extreme-scale targets, leading to limitations in feature representation, as shown in Figure 11. YOLOv8-DCDB integrates a multi-branch structure called DBB into its detection head, enhancing the detection accuracy of small targets, as shown in Figure 12.

3.: Overall Performance of YOLOv8-DCDB

The improved model demonstrates excellent performance in complex multi-scene, multi-object, and multi-scale detection tasks, effectively addressing the challenges in miner protective equipment detection. The detection results are illustrated in Figure 13.

Comparative Analysis with Other Models

In this section, we compare our YOLOv8-DCDB model’s detection results with those of several other common models, including YOLOv8, Faster R-CNN, RetinaNet, YOLOv10, and YOLOv12. The detection outcomes are depicted in Figure 14. Our proposed YOLOv8-DCDB model outperforms the others in detecting both large and small objects, especially excelling at identifying small objects such as hands and gloves. In complex scenarios, it achieves significantly higher confidence levels than YOLOv8, Faster R-CNN, RetinaNet, YOLOv10, and YOLOv12. Overall, YOLOv8-DCDB demonstrates greater precision in multi-object detection tasks under complex environmental conditions.

5. Discussion and Contributions

5.1. Summary of Research Contributions

This study proposes an improved YOLOv8-DCDB algorithm to address the complex requirements of personal protective equipment (PPE) detection for coal miners. By integrating the Dilated-CBAM module, a multi-branch detection head structure, and a K-fold cross-validation strategy, the algorithm demonstrates outstanding performance in multi-scene, multi-class PPE detection tasks, significantly enhancing detection accuracy and robustness. Extensive experiments on the minersPPE data set have shown that YOLOv8-DCDB outperforms other common methods in key metrics such as Precision, Recall, mAP0.5, and mAP0.5:0.95. Specifically, YOLOv8-DCDB achieved values of 0.925, 0.836, and 0.642 in these metrics, respectively. In terms of performance, YOLOv8-DCDB achieves a frame rate (FPS) of 102.3 and has a parameter count of 31.9 million. By maintaining a lower number of parameters, it is capable of processing hundreds of image frames per second, making YOLOv8-DCDB more suitable for real-time application deployment on devices with limited resources. Moreover, YOLOv8-DCDB not only performs well on standard test sets but also demonstrates good generalization ability in coal mine scenarios. In summary, YOLOv8-DCDB provides a new and efficient method for identifying whether miners are wearing protective gear in coal mine production environments and supports the safety of miners under complex and variable working conditions, thus promoting the development of automated safety monitoring systems.

5.2. Future Work Outlook

While this study has demonstrated the effectiveness of YOLOv8-DCDB in PPE detection for coal miners, several challenges remain in ensuring its practical application in real-world scenarios. Future work will focus on two key aspects to further enhance the model’s performance and applicability:

Impact of Environmental Factors and Data Variability. The performance of YOLOv8-DCDB can be influenced by factors such as lighting variations, environmental clutter, and the presence of occlusions. Extreme lighting conditions, such as dimly lit or overly bright environments, as well as partial occlusions of protective gear, may reduce detection accuracy. Future research will explore advanced data augmentation techniques to simulate these challenging conditions, helping to make the model more resilient to a wide range of environments. Additionally, expanding the dataset to include more diverse scenarios, such as different weather conditions, varying distances, and occlusions, will improve the model’s robustness. A particular focus will be placed on annotating data with more detailed annotations, such as partial gear visibility and the various angles of observation, to support the model in recognizing PPE in diverse real-world environments.

Real-time Deployment and Computational Optimization. The practical deployment of YOLOv8-DCDB in autonomous monitoring systems for mining environments requires optimizing its computational efficiency. Currently, the model’s computational load and inference time could pose challenges for real-time applications, especially in resource-constrained environments. To address these challenges, future work will focus on optimizing the model’s architecture, exploring lightweight alternatives, and utilizing hardware acceleration options such as GPUs and FPGAs to enhance real-time performance. Furthermore, the model will be optimized to run efficiently on embedded systems, ensuring that it remains suitable for deployment on devices with limited processing power. Special attention will be given to reducing inference time while maintaining detection accuracy to meet the demands of real-time safety monitoring in mining operations.

In summary, while YOLOv8-DCDB demonstrates strong performance in PPE detection, further research is needed to improve its robustness in complex environmental conditions and optimize its computational efficiency for real-time applications. Future work will focus on improving the model’s ability to handle challenging lighting conditions, occlusions, and data variability, as well as optimizing its architecture for hardware deployment. Through these efforts, we aim to enhance the model’s practicality and reliability for use in real-world mining environments, contributing to the advancement of autonomous safety monitoring systems.

Author Contributions

Conceptualization, J.Y. and H.X.; methodology, J.Y. and H.X.; validation, H.X.; writing—original draft preparation, J.Y. and H.X.; writing—review and editing, H.X., X.Z., S.S. and J.C.; supervision, J.C. All authors have read and agreed to the published version of the manuscript.

Funding

This study is supported by the National Special Project of Science and Technology Basic Resources Survey (grant No. 2022FY101400) and the National Natural Science Foundation of China Innovation Group Project (grant No. 52121003).

Data Availability Statement

Restrictions apply to the datasets.

Conflicts of Interest

The authors declare no conflict of interest.

References

Imam, M.; Baïna, K.; Tabii, Y.; Mostafa Ressami, E.; Adlaoui, Y.; Benzakour, I.; Bourzeix, F.; Abdelwahed, H. Ensuring Miners’ Safety in Underground Mines Through Edge Computing: Real-Time PPE Compliance Analysis Based on Pose Estimation. IEEE Access 2024, 12, 145721–145739. [Google Scholar] [CrossRef]
Tian, S.; Wang, Y.; Ma, T.; Mao, J.; Ma, L. Analysis of the causes and safety countermeasures of coal mine accidents: A case study of coal mine accidents in China from 2018 to 2022. Process Saf. Environ. Prot. 2024, 187, 864–875. [Google Scholar] [CrossRef]
Kursunoglu, N.; Onder, S.; Onder, M. The evaluation of personal protective equipment usage habit of mining employees using structural equation modeling. Saf. Health Work 2022, 13, 180–186. [Google Scholar] [CrossRef] [PubMed]
Ayoo, B.A.; Moronge, J. Factors influencing compliance with occupational safety regulations and requirements among artisanal and small-scale miners in central sakwa ward, siaya county. J. Sustain. Environ. Peace 2019, 1, 1–5. [Google Scholar] [CrossRef]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; pp. 886–893. [Google Scholar]
Lienhart, R.; Maydt, J. An extended set of haar-like features for rapid object detection. In Proceedings of the International Conference on Image Processing, Rochester, NY, USA, 22–25 September 2002; pp. 900–903. [Google Scholar]
Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Jaafar, M.H.; Arifin, K.; Aiyub, K.; Razman, M.R.; Ishak, M.I.S.; Samsurijan, M.S. Occupational safety and health management in the construction industry: A review. Int. J. Occup. Saf. Ergon. 2018, 24, 493–506. [Google Scholar] [CrossRef]
Chen, T.; Liu, J.; Li, H.; Wang, H. Study of influential factors and paths of the mine manager’s safety awareness. Min. Saf. Environ. Prot. 2022, 49, 109–113. [Google Scholar] [CrossRef]
Zhang, Z.; Yang, J.; Ding, L.; Zhao, Y. Estimation of coal particle size distribution by image segmentation. Int. J. Min. Sci. Technol. 2012, 22, 739–744. [Google Scholar] [CrossRef]
Paplinski, A.P. Directional filtering in edge detection. IEEE Trans. Image Process. 1998, 7, 611–615. [Google Scholar] [CrossRef]
Elaziz, M.A.; Ewees, A.A.; Oliva, D. Hyper-heuristic method for multilevel thresholding image segmentation. Expert Syst. Appl. 2020, 146, 113201. [Google Scholar] [CrossRef]
Shen, Y.; Xie, X.; Wu, J.; Chen, L.; Huang, F. EAFF-Net: Efficient attention feature fusion network for dual-modality pedestrian detection. Infrared Phys. Technol. 2025, 145, 105696. [Google Scholar] [CrossRef]
Yang, J.; Sun, S.; Chen, J.; Xie, H.; Wang, Y.; Yang, Z. 3D-STARNET: Spatial–Temporal Attention Residual Network for Robust Action Recognition. Appl. Sci. 2024, 14, 7154. [Google Scholar] [CrossRef]
Mohammadpour, L.; Ling, T.C.; Liew, C.S.; Aryanfar, A. A Survey of CNN-Based Network Intrusion Detection. Appl. Sci. 2022, 12, 8162. [Google Scholar] [CrossRef]
Freund, Y.; Schapire, R.E. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. J. Comput. Syst. Sci. 1997, 55, 119–139. [Google Scholar] [CrossRef]
Valkenborg, D.; Rousseau, A.-J.; Geubbelmans, M.; Burzykowski, T. Support vector machines. Am. J. Orthod. Dentofac. Orthop. 2023, 164, 754–757. [Google Scholar] [CrossRef] [PubMed]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Li, J.; Liang, X.; Shen, S.; Xu, T.; Feng, J.; Yan, S. Scale-Aware Fast R-CNN for Pedestrian Detection. IEEE Trans. Multimed. 2018, 20, 985–996. [Google Scholar] [CrossRef]
Ding, X.; Li, Q.; Cheng, Y.; Wang, J.; Bian, W.; Jie, B. Local keypoint-based Faster R-CNN. Appl. Intell. 2020, 50, 3007–3022. [Google Scholar] [CrossRef]
Wang, J.; Wu, Q.M.J.; Zhang, N. You Only Look at Once for Real-Time and Generic Multi-Task. IEEE Trans. Veh. Technol. 2024, 73, 12625–12637. [Google Scholar] [CrossRef]
Wang, Y.; Niu, P.; Guo, X.; Yang, G.; Chen, J. Single Shot Multibox Detector With Deconvolutional Region Magnification Procedure. IEEE Access 2021, 9, 47767–47776. [Google Scholar] [CrossRef]
Chen, Z.; Yang, J.; Li, F.; Feng, Z.; Chen, L.; Jia, L.; Li, P. Foreign Object Detection Method for Railway Catenary Based on a Scarce Image Generation Model and Lightweight Perception Architecture. IEEE Trans. Circuits Syst. Video Technol. 2025. [Google Scholar] [CrossRef]
Wu, Y.; Zhao, Z.; Chen, P.; Guo, F.; Qin, Y.; Long, S.; Ai, L. Hybrid learning architecture for high-speed railroad scene parsing and potential safety hazard evaluation of UAV images. Measurement 2025, 239, 115504. [Google Scholar] [CrossRef]
Din, I.U.; Muhammad, S.; Faisal, S.; Rehman, I.u.; Ali, W. Heavy metal(loid)s contamination and ecotoxicological hazards in coal, dust, and soil adjacent to coal mining operations, Northwest Pakistan. J. Geochem. Explor. 2024, 256, 107332. [Google Scholar] [CrossRef]
Adjiski, V.; Despodov, Z.; Mirakovski, D.; Serafimovski, D. System architecture to bring smart personal protective equipment wearables and sensors to transform safety at work in the underground mining industry. Rud. Geološko-Naft. Zb. 2019, 34, 37–44. [Google Scholar] [CrossRef]
Nikulin, A.; Ikonnikov, D.; Dolzhikov, I. Smart personal protective equipment in the coal mining industry. Int. J. Civ. Eng. Technol. 2019, 10, 852–863. [Google Scholar]
Wang, Z.; Zhu, Y.; Zhang, Y.; Liu, S. An effective deep learning approach enabling miners’ protective equipment detection and tracking using improved YOLOv7 architecture. Comput. Electr. Eng. 2025, 123, 110173. [Google Scholar] [CrossRef]
Du, Q.; Zhang, S.; Yang, S. BLP-YOLOv10: Efficient safety helmet detection for low-light mining. J. Real-Time Image Process. 2024, 22, 10. [Google Scholar] [CrossRef]
Tan, M.; Le, Q. EfficientNetV2: Smaller Models and Faster Training. In Proceedings of the 38th International Conference on Machine Learning, Virtual Event, 18–24 July 2021; pp. 10096–10106. [Google Scholar]
Wieczorek, J.; Guerin, C.; McMahon, T. K-fold cross-validation for complex sample surveys. Stat 2022, 11, e454. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef]
Wang, Z.; Wu, F.; Yang, Y. Air pollution measurement based on hybrid convolutional neural network with spatial-and-channel attention mechanism. Expert Syst. Appl. 2023, 233, 120921. [Google Scholar] [CrossRef]
Ding, X.; Zhang, X.; Han, J.; Ding, G. Diverse branch block: Building a convolution as an inception-like unit. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 10886–10895. [Google Scholar]
Yang, L.; Zhang, K.; Liu, J.; Bi, C. Location IoU: A New Evaluation and Loss for Bounding Box Regression in Object Detection. In Proceedings of the 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 30 June–5 July 2024; pp. 1–8. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J. Yolov10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]

Figure 1. YOLOv8 -DCDB network structure. This figure shows the overall architecture of the YOLOv8-DCDB model, with the key improvements highlighted. The Dilated-CBAM and DDBDetect modules, proposed in this study, are incorporated into the architecture to enhance feature extraction and multi-scale detection.

Figure 2. Channel attention mechanism.

Figure 3. Dilated spatial attention mechanism.

Figure 4. DBBDetect structure.

Figure 5. Diverse branch block architecture.

Figure 6. Long-tail distribution of minersPPE dataset.

Figure 7. Working 5-Fold Cross-Validation approach.

Figure 8. The type of minersPPE dataset and its corresponding quantity.

Figure 9. Comparison of YOLOv8 performance with and without the Dilated-CBAM module in dim environments. (a) Yolov8’s performance. Without Dilated-CBAM, detection confidence is lower. (b) Yolov8 introduces the Dilated-CBAM module for performance. With Dilated-CBAM, detection accuracy improves, especially for “person” and “safety suit”.

Figure 10. Comparison of YOLOv8 performance with and without Dilated-CBAM in occlusion scenes. (a). Yolov8’s performance. Without Dilated-CBAM, “hands” are missed. (b). Yolov8 introduces the Dilated-CBAM module for performance. With Dilated-CBAM, the missed detection is improved.

Figure 11. The results of the original model’s detection of small targets.

Figure 12. Results of small target detection with an efficient detection head with a multi-branch structure.

Figure 13. Overall performance of the YOLOv8-DCDB network.

Figure 14. Recognition results of each model on the minersPPE dataset. This comparison aims to evaluate the performance improvements in detecting protective equipment across different models, with YOLOv8-DCDB (f) demonstrating superior accuracy, particularly in detecting smaller objects like gloves and hands.

Table 1. The composition of the experimental algorithm configuration workstation.

Experimental Hardware and Software Information	Version and Model
CPU	Intel Core i9-13900KF
GPU	2*NVIDIA GeForce RTX3080
Pytorch	2.7
GPU Memory Size	2048 M
CUDA Version	11.7
Operating System	Ubuntu 22.04 LTS

Table 2. Ablation experiments (val).

Dilated-CBAM	DDBDetection	K-Fold cv	Precision	mAP0.5	mAP0.5:0.95	FPS	Params
-	-	-	0.736	0.661	0.432	106.6	25.9
√	-	-	0.779	0.668	0.439	110.3	25.9
-	√	-	0.766	0.679	0.442	116.7	31.4
√	√	-	0.798	0.675	0.444	100.1	31.5
√	√	√	0.925	0.836	0.642	102.3	31.9

Table 3. Ablation experiments(train).

Dilated-CBAM	DDBDetection	K-Fold cv	Precision	mAP0.5	mAP0.5:0.95	FPS	Params
-	-	-	0.736	0.661	0.432	106.6	25.9
√	-	-	0.753	0.669	0.438	110.3	25.9
-	√	-	0.773	0.678	0.443	116.7	31.4
√	√	-	0.777	0.680	0.440	100.1	31.5
√	√	√	0.925	0.826	0.646	102.3	31.9

Table 4. Comparative experiments.

Model	Recall	mAP0.5	mAP0.5:0.95	FPS	Params
Faster R-CNN	0.613	0.616	0.353	60.74	98 M
RetinaNet	0.592	0.592	0.357	56	34 M
Yolov8	0.623	0.661	0.432	106.6	25.9 M
Yolov10	0.630	0.677	0.438	135.4	15.4 M
Yolov12	0.671	0.722	0.497	112.9	20.2 M
Yolov8-DCDB	0.767	0.836	0.642	102.3	31.9 M

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, J.; Xie, H.; Zhang, X.; Chen, J.; Sun, S. Enhancing Mine Safety with YOLOv8-DBDC: Real-Time PPE Detection for Miners. Electronics 2025, 14, 2788. https://doi.org/10.3390/electronics14142788

AMA Style

Yang J, Xie H, Zhang X, Chen J, Sun S. Enhancing Mine Safety with YOLOv8-DBDC: Real-Time PPE Detection for Miners. Electronics. 2025; 14(14):2788. https://doi.org/10.3390/electronics14142788

Chicago/Turabian Style

Yang, Jun, Haizhen Xie, Xiaolan Zhang, Jiayue Chen, and Shulong Sun. 2025. "Enhancing Mine Safety with YOLOv8-DBDC: Real-Time PPE Detection for Miners" Electronics 14, no. 14: 2788. https://doi.org/10.3390/electronics14142788

APA Style

Yang, J., Xie, H., Zhang, X., Chen, J., & Sun, S. (2025). Enhancing Mine Safety with YOLOv8-DBDC: Real-Time PPE Detection for Miners. Electronics, 14(14), 2788. https://doi.org/10.3390/electronics14142788

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing Mine Safety with YOLOv8-DBDC: Real-Time PPE Detection for Miners

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Network Architecture

3.2. Multi-Scale Feature Enhancement Network Based on Dilated-CBAM

3.2.1. Channel Attention Mechanism (CAM)

3.2.2. Dilated Spatial Attention

3.3. Efficient Detection Head Based on Multi-Branch Structure

3.3.1. Diverse Branch Block Module

3.3.2. Distribute Focal Loss (DFL)

3.4. Miner’s Protective Equipment Detection Based on K-Fold Cross-Validation

4. Results

4.1. Experiment Introduction

4.1.1. Dataset

4.1.2. Experimental Environment

4.1.3. Evaluation Metrics

4.2. Experimental Results

4.2.1. Ablation Experiments

4.2.2. Comparison Experiments

4.2.3. Qualitative Analysis

Ablation Study Qualitative Analysis

Comparative Analysis with Other Models

5. Discussion and Contributions

5.1. Summary of Research Contributions

5.2. Future Work Outlook

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI