1. Introduction
Coal is the primary source of energy for all nations worldwide and plays a pivotal role in the energy supply and livelihoods of the population. Consequently, it is regarded as one of the most reliable and secure strategic resources for all nations in the present day. In particular, for some energy-rich countries such as Russia and China, coal is also an important part of their economic potential. Hence, there is an imperative for the organisation of safe and sustainable coal mining practices in mines [
1,
2].
In recent years, there has been a notable increase in the frequency of industrial accidents in the coal mining sector. These accidents have resulted in significant casualties and have led to the adoption of digitalisation and the integration of artificial intelligence in the coal mining industry. Real-time miner detection systems in coal mines have been shown to be an effective method of preventing dangerous accidents, thereby enhancing the safety management of coal mines, controlling risks, and strengthening the safety management of personnel [
3,
4,
5].
The primary method of personnel monitoring in coal mines was initially a manual process, necessitating the allocation of a considerable number of personnel. The advent of monitoring equipment and network transmission technology has facilitated the real-time observation of video data collected by cameras situated within the mine by personnel in the monitoring room. This approach has the advantage of significantly reducing labour costs, yet it remains dependent on personnel for the detection process. The presence of human error has been demonstrated to have a significant impact on efficiency, thus indicating that this monitoring method is not without its inherent safety risks. The advent of cloud computing, machine vision, and artificial intelligence technology has led to the development of methodologies that facilitate the automated detection of miners through the analysis and detection of video images captured by cameras in mining operations. This technological advancement is instrumental in enhancing safety measures in the mining industry, thereby mitigating the risk of accidents [
6].
For instance, Mostafa et al. developed a lightweight YOLO-based C-Mask system for drone-borne face-mask surveillance, achieving 92.2% accuracy with real-time inference, further evidencing the suitability of YOLO variants for mobile safety monitoring.
Coal mines have complex operating environments, and common problems include uneven lighting, frequent occlusions, changing postures of personnel, and cluttered backgrounds. These factors cause targets to have blurred edges, small sizes, and similar categories, which can easily lead to false detections and missed detections. In addition, most existing algorithms are trained on urban or standard outdoor datasets and lack adaptability to complex working conditions in mining areas.
The advantages of YOLOv10 in mining are its lightweight design, high-precision detection of small targets (through CA and DyHead), and real-time performance, while YOLOv8, PP-YOLOE, and RT-DETR have limitations in these aspects, such as high computational complexity (RT-DETR), insufficient small-target detection (PP-YOLOE), and slightly lower accuracy (YOLOv8). The experimental data verify the significant improvement offered by YOLOv10 for the mine personnel detection task.
YOLOv10, released by Tsinghua University in 2024, is the latest version in the YOLO family of object detectors, known for its high speed, lightweight structure, and improved multi-scale target detection. It introduces architectural changes such as the SCDown module and PSA attention to optimise real-time performance while maintaining accuracy.
YOLOv10 contains six models, n, s, m, b, l, and x, to meet different application scenarios. Considering the communication conditions inside the mine and the need for real-time monitoring of personnel detection, this study uses YOLOv10n. YOLOv10n is the smallest model with the fastest detection speed and the smallest number of parameters, which meets the needs for real-time detection of mine personnel. The improved method proposed in this paper is also applicable to the other five versions.
The field of target detection can be broadly categorised into two distinct approaches: traditional methods and deep learning-based methods. The work [
7] utilises synchrotron optical micro-tomography technology and piezoelectric pulsed echo-ultrasound for real-time detection. The work [
8] enhances the precision of personnel fall detection through the integrity of high-dimensional digital sequences and the specificity of disparate measurements. A target detection algorithm based on deep learning analyses and processes images using convolutional neural networks, thereby enabling real-time detection of personnel position. At present, the algorithms employed for detection can be categorised into two-stage and one-stage detection algorithms. Two-step detection algorithms include R-CNN [
9], FastR-CNN [
10] and Faster R-CNN [
11]. In the initial phase, they employ heuristics or CNNs (RPN) [
12] to generate a proposal, subsequently undertaking classification and regression based on the proposal as a region. The efficacy of these algorithms is notable, however their efficiency is not commensurate with that of alternative solutions [
13,
14,
15]. The utilisation of one-stage detection algorithms does not necessitate the generation of candidate regions; rather, these algorithms are capable of directly generating the category probability and the coordinate value of the target position. Following a one-stage detection process, the final detection outcome can be obtained directly, thereby contributing to the enhanced detection efficiency. One-stage detection algorithms are principally represented by the SSD algorithm [
16], and the YOLO algorithm [
17] is also extensively utilised.
In [
18], AKConv is combined with YOLOv8 to develop the C2f-BE module, with the objective of enhancing the algorithm’s feature-processing capability. In [
19], the standard convolution layer is replaced with a deep separable convolution, inverse residual, and linear bottleneck structure, with the aim of improving the feature extraction capability of small targets. However, the model structure is too complex. The research in [
20] enhances the precision of the YOLOv7 algorithm through the implementation of a mechanism that is devoid of parameters and the restructuring of the feature extraction module. However, it does not adequately regulate the number of model parameters. As posited by [
21], the efficacy of multi-scale target detection is enhanced through the incorporation of a small target detection layer and the refinement of the feature extraction structure. However, it is important to note the substantial delay that is concomitantly introduced. In order to address the difficult balance between model parameters and real-time performance, YOLOv8-based detection accuracy is improved by lightweight processing in [
22]. The dataset under consideration is predominantly composed of large targets, a choice that may lack scientific rigor. In [
23], the false alarm rate of the model is reduced by combining the motion branch and the target branch. In [
24], the YOLOv4 model is enhanced by employing ResNet50 [
25] in lieu of the original CSPDarknet53 [
26], thereby optimising the accuracy and speed of personnel detection. As posited by [
27], the Retinex image enhancement algorithm is integrated into the YOLOv3 network with the aim of enhancing the efficacy of detecting people in low-light conditions. As demonstrated in [
28], the incorporation of an attention mechanism within the Backbone and Neck of the network enhances its capacity for feature fusion.
Studies indicate that video-based target detection algorithms have found some application in various contexts. However, it should be noted that the working conditions in such mining operations are often characterised by considerable hardship, and the presence of small targets may not be discernible in the captured images. Further research is required to address the issue of enhancing the accuracy of target detection and reducing false alarms. In parallel, Alexandrov et al. showed that Laplace–Beltrami spectral descriptors (HKS/WKS) boost 3-D object detection accuracy to 97%, implying that fusing geometric priors with attention-based YOLO frameworks could be a promising research direction.
In order to address this problem, we propose an improved miner detection architecture based on YOLOv10-N with targeted enhancements for low-visibility and asymmetric detection conditions;
      
- A Coordinate Attention mechanism is introduced into the Backbone to preserve position-aware context and suppress irrelevant background noise; 
- A Dynamic Detection Head (DyHead) is incorporated to support multi-scale and task-aware detection [ 29- , 30- ]; 
- We employ an EIOU loss function to accelerate training convergence and improve box regression precision; 
- A coal miner image dataset under real-world conditions is created and used to validate the proposed method; 
This work mainly proposes introducing a CA attention mechanism based on YOLOv10 to improve perception ability for small targets in mine images; introducing a Dynamic Head module in the detection head to enhance scale and spatial adaptability; introducing an EIOU loss function to improve bounding box fitting and model convergence speed; constructing and annotating a custom dataset of a complex mine environment, and carrying out comparative experiments to verify the effectiveness of the method.
  2. Materials and Methods
In 2024, Tsinghua University unveiled the latest iteration of YOLOv10 (introduced by Tsinghua University, China) encompassing six distinct models, n, s, m, b, l, and x, meticulously designed to cater to a range of application scenarios. The running speed and number of parameters of differently sized models are summarised in 
Table 1. As demonstrated in the table, YOLOv10-N is the model with the highest detection rate and the lowest number of parameters, which fulfils the requirements of real-time personnel detection in mines. Consequently, the present study has opted for YOLOv10-N as the foundational model [
31,
32].
As illustrated in 
Table 1, the configurations of YOLOv10-N to YOLOv10-X are presented, with the input size set at a uniform 640 pixels. This indicates the resolution of the image processed by the model. The number of parameters reflects the complexity of the model, with higher values signifying greater trainability. The computational performance of a model is measured in gigaflops (G). It has been demonstrated that higher values indicate greater computational intensity. The average accuracy of the model is expressed as a percentage, demonstrating the accuracy of the model in detecting the target. The speed of the model is measured in milliseconds (ms). In general, as the model transitions from YOLOv10-N to YOLOv10-X, there is a gradual increase in the number of parameters and computational performance, accompanied by an improvement in average accuracy. However, this transition also results in a corresponding decline in processing speed.
YOLOv10-N has insufficient feature extraction capabilities due to its small number of parameters (2.3 M), and the missed detection rate of small targets in complex mine environments is significantly increased.
YOLOv10-S has a poor processing effect on occluded targets, and the false detection rate increases by 15–20% in dusty environments, affecting monitoring reliability.
YOLOv10-M has a computing capacity of 59.1 G FLOPs, and the frame rate on edge devices is less than 15FPS, which is insufficient to meet real-time monitoring needs.
YOLOv10-B requires 16 GB of video memory support, the energy efficiency ratio is unbalanced, and the cost of large-scale deployment underground is too high.
YOLOv10-L has too large a memory footprint (24.4 M parameters) and poor running stability on domestic mining AI equipment.
YOLOv10-X has a computing capacity of up to 160.4 G FLOPs, but the accuracy improvement is limited, and it requires professional-level GPU support, so it is not practical underground. In light of the performance of hardware devices, it has been determined that the utilisation of YOLOV10-N is to be employed in this study.
Compared to previous versions of the YOLO model, YOLOv10 is characterised by higher speed and accuracy. The structure of the YOLOv10 network is shown in 
Figure 1.
The block diagram of YOLOV10 is shown above.
The architecture of the YOLOv10 network is segmented into three distinct components: the Backbone, the Neck, and the Head.
The Backbone is responsible for feature extraction; the Neck enhances the semantic information transmission of different levels through multi-scale fusion and may use convolution-free downsampling technology to reduce information loss; and the Head completes the final classification and positioning prediction. These three parts work together to achieve efficient and accurate end-to-end object detection.
In our improved architecture, a Coordinate Attention (CA) module is embedded into the Backbone, specifically after the PSA module, to enhance the model’s ability to capture spatially relevant and long-range features in complex mining environments.
As demonstrated in 
Figure 1, the Backbone layer is responsible for feature extraction and consists of six modules. The following acronyms are employed in this text: CBS, C2f, SCDown, C2fCIB, PSA, and SPPF (Spatial Pyramid Pooling—Fast/Fast Spatial Pyramid Layer).
The CBS module is responsible for the convolution operation, and it is this module that is denoted by the abbreviation.
The C2f module is an inter-stage feature pooling module that enhances the ability of the model to extract multi-scale features. The convolution operation is performed by the CBS module, and the intermediate feature maps are then split into two parts by the split module. One of these parts is passed directly to the final Concat block, while the other is processed by several Bottleneck blocks. The Bottleneck block contains multiple convolutional layers, which transform the input feature map and extract high-level feature representations. The Concat block is primarily responsible for merging the feature map processed by the Bottleneck block with the part of the feature map transmitted directly in the channel dimension, thereby forming the merged feature map.
The SCDown block is a spatial-channel decoupled downsampling block. It is an efficient downsampling block that was introduced in YOLOv10. It replaces part of the traditional convolutional layer. The purpose of this replacement is to reduce the spatial resolution of feature maps while preserving rich semantic information and reducing computation.
The C2fCIB module (Cross-Scale Feature Fusion with Channel-Wise Information Bottleneck) first divides the input features into two parts by 1 × 1 convolution and then performs 3 × 3 multilayer convolution operations to extract features at different scales. In conclusion, the Concat module is employed for the purpose of splicing, with the objective of enhancing the capacity to combine features of different scales.
The Pyramid-Split Attention (PSA) module is an attention mechanism-based module that enhances the model’s ability to learn global representations. The model generates a pyramid-shaped feature map by splicing the results of convolution operations with convolution kernels of different sizes. It then applies the attention mechanism to this feature map to extract richer feature information.
The SPPF module is an efficient spatial pyramid fusion technique that enhances the model’s ability to detect targets of different sizes through multi-scale feature fusion while optimising computational efficiency.
To summarise, the above six modules form a progressive feature extraction hierarchy:
- (1)
- CBS captures low-level textures; 
- (2)
- C2f and SCDown reduce spatial resolution while widening the channel space; 
- (3)
- C2fCIB performs cross-scale fusion that enriches mid-level semantics; 
- (4)
- PSA injects multi-kernel global context; and 
- (5)
- SPPF aggregates multi-scale cues at a negligible cost. 
This coarse-to-fine hierarchy supplies a rich multi-level feature pyramid for the subsequent enhancement stages.
The Neck component integrates feature maps of varying dimensions derived from the Backbone, thereby amalgamating shallow features with those of a more profound nature. The software is composed of modules such as CBS, C2f, SCDown, C2fCIB, Concat, Upsample, and others. The function of the Upsample module is to augment the dimensions of the input tensor through interpolation, thereby enhancing the resolution or intricacies of the image.
The Head employs a distinct approach to address classification, regression, and aggregation issues, utilising a dedicated separating head for each respective function [
33,
34,
35]. In this study, we replace the original detection head with a Dynamic Head (DyHead) module, which integrates size-aware, spatially aware, and task-aware attention mechanisms to improve the model’s adaptability to small and asymmetric objects. The Head of the network principally consists of the Detect module, which processes feature maps through multiple 1 × 1 convolutional layers in order to convert the combined feature maps into final detection results.
      
- 1.
- It is imperative to establish a coordinated attention mechanism. 
In the context of coal mines, where image resolution is often limited and objects are frequently diminutive, a Coordinate Attention (CA) mechanism [
36] is incorporated into the backbone network. This mechanism is employed to mitigate the impact of extraneous information, thereby enhancing the network’s ability to focus on the relevant aspects of the input data. Its structural configuration is delineated in 
Figure 2.
As illustrated in 
Figure 2, X Avg Pool and Y Avg Pool refer to the feature maps that are derived from global height and width decomposition, respectively. The Concat+Conv2d operation denotes the union of two feature maps, followed by a two-dimensional convolution operation. The term “Batch Norm + Non-linear” is used to denote a normalisation operation followed by a nonlinear activation function. Conv2d denotes the two-dimensional convolution operation, and sigmoid denotes the sigmoidal activation function.
In the context of feature enhancement, the Coordinate Attention (CA) block acts as the first refinement stage. By encoding height-wise and width-wise context, CA generates two 1-D attention maps that re-weight the intermediate tensor. This operation highlights miner-relevant regions—such as helmets, limbs, or reflective strips—while suppressing background noise introduced by uneven lighting or coal dust. The lightweight design (two 1 × 1 convolutions and element-wise multiplication) adds less than 0.02 M parameters, ensuring minimal computational overhead.
For each feature map X with input dimension C × H × W, each channel is encoded in the horizontal and vertical directions, respectively, to obtain feature maps in the height and width directions as shown in Formula (1):
      where 
 and 
 are the width and height of the feature map. 
 and 
 are the corresponding eigenvalues in the feature map; 
 and 
 represent the output of channel 
 in the horizontal and vertical directions, respectively.
The term “Concat+Conv2d” refers to the concatenation of height- and width-pooled features, followed by a 1 × 1 convolution to combine directional information into a unified representation before further splitting into spatial branches.
Through this transformation, CA can exhibit long dependencies in one direction and retain accurate location information in the other direction, which helps the network to find important information. The result is then passed to the convolution transform function 
, which encodes the spatial information in horizontal and vertical directions and yields an intermediate feature map 
.
After obtaining ,  is split into two independent tensors,  and , along the spatial dimension.  and  denote the two-dimensional convolution operations, respectively.
 and 
 are transformed into tensors 
 and 
 with the same number of channels as the input signal X using a 1 × 1 convolution transform.
      where 
 and 
 are the coordinate attention weights for height and width, respectively.
Finally, the obtained 
 and 
 are expanded and used as attention weights, respectively, to obtain the output value of the coordinate attention block Y.
      where 
 represents eigenvalues at the output of the coordinate attention layer, and 
 represents eigenvalues of the inputs of the coordinate attention layer.
As illustrated in 
Figure 3, the Backbone network is modified by the incorporation of a CA layer following the PSA layer of the Backbone network. This layer, termed the coordinate attention mechanism layer, has been previously delineated in the present text. It functions by decomposing and combining the outputs of the PSA layer, thereby enhancing the detection effect.
CA enhances position awareness through spatial-coordinate decomposition and attention-based recombination, operating in four stages:
Coordinate Decomposition:
Decompose input feature 
 along height/width axes:
      yielding directional features 
, 
.
Feature Fusion and Transformation:
Concatenate features and transform via 1 × 1 conv 
:
      where 
 is the channel reduction ratio; 
 denotes ReLU.
Attention Weight Generation:
Split features and apply sigmoid:
Feature Recombination:
Output feature map via coordinate weighting:
To validate the placement of CA after the PSA module, we conducted a sensitivity analysis by inserting the CA module before PSA, after C2fCIB, and after PSA (
Table 2). The results show that placement after PSA yields the highest detection performance, improving mAP50–95 by 1.8% over pre-PSA placement, indicating optimal feature interaction with global attention maps.
      
- 2.
- The Dynamic Camera is a technological innovation that has been developed for the purpose of capturing moving images. 
In 2021, researchers at Microsoft proposed Dynamic Head (DyHead), an innovative detection head architecture. Dynamic Head integrates scale awareness, spatial awareness, and task awareness into a unified framework, thereby enabling the detection head to dynamically and adaptively adjust to input features by deploying an attention mechanism in the scale, spatial, and channel dimensions of features, respectively. Significant performance gains were achieved in COCO benchmarks [
37].
Each attention branch within DyHead not only performs detection but also refines feature representation: scale-aware attention dynamically adjusts the receptive field size so that very small or very large miner targets are better matched; spatially aware attention predicts learnable offsets, helping the network focus on key regions (e.g., helmets or limbs) even under partial occlusion; and task-aware attention redistributes channel importance between classification and regression, balancing localisation accuracy and class confidence. Taken together, these three branches constitute the second-stage feature enhancement pipeline, complementing the coordinate attention applied in the Backbone.
The internal environment of the mine is relatively complex, and the target dimensions vary greatly. This paper introduces a Dynamic Head (DyHead) in the Head network. The DyHead combines size perception, space perception, and task perception. It also deploys the attention mechanism for the specific dimensions of each function and finally nestles them into the attention function [
38,
39].
      where 
 is the attention function, 
 is the input three-dimensional tensor 
, 
 is the level of the feature map, 
 is the product of the width and height of the feature map, 
 is the number of channels of the feature map, 
 is the size-aware attention module, 
 is the space-aware attention module, and 
 is the task-aware attention module.
The size-aware attention modulus is calculated as follows:
Here, 
 is a linear function approximated by a 1 × 1 convolutional layer, and σ is a hard sigmoid function.
      where 
 is the number of sparse sampling positions, 
 is the bias of the 
 self-learning space, and 
 is the importance scalar obtained from self-learning.
DyHead realises the perception of different targets through a task-aware attention mechanism and supports different tasks by dynamically opening and closing channels.
      where 
 is the feature slice of the C-th channel, and α and β are the trained parameters.
The structural configuration of DyHead is illustrated in 
Figure 4. The model’s efficacy is enhanced by the overlap of size-aware attention, space-aware attention, and task-aware attention, which is a consequence of the multiple nesting of DyHead.
As illustrated in 
Figure 4, the DyHead structure comprises three distinct modules: size-aware attention, space-aware attention, and task-aware attention.
The size-aware attention module performs an AVG Pool (Average Pooling) operation on the input feature map with the objective of reducing the size of the feature map. Following this, the module performs convolution operations and activation functions.
The function of the spatially aware attention module is primarily to adjust the position of each pixel point in the feature map. This is achieved by means of the displacement module, with a goal of improving the accuracy of the target pose modelling.
The CA module guides the network to focus on the target area in the image by separating the encoding space and channel information. DyHead adjusts the feature response through scale, space, and task perception modules to enhance the target positioning and classification performance, especially for miners with large size changes.
The task-aware attention module dynamically switches the feature channel on or off by learning task-related weights to support different task requirements.
      
- 3.
- Loss function optimisation. 
To further enhance bounding box regression accuracy and convergence speed, we replace the CIOU loss with the Efficient Intersection over Union (EIOU) loss.
The CIOU loss function exhibits deficiencies when confronted with variations in boundary dimensions and sampling imbalance. In this paper, the EIOU loss function [
40] is employed for optimisation and adjustment.
Since the CIOU loss function cannot distinguish between bounding boxes with the same centre and the same aspect ratio but different sizes, this paper introduces the EIOU loss function, which changes the original aspect ratio to a regression of the aspect values, i.e., the original, CIOU = IOU + loss of centre point + loss of length to width ratio, is changed to EIOU = IOU + loss of centre point + loss of width + loss of length. The formula for calculating the EIOU loss function is as follows:
Prediction window means the shape of the output window of the model. Real window means the shape of the window marked. Where  is the ratio of the intersection of the prediction window to the real window,  is the square of the Euclidean distance between the centre point of the prediction window (6) and the centre point of the real window,  is the square of the length of the diagonal of the minimum outer box between the prediction and the real window,  is the square of the Euclidean distance between the width of the prediction window and the width of the real window,  is the normalised parameter of the width difference between the predicted and the true box,  is the square of the Euclidean distance between the height of the predicted box and the width of the true box, and  is the normalised parameter of the width difference between the predicted and the true box.
      
- 4.
- General structural diagram. 
The structure of YOLOv10n after improving the coordinate attention, dynamic detection head, and EIOU loss function is shown in 
Figure 5.
As illustrated in 
Figure 5, this paper proposes the incorporation of the CA (Coordinate Attention) module behind the PSA (Pose-Specific Attention) module of the Backbone, in conjunction with the replacement of the Detect module with the dynamic camera module within the Head. This modification is intended to enhance the efficacy of the YOLOv1O model.
  3. Results and Discussion
The Internet Miner Detection Dataset is utilised in this study. The intelligent detection and control group for electromechanical equipment of coal mines at Xi’an University of Science and Technology, Shaanxi Province, China, was responsible for its design and development. The dataset under consideration encompasses a diverse range of targets and scenarios, including the detection of miners using helmets, the detection of miner behaviour, and the detection of hydraulic support protection [
41].
The present study presents an investigation into the YOLOv10 miner detection method and the potential for enhancement thereof. This work improves the YOLOv10-N model by adding the Coordinate Attention (CA) mechanism, Dynamic Detection Head (DyHead), and EIOU loss function, which significantly improve the miner detection performance. The experiment was conducted on a Windows 11 system (RTX 2060 GPU) using 37,463 annotated images, and the training parameters included 10 epochs, a 640 × 640 input size, and a 16 batch size. The improved model achieved 92.69% accuracy and 87.53% recall on the test set, and mAP50 was 94.71%, which is better than YOLOv8 (89.91%), Faster-RCNN (74.23%), and SSD (69.11%). Visualisation results and ablation experiments verify the effectiveness of each improved module, proving that this method is particularly suitable for small-target detection in complex mine environments
The objective of the present study is to facilitate the detection of coal miners. To this end, we created a new dataset that covers three types of targets: miners wearing helmets (22,357 images), miners without helmets (8493 images), and hydraulic supports and obstacles (6613 images). All images are from real underground coal mines in Shaanxi Province and are annotated using LabelImg v1.8.6 software. To ensure accuracy, we perform double-person cross-annotation and manual review. Data augmentation strategies include random cropping, image flipping, brightness perturbation, and blurring. Training evaluation uses five indicators (Precision, Recall, IOU, mAP50, mAP50–95) and PR curve analysis to ensure the generalisation ability and practicality of the model.
Images were captured in active underground coal mines across Shaanxi Province using fixed surveillance and mobile cameras. Lighting conditions range from well-lit tunnels to dimly lit workspaces. Viewpoints include frontal, side, and overhead perspectives. Backgrounds vary from clean passageways to cluttered machinery zones. Miners appear in multiple outfits, including reflective vests, helmets of different colours, and protective uniforms, ensuring diversity in visual features.
To ensure unbiased evaluation, we employed a random stratified split of 65% training, 15% validation, and 20% testing, ensuring no temporally or spatially adjacent frames were shared across sets. Furthermore, a 5-fold cross-validation was conducted, and the average results across all splits showed a variance of less than 0.5%, indicating the stability of our model and the robustness of the data split.
The data stream is illustrated in 
Figure 6.
As illustrated in 
Figure 6, the LabelImg program is utilised to incorporate bounding boxes and category labels for target detection in images from the specified dataset. The initial step in the process of labelling the miner dataset is the placement of the image files from the dataset into the “images” folder of the LabelImg program. The initiation of the LabelImg program, followed by the selection of the miner images folder through the utilisation of the “Open Dir” function, results in the automatic loading of all images. For each image, the user is able to draw a bounding box by dragging the mouse across the interface to mark the position of the miner. The utilisation of LabelImg for the labelling of the coal miner dataset ensures the provision of precise location and target category information, thereby facilitating enhanced model training accuracy and, consequently, improved model detection accuracy.
Figure 7 provides an illustration of preferential selection in the dataset, showcasing the working environment in a coal mine. Given that the primary subject of study is miners, it is recommended that the figure be annotated with the labels of two miners.
 The experiment detailed in this paper is conducted on the Windows 11 operating system with an RTX 2060 graphics card (16 GB video memory). The software in question is PyCharm Professional Edition, with the deep learning framework being PyTorch 2.8.0. The settings for the model training parameters are given in 
Table 3.
In accordance with the stipulated requirements of the task, a suitable YOLOv10-N model is selected, the appropriate model configuration file is loaded, the source file is edited, and the coordinate attention mechanism layer and dynamic camera proposed in this paper are introduced into the source file, thereby achieving the modification of the model.
The settings of the model training parameters are given in 
Table 3.
The model was trained for 10 epochs, with each epoch representing a complete iteration through the training dataset. The input image size accepted by the model is 640 × 640 pixels. Each training batch contains 16 images, and the model is trained from scratch without using pre-trained weights. The model training starts with a training rate of 0.001, and the momentum parameter is set to 0.937 to accelerate the convergence rate for gradient descent and reduce oscillations. The damping factor of the weights is 0.005, which is used to regularise the model and prevent overfitting.
The optimiser used in training is Stochastic Gradient Descent (SGD) with Nesterov momentum. A cosine annealing learning rate schedule is adopted, gradually reducing the learning rate to prevent overshooting and ensure convergence. The total training time per epoch is approximately 14 min.
Subsequently, the training script is initiated to commence the model training process. As the training process progresses, the model systematically traverses the dataset on multiple occasions, adhering to the stipulated parameters. This iterative procedure is undertaken to progressively refine the model weights. During the training process, a validation set is routinely employed to evaluate the model’s performance, ensuring that it is not overloaded. Subsequent to the conclusion of the training phase, the validation set is employed for the purpose of conducting a final evaluation of the model, the objective of which is to ascertain the model’s accuracy and to derive other performance metrics.
In this paper, five metrics are selected for the purpose of evaluating the performance of the model in identifying miners. The selected metrics are as follows: Intersection Over Unification (IOU) [
42], Precision (P), Response (R), Average Precision (AP), and Mean AP for all classes (mAP).
      where 
 is the overlap area and 
 is the total area.
Precision P is the ratio of the number of samples that are estimated by the model to be correct, and indeed are, to the total number of samples that are estimated by the model to be correct. In this paper, it is expressed as the probability that miners are correctly labelled [
42].
      where 
TP is the number of positive samples predicted from positive samples; 
FP is the number of negative samples predicted from negative samples.
Recall (
R) is the proportion of actual positive samples correctly identified by the model. It is defined as follows:
      where 
FN is the probability that the sample is positive but is classified as negative erroneously.
Average precision 
AP is the area of the graph consisting of the different values of 
P and 
R.
The average 
AP value for all classes is as follows:
      where 
C is the number of target classes, and 
 denotes the i-th target class.
The present study constitutes an experimental investigation into the impact of attention improvement.
In order to investigate the influence of different attention mechanisms on the effect of model improvement, this paper compares the CA attention mechanism with the SE (Squeeze-and-Excitation) attention mechanism and the BAM (Bottleneck Attention Module) attention mechanism. The YOLOv10-N model is utilised as a reference point.
The YOLOv10n+SE model incorporates the SE (Squeeze-and-Excitation) module as a constituent element of the YOLOv10n framework. The SE module employs an adaptive learning process to determine the relative importance of each channel. This is achieved through the implementation of squeeze and excitation operations, which enable the adjustment of channel contributions to the feature map. The adjustment is based on the task requirements, ensuring that the channels contribute optimally to the feature map. This mechanism assists the network in focusing on the most significant feature channels, consequently enhancing the model’s performance.
The YOLOv10n+BAM model represents an enhanced iteration of the BAM (Bottleneck Attention Module), developed based on the YOLOv10n architecture. BAM is a lightweight attention mechanism that improves network attention to important features by adding an attention module to the network. The BAM module enhances the model’s capacity to filter and utilise features by incorporating an attention mechanism into the Bottleneck structure. The modular design facilitates the seamless integration of BAM into existing convolutional neural network architectures, obviating the need for extensive modifications to the original network.
The experimental results are summarised in 
Table 4. The following key findings highlight the advantages of the CA attention mechanism: The YOLOv10n+CA model achieved the highest average detection accuracy (94.79%), outperforming SE and BAM attention modules. Its detection efficiency (mAP50–95) reached 68.24%, indicating superior performance in complex detection tasks. The CA module effectively enhanced long-range dependency and spatial localisation, especially under occlusion or cluttered backgrounds.
Unlike CBAM and SimAM, CA is specifically designed to capture spatial-coordinate dependencies, making it more suitable for miner detection where positional cues (e.g., helmets, limbs) under occlusion are crucial. Moreover, DyHead offers task-specific adaptability not present in general-purpose detection heads, aligning better with the variable postures in mining environments.
An experiment was also conducted for the purpose of determining the number of nestings of the DyHead module.
The number of nesting times of DyHead modules has been demonstrated to have a direct impact on the performance of the model. The present study investigates the impact of nesting between one and three DyHead modules on the performance of the model. The experimental results are summarised in 
Table 5. Given that each nesting operation increases the complexity of the model, the number of model parameters increases by approximately 0.5 MB for each additional DyHead nesting operation. We set the number of nesting operations to 2. The performance across different nesting levels can be summarised as follows:
- 1.
- Layer nesting underperformed slightly (90.78%) while requiring less memory. 
- 2.
- Layer DyHead nesting yielded the best trade-off between accuracy (94.79%) and model size. 
- 3.
- Layer nesting offered no improvement over two-layer design but significantly increased model size, indicating diminishing returns and potential overfitting. 
The experiment was based on YOLOv10-N. The first optimisation is to incorporate the CA attention mechanism into the Backbone network. The second optimisation is to substitute the detection head with the DyHead detection head, based on the first optimisation. The third optimisation is to modify the loss function to EIOU, based on the second optimisation. It is noteworthy that all experiments utilise identical training parameters and are executed on an identical dataset. The results of the study are summarised in 
Table 6:
As illustrated in 
Table 6, the combined application of the CA attention mechanism, DyHead structure, and EIOU loss yielded notable performance improvements: Accuracy improved from 90.04% → 92.69% and recall from 85.11% → 87.53%; mAP50 increased by 6.37 percentage points, and mAP50–95 improved by 4.97 points. The experimental results demonstrated that the enhanced methodology can substantially enhance performance while maintaining processing efficiency. Furthermore, it was found to be particularly well-suited for applications in the domain of miner detection within mining environments.
In particular, compared to SE and BAM, CA demonstrates superior spatial feature extraction under occlusion; DyHead outperforms conventional heads like TaskAligned by adapting to multi-scale objects; EIOU resolves size regression inconsistencies more effectively than CIOU, resulting in higher precision and recall.
The training efficiency, training time, inference latency, parameter count, and model size of the proposed CA-YOLOv10n model under different DyHead nesting configurations are presented in 
Table 7. The data clearly show that model complexity increases linearly with each additional DyHead module. Specifically, the model size increases from 0.77 MB to 1.61 MB, and the training time per epoch increases from approximately 13 to 19 min. The inference latency also rises modestly from 1.84 ms to 3.10 ms per image. These values demonstrate that although additional DyHead layers improve feature aggregation, excessive nesting results in diminishing returns in accuracy while significantly increasing computational cost. Therefore, a two-layer DyHead configuration was selected as the optimal compromise between performance and efficiency. Compared with conventional detectors such as Faster R-CNN and SSD, the proposed model exhibits significantly lower complexity while achieving higher accuracy, making it suitable for real-time, resource-constrained deployment in industrial environments.
As demonstrated in 
Figure 8, the comprehensive experimental results are presented in tabular form.
As illustrated in 
Figure 8, the training outcomes are presented alongside ten distinct metrics.
The quantity train/box_om is indicative of the loss of the one-to-many header bounding box regression, which is defined as the discrepancy between the predicted model bounding box and the actual bounding box. As the number of training rounds increases, the value of train/box_om gradually decreases from 1.7 to close to 1, indicating that the predicted model bounding field gradually approaches the real bounding field.
The train/cls_om indicator signifies the classification loss in the one-to-many header, thereby reflecting the model’s accuracy in predicting multiple target categories. It is evident from the figure that as the number of training rounds increases, the value of train/cls_om experiences a gradual decrease from 1.4 to approximately 0.6, signifying an enhancement in the model’s classification performance.
The term “train/dfl_om” is used to denote the distribution focus loss in the one-to-many header, the purpose of which is to reflect the confidence and accuracy of the model in prediction. As is clearly evident in the 
Figure 9, the value of train/dfl_om gradually decreases from 1.6 to 1.1 total, thus indicating that the confidence and accuracy of the model in prediction is constantly increasing.
The train/box_oo designation is attributed to the boundary box regression loss in the one-to-one heading configuration, while the train/cls_oo designation is attributed to the classification loss in the one-to-one heading configuration. The train/dfl_oo designation is attributed to the distribution focusing loss in the one-to-one heading configuration. Since the research in this paper principally addresses miner detection, these metrics are decreasing, indicating that the model improves as the number of training rounds increases.
The mAP50 (Mean Average Precision at 50% IoU) is the average precision when the Intersection Over Union (IoU) threshold is set to 0.5 in the target detection task. Results indicate larger values are preferable, and the value obtained in this paper is 89.9%.
Although the number of training epochs was limited to 10, we adopted several convergence-enhancing strategies, including cosine annealing learning rate scheduling, Nesterov momentum optimisation, and strong data augmentation. 
Figure 10 presents the loss curves during training, showing consistent decreases in bounding box loss, classification loss, and distribution focal loss. These trends indicate stable convergence, further confirmed by low validation variance (<0.5%) observed during cross-validation.
mAP50–95 (mean average precision from 50% to 95% IoU) is a pivotal performance evaluation metric for target detection tasks, measuring the average accuracy of the model at varying IoU (Intersection Over Union) thresholds. The results indicate an increase in value is preferable. In this particular study, a value of 68.24% was obtained.
The recall rate, denoted by “Recall”, and the accuracy rate, denoted by “Precision”, achieved 87.53% and 92.69%, respectively.
Figure 10 shows the performance of this paper in the target detection task. The confusion matrix intuitively presents the model’s judgement in distinguishing between “human” and “non-human” categories. From the matrix data, it can be seen that the model shows a high detection accuracy on the test set and can correctly identify most of the “human” and “unmanned” images. Among them, there are 140 instances of correctly detecting humans (TP) and 142 correctly identifying unmanned scenes (TN). At the same time, there is a small number of false detections (FP = 8) and missed detections (FN = 10). Overall, the model shows good classification ability, with both precision and recall rates above 90%, indicating that it can effectively capture most target objects while maintaining a low false alarm rate and has reliable practical performance.
 In order to facilitate the visualisation of the outcomes of training the model, as illustrated in 
Figure 11, a selection is made from the eight images contained within the test set. This selection demonstrates the model’s capacity to accurately identify the presence of miners within the designated mine environment.
As demonstrated in 
Figure 12, the model proposed in this paper accurately identifies images of miners working in a coal mine with a high degree of confidence, thereby fulfilling the practical requirements.
To further validate the robustness of our model in real-world mining environments, we added supplementary visualisations showcasing detection under challenging conditions such as dim lighting, occlusion, and diverse viewpoints. As shown in 
Figure 13, the model successfully identifies miners with high confidence, even when parts of the body are obscured by objects or when miners appear from uncommon angles. These results highlight the adaptability and resilience of our enhanced YOLOv10n-based architecture.
While initial visualisations emphasised the “person” class for clarity, our model is designed to detect distinct categories: miners with helmets, miners without helmets, and hydraulic supports or obstacles. 
Figure 14 provides representative examples for each class, confirming the model’s capability to handle multi-class detection in complex mining scenes.
In order to demonstrate the efficacy of the CA-YOLOv10n model investigated in this paper, a comparison is made with Faster-RCNN, SSD, and YOLOv8. The results of the study are summarised in 
Table 7.
The present study investigates an algorithm that has been demonstrated to exhibit superior detection accuracy in comparison to contemporary target detection algorithms, including Faster-RCNN and SSD. In comparison with other iterations of YOLO algorithms, the CA-YOLOv10-N model has demonstrated an enhancement in the mean target detection accuracy. A comparison of the CA-YOLOv10-N model with the other three algorithms reveals its superiority in both accuracy and efficiency: Our model surpasses YOLOv8 by +4.05% in mAP50–95, and +4.8% in average detection accuracy, and it outperforms Faster R-CNN by over +20% in mAP50, with significantly reduced latency. Compared to SSD, it demonstrates +25.6% accuracy gain, with higher reliability under low-light conditions. The model remains lightweight and suitable for edge device deployment in industrial environments (
Table 8).
Nevertheless, the proposed method has several limitations that should be acknowledged:
      
- (1)
- The model was evaluated only on a custom dataset collected under specific mining conditions. Its generalisation ability to other mines—especially those with different layouts, lighting, or equipment—may be limited. 
Although our current dataset is from a single geographic region, we incorporate significant differences in conditions (e.g., lighting, occlusion, device type) and use a strong cross-validation strategy. The low variance of mAP between validation sets (<0.5%) supports generalisation capabilities within the field. In the future, we hope that this application can be validated on multiple mine datasets.
	  
- (2)
- While performance was improved, the inclusion of DyHead and CA modules increased computational complexity. In particular, DyHead introduces multiple attention branches, and CA performs dual-directional pooling and fusion, which may impact inference speed on real-time or low-power embedded devices. 
- (3)
- No evaluation has yet been conducted on publicly available mining datasets or cross-domain benchmarks (e.g., MineScape, SafetyHelmet), which limits the current evidence for broad applicability. 
- (4)
- The current system operates in a frame-by-frame manner without leveraging temporal information or object tracking mechanisms. This may reduce detection robustness under fast motion, overlapping miners, or temporary occlusions. 
These limitations will be addressed in future work through dataset expansion, lightweight module optimisation, and the integration of spatiotemporal modelling for video-based detection. Future efforts will also draw on edge-deployment strategies proven by C-Mask and explore spectral-descriptor fusion techniques inspired by to improve robustness for small or occluded miner targets.