Study on Lightweight Algorithm for Multi-Scale Target Detection of Personnel and Equipment in Open Pit Mine

Zhao, Erxiang; Qiu, Caimou; Zhang, Chunyang

doi:10.3390/app16010354

Open AccessArticle

Study on Lightweight Algorithm for Multi-Scale Target Detection of Personnel and Equipment in Open Pit Mine

by

Erxiang Zhao

¹,

Caimou Qiu

² and

Chunyang Zhang

^2,3,4,*

¹

China National Heavy Duty Truck Group Co., Ltd., Jinan 250100, China

²

School of Resources and Environmental Engineering, Wuhan University of Technology, Wuhan 430070, China

³

State Key Laboratory of Metal Mining Safety and Disaster Prevention and Control, Maanshan 243000, China

⁴

Sinosteel Maanshan General Institute of Mining Research Co., Ltd., Maanshan 243000, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(1), 354; https://doi.org/10.3390/app16010354 (registering DOI)

Submission received: 4 November 2025 / Revised: 17 December 2025 / Accepted: 25 December 2025 / Published: 29 December 2025

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Personnel and equipment target detection algorithms in open pit mines have significantly improved mining safety, production efficiency, and management optimization. However, achieving precise target localization in complex backgrounds, addressing mutual occlusion among multiple targets, and detecting large-scale and spatially extensive targets remain challenges for current target detection algorithms in open pit mining areas. To address these issues, this study proposes a novel target detection algorithm named RSLH-YOLO, specifically designed for personnel and equipment detection in complex open pit mining scenarios. Based on the YOLOv11 (You Only Look Once version 11) framework, the algorithm enhances the backbone network by introducing receptive field attention convolution and dilated convolution to expand the model’s receptive field and reduce information loss, thereby improving target localization capability in complex environments. Additionally, a bidirectional fusion mechanism between high-resolution and low-resolution features is adopted, along with a dedicated small-target detection layer, to strengthen multi-scale target recognition. Finally, a lightweight detection head is implemented to reduce model parameters and computational costs while improving occlusion handling, making the model more suitable for personnel and vehicle detection in mining environments. Experimental results demonstrate that RSLH-YOLO achieves a mAP (mean average precision) of 89.1%, surpassing the baseline model by 3.2 percentage points while maintaining detection efficiency. These findings indicate that the proposed model is applicable to open pit mining scenarios with limited computational resources, providing effective technical support for personnel and equipment detection in mining operations.

Keywords:

target detection algorithm; open pit mining scenarios; personnel and vehicle detection; detection accuracy; feature extraction ability

1. Introduction

Mineral resources are the cornerstone of modern industry. However, the open pit mining model faces a triple paradox involving conflicts among safety, efficiency, and sustainability. Although traditional manual operations demonstrate flexibility, they exhibit systemic safety vulnerabilities in complex open pit scenarios. These vulnerabilities include delayed real-time warnings for slope collapses, collision risks in vehicle blind zones, and inefficient monitoring of personal protective equipment (PPE) compliance. Concurrently, production scheduling relies on empirical human judgment, resulting in suboptimal utilization rates of mining equipment. Additionally, harsh working conditions exacerbate labor shortages, driving disproportionately high labor costs within total operational expenditures. The Industry 4.0 revolution [1] and intelligent technologies offer novel solutions to this predicament. Object detection technology serves as the cornerstone of environmental perception, fundamentally reengineering operational paradigms in open pit mining.

Traditional object detection methods struggle to adapt to the complex operational environments of open pit mining. With the rapid advancement of deep learning, data-driven object detection algorithms have achieved widespread adoption. Current detection architectures fall into two categories: two-stage and single-stage detectors. Two-stage detectors are represented by the Region-based Convolutional Network (RCNN) series, including the Fast Region-based Convolutional Network method (Fast R-CNN) [2], Faster Region-based Convolutional Network method (Faster R-CNN) [3], and Mask Region-based Convolutional Network method (Mask R-CNN) [4]), and single-stage detectors are exemplified by frameworks such as You Only Look Once (YOLO) [5], Single Shot MultiBox Detector (SSD) [6,7,8], EfficientDet [9], and DEtection TRansformer (Detr) [10].

Although two-stage detectors generally provide superior recognition accuracy compared to single-stage detectors, their multi-stage detection pipelines introduce significant computational latency. Consequently, they often struggle to meet the real-time processing requirements of resource-constrained edge devices. In contrast, YOLO series algorithms demonstrate distinct advantages in inference speed and model generalization compared to two-stage detectors, which has driven their widespread adoption in industrial deployments. These advantages have driven their widespread adoption in industrial deployments. EfficientDet, a family of models based on neural architecture search (NAS) and compound scaling, also offers a competitive balance of accuracy and efficiency. However, YOLO’s streamlined architecture often yields better real-time performance than EfficientDet and is generally preferred for resource-constrained deployments. Furthermore, Real-Time DEtection Transformer (RT-DETR) [11], a variant of the DETR framework, offers flexible speed tuning without retraining. It also eliminates the inconveniences associated with dual NMS (Non-Maximum Suppression) thresholds and provides model scaling strategies, thereby opening new possibilities for diverse real-time scenarios. Through the deployment of professional teams and visual perception systems for mining machinery inspection, mining operations have achieved rapid and effective improvements in three areas:

In the field of safety management, the enhanced YOLO Pose v8 model developed by Mohamed Imam et al. [12] demonstrates exceptional performance in Personal Protective Equipment (PPE) detection and operational posture analysis, significantly elevating mine safety standards. This innovation reduces accident rates and enhances emergency rescue efficiency. Meanwhile, Liu et al. [13] established a DG-YOLOv3-based object detection framework for open pit mine blasting scenarios, achieving precise identification of personnel and machinery. Their intelligent real-time warning system improves the risk alert response speed compared to conventional methods, successfully creating a smart real-time monitoring system.

In the realm of production efficiency optimization, Qin et al. [14] proposed a YOLOv5s-based real-time detection algorithm that enhances detection accuracy (mAP@0.5) by 9 percentage points while maintaining real-time processing capabilities. The algorithm achieves 24/7 stable recognition of multi-scale obstacles (e.g., personnel, vehicles) across day–night cycles, thereby boosting productivity in unmanned mining operations. Additionally, Liu et al. [15] designed an end-to-end framework for small object detection and drivable area segmentation in open pit environments, addressing challenges such as dense small objects and ambiguous boundary segmentation on unstructured roads, and improved the driving efficiency of unmanned mining vehicles.

In the realm of management optimization, Calle et al. proposed a binary helmet detection method based on YOLOv4, demonstrating high efficiency and practicality. By precisely monitoring safety helmet usage, this approach significantly reduces head injury risks in workplace accidents, thereby safeguarding workers and ensuring operational continuity.

Although deep learning methods and machinery detection algorithms have achieved notable success in enhancing safety, productivity, and management in open pit mines, key challenges persist in target detection for mining environments. These include precise localization under complex backgrounds, mutual occlusion among multiple targets, and adaptation to large-scale targets with significant span variations.

Therefore, this study proposes a novel multi-object detection method named RSLH-YOLO, specifically designed for personnel and equipment detection in complex open pit mining production scenarios. The main innovations and contributions of this study can be summarized as follows:

C3K2_RFAConv Module: A novel convolutional block with an attention mechanism that employs position-specific attention within each receptive field window. This enables fine-grained spatial attention for distinguishing targets from complex backgrounds and handling occlusion scenarios.
Feature Pyramid Shared Convolution (FPSConv) Module: A parameter-efficient multi-scale feature extraction module that uses shared-kernel dilated convolution cascades to achieve multi-scale receptive fields. By sharing convolutional kernels across different dilation rates, FPSConv reduces parameters by approximately 66% compared to independent dilated convolutions while maintaining similar feature extraction capability.
ReCa-PAFPN with Spatial Branch Attention (SBA) Module: A novel neck network with a bidirectional gated interaction mechanism—where learned gate functions control information flow in both directions (from high-resolution to low-resolution features and vice versa)—that performs adaptive attention-based fusion between high-resolution and low-resolution features, dynamically adjusting feature weights to address large-scale span variations.
Dedicated Small-Target Detection Layer: An additional detection layer operating on higher-resolution feature maps (P2) to specifically enhance the detection performance for small targets in mining scenarios.
Synergistic Integration: A task-oriented redesign that integrates the above five modules into a coherent pipeline optimized for mining safety monitoring. This integration addresses three key challenges simultaneously: complex backgrounds, mutual occlusion, and large-scale span variations.

RSLH-YOLO achieves state-of-the-art performance with 89.1% mAP while maintaining real-time inference (76.1 FPS) and a lightweight architecture (3.84 M parameters, 16.0 GFLOPs). The system effectively addresses complex backgrounds, mutual occlusion, and large-scale span variations in mining environments. It demonstrates an optimized accuracy–speed trade-off that exceeds real-time requirements. However, the current validation is limited to fixed-camera scenarios, and empirical profiling on edge devices remains for future work due to hardware constraints.

The following paragraphs are organized as follows: In Section 2, key issues in multi-object recognition for personnel and vehicles will be discussed. In Section 3, the basic model and its modules will be analyzed, and the improvements of RSLH-YOLO will be elaborated in detail. Section 4 elaborates on dataset preparation, training protocols, and comparative analysis of experimental results. Finally, the contribution, limitations, and future work of this study will be introduced.

2. Related Works

2.1. Distinguishing Targets from Complex Backgrounds

Complex backgrounds introduce significant interference in object detection tasks, adversely affecting models’ ability to localize critical targets. In recent years, attention mechanisms have been widely adopted in object detection to mitigate this challenge. These mechanisms offer several advantages: enhancing regions of interest; adapting receptive fields; modeling target correlations; and particularly, suppressing background interference. These advantages improve overall detection performance. For instance, Ranjan et al. [16] integrated the Convolutional Block Attention Module (CBAM) into the backbone network of YOLOv11 to segment tree trunks and branches under varying seasonal conditions, a critical task for robotic agricultural operations. Similarly, Angelo et al. [17] introduced the Shuffle Attention (SA) module into the neck network of YOLOv11, which significantly enhanced the recognition of tomatoes against complex planting backgrounds. In another study, Hui et al. [18] incorporated the SwinTransformer into YOLOv7 and combined it with CNN. The SwinTransformer is a technique originating from the field of Natural Language Processing (NLP). By leveraging the sliding window self-attention mechanism, a novel convolutional module named STRCN was designed. This module yielded significant improvements in distinguishing small targets from complex ground environments in UAV (Unmanned Aerial Vehicle) remote sensing images. Furthermore, a Type-1 Fuzzy Attention (T1FA) mechanism [19] has been introduced for vehicle detection tasks. This mechanism reweights feature maps through fuzzy entropy to reduce uncertainty in feature representations and directs the detector’s focus toward target centers, thereby significantly improving detection accuracy in complex scenarios. These advancements not only demonstrate the potential of attention mechanisms to enhance object detection performance, but also provide effective technical solutions for addressing detection challenges in complex backgrounds. However, these attention mechanisms operate at different granularities: CBAM sequentially applies channel and spatial attention, while SA focuses on channel-wise shuffling. In mining scenarios where targets blend with complex backgrounds, these mechanisms may not provide sufficient fine-grained spatial attention. Our proposed RFAConv addresses this limitation by learning position-specific attention weights within each receptive field window. This enables adaptive weighting of different positions based on their relevance to the target.

2.2. Large-Scale Span Target Detection

Large-scale span objects refer to objects within the same scene that exhibit significant variations in size. In object detection tasks, such objects are commonly encountered due to several factors. These factors include varying distances between objects and the camera, inherent size differences among objects (e.g., pedestrians, vehicles, buildings), or changes in image resolution. These objects typically manifest as small-scale targets occupying few pixels (e.g., distant pedestrians or vehicles) and large-scale targets occupying many pixels (e.g., nearby vehicles or buildings). This large-scale span phenomenon introduces challenges for object detection, such as difficulties in feature extraction, reduced detection accuracy, and increased model complexity.

To address these challenges, the FPN (Feature Pyramid Network) [20] has been widely adopted to enhance multi-scale detection capabilities by fusing hierarchical features. Based on this framework, Zhang et al. [21] incorporated a DWR module into the Cross-Stage Partial with two convolutions (C2f) module of YOLOv8’s neck network. This modification enables more effective fusion of multi-scale features extracted by the backbone network, thereby improving the model’s detection performance for large-scale span objects in underground coal mining environments. Furthermore, enhancing small-target detection capabilities represents a critical direction in addressing large-scale span challenges. Based on the original YOLOv3 architecture, Ren et al. [22] integrated an additional feature detection layer specifically tailored for small objects. This enhanced model was deployed in autonomous driving systems operating in mining environments. The modification significantly improved the detection accuracy for minute targets such as scattered rocks and pedestrians. These methods continue to emerge and have enhanced model performance in detecting multi-scale targets within practical applications.

2.3. Lightweight Detection Head for Resource-Constrained Environments

Efficient deployment and utilization of models in resource-constrained environments remain critical challenges in the field of object detection. Reducing the computational and storage demands of deep learning models, known as model lightweighting, effectively addresses this challenge. Key lightweighting techniques include pruning, quantization, factorization, network architecture optimization, knowledge distillation, model compression, and neural architecture search. In YOLOv8, the detection head accounts for approximately 50% of the total parameters, making its lightweighting a crucial research direction. Liu et al. [23] introduced a lightweight “ES-Head” to YOLOv8, effectively mitigating the computational burden caused by adding small object detection layers. Mu et al. [24] innovatively designed a parameter-sharing lightweight detection head (“FS-Head”) leveraging feature fusion and parameter-sharing strategies. This design reduces parameter counts while improving detection efficiency for multi-pose small and large objects in underground coal mining environments. Additionally, Zhang et al. [25] developed a novel lightweight asymmetric detection head (“LADH-Head”) to replace the original detection head in the YOLOv5s model, offering an efficient and lightweight solution for remote sensing applications.

While established object detection methods have achieved success in various scenarios, their application in mining environments faces significant limitations due to unique operational complexities. These challenges include complex backgrounds, where targets often blend with rock faces, shadows, and dust. This demands fine-grained spatial attention beyond standard mechanisms like SA or CBAM. Severe occlusion is frequent among workers and equipment. This problem is exacerbated by poor feature learning in conventional heads when utilizing the small batch sizes common on resource-constrained edge devices. Furthermore, extreme scale variations overwhelm standard multi-scale fusion approaches like FPN. These variations range from distant workers to large engineering vehicles, resulting in a critical semantic gap and poor performance for targets with large scale differences. This necessitates a careful accuracy–efficiency trade-off, as heavy models cannot be deployed; yet, sacrificing accuracy for speed is unacceptable for safety-critical applications. Consequently, existing methods fail to achieve the requisite high accuracy, high speed, and robust feature representation needed to meet the stringent real-time safety requirements inherent to the mining operational context.

3. Methodology

3.1. Overall of Network

The YOLO series [26,27,28,29,30,31,32,33,34] has emerged as a leading framework for real-time object detection. YOLOv12 [29] was the latest version of the YOLO series, which was released in February 2025. However, YOLOv11 [30,31] was selected as the baseline for RSLH-YOLO. As shown in Table 1, YOLOv11 achieves the highest mAP and precision while maintaining manageable GFLOPs and FPS among the YOLO series. This is an essential consideration given the stringent hardware constraints in mining environments. Furthermore, the smallest density variant, YOLOv11n, is adopted for the same reason: it balances accuracy and efficiency under limited computational resources on site, as shown in Table 2.

As shown in Figure 1a, the architecture of YOLOv11 consists of three main components: Backbone, Neck, and Head. The Backbone focuses on feature extraction, employing C3k2 (a cross-stage portion of kernel size 2) block, SPPF (Spatial Pyramid Pooling Fast), and C2PSA (Convolutional Block with Parallel Spatial Attention) components. These components transform raw image data into multi-scale feature maps. These feature maps, which contain rich local and global information, are subsequently passed to the neck for advanced integration. The neck component serves as an intermediate processing stage, employing specialized layers to aggregate and enhance feature representations across different scales. The enhanced feature maps generated by the neck are then routed to the head for the final detection tasks. The head component serves as the prediction mechanism, generating final outputs for object localization and classification based on refined feature maps. Using a multi-branch design, the head processes the multi-scale features provided by the neck. Each branch is optimized to detect objects of specific sizes, including small, medium, and large targets. The final detection outputs, encompassing the object categories, bounding box coordinates, and confidence scores, are produced by the head.

The architecture of the improved RSLH-YOLO network is illustrated in Figure 1b. As shown in Table 3, this paper primarily improves the YOLOv11n model in three aspects: the backbone, neck, and head. The specific enhancements are as follows. Firstly, the backbone is optimized by replacing the Conv, C3K2, and SPPF connection layers with RFAConv, C3K2_RFAConv, and FPSConv, respectively. These replacements expand the receptive field, enhance contextual information capture, reduce information loss, and enable multi-scale feature extraction. Subsequently, a ReCa-PAFPN integrating SBA and C3K2_RFAConv is introduced to replace the original neck of YOLOv11 to improve feature fusion and address large-scale span variations. To better detect small objects, a dedicated small-target detection layer is added to the head section. Finally, to ensure model lightweighting and better solving the occlusion problem, a lightweight EGNDH detection head replaces the original detection head.

3.2. Improvement of Backbone

3.2.1. C3K2_RFAConv Module

The C3K2 module contains multiple Bottleneck structures, where the serial connection of these residual modules enables feature extraction and fusion across different scales. However, the limitations of the Bottleneck residual modules are evident in mining environments. Mining scenarios present unique challenges that severely test the capabilities of standard object detection architectures. For instance, in open pit mining operations, targets such as personnel, machinery, and vehicles often blend with complex backgrounds. These backgrounds are characterized by rock faces, shadows, and similar-colored equipment. This challenging condition requires detection models to maintain precise localization capabilities while effectively distinguishing targets from intricate backgrounds.

The limitations of the standard Bottleneck modules in C3K2 become particularly problematic under mining-specific conditions, where targets blend with complex backgrounds characterized by rock faces, shadows, and similar-colored equipment. These modules cause the network to excessively focus on high-frequency positional information within localized regions. Because the standard Bottleneck structure uses fixed convolution operations without adaptive attention, the effective receptive field remains limited to the immediate neighborhood of each position. This limited receptive field results in neglect of surrounding contextual information at each position, potentially causing loss of essential details that are crucial for distinguishing targets from complex backgrounds. Specifically, the standard Bottleneck structure employs fixed convolution operations that treat all positions within the receptive field equally. It lacks adaptive attention mechanisms to prioritize informative spatial locations. This design limitation—the lack of adaptive attention mechanisms to prioritize informative spatial locations—means that when targets are partially blended with backgrounds or when contextual cues are critical for accurate localization, the network cannot effectively focus on the most relevant information. Consequently, the model struggles to capture the subtle distinctions between targets and complex mining backgrounds, leading to degraded localization accuracy and increased false positives.

To enhance the detection accuracy of targets in complex mining backgrounds, the RFAConv [35] module is introduced. This module improves both the backbone network and the C3K2 module in the neck network, as illustrated in Figure 1b.

Unlike SE (Squeeze-and-Excitation), which operates at the channel level, or CBAM, which sequentially applies channel and spatial attention as separate modules, RFAConv integrates attention directly into the convolution process. Specifically, RFAConv learns position-specific attention weights within each k×k receptive field window before feature aggregation. This enables fine-grained spatial attention—attention that operates at the pixel level within each receptive field window rather than at the channel or global spatial level—that adaptively weights different positions based on their relevance to the target.

This design provides several key advantages for mining detection scenarios. First, the fine-grained spatial attention mechanism allows the model to focus on critical contextual cues at each position rather than treating all positions equally, which is essential when targets blend with rock faces, shadows, and dust. Second, by expanding the effective receptive field while maintaining computational efficiency comparable to standard convolution, RFAConv enables better capture of surrounding contextual information. This information is crucial for distinguishing targets from complex backgrounds. Third, the receptive-field-aware design—which explicitly considers the spatial structure of the receptive field when computing attention weights—provides superior localization capability compared to standard convolution and sequential attention mechanisms. This makes it particularly effective for scenarios where targets are partially blended with backgrounds or where precise boundary localization is critical. These are common challenges in mining environments.

The integration of C3K2_RFAConv into the YOLOv11n architecture is implemented systematically across both the backbone and neck networks, as illustrated in Figure 1. As Figure 2b shows, the CBS of standard Bottleneck structures within C3K2 modules is replaced with RFAConv in the backbone and neck networks. The integration strategy of replacing CBS with RFAConv in both backbone and neck networks ensures that the enhanced contextual feature extraction capabilities of RFAConv are consistently applied across both the feature extraction (backbone) and feature fusion (neck) stages. This strengthens the model’s overall capability to distinguish targets from intricate mining backgrounds while maintaining the lightweight characteristics suitable for resource-constrained mining environments.

The structure of RFAConv is shown in Figure 3. RFAConv is calculated as follows:

\begin{matrix} F & = S o f t m a x (g^{1 \times 1} (A v g P o o l (X))) \times R e L U (N o r m (g^{k \times k} (X))) \\ = A_{r f} \times F_{r f} \end{matrix}

(1)

where gⁱ^×i denotes a grouped convolution with a size of i × i. k represents the convolution kernel size, Norm indicates normalization, X denotes the input feature map, and F is obtained by multiplying the attention picture A_rf with the transformed receptive field spatial features F_rf.

3.2.2. FPSConv Module

SPPF plays a significant role in YOLOv11 by repeatedly using the same max-pooling layer for multiple pooling operations. This reduces computational complexity and accelerates the processing speed while enabling multi-scale feature extraction. However, mining scenarios present unique challenges that expose the fundamental limitations of fixed-window pooling approaches. In open pit mining operations, the detection system must simultaneously handle extreme scale variations. These variations occur between small distant workers (often appearing as only a few pixels) and large engineering vehicles (trucks, excavators) that occupy significant portions of the image. These challenging conditions demand multi-scale feature extraction mechanisms that can adaptively capture both local details and global context while preserving critical spatial information across different scales.

The limitations of SPPF become particularly problematic under these mining-specific conditions. However, as shown in Figure 4a, the fixed pooling window size and repetition count in SPPF may lead to information loss at certain scales. This is particularly problematic when handling very large or extremely small targets, failing to adequately adapt to all types of input data or application scenarios. Specifically, SPPF’s reliance on fixed-window max pooling operations creates several critical drawbacks. (1) The fixed window size cannot adapt to the diverse scale distribution of mining targets, causing information loss for targets that do not match the predefined pooling scales. (2) Max pooling operations inherently discard detailed spatial information by selecting only the maximum value within each window. This is particularly detrimental for small targets where fine-grained features are essential for accurate detection. (3) The non-learnable nature of pooling operations means that SPPF cannot adapt its feature extraction strategy based on the input content. This makes it unable to prioritize critical regions or adjust to varying target characteristics. The three limitations described above—fixed window size, information loss from max pooling, and non-learnable operations—significantly constrain the algorithm’s performance and practicality when dealing with complex variations and challenging ill-posed regions in mining scenarios. This leads to degraded detection accuracy for both small workers and large vehicles.

To effectively address the aforementioned challenges, we propose the FPSConv module, as illustrated in Figure 4b. The FPSConv module addresses the limitations of SPPF through several key advantages specifically designed for mining scenarios. First, by employing shared-kernel dilated convolution cascades—where the same convolutional weights are reused with different dilation rates (d = 1, 3, 5) to extract multi-scale features—FPSConv can capture more fine-grained features while preserving detailed spatial information. In contrast, the pooling operations in SPPF may lose some detailed information, which is critical for distinguishing small workers from complex backgrounds or accurately localizing vehicle boundaries. Convolutional operations offer greater flexibility and representational capacity during feature extraction, enabling better capture of image details and complex patterns that are essential for mining safety monitoring. Second, through the use of convolutional layers with different dilation rates (d = 1, 3, 5), the module can extract multi-scale features that adaptively capture information at varying spatial scales. This proves highly advantageous for capturing information of varying sizes and spatial relationships within images. Low dilation rates (d = 1) capture local details crucial for small-target detection, while high dilation rates (d = 5) capture global context necessary for large vehicle recognition. This adaptive multi-scale capability enables FPSConv to effectively handle the extreme scale variations characteristic of mining environments. Third, the use of shared convolutional layers significantly reduces the number of trainable parameters compared to employing independent convolutional layers for each dilation rate. This shared-kernel design can reduce redundancy and improve model efficiency, lowering storage and computational costs while improving computational performance. This parameter efficiency is particularly important for resource-constrained mining equipment where computational resources are limited but high accuracy is required for safety-critical applications. Fourth, unlike SPPF’s fixed and non-learnable pooling operations, FPSConv’s convolutional operations are learnable. This allows the network to adaptively adjust feature extraction strategies based on the input content and learn optimal representations for mining-specific scenarios.

As shown in Figure 1b, the integration of FPSConv into the YOLOv11n architecture is implemented at a critical location in the backbone network. Specifically, as shown in Figure 1b, the SPPF module located after Stage 4 of the backbone network is replaced with the FPSConv module. This module follows the final C3K2 module. Placing FPSConv after Stage 4 of the backbone network (following the final C3K2 module) ensures that FPSConv operates on high-level semantic features that have already been processed through the four-stage feature extraction pipeline. This enables it to perform efficient multi-scale refinement on rich semantic representations. The replacement occurs at the final stage of the backbone network, where the feature maps have been downsampled to a resolution suitable for multi-scale feature aggregation. At this location, FPSConv receives feature maps from the C3K2_RFAConv modules and further refines them through adaptive multi-scale convolution. The C3K2_RFAConv modules have already enhanced contextual information through receptive-field-aware attention. The output of FPSConv then feeds into the neck network (ReCa-PAFPN), where the enhanced multi-scale features are fused with features from other stages through the SBA module. This integration strategy ensures that the improved multi-scale feature extraction capabilities of FPSConv complement the receptive-field enhancement provided by C3K2_RFAConv. This creates a synergistic effect that strengthens the model’s overall capability to handle complex mining backgrounds, extreme scale variations, and challenging detection scenarios while maintaining computational efficiency suitable for real-time deployment in mining safety monitoring systems. The FPSConv’s workflow is as follows:

Firstly, the input tensor x undergoes transformation through a convolutional layer, generating a feature map y with halved channel count. This stage reduces the computational complexity for subsequent operations and prepares for multi-scale feature extraction.

In the multi-scale feature extraction phase, shared convolutional weights are employed to perform dilated convolution operations with different dilation rates on these feature maps. Each dilated convolution operation generates a new feature representation, which is incrementally added to the results collection, accumulating feature representations through sequential element-wise addition, where each dilated convolution output is added to the previous sum. This incremental addition strategy enables the model to capture information across multiple scales, enhancing its understanding of multi-scale targets.

During the feature fusion stage, all feature maps from different scales are concatenated along the channel dimension. This step integrates information from various scales, forming a more comprehensive and enriched feature representation. Subsequently, the concatenated feature maps pass through another convolutional layer to adjust to the desired output channel count, ensuring the output format aligns with requirements.

Finally, the feature map obtained after undergoing the aforementioned processing steps constitutes the final output of this process. The entire workflow is designed to optimize feature extraction and fusion, thereby enhancing the model’s capability in target detection and recognition across various scales.

3.3. Improvement of Neck

The shallow layers contain less semantic information but are rich in details, exhibiting more distinct boundaries and less distortion. Additionally, deeper layers encapsulate abundant material semantic information. Within the framework of object detection algorithms, the neck network constructs a Path Aggregation Network. It incorporates rich semantic features through feature fusion and multi-scale detection methods, aiming to minimize the loss of semantic feature information. However, mining scenarios present unique challenges that severely test the capabilities of traditional neck network architectures. In open pit mining operations, the neck network must effectively fuse features across extreme scale variations. Small distant workers (often appearing as only a few pixels) coexist with large engineering vehicles (trucks, excavators) that occupy significant portions of the image. Additionally, complex backgrounds characterized by rock faces, shadows, and similar-colored equipment create ambiguous boundaries between targets and backgrounds. This requires the neck network to preserve both fine-grained spatial details from shallow layers and rich semantic information from deep layers. These challenging conditions require neck networks to achieve effective multi-scale feature fusion while maintaining computational efficiency suitable for real-time safety monitoring applications.

However, traditional networks that directly fuse low-resolution and high-resolution features may lead to redundancy and inconsistency, thereby compromising the fusion effectiveness. The limitations of the original YOLOv11n neck network become particularly problematic under these mining-specific conditions. In the task of detecting personnel and vehicles in complex mining environments, the performance of the neck network is often constrained by multiple factors. These factors include interference from intricate backgrounds, diverse object categories, significant scale variations, and challenges with small targets. Specifically, the original YOLOv11n neck network employs a standard PAFPN structure that uses simple Upsample, Conv, and Concat operations for feature fusion. This design creates several critical drawbacks. (1) The direct concatenation or addition of features from different resolutions without adaptive weighting mechanisms leads to redundant information propagation and inconsistent feature representations across scales. This is particularly problematic when fusing shallow high-resolution features (rich in details but less semantic) with deep low-resolution features (rich in semantics but less detailed). (2) The fixed fusion strategy treats all spatial locations and channels equally, without considering the varying importance of different features for different target types (small workers vs. large vehicles). This makes it unable to adaptively emphasize critical information while suppressing background noise. (3) The lack of explicit compensation mechanisms for missing boundary information in high-level features and missing semantic information in low-level features results in incomplete feature representations. This degrades detection accuracy, especially for small targets where both fine-grained boundaries and semantic context are essential. These limitations significantly constrain the algorithm’s performance when dealing with complex mining scenarios characterized by extreme scale variations, ambiguous boundaries, and frequent occlusions.

To enhance the detection performance of RSLH-YOLO in mining scenarios, we optimized and refined the neck of the original YOLOv11 and proposed a ReCa-PAFPN structure, as illustrated in Figure 1. The ReCa-PAFPN structure addresses the aforementioned limitations through several key advantages specifically designed for mining scenarios. First, within the ReCa-PAFPN framework, the Spatial Branch Attention (SBA) module [36] is introduced as a neural network component designed to handle high- and low-resolution feature maps. This module aims to improve feature fusion capabilities through attention mechanisms. Unlike traditional direct concatenation or addition, SBA learns two complementary gates for shallow and deep features and uses them to recalibrate what information should be exchanged in each direction. This enables adaptive weighting of features based on their relevance to the detection task. This adaptive fusion mechanism allows the network to emphasize critical information while suppressing redundant or noisy features. This is particularly effective for mining scenarios where targets must be distinguished from complex backgrounds. Second, the SBA module implements a bidirectional fusion mechanism between high-resolution and low-resolution features that ensures more thorough information exchange. The RAU (Recalibration and Aggregation Unit) block within SBA explicitly compensates missing boundary information in high-level features and missing semantic information in low-level features before fusion. This compensation is rarely considered in standard pyramid designs. This compensation mechanism ensures that both fine-grained spatial details and rich semantic information are preserved and effectively integrated. This makes SBA particularly effective for mining scenes where small workers and large engineering vehicles coexist and where boundaries between targets and background slopes are often ambiguous. Third, the adaptive attention mechanism dynamically adjusts feature weights based on the varying resolutions and content of feature maps, enabling better capture of multi-scale target characteristics. This attention mechanism assists the model in capturing long-range contextual dependencies, which is particularly critical for detecting small objects and handling occlusions in complex mining environments. Furthermore, ReCa-PAFPN incorporates an additional small-object detection layer (P2 layer) to maximize the detection of small targets and enhance the model’s overall detection accuracy. This additional detection layer operates on high-resolution feature maps, providing dedicated capacity for small-target detection that is essential for worker safety monitoring in mining environments.

As shown in Figure 1b, the integration of ReCa-PAFPN into the YOLOv11n architecture is implemented as a comprehensive replacement of the original neck network structure. Specifically, the standard PAFPN neck network, which uses Upsample, Conv, Concat, and standard C3K2 modules, is entirely replaced with the ReCa-PAFPN structure. In the top–down path (FPN-like), the original Upsample and Concat operations are replaced with SBA modules. These modules perform adaptive bidirectional fusion between high-level semantic features (from deeper backbone stages) and low-level detailed features (from shallower backbone stages). Each SBA module receives features from two different scales and outputs a refined feature map that combines the strengths of both inputs through learned attention gates. Following each SBA module, the standard C3K2 modules are replaced with C3K2_RFAConv modules, which further enhance the fused features through receptive-field-aware attention. In the bottom–up path (PAN-like), the same replacement strategy is applied. SBA modules replace the original Conv and Concat operations for downsampling and feature fusion, while C3K2_RFAConv modules replace standard C3K2 modules. This comprehensive replacement ensures that all feature fusion operations in the neck network benefit from adaptive attention mechanisms and enhanced contextual feature extraction. Additionally, the ReCa-PAFPN structure extends the original three-scale detection (P3, P4, P5) to four scales by incorporating an additional P2 detection layer. This layer operates on the highest-resolution feature maps (160 × 160) to enhance small-target detection capability. The output feature maps from ReCa-PAFPN at four different scales are then fed into the EGNDH detection head for final predictions. This integration strategy ensures that the improved multi-scale feature fusion capabilities of ReCa-PAFPN work synergistically with the enhanced backbone features (from C3K2_RFAConv and FPSConv) and the lightweight detection head (EGNDH). This creates a coherent pipeline that effectively addresses large-scale variations, wide spatial distribution, and severe occlusions in mining safety monitoring scenarios while maintaining real-time inference capabilities.

As one of the most important parts of the ReCa-PAFPN, the architecture of the SBA module is illustrated in Figure 5.

Figure 5 illustrates a novel RAU (Recalibrated Attention Unit) block that can adaptively select common representations from two inputs (F^s, F^b) before fusion. F^b and F^s, which denote the shallow-level and deep-level information, respectively, are fed into two RAU blocks through distinct pathways. This compensates for the missing spatial boundary information in high-level semantic features and the absent semantic information in low-level features. Finally, the outputs of the two RAU blocks are concatenated after a 3 × 3 convolution. This aggregation strategy achieves robust combination of different features and refines the coarse features. The RAU block function

P A U

(⋅,⋅) can be expressed as:

T_{1}^{'} = W_{θ} (T_{1}), T_{2}^{'} = W_{ϕ} (T_{2})

(2)

P A U (T_{1}, T_{2}) = T_{1}^{'} ⊙ T_{1} + T_{2}^{'} ⊙ T_{2} ⊙ (⊖ (T_{1}^{'})) + T_{1}

(3)

Here, T₁, T₂ are the input features. Two linear transformations and sigmoid functions

W_{θ} (\cdot), W_{ϕ} (\cdot)

are applied to the input features to reduce the channel dimensions to the specific channel C and obtain the feature maps

T_{1}^{'}

and

T_{2}^{'}

. The symbol ⊙ represents Point-wise multiplication.

⊖ (\cdot)

represents an inverse operation performed by subtracting the feature

T_{1}^{'}

, refining the imprecise and coarse estimates into accurate and complete prediction pictures [36]. We employ convolutional operations with a kernel size of 1 × 1 as the linear transformation process. Therefore, the SBA process can be formulated as:

Z = C_{3 \times 3} (C o n c a t (P A U (F^{s}, F^{b}), P A U (F^{b}, F^{s})))

(4)

where C_3×3(·) is a 3 × 3 convolution with batch normalization and ReLU activation.

F^{s} \in R^{\frac{H}{8} \times \frac{W}{8} \times C}

contains deep semantic information following the third and fourth layers of the fusion encoder, while

F^{b} \in R^{\frac{H}{4} \times \frac{W}{4} \times C}

represents boundary-rich details extracted from the first layer of the backbone network.

C o n c a t (\cdot)

denotes the concatenation operation along the channel dimension.

Z \in R^{\frac{H}{4} \times \frac{W}{4} \times C}

is the output of the SBA module. The code of SBA as described above can be found in [36].

3.4. Improvement of Head

The detection head serves as the final stage of object detection networks, responsible for converting multi-scale feature maps from the backbone and neck networks into precise bounding box coordinates and class predictions. However, mining scenarios present unique challenges that severely test the capabilities of traditional detection head architectures. In open pit mining operations, frequent occlusions among workers, vehicles, and equipment demand robust detection mechanisms. These mechanisms can accurately predict bounding box coordinates and classifications even when partial target information is missing. This challenging condition requires detection heads to achieve high accuracy while maintaining computational efficiency suitable for real-time safety monitoring applications in resource-constrained mining environments. GPU memory limitations and batch size constraints are common in these environments.

As shown in Figure 6a, the original detection head of the YOLOv11 network consists of three convolutional layers and a loss computation component. Each convolutional layer is independently trained with its own parameters. However, this structural design demands significant computational resources. Moreover, in object detection scenarios, input images typically have large dimensions, and due to GPU memory limitations, it is not possible to use batches with a large number of images. Batches with fewer images can lead to unstable calculations of mean and variance, which can adversely affect the model’s lightweight design and accuracy. Additionally, the three convolutional layers operate independently without information sharing, hindering effective acquisition of contextual information. This limitation reduces the model’s ability to handle occlusion scenarios, leading to increased false positives and missed detections. Specifically, the original YOLOv11n detection head architecture exhibits several critical limitations that become particularly problematic under mining-specific conditions. (1) The independent training of three separate convolutional layers with distinct parameters results in significant parameter redundancy and increased computational overhead. This is especially problematic in resource-constrained mining monitoring systems where computational efficiency is crucial. (2) The reliance on Batch Normalization (BN) for feature normalization creates a fundamental dependency on batch statistics (mean and variance), which becomes unstable when batch sizes are small due to GPU memory constraints. In mining detection scenarios, where high-resolution images are necessary to capture small workers and large vehicles simultaneously, batch sizes are often limited to eight or fewer images. This leads to inaccurate batch statistics that degrade both localization and classification performance. (3) The lack of parameter sharing between the three convolutional layers prevents effective information exchange and contextual understanding across different detection scales (P2, P3, P4, P5). This makes it difficult for the network to leverage complementary information from different feature resolutions. This limitation is particularly severe in mining scenarios where targets at different scales (small workers vs. large vehicles) require coordinated detection strategies. (4) The independent operation of convolutional layers without shared contextual information extraction mechanisms hinders the model’s ability to capture long-range dependencies and global scene understanding. These are essential for handling complex occlusions and distinguishing targets from cluttered backgrounds. This results in increased false positives (detecting background elements as targets) and missed detections (failing to detect partially occluded targets), significantly compromising detection accuracy in complex mining environments characterized by frequent occlusions and ambiguous boundaries. These limitations severely compromise detection efficiency and accuracy in resource-constrained environments like mine personnel and vehicle detection, where occlusions are common and computational resources are often limited.

To address these issues, this paper introduces EGNDH (Efficient Group Normalization Detection Head), which primarily consists of convolutional layers, GN (Group Normalization) components, two serially shared convolutional layers, as well as Conv2d modules and scale modules. GN can enhance the model’s localization and classification performance while reducing the dependency on batch size. Furthermore, by sharing convolutional kernel parameters across layers, the model achieves higher accuracy and becomes more lightweight. As shown in Figure 6b, the EGNDH structure addresses the aforementioned limitations through several key advantages specifically designed for mining scenarios. First, the introduction of Group Normalization (GN) replaces Batch Normalization (BN) as the normalization mechanism, fundamentally eliminating the dependency on batch statistics. Unlike BN, which computes normalization statistics across the batch dimension and becomes unstable with small batch sizes, GN divides channels into groups and computes statistics within each group independently of batch size. This design ensures stable and accurate normalization even when batch sizes are as small as 1. This makes EGNDH particularly suitable for mining detection scenarios where GPU memory constraints limit batch sizes. The stable normalization provided by GN significantly enhances both localization accuracy (precise bounding box prediction) and classification accuracy (correct class prediction). This is especially critical for detecting small workers and distinguishing them from similar background elements. Second, the parameter sharing mechanism implemented through two serially shared convolutional layers dramatically reduces the total parameter count while improving detection performance. Instead of maintaining three independent sets of convolutional parameters (one for each detection scale), EGNDH shares convolutional kernels across all detection scales (P2, P3, P4, P5). This enables the network to learn common feature representations that are beneficial for all scales. This parameter sharing not only reduces the memory footprint and computational cost, making the model more lightweight and suitable for resource-constrained mining monitoring systems, but also facilitates information exchange and knowledge transfer across different scales. This allows the network to leverage complementary information from high-resolution features (for small targets) and low-resolution features (for large targets). Third, the serial architecture of shared convolutional layers creates a progressive feature refinement pipeline, where features from different scales are processed through the same learned transformations, ensuring consistent feature representations across scales. This consistency is particularly important for mining scenarios where the same target types (workers, vehicles) appear at multiple scales, enabling the network to maintain coherent detection strategies. Fourth, the integration of GN with shared convolutions creates a synergistic effect. GN provides stable normalization that enables effective parameter sharing, while shared parameters allow GN to learn more generalizable normalization statistics that benefit all detection scales. The synergistic integration of GN with shared convolutions results in improved detection accuracy, especially for challenging cases such as small targets, occluded objects, and targets with ambiguous boundaries. These are common in mining environments. Furthermore, the lightweight design of EGNDH, achieved through parameter sharing and efficient GN operations, maintains real-time inference capabilities while significantly improving detection performance. This makes it ideal for real-time safety monitoring applications in mining operations. The structure of the EGNDH module is illustrated in Figure 6.

The integration of EGNDH into the YOLOv11n architecture is implemented as a comprehensive replacement of the original detection head structure. Specifically, the standard YOLOv11n detection head, which consists of three independent convolutional layers (each with its own BN and activation functions) operating separately on features from different scales, is entirely replaced with the EGNDH structure. In the EGNDH architecture, feature maps extracted from the Backbone and Neck components (labeled as P2, P3, P4, and P5) are first processed through scale-specific GN-convolution layers. These initial layers perform scale-specific feature extraction and normalization using Group Normalization, which ensures stable normalization regardless of batch size. The processed features from all scales are then fed into two serially connected shared convolutional layers, which apply the same learned transformations to features from all scales. This shared processing enables parameter sharing across scales, dramatically reducing the total parameter count while facilitating information exchange between different feature resolutions. The outputs from the shared convolutional layers are then divided into two parallel branches: the bounding box prediction branch and the classification prediction branch. For the training task, inputs are features at different scales. For each scale, GN-convolution layers specific to that scale are utilized to extract and normalize the input feature maps. The processed results are then further refined through the shared convolutional layers, effectively reducing the parameter count and improving computational efficiency. Finally, the outputs from the shared convolutional processing are fed into the bounding box prediction branch and classification prediction branch, which are concatenated along the channel dimension. For the inference task, the same architecture is used but optimized for real-time performance. The integration of EGNDH ensures that the improved normalization stability (through GN) and parameter efficiency (through shared convolutions) work synergistically with the enhanced backbone features (from C3K2_RFAConv and FPSConv) and the improved neck network (ReCa-PAFPN). This creates a coherent pipeline that effectively addresses large-scale variations, wide spatial distribution, severe occlusions, and resource constraints in mining safety monitoring scenarios while maintaining real-time inference capabilities. The lightweight design of EGNDH, combined with its stable normalization and parameter sharing mechanisms, makes it particularly suitable for deployment in resource-constrained mining monitoring systems where computational efficiency and detection accuracy must be balanced.

The theoretical calculation of EGNDH (Efficient Group Normalization Detection Head), implemented as a Lightweight Shared Detail-Enhanced Convolutional Detection Head, can be aligned with the following computation steps.

Assume the neck outputs four feature maps

{X_{s}}_{s \in {2, 3, 4, 5}}

with shapes

X_{s} \in R^{B \times C_{s} \times H_{s} \times W_{s}}

. All scales are first lifted to a unified hidden dimension

D = h i d c

by per-scale stem convolutions and then processed by shared detail-enhanced convolutions and prediction heads.

Step 1: Per-scale stem Conv + GN + Activation

For each scale

s

, the stem block uses a 1 × 1 convolution, followed by Group Normalization with a fixed number of groups (G = 16) and a SiLU activation:

Y_{s} = S i L U ({G N}_{16} ({C o n v}_{1 \times 1} (X_{s}; C_{s} \to D)))

(5)

where

{G N}_{16}

denotes GroupNorm with 16 groups. This operation costs approximately

O (B \cdot C_{s} \cdot D \cdot H_{s} \cdot W_{s})

FLOPs per scale.

Step 2: Shared detail-enhanced convolution (two DEConv_GN blocks)

The intermediate features

Y_{s}

from all scales are then passed through a shared sequence of two detail-enhanced convolution blocks, each consisting of a 3 × 3 transposed convolution (stride 1), followed by GroupNorm (G = 16) and a nonlinear activation. For each scale

s

:

Z_{s}^{(1)} = {D E C o n v G N}_{1} (Y_{s})

(6)

Z_{s}^{(2)} = {D E C o n v G N}_{2} (Z_{s}^{(1)})

(7)

where the spatial resolution is preserved and the channel dimension remains

D

. Each 3 × 3 shared block contributes roughly

O (B \cdot D^{2} \cdot 3^{2} \cdot H_{s} \cdot W_{s})

FLOPs, but the weights are shared across all four scales, so parameters do not grow with the number of levels.

Step 3: Shared prediction heads with Scale reweighting

For each scale

s

, the refined feature

Z_{s}^{(2)}

is fed into two 1 × 1 convolutional heads that are shared across spatial locations but separate for regression and classification:

B_{s} = {S c a l e}_{s} ({C o n v}_{1 \times 1}^{r e g} (Z_{s}^{(2)}; D \to 4 \cdot R))

(8)

C_{s} = {C o n v}_{1 \times 1}^{c l s} (Z_{s}^{(2)}; D \to N_{c})

(9)

where

R =

reg_max is the number of bins for distributional box regression, and

N_{c}

is the number of classes. Thus

B_{s} \in R^{B (4 R) \times H_{s} \times W_{s}}

and

C_{s} \in R^{B \times N_{c} \times H_{s} \times W_{s}}

. The Scale_s layer is a learnable scalar parameter applied element-wise to the regression logits on each scale.

Step 4: Output assembly and DFL-based box decoding

At each scale, the regression and classification logits are concatenated along the channel dimension:

O_{s} = C o n c a t (B_{s}, C_{s}) \in R^{B \times (4 R + N_{c}) \times H_{s} \times W_{s}}

(10)

During decoding, a Distribution Focal Loss (DFL) layer—which models bounding box coordinates as a probability distribution over discrete bins and converts them to continuous values—converts

B_{s}

from

4 R

logits into 4 continuous offsets, and a distance-to-bbox mapping uses anchors and strides to obtain final bounding boxes. The classification logits

C_{s}

are passed through a sigmoid function to yield class probabilities.

Complexity Summary: Across the four scales, the total computational cost of EGNDH can be approximated as

O (\sum_{s} (B \cdot C_{s} \cdot D \cdot H_{s} \cdot W_{s}) + \sum_{s} (B \cdot D^{2} \cdot 3^{2} \cdot H_{s} \cdot W_{s}) + \sum_{s} (B \cdot D \cdot (4 R + N_{c}) \cdot H_{s} \cdot W_{s}))

(11)

where the first term corresponds to the per-scale stem Conv + GN, the second to the two shared detail-enhanced 3 × 3 convolutions, and the third to the lightweight 1 × 1 prediction heads.

4. Experiments

4.1. Implementation Details

4.1.1. Datasets

The dataset was collected from open pit mining production scenarios, using both stationary surveillance cameras and mobile devices, covering various seasonal and lighting conditions. To enhance the algorithm’s adaptability to various working conditions, we supplemented it with construction engineering scene data sourced from the internet. This data shares environmental similarities with mining scenarios, thereby enriching data diversity. The dataset encompasses diverse environmental conditions to enhance model robustness, including variations in lighting (bright sunlight, overcast, early morning/late afternoon, artificial lighting), target scales (from small personnel to large equipment), occlusion scenarios, and viewpoint perspectives (ground-level, elevated, aerial). Upon completing the data collection, Labellmg (version number is 1.8.6) was utilized to annotate the images. A total of 8853 images were collected in 12 target categories, as shown in Table 4. To ensure a fair and reproducible evaluation, the dataset was divided into training, validation, and test subsets with a ratio of 60%, 20%, 20% at the image level. Specifically, all annotated images were first shuffled using a fixed random seed and then randomly assigned to the three subsets according to the above ratio. Samples of the dataset are shown in Figure 7.

4.1.2. Training Setting

The operating system used for experimental training models in this study is Ubuntu 22.04.5 LTS. The hardware configuration includes an Intel (R) Xeon (R) CPU E5-2680 v3 @ 2.50 GHz, a Tesla P40 24G GPU, and Python 3.10.15 as the programming environment. The deep learning framework employed is PyTorch 2.5.1+cu124, with CUDA 12.6 serving as the parallel computing architecture. The training parameters set are shown in Table 5.

4.1.3. Evaluating Indicator

To verify the performance of the proposed method, Precision (P), Recall (R), Average Precision (AP), mAP (mean average precision), F₁ score (F₁), GFLOPs (Giga Floating Point Operations Per Second), and Frames Per Second (FPS) are used as evaluation indicators. The IoU (Intersection over Union) threshold is set to 0.5. The accuracy rate refers to the proportion of correctly classified positive samples to all positive samples detected, while the recall rate refers to the proportion of positive samples detected to all actual positive samples. Accuracy and recall rates are defined as follows:

P = \frac{T P}{T P + F P}

(12)

R = \frac{T P}{T P + F N}

(13)

F_{1} = \frac{2 \times P \times R}{P + R}

(14)

where TP denotes the number of positive samples correctly classified, FP represents the count of negative samples incorrectly classified as positive, and FN indicates the number of positive samples mistakenly classified as negative.

A P = \int_{0}^{1} P (R) d R

(15)

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(16)

where mAP@0.5 represents the mean average precision calculated with an IoU threshold of 0.5, and mAP@0.5:0.95 represents the average accuracy of IoU when the threshold is set to 0.5 and 0.95 and the step size is 0.05. For simplicity, AP50 and AP75 are used as shorthand notations for mAP@0.5 and mAP@0.5:0.95, respectively.

The IoU (Intersection over Union) is calculated as followed:

I o U = \frac{A r e a o f I n t e r s e c t i o n}{A r e a o f U n i o n}

(17)

To quantitatively evaluate the balance between the detection accuracy and inference speed of the proposed model, the Trade-off Ratio (TR) is defined as the ratio of relative mAP improvement to relative FPS reduction. YOLOv11n is adopted as the baseline model. The specific calculation formulas are presented as follows:

mAP Improvement (%) = \frac{{m A P}_{P r o p o s e d} - {m A P}_{B a s e l i n e}}{{m A P}_{B a s e l i n e}} \times 100 %

(18)

FPS Reduction (%) = \frac{{F P S}_{B a s e l i n e} - {F P S}_{P r o p o s e d}}{{F P S}_{B a s e l i n e}} \times 100 %

(19)

T r a d e - o f f R a t i o = \frac{mAP Improvement (%)}{FPS Reduction (%)}

(20)

where

mAP

and

FPS

denote the mean average precision and inference speed of the proposed RSLH-YOLO, respectively.

{m A P}_{B a s e l i n e}

and

{F P S}_{B a s e l i n e}

represent the corresponding metrics of the baseline YOLOv11n. A higher Trade-off Ratio indicates that the model can obtain more significant accuracy gains per unit of speed loss, reflecting a superior balance between detection performance and real-time responsiveness.

4.2. Ablation Study

In Table 6, Model A is the original YOLOv11n network. Model B replaces the CBS of C3K2 of Model A with the RFAConv module, forming C3K2_RFAConv. Model C replaces the SPPF module with FPSConv based on Model B. On the basis of Model C, Model D introduces a new small-target detection layer. Model E replaces the up/down sampling operations and concat operations in the neck network of Model C with the SBA module, forming a new ReCa-PAFPN neck. Model F adds a new small-target detection layer to Model E, and Model G replaces the detection head of Model F with the EGNDH detection head.

As shown in Figure 8, the models’ mAP trajectories rapidly stabilize near the highest value, while validation box loss decays smoothly toward a flat floor, demonstrating consistent convergence across all training runs. Both the mAP stabilization and loss decay behaviors satisfy visual detection evaluation criteria for stability, repeatability, and practical deployment readiness.

Through step-by-step module replacement and integrated optimization, the model’s performance is significantly improved—mAP increases from 85.9% (baseline) to 89.1% (final model), representing a 3.2 percentage point improvement—while maintaining computational efficiency suitable for mining scenarios. Using YOLOv11 as the basic model, shown as Model A, the accuracy is 91.3%, the recall rate is 80.0%, the mAP is 85.9%, the number of parameters is about 2.58 M, the calculation complexity is 6.3 G, and the FPS is 237.5 frames/s. After the introduction of C3K2_RFAConv in Model B, although the FPS dropped to 93.5, the accuracy rate increased to 92.8%, the recall rate increased to 80.5%, and the mAP increased to 86.4%. By introducing the RFA (Receptive-Field Attention) mechanism in RFAConv of C3K2 at both the backbone feature extraction and neck feature fusion stages, C3K2_RFAConv captures surrounding contextual information and provides superior localization capability for Model B. Therefore, as shown in Table 7, Model B achieves significant improvements in the accuracy of car (from 0.823 to 0.832), worker (from 0.817 to 0.835), and truck (from 0.775 to 0.789), which often blend with complex backgrounds in complex mining scenarios.

After the introduction of the FPSConv structure in Model C, the accuracy rate increased to 92.9%, the recall rate reached 81.1%, the mAP increased to 86.9%, and the FPS recovered to 101.5. The FPSConv structure replaces the original pooling operation with dilated convolution, which reduces the information loss and computation cost of feature extraction and improves the computation efficiency. As a result, many small targets like workers with only a few pixels in the pictures are recaptured. That is why the mAP of workers improves to 0.854 from Model B to Model C in Table 7.

With the introduction of the ReCa-PAFPN neck composed of SBA and C3K2_RFAConv in Model E, the accuracy rate increased to 93.8%, the recall rate increased to 80.7% compared to the benchmark model, and the mAP further improved to 87.3%. The bidirectional fusion mechanism between high-resolution and low-resolution features and the RAU adaptive attention module enhance the feature fusion effect of the neck network. This is more suitable for the target detection task with extreme scale variations in complex mining scenarios. On the basis of the introduction of the ReCa-PAFPN neck and the addition of the small-target detection layer of modal F, the accuracy rate increased to 94.5%, the recall rate reached 81.4%, and the mAP significantly increased to 88.4%. In contrast, when a small-target detection layer is introduced alone, as in Model D, the mAP only improves by 0.3% (86.2%) compared with the baseline model. This indicates that the combination of the ReCa-PAFPN neck and small-target detection layer can effectively improve the scene accuracy of mine small-target detection, which shows that the SBA module has a strong capability in feature fusion. It was evident that the mAPs of those targets with only a few pixels like workers and large objects like crane increase at the same time from Model C to Model F in Table 7. This means that with the help of the ReCa-PAFPN neck and an additional small detection layer, Model F can effectively capture the small targets and large objects in one picture simultaneously, handling detection problems with extreme scale variations.

However, the introduction of a small-target layer leads to the increase of the parameter number, computation, and model complexity, which puts pressure on mine equipment with limited resources. In order to effectively mitigate this problem, lightweight EGNDH detection heads are introduced in Model G. In the test, because of EGNDH detection heads in Model G, the recall rate reached 83.5%, the mAP increased to 89.1%, and the number of parameters and calculation amount were reduced by 1.8 M and 2.4 GFLOPs, respectively. This effectively alleviated the pressure of the small-target layer on mining equipment. What is more, the mAP of the truck increases from 0.809 (Model F) to 0.831 (Model G), representing a 2.2 percentage point improvement, as shown in Table 7. This is because of the shared convolutional kernels across all detection scales (P2, P3, P4, P5) of EGNDH, enabling the network to learn common feature representations that are beneficial for all scales, leading to the discovery of more occluded trucks.

From Table 6, the baseline Model A achieves an F1 score of 0.853, whereas the progressive introduction of RFAConv, FPSConv, SBA, EGNDH, and the modified detection heads consistently improves the F1 metric. Model G reaches 0.870, corresponding to a 1.7% absolute increase over the baseline. Compared with intermediate variants (Models B–E, F1 ≤ 0.868), Model G attains the highest F1 while simultaneously achieving the best recall (0.835) and the highest mAP (0.891). Although Model F attains a slightly higher F1 score (0.875), this comes at the expense of increased parameters (4.02 M vs. 3.84 M) and computational cost (18.4 vs. 16 GFLOPs). In contrast, Model G delivers a more favorable balance between detection performance and efficiency. Consequently, Model G offers the most advantageous trade-off among F1 score, recall, and computational complexity, making it the most suitable configuration for practical deployment in mine-site monitoring.

RSLH-YOLO has been proven effective for detecting humans and vehicles in mining scenarios. Additionally, attention heat pictures are generated to further explore the internal feature extraction processes of RSLH-YOLO, as illustrated in Figure 9. The HiResCAM (High-Resolution Class Activation Mapping) method is employed to identify the model’s regions of interest. This visually demonstrates whether critical features of mining-specific humans and vehicles are learned by the model. HiResCAM helps understand how the model makes predictions by generating high-resolution heat pictures that show how much attention the model pays to different areas of the image when making classifications. In attention heat pictures, different colors indicate the level of focus assigned to each region of the image for detecting humans and vehicles in mining environments. Typically, red or orange represents high-attention areas, while blue denotes low-attention regions.

From the heat pictures of Models B and C, it can be observed that the introduction of the C3K2_RFAConv and FPSConv modules activates more regions and expands the receptive field while accurately identifying target areas. This indicates that the C3K2_RFAConv module effectively separates targets from complex backgrounds and enlarges the receptive field. The FPSConv module reduces information loss during feature extraction, improving model accuracy and providing richer information for subsequent tasks. When the ReCa-PAFPN neck processes multi-scale information, it can be seen from the heat picture of Model E that the feature map becomes richer, and more small-target regions are activated. This indicates that more small targets are detected, so that the model can better cope with the situation where the target size gap is too large and enhance the recognition of large-scale spanning targets. As shown in the heat picture of Model F, after adding the small-target detection layer, the orange coverage (high-attention areas) for small targets expands significantly. Additionally, the edges of detection regions exhibit more continuous coloration and sharper contrast with the background, reflecting improved boundary recognition and precise localization of mining equipment. As shown in the heat picture of model G, the EGNDH detection head employs group normalization to reduce precision loss and shared convolutional outputs to fuse cross-channel information. This strengthens inter-channel relationships, enriches contextual information, and effectively identifies occluded targets. Consequently, even partially obscured trucks receive sufficient attention, as evidenced by the heat picture.

The improved detection performance demonstrated in the attention heatmaps is largely due to three targeted innovations. Firstly, the backbone integrates RFAConv, C3K2_RFAConv, and FPSConv to enlarge the receptive field while preserving fine-grained cues, enabling richer multi-scale representations under strict computational budgets. Secondly, the neck replaces the conventional feature pyramid with a ReCa-PAFPN that combines SBA and C3K2_RFAConv, thereby refining cross-scale alignment and mitigating large-span variations common in mining scenes. Thirdly, the head appends a P2 branch for small-object sensitivity and adopts the lightweight EGNDH structure to enhance occlusion robustness without sacrificing real-time throughput. Together, these three design choices—backbone integration of RFAConv and FPSConv, ReCa-PAFPN neck, and EGNDH head with P2 branch—deliver a compact yet expressive detector tailored to resource-constrained industrial deployments.

4.3. Comparisons

Figure 10a shows the original pictures. To better illustrate the effectiveness of RSLH-YOLO, Figure 10 shows a comparison of some test results. In Figure 10b, under the detection conditions of small and medium-sized targets, there are workers’ missed detection and false detection. In the case of multi-object occlusion, not only the probability of missing detection is greatly increased, but also there is a multi-inspection situation. This cannot effectively identify the shape of complex engineering vehicles.

However, in Figure 10c, the model proposed in this paper not only identifies workers of small targets to avoid missed detection and false detection, but also identifies missing trucks and excavators due to occlusion in multi-target scenarios. When faced with a scene with a large target scale span, the model also recognizes small and distant target trucks and excavators. The above results show that RSLH-YOLO can correctly identify various targets, which confirms the effectiveness of RSLH-YOLO on the dataset presented in this paper.

As presented in Table 2, the yolov11n model was selected as the baseline for comparison due to its minimal parameter count (2.58 M) and computational load (6.3 GFLOPs), representing the most lightweight configuration in the series.

Our proposed model, RSLH-YOLO, demonstrates a superior performance profile. It achieves the highest mean Average Precision (mAP) of 0.891, outperforming all other models, including the much larger yolov11m (0.872 mAP). This represents a significant 3.2% improvement in mAP over the yolov11n baseline (0.859 mAP).

Crucially, this state-of-the-art accuracy is achieved while maintaining a highly efficient architecture. With only 3.84 M parameters and 16 GFLOPs, RSLH-YOLO is only marginally larger than the baseline and substantially more lightweight than both yolov11s (10.75 M; 24.5 GFLOPs) and yolov11m (25.11 M; 59.7 GFLOPs). Furthermore, RSLH-YOLO also records the highest Recall (0.835), indicating a superior capability in identifying all relevant objects.

The advantages of our model are best encapsulated by the Trade-off Ratio (mAP%/FPS%). RSLH-YOLO obtains the highest score of 0.0548, which is significantly greater than that of yolov11s (0.0236) and yolov11m (0.0168). This metric strongly validates that our proposed method achieves the optimal balance between detection accuracy and inference speed, making it a highly effective and efficient solution.

RSLH-YOLO attains the optimal balance between accuracy and efficiency, as demonstrated in Table 8. The comparative experimental results of different algorithms are shown in Figure 11. Its mAP of 0.891 and F1 score of 0.870 surpass all competing detectors while concurrently maintaining a modest architectural footprint (3.84 M parameters, 16 GFLOPs) and real-time throughput (76.1 FPS). When benchmarked against transformer-based approaches (DETR, RT-DETR) and larger YOLO variants, RSLH-YOLO successfully improves precision and recall without incurring prohibitive computational overhead. This underscores the effectiveness of the proposed modules. The utilization of yolov11n as the baseline is thus well justified. Within the nano-scale YOLO family, this model already offers the strongest baseline accuracy (mAP = 0.859, F1 = 0.853) at a comparable computational regime. Extending yolov11n with the RSLH enhancements therefore effectively isolates the contributions of the new design while preserving a lightweight deployment profile.

The computational efficiency of RSLH-YOLO, as demonstrated by its lightweight design (3.84 M parameters and 16.0 GFLOPs), indicates its suitability for deployment on edge devices. The model’s parameter count is significantly lower than many state-of-the-art object detection models, which typically require 20–50 M parameters. This lightweight architecture reduces memory requirements and enables efficient inference on resource-constrained devices. The GFLOPs metric (16.0) provides a direct measure of computational complexity, suggesting that the model can achieve real-time performance on modern edge computing platforms with GPU acceleration, such as NVIDIA Jetson series devices. A detection speed of 76.1 FPS not only meets the demands of real-time detection, but also optimizes the utilization of device capabilities. While specific resource usage measurements on edge devices would require hardware access, the model’s design principles are specifically tailored to minimize computational overhead while maintaining detection accuracy. These principles include shared convolutional layers in EGNDH, efficient attention mechanisms in SBA, and optimized feature extraction in FPSConv. This makes it well suited for practical deployment in mining safety monitoring scenarios.

5. Conclusions

5.1. Contribution

In this study, we proposed RSLH-YOLO, a novel person–vehicle detection method, specifically designed to address the challenges of complex man–vehicle detection in mining environments. Initially, we constructed a dedicated multi-target detection dataset tailored to mining scenarios, thereby ensuring the algorithm’s practical relevance. To enhance the model’s feature extraction capability, we strategically modified the backbone network of YOLOv11n by integrating C3K2_RFAConv and FPSConv. These modules effectively expanded the receptive field, minimized information loss, and refined feature representation to better distinguish targets from complex backgrounds. Subsequently, the introduction of the ReCa-PAFPN neck and a small-target detection layer significantly improved the model’s capacity to focus on multi-scale targets. Finally, the lightweight detection head, EGNDH, was employed not only to reduce the model’s parameter count and computational complexity but also to promote the learning of common feature representations across all detection scales (P2, P3, P4, P5) through shared convolutional kernels. This proved beneficial for discovering occluded targets.

RSLH-YOLO achieved a state-of-the-art mAP detection accuracy of 89.1%, surpassing other competitive algorithms while ensuring high efficiency for critical open pit mine detection tasks. The experimental findings validate that RSLH-YOLO is well suited for deployment on mine inspection equipment with limited computing resources, offering robust technical support for mine person–vehicle safety inspection.

5.2. Limitations and Future Work

Despite the demonstrated successes of RSLH-YOLO, the current scope of this study possesses two primary constraints. Firstly, the data used for training and evaluation remains constrained exclusively to fixed-camera mine person–vehicle scenes. This specificity limits the model’s geometric universality and generalizability to diverse operational viewpoints. Secondly, while the model’s computational efficiency is quantified using theoretical indicators (i.e., parameters, GFLOPs, and FPS), the vital step of empirical profiling on target edge devices was omitted due to hardware inaccessibility. Consequently, the true real-world latency, power consumption, and thermal stability remain unverified.

To conclusively validate the model’s suitability for industrial deployment, future work will be structured across two critical tracks. We plan to broaden viewpoint coverage (i) by integrating data from drone-based and other flexible perspectives. This introduces complex challenges like dynamic scaling and motion artifacts. Simultaneously, we will harden the model under extreme operating conditions (ii), conducting rigorous testing against scenarios such as heavy snow, torrential rain, dust storms, and dense fog. This will establish true environmental robustness. These experiments will be paired with full edge-device measurements, utilizing Jetson/Atlas-class hardware, once access is secured. This integrated approach ensures that both environmental resilience and actual resource usage are robustly demonstrated, paving the way for reliable system integration.

Author Contributions

E.Z.: Conceptualization, Methodology, Investigation, Writing—review and editing, Supervision. C.Q.: Writing—original draft, Methodology. C.Z.: Investigation, Data curation, Funding acquisition, Writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Deep Earth Probe and Mineral Resources Exploration—National Science and Technology Major Project-2025ZD1010901 and the National Natural Science Foundation of China (Nos. 52474108, 52174088).

Institutional Review Board Statement

The conducted research is not related to either human or animal use.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets generated and/or analyzed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest

Author Erxiang Zhao was employed by the company China National Heavy Duty Truck Group Co., Ltd. Author Chunyang Zhang was employed by the company Sinosteel Maanshan General Institute of Mining Research Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Sony, M.; Naik, S. Key ingredients for evaluating Industry 4.0 readiness for organizations: A literature review. Benchmarking Int. J. 2019, 27, 2213–2232. [Google Scholar] [CrossRef]
Girshick, R. Fast R-CNN. arXiv 2015, arXiv:1504.08083. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv 2016, arXiv:1506.01497. [Google Scholar] [CrossRef] [PubMed]
Kaiming, H.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Jiang, P.; Ergu, D.; Liu, F.; Cai, Y.; Ma, B. A Review of Yolo Algorithm Developments. Procedia Comput. Sci. 2022, 199, 1066–1073. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer Vision–ECCV 2016, Proceedings of the 14th European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland; Volume 9905, pp. 21–37. [CrossRef]
Chiu, Y.-C.; Tsai, C.-Y.; Ruan, M.-D.; Shen, G.-Y.; Lee, T.T. Mobilenet-SSDv2: An improved object detection model for embedded systems. Presented at the IEEE 2020 International Conference on System Science and Engineering (ICSSE), Kagawa, Japan, 31 August–3 September 2020; pp. 1–5. [Google Scholar]
Li, Y.; Huang, H.; Xie, Q.; Yao, L.; Chen, Q. Research on a Surface Defect Detection Algorithm Based on MobileNet-SSD. Appl. Sci. 2018, 8, 1678. [Google Scholar] [CrossRef]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14 June–19 June 2020; pp. 10781–10790. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. arXiv 2020, arXiv:2005.12872. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 16965–16974. [Google Scholar]
Imam, M.; Baïna, K.; Tabii, Y.; Ressami, E.M.; Adlaoui, Y.; Boufousse, S.; Benzakour, I.; Abdelwahed, E.H. Integrating real-time pose estimation and PPE detection with cutting-edge deep learning for enhanced safety and rescue operations in the mining industry. Neurocomputing 2025, 618, 129080. [Google Scholar] [CrossRef]
Liu, X.; Yang, H.; Jing, H.; Sun, X.; Yu, J. Research on intelligent risk early warning of open-pit blasting site based on deep learning. Energy Sources Part A-Recovery Util. Environ. Eff. 2021, 47, 6355–6371. [Google Scholar] [CrossRef]
Qin, X.; Huang, Q.; Chang, D.; Liu, J.; Hu, M.; Xu, B.; Xie, G. Object Detection Method in Open-pit Mine Based on Improved YOLOv5. J. Hunan Univ. Nat. Sci. 2023, 50, 23–30. [Google Scholar]
Liu, Y.; Li, C.; Huang, J.; Gao, M. MineSDS: A Unified Framework for Small Object Detection and Drivable Area Segmentation for Open-Pit Mining Scenario. Sensors 2023, 23, 5977. [Google Scholar] [CrossRef]
Sapkota, R.; Karkee, M. Integrating YOLO11 and Convolution Block Attention Module for Multi-Season Segmentation of Tree Trunks and Branches in Commercial Apple Orchards. arXiv 2024, arXiv:2412.05728. [Google Scholar] [CrossRef]
Cardellicchio, A.; Renò, V.; Cellini, F.; Summerer, S.; Petrozza, A.; Milella, A. Incremental learning with domain adaption for tomato plant phenotyping. Smart Agric. Technol. 2025, 12, 101324. [Google Scholar] [CrossRef]
Hui, Y.; Wang, J.; Li, B. STF-YOLO: A small target detection algorithm for UAV remote sensing images based on improved SwinTransformer and class weighted classification decoupling head. Measurement 2024, 224, 113936. [Google Scholar] [CrossRef]
Kang, L.; Lu, Z.; Meng, L.; Gao, Z. YOLO-FA: Type-1 fuzzy attention based YOLO detector for vehicle detection. Expert Syst. Appl. 2024, 237, 121209. [Google Scholar] [CrossRef]
Dollar, P.; Appel, R.; Belongie, S.; Perona, P. Fast Feature Pyramids for Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 1532–1545. [Google Scholar] [CrossRef] [PubMed]
Zhang, Z.; Tao, L.; Yao, L.; Li, J.; Li, C.; Wang, H. LDSI-YOLOv8: Real-Time Detection Method for Multiple Targets in Coal Mine Excavation Scenes. IEEE Access 2024, 12, 132592–132604. [Google Scholar] [CrossRef]
Ren, L.; Yang, C.; Song, R.; Chen, S.; Ai, Y. An Feature Fusion Object Detector for Autonomous Driving in Mining Area. In Proceedings of the 2021 IEEE International Conference on Cyber-Physical Social Intelligence (ICCSI), Beijing, China, 18–20 December 2021; pp. 1–5. [Google Scholar] [CrossRef]
Liu, Y.; Huang, Z.; Song, Q.; Bai, K. PV-YOLO: A lightweight pedestrian and vehicle detection model based on improved YOLOv8. Digit. Signal Process. 2025, 156, 104857. [Google Scholar] [CrossRef]
Mu, H.; Liu, J.; Guan, Y.; Chen, W.; Xu, T.; Wang, Z. Slim-YOLO-PR_KD: An efficient pose-varied object detection method for underground coal mine. J. Real-Time Image Proc. 2024, 21, 160. [Google Scholar] [CrossRef]
Zhang, J.; Chen, Z.; Yan, G.; Hu, W.B. Faster and Lightweight: An Improved YOLOv5 Object Detector for Remote Sensing Images. Remote Sens. 2023, 15, 4974. [Google Scholar] [CrossRef]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) IEEE, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M. What is YOLOv5: A deep look into the internal features of the popular object detector. arXiv 2024, arXiv:2407.20892. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-Centric Real-Time Object Detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Ultralytics. Ultralytics yolov11. 2024. Available online: https://docs.ultralytics.com/models/yolo11/ (accessed on 10 November 2024).
Cheng, P.; Tang, X.; Liang, W.; Li, Y.; Cong, W.; Zang, C. Tiny-YOLOv7: Tiny object detection model for drone imagery. In Image and Graphics, Presented at the International Conference on Image and Graphics, Nanjing, China, 22–24 September 2023; Springer: Cham, Switzerland, 2023; pp. 53–65. [Google Scholar]
Yaseen, M. What is YOLOv9: An In-Depth Exploration of the Internal Features of the Next-Generation Object Detector. arXiv 2024, arXiv:2409.07813. [Google Scholar] [CrossRef]
Varghese, R.; Sambath, M. Yolov8: A novel object detection algorithm with enhanced performance and robustness. Presented at the 2024 IEEE International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 18–19 April 2024; pp. 1–6. [Google Scholar]
Zhang, X.; Liu, C.; Yang, D.; Song, T.; Ye, Y.; Li, K.; Song, S. RFAConv: Innovating Spatial Attention and Standard Convolutional Operation. arXiv 2024, arXiv:2304.03198. [Google Scholar] [CrossRef]
Tang, F.; Huang, Q.; Wang, J.; Hou, X.; Su, J.; Liu, J. DuAT: Dual-Aggregation Transformer Network for Medical Image Segmentation. arXiv 2022, arXiv:2212.11677. [Google Scholar] [CrossRef]
Feng, C.; Zhong, Y.; Gao, Y.; Scott, M.R.; Huang, W. Tood: Task-aligned one-stage object detection. Presented at the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–18 October 2021; pp. 3490–3499. [Google Scholar]

Figure 1. Comparison of YOLOv11 and RSLH-YOLO: (a) Overall structure of RSLH-YOLO; (b) Overall structure of YOLOv11.

Figure 2. Comparison of C3K2 and C3K2_RFAConv: (a) Structure of C3K2; (b) Structure of C3K2_RFAConv.

Figure 3. Structure of RFAConv.

Figure 4. Comparison of SPPF and FPSConv: (a) Structure of SPPF; (b) Structure of FPSConv.

Figure 5. Structure of SBA.

Figure 6. Comparison of Detect and EGNDH: (a) Structure of Detect; (b) Structure of EGNDH.

Figure 7. Samples of dataset.

Figure 8. Results of ablation study: (a) Curves of validation loss; (b) Curves of mAP.

Figure 9. Comparison of HiResCAM attention heatmaps for different model variants. The heatmaps visualize the regions of focus for each model when processing the same input image from the mining dataset. Warmer colors (red/yellow) indicate higher attention intensity, while cooler colors (blue) indicate lower attention: (a) Original input image showing workers and mining equipment; (b) Baseline YOLOv11n (Model A); (c) YOLOv11n with C3K2_RFAConv (Model B); (d) YOLOv11n with C3K2_RFAConv and FPSConv (Model C); (e) YOLOv11n with C3K2_RFAConv, FPSConv, and SBA (Model E); (f) YOLOv11n with C3K2_RFAConv, FPSConv, SBA, and small-target detection layer (Model F); (g) Complete RSLH-YOLO with all modules (Model G). The progressive enhancement of attention to target regions (workers and mining equipment) demonstrates the cumulative effectiveness of each proposed module in improving feature focus and detection accuracy.

Figure 10. Comparison of algorithm part detection results. (a) Original picture; (b) YOLOv11n; (c) Proposed method.

Figure 11. Comparative experimental results of different algorithms.

Table 1. Results of YOLO series. (Bold values indicate the best performance in each metric).

Model	mAP	Params/M	Precision	Recall	GFLOPs	FPS
yolov5n [27]	0.741	1.78	0.739	0.709	4.2	270
yolov7-tiny [32]	0.809	6.2	0.819	0.774	30	118
yolov8n [34]	0.755	3.01	0.742	0.693	8.1	206.8
yolov9 [33]	0.822	60.52	0.834	0.729	264	11.9
yolov10n [28]	0.742	2.7	0.716	0.696	8.3	149.5
yolov11n [30,31]	0.859	2.58	0.913	0.8	6.3	237.5
yolov12n [29]	0.836	2.72	0.852	0.817	6.9	257.4

Table 2. Comparative experimental results of different densities of Yolov11. (Bold values indicate the best performance in each metric).

Model	Precision	Recall	F1 Score	mAP	Params/M	GFLOPs	FPS	mAP Improvement	FPS Reduction	Trade-Off Ratio (mAP%/FPS%)
yolov11n	0.913	0.8	0.853	0.859	2.58	6.3	237.5	-	-	-
yolov11s	0.927	0.813	0.866	0.869	10.75	24.5	120.5	0.0116	0.4926	0.0236
yolov11m	0.932	0.832	0.879	0.872	25.11	59.7	23	0.0151	0.9032	0.0168
RSLH-YOLO (ours)	0.909	0.835	0.870	0.891	3.84	16	76.1	0.0373	0.6796	0.0548

Table 3. Comparison of YOLOv11n and RSLH-YOLO.

Component	YOLOv11n	RSLH-YOLO		Improvement Purpose
Backbone	Conv	RFAConv		Expand receptive field, enhance contextual information capture, distinguish targets from intricate backgrounds
	C3K2	C3K2_RFAConv
	SPPF	FPSConv		Reduce information loss, enable multi-scale feature extraction
Neck	Upsample	SBA	ReCa-PAFPN	Improve multi-scale feature fusion, address large-scale span variations
	Conv
	Concat
	C3K2	C3K2_RFAConv
Head	Detect	EGNDH		Reduce parameters, improve occlusion handling
Head	-	Add P2 Layer		Enhance small-target detection capability

Table 4. Number and distribution of target labels in the dataset.

Labels	Train Set	Val Set	Test Set	Sum
Excavator	1026	345	332	1703
truck	1604	671	473	2748
Compactor	348	128	135	611
crane	586	219	168	973
Loader	552	183	210	945
concrete_mixer_truck	273	104	102	479
backhoe_loader	422	138	133	693
Dozer	342	116	120	578
Grader	405	111	125	641
worker	4614	2252	1513	8379
other vehicle	1054	597	331	1982
car	5084	3531	1715	10,330
Sum	16,310	8395	5357

Table 5. Training parameters.

Parameters	Value	Parameters	Value
Epochs	300	Optimizer	SGD
Batch	5	Weight decay	0.0005
lmg size	640 × 640	Momentum	0.937
Workers	4	Lr	0.01

Table 6. Result of ablation study. (Bold values indicate the best performance in each metric; ”√” indicates the inclusion of the module; “—” indicates that the corresponding module or method is not included).

Model	RFAConv	FPSConv	SBA	EGNDH	3heads	4heads	Precision	Recall	F1 Score	mAP	Params/106	GFLOPs	FPS
A	—	—	—	—	√	—	0.913	0.8	0.853	0.859	2.58	6.3	237.5
B	√	—	—	—	√	—	0.928	0.805	0.862	0.864	2.69	6.9	93.5
B	√	—	—	—	√	—	0.928	0.805	0.862	(+0.005)	2.69	6.9	93.5
C	√	√	—	—	√	—	0.929	0.811	0.866	0.869	2.84	6.9	101.5
C	√	√	—	—	√	—	0.929	0.811	0.866	(+0.010)	2.84	6.9	101.5
D	√	√	—	—	—	√	0.893	0.799	0.843	0.862	3.17	15	74.5
D	√	√	—	—	—	√	0.893	0.799	0.843	(+0.003)	3.17	15	74.5
E	√	√	√	—	√	—	0.938	0.807	0.868	0.873	3.74	14.1	91.2
E	√	√	√	—	√	—	0.938	0.807	0.868	(+0.014)	3.74	14.1	91.2
F	√	√	√	—	—	√	0.945	0.814	0.875	0.884	4.02	18.4	84
F	√	√	√	—	—	√	0.945	0.814	0.875	(+0.025)	4.02	18.4	84
G(OURS)	√	√	√	√	—	√	0.909	0.835	0.870	0.891	3.84	16	76.1
G(OURS)	√	√	√	√	—	√	0.909	0.835	0.870	(+0.032)	3.84	16	76.1

Table 7. The mAP of per class. (Bold values indicate the best performance in each metric).

Class	Model A	Model B	Model C	Model D	Model E	Model F	Model G (OURS)
car	0.823	0.832	0.827	0.894	0.911	0.921	0.927
worker	0.817	0.835	0.854	0.866	0.869	0.904	0.904
truck	0.775	0.789	0.79	0.787	0.793	0.809	0.831
other vehicle	0.704	0.705	0.702	0.707	0.701	0.703	0.706
Excavator	0.887	0.878	0.896	0.87	0.865	0.88	0.889
Loader	0.901	0.897	0.888	0.885	0.881	0.887	0.899
crane	0.793	0.769	0.817	0.757	0.849	0.851	0.873
backhoe_loader	0.953	0.951	0.955	0.938	0.935	0.95	0.935
Dozer	0.914	0.902	0.904	0.911	0.901	0.912	0.905
Grader	0.971	0.987	0.978	0.969	0.97	0.978	0.987
concrete_mixer_truck	0.863	0.906	0.91	0.86	0.91	0.894	0.929
Compactor	0.912	0.919	0.901	0.899	0.896	0.913	0.905
mAP	0.859	0.864	0.869	0.862	0.873	0.884	0.891

Table 8. Comparative experimental results of different algorithms. (Bold values indicate the best performance in each metric).

Model	mAP	Params/M	Precision	Recall	F1 Score	GFLOPs	FPS
faster_rcnn [3]	0.734	41.41	0.729	0.711	0.720	200	13.5
SDD [6]	0.745	20.44	0.718	0.697	0.707	16.9	72.5
MobileNet-SSD V3 Large [8]	0.801	13	0.759	0.737	0.748	9.8	176
TOOD-R50 [37]	0.751	32.04	0.747	0.704	0.725	110	27.4
DETR [10]	0.867	41.3	0.883	0.845	0.864	170	50.2
RT-DETR [11]	0.879	3.8	0.896	0.854	0.874	10.8	95.6
yolov5n [27]	0.741	1.78	0.739	0.709	0.724	4.2	270
yolov7-tiny [32]	0.809	6.2	0.819	0.774	0.796	30	118
yolov8n [34]	0.755	3.01	0.742	0.693	0.717	8.1	206.8
yolov9 [33]	0.822	60.52	0.834	0.729	0.778	264	11.9
yolov10n [28]	0.742	2.7	0.716	0.696	0.706	8.3	149.5
yolov11n [30,31]	0.859	2.58	0.913	0.8	0.853	6.3	237.5
yolov12n [29]	0.836	2.72	0.852	0.817	0.834	6.9	257.4
RSLH-YOLO (ours)	0.891	3.84	0.909	0.835	0.870	16	76.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhao, E.; Qiu, C.; Zhang, C. Study on Lightweight Algorithm for Multi-Scale Target Detection of Personnel and Equipment in Open Pit Mine. Appl. Sci. 2026, 16, 354. https://doi.org/10.3390/app16010354

AMA Style

Zhao E, Qiu C, Zhang C. Study on Lightweight Algorithm for Multi-Scale Target Detection of Personnel and Equipment in Open Pit Mine. Applied Sciences. 2026; 16(1):354. https://doi.org/10.3390/app16010354

Chicago/Turabian Style

Zhao, Erxiang, Caimou Qiu, and Chunyang Zhang. 2026. "Study on Lightweight Algorithm for Multi-Scale Target Detection of Personnel and Equipment in Open Pit Mine" Applied Sciences 16, no. 1: 354. https://doi.org/10.3390/app16010354

APA Style

Zhao, E., Qiu, C., & Zhang, C. (2026). Study on Lightweight Algorithm for Multi-Scale Target Detection of Personnel and Equipment in Open Pit Mine. Applied Sciences, 16(1), 354. https://doi.org/10.3390/app16010354

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Study on Lightweight Algorithm for Multi-Scale Target Detection of Personnel and Equipment in Open Pit Mine

Abstract

1. Introduction

2. Related Works

2.1. Distinguishing Targets from Complex Backgrounds

2.2. Large-Scale Span Target Detection

2.3. Lightweight Detection Head for Resource-Constrained Environments

3. Methodology

3.1. Overall of Network

3.2. Improvement of Backbone

3.2.1. C3K2_RFAConv Module

3.2.2. FPSConv Module

3.3. Improvement of Neck

3.4. Improvement of Head

4. Experiments

4.1. Implementation Details

4.1.1. Datasets

4.1.2. Training Setting

4.1.3. Evaluating Indicator

4.2. Ablation Study

4.3. Comparisons

5. Conclusions

5.1. Contribution

5.2. Limitations and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI