Vision-Based Detection of Large Coal Fragments in Fully Mechanized Mining Faces Using Adaptive Weighted Attention and Transfer Learning

Wang, Yuan; Lei, Jian; Li, Leping; Lu, Zhengxiong; Xu, Lele; Zhao, Shuanfeng

doi:10.3390/s26041167

Open AccessArticle

Vision-Based Detection of Large Coal Fragments in Fully Mechanized Mining Faces Using Adaptive Weighted Attention and Transfer Learning

by

Yuan Wang

^1,2,*,

Jian Lei

¹,

Leping Li

^1,2,3,

Zhengxiong Lu

¹,

Lele Xu

¹ and

Shuanfeng Zhao

^1,2

¹

School of Mechanical Engineering, Xi’an University of Science and Technology, Xi’an 710054, China

²

Shaanxi Key Laboratory of Mine Electromechanical Equipment Intelligent Detection and Control, Xi’an 710054, China

³

Hunan Huanan Optoelectronics Co., Ltd., Changsha 415007, China

^*

Author to whom correspondence should be addressed.

Sensors 2026, 26(4), 1167; https://doi.org/10.3390/s26041167

Submission received: 8 January 2026 / Revised: 28 January 2026 / Accepted: 9 February 2026 / Published: 11 February 2026

(This article belongs to the Special Issue New Trends in Robot Vision Sensors and System)

Download

Browse Figures

Versions Notes

Abstract

The unloading port of a scraper conveyor is a critical component in fully mechanized mining operations and is prone to blockages caused by large coal fragments. These blockages primarily result from the limited accuracy and insufficient real-time performance of existing visual perception methods used by crushing robots to identify large coal pieces in complex mining environments. To address this issue, this paper proposes a visual inspection method for coal mine crushing robots based on transfer learning and an adaptive weighted attention mechanism, termed LCDet. First, a lightweight backbone network incorporating grouped convolution is designed to enhance feature representation while significantly reducing model complexity, thereby meeting deployment requirements. Second, an adaptive weighted attention mechanism is introduced to suppress background interference and emphasize regions containing large coal fragments, particularly enhancing blurred edge textures. In addition, a transfer learning-based training strategy is adopted to improve generalization performance and reduce dependence on large-scale training data. The experimental results on the public DsLMF+ dataset demonstrate that LCDet achieves accuracy, recall, mAP50, and mAP50–95 values of 79.3%, 75.1%, 84.5%, and 56.2%, respectively, achieving a favorable balance between detection accuracy and model complexity. On a self-constructed large coal dataset, LCDet attains accuracy, recall, mAP50, and mAP50–95 of 90.4%, 91.3%, 96.5%, and 69.3%, respectively, outperforming the baseline YOLOv8n model. Compared with other detection methods, LCDet exhibits superior performance while maintaining a relatively low parameter count. These results indicate that LCDet enables lightweight and accurate detection of large coal fragments, supporting real-time deployment on crushing robots in fully mechanized mining environments.

Keywords:

crushing robot; large coal fragments; adaptive weighted attention; transfer learning; object detection

1. Introduction

Coal remains a fundamental component of the global energy system and industrial production, playing an indispensable role in sustaining economic development and ensuring energy supply stability. According to the Global Energy Review 2025 released by the International Energy Agency (IEA), coal continues to account for a significant share of global primary energy consumption, particularly in regions with strong industrial demand and developing energy infrastructures [1]. Similarly, the BP Energy Outlook 2024 indicates that, despite the accelerated transition toward low-carbon energy sources, coal is expected to remain an important part of the global energy mix over the coming decades under multiple future scenarios [2]. In this global context, improving the safety, efficiency, and sustainability of coal mining operations remains a critical challenge worldwide.

Driven by these long-term energy trends, the coal industry is undergoing a profound transformation toward intelligent, automated, and unmanned mining systems. Recent international studies have emphasized that automation, robotics, and intelligent perception technologies are becoming key enablers for improving productivity, operational safety, and environmental performance in modern mining operations [3]. Within intelligent coal mining systems, the stable and continuous transportation of coal is of paramount importance. Scraper conveyors serve as essential components in fully mechanized mining faces, and their operational reliability directly affects overall production efficiency. However, oversized coal fragments are one of the primary causes of conveyor blockages, which can lead to unplanned shutdowns, reduced productivity, and increased safety risks. To mitigate these issues, crushing robots have been introduced to automatically fragment large coal pieces. The effectiveness of such robots, however, is highly dependent on the performance of their perception systems, which must accurately and reliably detect large coal fragments under complex underground conditions.

From a robotics perspective, crushing robots can be categorized as multi-degree-of-freedom mechanical arms. Their operational processes must establish a comprehensive closed loop that includes “perception-planning-execution-verification.” The initial stage of this closed loop, “perception,” remains the critical link for the crushing robotic arm in its task of breaking large pieces of coal. Yu et al. [4] proposed a multi-scale residual convolutional autoencoder (MR-CAE) to enhance the non-maximum suppression algorithm for blockage detection in transfer stations. However, this approach could only identify the blockage area and failed to distinguish individual coal blocks. Merely locating the blocked area is insufficient for guiding the crushing robot in executing precise crushing. Further identification and differentiation of individual abnormal coal blocks remain essential. Recent international research has proposed deep learning-based object detection frameworks for conveyor belt inspection in industrial environments. For instance, several studies have developed improved deep learning models to detect foreign objects on coal mine conveyor belts under complex underground conditions, showing enhanced detection accuracy and efficiency [5]. Other work has applied YOLOv5-based methods to identify abnormalities such as belt damage, highlighting both opportunities and limitations of current perception systems in industrial settings [6]. Furthermore, enhanced YOLOv8 models incorporating attention and feature fusion mechanisms have been explored to improve object localization and robustness during conveyor monitoring [7]. To mitigate the attention failure resulting from this low contrast, Wu et al. [8] developed an improved YOLOv5 model, termed Gghost Squeeze-BiFPN YOLOv5, which incorporates lightweight and attention mechanisms specifically designed for the real-time detection of large coal pieces on conveyor belts in coal mines. Nonetheless, the BiFPN architecture features numerous jump connections and experiences significant measurement delays at the edge, which can hinder the robot’s ability to break up coal effectively. Consequently, there is a need for a high-precision, low-latency detection algorithm for large coal pieces that can be integrated with a crushing robot, thereby creating an intelligent crushing system that achieves a complete closed loop of “perception-planning-execution-verification.” However, complex models typically demand substantial computational resources, making direct deployment on underground equipment challenging. As a result, researchers have begun to investigate lightweight solutions suitable for constrained hardware environments.

Researchers have increasingly concentrated on lightweight methodologies suitable for constrained hardware environments. In contexts characterized by limited computing resources, they have achieved model lightweighting through techniques such as model structure optimization [9], channel pruning [10], knowledge distillation [11], and the implementation of lightweight backbone networks. Within this framework, Fan et al. [12] introduced CM-YOLOv8, which reduces complexity through pruning. Hao et al. [13] utilized knowledge distillation to enhance the performance of student networks. Jiang et al. [14] adopted MobileNetv3 to lower computational costs. Lyu et al. [15] proposed grouped convolution to improve the receptive field. Although pruning, distillation, and lightweight backbones each offer unique advantages, grouped convolution has emerged as a pivotal operator upon which various technical approaches rely due to its outstanding performance. The studies mentioned indicate that lightweight networks employing grouped convolution can effectively balance performance with computational demands.

The detection system is vulnerable to positioning deviations and identification errors, which arise from interference caused by low light levels and significant dust accumulation underground. These factors can subsequently heighten the risk of blockage. To quantitatively evaluate the root causes of these positioning deviations and recognition errors, Fan et al. identified three primary challenges that the harsh underground environment poses to the visual system: low illuminance and high dust levels lead to a reduced signal-to-noise ratio for the active visual system; the high similarity between coal lumps and the texture of coal walls renders traditional visual tracking methods, which rely on geometric features, ineffective [16,17,18]. In response, researchers have begun to focus on enhancing the model’s capacity to discern key features within complex backgrounds.

Given the challenge that large coal pieces in underground environments exhibit high similarity and are difficult to differentiate from background features, an attention mechanism is introduced to enhance the extraction of key features. Chen et al. [19] integrated a high-frequency perception module with a spatially dependent perception mechanism through the High-Frequency Spatial Feature Pyramid Network (HSFPN). This approach filters low-frequency background interference while capturing pixel-level spatial relationships, thereby enhancing multi-scale feature fusion and improving coal detection accuracy. Sui et al. [20] implemented a coordinate attention mechanism to tackle issues of low model accuracy and target identification difficulties caused by low light and high dust conditions in coal mine operational faces, thereby enhancing the model’s feature expression capabilities. Zhang et al. [21] proposed a convolutional network incorporating an attention mechanism for foreign object detection in coal. These methods effectively improve the model’s perceptual abilities without significantly increasing complexity. Building on this foundation, this paper employs an adaptive weighted attention mechanism, which offers enhanced global modeling and positioning capabilities beyond both channel and spatial dimensions.

Although the attention mechanism has yielded remarkable advancements in enhancing model perception, its efficacy remains heavily reliant on the availability of sufficient, high-quality labeled data. In practical underground environments, however, acquiring large-scale labeled datasets is not only costly but also constrained by the complexity and security of the scenarios. Traditional supervised learning methods encounter challenges related to limited generalization capabilities. Transfer learning has been widely adopted in robotics and industrial vision systems to reduce annotation costs and improve learning efficiency in data-scarce environments [22]. To alleviate the limited-data problem and improve model robustness, this study adopts a transfer learning-based training strategy. Instead of training the detector from scratch, the model is initialized using weights pre-trained on publicly available object detection datasets with similar visual characteristics. This strategy enables the network to inherit general feature representations learned from related tasks and facilitates faster convergence on the target coal detection task.

By fine-tuning the pre-trained backbone on the underground large coal dataset, the proposed method effectively improves generalization performance while reducing the dependence on extensive manual annotation. This transfer learning strategy serves as a practical and efficient training approach for deploying detection models in resource-constrained industrial environments. This method requires only a minimal amount of real annotations to attain full supervision accuracy, thereby fulfilling the rapid deployment requirements for crushing robots.

Given the complex working conditions in the underground environment of fully mechanized mining faces—characterized by low illumination, high dust levels, strong vibrations, and the high similarity between coal blocks and background textures—this paper addresses the closed-loop operational requirements of “perception-decision-making-execution-verification” for crushing robots. It proposes a visual inspection method for large coal crushing robots in fully mechanized mining faces, utilizing an adaptive weighted attention mechanism and transfer learning. Unlike typical object detection studies that primarily focus on detection accuracy, this paper systematically optimizes the structural design, feature enhancement mechanism, and training strategy of the detection model, taking into account the constraints of computing resources, real-time control, and the execution requirements of robots. The aim is to achieve a balance among detection accuracy, model complexity, and deployment feasibility.

To address this issue, this paper presents an efficient feature encoding unit based on grouped convolution within the YOLOv8 framework. This approach significantly reduces the computational overhead of the model while maintaining its feature representation capabilities. To tackle the challenges of blurred edges in underground coal blocks and the low contrast between the foreground and background, we designed an adaptive weighted attention mechanism. This mechanism enhances the key features associated with large coal blocks in both channel and spatial dimensions. Additionally, by integrating a transfer learning strategy and utilizing datasets from similar tasks to acquire prior knowledge, we can effectively mitigate the insufficient generalization ability of the model, which arises from the difficulty in obtaining underground samples.

This paper presents a collaborative operation architecture in which the detection model and the crushing robot are deployed independently. It successfully completes the closed-loop verification of the “detection-positioning-crushing” process within a laboratory setting, thereby demonstrating the feasibility and effectiveness of the proposed method in real-world crushing operations.

The primary contributions of this work are summarized as follows: (1) A lightweight large coal detection framework (LCDet) is proposed by jointly optimizing network structure, attention-based feature enhancement, and training strategy under strict underground deployment constraints, rather than focusing solely on detection accuracy. (2) An adaptive weighted attention mechanism is designed to enhance blurred edge and spatial features of large coal fragments in low-illumination and high-dust environments, addressing the high visual similarity between coal blocks and background textures. (3) A transfer learning-based training strategy tailored for small-sample underground scenarios is introduced, which effectively balances convergence efficiency and generalization performance under limited annotation conditions. (4) A complete perception–decision–execution closed-loop deployment scheme is implemented and validated on a crushing robot platform, demonstrating the engineering feasibility of the proposed method beyond offline benchmark evaluation.

2. Methods

The research plan outlined in this paper is illustrated in Figure 1. The research process encompasses dataset construction, detector design, result visualization, and the operation of a crushing robot. First, a dataset comprising large pieces of coal from underground sources was constructed. Second, a detection model specifically for identifying large underground coal pieces was developed. Subsequently, the experimental results were visualized to enhance the credibility of the model’s predictions. Finally, based on the test outcomes, the large pieces of coal were positioned for the crushing robot to facilitate the crushing process.

2.1. Overall Framework of the Large Lump Coal Detector

YOLOv8 represents a significant iteration within the YOLO series and has found extensive applications in various domains, including autonomous driving and surface defect recognition [23]. Consequently, this paper adopts YOLOv8 as the benchmark framework for developing an accurate and real-time Large Coal Detector (LCDet). The architecture of the network model is illustrated in Figure 2.

Each feature extraction stage comprises a feature encoding unit and a downsampling encoding unit. The downsampling encoding unit employs a standard convolution operation. Within the single feature extraction stage, the downsampling encoding unit reshapes the feature map size and adjusts the channel count to enhance the information flow’s transmission efficiency. The output from the preceding feature encoding unit serves as the input for the subsequent feature encoding unit. Furthermore, this study implements a design strategy involving incremental convolution kernels, which progressively increases the size of the grouped convolution kernels in the backbone network to accommodate varying resolution requirements, thereby facilitating the gradual acquisition of multi-scale semantic information. During the feature fusion stage, this paper integrates an adaptive weighted attention mechanism to extract significant features associated with large coal chunks in the image. Finally, in the detection stage, the model utilizes decoupled heads to predict features of different scales following aggregation. The non-maximum suppression algorithm filters out bounding boxes with low confidence scores, selecting the bounding box with the highest confidence as the final detection result.

2.2. Feature Coding Unit

The feature encoding unit that employs grouped convolution can decrease both the number of parameters and the computational burden of the model, all while preserving essential spatial dimension information. The operational specifics of grouped convolution compared to standard convolution are illustrated in Figure 3. Grouped convolution significantly reduces model parameters by partitioning the input feature map into several subgroups and conducting convolution operations independently on each subgroup, thereby sustaining the model’s capacity to express and learn features. The dimensions of the input feature map are denoted as, where represents the width, denotes the height, and indicates the number of channels. The size of the convolutional kernel is, and the dimensions of the output feature map are represented as. The computational parameters for standard convolution and grouped convolution are detailed in Equations (1) and (2), respectively. It is clear that grouped convolution results in a reduced computational cost.

V_{1} = h \times w \times c^{2}

(1)

V_{2} = h \times w \times c^{2} \times \frac{1}{g}

(2)

To optimize the effectiveness of grouped convolution, this paper introduces a robust gradient propagation pathway. Figure 4 illustrates three distinct feature encoding units. As depicted in Figure 4a, the feature encoding units of the Cross Stage Partial Network mitigate excessive gradient information reuse through cross-stage local operations [24]. The feature encoding unit of the Efficient Layer Aggregation Network effectively consolidates features across various levels, thereby enhancing the model’s capacity to capture multi-scale information [25]. Drawing inspiration from these approaches, this paper presents the Efficient Feature Encoding Unit (EFEU), illustrated in Figure 4c. Initially, the input feature map undergoes 1 × 1 convolution and separation operations to produce two sets of feature maps. One set is directly subjected to concatenation, while the other, after processing through multiple bottleneck modules, is concatenated with the output of an alternative branch along the channel dimension. This design employs cross-stage local operations to reduce gradient information reuse, while the fusion of features from different branches in the channel dimension enhances the model’s feature representation capability, thus improving the network’s ability to model complex data.

To investigate the influence of the group number in grouped convolution, comparative experiments were conducted with different group settings (2, 4, and 8). As shown in Table 1, reducing the group number increases inter-channel feature interaction but leads to higher computational cost, whereas excessively increasing the group number weakens feature coupling and degrades detection accuracy. Overall, a group number of 4 provides the best balance between computational efficiency and detection performance, and is therefore adopted in the LCDet backbone.

2.3. Adaptive Weighted Attention Mechanism

The image will contain numerous features unrelated to target detection, including background information. When humans process visual information, they instinctively direct their attention to areas pertinent to the current task, relegating irrelevant information from other regions to a secondary status. This capability is referred to as the visual attention mechanism.

This study introduces an adaptive weighted attention mechanism designed to extract key features associated with large coal chunks from images. It enhances the blurred edge texture features of these coal chunks while concurrently diminishing the impact of complex backgrounds. The attention mechanism primarily comprises the channel attention module [26] and the spatial attention module [27].

The channel attention module assigns weights to features across different channels, while the spatial attention module accurately identifies the positional features of areas containing large coal deposits. In the operation of the channel attention module, the input feature map (E) is initially compressed into a one-dimensional feature vector along the spatial dimension using the Global Average Pooling (GAP) layer. Subsequently, one-dimensional convolution operations are performed on these feature vectors to extract features, resulting in new feature vectors (Z). Following this, an activation function is applied to normalize the data, yielding the attention weights (X) for the various channels. Finally, the original input feature map (E) is multiplied by the attention weights (X) to produce the enhanced feature map (F). The specific calculation process is as follows:

Z = C o n v 1 d (G A P (E))

(3)

X = S i g m o i d (Z)

(4)

In the operation of the spatial attention module, global average pooling and global max pooling layers are initially employed to compress the input feature space (F) along the channel dimension. Following this, a two-dimensional convolutional layer scans the compressed features, facilitating interactions within local spatial regions. An activation function is then applied to (K) for normalization, resulting in the output of the spatial attention feature (L). Finally, the spatial attention feature (L) is utilized to weight the input features, thereby producing the final output feature map (M), as demonstrated in Equations (5) and (6). The structure of the adaptive weighted attention mechanism is depicted in Figure 5.

K = c o n v 2 d (Concat (Avgpool (F), Maxpool (F)))

(5)

L = Sigmoid (K)

(6)

2.4. Training Strategies Based on Transfer Learning

In conventional supervised learning frameworks, model parameters are typically initialized randomly, which often requires large-scale labeled datasets to achieve stable convergence. However, in underground coal mining environments, the acquisition and annotation of high-quality image data are constrained by complex working conditions, safety requirements, and high labor costs. As a result, training deep detection models from scratch may lead to slow convergence and overfitting when only limited annotated samples are available.

Transfer learning has been widely studied as an effective strategy to improve model generalization and training efficiency under limited data conditions [28]. By reusing feature representations learned from related source tasks, transfer learning enables models to reduce their dependence on extensive manual annotation and to achieve more stable optimization behavior. This strategy has been successfully applied in various industrial vision and robotics applications where data collection is costly or difficult.

In this study, a parameter transfer-based training strategy is adopted to enhance the robustness and practicality of the proposed LCDet model. Specifically, the detector is first pre-trained on a publicly available object detection dataset to learn generic visual features, and the resulting parameters are then used to initialize the backbone network of LCDet. Considering the consistency of task attributes, the Pascal Visual Object Classes (VOC) dataset is selected as the source dataset, as both the source task and the target task belong to the category of object detection problems [29].

During the transfer process, the backbone network inherits the pre-trained parameters as feature priors, while the detection head is fine-tuned using the target large coal dataset to adapt to the specific visual characteristics of underground mining environments. This training strategy accelerates model convergence, reduces the risk of overfitting under small-sample conditions, and improves the stability and generalization capability of the detector in complex underground scenarios. The overall workflow of the parameter transfer-based training strategy is illustrated in Figure 6.

2.5. Overall Design of the Crushing Robot

In the context of unmanned and environmentally sustainable mining within the coal industry, the crushing robot serves as a crucial component for facilitating efficient coal transportation at the intelligent fully mechanized mining face. This robot is tasked with the automatic fragmentation of large coal pieces on the scraper conveyor. Its design and control logic significantly influence the closed-loop efficiency of the “perception-planning-execution-verification” process. The robot receives directives from the LCDet detection system and is equipped to address challenges such as low illumination, high dust levels, texture similarity, and vibration-induced blurring in underground environments while crushing large coal fragments. Figure 7 illustrates the overall composition of the crushing robot, which consists of a mobile base, a multi-degree-of-freedom mechanical arm, and a crushing impact head.

The robot features a modular design, with a mobile base powered by a double track chain. Four rolling wheels are integrated into the track grooves to restrict five degrees of freedom. The YN100-250 drive motor (250 W) (manufactured by Pfeider, Wuxi, China) is paired with a 1:10 planetary reducer to provide the necessary 3.2 m track driving force. A relative value rotary encoder is mounted on the reducer shaft. To prevent positioning deviation, reset origin and limit points are established at both ends of the track. The multi-degree-of-freedom robotic arm employs a three-joint configuration, incorporating two cantilever motors equipped with brakes to ensure a stable posture, thereby enabling pixel-level positioning and millimeter-level control. The selected crushing impact head, the Raya G3901, is affixed to the end of the robotic arm and can respond to predictions from the LCDet system to facilitate coal block trajectory tracking and crushing. For control purposes, the system implements a multi-module collaborative architecture centered around the Siemens controller, enabling the separate deployment of the “model-robot.” The LCDet system transmits the coordinates and confidence level of the coal block to the motion controller. This controller drives the three-axis servo driver to adjust the angle of the mechanical arm. The moving base utilizes a Fuji frequency converter to operate the traction motor. To mitigate safety risks, the track limit is equipped with Omron dual-channel dual-limit sensors.

The control process employs a closed-loop logic of “perception-judgment-execution-reset” and is intricately coordinated with the LCDet system. The execution sequence is as follows: After the camera captures an image of the coal flow, the image is transmitted to the LCDet module, where the system algorithm determines in real time whether the coal blocks require crushing. If the coal flow is deemed normal, the system will return directly to standby mode. Conversely, if an abnormally large coal piece is detected, the base drives the robot to the target area, while the mechanical arm simultaneously performs the pre-alignment operation. Following the crushing operation, the system automatically resets to standby. When the working position of the coal shearer significantly overlaps with that of the crushing robot, both the detection system and the robot immediately suspend operations to prevent interference with the coal mining process. This robot, in conjunction with the LCDet system, establishes an automated closed loop of “detection-positioning-crushing.” Its lightweight design is well-suited for underground spaces, thereby facilitating intelligent coal mining.

3. Experimental Results and Analysis

3.1. Data Set Construction

3.1.1. Public Data Sets

To thoroughly assess the effectiveness and accuracy of the model proposed in this study for practical applications, this paper utilized a substantial dataset of large coal images derived from the abnormal state image dataset of fully mechanized mining faces (DsLMF+) [30]. The images in this dataset originate from original underground surveillance video, as illustrated in Figure 8. The image data were extracted frame by frame, with subsequent filtering applied to eliminate noise and irrelevant information, thereby enhancing image quality. Following these preprocessing steps, a total of 21,017 high-quality large coal images were obtained. These images were then divided into training and validation sets in an 80:20 ratio. By conducting experiments on this challenging dataset, we can comprehensively evaluate the model’s generalization capability and accuracy in detecting large pieces of coal.

3.1.2. Self-Built Data Set

The coal block image data were collected from two sources: the 52,604 fully mechanized mining face in a specific mining area in Shendong and the fully mechanized mining laboratory at Xi’an University of Science and Technology. The former provides authentic underground working conditions, while the latter supplies supplementary samples under controlled experimental conditions. This combination enhances the dataset’s comprehensiveness and representativeness regarding environmental diversity, particle size distribution, and background complexity. The collection scenario for the large lump coal dataset is illustrated in Figure 9, with the collection device positioned above the end of the scraper conveyor.

The large lump coal detection device, illustrated in Figure 10, primarily consists of an industrial control computer and a binocular camera. The camera transmits video to the industrial control computer via a USB interface. The LCDet detection algorithm employed by the industrial control computer assesses the presence of large coal pieces and relays their location information to the crushing robot through a network cable. The camera parameters are presented in Table 2.

The collected dataset is presented in Figure 11. This paper screens and extracts 5000 high-quality frames containing large coal targets from videos recorded by the acquisition equipment over the course of one week. The standard resolution of these images is 1280 pixels by 960 pixels. To effectively construct the dataset for training and verification purposes, this paper designates 3000 images from the total of 5000 as the training set, 1000 images as the verification set, and the remaining 1000 images as the test set.

3.2. Experimental Platform

The experimental platform described in this paper is presented in Table 3. The hardware computing resources consist of an NVIDIA GeForce RTX 4080 and an Intel(R) Core(TM) i7-13700KF. For the experiments, Pytorch 1.7 was installed on a Windows 10 operating system. The hyperparameters utilized in training the algorithm include the learning rate, optimizer, batch size, and number of training epochs. The initial learning rate is set at 0.01, the optimizer employed is stochastic gradient descent, the batch size is 16, and the total number of training epochs is 400.

Indicators for assessing model performance encompass precision (Precision, P), recall (Recall, R), mean average precision (mAP), the number of parameters (Parameters), and GFLOPs. The formulas for recall, precision, and mean average precision are as follows:

Recall = \frac{T P}{T P + F N}

(7)

Precision = \frac{T P}{T P + F P}

(8)

AP = \int_{0}^{1} Precision (Recall) d (Recall)

(9)

mAP = \frac{1}{n} \sum_{i = 1}^{n} {AP}_{i}

(10)

True positive (TP) refers to the number of samples correctly classified as positive by the algorithm, while false positive (FP) denotes the number of samples incorrectly predicted as positive when they are actually negative. False negative (FN) indicates the number of positive samples misclassified as negative. Average Precision (AP) represents the area under the precision-recall (P-R) curve, which is defined by the precision and recall curves.

The term “number of parameters” denotes the total count of trainable parameters, serving as an indicator of the algorithm’s space complexity. GFLOPs represents the total number of floating-point operations executed by the model during inference or training, thereby reflecting the algorithm’s time complexity.

3.3. Experimental Results

3.3.1. Ablation Experiment Results

The LCDet proposed in this paper offers several advantages: (1) During the feature extraction stage, it facilitates frequent gradient information flow while minimizing the consumption of computing resources. (2) It employs AWAM to extract key features associated with large coal pieces in the image, effectively eliminating irrelevant information. (3) The model utilizes the acquired prior knowledge as initial weights, enhancing its adaptability to complex scenarios. This paper conducts ablation experiments to evaluate the contributions of various components to model performance. To ensure experimental fairness, all parameters were consistently set.

The ablation experiment is conducted in three steps:

Step 1: Adjust the number of coding unit bottlenecks at each feature extraction stage to achieve an improved balance between model complexity and detection accuracy.

Step 2: Once the number of coding units is established, implement AWAM to identify key features associated with large coal pieces in indoor images.

Step 3: Employ a training strategy grounded in transfer learning to ensure optimal training outcomes.

Table 4 shows that each proposed module contributes positively to the overall detection performance. In the table, × indicates the module is not used and √ indicates the module is used. Specifically, the introduction of the Efficient Feature Encoding Unit (EFEU) improves feature representation under limited computational budgets, leading to consistent gains in precision and recall. The Adaptive Weighted Attention Mechanism (AWAM) further enhances the discrimination of large coal regions by emphasizing blurred edge features and suppressing background interference, which is particularly beneficial for recall performance. In addition, the parameter transfer (PK) strategy improves model stability and generalization in data-limited scenarios. Compared with the baseline model, the final LCDet achieves improvements of 0.7% in Precision, 1.9% in Recall, 0.6% in mAP50, and 0.8% in mAP50–95, demonstrating the effectiveness of the integrated design.

To further verify the effectiveness of the adaptive weighted attention mechanism under extreme underground conditions, an additional ablation study was conducted on the DsLMF+ dataset. DsLMF+ is collected from real fully mechanized mining faces and inherently contains challenging samples acquired under harsh environments, including low illumination, heavy dust, vibration-induced motion blur, and strong background interference. As shown in Table 5, the introduction of AWAM leads to consistent improvements in recall and mAP50 compared with the model without AWAM. This indicates that the adaptive weighted attention mechanism effectively suppresses background interference and enhances edge-related features, thereby improving detection robustness in visually degraded underground scenarios.

3.3.2. Comparison of Experimental Results with Public Data Sets

To objectively assess the detection performance of the proposed model, this paper selects five detection models for comparison. The YOLO series is widely regarded as a classic in the field of deep learning, having demonstrated exceptional performance across various target detection tasks. RT-DETR employs a Transformer architecture for encoding and decoding, facilitating efficient training and inference. The source code and data for these models are sourced from their official websites. Subsequently, all models are evaluated using established metrics on the dataset.

The quantitative comparison results on the public dataset are summarized in Table 6. To comprehensively evaluate performance in coal mine-specific application scenarios, lightweight detectors designed for underground environments, including CM-YOLOv8 and UCM-Net, are incorporated as comparison baselines, with all results reported on the DsLMF+ dataset. Overall, LCDet demonstrates a favorable balance between detection accuracy and model efficiency, achieving mAP50 and mAP50–95 values of 84.5% and 56.2%, respectively, which are comparable to or higher than those of most general-purpose and coal mine-specific detectors. Meanwhile, LCDet operates with a significantly smaller parameter scale (2.93 M) and lower computational cost (7.5 GFLOPs) than the majority of comparison models, including CM-YOLOv8 and UCM-Net, while maintaining stable Precision and Recall of 79.3% and 75.1% under complex underground mining conditions. Among all evaluated methods, only YOLOv11n exhibits a smaller model footprint than LCDet, but at the expense of reduced detection accuracy. These results confirm that LCDet provides a more advantageous trade-off between accuracy and computational efficiency, making it particularly suitable for real-time deployment on crushing robots in fully mechanized mining environments.

3.3.3. Comparison of Experimental Results with Self-Built Data Sets

The quantitative comparison results on the self-constructed dataset are reported in Table 7. LCDet achieves competitive detection performance, particularly in terms of Recall and mAP50–95, while maintaining a compact model size and low computational cost. Compared with larger models such as RT-DETR, LCDet significantly reduces the number of parameters and model size, requiring only 5.99 MB, which is approximately 9% of RT-DETR. These results indicate that LCDet provides a favorable balance between detection accuracy and efficiency, making it suitable for deployment in resource-constrained underground mining environments.

To convey the effectiveness of the proposed model more intuitively, this paper visualizes all evaluation indicators and presents the results of different models in histogram format. Figure 12a illustrates the precision, recall rate, and average precision of each model on the test set, while Figure 12b depicts the number of parameters, model size, and computational load of each model. As demonstrated in Figure 12, regarding accuracy indicators, LCDet’s precision is only marginally lower than that of YOLOv8s, whereas its other indicators surpass those of the competing models. In terms of model complexity, LCDet exhibits significant advantages, thereby highlighting its overall superiority.

3.3.4. Visualization Results

To enhance trust in the predictive outcomes of the proposed model, this paper employs gradient-weighted class activation mapping (Grad-CAM) to visualize the model’s weights [32]. For image sample selection, ten representative images were chosen, each exhibiting significant variations in the size and edge texture of coal blocks. This selection ensures that the visualization results comprehensively demonstrate the model’s ability to capture the essential features of coal blocks. The Grad-CAM visualization results are presented in Figure 13 and Figure 14. The color gradients indicate the model’s attention levels across different regions of the image, with red denoting areas of high intensity and blue signifying low intensity. As illustrated in Figure 13, the model’s area of interest is concentrated on regions containing large coal pieces, aligning with human focus during the detection of such pieces. Figure 14 further reveals that when coal flow experiences congestion, the model’s attention shifts to the congested area; conversely, when the coal flow is smooth, the model’s focus returns to the large coal area. This behavior underscores the reliability of the model’s predictive results presented in this paper. In conclusion, LCDet demonstrates the capability to accurately detect large coal pieces within the scraper conveyor coal flow, thereby providing a foundation for the crushing robot to execute its crushing tasks.

3.4. Laboratory Deployment

The detection model presented in this paper is implemented in a large-scale coal crushing operation. Its overall performance is influenced not only by the accuracy of the model itself but also by the effectiveness of the “perception-decision-execution” framework. To assess the viability of the “separate deployment of the network model and crushing robot” architecture, a deployment scheme was executed in the fully mechanized mining face laboratory at Xi’an University of Science and Technology, simulating the operational environment of the underground scraper conveyor for validation. Figure 15 illustrates the laboratory deployment.

4. Discussion

4.1. Mechanism Analysis of LCDet Under Extreme Underground Conditions

LCDet shows stable performance in extreme underground environments characterized by low illumination, high dust concentration, strong vibration, and weak texture contrast. This robustness does not stem from a single architectural modification, but from the coordinated design of a lightweight backbone, an edge-aware attention mechanism, and a similar-domain transfer learning strategy, which together address the key challenges of underground visual perception.

The Efficient Feature Encoding Unit (EFEU) reconstructs the gradient propagation path through grouped convolution. Compared with simply deepening the network, this design reduces model parameters and computational cost while improving feature utilization under weak-texture conditions. By optimizing gradient flow, the network is able to extract more informative representations from limited visual cues, which is particularly important in complex coal wall scenes where texture information is severely degraded.

The Adaptive Weighted Attention Mechanism (AWAM) further enhances detection robustness by strengthening high-frequency edge information in both channel and spatial dimensions. This mechanism effectively reduces misdetections caused by grayscale similarity and adhesion between coal blocks and background surfaces. As a result, the model maintains reliable discrimination even when foreground–background contrast is minimal. Ablation experiments conducted on the DsLMF+ dataset indicate that AWAM contributes to improved recall and overall stability under harsh underground conditions.

In addition, the transfer learning strategy adopts a selective similar-domain migration scheme. Instead of relying on large-scale generic source datasets, pre-training is performed on domains with higher scene similarity to underground mining environments. This approach helps alleviate negative transfer under limited target-domain samples and accelerates convergence during training. Together, these components form a tightly coupled mechanism that improves detection reliability in extreme downhole scenarios.

From a broader methodological perspective, recent advances in industrial fault diagnosis have shown that robustness under strong noise and domain shift can be effectively addressed through metric learning, evolutionary reconstruction, and multi-source adversarial transfer strategies. While these methods primarily focus on signal-level fault recognition, LCDet extends the underlying principles of robustness and domain adaptation to vision-based perception tasks in underground mining. By embedding similar-domain transfer and edge-aware feature modeling into a lightweight detection framework, LCDet demonstrates how such ideas can be adapted to real-time visual detection under severe environmental and computational constraints. Unlike existing lightweight detectors that emphasize isolated structural optimizations, the key contribution of LCDet lies in its system-level design, where lightweight feature encoding, edge-aware attention, and domain-adaptive learning are jointly considered for underground applications.

4.2. Engineering Significance and Industrial Real-Time Performance

In practical mining applications, detection accuracy alone is insufficient. Crushing robots operating in fully mechanized mining faces are subject to strict constraints on computational resources, power consumption, and real-time responsiveness. Excessive inference latency may directly affect control stability and operational safety.

LCDet is designed with these engineering constraints in mind. The integration of the lightweight backbone with the motion significance branch supports a closed-loop workflow covering perception, decision-making, and execution. From an operational perspective, this design contributes to reducing congestion events on the scraper conveyor and facilitates the transition from offline high-precision control to end-to-end real-time control. At the same time, LCDet maintains competitive detection accuracy while keeping computational requirements low, which is essential for deployment on industrial embedded platforms.

This design philosophy is consistent with recent developments in industrial intelligent systems, such as lightweight fault diagnosis methods for aircraft engine bearings and train transmission systems, where efficient inference under limited hardware resources is emphasized. Rather than pursuing accuracy through increasingly deeper or heavier architectures, this work focuses on system-level optimization driven by practical engineering demands, including weak-texture perception, limited computational resources, and real-time robotic execution. In this context, LCDet provides a practical reference for intelligent perception tasks in harsh industrial environments with limited data availability.

4.3. Limitations and Future Work

Despite its promising performance, LCDet still has several limitations. First, although the proposed transfer learning strategy improves convergence speed and detection accuracy under limited data conditions, its effectiveness depends on the similarity between the source and target domains. When the target dataset is extremely small or the domain gap becomes large, negative transfer may occur, leading to suboptimal feature adaptation.

Second, analysis of false detection and missed detection samples reveals several recurring failure cases. Extremely small coal blocks occupying only a few pixels are prone to being missed due to insufficient spatial resolution and limited feature granularity in lightweight models. In addition, severe adhesion between coal blocks and background structures with similar grayscale or texture patterns can lead to both false positives and missed detections. Serious occlusion caused by machinery components, loose debris, or shadows further reduces the visible area of targets, increasing detection uncertainty. These cases indicate that when visual cues are extremely limited or heavily degraded, the current model may struggle to fully distinguish target boundaries.

Third, under extremely low-visibility conditions such as heavy dust, dense smoke, or water mist that completely obscures coal edges, detection performance—particularly recall—may fluctuate. While AWAM enhances robustness by emphasizing edge-related and spatial features, it cannot fully recover target contours when visual information is severely degraded. Future work will explore multi-scale feature fusion for small-target representation, temporal consistency modeling across consecutive frames to handle occlusion, and the integration of additional sensing modalities such as polarization or depth information to compensate for visual information loss.

Finally, with respect to generalization across different mining sites, variations in illumination conditions, coal texture characteristics, conveyor structures, and camera installation angles may affect detection performance. Although LCDet demonstrates stable results on both public and self-constructed datasets, future studies will focus on multi-site data collection and domain adaptation strategies to further improve cross-scene generalization and deployment robustness.

5. Conclusions

An LCDet detector was developed to tackle the challenge of detecting large coal deposits underground. Comprehensive research and experimental validation were conducted from three perspectives: model enhancement, attention mechanisms, and prior knowledge integration. First, a lightweight backbone network utilizing grouped convolution was designed, which not only improves the model’s feature representation capabilities but also significantly reduces its complexity. Second, an adaptive weighted attention mechanism was introduced to direct the model’s focus toward areas containing large coal deposits, accentuating the blurred edge textures of these deposits while minimizing interference from irrelevant backgrounds. Furthermore, to enhance the model’s generalization ability and improve training outcomes, a training strategy informed by prior knowledge was implemented. This strategy optimizes model performance through a knowledge-guided approach. Finally, the visualization of the attention areas provides an intuitive representation of the regions the model prioritizes during predictions, thereby increasing confidence in the model’s results. The experimental results indicate that on the self-constructed large coal dataset, LCDet achieved precision, recall, mAP50, and mAP50–95 scores of 90.4%, 91.3%, 96.5%, and 69.3%, respectively. In comparison to other models, LCDet outperforms in key metrics of precision, recall, mAP50, and mAP50–95 while maintaining a low model parameter volume and complexity. On the public dataset, LCDet recorded precision, recall, mAP50, and mAP50–95 values of 79.3%, 75.1%, 84.5%, and 56.2%, respectively. This model demonstrates an optimal balance between detection accuracy and complexity, making it suitable for deployment on crushing robot equipment with constrained computing resources.

Author Contributions

Conceptualization, Y.W. and J.L.; methodology, S.Z.; software, J.L.; validation, L.L., Z.L. and L.X.; formal analysis, Y.W.; resources, Y.W.; data curation, J.L.; writing—original draft preparation, L.L.; writing—review and editing, Y.W.; visualization, J.L.; supervision, Y.W.; project administration, Y.W.; funding acquisition, Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Shaanxi Province, grant number 2024JC-YBMS-366, and the Xi’an Science and Technology Plan Project, grant number 23GXFW0047. The APC was funded by the same projects.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available upon reasonable request from the corresponding author.

Acknowledgments

The authors would like to thank the Fully Mechanized Mining Laboratory of Xi’an University of Science and Technology for providing experimental conditions and technical support.

Conflicts of Interest

Leping Li was employed by Hunan Huanan Optoelectronics Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

International Energy Agency (IEA). Global Energy Review 2025; IEA: Paris, France, 2025; Available online: https://www.iea.org/reports/global-energy-review-2025 (accessed on 20 January 2026).
BP. BP Energy Outlook 2024; BP Plc: London, UK, 2024; Available online: https://www.bp.com/content/dam/bp/business-sites/en/global/corporate/pdfs/energy-economics/energy-outlook/bp-energy-outlook-2024.pdf (accessed on 20 January 2026).
Obosu, M.; Frimpong, S. Advances in Automation and Robotics: The State of the Emerging Future Mining Industry. J. Saf. Sustain. 2025, 2, 181–194. [Google Scholar] [CrossRef]
Yu, Y.; Zhou, H.; Sun, K. Coal blocking detection method for underground transfer point conveyors based on MR-CAE. Evol. Syst. 2025, 16, 121. [Google Scholar] [CrossRef]
Ling, J.; Fu, Z.; Yuan, X.; Ling, R. Development of a Deep Learning-Based Foreign Object Detection Algorithm for Coal Mine Conveyor Belts. Sci. Rep. 2025, 15, 42291. [Google Scholar] [CrossRef] [PubMed]
Liu, W.; Tao, Q.; Wang, N.; Xiao, W.; Pan, C. YOLO-STOD: An industrial conveyor belt tear detection model based on Yolov5 algorithm. Sci. Rep. 2025, 15, 1659. [Google Scholar] [CrossRef]
Ni, Y.; Cheng, H.; Hou, Y.; Guo, P. Study of Conveyor Belt Deviation Detection Based on an Improved YOLOv8 Algorithm. Sci. Rep. 2024, 14, 26876. [Google Scholar] [CrossRef] [PubMed]
Wu, L.; Chen, L.; Zhang, L.; Shi, J. A lump coal detection method fusion of lightweight and attention mechanism. IEEE J. Emerg. Sel. Top. Ind. Electron. 2024, 5, 763–771. [Google Scholar] [CrossRef]
Ma, H.; Xue, X.; Mao, Q.; Qi, A.; Wang, P.; Nie, Z.; Zhang, X.; Cao, X.; Zhao, Y.; Guo, Y. On the academic ideology of “coal mining is data mining”. Coal Sci. Technol. 2025, 53, 272–283. [Google Scholar]
Yan, P.; Sun, Q.; Yin, N.; Hua, L.; Shang, S.; Zhang, C. Detection of coal and gangue based on improved YOLOv5.1 which embedded with the scSE module. Measurement 2022, 188, 110530. [Google Scholar] [CrossRef]
Lv, Z.; Wang, W.; Xu, Z.; Zhang, K.; Fan, Y.; Song, Y. Fine-grained object detection method using attention mechanism and its application in coal–gangue detection. Appl. Soft Comput. 2021, 113, 107891. [Google Scholar] [CrossRef]
Fan, Y.; Mao, S.; Li, M.; Wu, Z.; Kang, J. CM-YOLOv8: Lightweight YOLO for Coal Mine Fully Mechanized Mining Face. Sensors 2024, 24, 1866. [Google Scholar] [CrossRef]
Hao, S.; He, T.; Ma, X.; Zhang, X.; Wu, Y.; Wang, H. KDBiDet: A bi-branch collaborative training algorithm based on knowledge distillation for photovoltaic hot-spot detection systems. IEEE Trans. Instrum. Meas. 2023, 73, 1–15. [Google Scholar] [CrossRef]
Jiang, J.; Yang, Z.; Wu, C.; Guo, Y.; Yang, M.; Feng, W. A compatible detector based on improved YOLOv5 for hydropower device detection in AR inspection system. Expert Syst. Appl. 2023, 225, 120065. [Google Scholar] [CrossRef]
Lyu, C.; Zhang, W.; Huang, H.; Zhou, Y.; Wang, Y.; Liu, Y.; Zhang, S.; Chen, K. RTMDet: An empirical study of designing real-time object detectors. arXiv 2022, arXiv:2212.07784. [Google Scholar] [CrossRef]
Fan, Q.; Cui, G.; Zhao, Z.; Shen, J. Obstacle avoidance for microrobots in simulated vascular environments based on combined path planning. IEEE Robot. Autom. Lett. 2022, 7, 9794–9801. [Google Scholar] [CrossRef]
Fan, L.; Liu, J.; Zhang, W.; Xu, P. Robot navigation in complex workspaces using conformal navigation transformations. IEEE Robot. Autom. Lett. 2023, 8, 192–199. [Google Scholar] [CrossRef]
Zhu, H.; Alonso-Mora, J. Chance-constrained collision avoidance for MAVs in dynamic environments. IEEE Robot. Autom. Lett. 2019, 4, 776–783. [Google Scholar] [CrossRef]
Chen, L.; Wu, L.; Ren, Q. A multimodal data fusion-based intelligent detection method for lump coal on underground conveyor belts in smart manufacturing. J. Ind. Inf. Integr. 2025, 48, 100997. [Google Scholar] [CrossRef]
Sui, Y.; Zhang, L.; Sun, Z.; Yi, W.; Wang, M. Research on coal and gangue recognition based on the improved YOLOv7-Tiny target detection algorithm. Sensors 2024, 24, 456. [Google Scholar] [CrossRef] [PubMed]
Zhang, K.; Wang, W.; Lv, Z.; Fan, Y.; Song, Y. Computer vision detection of foreign objects in coal processing using attention CNN. Eng. Appl. Artif. Intell. 2021, 102, 104242. [Google Scholar] [CrossRef]
Rösmann, C.; Hoffmann, F.; Bertram, T. Integrated online trajectory planning and optimization in distinctive topologies. Robot. Auton. Syst. 2017, 88, 142–153. [Google Scholar] [CrossRef]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLO. Available online: https://github.com/ultralytics/ultralytics (accessed on 15 March 2025).
Wang, C.-Y.; Liao, H.-Y.M.; Wu, Y.-H.; Chen, P.-Y.; Hsieh, J.-W.; Yeh, I.-H. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 1571–1580. [Google Scholar]
Wang, C.-Y.; Liao, H.Y.M.; Yeh, I.-H. Designing network design strategies through gradient path analysis. arXiv 2022, arXiv:2211.04800. [Google Scholar] [CrossRef]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Zhuang, F.; Qi, Z.; Duan, K.; Xi, D.; Zhu, Y.; Zhu, H.; Xiong, H.; He, Q. A comprehensive survey on transfer learning. Proc. IEEE 2021, 109, 43–76. [Google Scholar] [CrossRef]
Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The PASCAL visual object classes (VOC) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Yang, W.; Zhang, X.; Ma, B.; Wang, Y.; Wu, Y.; Yan, J.; Liu, Y.; Zhang, C.; Wan, J.; Wang, Y.; et al. An open dataset for intelligent recognition and classification of abnormal conditions in longwall mining. Sci. Data 2023, 10, 416. [Google Scholar] [CrossRef] [PubMed]
Zhou, F.; Zou, J.; Xue, R.; Yu, M.; Wang, X.; Xue, W.; Yao, S. Enhancing Object Detection in Underground Mines: UCM-Net and Self-Supervised Pre-Training. Sensors 2025, 25, 2103. [Google Scholar] [CrossRef] [PubMed]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]

Figure 1. Overall research scheme.

Figure 2. Overall architecture of the proposed LCDet model. (a) Feature extraction stage, where multi-scale spatial and semantic features are extracted from the input image; (b) Feature fusion stage with the adaptive weighted attention mechanism (AWAM) to enhance discriminative coal features; (c) Prediction stage with decoupled heads for class scores and bounding box regression.

Figure 3. Demonstrates the comparison between standard convolution and grouped convolution. (a) Standard Convolution; (b) Grouped Convolution. The dashed lines in the cuboids are used to illustrate channel-wise grouping along the channel dimension.

Figure 4. Illustrates three different convolutional neural network module architectures. (a) Panel describes a residual block structure where input features go through two convolutional layers, then undergo point-wise addition, followed by channel concatenation with the original input features, and finally pass through a convolutional layer. (b) Panel shows a module with multiple convolutional layers where input features are processed through a series of convolutional layers, concatenated channel-wise, and then output through a convolutional layer. (c) Panel depicts a module with a bottleneck structure where input features first pass through a convolutional layer, then split into two paths: one goes directly through a convolutional layer, and the other through two bottleneck convolutional layers, with the outputs of both paths concatenated channel-wise and finally output through a convolutional layer.

Figure 5. Structure diagram of adaptive weighted attention mechanism. The color coding in the schematic is used to distinguish between different operational components and data representations: yellow denotes Global Average Pooling (GAP), blue represents Global Max Pooling (GMP), light green indicates the channel attention vector generated from GAP, light purple corresponds to the intermediate feature map produced via channel-wise multiplication, and red signifies the Sigmoid activation function.

Figure 6. Illustration of the parameter transfer-based training strategy. The shapes in the diagram distinguish key components: circles denote knowledge or data domains (e.g., source domain, target domain), while diamonds represent systems, modules, or interaction mechanisms (e.g., learning system, prior knowledge, mutual correlation).

Figure 7. Composition of the crushing robot.

Figure 8. Example of public dataset.

Figure 9. Scene schematic diagram.

Figure 10. Large coal detection device.

Figure 11. Dataset example.

Figure 12. Comparison of different object detection models. (a) Detection performance: Precision, Recall, and mAP@0.5. (b) Computational efficiency: model size (MB), parameters (M), and inference time (ms). Models: RT-DETR, YOLOv8n, YOLOv10n, YOLOv11n, LCDet.

Figure 13. Visualization results of the self-built dataset heat map.

Figure 14. Visualization results of the public dataset heat map.

Figure 15. Laboratory Deployment.

Table 1. Effect of different group numbers in grouped convolution.

Group Number	Parameters (M)	GLOPs	mAP50 (%)
2	3.21	9.4	95.9
4	2.93	7.5	96.5
8	2.61	6.2	94.8

Table 2. Camera parameters.

Parameter Type	Parameter Value
Image Sensor	1/2.7 inch HM2131
Image Resolution	2 M 1080P
Image Format	YUV/JPG
Camera Lens	F.3.0 mm f/no.2.4 m
Operating Current	<200 mA
Sleep Current	<10 mA
Operating Temperature	−20°~50°

Table 3. Experiment platform.

Platform	Name	Manufacturer	Country
CPU	Intel(R) Core(TM) i7-13700KF	Intel	USA
GPU	NVIDIA GeForce RTX4080	NVIDIA	USA
System	Windows 10	Microsoft	USA
Framework	Pytorch 1.7	Facebook AI Research	USA

Table 4. The results of the ablation experiment.

Model	Improvement Modules			Model Accuracy Metrics
Model	EFEU	AWAM	PK	Precision	Recall	mAP50	mAP50–95
1	×	×	×	89.7	89.4	95.9	68.5
2	√	×	×	91.5	90.3	96.3	69.2
3	√	√	×	90.8	91.0	96.5	67.8
4	√	√	√	90.4	91.3	96.5	69.3

Table 5. Ablation results of AWAM on the DsLMF+ dataset.

Model	Precision (%)	Recall (%)	mAP50 (%)	mAP50–95 (%)
w/o AWAM	79.1	73.8	83.6	55.7
Full	79.3	75.1	84.5	56.2

Table 6. Quantitative comparison results of public datasets.

Model	Model Accuracy Metrics and Complexity Metrics
Model	Precision	Recall	mAP50	mAP50–95	Parameters (M)	FLOPs (G)	Model Size (MB)
RT-DETR	73.7	73.6	80.4	51.0	31.99	103.4	64.6
UCM-Net [31]	79.2	74.9	84.0	55.3	8.96	24.8	18.37
CM-YOLOv8 [12]	78.4	76.6	84.1	55.4	15.6	43.7	31.8
YOLOv8s	79.7	76.4	85.2	57.3	11.13	28.4	22.0
YOLOv10s	79.2	75.8	84.8	57.4	8.04	24.4	16.15
YOLOv11n	78.9	74.4	83.9	55.7	2.58	6.3	5.35
LCDet	79.3	75.1	84.5	56.2	2.93	7.5	5.99

Table 7. Quantitative comparison results of self-built datasets.

Model	Model Accuracy and Complexity Metrics
Model	Precision	Recall	mAP50	mAP50–95	Parameters (M)	FLOPs (G)	Model Size (MB)
RT-DETR	88.4	91.4	94.7	65.1	31.99	103.4	64.6
YOLOv8s	92.2	86.3	95.7	68.0	11.13	28.4	22.0
YOLOv10s	89.7	87.5	94.0	67.1	8.04	24.4	16.15
YOLOv11n	87.6	90.2	94.9	66.0	2.58	6.3	5.35
LCDet	90.4	91.3	96.5	69.3	2.93	7.5	5.99

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, Y.; Lei, J.; Li, L.; Lu, Z.; Xu, L.; Zhao, S. Vision-Based Detection of Large Coal Fragments in Fully Mechanized Mining Faces Using Adaptive Weighted Attention and Transfer Learning. Sensors 2026, 26, 1167. https://doi.org/10.3390/s26041167

AMA Style

Wang Y, Lei J, Li L, Lu Z, Xu L, Zhao S. Vision-Based Detection of Large Coal Fragments in Fully Mechanized Mining Faces Using Adaptive Weighted Attention and Transfer Learning. Sensors. 2026; 26(4):1167. https://doi.org/10.3390/s26041167

Chicago/Turabian Style

Wang, Yuan, Jian Lei, Leping Li, Zhengxiong Lu, Lele Xu, and Shuanfeng Zhao. 2026. "Vision-Based Detection of Large Coal Fragments in Fully Mechanized Mining Faces Using Adaptive Weighted Attention and Transfer Learning" Sensors 26, no. 4: 1167. https://doi.org/10.3390/s26041167

APA Style

Wang, Y., Lei, J., Li, L., Lu, Z., Xu, L., & Zhao, S. (2026). Vision-Based Detection of Large Coal Fragments in Fully Mechanized Mining Faces Using Adaptive Weighted Attention and Transfer Learning. Sensors, 26(4), 1167. https://doi.org/10.3390/s26041167

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Vision-Based Detection of Large Coal Fragments in Fully Mechanized Mining Faces Using Adaptive Weighted Attention and Transfer Learning

Abstract

1. Introduction

2. Methods

2.1. Overall Framework of the Large Lump Coal Detector

2.2. Feature Coding Unit

2.3. Adaptive Weighted Attention Mechanism

2.4. Training Strategies Based on Transfer Learning

2.5. Overall Design of the Crushing Robot

3. Experimental Results and Analysis

3.1. Data Set Construction

3.1.1. Public Data Sets

3.1.2. Self-Built Data Set

3.2. Experimental Platform

3.3. Experimental Results

3.3.1. Ablation Experiment Results

3.3.2. Comparison of Experimental Results with Public Data Sets

3.3.3. Comparison of Experimental Results with Self-Built Data Sets

3.3.4. Visualization Results

3.4. Laboratory Deployment

4. Discussion

4.1. Mechanism Analysis of LCDet Under Extreme Underground Conditions

4.2. Engineering Significance and Industrial Real-Time Performance

4.3. Limitations and Future Work

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI