DEL_YOLO: A Lightweight Coal-Gangue Detection Model for Limited Equipment

Qiuyue Zhang; Shuguang Miao; Sen Fan; Mengxu Guo; Xiang Liu

doi:10.3390/sym17050745

,

and

¹

School of Physics and Electronic Information, Huaibei Normal University, Huaibei 235000, China

²

Anhui Province Key Laboratory of Intelligent Computing and Applications, Huaibei 235000, China

^*

Author to whom correspondence should be addressed.

Symmetry2025, 17(5), 745;https://doi.org/10.3390/sym17050745

This article belongs to the Section Computer

Version Notes

Order Reprints

Abstract

The gangue mixed in raw coal has small feature differences from coal, in order to solve the existing gangue recognition, methods generally have slow detection speed and are difficult to deploy at the edge end of the problem, a lightweight gangue target detection algorithm is proposed to enhance the research for the field of coal mining. Firstly, a lightweight EfficientViT module is the backbone of the network; secondly is the introduction of the DRBNCSPELAN4 module, which can better capture target information at different scales; finally, the lightweight shared convolutional detection head Detect_LSCD is reconstructed in order to further reduce the model size and improve the detection speed for coal and gangue. The experimental results indicate that in the model compared with the original algorithm, mAP@50–95 is improved by 1.2%, model weight size, the number of parameters, and floating point operations are reduced by 52.34%, 55.35%, and 50.35%, respectively, and inference speed is accelerated by 20.87% on a Raspberry Pi 4B device. In the field of coal gangue sorting, the algorithm not only has high-precision, real-time detection performance, but also achieves significant results in the lightweight model, making it more suitable for deployment on edge equipment to meet the requirements of controlling the robotic arm sorting gangue.

Keywords:

coal gangue identification; real-time; lightweight; edge device deployment

1. Introduction

China is the world’s leading producer of coal. The country’s development is significantly impacted by the advancement of green mining practices and the efficient utilization of coal mining resources. In the process of coal mining, the raw coal is mixed with gangue accompanying the coal seam, which is a kind of solid waste with low calorific value, high water content, and high impurity content [1]. The relatively high content of sulfur dioxide and other harmful substances in gangue has the potential to cause environmental pollution, including contamination of rivers, the atmosphere, and soil. Due to the mining process in different areas of the coal seam and gangue, its distribution situation varies greatly, and the mining working face is also constantly changing, which makes it difficult to obtain accurate content of gangue and the distribution of the situation. To address the call’s comprehensive conservation plan, protect the security of the nation’s resources, aggressively and consistently advance carbon neutrality, and quicken the call’s green transformation, it is of particular importance to conduct research into the sorting of coal gangue in the field of coal mining [2].

Presently, China’s coal gangue selection employs traditional methods, including the artificial gangue method, the mechanical gangue method, the mechanical dry selection method, and the mechanical wet selection method [3]. The artificial gangue faces low sorting efficiency, poor working environment, high labor intensity, and high cost; traditional mechanical gangue selection methods generally have low identification accuracy, large footprint, high investment costs, serious environmental pollution, and other shortcomings [4]. Mechanical dry sorting methods, such as spectroscopy and X-ray diffraction, pose a risk of radiation exposure to the human body; mechanical wet sorting methods cause waste of water and secondary pollution of the environment, and do not meet the needs of green manufacturing. Therefore, computer vision technology provides an effective alternative for the direction of coal gangue sorting in the field of coal mining, using computer vision technology to identify and classify the collected images of coal gangue, thereby facilitating the intelligent construction of coal mines. There are two main methods for detecting coal and gangue using computer vision techniques: traditional image recognition methods and deep learning recognition methods [5]. Traditional image recognition methods need to manually extract the image features of coal and gangue, which have the disadvantages of low recognition accuracy, sensitivity to image changes, and poor generalization ability [6]; while deep learning recognition methods have high accuracy, real-time, and robustness.

Deep learning is a machine learning algorithm that focuses on learning representations of data. Deep learning frameworks include the convolutional neural network (CNN), deep confidence network, recursive neural network, and others [7]. With the continuous progress of artificial intelligence, deep learning algorithms have been increasingly used in the field of coal and gangue identification. Xu et al. [8] optimized a convolutional neural network-based gangue recognition model by pruning, thereby reducing the size of the model parameters and computational requirements. However, their work did not take the detection speed of the gangue recognition process into account. Cao Xiangang et al. [9] proposed a deep learning-based method for the recognition of gangue, utilizing an RPN network structure to extract the candidate area of gangue and to evaluate the output of its gangue category. However, the accuracy of this method for the recognition of mixed samples of gangue was found to be only 90.17%. Shi Yikun et al. [10] on the other hand, employed the YOLOv5s model as the baseline network and introduced the content-aware reassembly of features (CARAFE) module into their backbone network to augment the network’s capacity for feature extraction, but they have not yet implemented the deployment on devices with limited real-world computational resources.

The majority of the aforementioned research solely concentrates on the target identification of coal and gangue; nevertheless, knowing the coal and gangue’s center position is also essential for the subsequent separation processes that involve robotic arms, blowing mechanisms, and other components [11]. In addition to applying to the actual intelligent sorting of coal gangue, it is necessary to take into account the size of the model of detection and the detection speed. In the area of target detection, YOLOv5 is widely used in embedded scenarios, and although it has certain advantages, it is difficult to meet the real-time requirements of the system on devices with limited computing resources. YOLOv8, on the other hand, is a new model that improves YOLOv5 and is able to increase the detection accuracy while reducing the number of parameters. Compared with it, it is more suitable for the requirements of intelligent coal gangue sorting for light and real time. Therefore, a lightweight, real-time, and efficient DEL_YOLOV8s algorithmic model for intelligent coal gangue sorting is proposed, which combines the EfficientViT module, the DRBNCSPELAN4 module, and the Detect_LSCD detection header to make the model have less of a number of parameters and floating-point operations, which improves the recognition speed of the model, and is easier to deploy to the end of edge devices such as the Raspberry Pi and other similar edge devices.

2. Materials and Methods

2.1. Baseline Model

On 10 January 2023, Ultratics released YOLOv8, a single-stage detection algorithm that can detect objects in real time with high accuracy [12]. It is faster and more accurate than the original you only look once version (YOLO) algorithm and can work on both CPUs and GPUs. It consists of four main parts: input, backbone network, neck network, and detection head [13]. The backbone network is based on the CSP idea, which is essentially analogous to that of YOLOv5, with the difference that the C3 module has been substituted with the C2f module, which enables the acquisition of more detailed information about the gradient flow while maintaining the network’s overall lightweight nature. The head part improves the anchor-based mechanism to an anchor-free mechanism [14], achieves the separation design of classification prediction and bounding box detection head, and introduces the decoupled detection head (DDH) architectural design. The loss function’s design incorporates TOOD’s Task-aligned Assigner, which guides the matching process of positive and negative samples [15]. The weighted scores of the classification and regression tasks are used to pick positive samples. The model network structure is shown in Figure 1:

Figure 1. YOLOv8 model network structure diagram.

2.2. DEL-YOLOv8s Target Detection Algorithm

The benchmark models can be classified into five distinct categories, designated as n, s, m, l, and x, according to the differences in network depth, feature map width, and required computational resources [16]. In the process of coal and gangue sorting, coal and gangue are usually mixed. Therefore, the samples collected from the conveyor belt contain coal and gangue, which are used as input data for the model by labeling and preprocessing the coal and gangue images. In order to further improve the performance in the gangue sorting scenario, the objective is to minimize the number of parameters and the computational complexity of the model to achieve light-weighting of the model in order to improve the inference speed of the model in the edge computing device, taking into account the elements of parameter size, model size, processing speed, real time, and complexity. Therefore, YOLOv8s [17] is chosen as the baseline model framework, the EfficientViT module is used to replace the backbone network of the original algorithm, and the DRBNCSPELAN4 (RepNCSPELAN4) module is utilized to replace the C2f module in the neck network, which combines the shallow feature information with the deeper feature information, to further reduce the number of parameters and the complexity of the network while maintaining the model’s efficiency. Finally, the self-developed lightweight detection header, Detect_LSCD, is employed, resulting in a significant reduction in parameters through the use of shared convolution. The model is subsequently rendered more lightweight upon deployment on resource-constrained devices. Figure 2. displays the network structure.

Figure 2. DEL-YOLOv8s model network structure diagram.

2.2.1. EfficientViT Backbone Network

The original target detection algorithm’s backbone network comprises a series of convolutional layers and residual modules, which, when stacked together, result in an oversized model. This has the effect of limiting both the training and inference speed, as well as making it unsuitable for deployment on edge devices [18]. Furthermore, the traditional backbone network structure has limitations in dealing with cross-scale information, which makes it difficult to accurately capture the global features of coal and gangue. To achieve a lightweight network model and efficiently extract image features, the efficient vision transformer (EfficientViT) proposed by CAI et al. [19] is used as the backbone network for feature extraction. It combines the advantages of the convolutional neural network (CNN) and vision transformer (ViT), with a new multi-scale linear attention module at its core, which enables global sensory field and multi-scale learning through lightweight and hardware-efficient operations. The EfficientViT backbone network better captures local detailed features in coal and gangue images by introducing overlap patch embedding, which divides the input image into overlapping chunks [20]. In contrast to the conventional CNN backbone networks, EfficientViT incorporates a mezzanine layout module, namely the EfficientViT Block, intending to enhance the efficiency of the feature extraction process. The macro-architecture is shown in Figure 3a. The EfficientViT Block consists of a lightweight multi-scale linear attention (MSA) module, and a feed-forward network (FFN) with depthwise convolution (DWConv) [21]. The MSA module is employed for the purpose of capturing contextual information and establishing global dependencies between disparate regions of the image. The combination of FFN and DWConv can effectively extract local information. FFN utilizes point-by-point nonlinear transformations to enhance the feature representation, while DWConv is inserted into the FFN layer to further improve the local feature capture ability by decomposing the standard convolution into deep convolution and point-by-point convolution, while also significantly reducing the number of parameters and computational overheads, in order to realize the optimization of the model’s lightweight and efficient performance. The two modules are mutually reinforcing, and collectively they augment the comprehensive performance of feature expression. This fusion can be seen as a symmetrical balance between local and global feature processing. The EfficientViT building block is shown in Figure 3b. The MSA module uses a lightweight ReLU linear attention mechanism instead of the traditional Softmax attention, thus enabling a comprehensive understanding of the model’s behavior while ensuring optimal performance. The MSA module structure is shown in Figure 3c.

Figure 3. EfficientViT Nntwork architecture: (a) macro-architecture; (b) EfficientViT building block; and (c) multi-scale linear attention mechanism.

Given an input feature

x \in R

, the mathematical expression for softmax attention is as follows:

Q_{i} = \sum_{j = 1}^{N} \frac{S i m (Q_{i}, K_{i})}{\sum_{j = 1}^{N} S i m (Q_{i}, K_{i})} V_{j}

(1)

In this context,

Q_{i}

is the ith row of the Q matrix, and

S i m (Q, K)

denotes the similarity function of the ReLU linear attention, which is expressed as follows:

S i m (Q, K) = R e L U (Q) \cdot {R e L U (K)}^{T}

Q = x W_{Q}, K = x W_{K}, V = x W_{V} .

W_{Q} / W_{K} / W_{V} \in R^{f \times d}

is a linear projection matrix.

The equations achieve a significant degradation in computational complexity and memory requirements from quadratic to linear levels by applying the combinatorial nature of matrix multiplication. During the computation process, it is sufficient to compute

\sum_{j = 1}^{N} [{R e L U (K_{j})}^{T} V_{j}]

and

\sum_{j = 1}^{N} [{R e L U (K_{j})}^{T}]

only once, and for each query task, the reuse strategy requires only O(N) levels of computation and memory resources, which excludes the softmax operation, and markedly improves the hardware’s operational efficiency.

Ultimately, the mathematical expression for the ReLU linear attention is as follows:

Q_{i} = \frac{R e L U (Q_{i}) \cdot [\sum_{j = 1}^{N} {R e L U (K_{j})}^{T} V_{j}]}{R e L U (Q_{i}) \sum_{j = 1}^{N} {R e L U (K_{j})}^{T}} .

(2)

In addition, in order to improve ReLU linear attention’s capacity for multi-scale learning, firstly, MSA merges all DWConvs into a single DWConv and introduces the concept of group convolution. Next, the global attention mechanism is utilized to extract global features across a range of scales. Finally, the MSA links the characteristics at each scale according to the head dimension and passes these features to the final linear projection layer for deep feature fusion and optimization.

2.2.2. DRBNCSPELAN4 Module

Due to the physical characteristics of coal and gangue, there is a certain degree of similarity, such as how some specific conditions of the two colors close to the naked eye are difficult to distinguish directly. Moreover, the raw coal and gangue may undergo a series of processes, including collection, transportation, and transshipment, which can result in coal and gangue exhibiting irregularities in physical properties upon entering the gangue selection procedure. This further complicates the identification process. The C2f module introduced in YOLOv8 is a feature fusion module that is capable of fusing feature maps at different scales. Although it increases the accuracy and robustness of detection by enhancing the ability to capture details and semantic information, its feature fusion operation increases computational complexity, leading to the difficulty of balancing between performance and computational resources. Meanwhile, it provides redundant features that can easily trigger overfitting and involves a considerable amount of standard convolutional computation and frequent memory accesses, which makes the detection speed slow and the deployment of models difficult. It was therefore decided that the DRBNCSPELAN4 module, otherwise known as the Rep-Net with cross-stage partial CSP and ELAN (RepNCSPELAN4) module [22], should be employed in place of the C2f module. The RepNCSPELAN4 module combines the strengths of the cross-stage partial network (CSPNet) and the efficient layer aggregation network (ELAN) to enhance and improve feature representation through the processing of the input tensor. The module employs a process of splitting and merging feature matrices to reduce unnecessary computational overheads while using layer aggregation techniques to further enhance the capability of feature extraction. The overall network structure of the RepNCSPELAN4 module is shown in Figure 4a. The RepNCSPELAN4 module performs the extraction and transformation of input features through the integration of a convolutional layer. It enhances the spatial resolution of the feature maps through the utilization of up-sampling techniques and establishes connections with the feature maps of the preceding layer, thereby facilitating the effective integration of information derived from multi-scale data.

Figure 4. RepNCSPELAN4 Network structure diagram: (a) general structure of RepNCSPELAN4; (b) internal structure of RepNCSP; and (c) RepNBottleneck structure.

RepNCSPELAN4 mainly consists of Conv and RepNCSP, which consists of CBS with a varying number of RepNBottleneck modules [23], as illustrated in Figure 4b. The feature data are efficiently transferred and circulated horizontally between the layers of the neural network through the CBS, avoiding the potential issue of gradient vanishing that may arise during the training process. RepNCSPELAN4 has a modular design that allows the number of RepNBottleneck modules to be increased or decreased to adjust the width and depth of the model to accommodate different datasets and computational resources. RepNBottleneck is a base module with a residual structure, and Figure 4c displays the internal network structure. Additionally, the internal components of CBS and CB are illustrated, comprising Conv2d, BatchNorm2d, and SiLu activation function primitives, which help to alleviate the problems of gradient vanishing and gradient explosion in deep neural networks, thus advancing the stability and accuracy of model training [24].

2.2.3. Detect_LSCD Detection Header

The original network is designed with three detection heads, adopting the design idea of anchor-free and decoupled head. The detection head is divided into two branches, namely regression and classification. Each branch consists of two 3 × 3 convolutional kernel Conv modules and one 1 × 1 convolutional kernel Conv2d module, which are used for extracting image features [25]. Nevertheless, since the processing of each detection head is performed independently by two 3 × 3 convolutions and one 1 × 1 convolution, this leads to an increase in parameter computation [26]. Accordingly, a lightweight shared convolutional detection head (LSCD) is put forth with the structure depicted in Figure 5. The core idea of the Detect_LSCD detection header is to integrate the two branches of the detection header in the original network into a single branch to extract the parameter features with the following expression:

h_{o u t} = f_{w} (h_{i n})

(3)

where

h_{o u t}

represents the output feature;

h_{i n}

signifies the input feature; and

f_{w} (\cdot)

is the feature extraction operation.

Figure 5. Detect_LSCD network architecture.

In order to circumvent the potential loss of accuracy that may result from the lightweight process employed in the model, the Share_Conv module is utilized to supplant the convolution module present in the original detection header [27]. Additionally, the Share_Conv module substitutes the batch normalization (BN) layer within the CBS convolution module with the group normalization (GN) layer [28]. While the BN layer is effective in solving the internal covariate bias problem and thus mitigating the gradient saturation phenomenon, its performance is limited by the size of the batch size. In contrast, the GN layer employs an innovative strategy of grouping the feature channels and computing the mean and variance independently for each group, which in turn normalizes the channels within the group. Its advantage lies in the fact that the computational region is only related to the number of input channels C, and is independent of batch size N, which can considerably minimize the computational volume of the reduced model, and the GN layer maintains stable accuracy performance under different batch sizes, thus boosting the adaptability and generalization ability of the model [29].

Following the successful acquisition of the feature map by the detection layer, firstly, the shared convolution module (Conv_GN) with a 1 × 1 convolution kernel is introduced to adjust the feature map dimensions and increase the information exchange of the channel dimensions. Secondly, two Conv_GN modules, both with 3 × 3 convolutional kernels, were concatenated in series for the purpose of further feature extraction. This approach was taken in order to reduce redundancy through the aggregation of information and to enhance mutual learning between neighboring features. Then, the processing flow is divided into a localization branch (Conv_Reg, 3 × 3) and a classification branch (Conv_Cls, 3 × 3). Each branch adopts a parameter-sharing strategy to make full use of the Conv_GN module to extract the information and input it into the classification header and regression header. Finally, considering the target scale, the scale layer is utilized to perform scaling operations of different degrees [30], which optimizes the model’s ability to retain multi-scale features, therefore boosting the model’s performance in multi-scale detection tasks.

The Detect_LSCD detection header applies a shared convolution strategy to reduce the number of parameters and computational complexity of the model, thereby reducing resource consumption and facilitating deployment in resource-constrained environments or on edge computing devices [31]. LSCD incorporates a dynamic adjustment mechanism into the detection mechanism, which is capable of adaptively modifying the anchor point and step size in accordance with the specific dimensions of the input image and alterations in the feature map. This achieves compatibility with inputs of different sizes. In addition, a kind of symmetric simplification of the classification and regression branches is achieved, reducing the amount of redundant computation.

3. Experimental Results and Analysis

3.1. Experimental Environment

This experimental environment uses PyTorch 2.1.1 as the framework, CUDA version 12.0, Python as the programming language, Python 3.8 as the interpreter, and a hardware environment comprising NVIDIA GeForce RTX 3090 GPUs and Intel Xeon(R) Silver 4210R CPUs. In addition, a Linux-based Raspberry Pi 4B and display are equipped for verifying the performance of the model.

In the training phase, the input image size is 320 × 320 pixels, the epoch is set to 200 rounds, and the model is optimized using a stochastic gradient descent (SGD) optimizer with an initial learning rate of 0.01 and a momentum factor of 0.937.

3.2. Datasets

This experiment makes use of the coal gangue open dataset and Anhui Province Huaibei City Zhuzhuang coal mine mining coal and gangue as a sample of homemade dataset for the combination of a total of 3264 pictures after the production of a new dataset for the experimental object. During the image acquisition process, industrial environmental conditions such as temperature, light, and dirt will impact the acquisition of sample images. If the temperature is too high, image sensors can cause the sample image of coal and gangue noise, color shift, and other problems. If the lighting conditions are too strong, the image of coal and gangue will be overexposed, resulting in the appearance of coal and gangue not being obvious, losing important features of the details of the part; if the light is not enough, it will lead to the acquisition of a dim image, and it will be difficult to distinguish between coal and gangue. In addition, if the lens is dirty, the sample image taken will have some blurred areas, which will affect the feature extraction of coal and gangue. The horizontal placement of the samples on a conveyor belt was achieved at random. Two types of equipment were employed to capture the images: firstly, coal and gangue images were captured by an OpenMV-enabled programmable camera; secondly, the sample data were captured by an industrial camera. Each of these images contain four to eight coals and gangue. The dataset is shown in Figure 6. After that, all the captured images are manually labeled with coal and gangue categories and location information using the “Labelimg” labeling software tool to generate xml label files and txt label files using format conversion code in PyCharm. Eventually, the images in the dataset are disrupted and divided into training, validation, and testing sets according to 7:2:1.

Figure 6. Example diagram of a dataset.

3.3. Evaluation Indicators

To provide a comprehensive evaluation of the performance and efficiency of the algorithmic model in coal and gangue recognition and classification, several key indicators were selected for analysis. These included the model size, parameters, floating point operations per second (FLOPs), precision (P), recall (R), and mean average precision (mAP@50–95). The two metrics, FLOPs and parameters, are used to assess the complexity of the model or algorithm and the size of the model, respectively. A lower value indicates that the model is more lightweight in nature.

Gangue sorting is a binary classification problem, and the confusion matrix is represented in Table 1. TP indicates the total number correctly identified as coal or gangue. FP represents the total number of coal and gangue that are not coal or gangue but have been misclassified as coal or gangue. FN signifies the total number of coals or gangue that were not identified. TN denotes the total number of coals or gangue that are not coal or gangue and have not been identified.

Table 1. Confuse matrix.

Precision (P) is a measure of how accurately the model predicts instances of the positive class. Recall (R) is the proportion of instances correctly recognized by the model as positive class to all actual positive class instances. The F1 score is a reconciled average of precision and recall [32]. It provides a comprehensive picture of the model’s performance. The mean average precision (mAP) is a frequently utilized performance evaluation metric in target detection tasks to measure the average accuracy of a model across different categories and different intersections over union (IoU). The mAP@50–95 refers to the average of the mean accuracies (AP) of the computational models across all categories, calculated over a range of IoU thresholds from 0.5 to 0.95 (with values at 0.05 intervals, for a total of 10 thresholds). It is an essential metric for evaluating the performance of target detection models, taking into account the model’s performance at varying IoU levels to provide a more comprehensive assessment [33]. The metric is calculated as follows:

P = \frac{T P}{T P + F P}

(4)

R = \frac{T P}{T P + F N}

(5)

F_{1} = 2 \frac{P R}{P + R}

(6)

A P = \int_{0}^{1} P (R) d R

(7)

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(8)

m A P @ 50 - 95 = \frac{1}{10} \sum_{I o U = 0.5}^{0.95} m A P (I o U)

(9)

where AP is average precision, which refers to the average of the precision rates at all possible levels of recall for the model. Therefore, mAP is calculated by averaging the AP values for all categories.

The inference time of the model was recorded during validation on the Raspberry Pi to assess its real-time performance on the Raspberry Pi.

3.4. Model Training Performance

To demonstrate the effectiveness of the training, YOLOv8 was used as the baseline model to train the gangue for target detection, and an in-depth analysis of the model’s performance was carried out from multiple dimensions. Figure 7 illustrates the F1-confidence curve for the final model. It can be seen that as the confidence level gradually increases, the F1 value tends to increase and then decrease. When the confidence level reaches 0.537, the F1 value for all categories reaches 0.95, indicating that the model achieves a better balance between precision and recall at this confidence level.

Figure 7. F1-Confidence Curve.

The loss function curve can reflect the convergence and stability of the model during the training process. Figure 8 shows the box_loss, cls_loss, and dfl_loss curves during training and validation, as well as the metric curves for precision, recall, mAP50, and mAP50–95 in metrics. The high value of train/box_loss at the beginning of the training phase, which is about 4, signifies that the model exhibits reduced efficacy in localizing the target object in the initial phase, consequently leading to a large prediction bias. As the number of training rounds increases, the parameters are constantly optimized and adjusted, and the positional features of the target are gradually learned. train/box_loss decreases rapidly and then tends to stabilize, indicating that the model’s prediction of the target’s bounding box is becoming more accurate. The initial value of train/cls_loss training is around 3.5, and the model has difficulty in accurately extracting coal and gangue features in the early stages. As the training depth increased, train/cls_loss decreased significantly and eventually stabilized at around 0.6, indicating that the model was able to discriminate more accurately between coal and gangue. The train/dfl_loss at the beginning of training is about 4. As the model learns the degree of refinement of the target frame regression, the loss value decreases steadily and eventually stabilizes at about 1.2, indicating that the model achieved better results in the fine positioning of the target frame regression. val/cls_loss in the initial phase due to the relatively random parameters and the large deviation of the predicted sample categories, which caused an initial break in the curve. The model learns the coal and gangue characteristics efficiently, and the loss values decrease rapidly after about 20 rounds. As the number of training rounds increased, the val/cls_loss exhibited a gradual decrease and then stabilized at approximately 0.3. The overall trend of the loss function curves on the validation set is similar to that of the training set, both decreasing and then stabilizing. In addition, the curves for the four assessment metrics, precision, recall, mAP50, and mAP50–95, are consistently increasing and then converging. As a result, the model is currently not overfitting, indicating that the model has good generalization ability and can show good performance on new data. From the metrics/precision (B) curve, it can be seen that the precision rate is close to 0 at the beginning of the training, and as the training proceeds, the precision rate starts to increase substantially, approaching 1.0 by the 200th round, indicating that the proportion of samples predicted by the model to be positive instances is becoming higher and higher. The metrics/recall (B) curve shows that the relatively weak ability to recognize the target object at the beginning of the training period due to the random initialization of the model leads to an unstable calculation of the recall rate, and the curve produces an interruption, which is manifested as a rapid increase in the recall rate followed by an immediate decrease. As the number of training rounds increases, the model performs fast extraction and learning for coal and gangue features, and the recall increases rapidly, reaching around 0.8 near the 50th round. The curve then rises gradually due to the decrease in new valid features available for learning and eventually stabilizes, indicating that the model is able to correctly detect most of the actual positive case samples. The metrics/mAP50 (B) curve eventually approaches 1.0, indicating that the model demonstrates a high average detection accuracy for all types of targets at an IoU threshold of 0.5. The final value of the metrics/mAP50–95 (B) curve is about 0.8, indicating that the model still has good detection performance in the tighter IoU threshold range (0.5–0.95).

Figure 8. Model training and validation loss curves.

3.5. Ablation Experiments

Ablation experiments were conducted to better test the effectiveness of each improved module for coal and gangue identification optimization. Experiment 1 is the original model, which is notated as Base. Experiment 2 models the EfficientViT network by replacing the backbone network with the EfficientViT network, notated Base_EviT, and notated EfficientViT as EViT. Experiment 3 replaces the C2f module with the DRBNCSPELAN4 module, designated as Base_DRBNCSPELAN. Experiment 4 changed the detection head to the lightweight shared convolutional detection head Detect_LSCD, henceforth referred to as Base_LSCD. Experiment 5 combined Experiments 2 and 4, denoted as Base_EViT+LSCD. Experiment 6 integrated the methodologies of Experiments 3 and 4, resulting in the designation Base_DRBNCSPELAN+LSCD. Experiment 7 is the final model DEL-YOLOv8s after combining all modules. This model has been designated as DEL-YOLOv8s. Table 2 displays the experimental findings.

Table 2. Ablation experiments (“√” represents the addition of the module, and bold indicates optimal performance).

As illustrated in Table 2, the dimensional size, number of parameters, and GLOPs of the base model are 21.4 M, 11.13 M, and 28.4, respectively. When the three improved modules are combined with the base model alone, the model size, number of parameters, and FLOPs are all reduced while maintaining similar accuracy, indicating that all three modules exhibit varying degrees of lightweight efficacy and are well-suited for coal gangue sorting scenarios. The model after the two-by-two combination of the three modules shows a substantial reduction in model size, the number of parameters, and FLOPs, with the most significant change in the number of parameters. Upon combining the three modules, the DEL-YOLOv8s model size is reduced to 10.2 M, the number of parameters is only 4.97 M, and the FLOPs are 14.1 G. This represents a reduction of 11.2 M, 6.16 M, and 14.3 G, respectively, in comparison to the base model. In addition, the inference speed test conducted on a Raspberry Pi 4B device yielded a result of 937.7 ms/frame, which is a speed improvement of 20.87% compared to the base model. The model has been demonstrated to effectively reduce the complexity and computational requirements of the original model while maintaining high performance. It has also been shown to shorten the inference time, which is conducive to improving operational efficiency in the practical application of coal gangue sorting.

3.6. Comparative Experiments

To further verify the effectiveness of the improved model for recognizing the coal gangue, different models of YOLOv3s, YOLOv5s, and YOLOv6s are compared for experiments under the condition of guaranteeing the use of the same dataset and experimental environment configurations mentioned above. The results of the experiment are presented in Table 3.

Table 3. Comparison experiment.

From the table, the final model and the above models have a significant reduction in model size, number of parameters, and FLOPs, while mAP@50–95 is 83%, second only to YOLOv6s, which is 0.2% lower in comparison. in the Raspberry Pi 4B device, test speed is not as good as the YOLOv3s model, but all other indicators are better than YOLOv3s. When considered collectively, the DEL-YOLOv8s model exhibits notable advantages across multiple key performance indicators, with a more lightweight model, balanced detection accuracy, and it is more suitable for deployment on Raspberry Pi for practical coal gangue sorting applications, with a wide range and practicality.

3.7. Visualization Analysis

To more effectively demonstrate the enhanced efficiency of the revised algorithm for coal gangue sorting, once the weight files of the baseline model and the improved model have been obtained, a portion of the coal and gangue is identified and predicted. The main categories are normal, low light, strong light, and dusty industrial environments, and there is the prediction of homemade laboratory datasets. The resulting prediction effect is illustrated in Figure 9. In this case, the yellow dashed box indicates a missed or duplicate detection, and the red dashed box indicates a category detection error. While the baseline model can predict coal and gangue under normal circumstances, problems of misclassification and duplicate detection can occur. The baseline model also suffers from duplicate detections in low-light environments. Additionally, the baseline model misidentifies areas that are not gangue as gangue under strong lighting. For dusty environments and homemade datasets, the detection effect between the two is relatively insignificant. Overall, it is observed from the Figure 9. that the upgraded model has better detection of coal and gangue. As can be seen from the Figure 10, the mAP@50–95 curves during training of the improved model are broadly in line with the baseline model, but DEL-YOLOv8s’ accuracy is generally higher than that of the baseline model from round 175 onwards.

Figure 9. Visualization of model prediction results.

Figure 10. Comparison of mAP@50–95 curves.

Figure 11 illustrates the visualization between the number of parameters and the model size through the MATLAB R2023b software, as the number of parameters in the model decreases, so does the model size. It can be observed that experiment 7 has the lowest number of parameters and the smallest model size of the several experiments. Consequently, the final model achieves a smaller model size while being able to maintain a lower number of parameters, which is especially advantageous for edge devices or resource-constrained environments.

Figure 11. A scatter plot of the number of parameters and model size.

As illustrated in Figure 12, the physical diagram for validating the model on the Raspberry Pi is presented.

Figure 12. Model validation on the Raspberry Pi.

Furthermore, in order to show more clearly the performance of different models in testing inference speed on Raspberry Pi, using FLOPs as an example, scatter plots of FLOPs versus inference speed on the Raspberry Pi for each model were generated using MATLAB R2023b software. The results are shown in Figure 13. The Figure 13. demonstrates the relationship between the aforementioned two indicators for the seven models. As evidenced by the Figure 13., it can be seen that the inference speed shows a certain decreasing trend as the FLOPs decrease. Models 2, 3, and 4 showed a small decrease in the speed of reasoning while maintaining similar or slightly lower FLOPs. Models 5 and 6 achieve relatively fast inference speeds while further reducing FLOPs. The combination of all modules yields the final model with the lowest FLOPs, which significantly reduces the consumption of computational resources, and the inference speed in the real-time coal gangue sorting task demonstrates superior performance compared with the other six models, providing substantial evidence in support of the realization of the intelligent transformation of coal mines.

Figure 13. A scatter plot of FLOPs and inference speed.

The above visualization analysis intuitively presents the good results of the model in the coal gangue target detection task based on the existing dataset. However, since this study has not yet obtained coal and gangue samples from other mining environments for testing and validating the data, the generalization ability of the model may fluctuate when dealing with data from other mining areas.

4. Conclusions

In order to adapt to different light, shape, and other complex conditions in real-time high-precision identification of coal and gangue so as to achieve the coal gangue sorting, the DEL_YOLOv8s model is designed to address the issue of coal and gangue misdetection and omission detection. Firstly, the EfficientViT module is introduced in the backbone region to replace the backbone network of YOLOv8s, with the objective of extracting multi-layered and multi-scale image feature information. Secondly, the C2f module was replaced with the DRBNCSPELAN4 module in the neck network to enhance the model’s feature extraction and fusion capabilities, thereby reducing the computational complexity. Finally, Detect_LSCD, a self-developed lightweight detection head, is used to further improve the detection accuracy. In the end, the mAP@50–95, model weight size, number of parameters, and floating-point operations of the model for coal and gangue recognition detection are 83%, 10.2 MB, 4.97 M, and 14.1 G, respectively. In comparison to the YOLOv8s model, the DEL_YOLOv8s model exhibits a reduction in the size of 11.2 MB, a 55.35% reduction in the number of parameters, a 50.35% reduction in the number of floating-point operations, and a 1.2% improvement in the mAP@50–95. Concurrently, the processing of input coal and gangue images was accelerated by 20.87% after deployment in a Raspberry Pi 4B edge computing device. In summary, the model is able to maintain high accuracy and real-time identification of coal and gangue while necessitating reduced computation and accelerated inference. In addition, the model can provide accurate location information of gangue for deployment into edge computing devices for real-time gangue sorting applications in conjunction with STM32 microcontroller-controlled robotic arms. The model also has some limitations, although the dataset integrates the public dataset and the homemade dataset, but the samples of coal and gangue covered at present are limited in variety, and there are differences in the characteristics of coal and gangue in different mines, which may lead to insufficient generalization ability of the model when dealing with the data from other mines. On this basis, we will continue to optimize the network structure in order to be more lightweight, and further acquire different types of coal and gangue samples from multiple locations, make datasets to add to the existing dataset, and improve the adaptability of the model, so that the model can be better applied to the actual coal and gangue sorting scenarios.

Author Contributions

Conceptualization, Q.Z. and S.M.; methodology, Q.Z. and S.M.; software, Q.Z.; validation, Q.Z., S.F. and M.G.; formal analysis, S.M.; investigation, S.F., M.G. and X.L.; data curation, Q.Z. and M.G.; writing—original draft preparation, Q.Z. and S.M.; writing—review and editing, Q.Z., S.M., X.L. and S.F.; visualization, S.M.; funding acquisition, S.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the financial support of The Key Natural Science Research Project for Colleges and Universities of Anhui Province (grant number 2023AH050343), Anhui Provincial Department of Education Quality Engineering Project under Grant 2022jyxm1405, Anhui Province Graduate Education Quality Project (2024jyjxggyjY204), and Huaibei Normal University Bit and Graduate Education Quality Project (2024jgxm003).

Data Availability Statement

Data are contained within this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhu, H.; Dong, L.; Ren, H.; Zhuang, H.; Li, H. GS-YOLO: A Lightweight Identification Model for Precision Parts. Symmetry 2025, 17, 268. [Google Scholar] [CrossRef]
Li, Z.; Zou, H. Research on online visual recognition of coal gangue based on BLOB analysis and support vector machine. J. Comput. Methods Sci. Eng. 2024, 24, 2123–2134. [Google Scholar] [CrossRef]
Shang, D.Y.; Huang, Y.S.; Zhang, T.Y.; Liu, R.J. Experimental platform design of Delta coal sorting robot. Coal Technol. 2023, 42, 136–139. [Google Scholar] [CrossRef]
Cao, X.G.; Li, Y.; Wang, P.; Wu, X.D. Research status and prospect of gangue identification method. Ind. Min. Autom. 2020, 46, 38–43. [Google Scholar] [CrossRef]
Wang, Y.; Wang, Y.; Dang, L. Video detection of foreign objects on the surface of belt conveyor underground coal mine based on improved SSD. J. Ambient Intell. Humaniz. Comput. 2020, 14, 5507–5516. [Google Scholar] [CrossRef]
Guo, Y.; Wang, X.; Wang, S.; Hu, K.; Wang, W. Identification Method of Coal and Coal Gangue Based on Dielectric Characteristics. IEEE Access 2021, 9, 9845–9854. [Google Scholar] [CrossRef]
Zhang, S.R.; Huang, C.L.; Zhang, Y.H.; Zhang, A.; Ji, L. Research on coal gangue recognition based on improved YOLOv5. Ind. Min. Autom. 2022, 48, 39–44. [Google Scholar] [CrossRef]
Xu, Z.Q.; Lu, Z.Q.; Wang, W.D.; Hang, K.; Lü, H. Machine vision recognition method and optimisation for intelligent sorting of coal gangue. J. Coal 2020, 45, 2207–2216. [Google Scholar] [CrossRef]
Cao, X.G.; Liu, S.Y.; Wang, P.; Xu, G.; Wu, X.D. Research on coal gangue identification and positioning system for coal gangue sorting robot. Coal Sci. Technol. 2022, 50, 237–246. [Google Scholar]
Shi, Y.K.; Li, Z.; Li, R.T.; Dang, C.Y.; Zeng, Z.Q. Research on gangue sorting method based on machine vision. China Min. Ind. 2024, 33, 114–121. [Google Scholar]
Zeng, Q.; Zhou, G.; Wan, L.; Wang, L.; Xuan, G.; Shao, Y. Detection of Coal and Gangue Based on Improved YOLOv8. Sensors 2024, 24, 1246. [Google Scholar] [CrossRef] [PubMed]
Weng, Z.; Liu, K.; Zheng, Z. Cattle face detection method based on channel pruning YOLOv5 network and mobile deployment. J. Intell. Fuzzy Syst. 2023, 45, 10003–10020. [Google Scholar] [CrossRef]
Chen, K.; Du, B.; Wang, Y.; Wang, G.; He, J. The real-time detection method for coal gangue based on YOLOv8s-GSC. J. Real-Time Image Process. 2024, 21, 1–12. [Google Scholar] [CrossRef]
Wang, H.; Chen, G.; Rong, X.; Zhang, Y.; Song, L.; Shang, X. Detection Method of Stator Coating Quality of Flat Wire Motor Based on Improved YOLOv8s. Sensors 2024, 24, 5392. [Google Scholar] [CrossRef]
Li, M.; Liu, W.; Shao, C.; Qin, B.; Tian, A.; Yu, H. Multi-Scale Feature Enhancement Method for Underwater Object Detection. Symmetry 2025, 17, 63. [Google Scholar] [CrossRef]
Azurmendi, I.; Gonzalez, M.; García, G.; Zulueta, E.; Martín, E. Deep Learning-Based Postural Asymmetry Detection Through Pressure Mat. Appl. Sci. 2024, 14, 12050. [Google Scholar] [CrossRef]
Ni, J.; Zhu, S.; Tang, G.; Ke, C.; Wang, T. A Small-Object Detection Model Based on Improved YOLOv8s for UAV Image Scenarios. Remote Sens. 2024, 16, 2465. [Google Scholar] [CrossRef]
Liang, L.M.; Long, P.W.; Lu, B.H.; Ou, Y.; Zeng, L. EHH-YOLOv8s: A lightweight surface defect detection algorithm for strip steel. J. Beijing Univ. Aeronaut. Astronaut. 2024, 1–15. [Google Scholar] [CrossRef]
Cai, H.; Li, J.; Hu, M.; Gan, C.; Han, S. Efficientvit: Multi-scale linear attention for high-resolution dense prediction. arXiv 2022, arXiv:2205.14756. [Google Scholar]
Liu, C.Y.; Xu, J.; Li, K.; Wang, L. B-mode ultrasound image feature detection of intussusception in children based on improved YOLOv8n. J. Biomed. Eng. 2024, 41, 903–910. [Google Scholar]
Zhao, B.; Liu, S.C.; Zhang, W.P.; Zhu, L.C.; Han, Z.H.; Feng, X.G.; Wang, R.X. Optimisation of lightweight Transformer architecture for cherry tomato picking recognition. J. Agric. Mach. 2024, 55, 62–71+105. [Google Scholar]
Wang, C.Y.; Yeh, I.H.; Mark Liao, H.Y. Yolov9: Learning what you want to learn using programmable gradient information. In Proceedings of the European Conference on Computer Vision, ECCV 2024, Milan, Italy, 29 September–4 October 2024; Springer Nature: Cham, Switzerland, 2024; pp. 1–21. [Google Scholar]
Aibibu, T.; Lan, J.; Zeng, Y.; Lu, W.; Gu, N. Feature-Enhanced Attention and Dual-GELAN Net (FEADG-Net) for UAV Infrared Small Object Detection in Traffic Surveillance. Drones 2024, 8, 304. [Google Scholar] [CrossRef]
Wang, P.; Fan, X.; Yang, Q.; Tian, S.; Yu, L. Object detection of mural images based on improved YOLOv8. Multimed. Syst. 2025, 31, 93. [Google Scholar] [CrossRef]
Chen, W.; Jiang, Z.C.; Tian, Z.J.; Zhang, F.; Liu, Y. An algorithm for detecting unsafe manoeuvres of underground coal mine personnel based on YOLOv8. Coal Sci. Technol. 2024, 1–19. Available online: http://kns.cnki.net/kcms/detail/11.2402.td.20240322.1343.003.html (accessed on 6 December 2024).
Li, J.; Yang, F.; Gong, S.; Zhou, K. Vision—Based lightweight pavement anomaly detection algorithm. J. Jilin Univ. (Eng. Ed.) 2024, 1–9. [Google Scholar] [CrossRef]
Hua, C.; Luo, K.; Wu, Y.; Shi, R. YOLO-ABD: A Multi-Scale Detection Model for Pedestrian Anomaly Behavior Detection. Symmetry 2024, 16, 1003. [Google Scholar] [CrossRef]
Li, Z.; Hou, B.; Wu, Z.; Ren, B.; Yang, C. FCOSR: A Simple Anchor-Free Rotated Detector for Aerial Object Detection. Remote Sens. 2023, 15, 5499. [Google Scholar] [CrossRef]
Li, Y.; Zhang, M.; Zhang, C.; Liang, H.; Li, P.; Zhang, W. YOLO-CCS: Vehicle detection algorithm based on coordinate attention mechanism. Digit. Signal Process. 2024, 153, 104632. [Google Scholar] [CrossRef]
Wang, Q.; Xia, L.F.; Chen, T.M.; Han, H.; Wang, L. Underground personnel helmet wearing detection based on improved YOLOv8n. Ind. Min. Autom. 2024, 50, 124–129. [Google Scholar] [CrossRef]
Li, Y.; Zhang, Z.; Yuan, C.; Hu, J. Easily deployable real-time detection method for small traffic signs. J. Intell. Fuzzy Syst. 2024, 46, 8411–8424. [Google Scholar] [CrossRef]
Zhou, S.; Yin, W.; He, Y.; Kan, X.; Li, X. Detection of Apple Leaf Gray Spot Disease Based on Improved YOLOv8 Network. Mathematics 2025, 13, 840. [Google Scholar] [CrossRef]
Wang, Y.; Wang, B.; Fan, Y. PPGS-YOLO: A lightweight algorithms for offshore dense obstruction infrared ship detection. Infrared Phys. Technol. 2025, 145, 105736. [Google Scholar] [CrossRef]

Figure 1. YOLOv8 model network structure diagram.

Figure 2. DEL-YOLOv8s model network structure diagram.

Figure 3. EfficientViT Nntwork architecture: (a) macro-architecture; (b) EfficientViT building block; and (c) multi-scale linear attention mechanism.

Figure 4. RepNCSPELAN4 Network structure diagram: (a) general structure of RepNCSPELAN4; (b) internal structure of RepNCSP; and (c) RepNBottleneck structure.

Figure 5. Detect_LSCD network architecture.

Figure 6. Example diagram of a dataset.

Figure 7. F1-Confidence Curve.

Figure 8. Model training and validation loss curves.

Figure 9. Visualization of model prediction results.

Figure 10. Comparison of mAP@50–95 curves.

Figure 11. A scatter plot of the number of parameters and model size.

Figure 12. Model validation on the Raspberry Pi.

Figure 13. A scatter plot of FLOPs and inference speed.

Table 1. Confuse matrix.

Real Value	Predicted Value
Real Value	Positive	Negative
Positive	TP	FN
Negative	FP	TN

Table 2. Ablation experiments (“√” represents the addition of the module, and bold indicates optimal performance).

No.	EViT	DRBNCSPELAN	LSCD	Model Size (M)	P (%)	R (%)	mAP@50–95(%)	Parameters (M)	FLOPs (G)	Inference Speed (ms/rame)
1				21.4	96.8	98.3	82	11.13	28.4	1185
2	√			16.6	96.4	98.9	80.1	8.39	20.4	1140
3		√		15.1	96.6	97.1	81	7.67	19.6	1136.9
4			√	18.1	97.1	97.7	82.4	9.43	25.8	1087.1
5	√		√	13.4	97	98.1	83.3	6.69	17.7	1053.3
6		√	√	11.9	97.3	96.4	81	5.97	16.9	1037.5
7	√	√	√	10.2	97.3	97.9	83	4.97	14.1	937.7

Table 3. Comparison experiment.

No.	Model	Model Size (M)	P (%)	R (%)	F1 (%)	mAP@50–95 (%)	Parameters (M)	FLOPs (G)	Inference Speed (ms/rame)
1	YOLOv3s	23.24	96.9	96.6	96.7	80.5	12.13	18.9	741.6
2	YOLOv5s	17.64	96.8	97.4	97.1	82	9.12	23.8	1051.9
3	YOLOv6s	31.3	97.9	97.6	97.4	83.2	16.3	44	1504.6
4	YOLOv8s	21.4	96.8	98.3	97.5	82	11.13	28.4	1185
5	DEL-YOLOv8s	10.2	97.3	97.9	97.6	83	4.97	14.1	937.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

DEL_YOLO: A Lightweight Coal-Gangue Detection Model for Limited Equipment

Abstract

1. Introduction

2. Materials and Methods

2.1. Baseline Model

2.2. DEL-YOLOv8s Target Detection Algorithm

2.2.1. EfficientViT Backbone Network

2.2.2. DRBNCSPELAN4 Module

2.2.3. Detect_LSCD Detection Header

3. Experimental Results and Analysis

3.1. Experimental Environment

3.2. Datasets

3.3. Evaluation Indicators

3.4. Model Training Performance

3.5. Ablation Experiments

3.6. Comparative Experiments

3.7. Visualization Analysis

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics