A “Hardware-Friendly” Foreign Object Identification Method for Belt Conveyors Based on Improved YOLOv8

: As a crucial element in coal transportation, conveyor belts play a vital role, and monitoring their health is essential for the coal mine transportation system’s safe and efﬁcient operation. This paper introduces a new ‘hardware-friendly’ method for monitoring belt conveyor damage, aiming to address the issue of large parameters and computational requirements in existing deep learning-based foreign object detection methods and their challenges in deploying on edge devices with limited computing power. This method is tailored towards edge computing and aims to reduce the parameters and computational load of foreign object recognition networks deployed on edge computing devices. This method improves the YOLOv8 object detection network and redesigns a novel lightweight ShufﬂeNetV2 network as the backbone network, making the network more delicate in recognizing foreign object features while reducing redundant parameters. Additionally, a simple parameter-free attention mechanism called SimAM is introduced to further enhance recognition efﬁciency without imposing additional computational burden. Experimental results demonstrate that the improved foreign object recognition method achieves a detection accuracy of 95.6% with only 1.6 M parameters and 4.7 G model computational load (FLOPs). Compared to the baseline YOLOv8n, the detection accuracy has improved by 3.3 percentage points, while the number of parameters and model computational load have been reduced by 48.4% and 42.0%, respectively. These works are more friendly to edge computing devices that tend to “hardware friendly” algorithms. The improved algorithm can reduce latency in the data transmission process, enabling the accurate and timely detection of non-coal foreign objects on the conveyor belt. This provides assurance for the subsequent host computer system to promptly identify and address foreign objects, thereby ensuring the safety and efﬁciency of the belt conveyor.


Introduction
As the essential energy source for human society's development, the coal industry holds a dominant position in national production [1].Under the goals of carbon peaking and carbon neutrality, the urgent task in current energy development is to construct a diversified energy supply system and accelerate the energy transition [2].To address the adverse environmental effects associated with coal mining, production, and utilization processes, the advancement of intelligent coal mining equipment for achieving eco-friendly, efficient utilization of coal resources has emerged as a central focus within the contemporary coal industry [3].
The transportation of coal, serving as a crucial stage in the coal mining process, greatly impacts both the overall energy consumption and production costs associated with coal [4].As major transportation equipment for coal transportation, mining belt conveyors have been evolving towards large-scale development driven by rapid technological advancements and increasing demands in the coal industry.Additionally, they are progressively aligning with the development requirements proposed by Industry 4.0, transitioning towards intelligent and energy-efficient directions [5,6].
The safe operation of the belt conveyor system relies on the normal and healthy functioning of its equipment.Abnormal working conditions can not only cause unnecessary damage or wear to the equipment but also increase the system's energy consumption, leading to additional safety risks and economic pressure for coal mining enterprises, thereby severely impacting the green and sustainable development of the enterprise [7].
In practical production processes, coal often needs to be transported over long distances.Conveyor belts efficiently facilitate the movement of large quantities of coal from one location to another, ensuring the unobstructed flow of the coal supply chain.Additionally, conveyor belt systems, meticulously engineered, enable the automated and continuous transportation of coal, eliminating the need for extensive manual labor.This not only enhances production efficiency but also reduces labor costs while minimizing the potential for human errors.Given the intricate working environment of belt conveyors, issues such as belt misalignment, belt tearing, and overloading [8] frequently manifest, with belt damage being the most prevalent among these malfunctions.Investigation has revealed that longitudinal tearing occurs when the belt encounters external sharp objects, such as anchor rods, large chunks of rock, angle irons, and iron plates [9][10][11].Failure to promptly detect and address non-coal foreign objects on the conveyor belt may result in abnormal stoppages and conveyor belt malfunctions, introducing a gamut of potential risks and losses.These may include diminished production efficiency, the loss of substantial production capacity, delays in the transportation of coal and other commodities, and disruptions in the logistics of the supply chain, affecting delivery schedules and production plans.In more severe cases, it could lead to coal accumulation or blockage on the conveyor belt, potentially causing material stack collapses, posing risks to worker safety, and resulting in equipment damage.Therefore, the rapid identification and monitoring of non-coal foreign objects on conveyor belts are imperative.
As condition monitoring technology advances, the monitoring methods for belt damage in conveyor belts have gradually evolved from manual inspection to traditional image recognition algorithms and deep learning-based object detection methods.Manual inspection methods are inefficient and costly.In contrast to traditional image recognition algorithms, deep learning-based detection methods do not require manual feature extractor design and possess stronger feature extraction capabilities, meeting the demands of efficient and precise data processing in the age of big data [12].
Currently, deep learning object detection methods can be categorized into singlestage and two-stage algorithms.Classic two-stage algorithms include Region-based Convolutional Neural Network (R-CNN) [13], Fast Region-based Convolutional Neural Network (Fast R-CNN) [14], and Faster Region-based Convolutional Neural Network (Faster R-CNN) [15].However, two-stage methods require generating candidate boxes from input images before feature extraction and detection, resulting in a relatively slow detection speed, which fails to meet the high demands of real-time performance and detection speed in coal mining applications.Classic single-stage algorithms encompass the You Only Look Once (YOLO) series and the Single Shot MultiBox Detector (SSD) algorithm [16].Unlike the two-stage methods, single-stage algorithms directly extract category, coordinate, and other feature information while generating candidate boxes in a single step to obtain detection results.The YOLO series algorithms, known for their accuracy and fast speed, are widely used in coal mine foreign object detection due to their applicability to embedded mobile platforms.
Hu Jinghao et al. [17] proposed an improved YOLOv3-based method for foreign object detection in belt conveyor systems.This approach employs the Focal Loss function as its loss function and fine-tunes the optimal hyperparameters, including weight parameter α and focus parameter γ, to address the sample imbalance problem.The highest recognition accuracy on the proposed non-coal foreign object dataset was 94.0%.Zhang Mengchao et al. [18] proposed an improved YOLOv4-based approach using depthwise separable convolutions to construct a series of lightweight networks, achieving a detection accuracy of 93.7% on the proposed dataset.Zhang Lei et al. [19] proposed a coal gangue object detection method for belt conveyors based on YOLOv5s-SDE.By adding a Squeeze-and-Excitation (SE) attention mechanism to the backbone network and optimizing the loss function, they improved the model's convergence speed and prediction accuracy.The results showed a maximum detection accuracy of 92.5% and a recognition speed of 30 frames per second.Mao Qinghua et al. [20] introduced a foreign object recognition approach for coal mine belt conveyors, which is based on an improved version of YOLOv7.By introducing deep separable convolutions instead of ordinary convolutions in network models, the foreign object recognition speed was improved, with a recognition accuracy of 92.8% and a recognition speed of 25.64 frames/s.
Although deep learning-based foreign object detection methods have become mainstream in current detection, the video surveillance systems serving the belt conveyor field rely on network cameras or inspection robots for data acquisition, which is then uploaded to central servers for centralized processing.This "cloud computing" processing approach exhibits relatively high network latency, which has a substantial impact on the real-time nature and accuracy of system alerts.Furthermore, the simultaneous transmission of multiple data streams imposes demanding bandwidth and computing power requirements on the "cloud processors".Compared to the "cloud computing" processing approach, edge computing distributes the data processing mission to the data acquisition endpoints, reducing the latency during data transmission and alleviating the burden on the "cloud processors".Therefore, it is better suited for real-time data analysis and intelligent processing.However, edge computing devices typically have limited computational power and may struggle to hold out complicated deep neural networks.To enable real-time processing capabilities on edge computing devices, it is necessary to compress and optimize the corresponding algorithms, reducing network model complexity and minimizing computational load.
Furthermore, as high-speed conveyor belts continue to evolve, they demand greater frame capture rates from cameras.Cameras with higher frame rates will transmit highquality images to edge computing devices.If the network's foreign object recognition and detection speed is not increased accordingly, it can lead to asynchronous image input and output signals, resulting in information delay to the upper computer system and cloud operation platform.This delay can impact subsequent decision-making and operations, thereby increasing the risk of conveyor belt damage.
More precisely, the term 'hardware-friendly' algorithm pertains to network models characterized by minimal parameter counts and reduced computational demands.Employing these network models on edge devices within coal mines, which often have constrained computational capabilities, not only aligns with resource constraints in such scenarios, including limited computing power and storage space, but also aligns with the requirements of advancing high-speed conveyor belt technologies.
Based on the above situation, this article proposes a "hardware friendly" coal mine conveyor belt foreign body identification and detection method, which uses an improved YOLOv8 network to "slim down" the conveyor belt foreign object detection network.Compress the parameter amount and storage space occupation of the network model, and make the network reduce the calculation amount and improve the reasoning detection speed without losing the detection accuracy so as to enhance its potential for utilization in edge computing devices.
The remainder of this paper is structured as follows: Section 2 outlines the data preparation procedure, Section 3 introduces the enhancements made to the algorithm, Section 4 evaluates the experimental outcomes and discusses associated observations, and finally, Section 5 summarizes the research outcomes and considers future work.

Data Preparation
Due to the particularity of the detection targets, there is presently no publicly accessible dataset for foreign object detection in underground coal mines.Therefore, the dataset utilized in the experiment was acquired from video images captured by intelligent inspection robots in coal mines during the belt conveyor operation in the Mining Fluid Control Engineering Laboratory of Shanxi Province.The experimental environment and hardware setup are illustrated in Figure 1.The equipment and image parameters are as follows: • The conveyor belt operates at a speed of 4 m/s.

•
The mining inspection robot captures frames at a rate of 40 frames/s.

•
The image resolution is 1920 × 1080.However, transmitting data at this resolution and under these testing conditions demands a substantial amount of memory and bandwidth.High hardware computing capability is needed when implementing this in an industrial setting, especially in edge computing.To reduce the computational costs and enhance network performance, this study resizes the images to 224 × 224 using Python (version: 3.7.0)batch processing.The images are annotated using Labelme software (version: 5.1.1)and stored in the VOC2007 format as the dataset.
Considering the adverse underground environment in coal mines and the impact of vibrations caused by the movement of inspection robots on the network detection performance, all images in the dataset are subjected to processing techniques such as motion blur, dust and fog effects, and reduced brightness.After data augmentation, the dataset comprises 17,483 samples of foreign object images with 44,480 corresponding data labels.Throughout the training procedure, the entire collection of image samples was divided into a training set (comprising 12,238 images), a validation set (comprising 3496 images), and a test set (comprising 1749 images) in a ratio of 7:2:1.The dataset includes various types of foreign objects, such as anchor rods, angle irons, trays, gangue, nuts, and screws.Some sample images are illustrated in Figure 2.

YOLOv8 Network Model
YOLOv8 is the latest version of the object detection and image segmentation model developed by Ultralytics in 2023.Building upon the successful foundation of YOLOv5, YOLOv8 introduces new functionalities and improvements aimed at further enhancing performance and flexibility.The YOLOv8 algorithm has developed five distinct models, denoted as YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x, each with varying sub-module depths and widths.The model detection accuracy and model size have been improved in sequence.Considering the limited hardware resources and the high real-time requirements in actual coal mine underground environments, strict limitations on the model size are necessary.Therefore, in this study, we selected YOLOv8n, which has the minimum parameters and model computation, as the experimental model in coal mine belt conveyors.Its architecture is illustrated in Figure 3.
The YOLOv8 network comprises input, backbone, neck, and head components.At the input stage, primary operations include mosaic augmentation, adaptive anchor box calculation, and responsibility for receiving image data to pass it to the next layer of the network.Typically, the YOLOv8 network divides the image into fixed-sized grids and utilizes the information from each grid for object detection.
The backbone serves as the network's central component, responsible for transforming input images into feature maps that contain both object positions and feature information.Typically, the backbone network consists of convolutional layers, pooling layers, and other deep learning layers, enabling it to capture information from different abstraction levels in the image.In YOLOv8, the Context to Focus (C2F) module has been introduced to replace the Cross Stage (C3) module, incorporating additional skip connections to enhance the model's gradient flow and strengthen the network's feature representation capacity.
The neck layer is employed to further extract and integrate features from the backbone network, typically consisting of a series of convolutional and pooling layers.Its role is to enhance feature representation and aid the network in better understanding contextual information about objects.In YOLOv8, the neck layer adopts the Path Aggregation Network (PANet) structure, which strengthens the network's ability to aggregate object features at different scales.
The head layer serves as the network's output component and is responsible for generating object detection results.Its primary function is to map the feature maps to object detection results to determine the position and category of detected objects.Typically, the head includes convolutional and detection layers.In YOLOv8, the classification and regression tasks are predicted separately, with the classification task still employing Binary Cross-Entropy Loss (BCE Loss), while the regression task utilizes Distribution Focal Loss (DFL Loss) and Complete Intersection over Union Loss (CIOU Loss) functions.These loss functions enable the network to rapidly focus on the distribution of positions in the vicinity of the target location, resulting in integer coordinate weight values.Floating Point Operations (FLOPs), Memory Access Cost (MAC), parallelism, and computing platforms are important metrics for assessing network models' computational speed and complexity.FLOPs quantify the amount of floating-point computations, i.e., the computational workload of the network model.Under a given MAC, parallelism, and computing platform, the larger the FLOPs, the greater the computational workload and complexity of the model.Although the YOLOv8n network exhibits good performance in object recognition accuracy and detection speed, the complex underground environment in coal mines (characterized by humidity, high coal dust levels, insufficient lighting, and overall darkness) leads to poor image quality and weak distinguishability of targets.Consequently, this imposes a significant computational burden on the deployment of detection devices.In addition, YOLOv8n backbone feature extraction comprises multiple standard convolution-dense connections.The excessive use of ordinary convolution to extract image features will result in the redundancy of the feature.The deeper the number of layers, the greater the impact on FLOPs, thus affecting the speed of foreign matter detection in coal mines.Therefore, it is necessary to "slim down" the YOLOv8n network model.

Selection of
Then, employ N sets of 1 × 1 convolution kernels for convolution, resulting in a final output size of D H × D w × N for the feature map.
The calculation amount of ordinary convolution is (Q c ): The calculation amount of DWConv is (Q D ): The calculation ratio of DWConv to ordinary convolution is: It can be seen from Formula (3) that by introducing DWConv, the calculation amount and parameters of the original network can be reduced so that the detection speed can be significantly improved.
While DWConv can efficiently decrease FLOPs, it cannot outright substitute the standard convolution, as this could result in a considerable drop in accuracy.In practice, when expanding the DWConv's network width from c to c1 (where c < c1) to recoup for the precision loss, it also adds to the memory requirement for computations, thereby decelerating the overall computational velocity.For hardware devices in deployment, the count of memory accesses is enhanced to [21]: Higher than normal convolution, i.e., where h and w denote the dimensions of the input image feature in terms of width and height, c signifies the network's width, and k denotes the convolution kernel size.

ShuffleNet Network
Research on image classification methods based on lightweight deep neural networks has made considerable progress in recent years.Among them, the ShuffleNet series algorithms proposed by Megvii Technology have been widely applied in object detection for edge computing due to their ability to achieve the best model accuracy with limited computational resources.The ShuffleNet network employs two core operations [22]: group convolution and channel shuffle.These operations substantially decrease the model's computational intricacy while preserving accuracy.However, group convolution is limited in that each group operates independently without feature fusion between groups.As shown in Figure 4a, when the convolution kernel is divided into three groups, the resulting feature maps are also divided into three groups, with each group only exchanging information internally, lacking any information fusion between groups.Therefore, ShuffleNetv1 proposed the concept of channel shuffle, which divides the channels in the feature map into several groups according to certain rules and then rearranges the elements within each group.Doing so enhances the information interaction and fusion between channels, increasing the model's non-linear expression capability without compromising network accuracy.The channel shuffle operation is illustrated in Figure 4b.However, ShuffleNetv1 utilizes many grouped and pointwise convolutions, which can slow down the model speed.To address this issue, MA et al. proposed four design principles for effective and compact networks based on the ShuffleNetv1 model and introduced the ShuffleNetv2 network [23].A new channel split operation is proposed in ShuffleNetv2, where the network mainly comprises a basic unit and a downsample unit.
Figure 5a shows the basic unit feature extraction operation with Stride = 1.The input features are evenly divided under the channel split operation; that is, the number of feature channels in each branch accounts for 1/2 of the original number of channels.Then, the left pathway does not perform any processing identity mapping, and the right pathway performs two ordinary convolutions (1 × 1) and 1 DWConv (3 × 3. Step size is 1) operation, and finally, the left and right branches output feature maps through feature stitching and channel ShuffleNet operations.Figure 5b shows the downsampling operation with Stride = 2. Unlike the basic unit, the downsampling unit does not use channel split without increasing the computational complexity of the network model.Instead, it directly augments both the network's channel count and its overall width, further enhancing the network's ability to extract features.Firstly, the feature map is fed into two branches, each of which undergoes ordinary convolution (1 × 1) and DWConv (3 × 3. Step size is 2), and after the two branches are concatenated and merged through channels, the quantity of the output channels increases to twice the original, and channel ShuffleNet is performed on the merged feature map.
Within the network architecture of ShuffleNetv2, the number of channels is doubled every time a downsampling operation is performed.With the doubling of the number of channels, the network does not pay attention to the feature channels that significantly impact the classification results.Important and unimportant feature channels have the same weight, resulting in excessive retention of interference information from the mine, which can easily affect the classification effect of non-coal foreign objects.Therefore, this paper chooses the lightweight ShuffleNetv2 network as the backbone network and optimizes it for these problems.The dataset of foreign objects contains interference from non-important features such as complex environmental backgrounds, which is accompanied by a lot of interference information during the recognition process.Redundant information is transmitted during the learning process of the network model, and as the number of network layers increases, the weight of interference information in the feature map also increases, ultimately having a certain negative impact on the model.The attention mechanism is a frequently employed method in the realm of deep learning that allows the model to focus more on key information in the input sequence, thereby improving the accuracy and efficiency of the model.However, existing attention mechanisms suffer from two issues: firstly, they are capable of enhancing features solely in either the channel dimension or the spatial dimension, lacking flexibility in simultaneously adapting to both dimensions.Secondly, their structures are based on a series of complex operations, which can increase the model parameter size and do not fit coal mine edge computing devices.
The Simple and Parameter-Free Attention Mechanism (SimAM) is an energy-based attention mechanism that can derive 3D attention weights without requiring additional parameters [24].Compared to other attention mechanisms, SimAM's operations are more concise and clear, effectively avoiding the issue of model parameter increase caused by structural adjustments.Therefore, this paper introduces the attention mechanism SimAM into the ShuffleNetV2 model, which allows the network to focus more on extracting essential feature information from non-coal foreign object images, effectively suppressing the interference of redundant information on the network.This leads to more efficient feature extraction and reconstruction, improving network recognition accuracy and reducing network complexity.
Its computational formulas are shown in Equations ( 6)- (8).SimAM evaluates the importance of each neuron in the network by defining an energy function based on linear separability.Here, t denotes the target neuron, x i represents neighboring neurons, and λ is a hyperparameter.μ and σ2 represent the average and variance of all neurons within the channel, excluding t.The lower the energy, the higher the differentiation between neurons and adjacent neurons, and the higher the importance of neurons.
Subsequently, each neuron is assigned a unique weighted value based on the manifestation of attention regulation in mammalian brains, as shown in Equation ( 9) [25]: Within the mathematical expression, x is the input feature tensor, X is the enhanced feature tensor, E(x) is the sum of e * t in all channels and spatial dimensions, and ⊙ is a dot product operation.By adding a sigmoid function to limit the excessive value of E(x), the sigmoid function does not influence the comparative importance of each neuron.
After the input feature map passes through the SimAM, the weight is normalized by the sigmoid function.Then, the weight of the target neuron is multiplied by the characteristics of the initial feature map to derive the ultimate output feature map.The weight distribution of SimAM is shown in Figure 6.
. Schematic diagram of the SimAM weight assignment.

Improved ShuffleNetv2 Network Model
Considering the speed requirement for non-coal foreign object detection by the mining inspection robot, ShuffleNetv2 0.5X is chosen as the backbone network, and improvements are made to this model.The enhanced network architecture is depicted in Figure 7.The network architecture includes Conv (standard convolution), MaxPool (max pooling), Shuf-fleNetV2(1,3) (downsampling module stacked one time, basic unit module stacked three times), GlobalPool (global pooling), FC (fully connected layer), ReLU (activation function), BN (batch normalization), DWConv (depthwise separable convolution), SimAM (simple and parameter-Free attention mechanism), Concat (channel concatenation), Channel shuffle (channel shuffle module), and Channel split (channel splitting module).The specific operations are as follows: (1) SimAM attention modules are inserted before the feature concatenation of the basic and downsample units in the ShuffleNetv2 network model.This is because the SimAM attention mechanism can effectively exploit the neurons in the downsample layer and basic unit of the ShuffleNetv2 network using an energy function and balance the weight allocation based on the importance of neurons.It assigns greater weight to important feature channels and smaller weight to less important ones, thereby enhancing the attention of the downsample unit and basic unit.In addition, including the attention mechanism SimAM after the convolutional layers is intended to prevent the network from losing significant non-coal foreign object information due to the preceding convolutional operations.It enables the network to focus more on critical features related to non-coal foreign objects, thereby enhancing the model's discriminative ability for different types of foreign objects.(2) Due to the high similarity between certain classes of non-coal foreign objects in coal blocks, such as nuts and bolts, both being deep brown or large coal pieces and coal gangue having only subtle differences in shape, it becomes challenging to distinguish them.By increasing the stacking times of the basic units, the network becomes more refined, enabling it to capture better the finer shape features that differentiate different foreign objects.This allows the network to thoroughly learn and utilize more detailed information about non-coal foreign objects.The downsampling unit of ShuffleNetv2 is denoted as shuffle-b, and the basic unit is denoted as shuffle-a.As shown in Table 1, while ensuring model lightweights, the stacking times of the downsampling unit in The specific process of the improved YOLOv8 for non-coal foreign object image classification in a belt conveyor is as follows: Firstly, the non-coal foreign object images of the belt conveyor are subjected to data augmentation and other preprocessing operations at the input end, and the images are transformed into 224 × 224 × 3 and input into the improved ShuffleNetv2 network in the backbone.Then, the images go through convolution and max-pooling operations to obtain feature maps.Through three ShuffleNetv2 network modules containing upsampling units and basic units, which further extract non-coal foreign object features with attention information, a 7× 7 × 192 feature map is obtained.Specifically, the upsampling units and basic units in Stage 2 and Stage 4 are stacked once and thrice, respectively, while in Stage 3, the upsampling units and basic units are stacked once and nine times.The feature extraction results are then sequentially processed through convolution operations, a global pooling layer, and fully connected layer operations before undergoing feature fusion with the neck layer.Finally, the recognition results of non-coal foreign object images in belt conveyors are obtained through three prediction heads of different sizes in the head.

Experimental Environment and Parameter Settings
The parameters of the model training equipment used in the experiment: operating system, Windows 11; CPU, 11th Gen Intel(R) Core(TM) i7-11800H 2.30 GHz; GPU, NVIDIA GeForce RTX 3080; deep learning framework, Torch 1.9.0 + CUDA 11.3.
During the training process, a random Stochastic Gradient Descent (SGD) optimizer is used for parameter updates to ensure the scientific reliability of the experimental conclusions.The number of iterations is set to 300, and the batch size is 16.The initial learning rate is established at 0.01 with a weight decay coefficient of 0.0005 to prevent the network from overfitting during training.A momentum factor of 0.937 is employed to prevent the model from becoming trapped in local optima or bypassing the global optimum.

Evaluation Indicators
Model usage precision, recall, mean of average precision (mAP), model size, parameter count, and FLOPs are used as evaluation indicators for the overall performance of the algorithm.The calculation expression for precision, recall, and mAP is as follows: where TP denotes the count of actual positive samples, FP denotes the count of incorrectly identified positive samples, and FN denotes the count of correctly identified negative samples.mAP is typically obtained by calculating the Precision-Recall curve (PR curve) for each class and then averaging the Area Under the Curve (AUC) of all classes to derive the mAP.

Experimental Findings and Analysis of the Enhanced Network Model 4.3.1. Analysis of Experimental Results Introducing SimAM Attention Mechanism
Three groups of contrastive experiments were conducted in this study to investigate the effectiveness of integrating the attention mechanism into the ShuffleNetv2 network model for non-coal foreign object recognition.The ShuffleNetv2 model, which is about to introduce the SimAM attention mechanism, is compared with the YOLOv8n model and baseline model YOLOv8n after replacing the backbone network with the ShuffleNetv2 network (abbreviated as YOLOv8n-SM, YOLOv8n, and YOLOv8n-shuffle, respectively).Then, the effects of embedding the SimAM attention module at positions 1 , 2 , and 3 in Figure 7 were discussed separately.Through the comparison of results in Table 2, it was observed that by using the lightweight convolutional network ShuffleNetv2 as the backbone of YOLOv8, the parameter count and FLOPs were reduced by 48.4% and 42.0%, respectively, while maintaining recognition accuracy.Subsequently, by introducing the SimAM attention mechanism after the two ShuffleNetv2 basic units' 1 × 1 convolutional modules, there was no escalation in parameters or computational intricacy.However, the precision, recall, and mAP were improved by 0.5%, 2%, and 0.2%, respectively.Based on Figure 8a,b, and in combination with Table 2, it can be obtained that the YOLOv8n-shuffle model and YOLOv8n-SM model show better convergence performance compared to the baseline YOLOv8n network model.They both reach a stable state after approximately 60 iterations.The YOLOv8n-SM model achieves a precision of 95.3% (an improvement of 1.5%), a recall of 90.0% (an improvement of 2.2%), and the highest recognition accuracy of 93.8% (an improvement of 1.5%).In the figure, the first row represents images of coal gangue, and the second row shows images of nuts. Figure 9 illustrates that the YOLOv8n model places greater emphasis on other object features, such as conveyor belts and rollers, which results in an increased inference speed and overall computational complexity during feature extraction.However, when using the lightweight ShuffleNetv2 network as the backbone, the model starts to pay attention to the feature regions where materials are conveyed on the conveyor belt.With the integration of the SimAM into the ShuffleNetv2 model, the attention is concentrated the relevant feature regions of non-coal foreign objects, enhancing its ability to discern key features of the foreign objects.Additionally, Figure 9 demonstrates that the model with the introduced SimAM module can more accurately extract the feature information of non-coal foreign objects while effectively avoiding interference from non-essential features, such as background environments.To further validate the effectiveness of embedding the SimAM module in the Shuf-fleNetv2 network model, we conducted comparative experiments with several mainstream attention mechanisms, and the results are presented in Table 3.It is evident that solely embedding a single attention mechanism into the network may disrupt the stability of the original network structure.Although there is not much impact on the number of parameters and the amount of model calculation, they all affect the detection accuracy to varying degrees.However, with the introduction of the SimAM attention mechanism, which is a parameter-free attention mechanism, the network allocates more attention to important neurons, enabling more detailed feature extraction and thus improving the accuracy of foreign object detection.We carried out three rounds of comparative experiments to investigate whether the reconstruction of the two basic units in the ShuffleNetv2 network is effective for non-coal foreign object recognition.The reconstructed ShuffleNetv2 network model (YOLOv8n-shuffle+) was compared with the baseline YOLOv8n network model and YOLOv8n-shuffle model.As shown in Table 4 and Figure 10, by changing the stacking of basic units, the reconstructed ShuffleNetv2 network model becomes more delicate, and the depth-wise separable convolution modules within basic units better capture the finer shape features among different non-coal foreign objects.The model learns more informative features while keeping the parameter count and network computation relatively stable, improving detection accuracy.

Analysis of Ablation Experiment Results
To further substantiate the efficacy of different optimization techniques employed in this investigation and the enhancements in the integrated network model for recognizing non-coal foreign objects on conveyor belts, we executed comparative assessments between the enhanced network model and the baseline YOLOv8n model.The results from the experiments are presented in Table 5.According to Table 5, both the improved ShuffleNetv2 network model and the implantation of the SimAM module positively impact the model's recognition accuracy.In particular, compared to the original YOLOv8n model, the YOLOv8n model with the Shuf-fleNetv2 network as the backbone achieves an accuracy of 94.8%, an increase of 1 percentage point; the recall rate improved to 88.0%, an uptick of 0.2 percentage points; the mAP value reached 93.5%, an augmentation of 1.2 percentage points; the model parameter amount is 1.6M, reduced by 46.9%; the computation FLOPs is 4.7G, which reduced by 42.0%; after reconstructing the ShuffleNetv2 network, the accuracy reaches 96.9%, the recall rate is 92.7%, and mAP is 95.6%.
In terms of model performance, the improved network model does not increase the model parameter count and maintains a network computational complexity of 4.7G.This demonstrates that integrating the SimAM attention mechanism and the reconstructed Shuf-fleNetv2 network model has not negatively affected the YOLOv8n network.On the contrary, it is advantageous in improving the model's recognition accuracy and detection speed.

Analysis of Experimental Results of Different Data Sets
Alongside the mAP and model parameter count, the neural network model's capacity for generalization serves as one of the metrics for assessing model quality.We expect that the model, trained on the dataset, will deliver a sensible output when presented with new samples not included in the dataset.Because of the particularity of the application context, there is currently no openly accessible dataset for detecting foreign objects on belt conveyor systems.Manually introducing foreign objects into coal mine belt conveyor systems would contravene safety regulations.Considering safety concerns, a validation was conducted at Boshitong Limited in Taiyuan, Shanxi Province, China (dataset containing 500 foreign object images, referred to as DataI).The assessment outcomes are displayed in Table 6, while certain detection findings are depicted in Figure 11.It can be observed that both the original and improved network models achieve relatively accurate detection results for large pieces of coal gangue and other foreign objects.However, the improved model accurately detects buried and small corner foreign objects while avoiding redundant detection boxes.
To further assess the model's capacity for generalization, we applied various image techniques to adjust some images in the DataI dataset, simulating real underground mining conditions such as fog, dust, low lighting, and blurriness due to robot motion during the inspection process.The detection outcomes are depicted in Figure 12 (the leftmost column signifies mild processing, the central column signifies moderate processing, and the rightmost column signifies severe processing).The detection results show that the object detection network benefits from data augmentation techniques during model training, enabling it to exhibit strong adaptability to environmental changes.Specifically, the detection results remain largely unaffected in the presence of dust and fog.Even under low lighting conditions, the model can accurately identify buried objects.Although we considered motion blur and applied the corresponding preprocessing to the data before training, severe image blurriness can still lead to some missed detections.
Furthermore, this research assessed the new model's generalization capability using data from Reference [18] (dataset containing 10,448 images with six types of non-coal foreign objects, denoted as DataII).The assessment outcomes are showcased in Table 7, and partial recognition findings are depicted in Figure 13.The visualized results show that both the original and improved networks exhibit commendable recognition and detection capabilities for large foreign objects.However, when dealing with partially buried objects and objects with features similar to the background, the improved network demonstrates higher detection accuracy, as indicated by the higher confidence scores of the detection bounding boxes.Combining the findings from Table 7, it is evident that the improved network has a lower parameter size and reduced model complexity.Likewise, we employed identical data augmentation methods for the DataII dataset as we did for DataI, and the detection results are shown in Figure 14 (the leftmost column represents mild processing, the central column signifies moderate processing, and the rightmost column denotes severe processing).The improved network model demonstrates outstanding generalization performance when faced with the new dataset, including detection results in adverse environments such as dusty haze, low illumination, and mild blurring.However, it still exhibits relatively weak resistance to severe motion blur.From our perspective, this phenomenon is considered normal, as human eyes or vision behave similarly: as motion blur increases, human judgment capability tends to decline.

Analysis of Experimental Results of Different Models
To further validate the effectiveness of the optimized model, this study employed transfer learning to train popular object detection algorithms, for example, YOLOv3 [26], YOLOv5, and YOLOv7 [27], on the dataset.Additionally, a comparison was made with several representative lightweight convolutional network models, including Mobilenetv3 [28], EfficientNet [29], MobileNext [30], PP-LCNet [31], GhostNetV2 [32], and FasterNet [33].We did not opt for object detection methods utilizing the two-stage approach from the R-CNN series, primarily because of the demanding need for real-time detection while maintaining a specific level of accuracy in the coal block field deployment.Achieving this level of performance is challenging with our current hardware infrastructure.For result reliability, all experiments were conducted in triplicate using distinct random seeds, and the average values were documented.The outcomes are displayed in Table 8 and Figure 15.
By comparison, it can be observed that among the YOLO series, YOLOv5x achieves the highest recognition accuracy.However, its 86.7M parameters and 205.7GFLOPs pose deployment challenges and demand high hardware device requirements, making it unsuitable for real-time detection tasks in underground coal mines.On the other hand, the improved YOLOv8n model maintains a high detection precision (mAP50: 95.6%) while further compressing the parameter count and computational complexity to 1.6M and 4.7G, respectively.Additionally, as depicted in Figure 15, the improved YOLOv8 network still exhibits significantly higher detection accuracy, lower parameter count, and network computational complexity compared to various lightweight convolutional networks.
Finally, we conducted cross-comparisons of the foreign object detection model's recognition and detection results under different hardware conditions, as shown in Table 9.We conducted a total of three major comparative experiments.In the first experiment group, we evaluated several state-of-the-art detection networks mentioned in recent literature.Regardless of the hardware conditions, the network's parameter size remained unaffected, while recognition speed, i.e., frames per second, increased with GPU upgrades.In the second experiment group, we horizontally compared lightweight detection networks commonly used in the field of computer vision.From the results, it can be observed that the parameter size generally remained within 4 MB under the same hardware conditions.However, inference speed was influenced by factors such as network computing power and bandwidth, and an optimal balance was not achieved between recognition speed and accuracy.In the third experiment group, we vertically compared the improved network's detection data under various hardware conditions.The results indicate that the improved method does not impose strict requirements on hardware in terms of detection and inference speed.In other words, even when using devices with relatively weaker computing capabilities, our method can still achieve relatively good detection results.This provides a direction and possibility for applications in edge devices with limited computational resources.

Discussion
The foreign object identification and detection technique for conveyor belts presented in this paper, utilizing the improved YOLOv8, has yielded notable detection outcomes in both the evaluated dataset and laboratory settings.Although its recognition speed and accuracy are superior to most current classic algorithms, there are still some areas that can be improved.
From the detection results in Figures 11g-i and 13g-i, it can be observed that although the network model underwent motion blur preprocessing on the non-coal foreign object data before training, the resistance to motion blur in the improved network model is not ideal.In other words, the detection results are still affected by image clarity, and as the degree of image blur increases, the detection accuracy decreases.Therefore, it is imperative to fine-tune certain parameters of the acquisition apparatus; examples include decreasing exposure duration and adjusting the installation angle, focal length, and height of the acquisition apparatus to expand the perspective, particularly along the conveyor belt's length.This can maximize the completeness and clarity of the acquired image data.
In addition, noise interference is another important factor affecting image detection results.Excessive noise data can interfere with the network model's accurate judgments during recognition.As shown in Figure 16, we tested the model's resistance to noise interference on three different non-coal foreign object datasets.As the level of noise increases, there is a certain degree of loss in the model's recognition accuracy.In the industrial field, erroneous or delayed judgments can pose certain safety risks to the work site.Therefore, it is necessary to apply appropriate image-denoising techniques to the raw data during on-site debugging.

Conclusions
This paper introduces a foreign object identification method designed to be compatible with hardware constraints based on an improved version of YOLOv8.Through rigorous experimentation, the following conclusions have been reached: Compared to the baseline YOLOv8n, the improved model exhibits a remarkable reduction of 56.9% in model parameters and a 42.0% decrease in computational cost, achieving a peak prediction accuracy of 95.6%.This method showcases exceptional performance across new datasets and various object detection approaches.We aspire for the approach outlined in this paper to provide assistance to a broader community of developers and researchers engaged in foreign object identification and detection using edge computing devices.
More importantly, This approach offers a cost-effective, efficient, and exceptionally precise solution for foreign object identification and detection, even in challenging scenarios involving intricate backgrounds, low-light conditions, and the demand for real-time decision-making.Implementing this method on edge devices can significantly diminish detection delays, promptly allocate operational time for the upper computer system, minimize conveyor belt downtime, and ultimately enhance the overall operational efficiency of coal mining facilities.

Future Work
In our forthcoming research endeavors, we intend to enlarge and enhance the dataset to tackle the challenge of sample imbalance, consequently leading to a more efficient enhancement in detection speed.Given that noise disturbances in underground coal mines

Figure 1 .
Figure 1.Experimental environment.(a) Belt conveyor used in the experiment; (b) mine inspection robot.

Figure 2 .
Figure 2. Partial foreign object data image.(a) The original images; (b) preprocessed images.

Figure 8 .
Figure 8.The result curves of the three network models.(a) mAP result curve; (b) validation set classification loss curve.The activation heatmap provides clear visual evidence for the model's classification results, where deeper colors indicate that the model has focused more on the relevant regions, resulting in more accurate detection of the foreign object targets.This article extracts the thermal maps of YOLOv8n, YOLOv8n-shuffle, and YOLOv8-SM networks on the same layer (the sixth layer of YOLOv8n network; the sixth layer of YOLOv8n-shuffle and YOLOv8-SM networks: Stage3 layer).

Figure 12 .
Figure 12.Foreign object detection results in different environments.(a-c) Mist and dust recognition results; (d-f) low light recognition results; (g-i) motion blur recognition results.

Figure 14 .
Figure 14.Foreign object detection results in different environments.(a-c) Mist and dust recognition results; (d-f) low light recognition results; (g-i) motion blur recognition results.

Figure 15 .
Figure 15.Comparison of mainstream convolutional networks.(a) Parameter quantity results; (b) mAP50 and model calculation quantity.
Lightweight Convolutional Network 3.2.1.Depthwise Separable Convolution Depthwise Separable Convolution (DWConv) is an efficient feature extraction module composed of depth and pointwise convolution.It has significantly fewer parameters and computational costs compared to ordinary convolutions while being able to capture more informative feature representations.Compared to the ordinary convolution with h

Table 2 .
The results after introducing SimAM.

Table 3 .
Comparison of experimental results with different attention mechanisms.Analysis of the Experimental Results of the Improved ShuffleNetv2 Network Model

Table 4 .
Improved ShuffleNetv2 network model results comparison.

Table 5 .
Assessment of results from ablation experiments.

Table 6 .
Comparison of Data I before and after improvement.

Table 7 .
Comparison of Data II before and after improvement.

Table 8 .
Comparison of outcomes from various network detection models.The unit of the recognition accuracy of the six foreign objects is %, the unit of the parameter is M, and the unit of FLOPs is G.

Table 9 .
Inference speed and hardware condition results for different models.