Application of Lightweight Convolutional Neural Network for Damage Detection of Conveyor Belt

: Aiming at the problem that mining conveyor belts are easily damaged under severe working conditions, the paper proposed a deep learning-based conveyor belt damage detection method. To further explore the possibility of the application of lightweight CNNs in the detection of conveyor belt damage, the paper deeply integrates the MobileNet and Yolov4 network to achieve the lightweight of Yolov4, and performs a test on the exiting conveyor belt damage dataset containing 3000 images. The test results show that the lightweight network can effectively detect the damage of the conveyor belt, with the fastest test speed 70.26 FPS, and the highest test accuracy 93.22%. Compared with the original Yolov4, the accuracy increased by 3.5% with the speed increased by 188%. By comparing other existing detection methods, the strong generalization ability of the model is veriﬁed, which provides technical support and empirical reference for the visual monitoring and intelligent development of belt conveyors.


Introduction
Belt conveyor is one of the most important transportation equipment in the field of bulk material transportation, widely used in coal mines, docks, ports, chemical industries, and other fields. At present, it is developing towards long-distance, high-speed, smallradius space turning, and intelligence [? ]. The intelligentization of belt conveyors refers to the realization of self-perception and status adjustment of its operating status through modern sensing technology and artificial intelligence, while the realization of autonomous operation and unattended operation of the equipment [? ? ]. The intelligent transportation system is a safe, efficient, intelligent, and unmanned transportation system that integrates advanced technologies such as intelligent driving, intelligent control, intelligent operation and maintenance, and unmanned driving. Its core lies in intelligent mining transportation equipments [? ]. The current research work on the intelligent development of belt conveyors is focused on: energy-efficient equipment or energy-saving technology for belt conveyors, especially load-based energy-saving speed regulation systems for belt conveyors [? ? ? ? ? ? ? ]; expert-based fault diagnosis systems based on noise and vibration monitoring [? ? ]; running state detection technology based on vision and image processing: including deviation monitoring [? ], belt speed monitoring [? ], material flow detection [? ], foreign body identification [? ], tear detection, roller temperature monitoring [? ], etc. This article focuses on the visual monitoring of mining conveyor belt damage.
Conveyor belt, as an important component of belt conveyor, plays an important role in carrying materials and traction, with its cost accounts for 30-50% of the total price of belt conveyor [? ]. The health of the conveyor belt seriously affects the normal operation of the belt conveyor, which in turn affects the safe and efficient production of the entire enterprise. In the actual production process, conveyor belts often suffer problems such as the falling material impact, chute jamming, piercing by a foreign body, etc., which can easily cause abnormal damage to the conveyor belt, thereby shortening the service life of the conveyor belt and increasing production costs. Further, if the damage of the conveyor belt cannot be detected and treated in time, it may eventually cause the occurrence of the conveyor belt tearing accident, which will cause greater harm. The current detection methods for conveyor belt tearing include: weak magnetic test [? ], built-in sensor chip method [? ], machine vision method [? ? ? ? ], line laser-assisted method [? ], infrared camera assisted method [? ? ? ? ], audio-visual fusion method [? ].Among them, most of the research work only focuses on the detection of tearing of conveyor belt, without the detection of other types of damage, which has a certain limitation.
As for the detection of multiple forms of conveyor belt damage, relatively little research work has been done on the subject. A conveyor belt damage detection method based on ADCN (Adaptive Deep Convolutional Network), which is essentially a variant of SPP-Yolov3, was mentioned in Ref. [? ], enabling the detection of both Scratch and Tearing damage states. Then a deep learning-based detection method was mentioned in Ref.
[? ]: by constructing a conveyor belt damage dataset and classifying the belt damage types into four categories, namely surface wear, surface damage, breakdown, and tear, an Efficientnet-Yolov3-based target detection network was proposed to classify and locate the damage, achieved the highest prediction accuracy of 97.26% and the fastest prediction speed effect of 42 FPS on the dataset.
The work of this paper mainly focuses on improving the detection speed of conveyor belt damage based on deep learning method, which is to be realized through the lightweight of target detection network. As the conveyor belt moves faster, cameras with a higher frame rate are needed to capture clear and stable images of the conveyor belt surface. Otherwise, missing detection or moving shadows may occur. At the same time, if the processing speed or prediction speed is not accelerated, the image input and signal output will be out of sync, which is easy to cause delay and lag, also affect the detection results. Therefore, this paper mainly discusses the application and performance of lightweight convolutional neural network in conveyor belt damage detection, which is suitable for scene resources with limited storage space and computing capacity, and also meets the needs of the development of high speed belt conveyor.
There are various ways to achieve lightweighting of neural networks, in terms of the network structure, a target detection network could be divided into two parts, one is the backbone feature extraction network, and the other is the prediction network. The quality of the features extracted by the feature extraction network directly affects the prediction effect of the prediction network. Similarly, the number of parameters and calculations of the backbone feature extraction network also directly affect the detection speed of the target detection network. Generally, the number of parameters is positively correlated with the detection accuracy while negatively with the detection speed. At present, the tricks to reduce the amount of parameters in . Among them, SqueezeNet adopts a well-designed compression and expansion structure, MobileNet uses a more efficient depthwise separable convolution, and ShuffleNet proposes a channel shuffling operation, which further reduces the computational complexity of the model. The lightweight measures taken in this article are based on MobileNet. By replacing the backbone CSPDarknet53 of Yolov4 with MobileNet, the use of depthwise separable convolution is directly realized, and a lightweight model of Yolov4 network is obtained, and then it is applied to the detection of conveyor belt damage. Theoretically, the detection speed should be faster, and can meet the needs of the development of high-speed conveyor.
Section ?? explains the structure of the lightweight neural network and its implementation method, also with the operating environment and related parameter settings; Results are presented in Section ??, then the conclusion is highlighted in Section ??.

Principles and Methodology
With the rapid development of computer technology, deep convolutional neural networks have been promoted and are now the mainstream research method in the field of target detection, thanks to their better performance and the complete automation of feature engineering, even replacing the traditional target detection algorithm based on region filtering + feature extraction + feature classification, and eliminating the need to manually design feature extractors, which has been widely used in areas such as handwritten text transcription, image search, autonomous driving, pose estimation and instance segmentation.
Supervised learning-based target detection methods at the current stage can be divided into two categories: The first category is based on the anchor mechanism, and the second is based on the anchor-free or key-point mechanism. While the target detection algorithm based on the anchor mechanism can be roughly divided into two types: one-stage and two-stage. Two-stage target detection algorithms based on candidate regions, which first generate candidate frames through regional proposal networks (RPN), and then use convolutional neural networks for classification and non-maximum suppression (NMS) to remove the duplicated detections for the same instance by computing Intersection over Union (IoU). The process is more accurate but slower and difficult to meet real-time requirements due to more candidate frames, such as the R-CNN series [? ? ]; And one-stage target detection algorithms based on regression, represented by SSD (Single Shot MultiBox Detector) and YOLO (You only look once) [? ? ], one-stage detectors slide a complex arrangement of possible bounding boxes, called anchors, over the image and classify them directly without specifying the box content. Then the algorithm based on anchorfree or key-point mechanism detects directly by learning the key features of the input image instead of generating a series of anchor box, omitting the process of RPN(Region Proposal Network) and NMS(Non-Maximum Suppression), which makes the prediction process more direct and faster, and the representative algorithms include CornerNet, FCOS, ExtremeNet, CenterNet, etc.
The research of this paper is based on Yolov4 [? ], through combining the MobileNet backbone feature extraction network with the Yolov4 network to simplify the Yolov4, thereby reducing the number of parameters and achieving the purpose of improving the detection speed.

Network Structure and Improvement Methods
MobileNet is an excellent lightweight deep neural network proposed by Google, which includes three versions of V1 [? ], V2 [? ], and V3 [? ]. MobileNet V1 uses depthwise separable convolution instead of standard convolution to achieve feature extraction, which greatly reduces the number of parameters and calculations, making its calculations 1 N + 1 D 2 k times that of standard convolutions. When the input is an RGB image, and the size of the convolution kernel is 3 × 3, the calculation amount can be reduced to about 1/9 of the standard convolution. The principle can refer to Figure ?? and Equations (??)-(??). At the same time, the channel number scaling adjustment factor α and the input image resolution adjustment factor ρ were introduced to adjust the number of channels in each layer of the network and the input image resolution respectively, to further compress the computational effort, while the parameter amount or calculation amount of the model is positively correlated with α 2 and ρ 2 . The parameter calculation method of depthwise separable convolution and standard convolution can be based on Equations (??)-(??), and at the same time, according to Equation (??), the comparison of the parameters of the two can be obtained.
In the formula, N S−params , N S−cal , means the number of parameters and calculation of standard convolution, N D−params , N D−cal , means the number of parameters and calculation of depthwise separable convolution. D k × D k × M is the kernel size, N kernel is the number of the kernel, D k × D k × M is the input size.
MobilenetV2 continues to use depthwise separable convolution based on MobilenetV1, and uses an inverted residual connection similar to the residual network, as shown in Figure ??. Considering that a large amount of feature information cannot be extracted by applying a convolutional layer to filter low-dimensional tensors, MobilenetV2 uses an expansion convolution layer to obtain a large tensor, uses depthwise convolution to filter the data, and then uses a projection layer to reduce the tensor [? ]. By adjusting the low-dimensional tensor, the parameter amount of MobilenetV2 is reduced to about 80% of V1, and the speed is increased by about 33%. MobilenetV3 was designed mainly based on a combination of complementary search techniques, through a combination of hardware-aware network architecture search (NAS) complemented by the NetAdapt algorithm [? ]. The former(NAS) is used to search the various modules of the network under the premise of limited calculation and parameter quantity, also called Block-wise Search, and the latter(NetAdapt) is used to fine-tune the network layer after each module is determined; At the same time, it continues the depthwise separable convolution of MobilenetV1 and the bottleneck with the residual structure of MobilenetV2. On this basis, a lightweight attention model based on the squeeze and excitation structure in SENet was also added to adjust the number of channels, as shown in Figure ??; Besides, h-swish was used as the activation instead of the swish to reduce the amount of calculation and improve the performance. As an improved version of Yolov3, Yolov4 has made many improvements on the basis of Yolov3. The network structure is shown in Figure ??. CSPDarknet53 is used instead of Darknet53 as the backbone feature extraction network. The use of CSPnet enables the fusion of high-level and low-level semantic information and reduces the loss. SPPnet and PAnet are used to expand the receptive field and repeatedly extract image features, which greatly improves the feature extraction capability. Same as Yolov3, Yolov4 also uses the extracted feature information to make predictions through YoloHead.
It can be seen from Figure ?? that three feature layers from the backbone feature extraction network were extracted for feature enhancement of SPPnet and PAnet in YoloV4, and then were passed to YoloHead for prediction. In order to realize the lightweight Yolov4 network structure, the paper proposed to replace the backbone feature extraction network CSPDarknet53 of YoloV4 with MobileNet, continue to use the feature fusion and feature enhancement strategy in the original YoloV4, and make predictions through YoloHead. In this paper, the structure of MobilenetV1-YoloV4 is used to explain the replacement operation, and the structure or replacement principle of MobilenetV2-YoloV4 and MobilenetV3-YoloV4 is the same. In YoloV4, the feature layers input to the SPPnet or PAnet are the original images compressed 3, 4, and 5 times. Similarly, in the MobilenetV1 network, we pass the feature layers which compressed 3, 4, and 5 times into the subsequent feature enhancement, as shown in Figure ??; At the same time, we also change the number of channels of the layers by adjusting the value of α to achieve different degrees of lightness. In addition, we have taken a small trick, that is replacing the convolution in PAnet with a depthwise separable convolution, to better reduce the number of parameters.

Calculation of Loss-Function
The loss function during network training consists of three parts: regression loss L CIoU , classification loss L class , and confidence loss L con f . Among them, the regression loss L CIoU refers to the error between the position and width of the prediction box and the true label; the classification loss L class refers to the error between the prediction classification and the real classification, and the confidence loss L con f is relevant to the confidence score of the predicted value of each bounding box. The calculation method of the loss function is shown in Equations (??)-(??). The regression loss of the prediction frame is calculated by CIoU function, and the scale information of the overlap, center distance, and aspect ratio of the frame is considered based on IoU, which can better ensure the stability of the training process; the calculation of confidence loss and classification loss uses cross-entropy function.
Loss = L CIoU + L con f + L class (11) In the equation, IoU(A, B) -the intersection ratio between the predicted box and the real labeled box; ρ 2 (A ctr , B ctr ) -the Euclidean distance between the predicted box and the line point in the real labeled box; m -the diagonal distance of the smallest enclosed area that contains both the predicted box and the real labeled box; α -weight function; ν -length-to-width ratio similarity measurement coefficient; w gt , h gt -width and height of the real label box; w, h -width and height of the prediction box; S 2 -number of grids; B -prediction box on each grid; I obj ij -the target is included in the prediction frame; I noobj ij -the target is not included in the prediction frame; C j i -prediction confidence; − C j i -true confidence; λ noobj -calculation coefficient set by yourself; c -target classification number; P j i (c) -the true probability that the target in the frame belongs to a certain category; − P j i (c) -the predicted probability that the target in the frame belongs to a certain category.
When there is no target in the prediction frame, only the confidence loss L con f should be calculated. If there is a target in the prediction frame, three types of losses could be calculated according to Equations (??)-(??).

Operating Environment and Parameter Settings:
The rapid development of neural networks is based on the development of computers and mathematics. The powerful computing power of computers makes it possible to detect objects based on deep learning. With limited computing resources, the width, depth, and resolution of the input image will all affect the parameters of the network, thereby affecting the calculation and prediction speed [? ]. This is also the purpose of this paper to explore the application possibilities of lightweight neural networks in conveyor belt damage detection. It aims to improve the speed of the algorithm while ensuring the accuracy of the model as much as possible through the lightweight of the model under limited computing resources, in order to meet the needs of belt conveyors with high belt speed.
The running and testing environment of the algorithm in this paper is shown in Table ??.

Data Preparation
The conveyor belt damage dataset used in this paper is provided in Ref.
[? ], which contains 3000 images. The damage types are divided into four categories: surface wear, surface damage, tear and breakdown, and each type of damage occupies 1/4. The establishment of the dataset was completed through manual labeling, and finally stored in the format of VOC2007.
During the training process, the Mosaic and CutMix data enhancement strategies were used to increase the variability of the input image, enrich the image background information, and improve the robustness and generalization ability of the model. At the same time, the gradually decreasing learning rate was used to train 100 Epoch, the initial learning rate was 1 × 10 −3 , and dropped to 1/10 of the previous value at 50 Epoch and 80 Epoch, and the Batch size was set to 16.

Detection Results of Unscaled Networks
In actual engineering practice, the superiority of an algorithm is usually measured by the mean Average Precision (mAP) and the test speed FPS. mAP, the average value of AP of each class of objects, is the average value of AP obtained by multiple verification set individuals, which measures the overall detection accuracy of the algorithm. FPS, which is the frame rate that can be processed per second, is used to measure the processing speed of the algorithm. Figure ?? shows the detection results of MobilenetV3-YoloV4-1.0 (1.0 means α = 1.0, that is, the channel number scaling factor is 1.0) on the dataset of this article. Figure ??a-d corresponds to the damage types in order of tear, breakdown, damage, and surface wear; it can be seen that this algorithm can better realize the detection of multiple damage types.  Table ?? shows the results achieved by multiple models quantitatively, including the prediction accuracy (AP), mean Average Precision(mAP) and prediction speed (FPS) of the various algorithms for the various damage types. The data in Table ?? can be divided into three parts: The first part is the result using 7 current mainstream target detection algorithms, including Two-stage target detection algorithm: Faster R-CNN, Onestage detection algorithms: SSD and YOLO, then a key point based prediction algorithm: Centernet. It can be seen that the Resnet50 based Centernet algorithm has reached the highest average prediction accuracy of 95.05% with a fastest detection speed of 32.4 FPS; The second part is the detection results obtained in Ref. [? ]. Among them, the EfficientNetB0 based EfficientNet-B0-Yolov3 has achieved the fastest detection speed of 41.91 FPS, and the EfficientNet-B4-Yolov3 has achieved the highest detection accuracy with 97.26%. Compared with the original Yolov3 algorithm, the accuracy is increased by 10.4%, with the speed 45.9%. The third part is the detection result of 3 lightweight neural networks that combines Mobilenet and Yolov4 network proposed in this paper. Among them, the MobilenetV1-YoloV4-1.0 has achieved the fastest prediction speed of 51.12 FPS, MobilenetV3-YoloV4-1.0 achieved the highest prediction accuracy of 93.08%, compared with the detection speed of 24.39 FPS with an accuracy of 90% obtained by the Yolov4 network, it can be seen that the lightweight algorithms shown in Table ?? has achieved an improvement on the highest prediction accuracy by increasing 3.4% and 109% in prediction speed compared to the original Yolov4, but it should be noted that the fastest prediction speed and the highest prediction accuracy are not achieved by the same lightweight network.

Detection Results of Scaled Networks
As mentioned in the previous, in MobileNet, the number of channels in each layer of the backbone feature extraction network is adjusted through the channel number adjustment coefficient α to achieve the purpose of adjusting the amount of parameters. The paper selects different scaling factors α for different backbone feature extraction networks: when the backbone feature extraction network is MobilenetV1, α = 0.25, 0.5, 0.75, 1.0; when the backbone feature extraction network is MobilenetV2, α = 0.5, 0.75, 1.0, 1.3; when the backbone feature extraction network is MobilenetV3, α = 0.75, 1.0; the model test results with the scaling factor α = 1 are only shown in Table ??, and the remaining results are shown in Figure ??.
In general, the number of channels is positively correlated with the feature extraction capability of the backbone and inversely correlated with the amount of computation or number of parameters. The higher the number of channels, the more feature information the network extracts, which in turn increases the detection accuracy, but the increase in the number of parameters results in a certain loss of speed. The effect of different channel number scaling on detection accuracy and speed was shown in Figure ?? and Table ??.
As can be seen from Table ??, the improved algorithm using network scaling achieved faster detection accuracy compared to the unscaled ones, with MobilenetV1-Yolov4-0.25 achieving the fastest detection speed of 70.26 FPS among all algorithms, and MobilenetV2-Yolov4-1.3 achieving the highest detection accuracy of 93.22%.
The effect of different channel number adjustment factors on the test results can be found in Figure ??a. When adjusting the number of channels for MobilenetV1, the average prediction accuracy of the algorithm generally shows an increase trend as α increases, but the detection speed continues to decrease as shown in Figure ??a; when adjusting for MobilenetV2, the prediction accuracy shows an increasing-decreasing-increasing trend as α increases, which may be due to the small capacity of the dataset or the inappropriate batch size setting; When adjusting for MobilenetV3, the pattern of the change is basically the same as that for MobilenetV1, i.e., as α increases, the number of channels increases, the detection accuracy increases, but there is a small loss in detection speed. When compared with the results obtained by Efficientnet-Yolov3 in Ref. [? ], the lightweight network proposed in this paper has a great advantage in detection speed, except for MobilenetV2-Yolov4-1.3, and the fastest detection speed achieved in this paper is approximately 1.7 times faster than the fastest detection speed achiveved by Efficientnet-Yolov3, reaching 70 FPS. However, due to the compression and adjustment of channel number, the improved algorithm proposed in this paper is relatively deficient in image information feature extraction ability, and does not achieve a higher detection accuracy as mentioned in Ref. [? ].  The parameters of the target detection model under various zoom ratios are compared in Figure ??b. By comparing the increasing trend of the parameter amount of various improved algorithms under different ratios, it can be confirmed that the size of the parameter amount is proportional to α 2 . Also can be clearly seen that the combination of Mobilenet and Yolov4 could reduce the amount of parameters of Yolov4 effectively, then combined with the detection speed, it can be proved that reducing the amount of parameters by compressing the channels of networks is an effective measure to improve the target detection speed, but not conducive to ensuring the detection accuracy.

Verification of Generalization Proficiency
In addition to testing the algorithm on the dataset, we chose conveyor belt damage data from the Refs. [? ? ? ? ? ] for validation of the model's generalization ability, and the results are shown in Figure ??. Generalization ability refers to the ability of the neural network model to adapt to fresh samples, and we expect that the model we obtain through training on the dataset will still give reasonable output when faced with data outside the dataset, i.e. fresh samples. The generalization ability of a model is the third common measure of the superiority of a neural network model, besides the mean Average Precision (mAP), and the prediction speed (FPS).
In Figure ??, the first row show the original image of the conveyor belt damage, figures shown in the second row are the detection results using the method in this paper, and the third row are the detection results given in Refs. [? ? ? ? ? ] respectively. Among them, (c) shows the tear detection method based on image processing, (f) shows the detection method based on infrared, (i) shows the method based on infrared spectral analysis, which integrates the problem of local temperature increase due to sliding friction during the tearing process of the conveyor belt, and (l) shows the method assisted by a line laser, which transforms the detection problem of tears into the detection of corner points in a continuous smooth curve with the help of a line laser generator, (m,n) are the damage form of the conveyor belt proposed in Ref.
[? ], (m) shows scratch, and (n) shows tear, but the recognition results in this paper are all tears, because the annotations in our dataset are different from that. As shown in Figure ??, the algorithm in this paper achieves good detection results in these fresh samples, proving the strong model generalization capability of the algorithmic model in this paper.

Conclusions and Future Work
Aiming at the problem of conveyor belt damage detection, the paper proposed a detection method based on a lightweight neural network, which aims to increase the detection speed to meet the development needs of high-speed belt conveyors, also to match the cameras with high frame rate, making the signal processing speed more real-time.
In this paper, the Mobilenet network and the Yolov4 target detection network are effectively combined to achieve the simplification of the Yolov4 network. Meanwhile, a series of different lightweight models are achieved by adjusting the number of channels, also achieved a good detection effect on the conveyor belt damage dataset, with a highest detection accuracy of 93.22% and a fastest detection speed of 70.26 FPS. Compared with Yolov4, the accuracy is increased by 3.5%, and the speed is increased by 188%.
The contributions of this paper can be summaried as follows: (1) A lightweight Yolov4 network is realized through the effective combination of Mo-bileNet and Yolov4 network. In the following research, in addition to further expanding the dataset and improving the detection accuracy, we will also attach importance to the potential effects of image collection conditions carefully, such as dust, light, etc.

Data Availability Statement:
The data in this study are available on request from the corresponding author.