1. Introduction
The fourth industrial revolution will lead to disruptive changes in the industry [
1]. The integration and development of information technology and control theory have made robots an essential part of modern industrial production [
2]. With the rapid development of machine vision, robotic arms can automatically recognize different kinds of objects [
3,
4]. Tracking and capturing them has greatly improved human production efficiency which brings rapid development in the field of human technology. Additionally, many other industries gradually replace manpower with robots. For example, in the catering industry, the investment of robots has made the catering industry more hygienic. In this paper, research on automatic recognition and target object grabbing will be carried out.
There are generally two ways to achieve the grasp of the robotic arm. One is to grasp by teaching [
5]. On the basis of the preset path, the robotic arm can do a repeating movement to realize automated production. However, in this method, a disturbance can be easily caused by the environment. The change of links will result in the stagnation of all production. The other is to combine machine learning or deep learning to achieve a more intelligent grasp and better completion. Accordingly, a combination of identification and tracking [
6,
7] is highly required. As we all know, people are able to recognize the type and location of objects at first glance. It is easy for us to understand how they interact [
8]. However, it is hard for machine vision to achieve such an effect. Therefore, precise recognition of the target object is fundamental to a successful grasp. There are two ways to identify. One is the traditional detection algorithm, and the other is the detection algorithm combined with deep learning. Yonghao Zhao et al. [
9] adopted Canny operator edge detection by utilizing binocular three-dimensional vision to obtain the robotic arm grab point. The PID (proportion integral differential) position algorithm was used to track the object dynamically. Honglei Wei et al. [
10] used the Canny operator in the video image sequence to detect the edge of the target and the mean-shift algorithm was used to realize the target localization and tracking. It is easy to use the detecting methods that adopt the Canny operator. It can mark the edges of the target object as much as possible, but its disadvantage is that the existing image noise may also be marked as edges. Moreover, the light intensity of the environment [
11] and the change of target feature will interfere with the position of the target object obtained in the image. By comparison, the object detection algorithm based on deep learning shows better robustness in detecting. “You only look once (YOLO)”, which is created to achieve the purpose, will recognize what and where the target object is in just one sight. It converts the object detection framework into a bounding box for spatial segmentation and regression of related probabilities. In an evaluation, a single neural network directly predicts the bounding box and class probability from the complete image. Furthermore, the detection efficiency of YOLO is very fast [
12], which can meet the visual requirements of robotic arm recognition and capture. Unlike the current detection algorithms commonly used in the industry, such as SSD [
13], Mask R-CNN [
14], and Cascade R-CNN [
15], the study demonstrates that YOLO achieves high detection accuracy and at the same time maintains high FPS (frames per second). Shehan P Rajendran et al. [
16] proposed that using YOLOv3 to recognize road signs shows high real-time performance and accuracy when compared with Faster R-CNN. Jinmin Peng et al. [
17] used the idea of hollow convolution in YOLOv3 so that the average detection accuracy of the workpiece reached 92.98%. However, the latest research on YOLOv4 shows better results in detection [
18]. As for YOLOv5, the model frameworks of YOLOv5 and YOLOv4 are basically the same, but YOLOv4 is more customizable, and with the addition of more custom configuration conditions, YOLOv4 is still the best object detection framework. Consequently, in this paper, how to apply deep learning to identify the target will be discussed. More importantly, the significance and innovation of using YOLOv4 to identify the target are illustrated in this paper.
In addition to object recognition, tracking is also indispensable for the robotic arm to make a precise grasp [
6]. In order to provide the robotic arm with the position of the next moment, the current measurement, the observation measurement, and the system model in the dynamic system should be known [
19]. The tracking based on the target motion model can be roughly divided into three categories, namely tracking based on PID, kernel methods, and filtering theory. The PID-based tracking algorithm is shown by Yonghao Zhao et al. [
9], and its advantage is simple and fast. However, it is essential to constantly adjust the PID gain in an environment full of random disturbances. Otherwise, it will not make an excellent performance. The main tracking method based on the kernel method is the mean-shift. As shown by Honglei Wei et al. [
10], it has a small amount of calculation and high real-time performance, but it often fails when tracking small targets and fast-moving targets, which cannot be ignored in a stable grasping system. The tracking methods based on filter theory include the Kalman filter (KF) algorithm and the PF algorithm. Hitesh A Patel et al. [
20] used KF to track a single moving object. Shiuh-Ku Weng et al. [
21] proposed adaptive KF which has a better performance in tracking. However, most of them are used for a linear system. If the environment changes unpredictably, that is, the system produces nonlinear motion, their performance won’t be satisfactory. In contrast, in nonlinear and non-Gaussian systems, PF performs better in tracking and predicting. M. Sanjeev Arulampalam et al. [
22] conducted analysis and investigation experiments on KF, extended KF, and PF, which confirmed the performance of PF in the nonlinear and non-Gaussian environment. Junlan Yang et al. [
23] used PF to estimate and track target objects in video sequences and concluded that increasing the number of particles can reduce the estimated variance of the time series and make the estimated value smoother and more accurate. Therefore, in a non-linear environment, the PF algorithm is more robust than the KF algorithm [
24]. The second aspect of significance and innovation in using the PF method to track and predict the target will be illustrated as well, which could reduce environmental interference.
On this basis, due to the complicated and changeable catering environment, we propose to use YOLOv4 with the PF algorithm to dynamically grab the moving target in the nonlinear and non-Gaussian environment.
The main contributions of this paper can be summarized as follows:
- (1)
In this paper, the most widely used contemporary YOLO algorithm is analyzed theoretically and compared with other algorithms. In addition, three distances and FPS are tested using the latest YOLOv4 algorithm as well as commonly used detection algorithms to achieve efficient detection of YOLOv4 in real-time. The average recognition accuracy reaches 98.5% and the FPS is maintained at around 22 FPS.
- (2)
The laboratory tests on different filters with filtering and prediction functions are run under simulated situations in this paper. With the existence of environmental interference, the exquisite filtering and prediction functions of the PF algorithm are fulfilled. Its MSE of the filter error is 7.1537 × 10−6 and its MSE of the prediction error is 1.356 × 10−5, indicating that it can effectively reduce the impact of the environment to get the predicted position of the target object at the next moment more accurately.
- (3)
In this paper, we use a grabbing system that combines YOLOv4 and PF algorithms to conduct a large number of comparative experiments. YOLOv4 can maintain an accuracy of nearly 99.50% while recognizing the object from a proper distance. Additionally, its recognition speed meets the requirements of real-time and the ability of PF to adjust the sudden disturbances is also significantly higher than that of other filter algorithms such as KF. Therefore, the grabbing system can accomplish the successful grasp at a rate of nearly 88% even at a higher movement speed.
Our work is structured as follows: in
Section 2, the robotic arm grabbing system is preliminarily presented. In
Section 3, the YOLOv4 algorithm is used to identify and obtain the position of the target object in the image. In
Section 4, the PF is used to obtain the transition matrix and observation matrix of the target object’s motion and predict the motion state of the target object at the next moment. In
Section 5, through a large number of comparative simulation experiments and tests in a real environment, the PF is shown to affirm its excellent tracking and prediction performance in a nonlinear and non-Gaussian environment. In
Section 6, the research of the whole paper is summarized.
2. Framework of Moving Target Tracking and Grabbing Strategy
The following are the characteristics of moving target tracking, predicting, and grabbing in real life. Firstly, the image of the target object collected by the RGB-D sensor may be deformed, which may cause the center point of the target to drift. Secondly, the conveyor belt environment may be disturbed by its own or external influences, causing objects on the conveyor belt to become non-linear motions, resulting in failing grasps. Thirdly, the image sensor sampling rate and image processing rate are much lower than the system control cycle, which will lead to a delay in tracking.
Considering the above characteristics, we proposes a robotic grabbing system that can be used in a nonlinear non-Gaussian environment, with the combination of the YOLOv4 object detection algorithm and PF tracking algorithm. The framework of the robotic arm grabbing system is shown in
Figure 1. Before detecting the target object, its images should be collected in a different environment, and then it will be trained by YOLOv4. Then, by calculating the relative position of the RGB-D sensor and the robotic arm through different positions of the marker board, we can obtain the eye-to-hand calibration parameter. When the target object is detected, we can clarify its real coordinate points. By passing these coordinate points through PF, the most likely position of the target object at the next moment can be obtained. Finally, following the planned trajectory, the robotic arm reaches the grasping position to grasp the target.
3. Target Recognition Based on YOLOv4 Algorithm
Object detection is one of the important components of deep learning in the field of computer vision. Its task is to find all objects of interest in the image, including two subtasks: object location and object classification, and determine the category and location of the object at the same time. Object detection can be divided into two categories: one is based on Region Proposal’s R-CNN series of algorithms (R-CNN, Fast R-CNN, Faster R-CNN, etc.). They are two-stage algorithms and firstly need to be used to generate object candidate boxes which will then be classified and regressed. The other are YOLO series and SSD series, the one-stage algorithms, which use a convolutional neural network to simultaneously predict the target category and position. The two-stage algorithm has higher recognition accuracy, but its real-time performance is poor, while the one-stage algorithm is faster in speed but slightly lower in accuracy [
25]. However, starting from YOLOv3, the YOLO series achieves a better trade-off between recognition accuracy and speed, and YOLOv4 uses CIoU loss as the loss of the bounding box, which can converge faster and have better performance [
18]. Due to the requirements of real-time and accuracy, we propose to use a one-stage algorithm YOLOv4 to realize the task of identifying the bowl on the conveyor belt.
3.1. YOLOv4 Algorithm Structure
YOLOv3 is an end-to-end object detection algorithm whose model structure mainly consists of the Darknet-53 backbone network (Backbone) and multi-scale fusion feature network (FPN) [
26,
27]. The backbone network Darknet-53 is primarily used to extract image features. The network is mainly composed of five residual blocks, and each residual block contains a set of repeated residual units. The connection of a Batch Normalization (BN) layer and a Leaky ReLU activation function is after each convolutional layer, namely the Con2d layer. Based on the original YOLO object detection architecture, YOLOv4 retains the head of YOLOv3 and uses a more powerful backbone network CSPDarknet53. Additionally, it uses the idea of spatial pyramid pooling (SPP) to expand the receptive field and chooses PANet as the Neck part for feature fusion as shown in
Figure 2. Meanwhile, it can be improved and optimized due to the use of the Mish activation function, Mosaic data enhancement, and DropBlock regularization.
Integrating CSP on each large residual block of Darknet53, CSPDarknet53 can enhance the learning ability of CNN and maintain accuracy while the weight, computing bottlenecks, and memory costs are reduced. The input before each large residual block (Resblock) is divided into two parts, one of which is input to the stacked residual unit, and the other is directly convolved. Then the results of the two parts are concatenated, and finally output through convolution.
To expand the receptive field of the network, an SPP network needs to be added after Backbone. First, the output of the feature extraction network is subjected to three convolution operations and maximized through the pooling of 1 × 1, 5 × 5, 9 × 9, and 13 × 13. Then the four pooled outputs are concatenated into a feature map, and finally convolution dimensionality reduction can be done.
The Neck part of object detection is mainly used to fuse the feature information of feature maps of different sizes. YOLOv4 uses the feature fusion method of PANet. Based on the top-down feature fusion of FPN, the bottom-up feature fusion is added to shorten the information for the dissemination path which uses low-level features to accurately locate information. Additionally, each proposal can use the features of all layers of the pyramid if the method of dynamic feature pooling is used. After feature fusion, the network will output three different sizes of feature maps, 19 × 19 (Yolo_out1), 38 × 38 (Yolo_out2), 76 × 76 (Yolo_out3), corresponding to the prediction of large, medium, and small scale objects respectively.
3.2. The Loss of YOLOv4
YOLOv3 uses MSE (mean square error) directly based on the coordinates of the center point of the prediction box and the real box as well as the width and height information. However, the MSE loss function fails to reflect the relationship between the information, but uses it as an independent variable. In order to improve it,
IoU loss is proposed, which considers the area of the predicted bounding box (BBOX) and the ground truth bounding box [
28,
29]. YOLOv4 uses
CIoU loss instead of MSE loss, which includes the shape and direction of the object and also considers the overlap area, the distance between the center points, and the aspect ratio, which are defined as follows.
A represents the prediction frame.
B represents the real frame.
ρ (Actr,Bctr)/c2 represents the penalty for the center point distance.
Actr represents the center point coordinates of the prediction frame.
Bctr represents the center point coordinates of the real frame.
ρ(·) represents the Euclidean distance, and
C represents the distance between
A and
B as shown in
Figure 3.
α·λ represents the penalty for the aspect ratio.
α is a positive coefficient, and
λ is used to measure the consistency of the aspect ratio.
gt means ground truth.
wgt and
hgt are the width and height of the real frame.
w and
h are the width and height of the prediction frame. If the width and height of the real frame and the prediction frame are similar, then
λ tends to 0, and the penalty is invalid. Intuitively, the penalty term is used to control the width and height of the predicted frame to move toward the width and height of the real frame as soon as possible.
In the above equation, λciou plays a role in balancing the loss. It is used to increase the weight of the position loss of the bounding box and suppress the bounding box confidence of undetected objects. LSCE means sigmoid cross entropy. LBCE means binary cross entropy. y means real value. p means predicted value. If the center point of the real frame belongs to the prediction frame, it is assumed that the prediction frame has a target. Then Pr(object) = 1, if not, Pr(object) = 0. IgnoreMask means when there is no target, the IoU of the prediction frame and the real frame are calculated, and the largest IoU is selected as the IoU of the predicted and the real value. An IoU threshold is set, and when its maximum IoU is less than this threshold, it will be added to the loss function calculation as Equation (7). Lciou means location loss. Lprob means class loss. Lconf means confidence loss. Then we could obtain the loss of YOLOv4 through Equation (9).
3.3. Target Position Coordinate Calculation
When it is confirmed that the target object is detected, we can extract the position of the target object in the image, as shown in
Figure 4. The box consisting of dots in
Figure 4 is the anchor. In the anchor-based target detection algorithm, it is generally designed manually by designing nine anchors with different sizes and aspect ratios. One disadvantage of the manually designed anchors is that they are not guaranteed to fit the dataset well. If the size of the anchor differs significantly from that of the target, the detection effect of the model will be affected. In YOLO, we use the k-means clustering algorithm instead of a manual design to generate the anchors that fit the dataset by clustering the bounding box of the training set to improve the detection effect of the network.
cx and
cy represent the coordinates of the upper left corner of the area where the center point is located.
pw and
ph represent the width and height of the anchor, respectively.
σ(tx) and
σ(ty) represent the distance between the center point of the prediction box and the upper left corner respectively, and
σ represents the sigmoid function, which limits the offset to the current grid and is conducive to model convergence.
The calculation Equation (10) for the actual position and size of the target object in YOLOv3 is:
tw and
th represent the predicted width and height offset. The length and width of the anchor are adjusted.
bx_YOLOv3 and
by_YOLOv3 can obtain the position of the target object.
bw_YOLOv3 and
bh_YOLOv3 can obtain the size of the target object.
The calculation method of the center point coordinates of the prediction box in YOLOv3 is
bx_YOLOv3 =
σ(
tx) +
cx, where
σ(·) is a 0–1 function. It is difficult to obtain
cx or
cx + 1, and the phenomenon is that the center of the prediction frame will not fall on the grid boundary. The solution of YOLOv4 is to multiply a coefficient
β exceeding 1.0 in front to achieve this effect, as shown in the following Equation (11) and
Figure 5 (
cx = 0).
So the calculation equation for the actual position and size of the target object in YOLOv4 is:
According to bx_YOLOv4, by_YOLOv4, bw_YOLOv4, bh_YOLOv4, the coordinates of the target in the pixel coordinate system can be obtained, and then converted with the camera internal parameters to get its coordinates in the camera coordinate system.
6. Conclusions
The robotic arm grabbing system based on YOLOv4 and PF in the nonlinear and non-Gaussian environment is deeply studied in this paper. Specifically, the moving targets can be identified with the utilization of the YOLOv4 algorithm and the target position can be tracked and predicted with the utilization of the PF algorithm. On this basis, the detection experiment of the YOLOv4 algorithm and a large number of simulation comparison experiments between the filters have been carried out, which reflects the high accuracy and real-time performance of the YOLOv4 algorithm, as well as the ability to adjust rapidly of the PF under interference. Then the robotic arm can cooperate with the path planning function of MoveIt! to realize the rapid grasp of the target object. The results show that the implementation of this system is highly effective. Consequently, it is extremely conducive to improve intelligence in the robot industry.
In this paper, we design a robotic arm grabbing system that can be applied to grasp objects with nonlinear motion. The method not only ensures the accuracy of recognition of target objects, but also satisfies the ability to predict the motion trend of target objects. In the future development of new disciplines such as artificial intelligence, image processing, and the need for tracking and grasping of flexible robots and dual-arm robots, having a reliable and stable grabbing system will play a very important role. In the future, we will try to apply this system to flexible robots or dual-arm robots, and cooperate with their respective characteristics to further improve the robotic arm system so as to accomplish specific operational tasks.