1. Introduction
In today’s automated industrial manufacturing processes, the demand for sorting different types of workpieces is increasing, resulting in increasingly complex sorting tasks. Traditional industrial robots achieve grasping functionality through manual demonstration and recording of movements for programming. This method only allows for the grasping of objects of the same type, with fixed positions, leading to low efficiency. It cannot meet the modern industrial manufacturing needs, where goods are diverse and updated rapidly. To address this, the integration of machine vision with traditional robotic arms, enabling them with perception capabilities through object detection algorithms, has become a hot research topic in the field. This makes the study of more efficient and higher-performing object detection algorithms of great significance [
1].
On the basis of scale-invariant feature transform (SIFT) and speeded-up robust features (SURF), An Weisheng et al. [
2] introduced improvements by referencing the Gaussian image pyramid algorithm, proposing a novel image registration algorithm. Their experimental results demonstrated significant improvements in both matching accuracy and speed. Wang J’s [
3] team developed a new object recognition and positioning method by combining SIFT and moment invariants, and their experimental results showed that this algorithm exhibited strong robustness.
However, the above classical object recognition methods rely on predefined image features, making them vulnerable to environmental disturbances such as lighting variations and occlusions, leading to unsatisfactory recognition results. With the rapid development of modern computing technologies and the advent of affordable sensors, deep learning-based object recognition algorithms have gradually come into focus. Compared to traditional feature extraction methods, deep learning can automatically learn and extract high-level feature representations from data. Its multi-level and hierarchical structures give it superior flexibility, while extensive data training offers better generalization capabilities [
4]. Du Xuedan et al. [
5] used Faster R-CNN as the object detection model to accomplish object recognition and localization tasks, though it suffers from slow detection speed. Additionally, as a two-stage method, the training process is relatively complex. Gang Liu [
6] proposed a lightweight object detection algorithm and an improved YOLOv5 robot, introducing C3Ghost and GhostConv modules into the YOLOv5 backbone to reduce the computational load required for feature extraction and enhance detection speed. Object detection algorithms represented by the YOLO series [
7,
8] have gained widespread application in industry due to their advantages in detection accuracy and speed.
The YOLOv5 algorithm achieves an ideal balance between detection accuracy and speed. However, its accuracy in detecting small objects remains somewhat insufficient, and false detections often occur when the target occupies a small portion of the image or has indistinct features. To address this, this paper uses the smallest model in the YOLOv5 family, YOLOv5s, as the baseline network. While maintaining the speed advantage of the YOLOv5 model, we integrate the Convolutional Block Attention Module (CBAM) to enhance the network’s feature perception capabilities, which helps in detecting small objects and overcoming challenges in feature extraction.
Additionally, CBAM is combined with the EIoU (enhanced intersection over union) loss function. Compared to other loss functions, EIoU is more effective in handling cases where there is little or no overlap between bounding boxes, reducing offset errors between them, ensuring more accurate bounding box regression, and improving the precision of model training.
Finally, we created a dataset for training and validation and built a robotic system to conduct grasping experiments based on the improved algorithm. A host computer control system was designed, resulting in a complete six-axis robot hand–eye coordination grasping system.
2. YOLOv5 Object Detection Algorithm
The YOLO algorithm divides an image into fixed-size grids, treating each grid cell as a basic unit for detection [
9]. Each grid cell is responsible for predicting bounding boxes and categories of potential objects within that cell, along with confidence scores for each predicted bounding box. Once the bounding boxes are predicted, the algorithm calculates the discrepancies between these predicted boxes and the ground truth object positions. Subsequently, through non-maximum suppression, it computes the intersection over union to eliminate highly overlapping bounding boxes, retaining those most likely to contain objects. Finally, based on a set confidence threshold, the algorithm filters out detections with high confidence scores, effectively improving detection accuracy.
YOLOv5 is one of the top-performing object detection algorithms currently available and is widely used in industrial applications [
10]. The model is primarily composed of four parts: the input module, backbone network, feature fusion module, and prediction layer. The overall architecture of the YOLOv5 network is illustrated in
Figure 1.
In practical manufacturing scenarios, the targets to be grasped include both large-to-medium-sized items such as valves, three-way valves, and stainless steel elbows, as well as smaller objects like screws, small hex wrenches, and other small components. While YOLOv5 demonstrates effective detection for large and medium-sized prominent objects, its performance may not be ideal for detecting small targets with few samples and complex backgrounds. To enhance feature representation and improve detection performance for small target objects, improvements are needed in both feature information extraction and loss function design see
Figure 2.
3. Attention Mechanism and Loss Function-Based YOLOv5 Algorithm Improvement
3.1. Introducing CBAM Attention Mechanism
By incorporating the CBAM (Convolutional Block Attention Module) attention mechanism into the YOLOv5 backbone structure [
11], the network is able to focus more on useful information while ignoring irrelevant details, thereby extracting more fine-grained features. This improvement in feature extraction boosts the network’s efficiency. CBAM is a lightweight convolutional attention network that combines both channel attention and spatial attention mechanisms [
12]. Given an intermediate feature map
, CBAM applies the channel attention mechanism by performing both global average pooling (GAP) and global max pooling (GMP) on the input features, resulting in two
tensors, where C is the number of channels. These tensors are then passed through a shared fully connected network and a sigmoid activation function to generate new vectors. These vectors are multiplied with the original feature map to obtain the channel attention features, as shown in Equation (
1).
In this context, represents the descriptor vector of channel C after global average pooling, while is the descriptor vector of channel C after global max pooling. is the weight matrix of the first fully connected layer, and is the weight matrix of the second fully connected layer. denotes the ReLU activation function, and represents the sigmoid activation function, which maps the channel attention values to the [0, 1] range. This process reduces the spatial dimensions of the input feature map while preserving the most significant features of the image.
The spatial attention mechanism helps the model focus on the spatial regions of the feature layer that are important, i.e., determining which spatial parameters should be prioritized. This mechanism compresses the channel dimension. Similar to the channel attention mechanism, it applies global average pooling (GAP) and global max pooling (GMP) on the input feature map along the channel dimension, producing two feature descriptors. These two descriptors are concatenated along the channel dimension to form a 2-channel feature map F. Then, a convolution operation with a
kernel is applied to enhance the receptive field, followed by the sigmoid activation function. Finally, the result is element-wise multiplied with the original feature map, yielding the spatial attention features, as shown in Equation (
2).
Through dynamic adjustments with channel attention and spatial attention mechanisms, CBAM adapts the response of the feature map, allowing the integrated YOLOv5 model to focus more on important regions and features within the image. This helps the network reduce its focus on background information and concentrate more on the features of small targets, thereby improving the accuracy of small target detection. Moreover, CBAM is designed to be relatively lightweight, minimizing its impact on the performance of YOLOv5. As illustrated in
Figure 3, a CBAM module is embedded after each C3_F module in the neck layer, resulting in the creation of a new model named YOLOv5s-I.
3.2. Loss Function Improvement
The loss function in YOLOv5 serves as a method to quantify the accuracy of the model’s predictions, where its value represents the disparity between the predicted values and the ground truth. By evaluating the degree of overlap between the predicted bounding boxes and the ground truth boxes, minimizing the loss function allows the model to converge.
Traditional object detection loss functions rely on aggregating metrics related to bounding box regression, such as the distance between predicted and ground truth boxes, overlap area, and aspect ratios. The most commonly used metric in bounding box regression loss calculation is intersection over union (IoU). The principle of IoU is depicted in
Figure 4, where IoU measures the overlap ratio between the candidate bound and the ground truth bound. Specifically, IoU is calculated as the area of intersection between two rectangles divided by the area of their union.
The CIoU (complete intersection over union) loss function used by YOLOv5 helps the model predict the position and size of the bounding box more accurately, thereby improving the precision of object detection. The CIoU calculation formula is as follows:
In the formula, IoU represents the basic intersection over union metric. denotes the Euclidean distance between the center points of the predicted bounding box and the ground truth box, while c is the diagonal length of the smallest enclosing box that contains both the predicted and ground truth boxes. is the penalty term for the aspect ratio difference, where is a balancing coefficient and v measures the difference in aspect ratio between the predicted box and the ground truth box.
CIoU takes into account the overlap area, the distance between the center points, and the aspect ratio, providing an accurate measure of the relative position between the predicted and ground truth boxes. It also addresses the optimization of both horizontal and vertical directions. However, CIoU does not consider the alignment between the orientation of the predicted and ground truth boxes, which can result in slower convergence speeds [
13]. To address this issue, EIoU is chosen as the loss function in this paper, replacing CIoU. EIoU, which is an improved version of CIoU, further enhances performance. The formula for the EIoU loss function is shown in Equation (
4).
This loss function consists of three components: overlap loss, center distance loss, and width–height loss. The first two components follow the approach used in CIoU, while the third part in EIoU independently handles the width and height differences. This allows the network to simultaneously optimize the center point and the size of the predicted box. By separately adjusting the width and height, the model can more smoothly adapt the shape of the predicted bounding box to the variations in small metal objects, resulting in more accurate boundary box predictions.
4. Hardware Experiment Setup
4.1. Dataset Creation
The experimental setup described in
Table 1 is employed in this study. The YOLOv5s neural network model is trained both before and after the improvement. The training parameters utilize the Adam optimization algorithm, with a batch size of 8, a momentum factor of 0.92, and a weight decay coefficient of 0.0005.
The quality of the sample dataset directly influences the detection performance of deep learning models. The research method proposed in this paper is primarily applied to the recognition and detection of industrial parts. However, there is currently no publicly available dataset for industrial parts. Considering the complexity of real-world scenarios, to further evaluate the model’s capabilities, this study selects six types of parts with varying sizes, complex features, diverse shapes and materials, and significant color differences, as shown in
Figure 5.
In the experiment, a custom dataset was created using samples of screwdriver, valve, tee valve, stainless steel elbow, 6-port valves, and flange coupling. Images were captured using an Intel RealSense D435 camera in various orientations. To enhance the model’s generalization ability and the detection accuracy of the workpieces, diverse data were generated by altering the number of instances for each class, varying the poses and positions of the objects, changing backgrounds, and introducing unrelated objects to disrupt the scenes. These methods were employed to increase the diversity of the dataset. Additionally, image processing functions [
14] from OpenCV were utilized for data augmentation to achieve optimal training results.
The annotation of the part images was performed manually using the LabelImg tool, with bounding boxes labeled, named, and categorized accordingly. The dataset was then randomly divided into training, testing, and validation sets in a 4:1:1 ratio, resulting in 2712 images for training, 678 images for testing, and 678 images for validation. The experiment’s testing set comprised a total of 4068 images.
Figure 6 and
Figure 7 depict the distribution of dataset sizes and center positions, respectively.
4.2. Evaluation Index
To evaluate the effectiveness of the model improvements, four key metrics were used to assess the algorithm’s performance [
15]: Precision (P),
[email protected],
[email protected], and detection speed (FPS). Precision (P) measures the algorithm’s accuracy in predicting true positives. A higher precision indicates stronger prediction capability for positive samples in a classification algorithm. mAP (mean average precision) is utilized to evaluate the overall performance of an object detection algorithm.
[email protected] and
[email protected] represent the average precision when the IoU threshold is set to 0.5 and are the average precision across IoU thresholds of 0.5 and 0.95, respectively. These are crucial indicators of an algorithm’s accuracy in object detection tasks. A high mAP suggests that the algorithm achieves high detection precision and recall across different categories. FPS (frames per second) refers to the number of images the object detection network can process per second, measuring the model’s detection speed.
4.3. Hand–Eye Calibration
To ensure the robot accurately grasps the workpiece, pixel coordinates must be converted, which requires hand–eye calibration of the robot [
16]. In the integration of robot systems, such as with cameras and robots, two different camera extrinsic calibration methods can be used: eye-in-hand and eye-to-hand. The eye-in-hand method involves attaching the camera to the robot’s end effector, which reduces the absolute measurement error from the camera and improves accuracy during grasping. However, this approach requires more complex calculations, involves complicated calibration processes, and faces difficulties when the camera cannot easily approach the target or when interference occurs. The eye-to-hand method, on the other hand, separates the camera and robot, allowing only the robot to move. Calibration is performed by capturing images of the calibration board from different positions and orientations, while the camera remains fixed. This method requires only a single external calibration, and the results can be used for an extended period without frequent adjustments. It simplifies the calibration and computation process, improving system stability and efficiency. Tasks in industrial production, such as automated sorting and quality inspection, often rely on a static, global perspective that requires a fixed camera field of view. Therefore, in this paper, the eye-to-hand method was chosen to construct the hand–eye system. The calibration diagrams are shown in
Figure 8 and
Figure 9.
During calibration, the calibration marker is first fixed at the end effector of the robot. The positions of the calibration board at different poses are recorded in the camera’s field of view, and the corresponding positions of the robot end effector are recorded through the robot’s upper computer. Because the relative pose between the robot end effector and the calibration board is fixed, the pixel coordinates of these known feature points and the coordinates of the robot end effector are paired, and the relative transformation matrix between the robot end effector and the camera is calculated using the corresponding mathematical model. ArUco marker [
17] is used as the calibration marker, which is a type of 2D barcode based on OpenCV. It is attached to the robot end effector, and the robot is controlled to move so that the camera sees the calibration board fixed at the robot end effector.
Multiple sets of end effector poses of the robot are recorded by the upper computer during the shooting process. During shooting, it is necessary to ensure that the ArUco marker is fully captured by the camera, and the distance and pose between the camera and the marker change. The pose of the calibration board in the camera is output through the
\aruco_signal\pose topic [
18] in the rostopic using the robot’s upper computer.
A total of 12 sets of poses of the calibration board in the camera and the poses of the robot arm in the upper computer were recorded. The commonly used calibration algorithms Tsai–Lenz [
19], Park [
20], Horaud, and Daniilidis [
21] were applied to calculate the calibration results. The results obtained by the four methods were basically consistent, and eventually, the Tsai–Lenz algorithm was adopted. This algorithm converts the hand–eye calibration problem into a system of linear equations and uses the least squares method to solve the system of linear equations to obtain the relative position and orientation between the robot end effector and the camera. It has a fast calculation speed and simple steps. Based on the 12 sets of data, the relative transformation matrix between the robot end effector and the camera was calculated, thereby establishing the coordinate transformation relationship between the robot and the camera [
22].
When the robot’s workspace remains at the same horizontal height, the plane-based nine-point calibration method is used for calibration [
23], which is characterized by fast speed and high accuracy. To establish the corresponding relationship, the robot gripper end effector is sequentially directed to each sampling point, and the coordinates of the nine centers in the robot coordinate system are obtained from the robot upper computer software. It is important to ensure that the calibration board remains horizontal during shooting, and the nine points on the calibration board need to be clearly photographed and identified.
By using OpenCV to obtain the pixel coordinates of the nine centers of the circles and obtaining the coordinates of the nine robot end effector points through the robot upper computer, the transformation relationship between the robot base coordinate system and the pixel coordinate system can be obtained using singular value decomposition (SVD) [
24].
4.4. Experimental Verification of Coordinate Conversion Module
The above content provides a theoretical analysis from the perspective of the camera imaging model, detailing the entire process of coordinate calculation. This section validates the theoretical algorithms described above. Firstly, sampling operations are performed on points in the images to measure the pixel coordinates at different positions. Using the plane-based nine-point calibration method, a hand–eye relationship transformation is conducted, and the transformed coordinates are recorded. Secondly, by operating the robot’s upper computer, the robot end effector is moved to the same position, and the coordinates of the robot base coordinate system at this time are recorded. According to the requirements of object recognition, multiple experiments are conducted on the objects, and the average error of six experiments is calculated. The experimental results are shown in
Table 2.
From
Table 2, it can be observed that based on the measurements from the RealSense D435 camera and the results from the robot’s upper computer software, the positioning errors in the x and y directions are less than 4 mm, with an average Euclidean distance error of 0.237 m. The translation error is controlled within the millimeter level, meeting the requirements for robot grasping. Therefore, the experiment validates the effectiveness of using the RealSense D435 sensor for target localization, and the approach of solving coordinate transformations using camera calibration and imaging model principles is proven to be effective.
4.5. System Workflow
The main functionality of the overall system control program is implemented through Python 3.7.3 programming, with a graphical user interface (GUI) for human–machine interaction in workpiece detection. The system is divided into three components: target detection, coordinate conversion, and robot motion control.
After running the system, the trained and improved YOLOv5 workpiece detection model is first loaded. Once the detection model is loaded, the RealSense camera captures real-time workpiece data, and the improved YOLOv5 object recognition algorithm detects the categories and pixel coordinates of the center points of all workpieces on the workstation. The system then uses the coordinate transformation matrix, obtained through internal and external camera calibration, to convert these coordinates into the robot arm’s coordinate system.
The coordinates output by the conversion module are processed through the robot arm’s inverse kinematics model, translating them into joint angles for the robot arm’s six axes. These are then formatted into corresponding command instructions. The serial communication module acts as a bridge within the robotic system, connecting the vision module to the motion module. In frameworks above Qt5, the cross-platform serial class QSerialPort is provided, which facilitates serial port configuration across different platforms.
Next, instructions are sent via serial communication to the robot arm’s control cabinet. Upon receiving the instructions, the robot controller synchronizes the robot’s real-time movements according to the commands, enabling precise remote control of the robotic arm in real time.
Figure 8 illustrates the real-time grasping platform environment for the robotic arm.
6. Conclusions
This paper addresses the issues encountered by robots when grasping small workpieces and proposes a YOLOv5-based six-axis robotic hand–eye coordination control system, utilizing an improved object detection algorithm. The goal was to solve the challenges posed by varying workpiece sizes and poor detection performance for small targets. Leveraging the YOLOv5 target detection algorithm as its foundation, the system integrates the CBAM attention mechanism and optimized loss functions. Training and validation are conducted on a custom-made workpiece dataset, and a dedicated hardware and software platform is designed for real grasping experiments. Experimental data indicate that the improved algorithm achieves a detection accuracy of 99.59% under real-time detection conditions, representing a 4.57% enhancement. This improvement reduces the occurrences of missed and false detections for small workpieces, thereby enabling the robot to effectively recognize and grasp most small-target workpieces automatically. Future research can focus on optimizing the robot’s grasping strategy. For example, after detecting a target object, further investigation could explore how to automatically generate key grasping points using visual information, enabling the robotic arm to grasp the object at these critical points, thereby improving the success rate of grasping and further enhancing the automatic sorting system of the six-axis robot.