Research on Six-Degree-of-Freedom Refueling Robotic Arm Positioning and Docking Based on RGB-D Visual Guidance

Yang, Mingbo; Liu, Jiapeng

doi:10.3390/app14114904

Open AccessArticle

Research on Six-Degree-of-Freedom Refueling Robotic Arm Positioning and Docking Based on RGB-D Visual Guidance

by

Mingbo Yang

^* and

Jiapeng Liu

School of Mechanical and Material Engineering, North China University of Technology, Beijing 100144, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(11), 4904; https://doi.org/10.3390/app14114904

Submission received: 18 January 2024 / Revised: 23 February 2024 / Accepted: 28 February 2024 / Published: 5 June 2024

Download

Browse Figures

Versions Notes

Abstract

:

Featured Application

This research delves into the cutting-edge domain of robotic automation, specifically focusing on the development and application of a six-degree-of-freedom refueling robotic arm guided by RGB-D (Red, Green, Blue—Depth) visual technology. The study explores the intricate processes involved in the accurate positioning and docking of a robotic arm in refueling tasks, leveraging the advanced capabilities of RGB-D sensors for enhanced spatial awareness and precise maneuvering. The current application areas of this technology predominantly reside in industrial automation, particularly in sectors requiring precise and repetitive tasks such as automotive manufacturing, aerospace, and logistics. The potential applications extend further into fields like unmanned service stations, military logistics, and remote operation environments where human intervention is limited or hazardous. This research contributes to the evolving landscape of robotic automation, offering insights into more efficient, more accurate, and safer automated refueling processes, potentially revolutionizing how these tasks are approached in various industrial and commercial sectors.

Abstract

The main contribution of this paper is the proposal of a six-degree-of-freedom (6-DoF) refueling robotic arm positioning and docking technology guided by RGB-D camera visual guidance, as well as conducting in-depth research and experimental validation on the technology. We have integrated the YOLOv8 algorithm with the Perspective-n-Point (PnP) algorithm to achieve precise detection and pose estimation of the target refueling interface. The focus is on resolving the recognition and positioning challenges of a specialized refueling interface by the 6-DoF robotic arm during the automated refueling process. To capture the unique characteristics of the refueling interface, we developed a dedicated dataset for the specialized refueling connectors, ensuring the YOLO algorithm’s accurate identification of the target interfaces. Subsequently, the detected interface information is converted into precise 6-DoF pose data using the PnP algorithm. These data are used to determine the desired end-effector pose of the robotic arm. The robotic arm’s movements are controlled through a trajectory planning algorithm to complete the refueling gun docking process. An experimental setup was established in the laboratory to validate the accuracy of the visual recognition and the applicability of the robotic arm’s docking posture. The experimental results demonstrate that under general lighting conditions, the recognition accuracy of this docking interface method meets the docking requirements. Compared to traditional vision-guided methods based on OpenCV, this visual guidance algorithm exhibits better adaptability and effectively provides pose information for the robotic arm.

Keywords:

robotic arm; object detection; pose estimation; recognition method

1. Introduction

With the continual rise in labor costs, there is an increasing demand for automation and intelligence across various industries. This has led to the widespread application of robots, especially in tasks requiring mechanical arm docking operations. Currently, unmanned refueling robots have become a hot research topic for many companies. While research on the actuators of refueling robots is already quite comprehensive and the technology relatively mature and stable, studies on the visual guidance and positioning of unmanned refueling robots are notably lacking.

The visual guidance and positioning of refueling robots primarily involve two key technologies: object detection and pose estimation. Traditional object detection methods extract features from images using techniques such as Haar [1], SIFT [2], SURF [3], HOG [4], etc., followed by the use of classifiers like SVM [5], Adaboost [6], etc., to determine whether these features belong to the target object. However, with the development of deep learning, methods based on this approach have become mainstream. Currently, this includes Faster R-CNN [7], Mask R-CNN [8], DETR [9], Swin Transformer [10], YOLO series [11], and others. The precision of pose estimation directly impacts the docking with the target, with the most critical algorithm being the PnP (Perspective-n-Point) algorithm. Solution methods include P3P [12], DLT [13], Bundle Adjustment [14], and others. Later, Moreno-Noguer F and others proposed the EPnP (Efficient Perspective-n-Point) [15] algorithm based on the PnP pose estimation method, which offers high computational efficiency and reduced time complexity. Subsequent developments such as OPnP [16], UPnP [17], and RPnP [18] further improved the accuracy of pose estimation. With deep learning becoming a mainstay in image processing, an increasing number of deep learning methods are being used to address pose estimation issues. These include direct estimation methods like PoseNet [19], SSD-6D [20], Deep-6DPose [21], PoseCNN [22], and keypoint methods such as BB8 [23], YOLO-6D [24], PVNet [25], etc. Liang Mingyu of South China University of Technology [26] and others used traditional object detection methods based on CCD binocular cameras to design a visual positioning system that can locate the space of a fuel tank cap. Ma Zhi from Jilin University [27] and others were among the first to use deep learning methods to detect vehicle fuel tank openings, thereby enhancing the accuracy of object detection. Wang Xufeng and others proposed a vision-assisted connector cone-style autonomous aerial refueling scheme for drones [28], aiming to accurately acquire the relative pose information between the refueling plug and cone during autonomous aerial refueling of drones. Gregory P. Scott and others designed a robotic refueling system for refueling special equipment such as naval ships at sea [29]. This refueling robot consists of two parts: one is the visual guidance part for precise positioning of the target fuel tank, and the other is the execution part, a soft pneumatic arm with a magnetic end effector designed for transferring fuel, providing compliant and safe docking with unmanned surface vehicles.

This paper targets a special fuel tank with non-standard connectors as the object, focusing on specific operational scenarios during the automatic refueling process of a particular vehicle. In real refueling operations, due to variations in the parking positions of the vehicles to be refueled and changes in vehicle posture, precise identification and positioning of the special vehicle’s fuel tank opening is required before each refueling docking task. The mechanical arm must ensure that the end effector’s orientation aligns with the plane normal vector of the fuel inlet. Based on this, the paper proposes a six-degree-of-freedom refueling mechanical arm docking method guided by RGB-D vision. This method uses an RGB-D camera to perform object detection and pose estimation of the target connector, providing target points for the mechanical arm. The arm is then driven to complete the docking operation, ensuring smooth docking with the fuel inlet of the specific vehicle. The paper validates the accuracy of the visual guidance and the applicability of the mechanical arm’s docking posture based on this method through experiments.

2. Target Recognition and Pose Estimation of the Fuel Tank Inlet

2.1. Identifying the Target

This paper focuses on a special type of fuel tank used in the automatic refueling operations of a certain specialized vehicle, as shown in Figure 1a. During the refueling operation of the target fuel tank, the end posture of the robotic arm needs to align with the posture direction of the refueling inlet to successfully complete the docking action. To simulate a real refueling scenario, a semi-physical simulation is adopted for the receiving end. A simulated fuel tank is used to replace the real fuel tank, and the refueling process is as shown in Figure 1b. The fuel tank docking interface used in this study for a certain specialized vehicle is a dry-type quick connector, as shown in Figure 1c. The connector is divided into male and female parts, as illustrated in Figure 1d, with the female part mounted on the simulated fuel tank. The workpiece is cylindrical in shape, with a diameter of 60 mm and a height of 90 mm. The female part features three 5 mm-wide slots to assist in the docking of the connector.

2.2. Recognition Method and Results Based on RGB-D Camera

To achieve the positioning of the female refueling connector, this study first conducts object detection on the female connector. Then, using the origin of the center box of the identified target as the docking target point for the female end of the refueling connector, the paper provides position and posture information to the robotic arm. When the robotic arm is in its current position, the image of the female connector is located at the exact center of the camera’s field of view, and the end effector of the robotic arm is parallel to the plane of the female connector. This position is used as the origin point for the robotic arm. To simulate a real refueling scenario, a six-degree-of-freedom simulation motion platform is utilized. By adjusting the various degrees of freedom of the motion platform, data can be obtained in different positions and postures. Since the camera’s field of view is limited, excessive movement may cause the target connector to move out of the camera’s view. To ensure the accuracy and controllability of the experiment, the range of motion of the six-degree-of-freedom platform is restricted, ensuring that the movement does not exceed 10 cm and the tilt angle does not exceed 5 degrees. Such limitations ensure that the target connector always remains within the camera’s field of view, effectively avoiding data loss or inaccuracies due to the target moving out of sight. A partial visualization of the dataset and the distribution of Center coordinates and normalized values of width and height are shown in Figure 2. In Figure 2, the left panel visualizes six typical orientations of the target connector and displays their recognition results. In the right panel, we statistically analyze the widths, heights, and center points of the detected bounding boxes to illustrate the distribution and sizes of the recognition boxes as well as the positions of the targets in the images. Together, these two panels provide an overview of the distribution of the recognized target connector boxes, the positions of their center points, and the sizes of the recognition boxes.

In this study, the YOLOv8 [5] algorithm is employed for object detection. YOLOv8 abandons the C3 module used in the previous YOLOv5 [6] and instead utilizes the more advanced C2f module for feature extraction. The incorporation of more branches in the C2f module allows for a richer gradient feedback to the sub-branches, enhancing the extraction of features from images. It is noteworthy that the center point of the annotated bounding box is taken as the central point of the circular connector. This approach effectively addresses the challenge of determining the docking point for circular connectors in a tilted position.

In the process of automatic docking, obtaining the pose information of the fuel inlet is essential to ensure its alignment. Perspective-n-Point (PnP) is a method used to solve the correspondence between 3D points and their 2D projections. It describes how to estimate the pose of a camera when the positions of n 3D space points are known, as illustrated in Figure 3.

After calibration, the intrinsic matrix

K

of the camera is known. The homogeneous coordinates of the 3D coordinate [

X_{w}

Y_{w}

Z_{w}

]^T in space can be represented a [

X_{w}

Y_{w}

Z_{w}

1] ^T. The coordinates of the projection point [

u

,

v

]^T in homogeneous coordinates can be expressed as [

u

,

v

]^T. The intrinsic matrix is denoted by

K

, and the aim is to solve for the rotation matrix

R

and the translation matrix

T

. The formula is as follows:

z_{c} [\begin{matrix} u \\ v \\ 1 \end{matrix}] = K [\begin{matrix} R & T \\ 0 & 1 \end{matrix}] [\begin{matrix} X_{w} \\ Y_{w} \\ Z_{w} \\ 1 \end{matrix}]

(1)

z_{c} [\begin{matrix} u \\ v \\ 1 \end{matrix}] = x [\begin{matrix} \begin{matrix} f_{11} & f_{12} & f_{13} \\ f_{21} & f_{22} & f_{23} \\ f_{31} & f_{32} & f_{33} \end{matrix} & \begin{matrix} f_{14} \\ f_{24} \\ f_{33} \end{matrix} \end{matrix}] [\begin{matrix} X_{w} \\ \begin{matrix} Y_{w} \\ Z_{w} \\ 1 \end{matrix} \end{matrix}]

(2)

This can be expanded and organized as follows:

f_{11} X_{w} + f_{12} Y_{w} + f_{13} Z_{w} + f_{14} - f_{31} X_{w} u_{c} - f_{32} Y_{w} u_{c} - f_{33} Y_{w} u_{c} - f_{33} Z_{w} u_{c} - f_{34} u_{c} = 0

(3)

f_{21} X_{w} + f_{22} Y_{w} + f_{23} Z_{w} + f_{24} - f_{31} X_{w} u_{c} - f_{32} Y_{w} u_{c} - f_{33} Y_{w} u_{c} - f_{33} Z_{w} u_{c} - f_{34} u_{c} = 0

(4)

Each pair of 2D–3D matched points corresponds to two equations. There are a total of 12 unknowns, and at least six sets of matching points are required to solve for the [R|T] matrix. The above equation can be expressed in matrix form AF = 0 as follows:

A = [\begin{matrix} \begin{matrix} X_{1} & Y_{1} & Z_{1} \\ 0 & 0 & 0 \end{matrix} & \begin{matrix} 1 & 0 & 0 \\ 0 & X_{1} & Y_{1} \end{matrix} & \begin{matrix} 0 & 0 & {- u}_{1} X_{1} \\ Z_{1} & 1 & {- v}_{1} X_{1} \end{matrix} \\ ⋮ & ⋮ & ⋮ \\ \begin{matrix} X_{n} & Y_{n} & Z_{n} \\ 0 & 0 & 0 \end{matrix} & \begin{matrix} 1 & 0 & 0 \\ 0 & X_{n} & Y_{n} \end{matrix} & \begin{matrix} 0 & 0 & {- u}_{n} X_{n} \\ Z_{n} & 1 & {- v}_{n} X_{n} \end{matrix} \end{matrix} \begin{matrix} \begin{matrix} {- u}_{1} Y_{1} & {- u}_{1} Z_{1} & {- u}_{1} \\ {- v}_{1} Y_{1} & {- v}_{1} Z_{1} & {- v}_{1} \end{matrix} \\ ⋮ \\ \begin{matrix} {- u}_{n} Y_{n} & {- u}_{n} Z_{n} & {- u}_{n} \\ {- v}_{n} Y_{n} & {- v}_{n} Z_{n} & {- v}_{n} \end{matrix} \end{matrix}]

(5)

F = {[\begin{matrix} \begin{matrix} \begin{matrix} f_{11} & f_{12} \end{matrix} & \begin{matrix} f_{13} & f_{14} \end{matrix} & \begin{matrix} f_{21} & f_{22} \end{matrix} \end{matrix} & \begin{matrix} \begin{matrix} f_{23} & f_{24} \end{matrix} & \begin{matrix} f_{31} & f_{32} \end{matrix} & \begin{matrix} f_{33} & f_{34} \end{matrix} \end{matrix} \end{matrix}]}^{T}

(6)

When N = 6, the equations can be directly solved. When N ≥ 6, the least squares solution is obtained, ||AF||². Performing Singular Value Decomposition (SVD) on the A matrix yields SVD(A) = UΣV, where both U and V matrices are orthogonal, and Σ is the matrix of eigenvalues of A. In the experiment, the A matrix was obtained as follows:

R = [\begin{matrix} 0.99998854 & 0.00216014 & 0.00427195 \\ - 0.00221074 & 0.99992702 & 0.01187697 \\ - 0.00424598 & - 0.011886270 & 0.99992034 \end{matrix}] T = [- 0.0501434, 0, 0]

(7)

After solving for the transformation matrix from the pixel coordinate system to the camera coordinate system, it is also necessary to unify the camera coordinate system with the robotic arm’s coordinate system. Considering the relative installation method between the camera and the robotic arm, this paper adopts the eye-in-hand calibration method to calculate the transformation matrix between the robotic arm and the camera, as shown in Figure 4:

We designate the robot’s base coordinate system as the world coordinate system (base), define the end of the robotic arm as the end effector coordinate system (end), the camera’s coordinate system (cam), and the fuel tank’s coordinate system (box). Based on these definitions, we can establish the following mathematical model:

{}_{b o x}^{b a s e}{T = {}_{e n d}^{b a s e}T \cdot} {}_{c a m}^{e n d}T \cdot {}_{b o x}^{c a m}T

,

{}_{b o x}^{b a s e}T

represents the transformation relationship between the robot’s base coordinate system and the fuel tank’s coordinate system.

{}_{e n d}^{b a s e}T

represents the transformation relationship between the robot’s base coordinate system and the end effector coordinate system of the robotic arm.

{}_{c a m}^{e n d}T

represents the transformation relationship between the end effector coordinate system and the camera’s coordinate system.

{}_{b o x}^{c a m}T

represents the transformation relationship between the camera’s coordinate system and the fuel tank’s coordinate system. In the diagram, the transformation matrix A between the fuel tank’s coordinate system and the camera’s coordinate system (shown in blue) is known. The transformation matrix C between the robot’s base coordinate system and the end effector coordinate system (shown in orange) is known. The transformation matrix H between the camera’s coordinate system and the end effector coordinate system (shown in red) is to be determined. Multiplying these three matrices yields the transformation matrix B from the fuel tank’s coordinate system to the robot’s base coordinate system, as shown in the orange part of the above diagram.

Considering the accuracy of calibration, in the experiment, we use the TCP (Tool Center Point) touch method for hand-eye calibration, targeting the target connector and camera installation, and use Aruco markers to assist in the joint calibration of the robotic arm and camera. We fix the Aruco marker to the tip of an electromagnet, pass the tip of the electromagnet through the exact center of the Aruco marker, and have the camera recognize and record the current coordinate values. After removing the Aruco marker, we move the center of the robotic arm’s end effector to the tip of the electromagnet and record the current position coordinates of the robotic arm. After completing this operation, we control the robotic arm to return to the origin and repeat the process 10 times. We then average the ten points obtained under the camera’s coordinate system and the ten points under the robot’s base coordinate system. Each point is subtracted from the average, and the results are written into two 3 × 10 matrices. The matrix formed by the ten points under the camera’s coordinate system is transposed and right-multiplied by the 3 × 10 matrix formed under the robot’s base coordinate system to obtain a new matrix, H_new. H_new is subjected to SVD decomposition to solve for the rotation matrix R. The calculated rotation matrix is then used to back-calculate the translation matrix T. The sought-after [R|T] matrix is thus the transformation matrix H_new.

H_{new} = [\begin{matrix} \begin{matrix} 0.06568738 & 2.20065276 \\ - 2.29523883 & - 0.01276419 \end{matrix} & \begin{matrix} - 1.82542348 & - 0.36943992 \\ - 5.29706490 & 0.86376219 \end{matrix} \\ \begin{matrix} 1.64837009 & 5.52861258 \\ 0 & 0 \end{matrix} & \begin{matrix} 0.16579219 & 0.49468693 \\ 0 & 1 \end{matrix} \end{matrix}]

(8)

3. Experimental Design

3.1. Experimental Platform Setup

To verify the accuracy of the docking interface recognition and positioning, as well as the applicability of visually guided robotic arm docking posture planning in such scenarios, this study established an experimental platform for robotic arm docking operations, as is shown in Figure 5. The experimental design includes measuring the time required by different visual methods, assessing their application range, and determining whether the system can successfully complete the docking task. Through experimental results, a systematic evaluation of the performance of different visual guidance methods under specific conditions was conducted.

In this study, the RealSense D435i camera was used as the visual sensor, with a stereo baseline of 50 mm and a detection depth range of 0.2~10 m. The robotic arm selected was the Aubo-i10 collaborative robotic arm, capable of carrying a load of 10 kg. The wrist joint of the arm has three mutually perpendicular axes and is equipped with a specially designed end effector for gripping the docking device and mounting the camera. The collaborative robot has a working radius of 1350 mm and a repeat positioning accuracy of ±0.03 mm. The simulated fuel tank is placed on a simulated target platform, which utilizes a domestic six-degree-of-freedom simulation platform with a load capacity of 700 kg, linear motion error of ±0.05 mm, and rotational error of ±0.03 deg.

3.2. Experimental Procedure Design

The experiment begins with using the camera to real-time capture the position of the refueling port. Both target detection methods based on OpenCV and YOLOv8 are employed to identify the refueling port. The time required for recognition by each algorithm and the pixel coordinates of the center point of the circle are recorded. Once the target object is recognized, pose estimation is performed on the target’s central position to recover the position and orientation of the target center point in the robot’s base coordinate system. This pose information is then provided to the robotic arm to plan the overall posture of the arm. According to the planned trajectory, the robotic arm controls the end effector to move to the specified posture, completing the docking operation between the refueling and receiving ends. The experimental procedure is illustrated in Figure 6.

3.3. Experimental Procedure

The robotic arm is maneuvered so that its end effector is parallel to the fuel tank’s receiving port, with the receiving interface positioned at the center of the image. This point is assumed to be the origin of the robotic arm for the experiment. The process from capturing the target position with the camera to the end effector reaching the target point constitutes one complete docking operation. The target recognition and docking process are shown in Figure 7. The following experiments are conducted:

a1: Target detection is performed using the OpenCV (version 4.5.2.54) method. The success of the target recognition is judged. If successful, proceed to a2; if the target is not recognized, the experiment is considered a failure.
a2: If a1 successfully recognizes the target, pose estimation is performed on the target’s center point. The robotic arm moves the end effector to the target position for docking. The experiment is considered successful if docking is completed smoothly; otherwise, it is a failure.
a3: The position and posture of the simulated fuel tank are changed. Under the condition of successful a1, experiment a2 is conducted until docking fails, determining the successful docking range.
a4: The position and posture of the simulated fuel tank are changed, and experiment a1 is repeated until failure, establishing the applicable range for target recognition.
a5: The above experiments based on the OpenCV method are replaced with the YOLOv8-based method, and experiments a1 to a4 are conducted accordingly.

3.4. Experimental Results and Discussion

Initially, the recognition performance of the target detection methods based on OpenCV and YOLOv8 is tested at different positions with the same posture. Within the range of successful target recognition, the coordinate prediction values are calculated and their docking success rates are tested, as shown in Appendix A in Table A1.

The error between the predicted point and the actual point is represented by the Euclidean distance. From Figure 8 and Figure 9, it can be observed that when the fuel tank opening moves in a single direction, the prediction error is relatively small. However, when moving in multiple directions, the error increases significantly. The largest error in prediction occurs when the target object moves along the y-axis, indicating that movement in the y-axis direction is the most significant factor causing errors. Furthermore, it is evident from the figures that compared to using OpenCV for circular (elliptical) detection to predict the center point, the method based on YOLOv8, which uses the midpoint of the target box for equivalent replacement, helps to reduce the error in position estimation.

Subsequently, the recognition performance under different poses and the predicted values of target points are tested. Within the range of successful target recognition, the docking success rate is tested, as shown in Appendix A in Table A2.

The errors are analyzed from both positional and attitudinal perspectives. As can be seen in Figure 10, when there is a change in attitude, the predicted values of the target center point based on OpenCV are significantly higher than those obtained using the YOLOv8 method. When there is a change in a single attitude, the positional error does not vary significantly. However, when the attitudes of multiple axes change, the error in predicting the center point of circular (elliptical) detection using OpenCV is significantly higher than that of using the midpoint of the target box for equivalent substitution based on YOLOv8. It is also evident from the figure that when the error in attitude prediction is large, the positional error increases correspondingly. This indicates that the accuracy of attitude estimation greatly influences the determination of the target’s position. Additionally, the robustness of using OpenCV for origin localization is not strong. The stability of docking operations is poor when there are significant changes in the object’s pose. In contrast, the YOLOv8-based pose detection method demonstrates better robustness.

Finally, within the range where both methods can recognize the target, ten sets of points are randomly set up to test their recognition success rate and record the time taken for docking by each, as shown in Appendix A in Table A3.

In Figure 11, the red section represents the time required for unsuccessful docking attempts. As we can observe from the graph, the positioning method based on OpenCV requires more time and has a significantly lower success rate compared to the method based on YOLOv8. Further analysis of the graph reveals that the average time taken for docking operations using the traditional OpenCV-based object detection method is 10.61 s. In contrast, the docking operation using the YOLOv8-based object detection method averages 9.93 s, which is a 6.8% reduction in time required compared to the traditional method. Additionally, the YOLOv8-based method is capable of successfully completing docking operations over a broader range.

The experimental results show that the target recognition method based on OpenCV can detect the target refueling port only within a translational range of 10 cm and a rotational range of 3°. It can successfully dock the refueling port only within a translational range of 10 cm and a rotational range of 2°. In contrast, the target recognition method based on YOLOv8 can accurately identify the target connector within a translational range of 10 cm and a rotational range of 5°. Failure in docking the refueling port only occurs when there are significant translations of about 10 cm in the x, y, and z axes, and substantial rotations in all three axes (rx, ry, rz). Compared to the traditional OpenCV-based target detection method, the YOLOv8-based recognition method has a broader recognition range and higher accuracy. The pose estimation derived from the same method is more precise, leading to a higher success rate in target docking. The results also indicate that the average time taken for docking operations using the traditional OpenCV-based target detection method is 10.575 s, while it is 9.89 s with the YOLOv8-based method, reducing the docking time by 6.5% and increasing the successful docking range. The results demonstrate that, in this experimental environment, the YOLOv8-based target detection method has better applicability in subsequent pose estimation processes. It not only provides more precise pose information for the robotic arm’s docking but also reduces the time required for the docking process, showing better overall applicability.

4. Conclusions and Future Prospects

This paper addresses the robotic arm docking scenarios in certain specialized operations, designing corresponding visual guidance methods based on the characteristics of the docking interfaces. It combines the YOLOv8 target detection algorithm with the PnP pose estimation method to directly recover the position and attitude of the target from images. An experimental platform was set up in the laboratory to compare the target positioning methods based on OpenCV and YOLOv8. The experimental data shows that when the target object undergoes only positional changes without attitudinal changes, both methods can meet the docking requirements within the given range, but the YOLOv8-based target positioning method has higher accuracy. When both position and attitude change simultaneously, the OpenCV-based recognition method has larger positioning errors. In 10 sets of experiments, 3 sets could not complete the docking successfully, whereas the YOLOv8-based target positioning method had smaller positioning errors, and all 10 sets of experiments successfully completed the docking. Additionally, the average time for the robotic arm to complete docking operations using the YOLOv8-based method was 9.89 s, compared to 10.575 s using the OpenCV-based method, representing an average reduction of 6.9% in docking operation time using the YOLOv8-based method.

The results indicate that in such environments, using a combination of the YOLOv8 and PnP algorithms as the visual guidance method for the robotic arm is highly applicable and can effectively reduce the time taken for the robotic arm to complete docking operations. Future research could further expand the field of robotic arm docking by exploring a wider range of application scenarios and more complex operational environments. Firstly, there is potential to extend visual guidance methods to encompass various types of robotic arms and docking scenarios, including industrial production lines, aerospace applications, and medical robotics. Secondly, there is an opportunity for in-depth investigation and optimization of deep learning algorithms for robotic vision guidance, aiming to enhance the recognition and localization capabilities for targets, thereby achieving more precise and stable docking operations. Additionally, the exploration of machine learning and autonomous control techniques in the docking process could enable robotic arms to make flexible decisions and motion planning based on real-time environmental information and task requirements, thereby enhancing the autonomy and intelligence level of docking operations. Through these endeavors, novel breakthroughs and advancements in robotic arm docking technology and its applications are anticipated.

Author Contributions

Conceptualization, M.Y. and J.L.; methodology, J.L. and M.Y.; software, J.L.; validation, J.L.; resources, M.Y. and J.L.; data curation, J.L.; writing—original draft preparation, J.L. and M.Y.; writing—review and editing, M.Y.; supervision, M.Y.; project administration, J.L. and M.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant 61973300, as supported by Undergraduate Education Reform project of North China University of Technology.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. Comparison of positional changes with two recognition methods.

No.	Docking Interface Pose (x,y,z,rx,ry,rz)	Recognition Based on OpenCV (x,y,z,rx,ry,rz)	Success Docking	Recognition Based on OpenCV YOLOv8 (x,y,z,rx,ry,rz)	Success Docking
1	(−0.998406,−0.411761,0.129670) (151.82193,36.066860,−44.923798)	(−0.998553,−0.411444,0.129181) (151.82192,36.066859,−44.923798)	Yes	(−0.998373,−0.411554,0.129454) (151.82192,36.066859,−44.923798)	Yes
2	(−0.948406,−0.411761,0.129670) (151.82193,36.066860,−44.923798)	(−0.948693,−0.411482,0.129118) (151.82192,36.066859,−44.923798)	Yes	(−0.948590,−0.411581,0.129318) (151.82192,36.066859,−44. 923798)	Yes
3	(−0.998406,−0.461761,0.129670) (151.82193,36.066860,−44.923798)	(−0.997705,−0.460317,0.129021) (151.82192,36.066859,−44.923798)	Yes	(−0.998281,−0.460671,0.129232) (151.82192,36.066859,−44.923798)	Yes
4	(−0.998406,−0.411761,0.134670) (151.82193,36.066860,−44.923798)	(−0.998493,−0.411566,0.134391) (151.82192,36.066859,−44.923798)	Yes	(−0.998451,−0.411463,0.134775) (151.82192,36.066859,−44.923798)	Yes
5	(−0.948406,−0.461761,0.129670) (151.82193,36.066860,−44.923798)	(−0.948016,−0.460288,0.128731) (151.82192,36.066859,−44.923798)	Yes	(−0.948287,−0.462751,0.129043) (151.82192,36.066859,−44.923798)	Yes
6	(−0.948406,−0.411761,0.134670) (151.82193,36.066860,−44.923798)	(−0.948021,−0.411056,0.133613) (151.82193,36.066860,−44.923798)	Yes	(−0.948512,−0.411664,0.134539) (151.82193,36.066860,−44.923798)	Yes
7	(−0.998406,−0.461761,0.134670) (151.82193,36.066860,−44.923798)	(−0.997156,−0.460437,0.133226) (151.82193,36.066860,−44.923798)	Yes	(−0.998032,−0.461582,0.133492) (151.82193,36.066860,−44.923798)	Yes
8	(−1.048406,−0.366761,0.124670) (151.82193,36.066860,−44.923798)	(−1.048839,−0.364972,0.124701) (151.82193,36.066860,−44.923798)	Yes	(−1.047969,−0.366241,0.124848) (151.82193,36.066860,−44.923798)	Yes
9	(−1.048406,−0.366761,0.134670) (151.82193,36.066860,−44.923798)	(−1.048952,−0.365831,0.133089) (151.82193,36.066860,−44.923798)	Yes	(−1.048359,−0.366527,0.133977) (151.82193,36.066860,−44.923798)	Yes
10	(−0.948406,−0.461761,0.134670) (151.82193,36.066860,−44.923798)	(−0.948199,−0.458838,0.133817) (151.82193,36.066860,−44.923798)	Yes	(−0.948177,−0.463963,0.134895) (151.82193,36.066860,−44.923798)	Yes

Table A2. Comparison of the two recognition methods regarding positional and postural changes.

No.	Docking Interface Pose (x,y,z,rx,ry,rz)	Recognition Based on OpenCV (x,y,z,rx,ry,rz)	Successful Docking	Recognition Based on OpenCV YOLOv8 (x,y,z,rx,ry,rz)	Successful Docking
1	(−0.998406,−0.411761,0.129670) (153.82193,36.066860,−44.923798)	(−0.996278,−0.410937,0.127972) (153.34790,36.163685,−44.73578)	Yes	(−0.996978,−0.412937,0.128079) (153.34790,36.163685,−44.73578)	Yes
2	(−0.998406,−0.411761,0.129670) (151.82193,34.066860,−44.923798)	(−0.996889,−0.409235,0.127667) (152.03432,34.321348,−45.178932)	Yes	(−0.997782,−0.410883,0.128631) (152.03432,34.321348,−45.178932)	Yes
3	(−0.998406,−0.411761,0.129670) (151.82193,36.066860,−42.923798)	(−0.995703,−0.408701,0.126892) (152.27036,36.410983,−43.376418)	Yes	(−0.999132,−0.416915,0.127105) (152.27036,36.410983,−43.376418)	Yes
4	(−0.998406,−0.411761,0.129670) (153.82193,34.066860,−44.923798)	(−0.996872,−0.413578,0.127962) (154.08652,34.903678,−44.555324)	Yes	(−0.997451,−0.411463,0.128024) (154.08652,34.903678,−44.555324)	Yes
5	(−0.998406,−0.411761,0.129670) (153.82193,36.066860,−42.923798)	(−0.995091,−0.414992,0.128013) (153.25681,36.693251,−42.379042)	Yes	(−0.997891,−0.413732,0.1266031) (153.25681,36.693251,−42.379042)	Yes
6	(−0.998406,−0.411761,0.129670) (151.82193,38.066860,−42.923798)	(−0.997351,−0.413742,0.128531) (152.72146,37.73592,−42.823921)	Yes	(−0.997235,−0.411321,0.127745) (152.72146,37.73592,−42.823921)	Yes
7	(−0.998406,−0.411761,0.129670) (149.82193,34.066860,−46.923798)	(−0.999492,−0.408388,0.127431) (148.89427,34.672967,−47.678159)	Yes	(−0.998164,−0.415803,0.130572) (148.89427,34.672967,−47.678159)	Yes
8	(−0.948406,−0.361761,0.134670) (154.32193,38.066860,−42.923798)	(−0.942015,−0.352419,0.131144) (153.69912,38.472419,−42.603981)	No	(−0.946068,−0.359634,0.133223) (153.69912,38.472419,−42.603981)	Yes
9	(−1.048406,−0.461761,0.124670) (149.32193,34.066860,−46.923798)	(−1.0315847,−0.467249,0.128247) (148.97931,33.938642,−47.244901)	No	(−1.049984,−0.463752,0.124639) (148.97931,33.938642,−47.244901)	Yes
10	(−0.948406,−0.361761,0.134670) (149.32193,36.066860,−46.923798)	(−0.950131,−0.376181,0.137169) (149.86974,35.971524,−46.760962)	No	(−0.947794,−0.363371,0.134149) (149.86974,35.071524,−46.760962)	Yes

Table A3. Time consumption and success rate comparison of two recognition methods.

No.	Docking Interface Pose (x,y,z,r_x,r_y,r_z)	Successful Docking	Time Taken for Recognition Method Based on OpenCV (s)	Successful Docking	Time Taken for Recognition Method Based on YOLOv8 (s)
1	(−0.998406,−0.411761,0.129670) (151.82193,36.066860,−44.923798)	Yes	10.3	Yes	9.6
2	(−0.978406,−0.431761,0.131670) (152.82193,35.066860,−43.923798)	Yes	10.5	Yes	9.9
3	(−0.999406,−0.391761,0.127670) (151.32193,35.566860,−45.923798)	Yes	10.7	Yes	9.8
4	(−1.048406,−0.431761,0.128670) (152.82193,35.066860,−44.423798)	Yes	10.6	Yes	9.9
5	(−1.048406,−0.381761,0.130670) (149.82193,35.566860,−43.423798)	Yes	10.7	Yes	9.8
6	(−0.948406,−0.381761,0.128670) (150.82193,36.566860,−45.423798)	Yes	10.4	Yes	9.7
7	(−1.018406,−0.431761,0.127670) (151.32193,36.566860,−45.023798)	Yes	10.6	Yes	9.9
8	(−0.948406,−0.361761,0.134670) (152.82193,37.066860,−43.923798)	Yes	10.8	Yes	10.0
9	(−0.988406,−0.451761,0.127670) (150.82193,34.066860,−45.923798)	No	—	Yes	10.2
10	(−0.948406,−0.381761,0.133670) (153.82193,37.066860,−43.923798)	No	—	Yes	10.1

References

Harris, C.; Stephens, M. A combined corner and edge detector. In Proceedings of the Alvey Vision Conference, Manchester, UK, 31 August–2 September 1988; pp. 147–151. [Google Scholar]
Lowe, D.G. Object recognition from local scale-invariant features. In Proceedings of the Seventh IEEE International Conference on Computer Vision (ICCV), Kerkyra, Greece, 20–27 September 1999; IEEE: Piscataway, NJ, USA, 1999; Volume 2, pp. 1150–1157. [Google Scholar]
Bay, H.; Ess, A.; Tuytelaars, T.; Gool, L.V. Speeded-up robust features (SURF). Comput. Vis. Image Underst. 2008, 110, 346–359. [Google Scholar] [CrossRef]
Dalal, N.; Triggs, B. Histograms of Oriented Gradients for Human Detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision & Pattern Recognition, San Diego, CA, USA, 20–25 June 2005; IEEE: Piscataway, NJ, USA, 2005. [Google Scholar]
Freund, Y.; Schapire, R.E. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 1997, 55, 119–139. [Google Scholar] [CrossRef]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. arXiv 2020, arXiv:2005.12872. [Google Scholar]
Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L.; et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 12009–12019. [Google Scholar]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Haralick, R.M.; Lee, C.-N.; Ottenberg, K.; Noelle, M. Analysis and solution of the three point perspective pose estimation problem. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Maui, HI, USA, 3–6 June 1991; pp. 592–598. [Google Scholar]
Hartley, R.; Zisserman, A. Multiple View Geometry in Computer Vision; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
Jie, S. A Brief History and Overview of Bundle Adjustment. J. Wuhan Univ. Inf. Sci. Ed. 2018, 43, 1797–1810. [Google Scholar]
Moreno-Noguer, F.; Lepetit, V.; Fua, P. EPnP: An Accurate O(n) Solution to the PnP Problem. Int. J. Comput. Vis. 2009, 81, 155–166. [Google Scholar]
Zheng, Y.; Kuang, Y.; Sugimoto, S.; Astrom, K.; Okutomi, M. Revisiting the PnP Problem: A Fast, General and Optimal Solution. In Proceedings of the 2013 IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; IEEE: Piscataway, NJ, USA, 2014. [Google Scholar]
Kneip, L.; Li, H.; Seo, Y. UPnP: An Optimal O(n) Solution to the Absolute Pose Problem with Universal Applicability. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Springer: Cham, Switzerland, 2014. [Google Scholar]
Li, S.; Xu, C.; Xie, M. A robust O(n)solution to theperspective-n-point problem. IEEE Trans. Onpattern Anal. Mach. Intell. 2012, 34, 1444–1450. [Google Scholar] [CrossRef] [PubMed]
Kendall, A.; Grimes, M.; Cipolla, R. Posenet: A convolutional network for real-time 6-dof camera relocalization. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2938–2946. [Google Scholar]
Kehl, W.; Manhardt, F.; Tombari, F.; Ilic, S.; Navab, N. SSD-6D: Making RGB-based 3D detection and 6d pose estimation greatagain. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: New York, NY, USA, 2017. [Google Scholar]
Do, T.T.; Cai, M.; Pham, T.; Reid, I. Deep-6DPose: Recovering 6D Object Pose from a Single RGB Image. arXiv 2018, arXiv:1802.10367. [Google Scholar]
Xiang, Y.; Schmidt, T.; Narayanan, V.; Fox, D. PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes. In Proceedings of the Robotics: Science and Systems 2018, Pittsburgh, PA, USA, 26–30 June 2018. [Google Scholar]
Rad, M.; Lepetit, V. BB8: A Scalable, Accurate, Robust to Partial Occlusion Method for Predicting the 3D Poses of Challenging Objects without Using Depth. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 3848–3856. [Google Scholar]
Tekin, B.; Sinha, S.N.; Fua, P. Real-time seamless single shot 6d object pose prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 292–301. [Google Scholar]
Peng, S.; Liu, Y.; Huang, Q.; Zhou, X.; Bao, H. Pvnet: Pixel-wise voting network for 6dof pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4561–4570. [Google Scholar]
Liang, M. Study of a Binocular Vision System for Automotive Fueling Robots. Master’s Thesis, South China University of Technology, Guangzhou, China, 2021. [Google Scholar]
Ma, Z. Research on Fuel Tank Cap Recognition and Detection Technology Based on Binocular Vision. Master’s Thesis, Jilin University, Changchun, China, 2022. [Google Scholar]
Wang, X.; Dong, X.; Kong, X.; Zhi, J.; Wang, L. MS-KF Fusion Algorithm for Cone Sleeve Tracking. Appl. Opt. 2013, 34, 951–956. [Google Scholar]
Scott, G.P.; Henshaw, C.G.; Walker, I.D.; Willimon, B. Autonomous robotic refueling of an unmanned surface vehicle in varying sea states. In Proceedings of the 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hamburg, Germany, 28 September–2 October 2015; pp. 1664–1671. [Google Scholar] [CrossRef]

Figure 1. Fuel tank and connector. (a) Schematic diagram of the special fuel tank; (b) simulated fuel tank; (c) dry-type self-sealing connector; (d) male and female ends of the connector.

Figure 2. Visualization of part of the dataset.

Figure 3. Schematic diagram of the PnP algorithm.

Figure 4. Relationship of transformation between coordinate systems.

Figure 5. Visual guidance docking platform.

Figure 6. Experimental procedure diagram.

Figure 7. Target recognition process and docking process.

Figure 8. Spatial distribution of predicted and actual values for two algorithms.

Figure 9. Error comparison of predicted and actual values for two algorithms.

Figure 10. Pose estimation error and the difference between predicted and actual values under different poses for two algorithms.

Figure 11. Time comparison between two methods.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, M.; Liu, J. Research on Six-Degree-of-Freedom Refueling Robotic Arm Positioning and Docking Based on RGB-D Visual Guidance. Appl. Sci. 2024, 14, 4904. https://doi.org/10.3390/app14114904

AMA Style

Yang M, Liu J. Research on Six-Degree-of-Freedom Refueling Robotic Arm Positioning and Docking Based on RGB-D Visual Guidance. Applied Sciences. 2024; 14(11):4904. https://doi.org/10.3390/app14114904

Chicago/Turabian Style

Yang, Mingbo, and Jiapeng Liu. 2024. "Research on Six-Degree-of-Freedom Refueling Robotic Arm Positioning and Docking Based on RGB-D Visual Guidance" Applied Sciences 14, no. 11: 4904. https://doi.org/10.3390/app14114904

APA Style

Yang, M., & Liu, J. (2024). Research on Six-Degree-of-Freedom Refueling Robotic Arm Positioning and Docking Based on RGB-D Visual Guidance. Applied Sciences, 14(11), 4904. https://doi.org/10.3390/app14114904

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Six-Degree-of-Freedom Refueling Robotic Arm Positioning and Docking Based on RGB-D Visual Guidance

Abstract

Featured Application

Abstract

1. Introduction

2. Target Recognition and Pose Estimation of the Fuel Tank Inlet

2.1. Identifying the Target

2.2. Recognition Method and Results Based on RGB-D Camera

3. Experimental Design

3.1. Experimental Platform Setup

3.2. Experimental Procedure Design

3.3. Experimental Procedure

3.4. Experimental Results and Discussion

4. Conclusions and Future Prospects

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI