Author Contributions
Conceptualization, I.P.-E., S.L.-B., R.M.-P., and P.J.S.; methodology, I.P.-E., S.L.-B., R.M.-P., and P.J.S.; software, I.P.-E. and S.L.-B.; validation, I.P.-E., S.L.-B.; formal analysis, I.P.-E., S.L.-B.; investigation, I.P.-E., S.L.-B., R.M.-P., and P.J.S.; resources, R.M.-P., and P.J.S.; data curation, I.P.-E. and S.L.-B.; writing—original draft preparation, I.P.-E. and S.L.-B.; writing—review and editing, I.P.-E., S.L.-B., R.M.-P., and P.J.S.; visualization, I.P.-E., S.L.-B., R.M.-P., and P.J.S.; supervision, S.L.-B., R.M.-P., and P.J.S.; project administration, R.M.-P., and P.J.S.; funding acquisition, R.M.-P., and P.J.S. All authors have read and agreed to the published version of the manuscript.
Figure 1.
Kinematic frames of the hybrid-controlled ROV.
Figure 1.
Kinematic frames of the hybrid-controlled ROV.
Figure 2.
Overview of the global object perception module.
Figure 2.
Overview of the global object perception module.
Figure 3.
Overview of the local grasp perception object perception module. The numbered points indicate: (1) segment centroid, (2,3) grasping points, and (4) visual reference point. The final metrics are the alignment errors and the distance between grasp points d.
Figure 3.
Overview of the local grasp perception object perception module. The numbered points indicate: (1) segment centroid, (2,3) grasping points, and (4) visual reference point. The final metrics are the alignment errors and the distance between grasp points d.
Figure 4.
Architecture of the natural language translation agent to ROS for robot control.
Figure 4.
Architecture of the natural language translation agent to ROS for robot control.
Figure 5.
Block diagram of the proposed architecture integrating a ROS agent with ArduSub for autonomous underwater manipulation.
Figure 5.
Block diagram of the proposed architecture integrating a ROS agent with ArduSub for autonomous underwater manipulation.
Figure 6.
(1) BlueROV2 Heavy, (2) umbilical tether, (3) frontal camera, (4) gripper camera, (5) Newton Subsea Gripper.
Figure 6.
(1) BlueROV2 Heavy, (2) umbilical tether, (3) frontal camera, (4) gripper camera, (5) Newton Subsea Gripper.
Figure 7.
Upper images: external views of the simulation. Bottom left: gripper camera view. Bottom right: RViz visualization.
Figure 7.
Upper images: external views of the simulation. Bottom left: gripper camera view. Bottom right: RViz visualization.
Figure 8.
Data Flow Between ROS, ArduSub SITL, and Stonefish Simulator.
Figure 8.
Data Flow Between ROS, ArduSub SITL, and Stonefish Simulator.
Figure 9.
The experimental setup is illustrated on the left, whereas the object is shown on the right.
Figure 9.
The experimental setup is illustrated on the left, whereas the object is shown on the right.
Figure 10.
Underwater image before and after visual preprocessing and pose estimation in a real marine environment.
Figure 10.
Underwater image before and after visual preprocessing and pose estimation in a real marine environment.
Figure 11.
Comparison between the pose estimated by the PnP method and the ground-truth pose from the simulation. The plots show the linear components x, y, z, the angular components roll–pitch–yaw and the errors all expressed in the camera frame and representing the position and orientation of the box with respect to the camera.
Figure 11.
Comparison between the pose estimated by the PnP method and the ground-truth pose from the simulation. The plots show the linear components x, y, z, the angular components roll–pitch–yaw and the errors all expressed in the camera frame and representing the position and orientation of the box with respect to the camera.
Figure 12.
Comparison between the pose estimated by the PnP method and the ground-truth pose obtained from ArUco markers. The plots show the linear components x, y, z, the angular components roll–pitch–yaw and the errors, all expressed in the camera frame and representing the position and orientation of the box with respect to the camera.
Figure 12.
Comparison between the pose estimated by the PnP method and the ground-truth pose obtained from ArUco markers. The plots show the linear components x, y, z, the angular components roll–pitch–yaw and the errors, all expressed in the camera frame and representing the position and orientation of the box with respect to the camera.
Figure 13.
Comparison between the pose estimated by the PnP method using the real fine-tuning and the ground-truth pose obtained from ArUco markers. The plots show the linear components x, y, z and the angular components roll–pitch–yaw, all expressed in the camera frame and representing the position and orientation of the box with respect to the camera.
Figure 13.
Comparison between the pose estimated by the PnP method using the real fine-tuning and the ground-truth pose obtained from ArUco markers. The plots show the linear components x, y, z and the angular components roll–pitch–yaw, all expressed in the camera frame and representing the position and orientation of the box with respect to the camera.
Figure 14.
Comparison between the pose estimated by the PnP method using the fine-tuning and the ground-truth pose obtained from ArUco markers in harbor conditions. The plots show the linear components x, y, z and the angular components roll–pitch–yaw, all expressed in the camera frame and representing the position and orientation of the box with respect to the camera.
Figure 14.
Comparison between the pose estimated by the PnP method using the fine-tuning and the ground-truth pose obtained from ArUco markers in harbor conditions. The plots show the linear components x, y, z and the angular components roll–pitch–yaw, all expressed in the camera frame and representing the position and orientation of the box with respect to the camera.
Figure 15.
Evolution of the camera-to-blackbox relative pose during the approach phase, showing the PnP-estimated position (x, y, z) and yaw angle of the camera with respect to the blackbox, along with the target reference values used for alignment.
Figure 15.
Evolution of the camera-to-blackbox relative pose during the approach phase, showing the PnP-estimated position (x, y, z) and yaw angle of the camera with respect to the blackbox, along with the target reference values used for alignment.
Figure 16.
Sequence of images during the approach phase. The top row shows the robot’s external view, and the bottom row shows the corresponding onboard camera images.
Figure 16.
Sequence of images during the approach phase. The top row shows the robot’s external view, and the bottom row shows the corresponding onboard camera images.
Figure 17.
Alignment error metrics with respect to the target point in the x and y directions (right), and distance between the grasping points (left).
Figure 17.
Alignment error metrics with respect to the target point in the x and y directions (right), and distance between the grasping points (left).
Figure 18.
Sequence of images during the grasping phase. The top row shows the robot’s external view, and the bottom row shows the corresponding onboard camera images.
Figure 18.
Sequence of images during the grasping phase. The top row shows the robot’s external view, and the bottom row shows the corresponding onboard camera images.
Figure 19.
Overview of the natural language agent interface and its interaction workflow.
Figure 19.
Overview of the natural language agent interface and its interaction workflow.
Figure 20.
Examples of natural language prompts and the corresponding system responses.
Figure 20.
Examples of natural language prompts and the corresponding system responses.
Table 1.
Reference frames used in the hybrid-controlled ROV.
Table 1.
Reference frames used in the hybrid-controlled ROV.
| # | Frame Name | Description |
|---|
| 1 | NED | North-East-Down reference frame |
| 2 | ROV_base_link | Main body frame of the ROV |
| 3 | Front_camera | Front-facing camera on the ROV |
| 4 | Gripper_base_link | Base frame of the Newton Gripper mounted on the ROV |
| 5 | Gripper_jaws_link | Jaws frame of the Newton Gripper |
| 6 | Gripper_camera | Camera mounted at the Gripper |
Table 2.
Translation error in x [m].
Table 2.
Translation error in x [m].
| Metric | Sim. | Real + Sim. | Real FT | Sea | Sea (Filtered) |
|---|
| Avg | −0.0071 | −0.1041 | −0.0567 | −0.1060 | −0.0588 |
| Std | 0.0551 | 0.0833 | 0.0515 | 0.0184 | 0.0762 |
| RMSE | 0.0556 | 0.1334 | 0.0766 | 0.1076 | 0.0962 |
Table 3.
Translation error in y [m].
Table 3.
Translation error in y [m].
| Metric | Sim. | Real + Sim. | Real FT | Sea | Sea (Filtered) |
|---|
| Avg | −0.0151 | −0.1082 | −0.1102 | −0.0981 | −0.1157 |
| Std | 0.0372 | 0.0628 | 0.0463 | 0.0264 | 0.0546 |
| RMSE | 0.0402 | 0.1251 | 0.1195 | 0.1015 | 0.1279 |
Table 4.
Translation error in z [m].
Table 4.
Translation error in z [m].
| Metric | Sim. | Real + Sim. | Real FT | Sea | Sea (Filtered) |
|---|
| Avg | −0.0044 | 0.0704 | 0.1667 | 0.0622 | 0.0867 |
| Std | 0.0351 | 0.1191 | 0.0735 | 0.0189 | 0.1143 |
| RMSE | 0.0354 | 0.1383 | 0.1822 | 0.0651 | 0.1435 |
Table 5.
Rotation error in roll [rad].
Table 5.
Rotation error in roll [rad].
| Metric | Sim. | Real + Sim. | Real FT | Sea | Sea (Filtered) |
|---|
| Avg | −0.0245 | −0.1802 | 0.0393 | −1.3421 | 0.1556 |
| Std | 0.0803 | 0.6938 | 0.2054 | 0.8094 | 0.1315 |
| RMSE | 0.0839 | 0.7168 | 0.2091 | 1.5673 | 0.2037 |
Table 6.
Rotation error in pitch [rad].
Table 6.
Rotation error in pitch [rad].
| Metric | Sim. | Real + Sim. | Real FT | Sea | Sea (Filtered) |
|---|
| Avg | −0.0221 | −0.8502 | −0.2285 | −0.2020 | −0.0195 |
| Std | 0.0615 | 0.7257 | 0.3043 | 0.9473 | 0.4311 |
| RMSE | 0.0654 | 1.1178 | 0.3806 | 0.9686 | 0.4316 |
Table 7.
Rotation error in yaw [rad].
Table 7.
Rotation error in yaw [rad].
| Metric | Sim. | Real + Sim. | Real FT | Sea | Sea (Filtered) |
|---|
| Avg | −0.0232 | 0.2581 | −0.0296 | 1.5686 | −0.1419 |
| Std | 0.0741 | 0.7297 | 0.0952 | 0.9154 | 0.1663 |
| RMSE | 0.0777 | 0.7740 | 0.0997 | 1.8161 | 0.2186 |
Table 8.
Grasping success rate in simulation and real-world experiments.
Table 8.
Grasping success rate in simulation and real-world experiments.
| Metric | Simulation | Real |
|---|
| Success rate | 8/10 | 7/10 |