Research on a Fusion Technique of YOLOv8-URE-Based 2D Vision and Point Cloud for Robotic Grasping in Stacked Scenarios
Abstract
:1. Introduction
2. Methodology
2.1. YOLOv8-URE
2.1.1. Overview
2.1.2. Backbone Module
2.1.3. Improved Neck Module
2.2. Point Cloud Preprocessing
2.2.1. Point Cloud Filtering
2.2.2. Point Cloud Image Segmentation
2.3. Fusion Processing
2.3.1. 3D Coordinate Transformation
2.3.2. KD-Tree Neighbor Search
2.4. Point Cloud Registration
2.5. Grasp Pose Estimation
3. Experimental Results and Analysis
3.1. Experimental Setup for Object Detection
3.2. Dataset Description
3.2.1. Dataset of Reducing Tee Pipes
3.2.2. WiderPerson Dataset
3.3. Description of the Indicator Parameters
3.4. Ablation Experiment
3.5. Analysis of Object Detection Results
3.5.1. Comparison of Model Experimental Results
3.5.2. Generalization Performance Comparison
3.6. Grasping Algorithm Experiments
3.6.1. Registration Algorithm Experiment
3.6.2. Workpiece Grasping Experiment
- In Postures 2 and 4, occlusions in the point cloud led to insufficient feature points, resulting in reduced pose estimation accuracy.
- Under these postures, the gripper was more likely to collide with the object, which caused grasp failures.
4. Conclusions and Future Work
4.1. Conclusions
4.2. Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Liu, C.; Yang, T. Application and industrial development of machine vision in intelligent manufacturing. Mach. Tool Hydraul. 2021, 49, 172–178. (In Chinese) [Google Scholar]
- Li, C.; Wei, X.; Zhou, Y.; Li, H. Research on control of intelligent logistics sorting robot based on laser vision guidance. Laser J. 2022, 43, 217–222. (In Chinese) [Google Scholar]
- Xue, L.; Zhou, J. Visual servo control of agricultural robot parallel picking arm. Sens. Microsyst. 2017, 36, 123–126. (In Chinese) [Google Scholar] [CrossRef]
- Lu, Z. Research on stacking object grasping method based on deep learning. Master’s Thesis, Guangdong University of Technology, Guangzhou, China, 2020. (In Chinese). [Google Scholar] [CrossRef]
- Li, X.; Li, J.; Zhang, X.; Peng, X. Optimal grasp posture detection method for robots based on deep learning. Chin. J. Sci. Instrum. 2020, 41, 108–117. (In Chinese) [Google Scholar] [CrossRef]
- Guo, D.; Kong, T.; Sun, F.; Liu, H. Object discovery and grasp detection with a shared convolutional neural network. In Proceedings of the 2016 IEEE International Conference on Robotics and Automation, Stockholm, Sweden, 16–21 May 2016; IEEE: New York, NY, USA, 2016; pp. 1234–1239. [Google Scholar]
- Geng, Z.; Chen, G. A novel real-time grasping method combined with YOLO and GDFCN. In Proceedings of the 2022 IEEE 10th Joint International Information Technology and Artificial Intelligence Conference (ITAIC), Beijing, China, 27–29 October 2022; IEEE: New York, NY, USA, 2022; pp. 500–505. [Google Scholar]
- Xiao, X.; Zheng, Y.; Sai, Q.; Fu, D. Lightweight model-based 6D pose estimation of drones. Control Eng. 2025, 24, 1–10. (In Chinese) [Google Scholar] [CrossRef]
- Zhang, Y.; Yi, J.; Chen, Y.; Dai, Z.; Han, F.; Cao, S. Pose estimation for workpieces in complex stacking industrial scene based on RGB images. Appl. Intell. 2022, 1, 1–3. [Google Scholar] [CrossRef]
- Guan, Q.; Sheng, Z.; Xue, S. HRPose: Real-time high-resolution 6D pose estimation network using knowledge distillation. Chin. J. Electron. 2023, 32, 189–198. [Google Scholar] [CrossRef]
- Wang, G.; Manhardt, F.; Tombari, F.; Ji, X. GDR-Net: Geometry-guided direct regression network for monocular 6D object pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; IEEE: Los Alamitos, CA, USA, 2021; pp. 16611–16621. [Google Scholar]
- Xu, W.; Wan, Y. ELA: Efficient local attention for deep convolutional neural networks. arXiv 2024, arXiv:2403.01123. [Google Scholar]
- Zhang, H.; Zhang, S. Shape-Iou: More accurate metric considering bounding box shape and scale. arXiv 2023, arXiv:2312.17663. [Google Scholar]
- Ding, X.; Zhang, Y.; Ge, Y.; Zhao, S.; Song, L.; Yue, X.; Shan, Y. UniRepLKNet: A universal perception large-kernel convNet for audio, video, point cloud, time-series, and image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; IEEE: Los Alamitos, CA, USA, 2024; pp. 5513–5524. [Google Scholar]
- Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. RepVGG: Making VGG-style ConvNets great again. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; IEEE: Los Alamitos, CA, USA, 2021; pp. 13733–13742. [Google Scholar]
- Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning (ICML), Lille, France, 6–11 July 2015; PMLR: Cambridge, MA, USA, 2015; pp. 448–456. [Google Scholar]
- Ding, X.; Zhang, X.; Han, J.; Ding, G. Scaling up your kernels to 31×31: Revisiting large kernel design in CNNs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; IEEE: Los Alamitos, CA, USA, 2022; pp. 11963–11975. [Google Scholar]
- Draelos, R.L.; Carin, L. Use HiResCAM instead of Grad-CAM for faithful explanations of convolutional neural networks. arXiv 2020, arXiv:2011.08891. [Google Scholar]
- Xu, X.; Jiang, Y.; Chen, W.; Huang, Y.; Zhang, Y.; Sun, X. Damo-YOLO: A report on real-time object detection design. arXiv 2022, arXiv:2211.15444. [Google Scholar]
- Liu, W.; Lu, H.; Fu, H.; Cao, Z. Learning to upsample by learning to sample. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–7 October 2023; IEEE: Los Alamitos, CA, USA, 2023; pp. 6027–6037. [Google Scholar]
- Baranchuk, D.; Babenko, A.; Malkov, Y. Revisiting the inverted indices for billion-scale approximate nearest neighbors. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018; pp. 202–216. [Google Scholar]
Comparison Parameter | Desk Point Cloud | Workpiece Point Cloud |
---|---|---|
Number of Original Points | 460,400 | 7197 |
Number of Filtered Points | 41,049 | 2126 |
Original Point Cloud Load Time (s) | 0.562537 s | 0.0231155 |
Filtered Point Cloud Load Time (s) | 0.052796 | 0.0105605 |
Method | Original Points | Points After Filtering | Filtering Time | Total Time |
---|---|---|---|---|
No Filtering | 626,910 | - | - | 6152 ms |
After Filtering | 626,910 | 233,783 | 1681 ms | 3677 ms |
Number of Points | Search Radius (m) | KD-Tree Radius Search Time (ms) | Linear Search Time (ms) |
---|---|---|---|
1000 | 0.02 | 74 | 47 |
10,000 | 0.02 | 184 | 501 |
50,000 | 0.02 | 774 | 2387 |
100,000 | 0.02 | 1752 | 5141 |
500,000 | 0.02 | 7383 | 19,378 |
Training Parameter | Value |
---|---|
Input Image Size | 640 |
Number of Epochs | 100 |
Batch size | 16 |
Optimizer | SGD |
Momentum of Optimizer | 0.937 |
Optimizer Weight Decay Factor | 0.0005 |
Algorithm | Uni | ELA | Improved Neck | ShapeIoU | P (%) | R (%) | mAP@0.5 (%) | Parameter (M) | GFLOPs | FPS |
---|---|---|---|---|---|---|---|---|---|---|
Base | × | × | × | × | 96.5 | 90.3 | 96.5 | 3.15 | 8.9 | 614 |
Base-1 | √ | × | × | × | 94.8 | 92.0 | 96.5 | 2.16 | 5.9 | 660 |
Base-2 | × | √ | × | × | 96.8 | 94.0 | 95.9 | 3.07 | 8.2 | 640 |
Base-3 | × | × | √ | × | 95.8 | 95.0 | 97.3 | 3.26 | 8.2 | 605 |
Base-4 | × | × | × | √ | 94.9 | 94.2 | 96.3 | 3.0 | 8.0 | 650 |
Base-5 | √ | √ | × | × | 91.5 | 94.8 | 96.7 | 2.09 | 5.9 | 662 |
Base-6 | √ | √ | √ | × | 95.9 | 95.0 | 96.7 | 2.28 | 6.1 | 632 |
Base-7 | √ | √ | √ | √ | 97.6 | 95.4 | 98.3 | 2.29 | 6.1 | 639 |
Number of Targets | Missed Detections | False Detections | Correct Detections | Detection Accuracy |
---|---|---|---|---|
15 | 0 ± 0/0 ± 0 | 0 ± 0/0 ± 0 | 15 ± 0/15 ± 0 | 100% ± 0.00%/100% ± 0.00% |
20 | 0 ± 0/0.4 ± 0.49 | 0 ± 0/0.2 ± 0.4 | 20 ± 0/19.4 ± 0.49 | 100% ± 0.00/97% ± 2.50% |
25 | 0.2 ± 0.4/3.2 ± 0.75 | 0.4 ± 0.49/0.6 ± 0.49 | 24.4 ± 0.49/21.2 ± 0.75 | 97.6% ± 1.96%/84.8% ± 2.99% |
30 | 2 ± 0.63/4 ± 0.89 | 0.8 ± 0.75/1.8 ± 0.4 | 27.2 ± 0.75/24.2 ± 0.75 | 90.67% ± 2.49/80.67% ± 2.49 |
35 | 5.4 ± 0.49/8 ± 0.63 | 2.4 ± 0.5/3.2 ± 0.75 | 27.2 ± 0.8/23.8 ± 0.98 | 77.71% ± 2.14/68% ± 2.8 |
40 | 8 ± 1.17/11.8 ± 0.75 | 3.8 ± 0.75/4.4 ± 0.8 | 27.4 ± 1.02/23.8 ± 0.4 | 68.5% ± 2.25/59.5% ± 1.00 |
Number of Targets | Missed Detections | False Detections | Correct Detections | Detection Accuracy |
---|---|---|---|---|
15 | 0 ± 0/0 ± 0 | 0 ± 0/2 ± 0.5 | 15 ± 0/15 ± 0 | 100% ± 0.00%/100% ± 0.00% |
20 | 0.4 ± 0.49/2.6 ± 0.8 | 0.2 ± 0.4/1.6 ± 0.49 | 19.4 ± 0.49/15.8 ± 0.75 | 97% ± 2.45%/79% ± 3.70% |
25 | 1.4 ± 0.8/4 ± 0.89 | 2.4 ± 1.02/2.6 ± 0.49 | 21.2 ± 1.33/18.4 ± 1.02 | 84.8% ± 5.31%/73.6% ± 4.08% |
30 | 1.8 ± 0.75/5.2 ± 0.75 | 3.2 ± 0.75/2.8 ± 0.75 | 25 ± 1.1/22 ± 0.63 | 83.33% ± 3.65%/73.33% ± 2.11% |
35 | 5.2 ± 0.75/8 ± 0.63 | 3.4 ± 1.02/3.6 ± 0.49 | 26.4 ± 0.6/23.4 ± 0.8 | 75.43% ± 2.91%/66.86% ± 2.29% |
40 | 9 ± 0.75/13 ± 0.63 | 3.2 ± 0.75/4.2 ± 0.75 | 27 ± 0.63/22.8 ± 1.17 | 67.50% ± 1.5%/57% ± 2.92% |
Algorithm | R% | mAP@0.5% | mAP@0.5:0.95% | Size (MB) |
---|---|---|---|---|
YOLOv8n | 79.7 ± 0.8 | 88.4 ± 0.6 | 62.3 ± 1.0 | 6.5 |
YOLOv7-tiny | 80.2 ± 0.7 | 86.9 ± 0.8 | 54.3 ± 1.2 | 11.7 |
YOLOv4 | 75.2 ± 0.9 | 84.9 ± 0.7 | 51.9 ± 1.3 | 17.76 |
YOLOv3 | 70.3 ± 1.1 | 82.0 ± 1.0 | 47.6 ± 1.5 | 117.7 |
YOLOv5s | 74.1 ± 0.6 | 86.3 ± 0.7 | 55.2 ± 1.1 | 13.78 |
Faster R-CNN | 71.3 ± 0.8 | 87.2 ± 0.6 | 56.9 ± 1.3 | 108 |
SSD | 64.8 ± 1.2 | 75.9 ± 1.5 | 47.8 ± 1.4 | 92.6 |
YOLOv8-URE | 80.4 ± 0.5 | 88.3 ± 0.6 | 62.4 ± 0.8 | 4.46 |
Algorithm | Control Algorithm | Proposed Algorithm |
---|---|---|
Running Time (s) | 10.5949 | 4.18396 |
RMSE | 0.317391 | 0.052965 |
MAE | 0.267475 | 0.051005 |
Object Pose | Grasp Attempts | Failures | Success Rate |
---|---|---|---|
Pose① | 25 | 3 | 88% |
Pose② | 25 | 4 | 84% |
Pose③ | 25 | 2 | 92% |
Pose④ | 25 | 5 | 80% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ye, X.; Qin, X.; Zhan, L.; Wang, J.; Chen, Y. Research on a Fusion Technique of YOLOv8-URE-Based 2D Vision and Point Cloud for Robotic Grasping in Stacked Scenarios. Appl. Sci. 2025, 15, 6583. https://doi.org/10.3390/app15126583
Ye X, Qin X, Zhan L, Wang J, Chen Y. Research on a Fusion Technique of YOLOv8-URE-Based 2D Vision and Point Cloud for Robotic Grasping in Stacked Scenarios. Applied Sciences. 2025; 15(12):6583. https://doi.org/10.3390/app15126583
Chicago/Turabian StyleYe, Xuhui, Xiaoyang Qin, Leming Zhan, Jun Wang, and Yan Chen. 2025. "Research on a Fusion Technique of YOLOv8-URE-Based 2D Vision and Point Cloud for Robotic Grasping in Stacked Scenarios" Applied Sciences 15, no. 12: 6583. https://doi.org/10.3390/app15126583
APA StyleYe, X., Qin, X., Zhan, L., Wang, J., & Chen, Y. (2025). Research on a Fusion Technique of YOLOv8-URE-Based 2D Vision and Point Cloud for Robotic Grasping in Stacked Scenarios. Applied Sciences, 15(12), 6583. https://doi.org/10.3390/app15126583