GPTArm: An Autonomous Task Planning Manipulator Grasping System Based on Vision–Language Models
Abstract
:1. Introduction
- Development of the RTPF: We propose a novel framework integrating GPT-4V with robotic arm control systems, leveraging environmental data from vision sensors and advanced reasoning capabilities. The RTPF implements a hierarchical decomposition paradigm that accomplishes the following:
- Autonomously plans task execution based on user-defined objectives, generating context-aware subtasks in real time.
- Selects optimal motion primitives from a predefined library to ensure precise and efficient operations.
- Validates and corrects execution outcomes through real-time closed-loop feedback, ensuring robust performance in dynamic environments.
- Environment-Aware Adaptive Planning: We introduce a hybrid natural language processing pipeline that autonomously extracts semantic objectives from unstructured instructions and generates environment-adaptive strategies, supported by multimodal voice–text interaction.
- Cross-Model Performance Benchmarking: A comprehensive evaluation of state-of-the-art VLMs (Fuyu-8B, Qwen-vl-plus, GPT-4V) reveals that GPT-4V excels in task decomposition and execution accuracy under dynamic conditions. Experimental results highlight the framework’s exceptional versatility, achieving consistent performance across diverse tasks—from simple object grasping to complex, multi-step assembly operations.
2. Materials and Methods
2.1. RTPF: A Hierarchical Framework for Environment-Aware Task Planning
- Detection Layer: Implements real-time sensor fusion and environmental mapping.
- Strategy Layer: Employs GPT-4V for contextual task decomposition and constraint satisfaction.
- Execution Layer: Converts abstract plans into hardware-specific control signals.
- Evaluation Layer: Provides continuous performance monitoring and error compensation.
2.1.1. Detection Layer
2.1.2. Strategy Layer
- (1)
- Task Decomposition.
- (2)
- Strategy Planning.
- (3)
- Secondary Verification.
2.1.3. Execution Layer
2.1.4. Evaluation Layer
- N denotes the total number of task steps;
- α represents the evaluation factor;
- m is the number of steps designated for correction.
Algorithm 1: RTPF |
1: Initialize camera C, task M, natural language instruction I, evaluation factor α 2: while task not completed do 3: Capture raw image xt = C(t) 4: // Noise reduction, grid partitioning 5: p = GPT-4V(xt, I) // Generate initial plan 6: if validate_strategy(p) == False then 7: pfinal = error_correct(p) // secondary verification 8: else 9: pfinal = p 10: end if 11: code = extract_motion_code(p, M) //Map to motion functions 12: Execute code, update gripper pose 13: xnew = C(tnew) 14: for k = 1 to N do //N: total steps in pfinal 15: if E(k, N, α) = True then 16: result = GPT-4V(xnew, pfinal) //Apply evaluation feedback 17: if result = False then 18: Trigger feedback loop: send xnew to strategy layer for re-planning 19: end if 20: end if 21: end for 22: end while 23: return Success |
2.1.5. Pre-Instructions of RTPF Layers
2.2. Intelligent Human–Robot Interaction via Integrated Voice and Text Command
- Strict Dependence on Precision in Instructions: Conventional systems can only respond to explicit, precise operational commands and are incapable of handling ambiguous or vague natural language expressions.
- Limited Recognition Capability: Traditional robotic arms rely on pre-trained image recognition models and can only recognize objects included in their training datasets, rendering them ineffective at identifying new or unseen objects.
3. Results
3.1. Experiment Setup and Evaluation
3.1.1. System Setup
3.1.2. Evaluation Metrics
- Success Rate (SR).The proportion of tasks in which all intended goal conditions are met by the end of the robotic arm’s operation. For tasks comprising multiple sub-steps, a single failure in any crucial step renders the entire task unsuccessful.
- Executability (EXEC).A measure of how many planned action steps can be physically executed without error. Formally,
- Goal Condition Recall (GCR).An indicator of how many final goal conditions are satisfied relative to the total conditions specified. GCR is computed as follows:
3.2. RTPF Validation: Task-Specific Performance Analysis
- B (basic) tasks entail direct single-object grasps under relatively clear conditions.
- I (intermediate) tasks incorporate either two-step sequential operations or target objects located in less favorable viewing conditions.
- A (advanced) tasks involve multi-step planning or partial re-planning upon receiving updated instructions
- HA (highly advanced) tasks increase the spatial or sequential complexity further, often requiring stacking operations and dynamic corrections.
3.2.1. Strategy Layer Performance Evaluation
3.2.2. Evaluation Layer Performance Evaluation
- α = 0, checks only when the entire task finishes;
- α = 0.5, checks approximately half of the steps;
- α = 1, checks every step in real time.
3.2.3. Comparison of the RTPF with State-of-the-Art Frameworks and Different VLMs
3.3. The Performance Evaluation of GPTArm Across Multiple Tasks
3.3.1. Overview of Completed Task Set
- Seen (Tasks 1, 2, 4, 5, and 7): The target objects for these tasks were present in the YOLOv10 training set, so the system had prior visual knowledge of each item.
- Unseen (Tasks 3, 6, 8, 9, and 10): The target objects for these tasks never appeared in the YOLOv10 training data, posing a more substantial challenge that requires GPTArm to adapt to novel shapes, colors, or configurations.
3.3.2. Illustrative Example of Autonomous Task Planning
3.3.3. Non-Customized Interaction and Dynamic Re-Planning
4. Discussion
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
VLMs | vision–language models |
LLMs | large language models |
RTPF | robotic task processing framework |
SR | success rate |
EXEC | executability |
GCR | Goal Condition Recall |
TGC | Total Goal Condition |
UGC | Unmet Goal Condition |
DoF | degrees of freedom |
IK | inverse kinematics |
Appendix A
Appendix A.1. YOLOv10 Performance Analysis
Appendix A.2. Robotic Arm Grasping Algorithm
Appendix A.2.1. Overview of the Robotic Arm Grasping Task
Appendix A.2.2. Target Detection and Grasping Strategy
Appendix A.2.3. Offset Calculation
Conversion from Pixel Coordinates to Camera Coordinates
Conversion from Camera Coordinates to World Coordinates
Calculate the Target’s Offset Relative to the Robotic Arm
Appendix A.3. Robotic Arm Motion Analysis
Appendix A.3.1. Robotic Arm IK Calculation
Appendix A.3.2. PID Control for Adjusting Robotic Arm Motion
- Yaw axis control: Based on the horizontal offset, the PID algorithm computes a speed adjustment applied to the sixth motor, which is responsible for horizontal rotation, to regulate the robotic arm’s horizontal angle.
- Pitch axis control: Based on the vertical offset, the PID algorithm calculates a speed adjustment primarily for the third motor, the main vertical rotation axis motor of the robotic arm. When the third motor reaches its limit, the adjustment is transferred to the 5th auxiliary axis motor; otherwise, the fifth motor remains stationary.
- ei(k) is the difference between the target angle and the current angle.
- ei(k − 1) is the difference between the target angle and the angle from the previous step.
- represents the cumulative sum of the angular differences.
- KP is the proportional gain, which controls the response speed of the robotic arm.
- KI is the integral gain, which reduces the steady-state error.
- KD is the derivative gain, which minimizes overshoot and improves system stability.
- ui(k) is the output PWM signal value.
References
- Yang, Y.; Yu, H.; Lou, X.; Liu, Y.; Choi, C. Attribute-Based Robotic Grasping with Data-Efficient Adaptation. IEEE Trans. Robot. 2024, 40, 1566–1579. [Google Scholar] [CrossRef]
- Yang, X.; Zhou, Z.; Sørensen, J.H.; Christensen, C.B.; Ünalan, M.; Zhang, X. Automation of SME production with a Cobot system powered by learning-based vision. Robot. Comput.-Integr. Manuf. 2023, 83, 102564. [Google Scholar] [CrossRef]
- Ge, Y.; Zhang, S.; Cai, Y.; Lu, T.; Wang, H.; Hui, X.; Wang, S. Ontology based autonomous robot task processing framework. Front. Neurorobot. 2024, 18, 1401075. [Google Scholar] [CrossRef] [PubMed]
- Shanthi, M.D.; Hermans, T. Pick and Place Planning is Better than Pick Planning then Place Planning. IEEE Robot. Autom. Lett. 2024, 9, 2790–2797. [Google Scholar] [CrossRef]
- Hao, Z.; Chen, G.; Huang, Z.; Jia, Q.; Liu, Y.; Yao, Z. Coordinated Transportation of Dual-arm Robot Based on Deep Reinforcement Learning. In Proceedings of the 19th IEEE Conference on Industrial Electronics and Applications (ICIEA 2024), Kristiansand, Norway, 5–8 August 2024. [Google Scholar]
- Reddy, A.B.; Mahesh, K.M.; Prabha, M.; Selvan, R.S. Design and implementation of A Bio-Inspired Robot Arm: Machine learning, Robot vision. In Proceedings of the International Conference on New Frontiers in Communication, Automation, Management and Security (ICCAMS 2023), Bangalore, India, 27–28 October 2023. [Google Scholar]
- Farag, M.; Abd Ghafar, A.N.; ALSIBAI, M.H. Real-time robotic grasping and localization using deep learning-based object detection technique. In Proceedings of the IEEE International Conference on Automatic Control and Intelligent Systems (I2CACIS 2019), Selangor, Malaysia, 29–29 June 2019. [Google Scholar]
- Ban, S.; Lee, Y.J.; Yu, K.J.; Chang, J.W.; Kim, J.-H.; Yeo, W.-H. Persistent human–machine interfaces for robotic arm control via gaze and eye direction tracking. Adv. Intell. Syst. 2023, 5, 2200408. [Google Scholar] [CrossRef]
- Li, X.; Liu, L.; Zhang, Z.; Guo, X.; Cui, J. Autonomous Discovery of Robot Structure and Motion Control Through Large Vision Models. In Proceedings of the IEEE International Conference on Cybernetics and Intelligent Systems (CIS) and IEEE International Conference on Robotics, Automation and Mechatronics (RAM 2024), Hangzhou, China, 8–11 August 2024. [Google Scholar]
- Shi, B.; Cai, H.; Gao, H.; Ou, Y.; Wang, D. The Robot’s Understanding of Classification Concepts Based on Large Language Model. In Proceedings of the IEEE International Conference on Advanced Robotics and Its Social Impacts (ARSO 2024), Hong Kong, China, 20–22 May 2024. [Google Scholar]
- He, H.; Li, Y.; Chen, J.; Guo, Y.; Bi, X.; Dong, E. A Human-Robot Interaction Dual-Arm Robot System for Power Distribution Network. In Proceedings of the China Automation Congress (CAC 2023), Chongqing, China, 17–19 November 2023. [Google Scholar]
- Cho, J.; Choi, D.; Park, J.H. Sensorless variable admittance control for human-robot interaction of a dual-arm social robot. IEEE Access 2023, 11, 69366–69377. [Google Scholar] [CrossRef]
- Dimitropoulos, N.; Papalexis, P.; Michalos, G.; Makris, S. Advancing Human-Robot Interaction Using AI—A Large Language Model (LLM) Approach. In Proceedings of the European Symposium on Artificial Intelligence in Manufacturing (ESAIM 2023), Kaiserslautern, Germany, 19 September 2023. [Google Scholar]
- Tziafas, G.; Kasaei, H. Towards open-world grasping with large vision-language models. arXiv 2024, arXiv:2406.18722. [Google Scholar]
- Mirjalili, R.; Krawez, M.; Silenzi, S.; Blei, Y.; Burgard, W. Lan-grasp: Using large language models for semantic object grasping. arXiv 2023, arXiv:2310.05239. [Google Scholar]
- Luo, H.; Guo, Z.; Wu, Z.; Teng, F.; Li, T. Transformer-based vision-language alignment for robot navigation and question answering. Inf. Fusion 2024, 108, 102351. [Google Scholar] [CrossRef]
- Que, H.; Pan, W.; Xu, J.; Luo, H.; Wang, P.; Zhang, L. “Pass the butter”: A study on desktop-classic multitasking robotic arm based on advanced YOLOv7 and BERT. arXiv 2024, arXiv:2405.17250. [Google Scholar]
- Chen, X.; Yang, J.; He, Z.; Yang, H.; Zhao, Q.; Shi, Y. QwenGrasp: A Usage of Large Vision Language Model for Target-oriented Grasping. arXiv 2023, arXiv:2309.16426. [Google Scholar]
- Wang, R.; Yang, Z.; Zhao, Z.; Tong, X.; Hong, Z.; Qian, K. LLM-based Robot Task Planning with Exceptional Handling for General Purpose Service Robots. arXiv 2024, arXiv:2405.15646. [Google Scholar]
- Vemprala, S.H.; Bonatti, R.; Bucker, A.; Kapoor, A. Chatgpt for robotics: Design principles and model abilities. IEEE Access 2024, 12, 55682–55696. [Google Scholar]
- Mao, J.W. A Framework for LLM-Based Lifelong Learning in Robot Manipulation. Massachusetts Institute of Technology. Ph.D. Thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, February 2024. [Google Scholar]
- Wang, B.; Zhang, J.; Dong, S.; Fang, I.; Feng, C. Vlm see, robot do: Human demo video to robot action plan via vision language model. arXiv 2024, arXiv:2410.08792. [Google Scholar]
- Zhang, Y.; Xin, D.; Yang, M.; Xu, S.; Wang, C. Research on Dual Robotic Arm Path Planning Based on Steering Wheel Sewing Device. In Proceedings of the 6th International Symposium on Autonomous Systems (ISAS 2023), Nanjing, China, 23–25 June 2023. [Google Scholar]
- Fu, K.; Dang, X. Light-Weight Convolutional Neural Networks for Generative Robotic Grasping. IEEE Trans. Ind. Inform. 2024, 20, 6696–6707. [Google Scholar]
- Chi, M.; Chang, S.; Guo, Z.; Huang, S.; Li, Z.; Li, J.; Xia, Z.; Zheng, Z.; Ren, Q. Research on Target Recognition and Grasping of Dual-arm Cooperative Mobile Robot Based on Vision. In Proceedings of the International Symposium on Intelligent Robotics and Systems (ISoIRS 2024), Changsha, China, 14–16 June 2024. [Google Scholar]
- Ko, D.-K.; Lee, K.-W.; Lee, D.H.; Lim, S.-C. Vision-based interaction force estimation for robot grip motion without tactile/force sensor. Expert Syst. Appl. 2023, 211, 118441. [Google Scholar] [CrossRef]
- Bhat, V.; Kaypak, A.U.; Krishnamurthy, P.; Karri, R.; Khorrami, F. Grounding LLMs For Robot Task Planning Using Closed-loop State Feedback. arXiv 2024, arXiv:2402.08546. [Google Scholar]
- Jin, Y.; Li, D.; Yong, A.; Shi, J.; Hao, P.; Sun, F.; Zhang, J.; Fang, B. Robotgpt: Robot manipulation learning from chatgpt. IEEE Robot. Autom. Lett. 2024, 9, 2543–2550. [Google Scholar]
- Brohan, A.; Brown, N.; Carbajal, J.; Chebotar, Y.; Chen, X.; Choromanski, K.; Ding, T.; Driess, D.; Dubey, A.; Finn, C. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv 2023, arXiv:2307.15818. [Google Scholar]
- Li, B.; Wu, P.; Abbeel, P.; Malik, J. Interactive task planning with language models. arXiv 2023, arXiv:2310.10645. [Google Scholar]
- Mei, A.; Zhu, G.-N.; Zhang, H.; Gan, Z. ReplanVLM: Replanning robotic tasks with visual language models. IEEE Robot. Autom. Lett. 2024, 9, 10201–10208. [Google Scholar]
- Bernardo, R.; Sousa, J.M.; Gonçalves, P. Ontological framework for high-level task replanning for autonomous robotic systems. Robot. Auton. Syst. 2025, 184, 104861. [Google Scholar]
- Osada, M.; Garcia Ricardez, G.A.; Suzuki, Y.; Taniguchi, T. Reflectance estimation for proximity sensing by vision-language models: Utilizing distributional semantics for low-level cognition in robotics. Adv. Robot. 2024, 38, 1287–1306. [Google Scholar]
- Han, Y.; Yu, K.; Batra, R.; Boyd, N.; Mehta, C.; Zhao, T.; She, Y.; Hutchinson, S.; Zhao, Y. Learning generalizable vision-tactile robotic grasping strategy for deformable objects via transformer. IEEE/ASME Trans. Mechatron. 2024, 30, 554–566. [Google Scholar]
- Hofer, M.; Sferrazza, C.; D’Andrea, R. A vision-based sensing approach for a spherical soft robotic arm. Front. Robot. AI 2021, 8, 630935. [Google Scholar]
- Shi, W.; Wang, K.; Zhao, C.; Tian, M. Compliant control of dual-arm robot in an unknown environment. In Proceedings of the 7th International Conference on Control and Robotics Engineering (ICCRE 2022), Beijing, China, 15–17 April 2022. [Google Scholar]
- Suphalak, K.; Klanpet, N.; Sikaressakul, N.; Prongnuch, S. Robot Arm Control System via Ethernet with Kinect V2 Camera for use in Hazardous Areas. In Proceedings of the 1st International Conference on Robotics, Engineering, Science, and Technology (RESTCON 2024), Pattaya, Thailand, 16–18 February 2024. [Google Scholar]
- Wei, J.; Li, J.; Huang, J.; Pang, Z.; Zhang, K. Visual Obstacle Avoidance Trajectory Control of Intelligent Loading and Unloading Robot Arm Based on Hybrid Interpolation Spline. In Proceedings of the 9th International Symposium on Computer and Information Processing Technology (ISCIPT 2024), Xi’an, China, 24–26 May 2024. [Google Scholar]
- Zheng, J.; Chen, L.; Li, Y.; Khan, Y.A.; Lyu, H.; Wu, X. An intelligent robot sorting system by deep learning on RGB-D image. In Proceedings of the 22nd International Symposium INFOTEH-JAHORINA (INFOTEH 2023), East Sarajevo, Bosnia and Herzegovina, 15–17 March 2023. [Google Scholar]
- Park, Y.; Son, H.I. Visual Scene Understanding for Efficient Cooperative Control of Agricultural Dual-Arm Robots. In Proceedings of the 24th International Conference on Control, Automation and Systems (ICCAS 2024), Jeju, Republic of Korea, 29 October–1 November 2024. [Google Scholar]
- Wu, K.; Chen, L.; Wang, K.; Wu, M.; Pedrycz, W.; Hirota, K. Robotic arm trajectory generation based on emotion and kinematic feature. In Proceedings of the International Power Electronics Conference (IPEC-Himeji 2022-ECCE Asia 2022), Himeji, Japan, 15–19 May 2022. [Google Scholar]
- Wake, N.; Kanehira, A.; Sasabuchi, K.; Takamatsu, J.; Ikeuchi, K. Gpt-4v (ision) for robotics: Multimodal task planning from human demonstration. IEEE Robot. Autom. Lett. 2024, 9, 10567–10574. [Google Scholar]
- Liu, H.; Zhu, Y.; Kato, K.; Tsukahara, A.; Kondo, I.; Aoyama, T.; Hasegawa, Y. Enhancing the LLM-Based Robot Manipulation Through Human-Robot Collaboration. arXiv 2024, arXiv:2406.14097. [Google Scholar]
- Zhao, W.; Chen, J.; Meng, Z.; Mao, D.; Song, R.; Zhang, W. Vlmpc: Vision-language model predictive control for robotic manipulation. arXiv 2024, arXiv:2407.09829. [Google Scholar]
- Muslim, M.A.; Urfin, S.N. Design of geometric-based inverse kinematics for a low cost robotic arm. In Proceedings of the 2014 Electrical Power, Electronics, Communicatons, Control and Informatics Seminar (EECCIS 2014), Malang, Indonesia, 27–28 August 2014. [Google Scholar]
- Kariuki, S.; Wanjau, E.; Muchiri, I.; Muguro, J.; Njeri, W.; Sasaki, M. Pick and Place Control of a 3-DOF Robot Manipulator Based on Image and Pattern Recognition. Machines 2024, 12, 665. [Google Scholar] [CrossRef]
- Liu, W.; Wang, S.; Gao, X.; Yang, H. A Tomato Recognition and Rapid Sorting System Based on Improved YOLOv10. Machines 2024, 12, 689. [Google Scholar] [CrossRef]
- Sun, H.; Yao, G.; Zhu, S.; Zhang, L.; Xu, H.; Kong, J. SOD-YOLOv10: Small Object Detection in Remote Sensing Images Based on YOLOv10. IEEE Geosci. Remote Sens. Lett. 2025, 22, 8000705. [Google Scholar] [CrossRef]
- Chu, Y.; Xu, J.; Yang, Q.; Wei, H.; Wei, X.; Guo, Z.; Leng, Y.; Lv, Y.; He, J.; Lin, J. Qwen2-audio technical report. arXiv 2024, arXiv:2407.10759. [Google Scholar]
- Chu, Y.; Xu, J.; Zhou, X.; Yang, Q.; Zhang, S.; Yan, Z.; Zhou, C.; Zhou, J. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models. arXiv 2023, arXiv:2311.07919. [Google Scholar]
- Wang, Z.; Zhou, Z.; Song, J.; Huang, Y.; Shu, Z.; Ma, L. Towards testing and evaluating vision-language-action models for robotic manipulation: An empirical study. arXiv 2024, arXiv:2409.12894. [Google Scholar]
- Hiba, S.; Smail, T.; Rachid, S.; Abdellah, C. Vision-Based Robotic Arm Control Algorithm Using Deep Reinforcement Learning for Autonomous Objects Grasping. Appl. Sci. 2021, 11, 7917. [Google Scholar] [CrossRef]
Reference (Year) | Advantage | Disadvantage |
---|---|---|
[14]-(2024) | Integrates segmentation and grasp synthesis to eliminate action model reliance. | Segmentation/synthesis errors may cascade. |
[28]-(2024) | Proposes a Robot-GPT interaction framework with code generation and error correction. | Visual feedback absence restricts error comprehension. |
[31]-(2024) | Combines internal/external error correction mechanisms. | Only grasps trained objects. |
[43]-(2024) | Hierarchical planning decomposes long-term tasks. | Ineffective control of errors arising during execution. |
[44]-(2024) | Conditional action sampling generates VLM-based targeted actions. | Predicted video–action mismatch degrades evaluation. |
No | Task | Difficulty | Description |
---|---|---|---|
Task 1 | Medicine Bottle Grasping | B | Pick up a single bottle from the workspace. |
Task 2 | Nut Grasping | Pick up a small nut from the workspace. | |
Task 3 | Block Grasping | Pick up a block from the workspace. | |
Task 4 | Circular Grasping | I | Circularly grasp all the small nuts from the workspace, where there may be slight occlusion or a complex environment. |
Task 5 | Edge-of-View Object Grasping | Requires careful localization when objects are only partially visible. | |
Task 6 | T-Shaped Block Arrangement | A | Requires placing multiple blocks in a specific geometric configuration; demands accurate positioning and orientation handling. |
Task 7 | Command-Driven Regrasping Following Misinterpretation | Simulates an instruction correction: pick up one object (e.g., a bottle) then replace it upon revised instructions (e.g., “move the charger instead”). Tests dynamic re-planning. | |
Task 8 | Partial T-Shaped Rearrangement | Involves moving or removing some blocks from a T-configuration and re-stacking them according to new instructions. | |
Task 9 | T-Shaped Construction with Additional Block Stacking | HA | Combines arrangement (forming the T) with vertical stacking of new blocks. Requires stable, multi-step sequencing with minimal positional error. |
Task 10 | Multiple Block Stacking | Requires sequentially stacking three or more blocks with distinct target positions, potentially re-planning if partial occlusions or collisions occur. |
Task | Initial Strategy SR | Strategy Revisions Initiated | Successful Revisions | Finial Strategy SR |
---|---|---|---|---|
Task 1 | 0.748 | 50 | 44 | 0.957 |
Task 4 | 0.710 | 58 | 44 | 0.919 |
Task 6 | 0.652 | 69 | 49 | 0.886 |
Task 10 | 0.657 | 62 | 47 | 0.881 |
Task | α | Error Step Count | Corrected Step Count | Final SR | Avg. Time (s) |
---|---|---|---|---|---|
Task 1 | 0 | 1 | 1 | 0.886 | 34 |
0.5 | 2 | 1 | 0.871 | 35 | |
1 | 1 | 0 | 0.914 | 33 | |
Task 4 | 0 | 3 | 1 | 0.771 | 35 |
0.5 | 6 | 2 | 0.857 | 56 | |
1 | 8 | 3 | 0.843 | 53 | |
Task 6 | 0 | 4 | 2 | 0.714 | 59 |
0.5 | 9 | 2 | 0.757 | 82 | |
1 | 12 | 5 | 0.800 | 110 | |
Task 10 | 0 | 5 | 1 | 0.686 | 69 |
0.5 | 11 | 4 | 0.714 | 88 | |
1 | 10 | 4 | 0.771 | 106 |
Reference (Year) | Model | SR | Generalization | Instruction Input |
---|---|---|---|---|
GPTArm (This Work) | GPT-4V | 0.900 | Unseen | Image, Speech, and Text |
[28]-(2024) | Robot-GPT | 0.800 | Unseen | Text |
[44]-(2024) | Qwen-vl | 0.767 | Unseen | Image and Text |
[29]-(2023) | PaLM-E | 0.740 | Unseen | Image and Text |
[43]-(2024) | GPT-4 | 0.913 | Seen | Text |
[31]-(2024) | GPT-4V | 0.900 | Seen | Image and Text |
[14]-(2024) | GPT-4V | 0.833 | Seen | Image and Text |
Task | Number of Successes/Total Attempts | SR ± Uncertainty |
---|---|---|
Task 1 | 64/70 | 91.4 ± 3.35% |
Task 2 | 61/70 | 87.1 ± 4.01% |
Task 3 | 63/70 | 90.0 ± 3.58% |
Task 4 | 59/70 | 84.3 ± 4.35% |
Task 5 | 62/70 | 88.6 ± 3.80% |
Task 6 | 56/70 | 80.0 ± 4.78% |
Task 7 | 59/70 | 84.3 ± 4.35% |
Task 8 | 52/70 | 74.3 ± 5.22% |
Task 9 | 55/70 | 78.6 ± 4.90% |
Task 10 | 54/70 | 77.1 ± 5.02% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, J.; Wang, Z.; Lai, J.; Wang, H. GPTArm: An Autonomous Task Planning Manipulator Grasping System Based on Vision–Language Models. Machines 2025, 13, 247. https://doi.org/10.3390/machines13030247
Zhang J, Wang Z, Lai J, Wang H. GPTArm: An Autonomous Task Planning Manipulator Grasping System Based on Vision–Language Models. Machines. 2025; 13(3):247. https://doi.org/10.3390/machines13030247
Chicago/Turabian StyleZhang, Jiaqi, Zinan Wang, Jiaxin Lai, and Hongfei Wang. 2025. "GPTArm: An Autonomous Task Planning Manipulator Grasping System Based on Vision–Language Models" Machines 13, no. 3: 247. https://doi.org/10.3390/machines13030247
APA StyleZhang, J., Wang, Z., Lai, J., & Wang, H. (2025). GPTArm: An Autonomous Task Planning Manipulator Grasping System Based on Vision–Language Models. Machines, 13(3), 247. https://doi.org/10.3390/machines13030247