Streamlining Human–Robot Interaction: Integrating LLM-Based Planning into Modular Robotic Frameworks
Abstract
1. Introduction
2. Related Works
2.1. Embodied AI
2.2. Navigation Systems
2.3. Object Manipulation Systems
3. Materials and Methods
3.1. Scanning
3.2. Navigation
3.3. Manipulation
3.4. Interaction
4. Experiments
4.1. Real-World Environment
4.2. Experimental Setup and Hardware
4.3. Task Flow and Module Processing
5. Results
5.1. Execution Time Comparison
5.2. Voice Command Processing
6. Discussion
7. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| AI | Artificial Intelligence |
| LiDAR | Light Detection and Ranging |
| LLMs | Large language models |
| VLMs | Vision-language models |
| ROS | Robot operating system |
| SAM | Segment Anything Model |
Appendix A
Appendix A.1
| Listing A1. Prompts used with the LLM to generate datasets for validating the performance of the speech recognition module. These prompts were specifically designed to create requests related to moving household objects to different locations within the home environment, simulating realistic task scenarios for evaluation. |
| messages = [ { “role”: “system”, “content”: [ { “type”: “text”, “text”: “““““ You are an assistant skilled in extracting key information about objects and locations from natural language instructions. You will be provided with a sentence that tells you to pick up an object from a specific place and put it somewhere else. Your task is to extract three things: 1. The **object** (including typical color and material). 2. The **place** where the object is located. 3. The **destination** where the object should be placed. Format your response as: object with typical color and material, place, destination. Important: - If the object is uncommon or uses a brand name, replace it with a more common name. - Focus on the general type of material and typical color for common objects. Examples: Input: “Pick up the blue Samsung phone from the kitchen table and place it on the office desk.” Output: blue Samsung phone, kitchen table, office desk Input: “Grab the leather wallet from the living room couch and move it to the bedroom dresser.” Output: leather wallet, living room couch, bedroom dresser Input: “Take the red Nike sneakers from the hallway closet and put them in the shoe rack by the front door.” Output: red Nike sneakers, hallway closet, shoe rack by the front door “““ } ] } ] |
Appendix A.2
| ☐ | Path A | Path B | Total Time (s) | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| N | P1 | N | P2 | N | P1 | N | P2 | ||||
| A | Baseline | A.T. | 33.80 | 141.60 | 63.60 | 102.00 | 59.20 | 131.80 | 85.40 | 100.00 | 717 |
| S.D. | 1.72 | 3.14 | 2.87 | 2.45 | 2.93 | 4.96 | 2.65 | 3.74 | |||
| Ours | A.T. | 34.80 | 138.00 | 60.00 | 97.00 | 54.60 | 127.20 | 83.40 | 98.40 | 693 | |
| S.D. | 2.56 | 1.90 | 2.37 | 2.45 | 1.72 | 1.83 | 3.14 | 2.50 | |||
| B | Baseline | A.T. | 66.40 | 156.80 | 18.00 | 106.00 | 66.80 | 135.80 | 41.00 | 107.00 | 698 |
| S.D. | 3.98 | 2.23 | 1.41 | 2.61 | 3.49 | 2.04 | 1.26 | 2.45 | |||
| Ours | A.T. | 65.60 | 153.80 | 16.00 | 101.00 | 61.60 | 130.60 | 38.00 | 103.00 | 670 | |
| S.D. | 3.32 | 2.14 | 1.10 | 2.00 | 3.01 | 2.50 | 1.41 | 2.10 | |||
| C | Baseline | A.T. | 103.20 | 175.60 | 104.00 | 135.80 | 140.80 | 136.40 | 104.00 | 112.20 | 1012 |
| S.D. | 2.64 | 5.08 | 4.05 | 4.07 | 3.87 | 3.93 | 3.22 | 3.97 | |||
| Ours | A.T. | 98.60 | 167.60 | 96.80 | 128.80 | 139.20 | 132.60 | 100.60 | 108.40 | 973 | |
| S.D. | 1.36 | 4.88 | 2.93 | 4.83 | 2.14 | 2.94 | 3.72 | 2.24 | |||
| D | Baseline | A.T. | 35.20 | 136.20 | 74.80 | 185.20 | 64.80 | 250.60 | 38.60 | 100.40 | 886 |
| S.D. | 2.48 | 1.72 | 2.99 | 2.48 | 2.64 | 3.26 | 2.06 | 3.38 | |||
| Ours | A.T. | 35.60 | 129.40 | 65.00 | 176.40 | 58.40 | 247.60 | 38.00 | 95.80 | 846 | |
| S.D. | 2.65 | 3.26 | 2.00 | 4.08 | 2.24 | 2.33 | 2.04 | 1.72 | |||
Appendix A.3
| Listing A2. List of 100 common household objects used as examples in the prompt design to generate requests for moving items within a home environment. |
| { “Objects”: [“remote control”, “book”, “phone”, “tablet”, “laptop”, “pillow”, “blanket”, “cup”, “glass”, “mug”, “plate”, “fork”, “spoon”, “knife”, “bottle “, “water bottle”, “vase”, “flower pot”, “toy”, “pen”, “pencil”, “notebook”, “scissors”, “hairbrush”, “comb”, “toothbrush”, “razor”, “soap”, “shampoo bottle”, “towel”, “key”, “wallet”, “bag”, “shoe”, “sock”, “hat”, “glove”, “ remote controller”, “controller”, “fan”, “vacuum cleaner”, “broom”, “dustpan “, “sponge”, “dish”, “pan”, “pot”, “spatula”, “whisk”, “knife sharpener”, “ cutting board”, “measuring cup”, “magazine”, “newspaper”, “candle”, “lighter “, “screwdriver”, “hammer”, “paintbrush”, “ruler”, “tape measure”, “stapler”, “eraser”, “charger”, “USB cable”, “headphones”, “earbuds”, “game console”, “ camera”, “thermostat remote”, “heater”, “humidifier”, “vacuum cleaner attachment”, “curtain rod”, “puzzle piece”, “picture frame”, “calendar”, “ coaster”, “tray”, “table cloth”, “napkin”, “paper towel roll”, “trash can lid “, “shopping bag”, “plastic container”, “lunchbox”, “tupperware”, “toolbox”, “clothes hanger”, “shoe box”, “umbrella”, “mirror”, “toilet paper roll”, “ plunger”, “hair dryer”, “toaster”, “kettle”, “iron”, “alarm clock”, “blue phone”] } |
| Listing A3. List of 100 flat surfaces commonly found in household environments and used as potential placement locations in prompt designs. |
| { “Places”: [“kitchen counter”, “dining table”, “coffee table”, “bookshelf”, “nightstand”, “bedside table”, “cabinet”, “drawer”, “closet shelf”, “bathroom sink”, “kitchen sink”, “refrigerator shelf”, “freezer shelf”, “pantry”, “shoe rack”, “wardrobe”, “dresser top”, “desk”, “TV stand”, “fireplace mantel”, “windowsill”, “kitchen island”, “microwave oven”, “oven rack”, “stove top”, “dish rack”, “laundry basket”, “laundry room shelf”, “washing machine”, “drying rack”, “bathtub edge”, “shower shelf”, “toilet tank”, “medicine cabinet”, “mirror shelf”, “toilet paper holder”, “towel rack”, “hallway table”, “entryway bench”, “mudroom shelf”, “garage shelf”, “toolbox”, “work bench”, “storage bin”, “attic floor”, “basement shelf”, “under the bed”, “under the couch”, “sofa armrest”, “patio table”, “balcony shelf”, “garden shed”, “porch bench”, “poolside table”, “sideboard”, “hutch”, “china cabinet”, “wine rack”, “bar cart”, “cupboard”, “medicine drawer”, “fruit bowl”, “coat rack”, “hat stand”, “umbrella stand”, “shoe cabinet”, “recycling bin”, “trash can”, “compost bin”, “pet bed”, “pet food bowl”, “aquarium stand”, “fish tank”, “birdcage”, “dog crate”, “cat tree”, “window seat”, “bookshelf cubby”, “file cabinet”, “office desk drawer”, “printer stand”, “paper tray”, “pencil holder”, “mail organizer”, “kitchen towel rack”, “cooking utensil holder”, “cutlery drawer”, “spice rack”, “pantry shelf”, “refrigerator door”, “freezer drawer”, “under sink cabinet”, “cleaning supply shelf”, “vacuum cleaner storage”, “ironing board”, “coat closet”, “hat shelf”, “linen closet”, “towel closet”, “kids toy box”, “playroom shelf”, “craft room table”, “sewing machine stand”] } |
| Listing A4. Example of a JSON file generated using LLM-based data creation. The file includes fields such as request (a generated command for moving an object), object (the item to be moved), from (the current location of the object), destination (the target location of the object), and index (the data index). All fields, except for request, were used to verify whether our model accurately extracted and interpreted the relevant information, ensuring the robustness of the speech recognition module evaluation. |
| { “Requests”: [ { “request”: “Please transfer the clothes hanger from the stove top to the entryway bench.”, “object”: “clothes hanger”, “from”: “stove top”, “destination”: “entryway bench”, “index”: 0 }, { “request”: “Could you put the blanket on the wine rack after taking it from the entryway bench?”, “object”: “blanket”, “from”: “entryway bench”, “destination”: “wine rack”, “index”: 1 }, { “request”: “Please place the pen on the sofa armrest from the birdcage.”, “object”: “pen”, “ from “: “birdcage”, “destination”: “sofa armrest”, “index”: 2 }, … { “request”: “Could you put the ironing board on the wine rack after taking it from the pet bed?”, “object”: “ironing board”, “from”: “pet bed“, “destination”: “wine rack”, “index”: 1899 } ] } |
References
- Kawaharazuka, K.; Matsushima, T.; Gambardella, A.; Guo, J.; Paxton, C.; Zeng, A. Real-world robot applications of foundation models: A review. Adv. Robot. 2024, 38, 1232–1254. [Google Scholar] [CrossRef]
- Tulli, S.K.C. Artificial intelligence, machine learning and deep learning in advanced robotics, a review. Int. J. Acta Inform. 2024, 3, 35–58. [Google Scholar]
- Matsushima, T.; Noguchi, Y.; Arima, J.; Aoki, T.; Okita, Y.; Ikeda, Y.; Ishimoto, K.; Taniguchi, S.; Yamashita, Y.; Seto, S.; et al. World robot challenge 2020—Partner robot: A data-driven approach for room tidying with mobile manipulator. Adv. Robot. 2022, 36, 850–869. [Google Scholar] [CrossRef]
- Lin, M.-Y.; Lee, O.-W.; Lu, C.-Y. Embodied AI with large language models: A survey and new HRI framework. In Proceedings of the International Conference on Advanced Robotics and Mechatronics (ICARM), Tokyo, Japan, 8–10 July 2024; IEEE: New York, NY, USA, 2024; pp. 978–983. [Google Scholar] [CrossRef]
- Duan, J.; Yu, S.; Tan, H.L.; Zhu, H.; Tan, C. A survey of embodied AI: From simulators to research tasks. IEEE Trans. Emerg. Top. Comput. Intell. 2022, 6, 230–244. [Google Scholar] [CrossRef]
- Na, K.-I.; Park, B. Real-time 3D multi-pedestrian detection and tracking using 3D LIDAR point cloud for mobile robot. ETRI J. 2023, 45, 836–846. [Google Scholar] [CrossRef]
- Park, K.; Oh, C.; Dong, S. KMSAV: Korean multi-speaker spontaneous audiovisual dataset. ETRI J. 2024, 46, 71–81. [Google Scholar] [CrossRef]
- Seo, S.; Jung, H. A robust collision prediction and detection method based on neural network for autonomous delivery robots. ETRI J. 2023, 45, 329–337. [Google Scholar] [CrossRef]
- Zhang, M.; Chen, J.; Wei, X.; Zhang, D. Work chain-based inverse kinematics of robot to imitate human motion with Kinect. ETRI J. 2018, 40, 511–521. [Google Scholar] [CrossRef]
- Canovas, B.; Nègre, A.; Rombaut, M. Onboard dynamic RGB-D simultaneous localization and mapping for mobile robot navigation. ETRI J. 2021, 43, 617–629. [Google Scholar] [CrossRef]
- OpenAI. GPT-4, 2023. OpenAI. Available online: https://openai.com/gpt-4 (accessed on 12 September 2023).
- Liang, J.; Huang, W.; Xia, F.; Xu, P.; Hausman, K.; Ichter, B.; Florence, P.; Zeng, A. Code as policies: Language model programs for embodied control. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; IEEE: New York, NY, USA, 2023; pp. 9493–9500. [Google Scholar] [CrossRef]
- Vemprala, S.H.; Bonatti, R.; Bucker, A.; Kapoor, A. ChatGPT for robotics: Design principles and model abilities. IEEE Access 2024, 12, 55682–55696. [Google Scholar] [CrossRef]
- Li, M.; Zhao, S.; Wang, Q.; Wang, K.; Zhou, Y.; Srivastava, S.; Gokmen, C.; Lee, T.; Li, L.E.; Zhang, R.; et al. Embodied agent interface: Benchmarking LLMs for embodied decision making. Adv. Neural Inf. Process. Syst. 2024, 37, 100428–100534. [Google Scholar] [CrossRef]
- Dorbala, V.S.; Mullen, J.F., Jr.; Manocha, D. Can an embodied agent find your Cat-Shaped Mug? LLM-guided exploration for zero-shot object navigation. arXiv 2023, arXiv:2303.03480. [Google Scholar] [CrossRef]
- Kwon, T.; Di Palo, N.D.; Johns, E. Language models as zero-shot trajectory generators. IEEE Robot. Autom. Lett. 2024, 9, 6728–6735. [Google Scholar] [CrossRef]
- Yoshida, T.; Masumori, A.; Ikegami, T. From text to motion: Grounding GPT-4 in a humanoid robot “alter3”. arXiv 2023, arXiv:2312.06571. [Google Scholar] [CrossRef]
- Dai, Y.; Peng, R.; Li, S.; Chai, J. Think, act, and ask: Open-world interactive personalized robot navigation. arXiv 2024, arXiv:2310.07968. [Google Scholar] [CrossRef]
- Ma, Y.; Song, Z.; Zhuang, Y.; Hao, J.; King, I. A survey on vision-language-action models for embodied AI. arXiv 2024, arXiv:2405.14093. [Google Scholar] [CrossRef]
- Xu, J.; Sun, Q.; Han, Q.-L.; Tang, Y. When embodied AI meets Industry 5.0: Human-centered smart manufacturing. IEEE/CAA J. Autom. Sin. 2025, 12, 485–501. [Google Scholar] [CrossRef]
- Daza, I.G.; Izquierdo, R.; Martínez, L.M.; Benderius, O.; Llorca, D.F. Sim-to-real transfer and reality gap modeling in model predictive control for autonomous driving. Appl. Intell. 2023, 53, 12719–12735. [Google Scholar] [CrossRef]
- Huang, P.; Zhang, X.; Cao, Z.; Liu, S.; Xu, M.; Ding, W.; Francis, J.; Chen, B.; Zhao, D. What went wrong? Closing the sim-to-real gap via differentiable causal discovery. Proc. Mach. Learn. Res. 2023, 229, 734–760. [Google Scholar]
- Liu, V.; Adeniji, A.; Zhan, H.; Haldar, S.; Bhirangi, R.; Abbeel, P.; Pinto, L. EgoZero: Robot learning from smart glasses. arXiv 2025, arXiv:2505.20290. [Google Scholar] [CrossRef]
- Kwon, W.; Baek, S.; Baek, J.; Shin, W.; Gwak, M.; Park, P.; Lee, S. Reinforced intelligence through active interaction in real world: A survey on embodied AI. Int. J. Control Autom. Syst. 2025, 23, 1597–1612. [Google Scholar] [CrossRef]
- Jeon, S.; Lee, J.; Yeo, D.; Lee, Y.-J.; Kim, S. Multimodal audiovisual speech recognition architecture using a three-feature multi-fusion method for noise-robust systems. ETRI J. 2024, 46, 22–34. [Google Scholar] [CrossRef]
- Mu, Y.; Zhang, Q.; Hu, M.; Wang, W.; Ding, M.; Jin, J.; Wang, B.; Dai, J.; Qiao, Y.; Luo, P. EmbodiedGPT: Vision-language pre-training via embodied chain of thought. Adv. Neural Inf. Process. Syst. 2023, 36, 25081–25094. [Google Scholar]
- Zhang, H.; Du, W.; Shan, J.; Zhou, Q.; Du, Y.; Tenenbaum, J.B.; Shu, T.; Gan, C. Building cooperative embodied agents modularly with large language models. arXiv 2023, arXiv:2307.02485. [Google Scholar] [CrossRef]
- Soori, M.; Arezoo, B.; Dastres, R. Artificial intelligence, machine learning and deep learning in advanced robotics, a review. Cogn. Robot. 2023, 3, 54–70. [Google Scholar] [CrossRef]
- Brunke, L.; Greeff, M.; Hall, A.W.; Yuan, Z.; Zhou, S.; Panerati, J.; Schoellig, A.P. Safe learning in robotics: From learning-based control to safe reinforcement learning. Annu. Rev. Control Robot. Auton. Syst. 2022, 5, 411–444. [Google Scholar] [CrossRef]
- Elendu, C.; Amaechi, D.C.; Elendu, T.C.; Jingwa, K.A.; Okoye, O.K.; John Okah, M.; Ladele, J.A.; Farah, A.H.; Alimi, H.A. Ethical implications of AI and robotics in healthcare: A review. Medicine 2023, 102, e36671. [Google Scholar] [CrossRef]
- Pasham, S.D. A review of the literature on the subject of ethical and risk considerations in the context of fast AI development. Int. J. Mod. Comput. 2022, 5, 24–43. [Google Scholar]
- Zheng, D.; Huang, S.; Zhao, L.; Zhong, Y.; Wang, L. Towards learning a generalist model for embodied navigation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2024; pp. 13624–13634. [Google Scholar] [CrossRef]
- Xiao, X.; Liu, B.; Warnell, G.; Stone, P. Motion planning and control for mobile robot navigation using machine learning: A survey. Auton. Robot. 2022, 46, 569–597. [Google Scholar] [CrossRef]
- Kalashnikov, D.; Irpan, A.; Pastor, P.; Ibarz, J.; Herzog, A.; Jang, E.; Quillen, D.; Holly, E.; Kalakrishnan, M.; Vanhoucke, V.; et al. Scalable deep reinforcement learning for vision-based robotic manipulation. In Proceedings of the 2nd Conference on Robot Learning; PMLR: Cambridge, MA, USA, 2018; pp. 651–673. [Google Scholar]
- Shafiullah, N.M.M.; Paxton, C.; Pinto, L.; Chintala, S.; Szlam, A. CLIP-Fields: Weakly supervised semantic fields for robotic memory. arXiv 2023, arXiv:2210.05663. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. arXiv 2021, arXiv:2103.00020. [Google Scholar] [CrossRef]
- Wang, L.; Yang, N.; Huang, X.; Yang, L.; Majumder, R.; Wei, F. Multilingual E5 Text Embeddings: A Technical Report. arXiv 2024, arXiv:2402.05672. [Google Scholar] [CrossRef]
- Bhirangi, R.; Pattabiraman, V.; Erciyes, E.; Cao, Y.; Hellebrekers, T.; Pinto, L. AnySkin: Plug-and-Play Skin Sensing for Robotic Touch. arXiv 2024, arXiv:2409.08276. [Google Scholar] [CrossRef]
- Liu, P.; Orru, Y.; Vakil, J.; Paxton, C.; Shafiullah, N.M.M.; Pinto, L. OK-Robot: What really matters in integrating open-knowledge models for robotics. arXiv 2024, arXiv:2401.12202. [Google Scholar] [CrossRef]
- Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2023; pp. 3992–4003. [Google Scholar] [CrossRef]
- Fang, H.-S.; Wang, C.; Fang, H.; Gou, M.; Liu, J.; Yan, H.; Liu, W.; Xie, Y.; Lu, C. AnyGrasp: Robust and efficient grasp perception in spatial and temporal domains. IEEE Trans. Robot. 2023, 39, 3929–3945. [Google Scholar] [CrossRef]
- Minderer, M.; Gritsenko, A.; Stone, A.; Neumann, M.; Weissenborn, D.; Dosovitskiy, A.; Mahendran, A.; Arnab, A.; Dehghani, M.; Shen, Z.; et al. Simple open-vocabulary object detection with vision transformers. arXiv 2022, arXiv:2205.06230. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust speech recognition via large-scale weak supervision. arXiv 2022, arXiv:2212.04356. [Google Scholar] [CrossRef]
- Zhu, Q.-S.; Zhou, L.; Zhang, J.; Liu, S.-J.; Hu, Y.-C.; Dai, L.-R. Robust data2vec: Noise-robust speech representation learning for ASR by combining regression and improved contrastive learning. arXiv 2022, arXiv:2210.15324. [Google Scholar] [CrossRef]
- Bartlett, M.E.; Edmunds, C.E.R.; Belpaeme, T.; Thill, S. Have I Got the Power? Analysing and Reporting Statistical Power in HRI. J. Hum.-Robot Interact. 2022, 11, 1–16. [Google Scholar] [CrossRef]








Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Kim, M.; Park, J.; Park, K.; Lee, Y.-J.; Jeon, S. Streamlining Human–Robot Interaction: Integrating LLM-Based Planning into Modular Robotic Frameworks. Sensors 2026, 26, 1978. https://doi.org/10.3390/s26061978
Kim M, Park J, Park K, Lee Y-J, Jeon S. Streamlining Human–Robot Interaction: Integrating LLM-Based Planning into Modular Robotic Frameworks. Sensors. 2026; 26(6):1978. https://doi.org/10.3390/s26061978
Chicago/Turabian StyleKim, MinHyuk, JooHee Park, Kwanyong Park, Yong-Ju Lee, and Sanghun Jeon. 2026. "Streamlining Human–Robot Interaction: Integrating LLM-Based Planning into Modular Robotic Frameworks" Sensors 26, no. 6: 1978. https://doi.org/10.3390/s26061978
APA StyleKim, M., Park, J., Park, K., Lee, Y.-J., & Jeon, S. (2026). Streamlining Human–Robot Interaction: Integrating LLM-Based Planning into Modular Robotic Frameworks. Sensors, 26(6), 1978. https://doi.org/10.3390/s26061978

