1. Introduction
Healthcare professionals are facing high stress and burnout due to staffing shortages, long shifts, demanding responsibilities, and ongoing exposure to hazardous substances [
1]. Nurses, who represent nearly half of the global healthcare workforce, are among the most affected [
2]. In addition to frequent overtime, their duties include administering medications, assessing patients, organizing patient rooms, gathering supplies, and documenting information, all of which require specialized skills [
3]. Prior to the pandemic, 74% of nurses reported stress and overwork, with 24% reporting fatigue [
2]. These pressures intensified during the pandemic, with 42% reporting higher stress, 38% greater anxiety, and 29% more depression [
4]. Moreover, post-pandemic studies found that 68% of nurses experienced workplace stress, leading to over 50% reported thoughts of leaving their job [
5]. Those who stayed reported burnout, anxiety, depression, and exhaustion [
6]. While nursing vacancy rates rose by 30% between 2019 and 2020 [
7], more recent data from 2025 shows there is still a shortage of nurses, with a projected deficit of 64,000 by 2030. This ongoing shortage is causing financial strain because it now costs over
$61,000 to replace a single bedside nurse [
8]. Such severe strain harms the quality of healthcare and patient outcomes. This can lead to delays in treatment, less time spent with patients, and a greater chance of medical errors. There is a clear need for new solutions that can reduce daily pressures and support healthcare workers.
To address this need, recent advancements in artificial intelligence (AI) and robotics offer good solutions to automate many of the tasks that people currently do. A study by the McKinsey Global Institute estimated that more than one-third of healthcare tasks in the United States could be automated using AI and robotics technology that was already available in 2017 [
9]. This estimate is likely low, given the progress since then, including advances such as Large Language Models (LLMs), multimodal AI, and robot control systems that learn from experience. Over the past decades, assistant robots have been developed in both research and industry for applications such as patient lifting and transportation [
10,
11], robotic wheelchairs, medical supply delivery [
12,
13], cleaning and sterilization [
14,
15], and even emotional care services [
16,
17], reflecting the growing interest in augmenting robots in healthcare. However, it is crucial to recognize that nurses have multiple responsibilities that they manage simultaneously in rapidly changing, unpredictable situations that cannot be fully planned or scripted. To effectively support healthcare workers, a robotic assistant should be able to perform diverse tasks, including physical tasks (e.g., bringing objects, restocking supplies), communicational tasks (e.g., reminding patients about schedules, engaging in conversation), and informational tasks (e.g., educating patients about medications), while communicating naturally with patients, visitors, and staff. Because medical environments are highly dynamic and unique, robots must also interact seamlessly with their surroundings and adapt to evolving situations and workflows—either autonomously or, at minimum, under the guidance of healthcare users.
From this perspective, current systems remain insufficient to meaningfully reduce healthcare worker burdens. Most are restricted to single-purpose functions or rigid multi-task modules, lacking a framework to progressively expand their capabilities in response to evolving needs of medical environments and healthcare staff. As a result, their behaviors are typically hard-coded, providing healthcare staff with no means to adapt task execution to context or guide robot actions using workflow knowledge in natural language. This limits both adaptability and clinician involvement. They also remain isolated from the broader medical environment, limited by onboard perception and disconnected from facility systems and medical devices, preventing ward-level awareness and coordinated workflows. Together, these limitations call for a fundamentally new framework that integrates humans, multifunctional robots, and healthcare facilities into a unified model of care, where system behaviors can be adapted through natural language and clinical guidance.
This study presents a multifunctional Robotic Health Attendant (RHA) enabled by an LLM-orchestrated framework that integrates robot actions and environment interactions. Structured to support physical, communicational, and informational functions, it enables new tasks to be added or existing ones modified through natural language, allowing end-users to flexibly adapt the system in practice without the need for extensive redevelopment. The primary objective of this study is to create and validate the proposed framework. To this end, we developed a simulation environment in Isaac Sim and implemented the Holland RHA on the Tiago Pro robotic platform. Here, we use n8n as a control pipeline that coordinates multiple LLM-based modules and interactions between users, robots, and smart environment components into a single natural language–driven system, supporting physical, communicational, and informational tasks. Within this pipeline, task specifications are embedded, enabling users to design and refine tasks via natural language and directly link them to robot execution.
2. Literature Review
2.1. Existing Service Robots in Medical Environments
Early systems, such as the Robotic Nursing Assistant (RoNA) [
10], along with related developments including chest-holder-based transfer robots [
18] and dual-arm systems such as RIBA [
19] and RIBA-II [
20], focused on patient lifting and transfer through sensor-based human guidance. Additional approaches, including sensor-assisted hoists [
21], direct-touch interaction systems [
11], and bed transport robots [
22], further explore mechanisms to reduce physical strain while maintaining safe human–robot interaction.
Beyond lifting, intelligent wheelchairs (IWs) have been developed to provide autonomous mobility and assistive interaction. These systems incorporate multimodal control interfaces and varying levels of autonomy, including connected and driverless navigation [
12], robotic arm integration [
13], cost-effective navigation strategies [
23], semi-autonomous object manipulation [
24], and mobility-on-demand services in hospital environments [
25].
Delivery and logistics robots have been widely explored to automate in-hospital transport tasks. Early systems such as HelpMate [
26] established autonomous navigation, followed by platforms incorporating SLAM-based localization [
27] and perception-driven navigation [
28]. Subsequent developments include systems for meal delivery [
29] and modular contact-free logistics [
30]. Commercial platforms such as TUG [
31], HOSPI [
32], Relay [
33], Medbot [
34], and Moxi [
35] demonstrate real-world deployment for transporting medical supplies and samples. In parallel, disinfection robots have been developed using UV-based sterilization [
15], flexible robotic arms [
36], and low-cost remote systems [
14], with commercial systems such as UVD Robotics [
37], LightStrike [
38], and related platforms [
39] reporting high pathogen removal rates.
Personal care robots focus on supporting independent living and human–robot interaction. Systems such as Hobbit [
16], Lio [
40], and Pepper [
41] provide assistance with daily activities, companionship, and social interaction, while platforms such as Alter-Ego [
42] incorporate advanced actuation for manipulation and mobility in complex environments.
Finally, mobile manipulators such as the Tiago platform [
43], including its deployment in hospital environments [
44], represent integrated systems combining mobility, manipulation, and perception for healthcare applications. These systems have been used for delivery, disinfection, and interaction tasks, with frameworks such as OpenDR [
45] enhancing sensing and autonomy capabilities.
2.2. Utilization of LLM for Robot Task Planning
This section reviews prior approaches to robot task planning, including conventional model-based methods and interaction-driven strategies, and then focuses on recent advances using LLMs. The discussion highlights the limitations of traditional approaches and motivates the use of LLM-based planning for adaptive and context-aware task execution.
Conventional robotic task planners often struggle in changing, unstructured settings due to their reliance on models that presuppose a static environment [
46]. Bernardo et al. [
47] combined an ontological framework with Behavior Trees (BTs) to self-correct upon task failures, such as navigation and path planning. This system uses a PDDL planner to create a task sequence, converting it into a BT. If the robot fails, the system finds the cause and potential solution in the Ontology and resumes with corrective actions. However, hardcoding the ontology knowledge base can hinder adaptation across different environments. Additionally, SysSelf [
48] combines ontology-based reasoning with category theory to help mining robots map alternative functional components when hardware fails. This study uses advanced math to precisely map how all the robot’s parts and goals connect; if a LIDAR sensor fails, the robot calculates how to reprogram itself to use a different camera instead. Moreover, Yu et al. [
17] highlights that robots cannot move properly in homes designed solely for humans. Therefore, this study proposed a system that simulates daily household chores and generates heatmaps to identify where robots get stuck or slow down. Based on this map, the framework recommends repositioning furniture or shortening partition walls and clearing pathways, which decreases the robot’s travel steps by 19.8%. However, these approaches depend on predefined knowledge, labor-intensive ontologies, and complex mathematical logic. Robots cannot adapt to situations beyond their encoded models, limiting performance in dynamic environments. Overcoming this rigidity is important for healthcare robots because they must interpret signals, adapt plans, and update internal models as protocols and patient conditions change.
Researchers have also addressed system-level coordination and environmental integration. Valner et al. [
49] improved the coordination of multiple robots and objects, such as hospital doors. A central server, the Robotics Middleware Framework (RMF), acts as a traffic controller, directing the robots to deliver urgent blood samples and calculating safe paths around people. However, this framework schedules routes on a fixed map and lacks high-level reasoning to handle unexpected, high-priority scenarios.
In parallel, another line of research has explored interaction-based knowledge expansion. Perera et al. [
50] developed a dialog-based learning mechanism that helps robots to learn tasks through user conversations. Similarly, Amiri et al. [
51] developed a system where the robot’s language skills improve through conversations. In contrast, Doğan et al. [
52] used visual heatmaps and object detectors to understand confusing commands. The robot uses cameras to scan the room, then asks targeted questions (e.g., “Is the vegetable next to the knife?”) to quickly identify objects. This makes the robot smarter and easier for people to use. Additionally, Rosenthal et al. [
53] introduced a Verbalization algorithm that converts navigation logs and digital maps into plain English to help people understand what robots are doing, seeing, or planning. This method lets the robot narrate its journey like a story, making it easier for humans to trust it. Nevertheless, this system relies on manually written templates and human labeling of maps. Typically, home robots forget old objects when they are taught new ones, making it hard for them to adapt continuously to a user’s home. So, Interactive Continual Learning (ICL) [
54] gave the robot a brain system with short-term and long-term memory, enabling human users to teach it new objects in real-time without erasing its past knowledge. However, this learning process is manually guided by humans. Basically, these human-in-the-loop strategies reduce autonomy because robots rely on human involvement.
Recent developments in Large Language Models (LLMs) have driven efforts to leverage their commonsense reasoning and knowledge for automated planning of robotic tasks. Huang et al. [
55] created Inner Monolog, a framework with closed-loop feedback between the robot, human operator, and textual scene descriptors. Here, human inputs and environmental observations are integrated into LLM prompts, enabling the language model to generate contextually suitable task plans from multiple viewpoints. Typically, creating a social robot involves developing separate rule systems for tasks such as talking, moving, and approaching people. Sucal et al. [
56] replaced those rules with a single LLM-based agentic workflow where the robot converts sights and sounds into text. Then the LLM analyzes these texts to understand the situation and instructs the robot on what to do.
Researchers at Google introduced Socratic Models (SMs) [
57], which integrate multiple pretrained foundation models, APIs, and databases using natural-language prompts. Their method employs language as a universal interface to use cross-modal reasoning. Applications demonstrated include answering first-person video questions, supporting multimodal assistive dialogs, and enhancing robotic perception and task planning. Singh et al. [
58] introduced ProgPrompt, a structured prompting method highlighting LLMs’ training on programming tutorials. This method enabled GPT-3 to generate task plans by completing code snippets. Consequently, this programmatic prompt and embedded code improved task consistency. In contrast, Chatcliport [
59] improves robotic manipulation tasks, like object assembly and table cleaning, by using correction-based prompting. Instead of blindly executing commands, the system breaks down instructions into small steps, monitors success, and rewrites actions if an error occurs, enabling instant self-correction.
Researchers from Scaled Foundation and Microsoft proposed principles for a ChatGPT 3.5-driven robotics pipeline [
60]. Their pipeline allows non-technical users to define robot tasks via natural language. It involves developing APIs, creating prompts, testing, and refining in simulation with user feedback, then deploying plans to real robots. Similarly, Liang et al. [
61] used LLMs’ code-generation to create runtime perception–action feedback loops. They created Language Model Programs (LMPs) using logical operators and third-party libraries for geometry, interpolation, and spatial reasoning. These programs can handle complex instructions such as “Arrange blocks vertically 20 cm long and 10 cm below the blue bowl,” outperforming traditional step-by-step methods. Ding et al. [
46] introduced Common Sense-Based Open-World Planning (COWP), which combines PDDL-grounded expert knowledge with LLM-generated extensions. This hybrid framework allows the system to perform domain-constrained reasoning while also learning new action schemas in PDDL format from GPT-3 whenever novel scenarios emerge.
LLMs excel at high-level reasoning with textual input but struggle to process raw visual information to understand messy, disorganized 3D spaces. Relying only on text creates a disconnect between the robot’s plans and actual actions [
62]. Vision–Language Model (VLMs) enable robots to process visual and textual data, helping them recognize unseen objects and execute actions in real time [
63]. Zhi et al. [
64] introduced COME-robot, which used GPT-4V to generate a sequence of actions using images and text prompts. These prompts included a manual of robot commands, guidelines for formatting its answers, and helpful tips, such as thinking step by step and learning from mistakes. The robot could explore the room, take pictures, move to new locations, grasp objects, and place them down. It reads and observes to produce two-part outputs: a reason explaining its logic and movement commands. Recent advancements have introduced Vision-Language-Action (VLA) models that combine visual understanding, language, and motor control. For example, OpenVLA [
65] maps visual inputs and text commands to physical trajectories. However, a common challenge is balancing high-level reasoning and precise actions; improving one can reduce the effectiveness of the other. To address this problem, GenieReasoner [
66] was introduced, which combines VLM with VLA. VLA generates actions like precise movements of arms and mobile bases to execute individual actions within the plan. To train this model, internet knowledge, physical spaces, and real robotic action data were used as input. When the robot receives a text command, it assesses its surroundings and plans using simple digital thoughts called “tokens.” However, these chunky tokens do not enable fluid movement. A translator tool was used to convert those tokens into precise physical actions. It links context-aware reasoning with physical movements to perform multi-step chores like folding laundry.
2.3. Point of Departure
Despite these developments, current robots remain insufficient to meaningfully reduce healthcare worker burdens due to three key limitations. (1) Most systems are limited to single-purpose functions or loosely connected modules without a unifying framework, preventing scalable and adaptive multifunctional operation. For example, HoLLiECare [
67] demonstrated multiple care tasks but relied on separate modules rather than an integrated system. (2) Robots operate largely in isolation from the built environment, treating facility systems as passive data sources rather than enabling coordinated perception and action. Although efforts such as digital twin-based integration [
68] exist, co-perception and co-execution between robots and facility systems remain limited. (3) Existing systems rely on hard-coded behaviors, limiting the ability of healthcare staff to adapt task execution or guide robot behavior in context. While LLMs provide a potential pathway for natural-language-driven task specification, their integration into healthcare robotics remains limited. Prior work [
69] demonstrated only basic feasibility without supporting integrated data flows, flexible task orchestration, or interactive system adaptation. As a result, the broader challenge of enabling adaptive, multifunctional robotic assistance in clinical environments remains unresolved.
3. Objective and Scope
To address these limitations, we propose the Robotic Health Attendant (RHA)—a multifunctional nursing assistant robot designed to support a wide range of physical, communicational, and informational tasks. Rather than being treated as a single-purpose device, the RHA is conceptualized as part of an LLM-orchestrated framework that enables coordination between robot actions and building systems. The strength of this approach is that robots and building systems can assist humans in a more context-aware manner—working alongside nurses and other clinicians, communicating through natural language, and acting with contextual understanding. At the core of the framework is a workflow architecture that integrates multiple LLMs with other processing nodes to combine diverse information streams—including ward-level edge sensing (e.g., audio detection, people tracking, fall-event recognition), facility data from building information models (BIM), patient information, hospital guidelines, and clinician-issued commands. Within this workflow, tasks are specified in natural language and represented as step-by-step instructions, which are routed across three functional branches: (1) a physical branch for embodied actions such as navigation, object manipulation, and resource fetching; (2) a communicational branch for context-aware dialog with patients and clinicians; and (3) an informational branch that leverages a retrieval-augmented generation (RAG) system for accessing and summarizing medical knowledge. This design allows clinicians to create and refine workflows in natural language by flexibly combining existing robot skills, enabling the system to adapt to evolving care needs.
The focus of this study is therefore on the development and demonstration of the proposed framework, including high-level workflow orchestration and its implementation, and on demonstrating a multifunctional and extendable robotic assistant—rather than on low-level motor control such as pick-and-place motions. By showing how multimodal information can be fused with natural language task design, this study highlights a pathway for developing multifunctional robotic assistants that can operate as supportive components within clinical workflows. Unlike prior systems—often limited to single-purpose functions, hard-coded behaviors, or rigid designs that could not adapt to workflows or interact naturally with humans—this framework enables robots to flexibly coordinate physical, communicational, and informational tasks within the same architecture. In doing so, it provides a potential approach to addressing the limitations and demonstrates feasibility in a simulated environment.
4. LLM-Orchestrated Framework for the Multifunctional Robotic Health Attendant (RHA)
4.1. Traditional Development Lifecycle of a Robot Task Executor
To situate the proposed framework, we first provide a conceptual comparison with the traditional approach to RHA development. This comparison illustrates why the traditional model is unsuited to RHA and supports multifunctionality and adaptability in interactions with facilities and users. In describing traditional solution development, Olszewska et al. [
70] outline sequential phases of specification, conceptualization, formalization, and implementation, supported by auxiliary activities such as knowledge acquisition and evaluation. Likewise, Rahman [
71] notes that the classic software development life cycle (SDLC) typically proceeds through requirements gathering, system design, implementation, testing, deployment, and maintenance. Building on these foundations,
Figure 1 adapts this traditional process into the development of the RHA, where the final packaged software artifact is hereafter referred to as the “task executor”. This lifecycle proceeds in a largely sequential manner. Users participate in the initial design phase by providing requirements and feedback (Tasks 1–4), developers then take charge of implementation and deployment (Tasks 5–8), and users are involved again during operation and evaluation, where they use the system and adjust robot behaviors when needed (Tasks 9–10). The detailed steps of this lifecycle are summarized below.
- ○
Task 1. Develop Robot Workflow Narrative: Define a human-language description of clinical needs based on stakeholder input (e.g., medication delivery with safety constraints).
- ○
Task 2. Determine List of Tasks for the Robot: Translate the narrative into a structured catalog of robot tasks (e.g., fetch medication, deliver instructions).
- ○
Task 3. Define Ordered Robot Steps: Decompose each task into an ordered sequence of actions (e.g., navigate, pick, deliver, speak).
- ○
Task 4. Describe Execution Details: Specify execution parameters for each action (e.g., speed, lighting, grasp type), forming the task specification.
- ○
Task 5. Implement Robot Skill Library: Develop modular robot skills (e.g., navigate, pick, place, speak) as executable software components
- ○
Task 6. Prepare Input Data: Structure facility and environmental data (e.g., room coordinates) for use in execution.
- ○
Task 7. Map Actions, Skills, and Inputs: Link actions to skills and parameters to form a task execution model.
- ○
Task 8. Package and Deploy the Task Executor: Integrate components into a deployable system with a user interface for task triggering.
- ○
Task 9. Issue Command: Users initiate tasks through UI or voice commands.
- ○
Task 10. Evaluate and Modify Robot Performance: Evaluate outcomes and update the system through developer intervention.
This traditional lifecycle is not well-suited for multifunctional healthcare robots operating in dynamic environments. (1) Knowledge conversion is indirect, as clinical narratives are reduced to predefined sequences and parameters, losing contextual reasoning. (2) Adaptability is limited because behaviors are pre-specified and cannot easily adjust to changing user needs or environmental conditions without developer intervention. (3) The facility is treated as a passive data source, limiting coordinated interaction between robot actions and environmental systems. (4) Task updates are cumbersome, as modifications require revisiting earlier development stages rather than being directly specified by clinicians.
4.2. LLM-Orchestrated Task Execution Framework for RHA
The proposed lifecycle moves away from rigid software development pipelines by embedding human-language narratives and structured prompts directly into an LLM-orchestrated task executor. Unlike the traditional lifecycle, which proceeds as a long linear sequence of tasks and manual conversions, the proposed process shortens this path through parallel steps and requires earlier collaboration between developers and users. Intermediate outputs such as workflow narratives, task catalogs, and task specifications are used directly as prompts for the “task executor,” with the LLM interpreting and transforming them into executable actions rather than being manually translated into code. This enables clinicians and developers to collaboratively design and adapt tasks in natural language while also triggering structured skills available through both the robot (e.g., navigation, pick-and-place, speech) and a facility skill library (e.g., lights, alarms, sensors, door control, elevator operation). The overall process is illustrated in
Figure 2, and the detailed steps are summarized. This executor integrates robot actions and environment interactions within a unified architecture by positioning the robot, the medical infrastructure, and human users as interacting components of a single system. Human knowledge and needs are directly reflected in the prompts, ensuring that clinical practices and user requirements shape task execution without indirect translation. The robot performs a wide range of nursing tasks as part of the care team, coordinating its actions not only with humans but also with the surrounding environment. The medical infrastructure interacts with the robot through perception and execution, with the executor jointly orchestrating both sensing and actions. Together, these three elements operate as a unified system centered on language-driven architecture. Because knowledge and commands are represented in prompts, the entire system can be flexibly adjusted through language rather than re-coding, enabling continuous adaptation to clinical contexts as long as users operate within the capabilities provided by the existing skill libraries. The broader implications of this lifecycle for developers, clinicians, and patients are discussed in the following section.
Task 1. Develop Robot Workflow Narrative: The outcome is the “Robot-Integrated Workflow Narrative” (human-language narrative). Clinical needs are gathered through surveys, interviews, and user reflections. The narrative combines explicit needs, tacit practices, and envisioned robot roles. Unlike in the traditional lifecycle, this narrative is not manually converted into code but is fed directly into the executor prompt.
Task 2–1. Determine List of Tasks for the Robot: The outcome is the “Task Catalog with Descriptions” (human-language narrative). Insights from the workflow narrative are organized into a catalog of tasks, such as fetching medication, providing discharge instructions, or guiding patients. Each task includes a short natural-language description of context and purpose. This catalog is used directly in executor prompts, allowing the LLM to link user commands with tasks.
Task 2–2. Implement Robot and Facility Skill Library: The outcome is the “Skill Library with Signatures” (software modules). Developers implement callable robot and facility skills, each with required inputs and parameters. Robot skills include Navigate (target), Pick (object), Place (location), Speak (text), while facility skills cover functions such as controlling lights, generating alerts, or querying sensors. This component makes the facility an active agent, unlike in traditional models.
Task 3–1. Write Task Specification—Robot Steps: The outcome is the “Task Specification—Robot Steps” (human-language narrative). Tasks are expressed as ordered natural-language action steps, which are refined into an Execution Blueprint. This blueprint is both human-readable and directly usable by the executor, bridging user intent and robot execution.
Task 3–2. Write Task Specification—Execution Details: The outcome is the “Task Specification—Execution Details” (human-language narrative + LLM-generated examples). Action steps are enriched with conditions such as navigation speed, lighting, grasp force, or handover versus placement. Although written in natural language, they are validated with LLM-generated output examples, such as JSON mappings of “navigate to patient room” into BIM-derived coordinates. These details enable the executor, mediated by the LLM, to generate skill calls dynamically at runtime.
Task 4. Develop LLM-Orchestrated Task Executor: The outcome is the “LLM-Orchestrated Task Executor” (software artifact). Narratives, catalogs, skill signatures, and task specifications are integrated into one executor. This system combines human knowledge, robot actions, and environment interactions within a unified architecture.
Task 5. Issue Command: The outcome is the “User Command” (human interaction). Clinicians issue commands in natural language. The executor interprets these commands in context with narratives, catalogs, and specifications, and generates structured calls to robot and facility skills in real time.
Task 6. Evaluate and Modify: The outcome is “Evaluation and Modification” (human feedback + narrative/specification revision). Robot performance is observed in practice, and prompts, narratives, and specifications can be revised and reintegrated. Unlike traditional lifecycles, clinicians and developers can continuously modify system behavior through language and prompt updates, without rebuilding executors. This enables responsive adaptation to evolving clinical needs.
4.3. Developmental Considerations of the Proposed Framework
The development of the framework requires sustained collaboration between developers and clinicians across all stages of the process, beginning with (1) documentation that must be written for direct LLM use so that clinical narratives, task catalogs, and early specifications function not only as records of user needs but also as working inputs for the executor. (2) Early development requires that task design and skill definition advance together, since task specifications rely on available skills, while skill definitions in turn must be guided by clinical needs. In practice, Task 2–1 (creating the “Task Catalog with Descriptions”) and Task 2–2 (implementing the “Skill Library with Signatures”) must therefore progress in parallel. The catalog identifies required functions while developers define preliminary skill signatures that later support task specifications. In case of a medication-fetching task, both “goToLocation” and “goToObject” may be defined as callable skills, and the specification must indicate which applies in context to ensure that the executor can translate natural-language prompts into concrete actions. (3) Task specifications should then be authored jointly by developers and clinicians, serving as evolving templates rather than static blueprints, with the act of writing and refining prompts guiding how skills are structured. When “goToObject (object_name)” is used, the executor must connect to mechanisms that continually update coordinates and status from the facility or robot. Because LLMs mediate between natural-language specifications and skill execution, their outputs must be validated not only for accuracy in isolation but also for consistency when interacting with other LLM modules, requiring developers to organize the “Task Executor” so that related actions form coherent pipelines across domains: physical tasks should rely on efficient locomotion and manipulation with minimal extra context, communicational tasks must incorporate rich contextual inputs about users and environments while adapting outputs to speech or text channels, and informational tasks may depend on multiple vector databases with careful validation before delivery in clinical practice. Finally, (4) real-time adaptation of robot behavior requires that both clinicians and developers maintain awareness of available skills and how these are referenced in task specifications, since minor modifications, such as adding a light signal, can be integrated easily, whereas entirely new tasks require a deeper understanding of the robot’s catalog and facility functions. In this way, the proposed framework enables continuous refinement through natural language, but its success depends on shared responsibility: clinicians must articulate meaningful prompts, and developers must provide robust, well-structured skill libraries and validation mechanisms to ensure reliable execution in evolving clinical contexts.
5. System Development and Implementation in Simulation
5.1. LLM-Orchestrated Task Executor
Handling the diverse nursing tasks required of the RHA with a single LLM call would require excessively large prompts encompassing the entire knowledge base, all task specifications, coordinate information, patient data, and complex interaction logic among human, robot, and facility systems (for example, managing access control, data restrictions, or conditional task execution). To address this complexity, the system modularizes both prompts and reasoning logic and adopts an agent-based orchestration approach that distributes information and functions to the right nodes at the right time. Similar orchestration principles have been applied in earlier studies on task decomposition for service robots (Liu et al. [
72]), dual-arm coordination (Yang et al. [
73]), and high-level reasoning integrated with low-level control (Munawar et al. [
74]). Here, these ideas are extended to healthcare robotics, enabling structured collaboration between humans, robots, and building systems.
At the core of this framework is the LLM-Orchestrated Task Executor, which converts natural-language commands received through a frontend webhook into machine-readable instructions executed across three functional branches—physical, communicational, and informational (see
Figure 3). For implementation, n8n is used as the orchestration layer to coordinate workflow execution across modular components. Similar LLM-based orchestration frameworks, such as AutoGen, CrewAI, and LangGraph, support capabilities including multi-agent task decomposition, stateful workflow management, inter-agent communication, tool invocation, and dynamic control flow for coordinating LLM-driven reasoning and execution processes. The system knowledge base supports this process by providing both narrative and structured information for LLM reasoning. It contains summaries of available tasks and their access levels, detailed task specifications, skill definitions for both robot and facility actions, information on object locations and states, patient information, and medical education materials used for retrieval-augmented generation (RAG).
When a command is received, the orchestrator first performs access-level validation using the task summary. The LLM determines which task the robot should execute and whether the user has sufficient permission. If permission is denied, a denial message is generated and returned to the user. If access is granted, the workflow proceeds through one of the three branches.
Each branch follows a modular process in which the orchestrator retrieves the relevant task specification, constructs a preliminary machine-readable JSON structure through the LLM, completes it with detailed skill inputs, provides a short confirmation to the user, and delivers the finalized JSON to the execution modules. Throughout this process, the narratives and structured data from the knowledge base are dynamically merged within the LLM prompt to ensure contextual accuracy. Later case studies illustrate these steps in detail.
The physical branch handles embodied actions such as navigation, picking, placing, and speaking, supporting tasks like medication delivery, meal delivery, and meal assistance. Because these tasks depend on precise spatial reasoning, the LLM is invoked twice—first to generate the structural form of the task and then to complete it with coordinate data retrieved from the facility database. The communicational branch manages dialog with patients and clinicians, generating context-aware responses that can also trigger physical actions if needed. The informational branch focuses on knowledge-intensive operations, retrieving and summarizing professional medical content through an RAG system; in this study, it was applied to patient education on complex heart-transplant care.
The three branches can operate independently or converge depending on workflow design. Each produces two parallel outputs: a human-readable message for communication with users and a machine-readable JSON for system execution. The JSON outputs directly trigger robot and facility skills—such as opening doors, adjusting lighting, or navigating—and dynamically adjusting their operational parameters. A screenshot of the orchestration workflow in n8n is shown in
Figure 4, illustrating how these processes are sequentially executed within the modular framework.
5.2. Robot and Facility Skill Implementation
The robotic and facility skills of the RHA system were implemented and tested in the Isaac Sim environment using the Tiago Pro robot model and the digital twin of the UNMC Innovation Design Unit (IDU) patient room. Robot skills represent the core physical capabilities of the RHA, enabling it to interact with people and objects in a realistic care environment. As shown in
Figure 5, these skills include fundamental operations such as robot/speak, robot/navigate, robot/pick, and robot/place, which together form the foundation for executing complex patient care tasks. Each skill follows a defined sequence of steps—for example, navigation involves reading the target pose, planning a path, and aligning orientation upon arrival; picking and placing rely on precise manipulation sequences that move the end-effector to pre-defined poses, attach or release objects, and ensure safe retraction. The robot/speak skill enables verbal interaction through text-to-speech, supporting basic patient and clinician communication. Additional parameters, such as robot/visual_alert (off/dim/on) and movement speed, provide flexibility for adjusting behaviors to context-sensitive conditions like night mode or emergency operation. These modular robot skills are designed to be composed dynamically according to task specifications generated by the LLM-Orchestrated Task Executor.
Facility skills were developed to enable the RHA to interact with the built environment. As shown in
Figure 6, these skills allow the robot to manipulate and coordinate with environmental systems such as lighting, doors, and monitoring devices within the simulated UNMC IDU patient room. The facility/light skill controls illumination intensity based on task requirements, supporting energy efficiency and patient comfort by toggling between on, off, and dim modes. The facility/door skill interfaces with animated door elements, allowing the robot or system to autonomously open or close doors as part of navigation or delivery workflows. The facility/monitor skill integrates sensing and perception functions within the environment—capturing visual data, running object and human segmentation, detecting patient posture, and publishing processed frames to the interface for situational awareness. Together, these facility skills extend robotic operation beyond isolated manipulation to a coordinated, infrastructure-aware system where the environment components are integrated into task execution workflows.
5.3. Frontend UI and System Integration
The RHA system is accessed through a web-based frontend consisting of a chat interface and a real-time monitoring dashboard (See
Figure 7). Users sign in with institutional credentials; upon login, the backend stores the session together with user profile fields (role, unit, and access level). When a user submits a command in chat, the frontend posts it—along with user identity and access level—to the n8n-based task executor via webhook. The executor validates access against the Task Specification Summary, selects the appropriate task, and runs the orchestration pipeline: it loads the relevant human-language specification and skill signatures, composes a machine-readable JSON plan, and issues HTTP task instructions to the simulation. Isaac Sim executes the robot and facility skills (e.g., navigate, pick/place, open door, control lights) and streams state back to the backend. Two feedback paths close the loop: (1) a concise, human-readable response is returned to the chat interface so the user sees confirmations or errors in context; and (2) telemetry, scene snapshots, and derived metrics (e.g., positions, detections, alerts) are pushed to the monitoring dashboard for live visualization. This wiring allows authenticated, role-aware user commands to flow from the browser to the orchestrator, drive embodied actions in simulation, and surface both explanations and live results back to the user in real time.
6. Case Study
To demonstrate the feasibility of the proposed framework, a case study was conducted using a simulated hospital environment modeled after the Nebraska Medicine IDU (
Figure 8). The architectural model from Revit was imported into Isaac Sim to recreate a realistic inpatient environment with patient rooms, hallways, and nursing stations. Human models representing patients and clinicians were added to simulate real hospital interactions, and a Robotic Health Attendant operated within the same environment, executing physical, communicational, and informational tasks.
The purpose of this case study is to demonstrate the feasibility of the framework in translating clinician-written task specifications—expressed in natural language—into machine-executable commands that coordinate both robotic and facility behaviors toward specific care goals. It further illustrates how the orchestration pipeline can integrate multimodal information, including sensor-based patient monitoring, patient records, and structured and unstructured knowledge-base data, into contextually relevant prompts that lead to purposeful actions. In this study, the entire task selection and orchestration process was implemented using GPT-4.1 nano, which interprets clinician inputs, identifies corresponding task specifications, verifies access permissions, and generates structured execution plans for the robot and facility systems.
When a user issues a command through the web-based frontend, the message is transmitted to the backend as structured data that includes the user’s message, username, full name, and access level. For example, the command “I am about to have a heart transplant. Tell me about the procedure” with username, full name, and access level is received by the system and passed to the n8n-based task executor. The executor integrates this input into a structured prompt and passes it to the language model for processing.
The model compares the user’s query against all entries in the Task Specification Summary (
Supplementary Material S1), which lists available tasks, their descriptions, required access levels, task types, and corresponding specification files. The exact prompt used in the workflow is shown below.
| Prompt: Task Selection |
This user command: {user_command} This user’s access level: {access_level} These are all available task specifications: {task_specification_summary} From the list of task specifications, select one task and determine if the user has the access based on the user query and task specifications. For task, use the task names in task specifications with no modification. For access, the value can be either True or False. For task type, use the task type in task specifications with no modification. Return JSON like: { “task”: “task_name”, “user_access_level”: “1”, “required_access_level”: “1”, “has_access”: “True”, “task_type”: “Physical”, “task_specification_file”: “XXX.txt” } |
Using this prompt, GPT-4.1 nano interprets the natural-language command, selects the matching task, identifies the task type, retrieves the name of the task specification file containing full details for the selected task, and determines whether the user has sufficient access privileges, producing the structured output shown below.
| Output: Task Selection |
{ “task”: “Patient Education”, “user_access_level”: “2”, “required_access_level”: “2”, “has_access”: “True”, “task_type”: “Informational”, “task_specification_file”: “patient_education.txt” } |
At this stage of the pipeline, the executor generates a machine-readable JSON structure that defines the sequence of actions for the selected task. For example, when the user command is “bring medication 1 to John Smith,” and the selected task is Fetch Medication, the system combines several inputs—namely, user command, task name, user name, and task specification (
Supplementary Material S2)—into a structured prompt as shown below that guides the generation of the initial structure of task JSON. The prompt used to construct the task structure is written as follows.
| Prompt: Task JSON Structure Generation |
Command given is: {user_command} Selected task is: {task_name} User name is: {user_name} Task specification includes task signature and example output: {task_specification} Generate JSON output in this format: {
“task”: “use selected task name as it is”,
“task_id”: “randomly generate short id using user, task, time”,
“command”: “use command given as it is”,
“user”: “user name here”,
“actions”: [
{
“action”: “action1”,
“skill”: “skill1”,
“robot_params”: { “speed”: 1,
“visual_alert”: “On”
}
},
{
“action”: “action2”,
“skill”: “skill2”,
“robot_params”: {
“speed”: 2,
“visual_alert”: “Dim”
}
}
]
}
Instructions
Each action description follows the details in the task specification, while keeping the skill names and their order identical to the task signature. Parameters are assigned by first applying values mentioned in the user command; if absent, default values from the human-readable specification are used. If multiple objects or targets are mentioned, the complete action sequence is repeated for each. Parameter keys must strictly match those defined in the command or specification. |
Next, the system converts the structured plan into an executable task by filling in all input arguments. Using the prompt in
Supplementary Material S3, the executor takes the user command, skill signatures, facility data with object coordinates, and the JSON structure generated in the previous step, and completes all required parameters for execution. The resulting task JSON (
Supplementary Material S4) is sent via HTTP request to Isaac Sim, where the robot and facility actions are executed in sequence.
The first experiment demonstrates the execution of the command “Bring medication 1 to John Smith.” When this JSON command is received, John Smith is identified as being in patient room 1, and medication 1 is located in medication room 1. As shown in
Figure 9, the RHA turns the visual alert on, the facility opens the medication room 1 door and turns the light on, and the RHA navigates to medication room 1. The RHA then picks medication 1 in medication room 1. After that, the facility opens patient room 1 and turns its light on. Finally, the RHA navigates to the center of patient room 1 and places the medication at the bedside, completing the task successfully. In this process, the actions and parameters required to perform the task were intentionally defined and successfully executed to complete the fetch medication task. Additionally, even though the command did not specify “patient room 1,” the RHA matched the correct patient room by identifying John Smith’s location using facility data.
Figure 10 shows the execution process for the user command “Bring medication 1 to patient room 1. This time, put the medication on the supply cabinet and make sure your visual alert is dim all the time since patients are sleeping. Keep the patient room light off for the patient there.” This command was designed as a complex scenario to verify whether the system could coordinate facility control and robot behavior under multiple contextual conditions. The system first receives the JSON command describing the detailed task with constraints on lighting and visual alert levels. The facility then turns off the medication room light and opens the medication room door, while the RHA changes its visual alert to dim, navigates to the medication room, and picks up medication 1. Next, the facility turns off the patient room light and opens the patient room door as the RHA keeps its visual alert dim and navigates toward the patient room. Finally, the RHA arrives at the center of patient room 1 and places the medication on the cabinet, completing the task. This experiment illustrates that the proposed framework can accurately interpret complex natural-language instructions, manage robot parameters and facility states simultaneously, and execute sequential actions that align with environmental and contextual constraints.
Figure 11 presents two examples demonstrating how the system interprets and executes communicational and informational tasks through natural-language interaction in the simulated hospital environment. In the first case, a clinician asked about the facility (“It is my first day. Tell me about your knowledge of this facility.”). The system recognized the query as a Communicational task, retrieved the corresponding contextual description from the facility knowledge base, and provided a concise, human-readable response through the chat interface, while simultaneously displaying the same content in the 3D simulation environment. In the second case, a patient asked, “I am about to have a heart transplant. Tell me about the procedure.” The system identified this as a patient education task belonging to the informational category.
The task executor then retrieved relevant clinical information from the RAG knowledge base and formatted the explanation according to the patient education task specification before delivering it to both the chat interface and the simulation environment. These examples illustrate how the framework can accurately distinguish between task types, access the appropriate information source, and produce coherent, context-specific responses for both clinical staff and patients.
To further evaluate repeatability and robustness of the proposed framework, repeated trials were conducted across the representative physical, communicational, and informational task scenarios presented in this study using multiple variations in user commands with different phrasings and sentence structures. For the physical-task workflow, command variations included “Bring medication 1 to John Smith” and “Bring medication 1 to patient room 1 with dim visual alerts and lights off.” For the communicational workflow, command variations included “It is my first day. Tell me about your knowledge of this facility.” and “I am new here. Explain your understanding of this facility.” For the informational workflow, command variations included “I am about to have a heart transplant. Tell me about the procedure.” and “Explain heart transplant recovery and care instructions for a patient.” Each representative scenario was repeated 20 times using these command variations. Across the communicational and informational workflows, all repeated trials successfully generated the intended task classifications and responses. For the physical-task planning workflow involving longer structured JSON outputs and coordinated robot/facility actions, malformed JSON formatting outputs were observed in 1 out of 20 trials even with the structured JSON generation prompts and parameter constraints described in the framework, despite correct high-level task reasoning and action sequencing.
7. Conclusions and Discussion
This study presents an LLM-orchestrated framework for enabling a multifunctional Robotic Health Attendant (RHA) in healthcare environments. The framework illustrates a unified control and communication model in which robot actions and environment interactions can be coordinated. Within this model, the RHA functions as a supportive system within clinical workflows, coordinating with clinicians and the facility to perform physical, communicational, and informational tasks.
The system integrates clinician-written task specifications with structured and unstructured data sources, enabling coordinated orchestration among multiple agents through modular prompt design and reasoning pipelines. This work illustrates that human-written instructions, combined with environmental and patient data, can guide coordinated behaviors across robots and facilities. Through simulation, the proposed framework demonstrates the feasibility of coordinated task execution using natural language as the primary interface.
The key contribution of this study is the establishment of a practical framework for LLM-based task orchestration. The simulation results illustrate that the RHA can execute multi-step tasks based on natural-language specifications, demonstrating feasibility in a controlled simulation environment.
There are, however, clear limitations. The current implementation focuses on high-level reasoning and system orchestration rather than low-level motion control. This study does not include quantitative evaluation such as repeated trials, success rate analysis, latency measurement, or baseline comparison. Detailed robot manipulation, sensing integration, and ROS-based control will be developed in future work. The reasoning capability of the framework will also be expanded in four stages: (1) human-language rule-based reasoning, (2) multimodal and multi-agent learning, (3) causal reasoning, and (4) collaborative human-AI decision-making.
Future research will extend the framework to assess its practical and perceptual impacts, including reductions in nurse workload and improvements in workflow safety and efficiency. The same framework can be applied to long-term care facilities, nursing homes, and other types of infrastructure, where coordination among humans, robots, and systems is equally essential.
In summary, this study provides a proof-of-concept demonstration of a framework that enables coordinated interaction between robot actions and environment components, showing that natural language can effectively bridge human intent with robotic and environmental actions in real-world care operations.
8. Safety, Ethical, Regulatory, and Deployment Considerations
The proposed framework is developed as a proof-of-concept in a simulated environment and is not intended for direct clinical deployment. The current implementation does not include safety-critical validation mechanisms for real-world operation. Potential failure modes include incorrect task interpretation, generation of inappropriate action sequences, parameter misassignment, and inconsistencies in LLM outputs. In clinical settings, such errors could lead to safety risks, including incorrect medication handling or unintended interactions with patients and facility systems. To mitigate these risks, future implementations must incorporate human-in-the-loop verification, task-level validation, and access control mechanisms before execution. In particular, task execution should require user approval of LLM-generated plans prior to execution, where the system presents a human-readable summary of the planned actions and parameters for confirmation. In addition, safety-critical rules should be enforced through hard-coded constraints (e.g., patient–medication matching, access-level restrictions, and restricted action filters) that can restrict unsafe LLM-generated outputs. In addition, strict safeguards are required for handling patient information, and any real-world deployment must comply with healthcare data protection regulations such as HIPAA. Furthermore, clinical deployment of robotic systems performing patient-facing tasks would require regulatory approval (e.g., FDA pathways for medical devices), as well as consideration of informed consent and liability frameworks. These aspects are beyond the scope of this study and remain critical directions for future work.
The current implementation relies on a commercial LLM API, which introduces additional practical considerations for real-world deployment, including latency, service availability, cost scalability, and model versioning. These factors may affect system responsiveness and consistency in time-sensitive environments. Addressing these challenges will require system-level strategies such as local model deployment, caching, and robustness mechanisms, which are beyond the scope of this study.