Embodied AI with Foundation Models for Mobile Service Robots: A Systematic Review

Lisondra, Matthew; Benhabib, Beno; Nejat, Goldie

doi:10.3390/robotics15030055

Open AccessEditor’s ChoiceSystematic Review

Embodied AI with Foundation Models for Mobile Service Robots: A Systematic Review

by

Matthew Lisondra

^1,*

,

Beno Benhabib

¹

and

Goldie Nejat

^1,2,*

¹

Autonomous Systems and Biomechatronics Laboratory (ASBLab), Department of Mechanical and Industrial Engineering, University of Toronto, Toronto, ON M5S 3G8, Canada

²

KITE, Toronto Rehabilitation Institute, University Health Network (UHN), Toronto, ON M5G 2A2, Canada

^*

Authors to whom correspondence should be addressed.

Robotics 2026, 15(3), 55; https://doi.org/10.3390/robotics15030055

Submission received: 17 January 2026 / Revised: 23 February 2026 / Accepted: 28 February 2026 / Published: 4 March 2026

(This article belongs to the Special Issue Embodied Intelligence: Physical Human–Robot Interaction)

Download

Browse Figures

Versions Notes

Abstract

Rapid advancements in foundation models, including Large Language Models, Vision-Language Models, Multimodal Large Language Models, and Vision-Language-Action models, have opened new avenues for embodied AI in mobile service robotics. By combining foundation models with the principles of embodied AI, where intelligent systems perceive, reason, and act through physical interaction, mobile service robots can achieve more flexible understanding, adaptive behavior, and robust task execution in dynamic real-world environments. Despite this progress, embodied AI for mobile service robots continues to face fundamental challenges related to the translation of natural language instructions into executable robot actions, multimodal perception in human-centered environments, uncertainty estimation for safe decision-making, and computational constraints for real-time onboard deployment. In this paper, we present the first systematic review of foundation models in mobile service robotics, following the preferred reporting items for systematic reviews and meta-analysis (PRISMA) guidelines. Using an OpenAlex literature search, we considered 7506 papers for the years spanning 1968–2025. Our detailed analysis identified four main challenges and how recent advances in foundation models, related to the translation of natural language instructions into executable robot actions, multimodal perception in human-centered environments, uncertainty estimation for safe decision-making, and computational constraints for real-time onboard deployment, have addressed these challenges. We further examine real-world applications in domestic assistance, healthcare, and service automation, highlighting how foundation models enable context-aware, socially responsive, and generalizable robot behaviors. Beyond technical considerations, we discuss ethical, societal, human-interaction, and physical design and ergonomic implications associated with deploying foundation-model-enabled service robots in human environments. Finally, we outline future research directions emphasizing reliability and lifelong adaptation, privacy-aware and resource-constrained deployment, as well as the governance and human-in-the-loop frameworks required for safe, scalable, and trustworthy mobile service robotics.

Keywords:

large language models; mobile service robots; foundation models; AI-enabled robotics

1. Introduction

In recent years, there have been rapid advancements in foundation models, including Large Language Models (LLMs) [1,2,3,4,5,6], Vision Language Models (VLMs) [7,8,9], Multimodal Large Language Models (MLLMs) [10,11], and Vision Language Action Models (VLAs) [12,13,14]. These models are significantly changing the overall field of robotics and its numerous applications. In particular, there have been notable improvements to personalized task management and context-aware smart assistant capabilities for domestic assistance [15,16,17], healthcare [18,19,20,21], and service automation [22,23]. Foundation models are trained on vast datasets to understand and generate human-like text and speech, as well as interpret visual information and reason. For example, language-guided mobile service robots utilizing foundation models can carry out complex instructions [24,25,26], perform fetch and carry tasks [27,28,29,30] and/or socially interact [31,32,33] with human users.

As robots are deployed in real-world environments to increasingly interact with humans, the integration of language-based intelligence has become crucial for intuitive and effective communications. In particular, mobile service robots can provide adaptable, context-aware assistance in dynamic and unpredictable environments to achieve various goals, from navigating person-centered environments to object or person detection and identification. Their ability to autonomously interpret human instructions, reason about tasks, and execute actions has enabled them to be deployed in numerous environments, from warehouses [34,35] to hospitals [12,36] and personal homes [26,37,38].

By leveraging the advancements in foundation models, service robots can bridge the gap between AI-driven perception and human-like decision-making, leading to natural human–robot interactions (HRIs) and collaborations in everyday scenarios. For mobile service robots, success also depends on dexterous, compliant physical interactions with people and objects, adherence to HRI constraints (e.g., personal space, idiosyncrasies, behavior predictability), and long-term autonomy in human-centered environments. However, to fully integrate the promising capabilities of foundation models, the open challenges in the development and deployment of embodied AI mobile service robots must be addressed. These open challenges include:

(1): Translation of natural language instructions into executable robot actions: Mobile service robots need to handle ambiguous, incomplete, or colloquial instructions (“bring me that from the other room”) while grounding these instructions in physical navigation and manipulation tasks [39,40,41,42,43]. This can be challenging in domestic and healthcare settings, where instructions may be highly contextualized by the environment or a user’s state (e.g., a patient who is resting, in pain, or under medication may give shortened or unclear commands that require contextual interpretation by the robot). They also require long-horizon, context-aware planning across multiple rooms, dynamic layouts, and human activities [44,45,46].
(2): Multi-modal perception: Mobile service robots must integrate multimodal inputs such as vision, speech, and touch while operating in human-centered environments such as homes and hospitals. As they navigate from region to region, they need to be able to adapt to: (1) lighting variations, (2) crowds, (3) occlusions by dynamic people, furniture, and/or medical equipment, and 4) noisy environments with background conversations or sounds including alarms [47,48,49,50].
(3): Uncertainty estimation: As mobile service robots operate in safety-critical, human-facing environments, decisions must be made under partial observability, unpredictable human behaviors, and socially acceptable norms [51,52,53]. Namely, they need to estimate both aleatoric (sensor/environmental) and epistemic (model) uncertainty while ensuring their actions remain socially acceptable. However, current embodied AI methods struggle with uncertainty estimation, leading to overconfident predictions that can risk user safety. For example, a hospital delivery robot might proceed through a crowded hallway despite incomplete LiDAR or vision data, misjudging the proximity of patients or equipment and causing potential collisions [54,55,56,57].
(4): Computational capabilities: Mobile service robots are constrained by onboard hardware, as continuous reliance on remote servers or cloud computation is often infeasible due to latency, connectivity instability, and data privacy concerns in hospitals and homes [58,59,60]. Sensing, perception, and decision-making need to be executed locally on embedded GPUs, NPUs, or other onboard accelerators close to the data source. The challenge lies within the tight compute, energy, and latency budgets of embedded platforms while ensuring safe operation around humans.

In general, foundation models can address the above challenges. They can bridge the gap between high-level human instruction, language understanding and low-level robotic control to enable more intelligent, adaptable, and interactive robots for diverse real-world applications, as shown in Figure 1.

To date, existing surveys have primarily examined either broad, non-task-specific applications in general-purpose robotics [61,62] or language-conditioned manipulation tasks for stationary robotic arms [63]. These surveys have not yet investigated the role of mobility in enabling robots to assist with human-centered tasks. In this paper, we present the first systematic review and analysis on the integration of foundation models in mobile service robotics, examining their role in advancing embodied AI. We introduce a unified architecture view in Figure 2 that illustrates how foundation models can integrate perception, planning, and control within mobile robots operating in human environments. Building on this integrated perspective, we identify research challenges and discuss how foundation models can enable task generalization, dynamic scene understanding, and social compliance. In particular, we focus on the promising applications of domestic assistance, healthcare and service automation for mobile service robots.

2. Methods Used for Identifying the Research Challenges

To quantify how research effort in mobile service robotics has been distributed to date, we used the preferred reporting items for systematic reviews and meta-analysis (PRISMA) guidelines [64]. Namely, we conducted a literature review analysis using OpenAlex [65], a large-scale scientific indexing platform. We queried and filtered papers related to mobile service robotics using title- and abstract-level keyword searches. The inclusion criteria comprised: (1) papers between 1968 to 2025 to capture the historical development of embodied robotic intelligence; (2) English-language papers; and (3) papers addressing embodied robotic systems intended for deployment in mobile or mobile-manipulation contexts. Exclusion criteria included papers focused primarily on: (1) autonomous driving and vehicular systems, (2) aerial robotics (e.g., UAVs and drones), (3) surgical robots, (4) underwater robotics, and/or (5) biological systems; and (6) papers lacking embodied robotics. A final total of 7506 papers were included in our systematic review. The identified literature was then grouped by recurring technical themes that consistently emerged across decades of research. A comprehensive description of the PRISMA [64] checklist mapping is provided in Appendix A.

2.1. Search Strategy

The search strategy was designed to systematically capture research relevant to embodied and embedded artificial intelligence in mobile service robotics. The process consisted of three stages: (1) keywords to define the robotics scope, (2) domain-level filtering to remove unrelated robots and application areas, and (3) embodied robot intelligence screening. Each stage involved title- and abstract-level keyword filtering to progressively refine the corpus toward embodied mobile service robotics.

(1): Keyword query: An initial keyword query targeting general autonomous robots from a constrained timeline of 1968 to 2025 was performed to constrain the search space from the beginning of robotic autonomy to the present. This query returned 23,210 papers.
(2): Domain-level filtering: Papers were filtered to remove research domains categorically outside the scope of mobile service robotics using the exclusion criteria. This stage ensured the corpus was restricted to ground-based mobile robotic systems. A total of 4530 papers were removed, leaving 18,680 papers.
(3): Embodied robot intelligence screening: The remaining papers were then examined to ensure that they substantively addressed embodied robotic intelligence within the previously defined mobile robotics scope. At this stage, we retained papers that engaged core embodied AI functions in physical robotic systems (e.g., perception-action coupling, language grounding, planning, control, or computational deployment considerations). This refinement step removed 11,174 papers, resulting in a final corpus of 7506 papers included in this systematic mapping review.

The resulting corpus represents the primary high-level research challenges in embodied mobile service robotics, spanning perception, language interaction, planning, control, and computational intelligence. These areas reflect broad technical concentrations observed in the existing research.

Please note that for this study, Nano Banana Pro AI Image Generator was used for the sole purpose of generating illustrative conceptual images in selected figures presented, as explicitly disclosed in the respective figure captions and in the Acknowledgments section.

3. Open Challenges of Embodied AI for Mobile Service Robots

We analyzed the final corpus to identify recurring open challenges in mobile service robot research. Based on the systematic mapping described in Section 2, we identified four recurring research challenges that encompass open challenges in embodied AI for mobile service robots. Table 1 ranks these challenges according to the proportion of papers in the 7506-paper corpus. Language-to-action mapping ranked the highest (29.18%), representing an emphasis on grounding instructions into executable behavior through planning, navigation, and manipulation. Multimodal Perception ranked second (28.83%), highlighting the challenge of integrating heterogeneous sensory inputs for varying environments. Uncertainty Estimation was ranked third (26.71%) due to the concerns related to robustness, safety, and decision-making under ambiguity in human-centered settings. Computational capabilities, while ranked fourth (15.28%), still remains a limitation to real-time execution and embedded deployment. Below, we discuss each of the four challenges in more details, while defining specific subchallenges for each.

3.1. Challenge #1: Translation of Natural Language Instructions into Executable Robot Actions

Enabling language-guided mobile service robots to follow high-level natural language commands provided by humans remains a challenge. Key limitations include:

(1): Symbolic-to-embodied mapping and instruction ambiguity: Natural language instructions are typically provided by non-expert human users. These commands are often expressed at a high level (“clean the room,” “get my walking cane”) and may include colloquial, ambiguous, or underspecified phrasing, lacking precise object identifiers, spatial context, or temporal constraints. While classical planners such as STRIPS [39] and SHOP2 [40,41], or logic-based systems (e.g., PDDL) [42,43] attempt to decompose commands into discrete subgoals, they rely on hand-engineered operators that are domain-specific and brittle [66,67,68,69]. This challenge is particularly critical for mobile service robots engaging in HRI with people in different roles who may provide incomplete requests (“bring me that”) or might use gestures and speech simultaneously, requiring robots to adapt ambiguous instructions into executable actions.
(2): Lack of embodied commonsense and awareness of physical constraints: Mobile service robots must not only interpret commands but also reason about constraints in human-centered environments. Many symbolic planners, grounding models, and PDDL-based models lack awareness of robot embodiment and environmental variability, according to refs. [70,71]. While language-only systems or semantic parsers trained on demonstrations can capture basic kinematics or affordances, they are not able to generalize across diverse service contexts [44,45,46]. For example, a patient may ask a robot to “place the food tray on the bed,” but without understanding that the bed is already occupied, the robot could attempt an unsafe or socially inappropriate action. Similarly, commands such as “open the door for my grandson” may be physically infeasible if the robot lacks the dexterity to open a door. In these cases, failing to account for physical constraints can result not only in task failure but also in loss of user trust.
(3): Failure in long-horizon task planning: Many mobile service tasks, such as medication delivery, multi-room navigation, or guided patient assistance unfold over extended time horizons, requiring robots to reason across multiple steps, maintain working memory, and adapt to evolving conditions [72]. Classical planners such as Simple Task Network (STN) Planning [73] or Fast Forward (FF) [74] perform well in static domains, but they are not able to handle dynamic, human-centered environments, in which unexpected subgoals may emerge due to users’ needs or environmental changes [75,76]. Learning-based models such as Neural Task Graphs (NTGs) [77] and Option-Critic [78] offer flexibility in sequencing actions, yet they can suffer from policy drift, compounding errors, and limited memory retention when deployed in real-world service contexts [79,80,81]. As a result, mobile service robots are unable to maintain task continuity, recover from errors, or adapt their plans in response to user interruptions or environmental changes, leading to cascading task failures during long-horizon service execution.
(4): Inability to handle mid-task requirement changes: Mobile service robots operating in human-centered environments need to be able to respond to mid-execution requirement changes, such as users: (1) changing goals (e.g., “bring me the red cup instead”), (2) reprioritizing actions (e.g., “stop cleaning and answer the door first”), or (3) interrupting tasks (e.g., “leave it and come here now”). These change requests have been addressed in classical STRIPS [39]- or PDDL [42,43]-based planners using execution monitoring frameworks that trigger either full replanning or plan repair within deliberative-reactive control architectures [82]. Classical three-layer deliberative-reactive architectures (e.g., planning, sequencing, and feedback control) allow robots to pause action execution, update the symbolic goal state and task constraints in the planning model, and recompute an updated action sequence. However, they rely heavily on hand-engineered domain operators and brittle state estimators [83]. Plan repair methods were introduced to avoid recomputing an entire plan from scratch when goals or constraints change, instead modifying only the affected portion of the remaining plan while preserving valid steps, thereby reducing computational cost and maintaining behavioral continuity [84]. Incremental and anytime planners (e.g., D*-style algorithms [85]) used in navigation and delivery tasks (be it within homes, hospitals, or public spaces) support responsiveness under changing goals or obstacle configurations by reusing previous search trees and updating only impacted graph regions; however, they are limited primarily to spatial objectives and do not generalize to high-level task restructuring [86]. Task-and-motion planning (TAMP) [87] frameworks for mobile manipulation provide joint symbolic–geometric replanning under revised task constraints, yet can incur computational overhead for real-time onboard deployment [87]. Reactive controllers and behavior-tree-based methods enable low-latency interruption handling and fallback behaviors but can only accommodate requirement changes expressible through pre-coded branches, limiting flexibility to designer-specified contingencies [82]. Probabilistic interaction models such as Partially Observable Markov Decision Processes (POMDPs) [80] for dialogue and intent tracking allow belief state updates and clarification under noisy or changing user intent but can scale poorly with state and action complexity and are often simplified for practical use [88]. As a result, these classical approaches struggle to robustly handle frequent, open-ended goal supersession and dynamic intent revision in unconstrained domestic service settings, often requiring conservative task structures or manual intervention.

3.2. Challenge #2: Multimodal Perception

Multimodal perception is essential for mobile service robots to interpret complex environments. These modalities often differ in spatial resolution, sampling frequency, and noise characteristics, making real-time sensor fusion and alignment difficult in human-centered environments. We discuss four core limitations below.

(1): Cross-modal representation: Sensory data from various robotic sensors need to be fused into a shared latent space to support real-time reasoning in mobile service robots. Fusion architectures are commonly classified into [89]: (i) early fusion, where raw inputs are concatenated or merged at the pixel level; (ii) late fusion, which aggregates decisions from different modalities; and/or (iii) intermediate fusion, where features from each modality are encoded and combined using attention mechanisms, joint embeddings, or cross-modal transformers. However, for mobile service robots, each fusion type presents unique challenges. Early fusion is sensitive to cluttered environments, background conversations, and variations in lighting [90], since it combines raw, low-level sensory inputs (e.g., pixel intensities or waveform amplitudes) before feature extraction. This means that any noise or occlusion in one modality, such as poor lighting in images or overlapping human speech in audio, can potentially corrupt the entire fused representation, leading to degraded perception accuracy and unstable behavior in mobile service robots. Late fusion can present delayed or inconsistent results in the fused perception output and downstream decision-making when different modalities provide conflicting cues, such as audio indicating a person’s presence while vision temporarily loses track of the person due to occlusions in crowded human spaces, leading to uncertainty for situational awareness [91,92]. Intermediate fusion, which combines encoded feature representations from each modality using attention or joint embeddings, often leads to temporal gaps when aligning robot data streams with different sampling rates, resolutions, and noise characteristics. This can cause failures, for example, in gesture–speech association during HRI [48,49].

The aforementioned challenges are compounded by spatial misalignment and temporal desynchronization when robots (i) follow or guide users across different rooms or (ii) interact with multiple people simultaneously. When a caregiver both speaks (“pass me that medication”) and simultaneously points toward a medicine cart, the robot must associate the speech command with the pointing gesture. If the audio (speech) and visual (gesture) signals are not synchronized, the robot may misinterpret which object the caregiver is referring to, resulting in ineffective assistance. In practice, fusion inconsistencies can cause object association errors, scene parsing failures, and HRI breakdowns. Such failures can directly undermine fine-grained context interpretation, and a service robot’s ability to disambiguate human intent and environmental cues to provide task assistance.

(2): Latency issues: Mobile service robots often rely on heterogeneous sensor streams that operate at different sampling rates, leading to temporal desynchronization [93,94,95]. These robots need to maintain situational awareness in dynamic, human-centered environments where people can have unpredictable motion, and tasks require real-time social responsiveness. Unlike fusion errors, latency stems from delays in updating perception and control loops. For example, a domestic assistant robot may take an unnecessarily long detour, as obstacle information was updated too late during navigation [93]. These delays reduce interaction responsiveness and fluidity, making robots appear slow or hesitant in environments where people expect robots to adapt in real time.
(3): Uncertainty propagation across modalities: Sensor quality may degrade due to lighting changes, occlusions due to people, cluttered rooms, and environmental noise [96,97,98]. Uncertainty propagation occurs when sensor noise or degradation in one modality affects the reliability of fused estimates across multiple modalities. For example, in joint estimation pipelines, this problem is magnified for service robots as it can cause cascading perception errors that directly affect safe and effective HRI. Yet, many fusion methods assume stationary sensor noise models and fixed modality trust, which can result in overconfident predictions or action outputs, refs. [99,100]. Overconfidence can lead to unsafe behaviors for mobile service robots. For example, a robot might continue to navigate confidently through a dim hallway, despite losing its visual information. Adaptive confidence weighting, which assigns modality-specific weights based on online reliability metrics [101,102], remains underdeveloped for real-world deployment of mobile service robots due to such key limitations as: (i) high computational overhead incompatible with embedded edge hardware, (ii) assuming static environments, and (iii) lacking robustness to abrupt sensor failures and scene changes.
(4): Domain adaptation and transferability issues: Early, late, and intermediate fusion models are often trained in static, well-lit indoor datasets and degrade in deployment across the cluttered households, dim hospital corridors, or dynamic outdoor and public venues which mobile service robots work in [103,104]. This domain gap makes it difficult for models to generalize once deployed in cluttered dynamic environments with multiple surface types (hospitals with glass doors and walls). Domain adaptation methods such as domain randomization [105], self-supervised online learning [106], and adversarial style transfer [107] face limitations in mobile service robot deployments. Transferability is further hindered when models are overfit to specific training conditions. Mobile service robots cannot retrain models in real time during operation, making them vulnerable to perceptual drift. Thus, robots can lose accuracy when exposed to novel environment setups: different layouts in long-term care homes, outdoor seasonal or weather-related lighting variations, or different human movement patterns.

3.3. Challenge #3: Uncertainty Estimation

Uncertainty is inherent in language-guided mobile service robotics, arising from sensor noise, human ambiguity, partial observability, and/or dynamic environments [54,57,108,109]. Uncertainty estimation in mobile service robots has three core limitations: lack of explicit uncertainty quantification, failure in long-horizon estimation, and uncertainty in HRI.

(1): Lack of explicit uncertainty quantification: Mobile service robots often lack the ability to estimate confidence in their decisions. While probabilistic methods consider modeling confidence [54,108], they are computationally intensive for real-time robotic applications. As a result, many service-oriented models rely on rule-based pipelines [55] or standard deep learning architectures [56,57] that do not provide calibrated measures of uncertainty [110]. This limitation can cause implementation failures.
(2): Failure in long-horizon uncertainty estimation: Reasoning about how uncertainty accumulates over extended time horizons is critical for service tasks. Traditional state estimation and planning methods, such as Kalman Filters [111,112], belief-space planners, and sampling-based approaches [113], do not model how uncertainty accumulates [114,115], which may lead to underestimating uncertainty as prediction horizons increase [116,117]. Belief-space motion planners [118] and sampling-based approaches such as Partially Observable Sparse Samplers [113] can become computationally intractable as the planning horizon increases [79,80,81]. For mobile service robots, this limitation directly impacts their ability to anticipate cascading errors, recover from uncertain states, or adapt task plans over time. In practice, robots may overcommit to long navigation paths that are not valid in dynamic or crowded environments.
(3): Uncertainty in HRI: A central challenge for service robots is managing uncertainty in user-facing decisions, where ambiguous verbal commands, nonverbal gestures, or multimodal cues must be interpreted in real time. To date, the majority of robots engaged in HRI lack the ability to communicate uncertainty or proactively request clarifications, which often results in social errors and not adhering to social etiquette rules [119,120,121,122]. These failures typically stem from misaligned intent interpretation, where the robot’s confidence is overestimated despite incomplete or noisy human input. Traditional approaches, including plan-based dialog management frameworks [51], behavior-tree-based HRI controllers [52], and POMDP-based dialog managers [53], often assume fixed interaction protocols and lack the ability to model, quantify, or express uncertainty in real time [123]. Recent studies stress that effective HRI requires not only technical uncertainty quantification but also the ability to express uncertainty in socially meaningful ways, for example, pausing, seeking clarification, or adapting conservatively when unsure [124,125].

3.4. Challenge #4: Computational Capabilities

Mobile service robots have limited onboard computational power, making real-time inference with large-scale symbolic and deep learning models challenging. Key limitations include:

(1): Perception and planning computational overhead: Large-scale deep learning models often require significant memory [126,127] and computational resources [128]. For example, models with convolutional backbones [126] or 3D segmentation networks [127] can exceed the real-time processing capacity of edge computing units commonly used in mobile service robots. Classical planning frameworks [129] and the Dynamic Window Approach (DWA) planner [130] were designed for lightweight CPU-based inference with simplified 2D cost maps. While efficient, they are not able to scale to the rich semantic representations needed in human-centered environments, such as multimodal feature maps or RGB-D affordance graphs [131,132]. Addressing this overhead requires combining efficient model compression and on-chip physical AI with selective edge-cloud collaboration, while maintaining the responsiveness necessary for safe, trustworthy HRI.
(2): Lack of adaptive resource allocation for real-time inference: Most classical mobile service robot frameworks such as traditional navigation stacks [133] and behavior-based control architectures [134] process sensor data at fixed input resolutions and allocate static CPU or GPU budgets, regardless of changing scene complexity or service robot task urgency [135,136]. This can lead to onboard resources not being used effectively during routine tasks; however, it risks overloading resources when service robots face applications that are interaction-heavy or in cluttered environments [137]. Adaptive techniques have been explored in visual pipelines [138,139] and mapping frameworks such as ElasticFusion [140] to dynamically adjust computation based on current sensory load or scene complexity. However, their approach is heuristic, and they do not generalize across the full perception-planning-control stack required for continuous operation in dynamic human-centered environments. They typically optimize a single subsystem (e.g., perception), without coordinating resource allocation across downstream planning and control modules that must respond to the same real-time constraints [141,142,143]. This results not only in technical inefficiency but also a reduction in service quality. For example, in cluttered domestic environments, increased perception load from dense visual scenes can monopolize computational resources, leaving insufficient capacity for robot motion planning or safety monitoring, leading to hesitations or unsafe navigation behaviors.

Figure 3 situates these four challenges within the broader evolution of embodied AI for mobile service robots. The left portion of the timeline (“Pre-Foundation Model Era”) highlights how classical pipelines, ranging from STRIPS [39] and PDDL [42,43] planners to DWA [130], LSD-SLAM [98], and Move Base [133], addressed multimodal perception, language-to-action translation, uncertainty estimation, and computational constraints with task-specific and brittle solutions. The middle segment (“Rise of Foundation Models”) represents the emergence of large pre-trained models such as GPT [144], BERT [145], CLIP [146], and early robotics integrations (e.g., DeiT [147], SayCan [12], OpenVINS [94]), where scaling laws and transfer learning began to influence service robotics. The right segment (“Foundation Model Era”) shows recent LLMs, VLMs, MLLMs, and VLAs (e.g., GPT-4 [148], DeepSeek-R1 [149], Magma [150], OpenVLA [151], Aether [152],

π_{0}

[153,154]) that directly target the four challenges, as indicated by the color-coded arrows for challenges #1–#4. This progression motivates Section 4, where we discuss and analyze how foundation models systematically address the challenges and their limitations in more detail.

4. Opportunities in Mobile Service Robots for Foundation Models

This section outlines how foundation models can address the four core challenges identified in Section 3.

To contextualize how foundation models have contributed to embodied intelligence, Figure 4 presents existing mobile service robot capabilities into three functional domains: (1) language communication, (2) vision-language navigation, and (3) manipulation and organization. Language communication capabilities (green box) enable robots to interact with users through language grounding and interactive question answering, supporting clarification, intent inference, and context-aware dialogue. Vision-language navigation capabilities (red box) support multimodal spatial reasoning for both autonomous visual goal reaching and language-guided navigation in complex environments. Manipulation and rearrangement capabilities (orange box) capture object-centric physical interaction, including object and scene manipulation (e.g., grasping, tool use, placing), as well as spatial rearrangement of objects within an environment (e.g., tidying, organizing, or reconfiguring layouts according to task goals). These capabilities go beyond isolated grasps to include sequencing, spatial reasoning, and task-level organization of objects, which are essential for everyday service tasks such as fetching, cleaning, and household reorganization. Together, these three domains provide a practical behavioral decomposition of how foundation models enable mobile service robots to perform meaningful, real-world assistance.

4.1. Addressing Challenge #1: Translation of Natural Language Instructions into Executable Robot Actions

Understanding and executing natural language instructions remains a core challenge for mobile service robots. In response, foundation models leverage language-conditioned policies [12], vision-language alignment [155,156], and multimodal reasoning [157,158] to bridge the gap between symbolic language and embodied robotic control. In particular, recent foundation models have advanced instruction-following in mobile service robots by addressing the symbolic-to-embodied gap [146,148,157,158,159,160,161,162,163,164,165], incorporating physical commonsense [166,167], and supporting long-horizon planning and adaptation [168,169,170]. We discuss these key advances below.

(1): Symbolic-to-embodied mapping and instruction ambiguity: Non-expert users can provide vague or underspecified requests that mobile service robots must translate into precise, embodied actions. Unlike classical planners that depend on brittle rule sets [171], foundation models learn shared semantic spaces where language, perception, and control are jointly represented. This enables robots to resolve ambiguities in context-sensitive ways, for instance, interpreting a parent’s request to “put this somewhere safe” in a home by linking it to spatial cues like a shelf or cabinet. Models such as LIMO [159], a reasoning-focused language model, extends this capability of resolving ambiguities by retaining semantic context across steps, allowing robots to adapt dynamically when users clarify or modify instructions mid-task. VLMs and MLLMs further support disambiguation by grounding colloquial or multi-modal inputs in real-world layouts, ensuring that actions remain compliant with HRI norms such as safe distances, legibility of motion, and responsiveness to feedback during task implementation.
(2): Lack of embodied commonsense and awareness of physical constraints: In general, mobile service robots can use foundation models to address the absence of physical commonsense, by incorporating: (1) physics-based priors [172], (2) affordance reasoning [166], and (3) constraint-aware planning [173] into language-grounded policy architectures. These capabilities address the embodied AI gap where robots often lack real-world intuition about object properties, support stability, or task feasibility which is critical in mobile service contexts. For example, Genesis [172], a generative physics simulation foundation model, encodes differentiable physics into object-centric representations, allowing robots to mentally simulate stacking, collision, or deformation outcomes before acting. This prevents unsafe task errors, such as placing medical supplies on unstable surfaces. By combining affordance reasoning with constraint-aware planning, foundation models enable mobile service robots to anticipate whether actions such as “reposition the IV pole beside the bed without tipping it” or “push the luggage cart through a narrow gate” are physically realizable given their embodiment. This awareness is especially critical in caregiving and domestic contexts, where failure to respect physical and social constraints can endanger others.
(3): Failure in long-horizon task planning: Long-horizon service tasks require mobile service robots to maintain memory, adapt to interruptions, and replan under uncertainty. Foundation models address these challenges by: (1) decomposing high-level language into structured subtasks [168,169], (2) retaining goal-conditioned histories to preserve context [169,170], and (3) revising plans in response to dynamic sensory input. For instance, Code-as-Policies [168], a code-writing LLM, can translate robot assistive commands into modular control routines that allow task branching, while LLM-Planner [169], an LLM-based planner, adapts ongoing navigation when new verbal clarifications arise. Similarly, s1 [170], a reasoning-optimized LLM, dynamically adapts reasoning depth to prevent cascading errors in extended sequences. As shown in Table 2, These advances allow mobile service robots to sustain coherent, user-aligned task progression in complex environments, whether completing multi-room domestic routines (e.g., fetch-and-deliver sequences that span kitchen → living room → bedroom) or replanning mid-shift in hospitals during time-sensitive medical delivery under corridor congestion and changing staff priorities.

(4): Inability to handle mid-task requirement changes: With the emergence of LLM-, VLM-, VLA-, and MLLM-based systems, mid-task requirement changes are increasingly modeled as conversational constraint updates that trigger closed-loop replanning rather than full task restarts. Instead of assuming a fixed goal specification, recent language-guided robot frameworks maintain an explicit dialogue-conditioned task state, allowing revised user intent to supersede earlier constraints during task execution [178,179,180]. Skill-based architectures such as SayCan [12] dynamically rescore or reselect low-level skills when user priorities change, enabling robots to adapt action sequences online without invalidating the full plan. Code-synthesis approaches (e.g., CLIP-based Code-as-Policies (CLIP-CAPs) [168]) further support mid-task revisions by regenerating reactive policy code that embeds perception, feedback, and control logic, allowing updated requirements to be recompiled into executable programs during execution. Furthermore, feedback-driven planners such as Inner Monologue-style reasoning frameworks [25] extend their reasoning capability by incorporating execution outcomes, scene descriptions, and user corrections into iterative reasoning loops, enabling continuous plan refinement under evolving constraints [180]. Building upon this closed-loop paradigm, recent corrective planning frameworks (e.g., CoPAL [181]) categorize failures into logical, semantic, or grounding errors and apply targeted repair strategies during execution, while VLM-grounded planners (e.g RePlanVLM [182]) re-evaluate the environment state to replan long-horizon tasks when goals or conditions change. These advances allow mobile service robots to maintain task continuity, safely revise goals during execution, and remain aligned with the user’s latest intent without requiring full task restarts, thereby improving robustness in dynamic domestic, healthcare, and service automation environments.

Prior approaches to language-guided robot planning such as STRIPS [39], PDDL [42,43], STN [73], and FF [74] relied predominantly on hand-engineered symbolic planners and task-graph formulations, and classical heuristic planning. While effective in static and well-specified domains, these methods were brittle to linguistic ambiguity, underspecified user instructions, and dynamic task variation [39]. Learning-based sequencing methods such as NTGs [77] and option-based policies [78] introduced greater flexibility, but still suffered from limited memory, compounding errors, and weak grounding when deployed in long-horizon mobile service tasks. Recent foundation-model approaches have begun to bridge the gap between symbolic reasoning and embodied execution, as shown in Table 2. In particular, CLIP-CAP [168] explicitly links vision-language representations to executable control programs, allowing natural language commands to be translated into modular, interpretable robot code that directly invokes perception, navigation, and manipulation primitives. Through this design, CLIP-CAP [168] achieves the highest reported real robot manipulation success (71% across real-robot manipulation tasks) among the evaluated models, while also supporting task branching and online replanning during execution. This capability marks a shift beyond earlier brittle planning pipelines towards more adaptive, language-driven mobile service robot behaviors.

To provide interpretability for deployment, Table 2 includes a ‘Suitable Scenarios’ column that maps each foundation model to the service environments where its architectural trade-offs are most appropriate (e.g., domestic assistance, healthcare, and/or service automation).

4.2. Addressing Challenge #2: Multimodal Perception

Robust multimodal perception is essential for interpreting complex environments. Particularly, multimodal VLMs and MLLMs can address this challenge by providing pre-trained multimodal representations of visual, textual and spatial features that enable the integration of heterogeneous sensory data through unified representation learning, temporal synchronization, uncertainty-aware fusion, and domain generalization [152,183]. We discuss below the key advances of leveraging foundation models for this challenge.

(1): Cross-modal representation gap: Foundation models mitigate the fragmentation of sensory streams in mobile service robots by aligning multimodal data into unified latent spaces using either (1) cross-modal tokenization [146,150], (2) spatial encoding [152], and/or (3) attention-based fusion [184] to create representations that not only synchronize perception but also remain socially meaningful. Unlike robots that operate in structured factories or warehouses, mobile service robots must integrate heterogeneous inputs, vision, speech, LiDAR, tactile, and physiological signals in real time, while interacting with humans in cluttered or crowded environments. Magma [150] is a transformer-based multimodal foundation model designed to bridge perceptual understanding and embodied actions through Set-of-Mark (SoM) and Trace-of-Mark (ToM) prompting, which convert visual scenes into actionable spatial tokens and predict future motion trajectories. By integrating images, language commands, and robot traces into a joint token space, Magma enables unified cross-modal grounding and planning. This allows a service robot deployed in a hospital to correctly associate a nurse’s spoken instruction with a patient’s movement, while in retail environments the same model can interpret gestures such as pointing and ground expressions like “that one over there” on crowded shelves. By unifying perception and interaction modes, foundation models can reduce representational misalignment that can result in failures in object association, gesture-speech coordination, and/or scene parsing during real-time HRI.
(2): Latency issues: Strategies for reducing latency in the service robot context include: (1) learning time-aware representations [185], (2) aligning asynchronous inputs [186], and (3) dynamically adjusting modality-specific update rates [187]. For learning time-aware representations, video foundation models such as VideoJAM [185] and InternVideo2 [186] jointly model visual appearance and optical-flow motion across time, enabling coherent spatiotemporal grounding of asynchronous frames. In a healthcare environment, for example, such models allow a robot to detect in real time when a patient begins to stand up from a chair, triggering immediate stabilization assistance or verbal support, before a fall risk escalates. In office environments, object-centric physical modeling frameworks further help synchronize sensor inputs during long hallway robot navigation when groups of employees are sharing the space while having unpredictable movements, enabling timely monitoring and smooth flow management. These advances allow mobile service robots to have real-time responsiveness in environments where even small delays can undermine safety, trust, and quality of assistance.
(3): Uncertainty propagation across modalities: Mobile service robots must assess the reliability of multimodal inputs in real time, since sensor degradation in one channel can otherwise cascade into unsafe actions. Foundation models enable uncertainty-aware fusion, where each modality’s contribution is dynamically weighted by estimated confidence [188]. For example, DeepSeek-R1 [149], a reinforcement-trained LLM, and the Segment Anything Model (SAM) [164] and SAM-2 [174], vision foundation models, incorporate confidence-driven suppression of low-quality visual regions and reinforcement-based reasoning to avoid acting on uncertain observations. These mechanisms can be directly leveraged in service environments: in restaurant service contexts, confidence-aware fusion can prevent unsafe misinterpretation of overlapping table orders and simultaneous calls. In service automation, such as busy airports or malls, adaptive reweighting allows robots to filter noisy crowd signals while sustaining situational awareness. These strategies enable mobile service robots to maintain robust perception and socially safe interaction even when modalities degrade.
(4): Domain adaptation and transferability: Service robots need to generalize across high variability, from clutter and lighting variations to reflective areas, reconfigurable layouts, and dense crowds. Unlike factory or lab-trained models, foundation model-based perception systems deployed in service robots cannot be continuously retrained during operation due to privacy, compute, and real-time performance constraints. Therefore, strong domain transfer capabilities are essential. Foundation models address this need through (1) self-supervised pretraining [189,190], which allows perception systems to adapt without large labeled datasets; (2) cross-domain token alignment [183,191], which stabilizes multimodal features when sensor inputs vary across environments; and (3) few-shot generalization [173,192], which dynamically adjusts tokenization for domain-flexible perception. These capabilities help mobile service robots obtain reliable recognition and reasoning under changing environments. This reduces perceptual drift and ensures continuity of safe, trustworthy interaction over long-term deployments.

Prior to foundation models, mobile service robots struggled with cross-modal alignment, noisy sensing, and limited generalization across objects and environments [47,48,49]. The perception results shown in Table 2 demonstrate that foundation models can address these concerns. Namely, SAM-2 [174] achieves the highest visual perception accuracy with real-time performance, while Magma [150] improves semantic grounding through unified image-language embeddings maintain competitive sensing at low compute cost.

4.3. Addressing Challenge #3: Uncertainty Estimation

Robust uncertainty estimation for mobile service robots enables such robots to work in unpredictable environments, where sensor noise, ambiguous input, and unexpected events can compromise decision-making. Foundation models address this challenge by supporting explicit confidence modeling, prediction long-horizon risk prediction, and uncertainty-aware HRI [144,175,193,194,195,196,197,198,199,200,201,202,203]. We discuss these below.

(1): Lack of explicit uncertainty quantification: In human-facing environments, robots must be able to avoid unsafe or socially awkward actions. Foundation models support this through (1) reinforcement-guided value modeling [149], (2) adaptive attention-based reweighting [144,175,193], and (3) temporal uncertainty tracking [194]. The LLM DeepSeek-R1 [149] calibrates confidence using reinforcement signals, while the MLLM GPT-4o [144] models token-level reliability. These capabilities allow mobile service robots to cautiously perform actions in everyday contexts. Benchmarks such as Humanity’s Last Exam (HLE) [204] and MANNERS-DB [205] further highlight the importance of expressing uncertainty in socially acceptable ways, such as apologizing before asking for clarification or signaling doubt through tone and timing, which is essential for maintaining trust in domestic and public service interactions.
(2): Failure in long-horizon uncertainty estimation: Planning over extended time horizons presents a major challenge for mobile service robots, as perception or reasoning errors can accumulate into significant risks during prolonged operation. Foundation models mitigate this through latent world modeling and temporal abstraction [195,196,197], which simulate outcomes in compressed latent space to anticipate risks such as localization drift, occlusion, or sensor degradation before unsafe actions are executed. Vision-based self-supervised models such as MAE [198] and Swin Transformer [199] further enhance robustness by reconstructing corrupted spatiotemporal patches, and improving confidence tracking under partial observability. In service applications, this enables long-horizon replanning for robotic tasks such as guiding visitors through multi-floor buildings where the availability of elevators or access restrictions can vary dynamically. By modeling uncertainty accumulation explicitly, these approaches support continuity of assistance and maintain reliability during extended, socially embedded tasks.
(3): Uncertainty in human–robot interactions (HRIs): Effective HRI requires mobile service robots to detect uncertain or ambiguous human behaviors such as vague speech, inconsistent gestures, or shifting gaze, and to communicate their own uncertainty clearly during task execution. Foundation models support this by integrating behavior forecasting, social intent grounding, and natural language grounding. For example, Lumiere [200], a diffusion-based generative video model, generates motion-consistent forecasts of human activity, enabling robots to anticipate possible human actions and avoid socially disruptive responses. In a domestic setting, this may involve the robot briefly pausing to signal uncertainty and then proactively asking the user a clarifying question, for example, “Did you mean the unwashed bowls on the table or the ones by the sink?” thereby preventing incorrect actions while maintaining smooth interaction flow.

Traditional mobile service robots often did not account for uncertainty, leading to unsafe behaviors and brittle recovery when unexpected human interactions occurred [54,55]. In contrast, recent foundation models demonstrate measurable progress toward calibrated reasoning under uncertainty. For instance, DeepSeek-R1 [149] reports the lowest calibration error (~81.8%) on the HLE benchmark [204], indicating closer alignment between predicted confidence and actual correctness, whereas GPT [144] and LLaMA [177] class models generally exhibit higher calibration error values (~92%), demonstrating overconfidence. As summarized in Table 2, this improvement suggests that certain foundation models are beginning to quantify and communicate uncertainty more reliably, an ability that is essential for safe, trust-preserving service robot deployment in real-world environments.

4.4. Addressing Challenge #4: Computational Capabilities

Mobile service robots often operate under constrained budgets, making real-time perception and planning with large deep learning and symbolic models difficult. Foundation models address this challenge by enabling scalable model compression [206,207], efficient attention distillation [179,208], adaptive token processing [209,210], and on-device execution [211,212]. These advances allow mobile robots to maintain real-time responsiveness, energy efficiency, and task throughput even under strict power, memory, and latency constraints. We discuss these key advances below.

(1): Perception and planning computational overhead: Real-time perception and planning can often exceed the computational limits of embedded mobile service robot systems. This creates latency bottlenecks that impair decision-making during navigation, interaction, or manipulation. Foundation models address computation limits through (1) architectural streamlining [151,213], (2) model compression [206,207], and (3) efficient attention mechanisms tailored for edge devices [179,208]. The LLM EdgeFormer [213] applies dynamic token control to sustain responsiveness during dialogue-driven task execution. The VLM DeiT [147,206], a distilled vision transformer, compresses large transformer architectures into compact models suitable for identifying small objects in cluttered environments. The MLLM VisionLLaMA [207] adapts LLaMA-style transformers for lightweight multimodal reasoning. These advances reduce inference latency while preserving reliability, allowing mobile service robots to operate efficiently under power and memory constraints while maintaining safe, socially responsive interactions.
(2): Lack of adaptive resource allocation for real-time inference: Efficient task execution under fluctuating computational loads is essential for mobile service robots, which must remain responsive despite strict latency and energy limits. Foundation models address this challenge through adaptive computation strategies that allocate resources in proportion to service robot task complexity. For example, the LLMs Switch Transformer [209] and GShard [210] use Mixture-of-Experts (MoE) architectures to selectively activate the modules required for the current context, conserving energy during routine operations while scaling up capacity during socially or perceptually demanding interactions. This dynamic allocation directly enhances service quality: a domestic assistant robot can transition from monitoring to full-capacity multimodal parsing of requests. By dynamically rebalancing inference loads, foundation models enable mobile service robots to maintain timely, safe, and socially fluent interactions.

Onboard processors and battery constraints restricted mobile service robots to simplified models with reduced perception and reasoning abilities. Meaningful improvements have been achieved with foundation models. Namely, Perceiver-Actor offers the lowest compute cost (~9–10 GFLOPs) while still achieving multi-task visuomotor control, while SAM-2 and Magma deliver strong perception with higher computation budgets.

4.5. Lessons Learned and Failure Patterns Across Foundation Model Architectures

While Table 2 summarizes the performance trade-offs of major foundation models across language-to-action, perception, uncertainty, and computation, these results also reveal recurring failure patterns that highlight why certain foundation model architectures can underperform in specific service environments. Namely, these patterns identify architectural mismatches between model assumptions and the operational constraints of domestic assistance, healthcare, and service automation settings. We discuss these key limitations in more detail below.

(1): High-level reasoning and executable control implementation gaps: General-purpose LLM/MLLM planners can generate coherent task decompositions and recovery strategies under ambiguous instructions; however, they can often fail when required to convert language into precise, physically feasible actions. This failure is evident in domestic manipulation tasks in cluttered household environments, where underspecified commands (“put this away”) require fine-grained object disambiguation, grasp feasibility checks, and contact-aware execution. For example, high-level reasoning-oriented models such as GPT [144], LLaMA [177] and DeepSeek-R1 [149] demonstrate strong abstract reasoning and planning performance. However, as these architectures do not natively integrate perception outputs with low-level motor control or physically validated action primitives (as reflected in Table 2 by the absence of real-robot grounding metrics), they risk producing plausible plans that omit embodied constraints. This leads to unsafe or infeasible steps, such as selecting an occluded or already-grasped object, attempting a grasp outside the robot’s reachable workspace, placing objects on unstable or non-supportive surfaces, colliding with nearby obstacles during arm motion, or issuing navigation commands that ignore local clearance and kinematic limits. In contrast, architectures that explicitly bind language inputs to perception outputs and predefined robot action primitives, such as CLIP-CAP [168] and Perceiver-Actor [176], reduce this type of failure by converting natural language instructions into executable control sequences that directly reference detected objects, grasp points, and motion policies. For example, CLIP-CAP [168] translates a command such as “pick up the red shirt and place it in the laundry basket” into structured program steps that: (1) call a vision module to localize the object, (2) compute grasp parameters based on object pose, and (3) execute validated manipulation actions using predefined robot control primitives. This explicit perception-to-action linkage is reflected in its higher real-robot manipulation success rate (71% across seven real-world tasks, Table 2). Similarly, Perceiver-Actor [176] directly maps multimodal scene encodings to low-level visuomotor policies (70% success across 18 tasks, Table 2), demonstrating stronger embodied integration compared to purely reasoning-oriented models.
(2): Overconfidence and miscalibration under uncertainty: A dominant failure pattern for foundation models in healthcare and public environments is miscalibrated confidence estimation. In safety-critical settings, failure often occurs when foundation models provide confident but incorrect outputs, resulting in downstream planners providing unsafe decisions (e.g., misidentifying objects, misunderstanding priorities, or failing to ask for clarification). Table 2 highlights the comparatively stronger calibrated reasoning under uncertainty of DeepSeek-R1 [149], while the GPT-class [144] and the LLaMA-class [177] models exhibited higher calibration errors, indicating overconfidence. In practice, such failure can be amplified in hospitals, where ambiguity is common (multiple staff instructions, object/tool similarities, changing workflows) and where uncertainty should trigger conservative behaviors (verification, asking, or deferring), rather than decisive incorrect actions.
(3): Latency and computation bottlenecks: In mobile service robotics, many failures can manifest as delayed responses rather than incorrect predictions. High-capacity models may succeed on static benchmarks, however, can become unusable in real deployments due to timing constraints imposed by HRI, navigation, and manipulation in dynamic settings. This is particularly critical in healthcare assistance (e.g., bedside monitoring, corridor navigation) and service automation (e.g., guidance and wayfinding) where responsiveness directly affects safety and trust. For example, GPT-style multimodal reasoning models [144] require substantial computation per token and operate at low inference speeds, making them unsuitable for onboard closed-loop control in time-sensitive environments. Similarly, CLIP-CAP [168] achieves strong real-robot manipulation performance but operates at low frame rates, which can limit responsiveness during continuous perception-action cycles. In contrast, Perceiver-Actor [176] demonstrates lower computational cost suitable for embedded deployment; however, this efficiency may come at the expense of richer reasoning capacity or explicit uncertainty modeling. This illustrates the broader trade-off between reasoning depth and real-time responsiveness on resource-constrained mobile platforms.
(4): Failure amplification across domains: Across domestic and healthcare assistance, and service automation, the same architectural limitations can be amplified by different operational constraints. Namely, domestic settings exacerbate language-to-perception and perception-to-action mapping errors, where a model can incorrectly associate a linguistic reference (e.g., “that bowl”) with the wrong object instance or can select an action that does not account for occlusion, reachability, or grasp feasibility. Hospitals can amplify uncertainty and safety failures due to high-stakes decisions, vulnerable populations, and multi-user conflicts. Furthermore, public service scenarios amplify compute and latency failures due to scale, crowds, and continuous interaction demands. Perception backbones, including SAM-2 [174], can generalize broadly across different domains, while reasoning-heavy models such as GPT [144] and DeepSeek-R1 [149] are best suited for supervisory or cloud-assisted components. The latter models can potentially be paired with efficient control pipelines that directly translate model outputs into validated perception modules, motion planners, and low-level motor commands with collision checking and feasibility verification. Examples of frameworks that implement this reasoning-to-control grounding include code-generation and skill-selection frameworks such as CLIP-CAP [168], which compiles language instructions into executable robot programs, and SayCan [12], which grounds high-level language reasoning into scored low-level skill primitives prior to execution.

Addressing the core challenges of multimodal perception, language grounding, task generalization, uncertainty estimation, and computational efficiency, foundation models offer scalable solutions for mobile service robotics, while exposing new design trade-offs that must be managed across deployment domains.

5. Real-World Applications for Mobile Service Robots with Embedded Foundation Models

The global service robotics market is projected to experience rapid growth, approximately doubling from $47.10 billion USD in 2024 to $98.65 billion USD by 2029, with an average compound annual growth rate of 15.9% [214]. This surge reflects broader trends toward integrating intelligent service robots into homes, healthcare facilities, and public spaces [215]. As foundation models are increasingly being embedded into mobile service robots, we examine how these models address real-world needs and enable safer, more adaptive, and socially intelligent robot behaviors across these domains.

We categorize applications into three primary domains, domestic assistance, healthcare assistance, and service automation, with each decomposed into representative sub-tasks (Figure 5).

Domestic assistance consists of fetch-and-retrieval, cleaning, childcare, and cooking activities; healthcare assistance includes medical delivery, bedside monitoring, assistive mobility, and hygiene maintenance; service automation encompasses customer assistance, guidance and wayfinding, as well as service setup and maintenance tasks such as venue preparation, layout arrangement, equipment positioning, and post-event cleanup in public spaces. Representative frameworks and robot platforms in each application domain are presented in Table 3, linking existing robotic systems (e.g., Stretch RE-1, Fetch, Everyday Robot, Jackal, and mobile manipulation platforms) to the tasks they perform. Together, Figure 5 and Table 3 provide a mapping from application domains and sub-tasks to specific foundation-model-enabled mobile service robot implementations that we examine in the following subsections.

5.1. Domestic Assistance

Domestic assistance refers to everyday tasks to support comfort, safety, and quality of life [233,234]. These tasks include the fetch and retrieval of household objects [217,218,235,236], cleaning [37,237], childcare assistance [153,238], and cooking activities [239,240,241].

(1): Fetch and retrieval tasks: These tasks, often triggered by natural language, gestures, or contextual cues, involve multimodal perception, ambiguous instruction grounding, uncertainty management, and efficient planning. Fetch and retrieval tasks decompose into sub-tasks including: (1) object localization, (2) grasp planning and action refinement, (3) navigation and multi-step delivery, and 4) handover and placement reasoning. Using foundation models such as CLIP [146] and CLIP-Fields [242] for spatial-language grounding, Octo [217] for visuomotor control, GPT-4V [163] and CogNav [30] for high-level reasoning and route sequencing, and LLaMA3.1-8B [177] or CodeLLaMA-70B [243] for symbolic task decomposition, mobile service robots can achieve end-to-end perception and manipulation across these sub-tasks. VLM CLIP-Fields [242] support object localization by constructing a continuous 3D semantic field from RGB-D images and odometry. Each point in the environment is encoded using CLIP’s text-vision embedding space, allowing a robot to search for objects using natural language descriptions (e.g., “the yellow sponge near the sink, go get it and bring it to me so I can clean these dishes”) without predefined labels. This method has been deployed on the Stretch RE-1 [216], a commercial mobile manipulator equipped with an RGB-D sensor, telescoping mast, and a compliant gripper. The Stretch uses the semantic field to resolve which surfaces and object clusters have the highest likelihood to match the user’s query, enabling reliable localization in typical household scenes where objects may be partially occluded, moved, or surrounded by visually similar items. Grasp execution and refinement are produced by the VLA model Octo [217], which blends a T5 language encoder with a diffusion-based visuomotor policy. On a Franka Emika Panda Robot arm, Octo generates stable grasp trajectories when object poses are uncertain or cluttered, refining motions based on both visual input and the language-specified goal. High-level instruction reasoning and route sequencing are modeled through GPT-4 [148], GPT-4V [163], and segmentation backbones such as SAM [164]. Integrated onto the Fetch mobile manipulator, these models interpret user requests, segment relevant regions of the scene, and adjust navigation plans as environmental obstacles or room layouts change. The CogNav system [30] further extends spatial reasoning by using GPT-4V [163] to decide when to explore, when to navigate, and how to map new rooms during longer fetch-and-deliver routines. Context-sensitive handover and placement actions are enabled through the PARTNR framework [218], which applies LLaMA3.1-8B [177] and CodeLLaMA-70B [243] to infer appropriate approach angles, placement surfaces, and human-aware delivery strategies. PARTNR demonstrated these behaviors on both a Franka Panda and a Boston Dynamics Spot robot equipped with a 7-degrees-of-freedom (DoF) arm, enabling the retrieval and socially compliant delivery of household items during domestic assistance tasks.
(2): Cleaning: Cleaning tasks involve a robot autonomously organizing, decluttering, and sorting household items into designated locations [244]. Unlike object-specific fetch tasks, cleaning often requires generalizing abstract commands such as “put everything away” into sequential, context-aware behaviors across many objects and categories. Cleaning tasks can be decomposed into the following sub-tasks: (1) user preference inference and object categorization, (2) object localization and sorting, and (3) task execution with multimodal grounding and action control. Foundation models such as GPT-3.5 [173] or LLaMA-based LLMs [177,243,245] for preference reasoning, VLM CLIP [146] for visual-semantic association, and embodied VLA models such as RT-2 can be used for physical execution. These models collectively support the subtasks in real-world cleaning robots. User preference inference and category reasoning are demonstrated in TidyBot [37], which integrates LLM GPT-3.5 [173] with a holonomic mobile base and a Kinova Gen3 arm. GPT-3.5 [173] is prompted with a small set of examples (e.g., “put light clothes in the drawer,” “place toys in the bin”) and infers general placement rules that align with household organization habits. The model predicts appropriate storage locations for previously unseen objects, enabling the robot to follow human-specific conventions rather than relying on rigid predefined templates. Object localization and sorting are enabled through the combination of GPT-3.5 and CLIP [146] on the same TidyBot robot, which together match visual observations to semantic categories. CLIP provides an open-vocabulary visual embedding space, allowing the TidyBot to recognize a wide range of everyday objects, even those not present in its training set, and associate them with the inferred categories generated by the LLM. This supports reliable sorting in typical home scenes where objects vary in appearance, lighting, or placement.

Task execution and multimodal control can be supported by VLA models such as RT-2 [14], which unify vision, language, and motor control. RT-2 policies map high-level cleaning commands (e.g., “clear the table,” “place everything neatly on the shelf”) into low-level robot actions while maintaining awareness of object poses, workspace boundaries, and user-defined organization rules. This method has been deployed on the Everyday Robot [14], a mobile manipulator with a single 6-DoF arm. The VLA model enables the Everyday Robot to carry out multi-step cleaning routines with consistency and adaptability.

(3): Childcare: Childcare tasks require a mobile service robot to assist with daily routines, object retrievals, and monitoring of young children [246]. Example tasks include responding to prompts such as “bring me my blanket,” guiding hygiene routines like teeth brushing after meals, and detecting safety risks such as toddlers approaching stairs or handling small objects. Childcare can be decomposed into following sub-tasks: (1) behavioral intent grounding, (2) dynamic instruction decomposition, and (3) real-time multimodal control and safety adaptation. Using foundation models such as GPT-3.5/3.5-turbo-instruct [173] for intent interpretation and multi-step reasoning, and VLA safety-adapter frameworks such as SAFE [219] for embodied risk detection, mobile service robots can integrate perceptual cues, language inputs, and feedback signals to support safe and context-aware childcare routines. LLM GPT-3.5-turbo-instruct [173] supports language-grounded behavioral reasoning by learning probabilistic mappings between behavioral cues and task goals, allowing a robot to interpret vague requests such as “help me” in context-specific ways; for example, in retrieving a dropped toy, handing over a comfort item, or alerting a caregiver. This has been demonstrated on the Smart Help system [28] which uses a custom mobile manipulator robot with an onboard camera and a Kinova Gen3 arm. Smart Help uses GPT-3.5-turbo-instruct [173] to infer user intent from short or underspecified prompts. The LLM supports dynamic instruction decomposition, where, for example, longer caregiver commands (e.g., “check if the baby is asleep and then turn off the light”) can be translated into ordered subgoals that coordinate perception, navigation, and verification steps, allowing a robot to maintain situational context and align timing with child behaviors. Real-time multimodal control and safety adaptation are provided by SAFE [219], which integrates a multimodal safety classifier atop pretrained LLM- and VLM-based policies. The Franka Emika Panda robot has used SAFE to continuously analyze vision, proprioception, and trajectory predictions to halt unsafe grasps and prevent collisions with fragile objects. SAFE gives a timely alert such that the robot can stop, backtrack, or ask for help, enabling early intervention before failures cause harm to humans, adults and children, around.
(4): Cooking: Cooking tasks involve mobile service robots executing multi-step procedures to assist elderly individuals, busy families, or users with disabilities. Cooking includes such subtasks as: (1) ingredient and utensil localization, (2) sequential cooking task decomposition and planning, and (3) bimanual manipulation and coordinated action execution. Foundation models such as LLM PaLM 540B [6] for visual-language grounding, LLMs BERT [145] and DistilBERT [247] for instruction parsing, and VLM CLIP [146] for visual-semantic correspondence, can map natural language cooking requests to perception-driven subgoals and physically executable motion plans in the home. In particular, the LLM PaLM 540B supports ingredient and utensil localization by visually identifying objects such as spoons, milk, eggs, or bread from natural language cues. This allows a robot to ground ingredient or utensil references directly to camera observations and infer their spatial context within the kitchen. This was demonstrated on the Everyday Robots mobile manipulator [220], which integrated the PaLM 540B LLM. Sequential task decomposition and planning have been demonstrated using the fixed-base 6-DoF Universal Robots UR5e arm system [221]. The system used the BERT and DistilBERT as well as CLIP to transform high-level goals such as “in kitchen, make a sandwich” into grounded action sequences like “pick bread from cabinet above” or “retrieve cooked meat from microwave.” These models decompose abstract cooking goals into structured subtasks linked to corresponding visual targets. Coordinated bimanual manipulation are enabled through the combination of GPT-3 [2] and a graph-based control network to synchronize grasping, tool use, ingredient placement, and force-sensitive interactions. This method has been deployed on a dual-arm Universal Robots UR5e configuration [222] utilizing GPT-3 as the high-level planner taking language instruction and RGB camera data as inputs. This integration allows the robot to perform LLM-guided actions such as stabilizing a bowl with one arm while stirring or transferring ingredients with the other, all while maintaining real-time perceptual grounding and safety constraints.

5.2. Healthcare Assistance

Healthcare assistance tasks involve supporting clinical workflows, assisting people, and infection control in dynamic hospital environments. These tasks include the delivery of medical supplies [223,224], bedside patient monitoring, assistive mobility [225,248], and hygiene maintenance and infection control [226,227,249].

(1): Delivery of medical supplies: Delivery of medical supplies involves mobile service robots autonomously transporting medications, IV fluids, lab samples, and surgical instruments throughout hospitals and clinics. These deliveries are essential to maintaining smooth clinical workflows, reducing staff workload, and minimizing human error in time-sensitive medical tasks. Robotic delivery tasks can be decomposed into the following sub-tasks: (1) symbolic-to-action mapping and long-horizon planning, (2) spatial grounding of objects and target locations, (3) feasibility evaluation and adaptive plan selection, and 4) social navigation in dynamic clinical settings. LLMs such as Codex [250], GPT-3 [2], and GPT-4 [148] have been used for symbolic program synthesis and task decomposition, and VLMs such as OpenScene [251] for spatial grounding, and GPT-4V [163] combined with YOLO [252] for human-aware navigation. These models allow robots to translate requests into structured action sequences, infer 3D target locations, and adapt their motion to human activity in real time. Codex [250] and GPT-3 [2] support symbolic-to-action reasoning by converting instructions such as “take these surgical tools to the OR room” into structured control sequences (e.g., functionally decomposed like grab (kit) → goto (prep_room) → handover) allowing a robot to autonomously generate interpretable, executable programs using an LLM-API and predefined spatial primitives to produce interpretable, long-horizon delivery routines. This is demonstrated on a custom mobile manipulator base equipped with a Franka Emika Panda arm [223]. Spatial grounding was enabled by a combination of models such as GPT-4 [148] and OpenScene [251]. Demonstrated on the same custom mobile manipulator base in [224], GPT-4 and OpenScene together resolves spatial expressions such as “on the left tray by the door” into 3D coordinates suitable for pick-and-place execution, enabling accurate supply localization and placement even when trays, carts, or equipment partially occlude the scene. GPT-4V [163] and YOLO [252] supports social navigation in dynamic environments as was shown with VLM-Social-Nav [201] on a Clearpath Jackal mobile robot equipped with RGB cameras and wheel odometry. VLM-Social-Nav merged GPT-4V and YOLO to interpret real-time human gestures and motion cues to infer social intent, dynamically adjusting its navigation policy to yield, reroute, or proceed safely through crowded corridors.
(2): Bedside patient monitoring: Bedside patient monitoring focuses on understanding a patient’s state, intent, and safety within the immediate vicinity of the bed, under strict clinical constraints on proximity, etiquette, and timely response. These tasks decompose into: (1) patient intent and state interpretation, and (2) bedside assistive decision-making. LLMs such as GPT-4 [148] and GPT-3.5-turbo-1106 [173] support these capabilities by reasoning over verbal cues, contextual room information, and clinical routines. This functionality is demonstrated in MoMa-LLM [248] on the Fetch robot, where GPT-4 [148] performs high-level reasoning over requests and situational context, while GPT-3.5-turbo-1106 [173] provides real-time room and object context classification. Together, these models enable robot bedside monitoring behaviors such as recognizing requests for assistance, detecting unsafe situations (e.g., a patient attempting to reach beyond safe range), and deciding when to reposition objects, alert staff, or initiate assistive actions, all while remaining localized to the bedside environment.
(3): Assistive mobility: Assistive mobility focuses on safely navigating dynamic hospital environments while anticipating human motion and proactively adapting behavior. These tasks decompose into: (1) safe co-navigation and obstacle adaptation and (2) multi-agent forecasting and proactive intervention. Vision-language models such as GPT-4V [163] and SC-CLIP [253] enable socially compliant navigation by grounding human motion cues, spatial layouts, and interaction affordances in real time. This approach is demonstrated in OLiVia-Nav [202] on a mobile Jackal robot, which combines GPT-4V (offline) [163] for social-context captioning with SC-CLIP (real-time) [253] for responsive perception, allowing the robot to maneuver around people while maintaining appropriate spacing. For longer-horizon anticipation, GPT-4o [163] supports multi-agent behavior forecasting by analyzing pedestrian formations and predicting future motion patterns. This capability is demonstrated on a custom differential-drive robot equipped with RGB cameras and 2D LiDAR [225], where GPT-4o [163] enable proactive actions such as yielding to people, avoiding emerging congestion.
(4): Hygiene maintenance and infection control: In hospitals, hygiene maintenance tasks involve disinfecting high-touch surfaces, restocking personal protective equipment (PPE), managing biomedical waste, and monitoring air quality in areas such as ICUs, isolation wards, and operating rooms. Hygiene maintenance and infection control robot tasks decompose into sub-tasks: (1) spatial perception and contamination zone identification, (2) hygiene task grounding and low-level action planning, and (3) real-time PPE tool handling and proactive environmental monitoring. LLM GPT-4 [148], MLLMs such as AutoRT [226], and VLAs such as SayPlan [227] enable robots to detect contamination risks, interpret abstract cleaning commands, and execute safe, context-aware disinfection procedures in clinical environments. AutoRT in [226] supports spatial perception and contamination zone identification by interpreting environment descriptions and linking them to actionable cleaning targets. AutoRT was integrated into the Everyday Robots mobile manipulator. AutoRT’s MLLM is able to parse high-level natural language queries (e.g., “disinfect all handles near patient beds”) under a robot-constitution safety framework and convert them into targeted manipulation instructions. This enables autonomous exploration of clinical rooms and prioritization of high-touch contamination zones. LLMs such as GPT-4 [148] support hygiene task grounding and low-level action planning by performing semantic reasoning over structured scene representations and generating long-horizon disinfection plans. This capability was demonstrated in SayPlan [227] on a Franka Emika Panda arm mounted on an Omron LD-60 mobile base, where GPT-4 [148] performed hierarchical semantic search over 3D scene graphs, which refine feasible disinfection steps across multi-room ICU layouts. This allows the robot to convert abstract instructions such as “sanitize the entire ICU wing” into sequential, spatially grounded actions. Real-time tool handling and proactive environmental monitoring is provided by VLA systems such as PaLM-E [13] by jointly coordinating visual perception, language reasoning, and manipulation. PaLM-E [13] is integrated with mobile manipulators to restock, manage tools, and with additional sensors can monitor indicators such as air quality, UV sanitation cycles, or filter status while avoiding contaminated regions during operation. PaLM-E’s multimodal grounding allows a robot to adapt actions when new hazards emerge, such as when staff flow alters the environment.

5.3. Service Automation

Service automation tasks involve helping customers, managing events, and navigating dynamic public venues such as malls, airports, and museums. These tasks include customer assistance, guidance, and wayfinding [228,229,230,254,255] as well as service setup and maintenance [231,232].

(1): Customer assistance, guidance and wayfinding: Mobile service robots in public venues such as malls, airports, and museums assist customers by interpreting requests, navigating large and dynamic spaces, and supporting multi-stage guidance under social constraints. These tasks decompose into sub-tasks: (1) language-to-goal spatial grounding, (2) multi-stage waypoint sequencing, and (3) symbolic-spatial planning for real-time urgency-based redirection. LLMs such as GPT-3 [2], GPT-3.5 [173], LLaMA [245], and VLMs such as CLIP [146] and ViNG [230] enable robots to interpret open-ended user queries, associate them with real-world spatial layouts, and generate adaptive navigation plans that remain interpretable and socially aware in complex public environments. Language-to-goal grounding is achieved using GPT-3.5 [173] or LLaMA [245], which convert natural language queries into spatial targets and semantic constraints. This approach was used in OVSG [228] on the Ackermann-steered mobile platform ROSMASTER R2 equipped with RGB-D cameras. OVSG fuses GPT-3.5 [173]/LLaMA [245] with Detic-CLIP embeddings to build open-vocabulary 3D scene graphs encoding object labels, relations, and spatial geometry. As a result, the robot can interpret queries such as “find the nearest pharmacy across from the information desk” and resolve them into precise navigation goals in complex public spaces. LLMs such as GPT-3 [2] combined with VLMs CLIP and ViNG support multi-stage waypoint sequencing by breaking down language instructions into landmark-based routes and grounding those landmarks visually. This was shown in LM-Nav [229] on a Clearpath Jackal robot. GPT-3 [2] parsed instructions into ordered landmark sequences, CLIP matched these landmarks to visual observations, and ViNG predicted traversability between them via a topological connectivity graph. This enables robust execution of multi-stop instructions such as “first go to security, then continue to Gate A12,” even when crowds, occlusions, or temporary barriers appear. Real-time symbolic-spatial redirection is provided by models such as GPT-3.5 [173], which can re-evaluate priorities and update navigation plans as conditions change. This functionality was demonstrated in MLLM-Search [256] and HAM-Nav [257] on a Jackal, where GPT-3.5 reprioritizes goals based on new constraints, such as flight gate changes, service desk relocations, or temporary crowding. This integration of symbolic reasoning and live spatial cues allows the robot to dynamically reroute while preserving safety and interpretability in busy public environments.
(2): Service setup and maintenance: Service setup and maintenance tasks involve preparing venues, arranging equipment, adjusting layouts, and cleaning up during conferences, trade shows, or public spaces. Robots must interpret vague and evolving instructions such as “set up chairs near the left projector,” “tidy this booth area,” or “build this structure based on the manual.” These sub-tasks decompose into: (1) adaptive motion skill chaining, (2) generalized layout planning and setup, and (3) spatial safety validation and manipulation. Using foundation models such as PaliGemma-3B [258] for multimodal grounding, $π_{0}$ [153] and $π_{0.5}$ [154] for language-conditioned visuomotor control, and PG-InstructBLIP [259] for physical attribute reasoning, mobile service robots perform adaptive skill sequencing, layout planning, and safe manipulation across these service setup and maintenance sub-tasks. LLM PaliGemma-3B [258] and VLA $π_{0}$ [153] support adaptive motion skill chaining, which together can translate vague service-setup instructions into structured sequences of low-level actions. This capability was demonstrated in Hi-Robot [231] using the dual-arm mobile manipulator Mobile ALOHA robot [241]. PaliGemma-3B interprets vague commands such as “tidy this booth area” or “organize the display materials,” while $π_{0}$ sequences grasping, placing, folding, or lining motions. Human-in-the-loop feedback allows the model to correct or refine behaviors online, which enables robots to adapt skill chains as service conditions change. Generalized layout planning and setup is achieved through the VLA $π_{0.5}$ [154], which extends LLM [258] by co-training across heterogeneous tasks and spatial rearrangement demonstrations. This generalization is also shown on the Mobile ALOHA platform, where $π_{0.5}$ can learn to interpret commands such as “set up chairs near the left projector” or “assemble the booth according to this layout” even in unfamiliar rooms. By grounding natural language against perceived furniture, landmarks, and geometry, the robot can infer feasible arrangements and execute multi-step venue setup procedures. Spatial safety validation and manipulation are enabled by PG-InstructBLIP [259], a fine-tuned VLM capable of inferring physical attributes such as fragility, density, or structural integrity from images. PhysObjects [232] used a Franka Emika Panda robot, where PG-InstructBLIP conditioned the robot’s motion policy on object properties. For example, slowing movement near glass displays, adapting grip force for heavy metal stands, or selecting compliant strategies when handling sensitive electronics. This allows a robot to manipulate equipment safely and reliably.

An overview of the three-level framework is presented in Figure 6, which connects application domains to embodied AI challenges and shared foundation model capabilities. At Level 1, the framework spans the three aforementioned domains of domestic assistance, healthcare, and service automation. Level 2 organizes specific systems according to their perception, reasoning, and action components, highlighting how different frameworks (e.g., CLIP-Fields [242], Octo [21], MoMa-LLM [248], OVSG [228], Hi-Robot [231]) combine vision, language, and control. Level 3 captures shared model capabilities across domains, such as CLIP-style VLMs for perception, GPT-family LLMs for reasoning, and VLA models such as RT-2 [14] and SayCan [12] for action. This shows that many foundation models serve as reusable building blocks rather than domain-specific one-offs. This taxonomy clarifies how embedded foundation models generalize across diverse service settings while still supporting domain-specialized behaviors.

6. Ethical, Societal, Human-Interaction, and Physical Design Implications for Mobile Service Robots with Embedded Foundation Models

Foundation models help close key technical gaps for mobile service robots (Section 3), but their use also creates deployment-time risks and responsibilities. The implications below should be read as requirements for safe adoption, not as arguments against foundation models. Each subsection pairs a risk with concrete mitigation hooks that align with the solutions discussed in Section 4 and Section 5. Figure 7 provides an overview of these four interconnected domains of responsibility, ethical, societal human–robot interaction and physical design and ergonomics implications. The circular structure emphasizes that risks in one domain often propagate into the others, motivating integrated mitigation strategies throughout Section 6.1, Section 6.2, Section 6.3 and Section 6.4.

6.1. Ethical Implications

Ethical implications concern the responsible use of foundation models in mobile service robots, focusing on protecting sensitive user data, ensuring accountability for autonomous decisions, and aligning robot behavior with human values.

(1): Privacy and data governance: Mobile service robots deployed in homes or hospitals collect sensitive multimodal data (visual, audio, and physiological) [260]. When coupled with foundation models trained on web-scale corpora, risks of leakage [261,262], re-identification, or misuse increase [263]. Ethical deployment requires privacy-preserving architectures (on-device inference; selective edge-cloud collaboration), data minimization and purpose limitation, and auditable data-handling protocols aligned with the target domain.
(2): Accountability and value alignment: When foundation models generate autonomous action plans, it becomes unclear who bears responsibility for harmful outcomes: the developer, operator, or model provider [264,265]. Ethical frameworks should mandate action provenance (logging how language was grounded into plans), human-over-the-loop approval for high-impact tasks, and verifiable policy constraints that encode domain values (e.g., clinical safety norms).

6.2. Societal Implications

Societal implications address how foundation model-powered service robots shape access, labor, and public trust, highlighting equity of deployment, transformation of human work roles, and the cultural acceptance of robots in daily life.

(1): Accessibility and equity of deployment: If foundation model-powered service robots remain expensive, access will skew toward wealthier institutions [266], leaving vulnerable populations (elderly people and rural communities) behind. Equitable deployment favors cost-efficient, compressed models (Section 4.4), and procurement or reimbursement mechanisms that extend access to underserved homes and care facilities [267].
(2): Labor and workspace transformation: As foundation-model-enabled robots become capable of handling service automation (hospital deliveries, retail stocking, and cleaning), societal concerns arise about job displacement [268,269]. At the same time, they may augment human workers by offloading repetitive or physically demanding tasks. Policy and design should prioritize augmentation (robotic workflows and role redesign) rather than substitution, with training pathways for clinicians and staff to oversee foundation-model-enabled service robots.
(3): Trust and public acceptance: Societal trust in robots depends on transparency, reliability, and cultural sensitivity. Public acceptance will require mobile service robots which adapt to cultural norms [270,271] and institutional standards, transparent behavior rationales (“why this action and is this action a socially non-discriminatory one?”), calibrated self-expression, and certification regimes appropriate to domestic and healthcare contexts [272].

6.3. Implications for Human–Robot Interaction

Human–robot interaction implications capture how foundation model-equipped service robots reshape user expectations, emotions, and trust, requiring careful design of emotional, psychological impact management, transparency and conflict resolution.

(1): Emotional and cognitive impact on users: Prolonged interaction with foundation model-equipped robots can influence user psychology [273]. While service robots may enhance companionship and reduce isolation in eldercare, they can also induce over-reliance, anxiety, or emotional confusion if their social behaviors mimic humans too closely without reliability [274]. Designers should tune anthropomorphism and disclosure (e.g., system status and capabilities) to reduce over-reliance and confusion, especially in eldercare.
(2): Ethical persuasion and influence: LLM-based dialogue of mobile service robots enables persuasive interactions, reminding an elder to take medication or guiding a child to safe play. However, persuasion risks manipulation [275,276] if not carefully bounded. Research is needed on ethical persuasion policies that define limits for motivational dialogue in health, safety, or childcare contexts.
(3): Multi-user interaction and conflict resolution: Unlike lab settings, service robots often face competing instructions from multiple people (e.g., family members, nurses). Foundation models excel at multi-turn reasoning but lack arbitration mechanisms [277]. Research should explore multi-user dialogue management and conflict-resolution strategies for embodied agents.

6.4. Physical Design and Ergonomic Implications

Physical design and ergonomics consider the embodiment of mobile service robots to promote safety and effectively support foundation-model-enabled perception, reasoning, and interactions in real-world environments. While foundation models enhance high-level planning, multimodal understanding, and dialogue capabilities, their deployment in domestic and healthcare settings require adaptive hardware configurations, morphology, and easy-to-use interfaces to meet user needs. Therefore, robot design plays a practical role in ensuring that advanced AI capabilities translate into safe, interpretable, and socially appropriate physical behaviors. Below, we discuss three main areas with respect to robot physical design and ergonomic implications.

(1): Embodiment legibility and intent communication: Foundation models enable robots to generate multi-step, context-sensitive action plans. However, without clear physical signaling, such as visible pre-motion orientation changes (e.g., rotating the base or turning the head toward the target direction before moving), projected path indicators on the floor, auditory intent announcements (e.g., “moving to the kitchen”), such actions may appear abrupt, unpredictable, or unsafe to potential users. Robot design should incorporate legibility mechanisms, including gradual acceleration and deceleration profiles, directional pre-motion cues (e.g., torso or head orientation before navigation), light-based or display-based status indicators, and deliberate motion segmentation between task phases [278,279]. These physical cues allow users to anticipate robot movements and recognize a robot’s intent before behavior execution. This can improve perceived reliability during close-proximity operations.
(2): Safety-oriented morphology and physical constraints: The increased autonomy enabled by foundation models requires hardware-level safeguards that reduce the severity and likelihood of physical harm. This includes compliant actuations (e.g., series elastic elements or torque-controlled joints); rounded external geometries to reduce injury risk upon accidental contact; conservative force and speed limits attuned for shared human environments, and passive mechanical limits that restrict joint ranges in high-risk zones [280]. Workspace-aware kinematic design, such as constraining the reach of a manipulator near a user’s head, or incorporating soft end-effectors for object transfer, can further reduce risk during assistive tasks. These design strategies ensure that even if high-level plans are imperfect or misinterpreted, the robot’s physical behavior remains bounded within safe operating parameters.
(3): Ergonomics for multi-user and accessibility contexts: Mobile service robots frequently operate in environments with diverse users, including older adults, children, and individuals with varying demographic characteristics. Thus, ergonomic design should accommodate varying physical abilities and interaction styles. This includes height-adjustable or angled displays for seated and standing users, multimodal input interfaces (touch, voice, gesture), clear visual feedback visible from multiple viewing angles, and intuitive emergency-stop accessibility [281]. In healthcare environments, physical interface placement should avoid interfering with clinical workflows or obstructing equipment. By optimizing for accessibility and shared-space usability, hardware design supports inclusive interaction across heterogeneous user groups.
(4): Hardware—foundation model co-design constraints: Foundation models impose computational, sensing, and thermal requirements that influence physical architecture. Sensor placements must support reliable multimodal perception (e.g., camera height aligned with seated and standing users), while onboard computation modules require appropriate thermal management without compromising form factor [282]. Power budgeting affects battery placement and weight distribution, influencing stability during robot navigation and manipulation. Co-design between hardware and model capabilities ensures that perception, language grounding, and uncertainty handling can operate reliably within the physical constraints of mobile platforms deployed in human-centered environments.

The same properties that make foundation models powerful for perception, language-to-action mapping, uncertainty handling, and efficient execution also necessitate explicit governance, human–robot interaction design, and embodiment-aware physical constraints, so that the benefits detailed in Section 4 are realized without incurring the harms outlined here.

7. Future Research Directions

Building on the challenges identified in Section 3, Section 4, Section 5 and Section 6, we outline four forward-looking research directions that define a research agenda for advancing foundation-model-enabled mobile service robots: (1) reliability and lifelong adaptation; (2) privacy-aware and resource-constrained inference; (3) governance, standards, and human-in-the-loop frameworks; and (4) multi-robot coordination and fleet-level reasoning. Figure 8 presents these research directions in a multi-stage capability-oriented milestone framework. The vertical axis represents increasing research maturity and verification depth, progressing from controlled research validation (Stage 1) to field-validated autonomy (Stage 2) and ultimately to trustworthy and governed system-level robustness (Stage 3). Across each stage, we define explicit capability milestones that correspond to verifiable system-level properties, such as closed-loop hallucination rejection in real environments, catastrophic forgetting across long-term service tasks, formal latency and safety-response guarantees under live human–robot interaction, privacy-certified federated adaptation without raw data transmission, auditable decision traces suitable for regulatory review, and scalable fleet-level coordination without central bottlenecks.

7.1. Reliability and Lifelong Adaptation

Reliability and lifelong adaptation address the need for mobile service robots with embedded foundation models to operate safely over time, avoiding hallucinations and preserving knowledge as environments and tasks evolve.

(1): Mitigating hallucinations in safety-critical domains: LLMs and MLLMs are prone to generating plausible but incorrect outputs. In mobile service robots, hallucinations can cause serious risks [283], such as falsely identifying medical equipment or proposing unsafe navigation paths. Future research must develop hallucination-detection pipelines that ground foundation model outputs against real-time sensory consistency checks, ensuring that every proposed action is validated before execution [284].
(2): Addressing catastrophic forgetting during continual learning: Mobile service robots deployed in hospitals or public spaces must adapt to evolving environments (e.g., service space rearrangements, new medical routines). Yet, online fine-tuning of foundation models risks catastrophic forgetting of previously learned tasks. Future work should explore continual adaptation methods [285,286] with domain-specific rehearsal buffers, elastic weight consolidation at the edge, or modular adapters that preserve core reasoning while integrating local updates.

7.2. Privacy-Aware and Further Resource-Constrained Research and Inference

Privacy-aware and resource-constrained research inference highlights the challenge of adapting foundation models to sensitive domains such as homes and hospitals, where strict data governance and limited onboard compute demand new approaches for secure, efficient, and context-specific execution.

(1): Data scarcity and privacy in sensitive domains: Unlike web-scale training, robots in homes and hospitals face data scarcity and strict privacy constraints. Foundation models cannot rely on large-scale collection of in-situ data [287]. Future directions include federated foundation model adaptation, synthetic scene generation grounded in domestic and clinical layouts, and on-device fine-tuning under limited compute and memory budgets.
(2): Efficient scaling for embedded execution: Although Section 4.4 showed advances in compressed and efficient architectures, mobile service robot research still faces energy, latency, and throughput ceilings [288]. Further research is required on task-aware scaling laws, where model complexity dynamically better adapts to robot workload (e.g., low-power monitoring vs. high-bandwidth multi-person dialogue), ensuring safe operation without exhausting onboard resources. There is always room for optimizing computation load.
(3): Sustainability and energy consumption: Beyond latency and throughput, long-horizon operations of mobile service robots are fundamentally constrained by energy autonomy. Foundation models are energy-intensive, and sustained perception, reasoning, and dialogue can rapidly deplete onboard power, making frequent recharging impractical. Future research should investigate model-aware energy management mechanisms [289] that explicitly couple reasoning depth, perception frequency, and planning fidelity to the robot’s battery state and mission horizon [290]. Such mechanisms may include adaptive inference modes and energy-aware task scheduling. Under low-power conditions, hierarchical activation policies that prioritize safety-critical functions [291] can further ensure context-appropriate and reliable operation over extended deployments.
(4): Performance trade-offs under onboard computational constraints: Adapting foundation models for onboard execution in mobile service robots introduces critical trade-offs between computational load and key performance properties, including reasoning depth, response latency, perceptual fidelity, and recovery robustness. Techniques such as model compression [292], reduced-precision inference [293], or early-exit reasoning [294] can lower latency and energy consumption, however, they may degrade interaction-level performance, particularly during mid-task corrections or recovery from execution errors [295]. When computational load is aggressively constrained, models may truncate reasoning chains, skip verification passes, or operate with lower-resolution sensory inputs to meet latency budgets. This directly affects safety-relevant behaviors: removal of verification stages prevents cross-checking of predicted actions; reduced reasoning depth weakens long-horizon task consistency; and lower perceptual fidelity increases the likelihood of misidentifying objects or human intent [296]. In healthcare and assisted living settings, such compromises can lead to executing incorrect or unsafe actions, even when average task success metrics remain acceptable. Future research must define explicit deployment constraints (e.g., maximum allowable response latency and required verification depth), below which models should not be compressed or simplified [297].

7.3. Governance, Standards, and Human-in-the-Loop Frameworks

Governance, standards, and human-in-the-loop frameworks emphasize the need for systematic evaluation environments, regulatory guidance, and supervisory protocols to ensure that foundation-model-enabled service robots operate safely, transparently, and under accountable oversight in human-centered domains.

(1): Standardized benchmarks for mobile service robots: Existing foundation model benchmarks such as Habitat provide strong environments for testing navigation and embodied perception, but they remain limited in capturing the complexity of human-centered service tasks [31]. Current frameworks often evaluate vision, language, or manipulation in isolation, without integrating multimodal grounding, safety norms, or real-time human interaction. Future research should focus on developing multi-room, human-in-the-loop benchmarks that embed realistic human agents, unpredictable activities, and task-level evaluation metrics relevant to homes, hospitals, and service venues. Such benchmarks would enable systematic testing of foundation model-powered service robots under conditions that better approximate real-world domestic assistance, patient care, and service automation.
(2): Human-over-the-loop interaction protocols: Even with strong autonomy, mobile service robots must allow human supervisors to override or confirm foundation model-generated plans in safety-critical domains [298,299]. Research should explore interaction protocols for seamless fallback control, combining foundation model reasoning with predictable human-in-the-loop interfaces.

7.4. Multi-Robot Coordination and Fleet-Level Reasoning

Multi-robot coordination addresses the challenge of scaling foundation-model-enabled service robots beyond single robots to collaborative teams operating in shared human environments, such as hospitals, airports, and large public facilities. To date, the majority of embedded foundation model research has focused on single-robot perception and decision-making [300]. However, real-world large service applications require groups of robots coordinating tasks with shared spatial and semantic understanding, while resolving conflicts [301].

(1): Conflict-aware multi-agent planning: When multiple mobile service robots operate concurrently, uncoordinated foundation model reasoning may lead to spatial conflicts, redundant actions, or unsafe congestion around humans. Future research should explore multi-agent extensions of vision-language-action reasoning that incorporate shared world models, explicit conflict prediction, and negotiation mechanisms between agents [302]. This includes grounding high-level language goals (e.g., “assist patients on this floor”) into coordinated task allocations that account for robot proximity, workload balance, and human presence.
(2): Communication and knowledge sharing across robot fleets: Foundation models enable rich semantic reasoning, however, fleet-level performance depends on how knowledge is shared across robots. Future work should investigate lightweight inter-robot communication protocols that allow robots to exchange task states, uncertainty estimates, and environment updates without excessive bandwidth or privacy leakage [303]. Techniques such as shared semantic maps, event-based updates, or selectively synchronized foundation model adapters could enable collective situational awareness while preserving onboard autonomy.
(3): Hierarchical and supervisory fleet control: Large-scale service environments often require a hierarchical structure, where local robot autonomy is balanced with fleet-level supervision. Research is still needed on hierarchical orchestration frameworks in which foundation models operate at different abstraction levels: at the local level, individual robots perform perception, grounding, and task execution; at the global level, supervisory planners coordinate task allocation, prioritization, traffic regulation, and escalation to human operators when conflicts arise [304]. By explicitly separating local embodied reasoning from fleet-level orchestration, such hierarchical architectures support scalable coordination while maintaining human oversight in safety-critical domains.

8. Conclusions

This systematic review examined the integration of foundation models into mobile service robotics, focusing on how they address four core challenges: multimodal perception, translation of natural language into executable actions, uncertainty estimation across perception and planning, and computational efficiency for real-time operation. We analyzed how recent advances in LLMs, VLMs, MLLMs, and VLAs provide technical solutions to these challenges, such as cross-modal fusion, instruction grounding, confidence-weighted reasoning, and lightweight scalable architectures. We also discussed real-world applications in domestic assistance, healthcare, and service automation, highlighting how foundation models enable context-aware, socially responsive, and generalizable robot behaviors for service robots. While significant progress has been made in bridging semantic reasoning and embodied robot actions in real-world environments, critical limitations remain for achieving reliable and scalable deployment, specifically in model reliability and lifelong adaptation, privacy-aware and resource-constrained deployment, and the development of regulatory guidance and supervisory protocols for accountable operation in human-centered domains. Research in these areas will allow mobile service robots to safely, adaptively, and collaboratively interact in dynamic human environments to become proactive assistants in everyday life.

Author Contributions

Conceptualization: M.L. and G.N.; literature search and analysis: M.L. and G.N.; writing—original draft preparation: M.L.; writing—review and editing: M.L., B.B., and G.N.; funding acquisition: B.B. and G.N.; supervision: B.B. and G.N. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Natural Sciences and Engineering Research Council of Canada (NSERC), the NSERC Alliance program, and the NSERC CREATE ADVENTOR fellowship.

Data Availability Statement

No new data were created or analyzed in this study.

Acknowledgments

During the preparation of this manuscript, the authors used Nano Banana Pro AI Image Generator only for the purpose of generating illustrative conceptual images in Figure 1, Figure 2, Figure 4 and Figure 8. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Systematic Review Methodology

Appendix A.1. PRISMA Checklist

This systematic review follows the PRISMA reporting guidelines [64]. Items applicable primarily to quantitative meta-analysis, clinical outcome synthesis, or statistical effect estimation (e.g., risk-of-bias assessment, effect measures, heterogeneity analysis, and certainty estimation) are marked as N/A, as this work constitutes a systematic review and research area analysis rather than an outcome-driven meta-analysis [305,306,307]. Table A1 maps each relevant PRISMA [64] checklist item to its corresponding section in the manuscript.

Table A1. PRISMA checklist mapping.

PRISMA Item	Description	Location in Manuscript
1	Identified as systematic review in title	Title
2	Structured abstract	Abstract
3	Rationale	Section 1
4	Objectives	Section 1
5	Eligibility criteria	Section 2
6	Information sources	Section 2.1 + Appendix A.2, Appendix A.3 and Appendix A.4 + Figure A1 + Table A2
7	Full search strategy	Section 2.1 + Appendix A.2 + Figure A1 + Table A2
8	Selection process	Appendix A.2 + Figure A1
9	Data collection	Appendix A.3
10a	Outcomes	Section 3 + Table 1, Table 2 and Table 3
10b	Variables	Section 2 + Appendix A + Figure A1 + Table A2 (used OpenAlex [65])
11	Risk of bias	N/A (scoping/systematic mapping review, best is Appendix A.2 and Appendix A.3)
12	Effect measures	N/A
13a–f	Synthesis methods	Section 3 + Appendix A.3
14	Reporting bias	N/A
15	Certainty assessment	N/A
16a	Study selection	Appendix A.2 and Appendix A.3 + Figure A1 + Table A2
16b	Excluded papers	Appendix A.2 and Appendix A.3 + Figure A1 + Table A2
17	Study characteristics	Section 3 + Appendix A.3
18–22	Meta-analysis items	N/A
23a–d	Discussion	Section 3, Section 4, Section 5, Section 6 and Section 7 + Appendix A.5
24	Registration	Not registered
25	Funding	Funding Section
26	Conflicts	Conflict of Interest Section
27	Data availability	Data Availability Section

Appendix A.2. Selection Process

The selection for this review was conducted using a multi-stage screening process aligned with the PRISMA [64] recommendations (Figure A1).

Figure A1. PRISMA flow diagram of selection procedure.

OpenAlex [65] was used as the bibliographic retrieval and filtering platform. Paper identification based on keyword querying, domain-level exclusion (explicit Boolean inclusion and exclusion parameters at each stage are provided in Appendix A.4), and embodied robot intelligence criteria was acquired through OpenAlex [65]. No independent machine-learning classifiers or crowdsourced screening systems were used in addition to OpenAlex [65] indexing and metadata retrieval.

Following completion of the three-stage filtering, 7506 papers were sought for retrieval. Manual validation of a random sample of 100 papers was performed by the first corresponding author. It was found that the content of the sample aligned with the predefined inclusion and exclusion criteria and the intended scope of embodied mobile service robotics.

Appendix A.3. Data Extraction and Categorization Procedure

Research challenge rankings were acquired by OpenAlex [65]. OpenAlex query outputs indicated the following of the 7506 papers: 2190 papers were associated with language-to-action mapping, 2164 papers with multimodal perception, 2005 with uncertainty estimation, and 1147 with computational capabilities. These rankings are presented in Section 3 and in Table 1.

To assess the reliability of the OpenAlex-based categorization and ranking, a sample-based manual validation procedure was conducted. For each of the four identified research challenges, 100 papers were randomly sampled from the OpenAlex-filtered subset (total validation sample

n = 400

for the four open challenges). The first corresponding author reviewed the sampled titles and abstracts. All sampled papers were found to be consistent with their OpenAlex [65]-based categorization.

Appendix A.4. Full Boolean Search Strings and Operators

The Boolean search strings used in the OpenAlex [65] search to obtain the 7506 papers are presented in Table A2.

Table A2. Full Boolean search strings used.

Stage	Boolean Search Strings Used
(1) Keyword Query	“(“autonomous robot” OR “embedded AI” OR “embodied AI”)”.
(2) Domain Filtering	“(“autonomous robot” OR “embedded AI” OR “embodied AI”) NOT (driving OR vehicle OR UAV OR drone OR surgery OR surgical OR underwater OR biology)”
(3) Embodied Robot Intelligence	“(“autonomous robot” OR “embedded AI” OR “embodied AI”) AND (multimodal OR “multi-modal” OR perception OR “sensor fusion” OR “RGB-D” OR “visual-inertial” OR “multimodal learning” OR “scene understanding” OR language OR instruction OR “language-conditioned” OR “instruction following” OR “language-to-action” OR “vision-language-action” OR “LLM-powered” OR uncertainty OR estimate OR estimation OR “computationally efficient” OR compute) NOT (driving OR vehicle OR UAV OR drone OR surgery OR surgical OR underwater OR biology)”

Appendix A.5. Limitations of the Review Methodology

The following limitations should be considered when interpreting the findings presented:

(1): Single-database literature search: The literature search was conducted using OpenAlex [65]. While OpenAlex [65] provides broad cross-disciplinary coverage and transparent indexing, reliance on a single database may have omitted papers indexed only in other repositories (e.g., IEEE Xplore, Scopus, Web of Science). OpenAlex [65] was selected to ensure reproducibility, consistent query structure, and longitudinal coverage across multiple decades.
(2): Absence of formal risk-of-bias assessment: No formal risk-of-bias or methodological quality assessment was conducted. This is consistent with qualitative systematic reviews and thematic analyses that aim to characterize research areas rather than compare empirical effect sizes or benchmark performance [308,309]. Similar approaches have been adopted in scoping and mapping reviews in robotics and artificial intelligence, where the objective is to identify thematic concentrations and research gaps rather than evaluate statistical validity of outcomes [310]. Accordingly, the PRISMA [64] items related to quantitative bias assessment, heterogeneity, and effect measures are not applicable.
(3): Breadth-depth trade-off: This review emphasizes breadth to capture the long-term evolution of embodied AI research toward foundation-model-enabled mobile service robotics. Individual papers were not exhaustively analyzed at the individual level. This design choice is appropriate for reviews aiming to identify research areas and open challenges at a field level [305,306,307], rather than conduct comparative experimental benchmarking.

Despite these limitations, the adopted methodology provides a transparent, reproducible, and comprehensive mapping of dominant research challenges in embodied mobile service robotics.

References

Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training; Technical Report; OpenAI: San Francisco, CA, USA, 2018; Available online: https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf (accessed on 5 December 2025).
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models Are Few-Shot Learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, BC, Canada, 6–12 December 2020; Curran Associates, Inc.: Red Hook, NY, USA, 2020; pp. 1877–1901. Available online: https://dl.acm.org/doi/10.5555/3495724.3495883 (accessed on 5 December 2025).
Thirunavukarasu, A.J.; Ting, D.S.J.; Elangovan, K.; Gutierrez, L.; Tan, T.F.; Ting, D.S.W. Large Language Models in Medicine. Nat. Med. 2023, 29, 1930–1940. [Google Scholar] [CrossRef]
Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. LLaMA 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
Taori, R.; Gulrajani, I.; Zhang, T.; Dubois, Y.; Li, X.; Guestrin, C.; Liang, P.; Hashimoto, T.B. Alpaca: A Strong, Replicable Instruction-Following Model; Stanford Center for Research on Foundation Models: Stanford, CA, USA, 2023; Available online: https://crfm.stanford.edu/2023/03/13/alpaca.html (accessed on 5 December 2025).
Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; et al. PaLM: Scaling Language Modeling with Pathways. J. Mach. Learn. Res. 2023, 24, 11324–11436. Available online: https://dl.acm.org/doi/10.5555/3648699.3648939 (accessed on 5 December 2025).
Zhang, J.; Huang, J.; Jin, S.; Lu, S. Vision-Language Models for Vision Tasks: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5625–5644. [Google Scholar] [CrossRef]
Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 18–22 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 6077–6086. [Google Scholar] [CrossRef]
Li, L.H.; Yatskar, M.; Yin, D.; Hsieh, C.-J.; Chang, K.-W. VisualBERT: A Simple and Performant Baseline for Vision and Language. arXiv 2019, arXiv:1908.03557. [Google Scholar] [CrossRef]
Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv 2022, arXiv:2204.06125. [Google Scholar] [CrossRef]
Bao, H.; Dong, L.; Piao, S.; Wei, F. BEiT: BERT Pre-Training of Image Transformers. arXiv 2021, arXiv:2106.08254. [Google Scholar] [CrossRef]
Ichter, B.; Brohan, A.; Chebotar, Y.; Finn, C.; Hausman, K.; Herzog, A.; Ho, D.; Ibarz, J.; Irpan, A.; Jang, E.; et al. Do as I Can, Not as I Say: Grounding Language in Robotic Affordances. In Proceedings of the 6th Conference on Robot Learning (CoRL 2022), Auckland, New Zealand, 14–18 December 2022; PMLR: Red Hook, NY, USA, 2023; pp. 287–318. Available online: https://proceedings.mlr.press/v205/ichter23a.html (accessed on 5 December 2025).
Driess, D.; Xia, F.; Sajjadi, M.S.M.; Lynch, C.; Chowdhery, A.; Ichter, B.; Wahid, A.; Tompson, J.; Vuong, Q.; Yu, T.; et al. PaLM-E: An Embodied Multimodal Language Model. In Proceedings of the 40th International Conference on Machine Learning (ICML 2023), Honolulu, HI, USA, 23–29 July 2023; JMLR.org: Red Hook, NY, USA, 2023; 20p, Available online: https://dl.acm.org/doi/10.5555/3618408.3618748 (accessed on 5 December 2025).
Zitkovich, B.; Yu, T.; Xu, S.; Xu, P.; Xiao, T.; Xia, F.; Wu, J.; Wohlhart, P.; Welker, S.; Wahid, A.; et al. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. In Proceedings of the 7th Conference on Robot Learning (CoRL 2023), Atlanta, GA, USA, 6–9 November 2023; PMLR: Red Hook, NY, USA, 2023; Volume 229, pp. 2165–2183. Available online: https://proceedings.mlr.press/v229/zitkovich23a.html (accessed on 5 December 2025).
Rivkin, D.; Hogan, F.; Feriani, A.; Konar, A.; Sigal, A.; Liu, X.; Dudek, G. AIot Smart Home via Autonomous LLM Agents. IEEE Internet Things J. 2024, 12, 2458–2472. [Google Scholar] [CrossRef]
Giudici, M.; Padalino, L.; Paolino, G.; Paratici, I.; Pascu, A.I.; Garzotto, F. Designing Home Automation Routines Using An LLM-Based Chatbot. Designs 2024, 8, 43. [Google Scholar] [CrossRef]
Li, Y.; Wen, H.; Wang, W.; Li, X.; Yuan, Y.; Liu, G.; Liu, J.; Xu, W.; Wang, X.; Sun, Y.; et al. Personal LLM Agents: Insights and Survey About the Capability, Efficiency and Security. arXiv 2024, arXiv:2401.05459. [Google Scholar] [CrossRef]
Pandya, A. ChatGPT-Enabled daVinci Surgical Robot Prototype: Advancements and Limitations. Robotics 2023, 12, 97. [Google Scholar] [CrossRef]
Salichs, M.A.; Castro-González, Á.; Salichs, E.; Fernández-Rodicio, E.; Maroto-Gómez, M.; Gamboa-Montero, J.J.; Marques-Villarroya, S.; Castillo, J.C.; Alonso-Martín, F.; Malfaz, M. Mini: A New Social Robot for the Elderly. Int. J. Soc. Robot. 2020, 12, 1231–1249. [Google Scholar] [CrossRef]
Paiva, A.; Leite, I.; Boukricha, H.; Wachsmuth, I. Empathy in Virtual Agents and Robots: A Survey. ACM Trans. Interact. Intell. Syst. 2017, 7, 11. [Google Scholar] [CrossRef]
Zhao, X.; Li, M.; Weber, C.; Hafez, M.B.; Wermter, S. Chat with the Environment: Interactive Multimodal Perception using Large Language Models. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2023), Detroit, MI, USA, 1–5 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 3590–3596. [Google Scholar] [CrossRef]
Xia, Y.; Zhang, J.; Jazdi, N.; Weyrich, M. Incorporating Large Language Models into Production Systems for Enhanced Task Automation and Flexibility. arXiv 2024, arXiv:2407.08550. [Google Scholar] [CrossRef]
Wang, Z.; Qin, H. Intelligent Industrial Production Process Automatic Regulation System Based on LLM Agents. In Proceedings of the 5th International Conference on Artificial Intelligence and Electromechanical Automation (AIEA 2024), 14–16 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 133–137. [Google Scholar] [CrossRef]
Huang, W.; Abbeel, P.; Pathak, D.; Mordatch, I. Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents. In Proceedings of the 39th International Conference on Machine Learning (ICML 2022), Baltimore, MD, USA, 17–23 July 2022; PMLR: Red Hook, NY, USA, 2022; pp. 9118–9147. Available online: https://proceedings.mlr.press/v162/huang22a.html (accessed on 5 December 2025).
Huang, W.; Xia, F.; Xiao, T.; Chan, H.; Liang, J.; Florence, P.; Zeng, A.; Tompson, J.; Mordatch, I.; Chebotar, Y.; et al. Inner Monologue: Embodied Reasoning through Planning with Language Models. In Proceedings of the 6th Conference on Robot Learning (CoRL 2023), 14–18 December 2023; PMLR: Red Hook, NY, USA, 2023; Volume 205, pp. 1769–1782. Available online: https://proceedings.mlr.press/v205/huang23c.html (accessed on 5 December 2025).
Shridhar, M.; Manuelli, L.; Fox, D. CLIPort: What and Where Pathways for Robotic Manipulation. In Proceedings of the 5th Conference on Robot Learning (CoRL 2021), 8–11 November 2021; PMLR: Red Hook, NY, USA, 2022; Volume 164, pp. 894–906. Available online: https://proceedings.mlr.press/v164/shridhar22a.html (accessed on 5 December 2025).
Narcomey, A.; Tsoi, N.; Desai, R.; Vázquez, M. Learning human preferences over robot behavior as soft planning constraints. arXiv 2024, arXiv:2403.19795. [Google Scholar] [CrossRef]
Cao, Z.; Wang, Z.; Xie, S.; Liu, A.; Fan, L. Smart Help: Strategic Opponent Modeling for Proactive and Adaptive Robot Assistance in Households. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 17–21 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 18091–18101. [Google Scholar] [CrossRef]
Xiao, A.; Janaka, N.; Hu, T.; Gupta, A.; Li, K.; Yu, C. Robi Butler: Multimodal Remote Interaction with a Household Robot Assistant. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA 2025), Atlanta, GA, USA, 19–23 May 2025; IEEE: Piscataway, NJ, USA, 2025. [Google Scholar] [CrossRef]
Cao, Y.; Zhang, J.; Yu, Z.; Liu, S.; Qin, Z.; Zou, Q.; Du, B.; Xu, K. CogNav: Cognitive Process Modeling for Object Goal Navigation with LLMs. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV 2025), Honolulu, HI, USA, 19–23 October 2025; IEEE/CVF: Piscataway, NJ, USA, 2025; pp. 9550–9560. Available online: https://openaccess.thecvf.com/content/ICCV2025/papers/Cao_CogNav_Cognitive_Process_Modeling_for_Object_Goal_Navigation_with_LLMs_ICCV_2025_paper.pdf (accessed on 5 December 2025).
Puig, X.; Undersander, E.; Szot, A.; Dallaire Cote, M.; Yang, T.-Y.; Partsey, R.; Desai, R.; Clegg, A.W.; Hlavac, M.; Min, S.Y.; et al. Habitat 3.0: A Co-Habitat for Humans, Ava-tars and Robots. arXiv 2023, arXiv:2310.13724. [Google Scholar] [CrossRef]
Eftekhar, A.; Weihs, L.; Hendrix, R.; Caglar, E.; Salvador, J.; Herrasti, A.; Han, W.; VanderBil, E.; Kembhavi, A.; Farhadi, A.; et al. The One RING: A Robotic Indoor Navigation Generalist. arXiv 2024, arXiv:2412.14401. [Google Scholar] [CrossRef]
Hu, J.; Hendrix, R.; Farhadi, A.; Kembhavi, A.; Martín-Martín, R.; Stone, P. FLaRe: Achieving Masterful and Adaptive Robot Policies with Large-Scale Reinforcement Learning Fine-Tuning. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA 2025), Atlanta, GA, USA, 19–23 May 2025; IEEE: Piscataway, NJ, USA, 2025. [Google Scholar] [CrossRef]
Fan, H.; Liu, X.; Fuh, J.Y.H.; Lu, W.F.; Li, B. Embodied Intelligence in Manufacturing: Leveraging Large Language Models for Autonomous Industrial Robotics. J. Intell. Manuf. 2025, 36, 1141–1157. [Google Scholar] [CrossRef]
Salierno, G.; Leonardi, L.; Cabri, G. Generative AI and large language models in Industry 5.0: Shaping Smarter Sustainable Cities. Encyclopedia 2025, 5, 30. [Google Scholar] [CrossRef]
Li, S.; Wang, J.; Dai, R.; Ma, W.; Ng, W.Y.; Hu, Y. RoboNurse-VLA: Robotic Scrub Nurse System Based on Vision-Language-Action Model. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2025), Detroit, MI, USA, 19–25 October 2025; IEEE: Piscataway, NJ, USA, 2025. [Google Scholar] [CrossRef]
Wu, J.; Antonova, R.; Kan, A.; Lepert, M.; Zeng, A.; Song, S.; Bohg, J.; Rusinkiewicz, S.; Funkhouser, T. Tidybot: Personalized Robot Assistance with Large Language Models. Auton. Robot. 2023, 47, 1087–1102. [Google Scholar] [CrossRef]
Zhao, Z.; Tang, H.; Yan, Y. Audio-Visual Navigation with Anti-Backtracking. In Pattern Recognition; Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, C.-L., Bhattacharya, S., Pal, U., Eds.; Lecture Notes in Computer Science; Springer Nature Switzerland: Cham, Switzerland, 2025; Volume 15318, pp. 358–372. [Google Scholar] [CrossRef]
Fikes, R.E.; Nilsson, N.J. STRIPS: A New Approach to the Application of Theorem Proving to Problem Solving. Artif. Intell. 1971, 2, 189–208. [Google Scholar] [CrossRef]
Nau, D.S.; Au, T.-C.; Ilghami, O.; Kuter, U.; Murdock, J.W.; Wu, D.; Yaman, F. SHOP2: An HTN Planning System. J. Artif. Intell. Res. 2003, 20, 379–404. [Google Scholar] [CrossRef]
Nau, D.; Au, T.-C.; Ilghami, O.; Kuter, U.; Wu, D.; Yaman, F.; Munoz-Avila, H.; Murdock, J.W. Applications of SHOP and SHOP2. IEEE Intell. Syst. 2005, 20, 34–41. [Google Scholar] [CrossRef]
Fox, M.; Long, D. PDDL2.1: An Extension to PDDL for Expressing Temporal Planning Domains. J. Artif. Intell. Res. 2003, 20, 61–124. Available online: https://dl.acm.org/doi/10.5555/1622452.1622454 (accessed on 5 December 2025). [CrossRef]
Ghallab, M.; Nau, D.; Traverso, P. Automated Planning: Theory and Practice; Morgan Kaufmann; Elsevier: San Francisco, CA, USA, 2004; ISBN 978-1-55860-856-6. [Google Scholar] [CrossRef]
Misra, D.K.; Sung, J.; Lee, K.; Saxena, A. Tell me Dave: Context-Sensitive Grounding of Natural Language to Manipulation Instructions. Int. J. Robot. Res. 2016, 35, 281–300. [Google Scholar] [CrossRef]
MacGlashan, J.; Babes-Vroman, M.; desJardins, M.; Littman, M.; Muresan, S.; Squire, S.; Tellex, S.; Arumugam, D.; Yang, L. Grounding English commands to reward functions. In Proceedings of the Robotics: Science and Systems XI (RSS 2015), Rome, Italy, 13–17 July 2015; Robotics: Science and Systems Foundation: Ann Arbor, MI, USA, 2015. [Google Scholar] [CrossRef]
Chen, D.; Mooney, R. Learning to Interpret Natural Language Navigation Instructions from Observations. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI 2011), San Francisco, CA, USA, 7–11 August 2011; pp. 859–865. [Google Scholar] [CrossRef]
Hazirbas, C.; Ma, L.; Domokos, C.; Cremers, D. FuseNet: Incorporating Depth into Semantic Segmentation Via Fusion-Based CNN Architecture. In Computer Vision—ACCV 2016; Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2017; Volume 10111, pp. 213–228. [Google Scholar] [CrossRef]
Hong, D.; Yokoya, N.; Xia, G.-S.; Chanussot, J.; Zhu, X.X. X-ModalNet: A Semi-Supervised Deep Cross-Modal Network for Classification of Remote Sensing Data. ISPRS J. Photogramm. Remote Sens. 2020, 167, 12–23. [Google Scholar] [CrossRef]
Zuñiga-Noël, D.; Ruiz-Sarmiento, J.-R.; Gomez-Ojeda, R.; Gonzalez-Jimenez, J. Automatic Multi-Sensor Extrinsic Calibration for Mobile Robots. IEEE Robot. Autom. Lett. 2019, 4, 2862–2869. [Google Scholar] [CrossRef]
Xia, B.; Zhou, J.; Kong, F.; You, Y.; Yang, J.; Lin, L. Enhancing 3D Object Detection Through Multi-Modal Fusion for Cooperative Perception. Alex. Eng. J. 2024, 104, 46–55. [Google Scholar] [CrossRef]
Bohus, D.; Rudnicky, A.I. The Ravenclaw Dialog Management Framework: Architecture and systems. Comput. Speech Lang. 2009, 23, 332–361. [Google Scholar] [CrossRef]
Colledanchise, M.; Ögren, P. How Behavior Trees Modularize Hybrid Control Systems and Generalize Sequential Behavior Compositions, The Subsumption Architecture, And Decision Trees. IEEE Trans. Robot. 2017, 33, 372–389. [Google Scholar] [CrossRef]
Young, S.; Gašić, M.; Thomson, B.; Williams, J.D. POMDP-Based Statistical Spoken Dialog Systems: A Review. Proc. IEEE 2013, 101, 1160–1179. [Google Scholar] [CrossRef]
Gal, Y.; Ghahramani, Z. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML 2016), New York, NY, USA, 19–24 June 2016; PMLR: Red Hook, NY, USA, 2016; Volume 48, pp. 1050–1059. Available online: https://dl.acm.org/doi/10.5555/3045390.3045502 (accessed on 5 December 2025).
Beetz, M.; Tenorth, M.; Jain, D.; Bandouch, J. Towards Automated Models of Activities of Daily Life. Technol. Disabil. 2010, 22, 27–40. [Google Scholar] [CrossRef]
Levine, S.; Finn, C.; Darrell, T.; Abbeel, P. End-to-End Training of Deep Visuomotor Policies. J. Mach. Learn. Res. 2016, 17, 1334–1373. Available online: https://dl.acm.org/doi/10.5555/2946645.2946684 (accessed on 5 December 2025).
Duan, Y.; Chen, X.; Houthooft, R.; Schulman, J.; Abbeel, P. Benchmarking Deep Reinforcement Learning for Continuous Control. In Proceedings of the 33rd International Conference on Machine Learning (ICML 2016), New York, NY, USA, 20–22 June 2016; JMLR.org: Red Hook, NY, USA, 2016; pp. 1329–1338. Available online: https://dl.acm.org/doi/10.5555/3045390.3045531 (accessed on 5 December 2025).
Kehoe, B.; Patil, S.; Abbeel, P.; Goldberg, K. A Survey of Research on Cloud Robotics and Automation. IEEE Trans. Autom. Sci. Eng. 2015, 12, 398–409. [Google Scholar] [CrossRef]
Wang, X.; Tang, Z.; Guo, J.; Meng, T.; Wang, C.; Wang, T.; Jia, W. Empowering Edge Intelligence: A Comprehensive Survey on On-Device AI Models. ACM Comput. Surv. 2025, 57, 228. [Google Scholar] [CrossRef]
Savva, M.; Kadian, A.; Maksymets, O.; Zhao, Y.; Wijmans, E.; Jain, B.; Straub, J.; Liu, J.; Koltun, V.; Malik, J.; et al. Habitat: A Platform for Embodied AI Research. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV 2019), Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 9338–9346. [Google Scholar] [CrossRef]
Hu, Y.; Xie, Q.; Jain, V.; Francis, J.; Patrikar, J.; Keetha, N.; Kim, S.; Xie, Y.; Zhang, T.; Fang, H.-S.; et al. Toward General-Purpose Robots Via Foundation Models: A Survey and Meta-Analysis. arXiv 2024, arXiv:2312.08782. [Google Scholar] [CrossRef]
Zhou, H.; Yao, X.; Mees, O.; Meng, Y.; Xiao, T.; Bisk, Y.; Oh, J.; Johns, E.; Shridhar, M.; Shah, D.; et al. Bridging Language and Action: A Survey of Language-Conditioned Robot Manipulation. arXiv 2024, arXiv:2312.10807. [Google Scholar] [CrossRef]
Firoozi, R.; Tucker, J.; Tian, S.; Majumdar, A.; Sun, J.; Liu, W.; Zhu, Y.; Song, S.; Kapoor, A.; Hausman, K.; et al. Foundation Models in Robotics: Applications, Challenges, and the Future. Int. J. Robot. Res. 2024, 44, 701–739. [Google Scholar] [CrossRef]
Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 statement: An Updated Guideline for Reporting Systematic Reviews. BMJ 2021, 372, n71. [Google Scholar] [CrossRef]
Priem, J.; Piwowar, H.; Orr, R. OpenAlex: A Fully-Open Index of Scholarly Works, Authors, Venues, Institutions, And Concepts. arXiv 2022, arXiv:2205.01833. [Google Scholar] [CrossRef]
Görner, M.; Haschke, R.; Ritter, H.; Zhang, J. MoveIt! Task Constructor for Task-Level Motion Planning. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA 2019), Montreal, QC, Canada, 20–24 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 190–196. [Google Scholar] [CrossRef]
Tellex, S.; Kollar, T.; Dickerson, S.; Walter, M.R.; Banerjee, A.G.; Teller, S.; Roy, N. Approaching the Symbol Grounding Problem with Probabilistic Graphical Models. AI Mag. 2011, 32, 64–76. [Google Scholar] [CrossRef]
Williams, T.; Cantrell, R.; Briggs, G.; Schermerhorn, P.; Scheutz, M. Grounding Natural Language References to Unvisited and Hypothetical Locations. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Bellevue, WA, USA, 14–18 July 2013; pp. 947–953. [Google Scholar] [CrossRef]
Kollar, T.; Tellex, S.; Roy, D.; Roy, N. Toward Understanding Natural Language Directions. In Proceedings of the 5th ACM/IEEE International Conference on Human-Robot Interaction (HRI), Osaka, Japan, 2–5 March 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 259–266. [Google Scholar] [CrossRef]
Walter, M.R.; Patki, S.; Daniele, A.F.; Fahnestock, E.; Duvallet, F.; Hemachandra, S.; Oh, J.; Stentz, A.; Roy, N.; Howard, T.M. Language Understanding for Field and Service Robots in A Priori Unknown Environments. Field Robot. 2022, 2, 1191–1231. [Google Scholar] [CrossRef]
Paul, R.; Arkin, J.; Roy, N.; Howard, T.M. Grounding Abstract Spatial Concepts for Language Interaction with Robots. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI 2017), Melbourne, Australia, 19–25 August 2017; IJCAI: California, CA, USA, 2017; pp. 4929–4933. [Google Scholar] [CrossRef]
Kaelbling, L.P.; Lozano-Pérez, T. Hierarchical Task and Motion Planning in The Now. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA 2011), Shanghai, China, 9–13 May 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 1470–1477. [Google Scholar] [CrossRef]
Mohr, F.; Lettmann, T.; Hüllermeier, E. Planning with Independent Task Networks. In KI 2017: Advances in Artificial Intelligence; Kern-Isberner, G., Fürnkranz, J., Thimm, M., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2017; Volume 10505, pp. 193–206. [Google Scholar] [CrossRef]
Hoffmann, J. FF: The Fast-Forward Planning System. AI Mag. 2001, 22, 57–62. [Google Scholar] [CrossRef]
Helmert, M. The Fast Downward Planning System. J. Artif. Intell. Res. 2006, 26, 191–246. [Google Scholar] [CrossRef]
Dornhege, C.; Eyerich, P.; Keller, T.; Trüg, S.; Brenner, M.; Nebel, B. Semantic Attachments for Domain-Independent Planning Systems. In Towards Service Robots for Everyday Environments; Prassler, E., Zöllner, M., Bischoff, R., Burgard, W., Haschke, R., Hägele, M., Lawitzky, G., Nebel, B., Plöger, P., Reiser, U., Eds.; Springer: Berlin/Heidelberg, Germany, 2012; pp. 99–115. [Google Scholar] [CrossRef]
Mao, W.; Desai, R.; Iuzzolino, M.L.; Kamra, N. Action Dynamics Task Graphs for Learning Plannable Representations of Procedural Tasks. arXiv 2023, arXiv:2302.05330. [Google Scholar] [CrossRef]
Bacon, P.-L.; Harb, J.; Precup, D. The Option-Critic Architecture. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI 2017), San Francisco, CA, USA, 4–9 February 2017; AAAI Press: Palo Alto, CA, USA, 2017; pp. 1726–1734. Available online: https://dl.acm.org/doi/10.5555/3298483.3298491 (accessed on 5 December 2025).
Kurniawati, H.; Du, Y.; Hsu, D.; Lee, W.S. Motion planning under uncertainty for robotic tasks with long time horizons. Int. J. Robot. Res. 2011, 30, 308–323. [Google Scholar] [CrossRef]
Lauri, M.; Hsu, D.; Pajarinen, J. Partially Observable Markov Decision Processes in Robotics: A Survey. IEEE Trans. Robot. 2023, 39, 21–40. [Google Scholar] [CrossRef]
Silver, D.; Veness, J. Monte-Carlo Planning in Large POMDPs. In Proceedings of the 24th International Conference on Neural Information Processing Systems (NIPS 2010), Vancouver, BC, Canada, 6–9 December 2010; Curran Associates, Inc.: Red Hook, NY, USA, 2010; pp. 2164–2172. Available online: https://dl.acm.org/doi/10.5555/2997046.2997137 (accessed on 5 December 2025).
Beetz, M. Structured Reactive Controllers: Controlling Robots That Perform Everyday Activity. In Proceedings of the 3rd Annual Conference on Autonomous Agents (AGENTS ’99), Seattle, WA, USA, 1–5 May 1999; Association for Computing Machinery: New York, NY, USA, 1999; pp. 228–235. [Google Scholar] [CrossRef]
Gat, E. Three-Layer Architectures. In Artificial Intelligence and Mobile Robots: Case Studies of Successful Robot Systems; Kortenkamp, D., Bonasso, R.P., Murphy, R., Eds.; MIT Press: Cambridge, MA, USA, 1998; pp. 195–210. Available online: https://dl.acm.org/doi/10.5555/292092.292130 (accessed on 5 December 2025).
Cashmore, M.; Fox, M.; Long, D.; Magazzeni, D.; Ridder, B.; Carrera, A.; Palomeras, N.; Hurtos, N.; Carreras, M. ROSPlan: Planning in the Robot Operating System. In Proceedings of the 25th International Conference on Automated Planning and Scheduling (ICAPS 2015), Jerusalem, Israel, 7–11 June 2015; AAAI Press: Palo Alto, CA, USA, 2015; pp. 333–341. [Google Scholar] [CrossRef]
Stentz, A. Optimal and Efficient Path Planning for Partially-Known Environments. In Proceedings of the 1994 IEEE International Conference on Robotics and Automation (ICRA), San Diego, CA, USA, 8–13 May 1994; pp. 3310–3317. [Google Scholar] [CrossRef]
Koenig, S.; Likhachev, M. D* Lite. In Proceedings of the 18th National Conference on Artificial Intelligence (AAAI 2002), Edmonton, AB, Canada, 28 July–1 August 2002; AAAI Press: Palo Alto, CA, USA, 2002; pp. 476–483. Available online: https://cdn.aaai.org/AAAI/2002/AAAI02-072.pdf (accessed on 5 December 2025).
Kaelbling, L.P.; Lozano-Pérez, T. Integrated Task and Motion Planning in Belief Space. Int. J. Robot. Res. 2013, 32, 1194–1227. [Google Scholar] [CrossRef]
Kaelbling, L.P.; Littman, M.L.; Cassandra, A.R. Planning and Acting in Partially Observable Stochastic Domains. Artif. Intell. 1998, 101, 99–134. [Google Scholar] [CrossRef]
Ramachandram, D.; Taylor, G.W. Deep Multimodal Learning: A Survey on Recent Advances and Trends. IEEE Signal Process. Mag. 2017, 34, 96–108. [Google Scholar] [CrossRef]
Li, S.; Tang, H. Multimodal Alignment and Fusion: A Survey. arXiv 2024, arXiv:2411.17040. [Google Scholar] [CrossRef]
Wang, T.; Zheng, P.; Li, S.; Wang, L. Multimodal Human–Robot Interaction for Human-Centric Smart Manufacturing: A Survey. Adv. Intell. Syst. 2024, 6, 2300359. [Google Scholar] [CrossRef]
Mora, A.; Prados, A.; Mendez, A.; Barber, R.; Garrido, S. Sensor Fusion for Social Navigation on a Mobile Robot Based on Fast Marching Square and Gaussian Mixture Model. Sensors 2022, 22, 8728. [Google Scholar] [CrossRef] [PubMed]
Zuo, X.; Geneva, P.; Lee, W.; Liu, Y.; Huang, G. LIC-Fusion: Lidar-Inertial-Camera Odometry. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2019), Macau, China, 3–8 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 5848–5854. [Google Scholar] [CrossRef]
Geneva, P.; Eckenhoff, K.; Lee, W.; Yang, Y.; Huang, G. OpenVINS: A research platform for visual-inertial estimation. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA 2020), Paris, France, 31 May–31 August 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 4666–4672. [Google Scholar] [CrossRef]
Rehder, J.; Siegwart, R.; Furgale, P. A General Approach to Spatiotemporal Calibration in Multisensor Systems. IEEE Trans. Robot. 2016, 32, 383–398. [Google Scholar] [CrossRef]
Cadena, C.; Carlone, L.; Carrillo, H.; Latif, Y.; Scaramuzza, D.; Neira, J.; Reid, I.; Leonard, J.J. Past, Present, and Future of Simultaneous Localization and Mapping: Toward the Robust-Perception Age. IEEE Trans. Robot. 2016, 32, 1309–1332. [Google Scholar] [CrossRef]
Mur-Artal, R.; Tardós, J.D. ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar] [CrossRef]
Engel, J.; Schöps, T.; Cremers, D. LSD-SLAM: Large-scale direct monocular SLAM. In Computer Vision—ECCV 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer International Publishing: Cham, Switzerland, 2014; pp. 834–849. [Google Scholar] [CrossRef]
Forster, C.; Carlone, L.; Dellaert, F.; Scaramuzza, D. On-Manifold Preintegration for Real-Time Visual–Inertial Odometry. IEEE Trans. Robot. 2017, 33, 1–21. [Google Scholar] [CrossRef]
Lupton, T.; Sukkarieh, S. Visual-Inertial-Aided Navigation for High-Dynamic Motion in Built Environments Without Initial Conditions. IEEE Trans. Robot. 2012, 28, 61–76. [Google Scholar] [CrossRef]
Gawlikowski, J.; Tassi, C.R.N.; Ali, M.; Lee, J.; Humt, M.; Feng, J.; Kruspe, A.; Triebel, R.; Jung, P.; Roscher, R.; et al. A Survey of Uncertainty in Deep Neural Networks. Artif. Intell. Rev. 2023, 56, 1513–1589. [Google Scholar] [CrossRef]
Jung, J.H.; Choe, Y.; Park, C.G. Photometric Visual-Inertial Navigation with Uncertainty-Aware Ensembles. IEEE Trans. Robot. 2022, 38, 2039–2052. [Google Scholar] [CrossRef]
Sünderhauf, N.; Brock, O.; Scheirer, W.; Hadsell, R.; Fox, D.; Leitner, J.; Upcroft, B.; Abbeel, P.; Burgard, W.; Milford, M.; et al. The Limits and Potentials of Deep Learning for Robotics. Int. J. Robot. Res. 2018, 37, 405–420. [Google Scholar] [CrossRef]
Maddern, W.; Pascoe, G.; Linegar, C.; Newman, P. 1 Year, 1000 Km: The Oxford RobotCar Dataset. Int. J. Robot. Res. 2017, 36, 3–15. [Google Scholar] [CrossRef]
Tremblay, J.; Prakash, A.; Acuna, D.; Brophy, M.; Jampani, V.; Anil, C.; To, T.; Cameracci, E.; Boochoon, S.; Birchfield, S. Training Deep Networks with Synthetic Data: Bridging the Reality Gap by Domain Randomization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW 2018), Salt Lake City, UT, USA, 18–22 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1082–10828. [Google Scholar] [CrossRef]
Sofman, B.; Lin, E.; Bagnell, J.A.; Cole, J.; Vandapel, N.; Stentz, A. Improving robot navigation through self-supervised online learning. J. Field Robot. 2006, 23, 1059–1075. [Google Scholar] [CrossRef]
Porav, H.; Maddern, W.; Newman, P. Adversarial Training for Adverse Conditions: Robust Metric Localization Using Appearance Transfer. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA 2018), Brisbane, QLD, Australia, 21–25 May 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1011–1018. [Google Scholar] [CrossRef]
Lakshminarayanan, B.; Pritzel, A.; Blundell, C. Simple and Scalable Predictive Uncertainty Estimation Using Deep Ensembles. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 6405–6416. Available online: https://dl.acm.org/doi/10.5555/3295222.3295387 (accessed on 5 December 2025).
Mohanan, M.G.; Salgoankar, A. A Survey of Robotic Motion Planning in Dynamic Environments. Robot. Auton. Syst. 2018, 100, 171–185. [Google Scholar] [CrossRef]
Guo, C.; Pleiss, G.; Sun, Y.; Weinberger, K.Q. On Calibration of Modern Neural Networks. In Proceedings of the 34th International Conference on Machine Learning (ICML 2017), Sydney, NSW, Australia, 6–11 August 2017; PMLR: Red Hook, NY, USA, 2017; Volume 70, pp. 1321–1330. Available online: https://dl.acm.org/doi/10.5555/3305381.3305518 (accessed on 5 December 2025).
Simon, D. Kalman Filtering with State Constraints: A Survey of Linear and Nonlinear Algorithms. IET Control Theory Appl. 2010, 4, 1303–1318. [Google Scholar] [CrossRef]
Khodarahmi, M.; Maihami, V. A Review on Kalman Filter Models. Arch. Comput. Methods Eng. 2023, 30, 727–747. [Google Scholar] [CrossRef]
Kurniawati, H.; Hsu, D.; Lee, W.S. SARSOP: Efficient Point-Based POMDP Planning by Approximating Optimally Reachable Belief Spaces. In Proceedings of the Robotics: Science and Systems IV (RSS 2008), Zurich, Switzerland, 25–28 June 2008. [Google Scholar] [CrossRef]
Wan, E.A.; Van Der Merwe, R. The Unscented Kalman Filter for Nonlinear Estimation. In Proceedings of the IEEE Adaptive Systems for Signal Processing, Communications, and Control Symposium, Lake Louise, AB, Canada, 1–4 October 2000; IEEE: Piscataway, NJ, USA, 2000; pp. 153–158. [Google Scholar] [CrossRef]
Van Der Merwe, R.; Wan, E.A. The Square-Root Unscented Kalman Filter for State and Parameter Estimation. In Proceedings of the 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2001), Salt Lake City, UT, USA, 7–11 May 2001; IEEE: Piscataway, NJ, USA, 2001; Volume 6, pp. 3461–3464. [Google Scholar] [CrossRef]
Brouk, J.D.; DeMars, K.J. Kalman Filtering with Uncertain and Asynchronous Measurement Epochs. Navig. J. Inst. Navig. 2024, 71, navi.652. [Google Scholar] [CrossRef]
Naab, C.; Zheng, Z. Application of the Unscented Kalman Filter in Position Estimation a Case Study on a Robot for Precise Positioning. Robot. Auton. Syst. 2022, 147, 103904. [Google Scholar] [CrossRef]
Platt, R.; Tedrake, R.; Kaelbling, L.P.; Lozano-Pérez, T. Belief space planning assuming maximum likelihood observations. In Proceedings of the Robotics: Science and Systems (RSS), Zaragoza, Spain, 27–30 June 2010. [Google Scholar] [CrossRef]
Leusmann, J.; Wang, C.; Gienger, M.; Schmidt, A.; Mayer, S. Understanding the Uncertainty Loop of Human–Robot Interaction. arXiv 2023, arXiv:2303.07889. [Google Scholar] [CrossRef]
Cumbal, R.; Lopes, J.; Engwall, O. Uncertainty in Robot Assisted Second Language Conversation Practice. In Proceedings of the Companion of the 2020 ACM/IEEE International Conference on Human-Robot Interaction (HRI ’20), Cambridge, UK, 23–26 March 2020; ACM: New York, NY, USA, 2020; pp. 171–173. [Google Scholar] [CrossRef]
Hough, J.; Schlangen, D. It’s Not What You Do, It’s How You Do It: Grounding Uncertainty for a Simple Ro-bot. In Proceedings of the 12th ACM/IEEE International Conference on Human-Robot Interaction (HRI 2017), Vienna, Austria, 6–9 March 2017; IEEE: Piscataway, NJ, USA; ACM: New York, NY, USA, 2017; pp. 274–282. [Google Scholar] [CrossRef]
Trick, S.; Koert, D.; Peters, J.; Rothkopf, C.A. Multimodal Uncertainty Reduction for Intention Recognition in Human–Robot Interaction. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 7009–7016. [Google Scholar] [CrossRef]
Leidner, J.L.; Lieberman, M.D. Detecting Geographical References in the Form of Place Names and Associated Spatial Natural Language. SIGSPATIAL Spec. 2011, 3, 5–11. [Google Scholar] [CrossRef]
Tellex, S.; Knepper, R.; Li, A.; Rus, D.; Roy, N. Asking for Help Using Inverse Semantics. In Proceedings of the Robotics: Science and Systems (RSS 2014), Berkeley, CA, USA, 12–16 July 2014. [Google Scholar] [CrossRef]
Dragan, A.D.; Lee, K.C.T.; Srinivasa, S.S. Legibility and Predictability of Robot Motion. In Proceedings of the 8th ACM/IEEE International Conference on Human–Robot Interaction (HRI 2013), Tokyo, Japan, 3–6 March 2013; pp. 301–308. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 770–778. [Google Scholar] [CrossRef]
Qi, X.; Liao, R.; Jia, J.; Fidler, S.; Urtasun, R. 3D Graph Neural Networks For RGB-D Semantic Segmentation. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 5209–5218. [Google Scholar] [CrossRef]
Zhou, T.; Porikli, F.; Crandall, D.J.; Van Gool, L.; Wang, W. A Survey on Deep Learning Techniques for Video Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 7099–7122. [Google Scholar] [CrossRef]
Quinlan, S.; Khatib, O. Elastic bands: Connecting path planning and control. In Proceedings of the 1993 IEEE International Conference on Robotics and Automation (ICRA), Atlanta, GA, USA, 2–6 May 1993; IEEE: Piscataway, NJ, USA, 1993; Volume 2, pp. 802–807. [Google Scholar] [CrossRef]
Fox, D.; Burgard, W.; Thrun, S. The Dynamic Window Approach to Collision Avoidance. IEEE Robot. Autom. Mag. 1997, 4, 23–33. [Google Scholar] [CrossRef]
Zhu, Y.; Gordon, D.; Kolve, E.; Fox, D.; Fei-Fei, L.; Gupta, A.; Mottaghi, R.; Farhadi, A. Visual Semantic Plan-ning Using Deep Successor Representations. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 483–492. [Google Scholar] [CrossRef]
Zhang, H.-B.; Zhang, Y.-X.; Zhong, B.; Lei, Q.; Yang, L.; Du, J.-X.; Chen, D.-S. A Comprehensive Survey of Vision-Based Human Action Recognition Methods. Sensors 2019, 19, 1005. [Google Scholar] [CrossRef]
Pütz, S.; Simón, J.S.; Hertzberg, J. Move Base Flex: A Highly Flexible Navigation Framework for Mobile Robots. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 3416–3421. [Google Scholar] [CrossRef]
Arkin, R.C. Behavior-Based Robotics; MIT Press: Cambridge, MA, USA, 1998; ISBN 978-0-262-01165-5. Available online: https://books.google.ca/books/about/Behavior_Based_Robotics.html?id=mRWT6alZt9oC&redir_esc=y (accessed on 5 December 2025).
Noroozi, F.; Daneshmand, M.; Fiorini, P. Conventional, Heuristic and Learning-Based Robot Motion Planning: Reviewing Frameworks of Current Practical Significance. Machines 2023, 11, 722. [Google Scholar] [CrossRef]
Pandey, A.; Pandey, S.; Parhi, D.R. Mobile Robot Navigation and Obstacle Avoidance Techniques: A Review. Int. Robot. Autom. J. 2017, 2, 96–105. [Google Scholar] [CrossRef]
Lewis, F.L.; Ge, S.S. Autonomous Mobile Robots: Sensing, Control, Decision Making and Applications; CRC Press: Boca Raton, FL, USA, 2018; ISBN 978-1-351-83711-8. [Google Scholar] [CrossRef]
Guo, X.; Lyu, M.; Xia, B.; Zhang, K.; Zhang, L. An Improved Visual SLAM Method with Adaptive Feature Extraction. Appl. Sci. 2023, 13, 10038. [Google Scholar] [CrossRef]
Kurz, G.; Holoch, M.; Biber, P. Geometry-Based Graph Pruning for Lifelong SLAM. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 3313–3320. [Google Scholar] [CrossRef]
Whelan, T.; Salas-Moreno, R.F.; Glocker, B.; Davison, A.J.; Leutenegger, S. ElasticFusion: Real-Time Dense SLAM and Light Source Estimation. Int. J. Rob. Res. 2016, 35, 1697–1716. [Google Scholar] [CrossRef]
Murali, A.; Liu, W.; Marino, K.; Chernova, S.; Gupta, A. Same Object, Different Grasps: Data and Semantic Knowledge for Task-Oriented Grasping. In Proceedings of the 2020 Conference on Robot Learning (CoRL), PMLR, Virtual Event, 16–18 November 2020; pp. 1540–1557. Available online: https://proceedings.mlr.press/v155/murali21a.html (accessed on 5 December 2025).
Mayya, S.; Ramachandran, R.K.; Zhou, L.; Senthil, V.; Thakur, D.; Sukhatme, G.S.; Kumar, V. Adaptive and Risk-Aware Target Tracking for Robot Teams with Heterogeneous Sensors. IEEE Robot. Autom. Lett. 2022, 7, 5615–5622. [Google Scholar] [CrossRef]
Wang, J.; Lin, S.; Liu, A. Bioinspired Perception and Navigation of Service Robots in Indoor Environments: A Review. Biomimetics 2023, 8, 350. [Google Scholar] [CrossRef]
OpenAI; Hurst, A.; Lerer, A.; Goucher, A.P.; Perelman, A.; Ramesh, A.; Clark, A.; Ostrow, A.J.; Welihinda, A.; Hayes, A.; et al. GPT-4o System Card. arXiv 2024, arXiv:2410.21276. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Minneapolis, MN, USA, 2–7 June 2019; Association for Computational Linguistics: Minneapolis, MN, USA, 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models from Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, Virtually, 1 July 2021; PMLR: Red Hook, NY, USA, 2021; pp. 8748–8763. Available online: https://proceedings.mlr.press/v139/radford21a.html (accessed on 5 December 2025).
Touvron, H.; Cord, M.; Jégou, H. DeiT III: Revenge of the ViT. In Computer Vision—ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Proceedings, Part XXIV; Springer: Berlin/Heidelberg, Germany, 2022; pp. 516–533. [Google Scholar] [CrossRef]
OpenAI; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; et al. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
Guo, D.; Yang, D.; Zhang, H.; Song, J.; Wang, P.; Zhu, Q.; Xu, R.; Zhang, R.; Ma, S.; Bi, X.; et al. DeepSeek-R1 Incentivizes Reasoning in LLMs through Reinforcement Learning. Nature 2025, 645, 633–638. [Google Scholar] [CrossRef] [PubMed]
Yang, J.; Tan, R.; Wu, Q.; Zheng, R.; Peng, B.; Liang, Y. Magma: A Foundation Model for Multimodal AI Agents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025), Nashville, TN, USA, 10–17 June 2025; IEEE: Piscataway, NJ, USA, 2025. [Google Scholar] [CrossRef]
Kim, M.J.; Pertsch, K.; Karamcheti, S.; Xiao, T.; Balakrishna, A.; Nair, S.; Rafailov, R.; Foster, E.P.; Sanketi, P.R.; Vuong, Q.; et al. OpenVLA: An Open-Source Vision-Language-Action Model. In Proceedings of the 8th Conference on Robot Learning (CoRL 2025), Atlanta, GA, USA, 6–9 November 2025; Proceedings of Machine Learning Research; PMLR: Cambridge, MA, USA, 2025; Volume 270, pp. 2679–2713. Available online: https://proceedings.mlr.press/v270/kim25c.html (accessed on 5 December 2025).
Zhu, H.; Wang, Y.; Zhou, J.; Chang, W.; Zhou, Y.; Li, Z.; Chen, J.; Shen, C.; Pang, J.; He, T. Aether: Geometric-Aware Unified World Modeling. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV 2025), Paris, France, 19–25 October 2025; pp. 8535–8546. Available online: https://openaccess.thecvf.com/content/ICCV2025/papers/Zhu_Aether_Geometric-Aware_Unified_World_Modeling_ICCV_2025_paper.pdf (accessed on 5 December 2025).
Black, K.; Brown, N.; Driess, D.; Esmail, A.; Equi, M.R.; Finn, C.; Fusai, N.; Groom, L.; Hausman, K.; Ichter, B.; et al. π0: A Vision-Language-Action Flow Model for General Robot Control. In Proceedings of the Robotics: Science and Systems (RSS 2025), Los Angeles, CA, USA, 21–25 June 2025. [Google Scholar] [CrossRef]
Black, K.; Brown, N.; Darpinian, J.; Dhabalia, K.; Driess, D.; Esmail, A.; Equi, M.R.; Finn, C.; Fusai, N.; Galliker, M.Y.; et al. π0.5: A Vision–Language–Action Model with Open-World Generalization. In Proceedings of the 9th Conference on Robot Learning (CoRL 2025), Seoul, Republic of Korea, 27–30 September 2025; Proceedings of Machine Learning Research; Lim, J., Song, S., Park, H.-W., Eds.; PMLR: Red Hook, NY, USA, 2025; Volume 305, pp. 17–40. Available online: https://proceedings.mlr.press/v305/black25a.html (accessed on 5 December 2025).
Szot, A.; Clegg, A.; Undersander, E.; Wijmans, E.; Zhao, Y.; Turner, J.; Maestre, N.; Mukadam, M.; Chaplot, D.S.; Maksymets, O.; et al. Habitat 2.0: Training Home Assistants to Rearrange Their Habitat. In Proceedings of the 35th International Conference on Neural Information Processing Systems (NeurIPS 2021), Online, 6–14 December 2021; Curran Associates, Inc.: Red Hook, NY, USA, 2021; pp. 251–266. Available online: https://dl.acm.org/doi/10.5555/3540261.3540281 (accessed on 5 December 2025).
Shu, D.; Zhao, H.; Hu, J.; Liu, W.; Payani, A.; Cheng, L.; Du, M. Large Vision–Language Model Alignment and Misalignment: A Survey through the Lens of Explainability. In Findings of the Association for Computational Linguistics: EMNLP 2025; Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V., Eds.; Association for Computational Linguistics: Suzhou, China, 2025; pp. 1713–1735. [Google Scholar] [CrossRef]
Chen, Z.; Wang, W.; Tian, H.; Ye, S.; Gao, Z.; Cui, E.; Tong, W.; Hu, K.; Luo, J.; Ma, Z.; et al. How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites. Sci. China Inf. Sci. 2024, 67, 220101. [Google Scholar] [CrossRef]
NVIDIA; Agarwal, N.; Ali, A.; Bala, M.; Balaji, Y.; Barker, E.; Cai, T.; Chattopadhyay, P.; Chen, Y.; Cui, Y.; et al. Cosmos World Foundation Model Platform for Physical AI. arXiv 2025, arXiv:2501.03575. [Google Scholar] [CrossRef]
Ye, Y.; Huang, Z.; Xiao, Y.; Chern, E.; Xia, S.; Liu, P. LIMO: Less Is More for Reasoning. arXiv 2025, arXiv:2502.03387. [Google Scholar] [CrossRef]
Dong, X.; Zhang, P.; Zang, Y.; Cao, Y.; Wang, B.; Ouyang, L.; Zhang, S.; Duan, H.; Zhang, W.; Li, Y.; et al. InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD. In Proceedings of the 38th International Conference on Neural Information Processing Systems (NeurIPS 2024), Vancouver, BC, Canada, 10–15 December 2024; Curran Associates, Inc.: Red Hook, NY, USA, 2024; pp. 42566–42592. Available online: https://dl.acm.org/doi/10.5555/3737916.3739264 (accessed on 5 December 2025).
Gu, Q.; Kuwajerwala, A.; Morin, S.; Jatavallabhula, K.M.; Sen, B.; Agarwal, A.; Rivera, C.; Paul, W.; Ellis, K.; Chellappa, R. Conceptgraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 5021–5028. [Google Scholar] [CrossRef]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 1–67. Available online: https://dl.acm.org/doi/10.5555/3455716.3455856 (accessed on 5 December 2025).
Yang, Z.; Li, L.; Lin, K.; Wang, J.; Lin, C.-C.; Liu, Z.; Wang, L. The dawn of LMMs: Preliminary Explorations with GPT-4V(ision). arXiv 2023, arXiv:2309.17421. [Google Scholar] [CrossRef]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.-Y.; et al. Segment Anything. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 4–6 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 3992–4003. [Google Scholar] [CrossRef]
Défossez, A.; Mazaré, L.; Orsini, M.; Royer, A.; Pérez, P.; Jégou, H.; Grave, E.; Zeghidour, N. Moshi: A Speech-Text Foundation Model for Real-Time Dialogue. arXiv 2024, arXiv:2410.00037. [Google Scholar] [CrossRef]
Zuo, G.; Tong, J.; Liu, H.; Chen, W.; Li, J. Graph-Based Visual Manipulation Relationship Reasoning in Object-Stacking Scenes. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–8. [Google Scholar] [CrossRef]
Chen, X.; Djolonga, J.; Padlewski, P.; Mustafa, B.; Changpinyo, S.; Wu, J. On Scaling Up a Multilingual Vision and Language Model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024), Seattle, WA, USA, 16–22 June 2024. [Google Scholar] [CrossRef]
Liang, J.; Huang, W.; Xia, F.; Xu, P.; Hausman, K.; Ichter, B.; Florence, P.; Zeng, A. Code as Policies: Language Model Programs for Embodied Control. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 9493–9500. [Google Scholar] [CrossRef]
Song, C.H.; Sadler, B.M.; Wu, J.; Chao, W.-L.; Washington, C.; Su, Y. LLM-Planner: Few-shot grounded planning for embodied agents with large language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 4–6 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 2986–2997. [Google Scholar] [CrossRef]
Muennighoff, N.; Yang, Z.; Shi, W.; Li, X.L.; Li, F.-F.; Hajishirzi, H.; Zettlemoyer, L.; Liang, P.; Candes, E.; Hashimoto, T. s1: Simple Test-Time Scaling. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025), Suzhou, China, 4–9 November 2025; pp. 20275–20321. [Google Scholar] [CrossRef]
Cambon, S.; Alami, R.; Gravot, F. A Hybrid Approach to Intricate Motion, Manipulation and Task Planning. Int. J. Robot. Res. 2009, 28, 104–126. [Google Scholar] [CrossRef]
Zhou, X.; Gandhi, S.; Fan, L.; Lin, Z.; Du, Y.; Abbeel, P.; Wu, J.; Xia, F. GENESIS: A Generative and Universal Physics Engine for Robotics and Beyond. Available online: https://genesis-embodied-ai.github.io/ (accessed on 5 December 2025).
Ye, J.; Chen, X.; Xu, N.; Zu, C.; Shao, Z.; Liu, S.; Cui, Y.; Zhou, Z.; Gong, C.; Shen, Y.; et al. A Comprehensive Capability Analysis of GPT-3 and GPT-3.5 Series Models. arXiv 2023, arXiv:2303.10420. [Google Scholar] [CrossRef]
Ravi, N.; Gabeur, V.; Hu, Y.-T.; Hu, R.; Ryali, C.; Ma, T.; Khedr, H.; Rädle, R.; Rolland, C.; Gustafson, L.; et al. SAM 2: Segment Anything in Images and Videos. In Proceedings of the International Conference on Learning Representations (ICLR 2025), Singapore, 24–28 April 2025; Yue, Y., Garg, A., Peng, N., Sha, F., Yu, R., Eds.; ICLR: Appleton, WI, USA, 2025; pp. 28085–28128. Available online: https://proceedings.iclr.cc/paper_files/paper/2025/file/45c1f6a8cbf2da59ebf2c802b4f742cd-Paper-Conference.pdf (accessed on 5 December 2025).
Jaegle, A.; Borgeaud, S.; Alayrac, J.-B.; Doersch, C.; Ionescu, C.; Ding, D.; Koppula, S.; Zoran, D.; Brock, A.; Shelhamer, E.; et al. Perceiver IO: A General Architecture for Structured Inputs and Outputs. arXiv 2021, arXiv:2107.14795. [Google Scholar] [CrossRef]
Shridhar, M.; Manuelli, L.; Fox, D. Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation. In Proceedings of the 6th Conference on Robot Learning (CoRL 2023), Atlanta, GA, USA, 6–9 November 2023; PMLR: Red Hook, NY, USA, 2023; Volume 205, pp. 785–799. Available online: https://proceedings.mlr.press/v205/shridhar23a.html (accessed on 5 December 2025).
Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Vaughan, A.; et al. The LLaMA 3 herd of models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
Song, X.; Chen, W.; Liu, Y.; Chen, W.; Li, G.; Lin, L. Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025), Seattle, WA, USA, 10–17 June 2025. [Google Scholar] [CrossRef]
Abdin, M.; Jacobs, S.A.; Awan, A.A.; Aneja, J.; Awadallah, A.; Awadalla, H.; Bach, N.; Bahree, A.; Bakhtiari, A.; Behl, H.; et al. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv 2024, arXiv:2404.14219. [Google Scholar] [CrossRef]
Dai, Y.; Peng, R.; Li, S.; Chai, J. Think, Act, and Ask: Open-World Interactive Personalized Robot Navigation. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA 2024), Yokohama, Japan, 13–17 May 2024; IEEE: Piscataway, NJ, USA, 2024. [Google Scholar] [CrossRef]
Joublin, F.; Ceravola, A.; Smirnov, P.; Ocker, F.; Deigmoeller, J.; Belardinelli, A.; Wang, C.; Hasler, S.; Tanneberg, D.; Gienger, M. CoPAL: Corrective Planning of Robot Actions with Large Language Models. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA 2024), Yokohama, Japan, 13–17 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 8664–8670. [Google Scholar] [CrossRef]
Mei, A.; Zhu, G.-N.; Zhang, H.; Gan, Z. ReplanVLM: Replanning Robotic Tasks with Visual Language Models. IEEE Robot. Autom. Lett. 2024, 9, 10201–10208. [Google Scholar] [CrossRef]
Li, J.; Wu, J.; Zhao, W.; Bai, S.; Bai, X. PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects. In Computer Vision—ECCV 2024; Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2024; Volume 15133, pp. 1–17. [Google Scholar] [CrossRef]
Diao, H.; Cui, Y.; Li, X.; Wang, Y.; Lu, H.; Wang, X. Unveiling Encoder-Free Vision-Language Models. In Proceedings of the 38th International Conference on Neural Information Processing Systems (NeurIPS 2024), Vancouver, BC, Canada, 10–15 December 2024; Curran Associates, Inc.: Red Hook, NY, USA, 2024; 23p, Available online: https://dl.acm.org/doi/10.5555/3737916.3739581 (accessed on 5 December 2025).
Chefer, H.; Singer, U.; Zohar, A.; Kirstain, Y.; Polyak, A.; Taigman, Y.; Wolf, L.; Sheynin, S. VideoJAM: Joint Appearance–Motion Representations for Enhanced Motion Generation in Video Models. In Proceedings of the 42nd International Conference on Machine Learning (ICML 2025), Vancouver, BC, Canada, 13–19 July 2025; PMLR: Red Hook, NY, USA, 2025; Volume 267, pp. 7595–7616. Available online: https://proceedings.mlr.press/v267/chefer25a.html (accessed on 5 December 2025).
Wang, Y.; Li, K.; Li, X.; Yu, J.; He, Y.; Wang, C.; Chen, G.; Pei, B.; Yan, Z.; Zheng, R.; et al. InternVideo2: Scaling Foundation Models for Multimodal Video Understanding. In Computer Vision—ECCV 2024; Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2024; Volume 15143. [Google Scholar] [CrossRef]
Chu, X.; Qiao, L.; Zhang, X.; Xu, S.; Wei, F.; Yang, Y.; Sun, X.; Hu, Y.; Lin, X.; Zhang, B.; et al. MobileVLM V2: Faster and Stronger Baseline for Vision-Language Models. arXiv 2024, arXiv:2402.03766. [Google Scholar] [CrossRef]
Cho, M.; Cao, Y.; Sun, J.; Zhang, Q.; Pavone, M.; Park, J.J.; Yang, H.; Mao, Z.M. Cocoon: Robust Multi-Modal Perception with Uncertainty-Aware Sensor Fusion. arXiv 2024, arXiv:2410.12592. [Google Scholar] [CrossRef]
Ren, T.; Chen, Y.; Jiang, Q.; Zeng, Z.; Xiong, Y.; Liu, W.; Ma, Z.; Shen, J.; Gao, Y.; Jiang, X.; et al. DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding. arXiv 2024, arXiv:2411.14347. [Google Scholar] [CrossRef]
Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. DINOv2: Learning Robust Visual Features without Supervision. arXiv 2023, arXiv:2304.07193. [Google Scholar] [CrossRef]
Liu, H.; Li, C.; Li, Y.; Lee, Y.J. Improved Baselines with Visual Instruction Tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; IEEE: Piscataway, NJ, USA, 2024. [Google Scholar] [CrossRef]
Tong, S.; Brown, E.L., II; Wu, P.; Woo, S.; Iyer, A.J.; Akula, S.C.; Yang, S.; Yang, J.; Middepogu, M.; Wang, Z.; et al. Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs. In Proceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS 2024), Vancouver, BC, Canada, 10–15 December 2024; Available online: https://dl.acm.org/doi/10.5555/3737916.3740687 (accessed on 5 December 2025).
Li, H.; Zhu, J.; Jiang, X.; Zhu, X.; Li, H.; Yuan, C.; Wang, X.; Qiao, Y.; Wang, X.; Wang, W.; et al. Uni-Perceiver V2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 2691–2700. [Google Scholar] [CrossRef]
Awadalla, A.; Gao, I.; Gardner, J.; Hessel, J.; Hanafy, Y.; Zhu, W.; Marathe, K.; Bitton, Y.; Gadre, S.; Sagawa, S.; et al. OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models. arXiv 2023, arXiv:2308.01390. [Google Scholar] [CrossRef]
Hafner, D.; Lillicrap, T.; Ba, J.; Norouzi, M. Dream to Control: Learning Behaviors by Latent Imagination. arXiv 2019, arXiv:1912.01603. [Google Scholar] [CrossRef]
Hafner, D.; Lillicrap, T.; Norouzi, M.; Ba, J. Mastering Atari with Discrete World Models. In Proceedings of the 9th International Conference on Learning Representations (ICLR 2021), Virtual Event, Austria, 3–7 May 2021; OpenReview Foundation: Amherst, MA, USA, 2021; Available online: https://openreview.net/forum?id=0oabwyZbOu (accessed on 5 December 2025).
Li, Q.; Lin, Y.; Luo, Q.; Yu, L. DreamerV3 for Traffic Signal Control: Hyperparameter Tuning and Performance. arXiv 2025, arXiv:2503.02279. [Google Scholar] [CrossRef]
Feichtenhofer, C.; Fan, H.; Li, Y.; He, K. Masked Autoencoders as Spatiotemporal Learners. In Proceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS 2022), New Orleans, LA, USA, 28 November–9 December 2022; Curran Associates, Inc.: Red Hook, NY, USA, 2022; pp. 2605–2617. Available online: https://dl.acm.org/doi/10.5555/3600270.3602875 (accessed on 5 December 2025).
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
Bar-Tal, O.; Chefer, H.; Tov, O.; Herrmann, C.; Paiss, R.; Zada, S.; Ephrat, A.; Hur, J.; Liu, G.; Raj, A.; et al. Lumiere: A space-time diffusion model for video generation. In SIGGRAPH Asia 2024 Conference Papers; Association for Computing Machinery: New York, NY, USA, 2024; pp. 1–11. [Google Scholar] [CrossRef]
Song, D.; Liang, J.; Payandeh, A.; Raj, A.H.; Xiao, X.; Manocha, D. VLM-Social-Nav: Socially Aware Robot Navigation Through Scoring Using Vision-Language Models. IEEE Robot. Autom. Lett. 2025, 10, 508–515. [Google Scholar] [CrossRef]
Narasimhan, S.; Tan, A.H.; Choi, D.; Nejat, G. OLiVia-Nav: An Online Lifelong Vision-Language Approach for Mobile Robot Social Navigation. In Proceedings of the 2025 IEEE International Conference on Robotics and Automation (ICRA 2025), Atlanta, GA, USA, 19–23 May 2025; IEEE: Piscataway, NJ, USA, 2025. [Google Scholar] [CrossRef]
Huang, Y.; Zhang, Q.; Y, P.S.; Sun, L. TrustGPT: A Benchmark for Trustworthy and Responsible Large Language Models. arXiv 2023, arXiv:2306.11507. [Google Scholar] [CrossRef]
Phan, L.; Gatti, A.; Han, Z.; Li, N.; Hu, J.; Zhang, H.; Zhang, C.B.C.; Shaaban, M.; Ling, J.; Shi, S.; et al. Humanity’s Last Exam. arXiv 2025, arXiv:2501.14249. [Google Scholar] [CrossRef]
Tjomsland, J.; Kalkan, S.; Gunes, H. Mind Your Manners! A Dataset and a Continual Learning Approach for Assessing Social Appropriateness of Robot Actions. Front. Robot. AI 2022, 9, 669420. [Google Scholar] [CrossRef] [PubMed]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training Data-Efficient Image Transformers and Distillation through Attention. In Proceedings of the 38th International Conference on Machine Learning (ICML 2021), Virtual, 18–24 July 2021; PMLR: Red Hook, NY, USA, 2021; Volume 139, pp. 10347–10357. Available online: https://proceedings.mlr.press/v139/touvron21a.html (accessed on 5 December 2025).
Chu, X.; Su, J.; Zhang, B.; Shen, C. VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks. In Computer Vision—ECCV 2024; Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2024; Volume 15148, pp. 1–18. [Google Scholar] [CrossRef]
MiniMax; Li, A.; Gong, B.; Yang, B.; Shan, B.; Liu, C.; Zhu, C.; Zhang, C.; Guo, C.; Chen, D.; et al. Minimax-01: Scaling Foundation Models with Lightning Attention. arXiv 2025, arXiv:2501.08313. [Google Scholar] [CrossRef]
Fedus, W.; Zoph, B.; Shazeer, N. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. J. Mach. Learn. Res. 2022, 23, 120. Available online: https://dl.acm.org/doi/abs/10.5555/3586589.3586709 (accessed on 5 December 2025).
Lepikhin, D.; Lee, H.; Xu, Y.; Chen, D.; Firat, O.; Huang, Y.; Krikun, M.; Shazeer, N.; Chen, Z. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. arXiv 2020, arXiv:2006.16668. [Google Scholar] [CrossRef]
Chen, W.; Huang, W.; Du, X.; Song, X.; Wang, Z.; Zhou, D. Auto-Scaling Vision Transformers without Training. arXiv 2022, arXiv:2202.11921. [Google Scholar] [CrossRef]
Qu, G.; Chen, Q.; Wei, W.; Lin, Z.; Chen, X.; Huang, K. Mobile Edge Intelligence for Large Language Models: A Contemporary Survey. IEEE Commun. Surv. Tutor. 2025, 27, 3820–3860. [Google Scholar] [CrossRef]
Ge, T.; Chen, S.-Q.; Wei, F. EdgeFormer: A parameter-efficient transformer for on-device seq2seq generation. arXiv 2022, arXiv:2202.07959. [Google Scholar] [CrossRef]
MarketsandMarkets. Service Robotics Market Size, Share and Trends. 2024. Available online: https://www.marketsandmarkets.com/Market-Reports/service-robotics-market-681.html (accessed on 5 December 2025).
Chiang, A.-H.; Trimi, S. Impacts of service robots on service quality. Serv. Bus. 2020, 14, 439–459. [Google Scholar] [CrossRef]
Liu, P.; Orru, Y.; Vakil, J.; Paxton, C.; Shafiullah, N.M.M.; Pinto, L. Demonstrating OK-Robot: What Really Matters in Integrating Open-Knowledge Models for Robotics. In Proceedings of the Robotics: Science and Systems (RSS 2024), Delft, The Netherlands, 15–19 July 2024. [Google Scholar] [CrossRef]
Ghosh, D.; Walke, H.R.; Pertsch, K.; Black, K.; Mees, O.; Dasari, S.; Hejna, J.; Kreiman, T.; Xu, C.; Luo, J.; et al. Octo: An Open-Source Generalist Robot Policy. In Proceedings of the Robotics: Science and Systems (RSS 2024), Delft, The Netherlands, 15–19 July 2024. [Google Scholar] [CrossRef]
Chang, M.; Chhablani, G.; Clegg, A.; Dallaire Cote, M.; Desai, R.; Hlavac, M.; Karashchuk, V.; Krantz, J.; Mottaghi, R.; Parashar, P.; et al. PARTNR: A Benchmark for Planning and Reasoning in Embodied Multi-Agent Tasks. arXiv 2024, arXiv:2411.00081. [Google Scholar] [CrossRef]
Gu, Q.; Ju, Y.; Sun, S.; Gilitschenski, I.; Nishimura, H.; Itkina, M.; Shkurti, F. SAFE: Multitask Failure Detection for Vision-Language-Action Models. arXiv 2025, arXiv:2506.09937. [Google Scholar] [CrossRef]
Chen, B.; Xia, F.; Ichter, B.; Rao, K.; Gopalakrishnan, K.; Ryoo, M.S. Open-Vocabulary Queryable Scene Representations for Real World Planning. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA 2023), London, UK, 29 May–2 June 2023; IEEE: Piscataway, NJ, USA, 2023. [Google Scholar] [CrossRef]
Shin, J.; Han, J.; Kim, S.; Oh, Y.; Kim, E. Task Planning for Long-Horizon Cooking Tasks Based on Large Language Models. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 14–18 October 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 13613–13619. [Google Scholar] [CrossRef]
Takebayashi, R.; Isume, V.H.; Kiyokawa, T.; Wan, W.; Harada, K. Cooking Task Planning Using LLM and Verified by Graph Network. In Proceedings of the 21st IEEE International Conference on Automation Science and Engineering (CASE 2025), Los Angeles, CA, USA, 17–21 August 2025; IEEE: Piscataway, NJ, USA, 2025. [Google Scholar] [CrossRef]
Singh, I.; Blukis, V.; Mousavian, A.; Goyal, A.; Xu, D.; Tremblay, J.; Fox, D.; Thomason, J.; Garg, A. ProgPrompt: Generating Situated Robot Task Plans Using Large Language Models. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 11523–11530. [Google Scholar] [CrossRef]
Yang, J.; Chen, X.; Qian, S.; Madaan, N.; Iyengar, M.; Fouhey, D.F.; Chai, J. LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 7694–7701. [Google Scholar] [CrossRef]
Luo, S.; Sun, P.; Zhu, J.; Deng, Y.; Yu, C.; Xiao, A. GSON: A Group-Based Social Navigation Framework with Large Multimodal Model. IEEE Robot. Autom. Lett. 2025, 10, 9646–9653. [Google Scholar] [CrossRef]
Ahn, M.; Dwibedi, D.; Finn, C.; Arenas, M.G.; Gopalakrishnan, K.; Hausman, K.; Ichter, B.; Irpan, A.; Joshi, N.; Julian, R.; et al. AutoRT: Embodied Foundation Models for Large-Scale Orchestration of Robotic Agents. arXiv 2024, arXiv:2401.12963. [Google Scholar] [CrossRef]
Rana, K.; Haviland, J.; Garg, S.; Abou-Chakra, J.; Reid, I.; Sünderhauf, N. SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Robot Task Planning. In Proceedings of the 7th Conference on Robot Learning (CoRL 2023), Atlanta, GA, USA, 6–9 November 2023; PMLR: Red Hook, NY, USA, 2023; Volume 229, pp. 23–72. Available online: https://proceedings.mlr.press/v229/rana23a.html (accessed on 5 December 2025).
Chang, H.; Boyalakuntla, K.; Lu, S.; Cai, S.; Jing, E.P.; Keskar, S.; Geng, S.; Abbas, A.; Zhou, L.; Bekris, K.; et al. Context-Aware Entity Grounding with Open-Vocabulary 3D Scene Graphs. In Proceedings of the 7th Conference on Robot Learning (CoRL 2023), Atlanta, GA, USA, 6–9 November 2023; PMLR: Red Hook, NY, USA, 2023; Volume 229, pp. 1950–1974. Available online: https://proceedings.mlr.press/v229/chang23b.html (accessed on 5 December 2025).
Shah, D.; Osiński, B.; Ichter, B.; Levine, S. LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action. In Proceedings of the 6th Conference on Robot Learning (CoRL), Auckland, New Zealand, 14–18 December 2022; PMLR: Red Hook, NY, USA, 2023; pp. 492–504. Available online: https://proceedings.mlr.press/v205/shah23b.html (accessed on 5 December 2025).
Shah, D.; Eysenbach, B.; Kahn, G.; Rhinehart, N.; Levine, S. ViNG: Learning Open-World Navigation with Visual Goals. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 13215–13222. [Google Scholar] [CrossRef]
Shi, L.X.; Ichter, B.; Equi, M.; Ke, L.; Pertsch, K.; Vuong, Q.; Tanner, J.; Walling, A.; Wang, H.; Fusai, N.; et al. Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models. arXiv 2025, arXiv:2502.19417. [Google Scholar] [CrossRef]
Gao, J.; Sarkar, B.; Xia, F.; Xiao, T.; Wu, J.; Ichter, B.; Majumdar, A.; Sadigh, D. Physically Grounded Vision-Language Models for Robotic Manipulation. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 12462–12469. [Google Scholar] [CrossRef]
Webster, C.; Ivanov, S. Robots, Artificial Intelligence and Service Automation in Tourism and Quality of Life. In Handbook of Tourism and Quality-of-Life Research II; Uysal, M., Sirgy, M.J., Eds.; Springer: Cham, Switzerland, 2023. [Google Scholar] [CrossRef]
Choi, Y.; Choi, M.; Oh, M.; Kim, S. Service Robots in Hotels: Understanding the Service Quality Perceptions of Human–Robot Interaction. J. Hosp. Mark. Manag. 2020, 29, 613–635. [Google Scholar] [CrossRef]
Yuan, W.; Duan, J.; Blukis, V.; Pumacay, W.; Krishna, R.; Murali, A.; Mousavian, A.; Fox, D. RoboPoint: A Vision-Language Model for Spatial Affordance Prediction in Robotics. In Proceedings of the 8th Conference on Robot Learning (CoRL 2025), Atlanta, GA, USA, 6–9 November 2025; Agrawal, P., Kroemer, O., Burgard, W., Eds.; PMLR: Cambridge, MA, USA, 2025; Volume 270, pp. 4005–4020. Available online: https://proceedings.mlr.press/v270/yuan25c.html (accessed on 5 December 2025).
Zheng, R.; Liang, Y.; Huang, S.; Gao, J.; Daumé, H., III; Kolobov, A.; Huang, F.; Yang, J. TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies. arXiv 2024, arXiv:2412.10345. [Google Scholar] [CrossRef]
Jiang, Y.; Gupta, A.; Zhang, Z.; Wang, G.; Dou, Y.; Chen, Y.; Li, F.-F.; Anandkumar, A.; Zhu, Y.; Fan, L. VIMA: Robot Manipulation with Multimodal Prompts. In Proceedings of the 40th International Conference on Machine Learning (ICML 2023), Honolulu, HI, USA, 23–29 July 2023; Available online: https://dl.acm.org/doi/10.5555/3618408.3619019 (accessed on 5 December 2025).
Min, S.Y.; Puig, X.; Chaplot, D.S.; Yang, T.-Y.; Rai, A.; Parashar, P.; Salakhutdinov, R.; Bisk, Y.; Mottaghi, R. Situated Instruction Following. In Computer Vision—ECCV 2024, Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 202–228. [Google Scholar] [CrossRef]
Dalal, M.; Chiruvolu, T.; Chaplot, D.; Salakhutdinov, R. Plan-Seq-Learn: Language Model Guided RL for Solving Long-Horizon Robotics Tasks. In Proceedings of the International Conference on Learning Representations (ICLR 2024), Vienna, Austria, 7–11 May 2024; Available online: https://proceedings.iclr.cc/paper_files/paper/2024/file/2e9f9cde1b709281a06dd14f679e4c51-Paper-Conference.pdf (accessed on 5 December 2025).
Zhao, T.Z.; Kumar, V.; Levine, S.; Finn, C. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware. In Proceedings of the Robotics: Science and Systems (RSS 2023), Daegu, Republic of Korea, 10–14 July 2023. [Google Scholar] [CrossRef]
Fu, Z.; Zhao, T.Z.; Finn, C. Mobile ALOHA: Learning Bimanual Mobile Manipulation Using Low-Cost Whole-Body Teleoperation. In Proceedings of the 8th Conference on Robot Learning (CoRL 2025), Atlanta, GA, USA, 6–9 November 2025; PMLR: Cambridge, MA, USA, 2025; Volume 270, pp. 4066–4083. Available online: https://proceedings.mlr.press/v270/fu25b.html (accessed on 5 December 2025).
Shafiullah, N.M.; Paxton, C.; Pinto, L.; Chintala, S.; Szlam, A. CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory. In Proceedings of the Robotics: Science and Systems (RSS 2023), Daegu, Republic of Korea, 10–14 July 2023. [Google Scholar] [CrossRef]
Rozière, B.; Gehring, J.; Gloeckle, F.; Sootla, S.; Gat, I.; Tan, X.E.; Adi, Y.; Liu, J.; Sauvestre, R.; Remez, T.; et al. Code Llama: Open Foundation Models for Code. arXiv 2024, arXiv:2308.12950. [Google Scholar] [CrossRef]
Kim, J.; Mishra, A.K.; Limosani, R.; Scafuro, M.; Cauli, N.; Santos-Victor, J.; Mazzolai, B.; Cavallo, F. Control Strategies for Cleaning Robots in Domestic Applications: A Comprehensive Review. Int. J. Adv. Robot. Syst. 2019, 16, 1729881419857432. [Google Scholar] [CrossRef]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
Lym, H.J.; Son, H.I.; Kim, D.-Y.; Kim, J.; Kim, M.-G.; Chung, J.H. Child-Centered Home Service Design for a Family Robot Companion. Front. Robot. AI 2024, 11, 1346257. [Google Scholar] [CrossRef]
Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT: A distilled version of BERT—Smaller, faster, cheaper and lighter. arXiv 2020, arXiv:1910.01108. [Google Scholar] [CrossRef]
Honerkamp, D.; Büchner, M.; Despinoy, F.; Welschehold, T.; Valada, A. Language-Grounded Dynamic Scene Graphs for Interactive Object Search with Mobile Manipulation. IEEE Robot. Autom. Lett. 2024, 9, 8298–8305. [Google Scholar] [CrossRef]
Wang, Y.; Xian, Z.; Chen, F.; Wang, T.-H.; Wang, Y.; Fragkiadaki, K.; Erickson, Z.; Held, D.; Gan, C. RoboGen: Towards unleashing infinite data for automated robot learning via generative simulation. In Proceedings of the 41st International Conference on Machine Learning (ICML 2024), Vienna, Austria, 21–27 July 2024; JMLR.org: Red Hook, NY, USA, 2024; Article 2127; Available online: https://dl.acm.org/doi/10.5555/3692070.3694197 (accessed on 5 December 2025).
Chen, M.; Tworek, J.; Jun, H.; Yuan, Q.; de Oliveira Pinto, H.P.; Kaplan, J.; Edwards, H.; Burda, Y.; Joseph, N.; Brockman, G.; et al. Evaluating Large Language Models Trained on Code. arXiv 2021, arXiv:2107.03374. [Google Scholar] [CrossRef]
Peng, S.; Genova, K.; Jiang, C.; Tagliasacchi, A.; Pollefeys, M.; Funkhouser, T. OpenScene: 3D scene understanding with open vocabularies. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 815–824. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016. [Google Scholar] [CrossRef]
Bai, S.; Liu, Y.; Han, Y.; Zhang, H.; Tang, Y.; Zhou, J. Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation. IEEE Trans. Image Process. 2025, 34, 8271–8284. [Google Scholar] [CrossRef] [PubMed]
Shah, D.; Sridhar, A.; Dashora, N.; Stachowicz, K.; Black, K.; Hirose, N.; Levine, S. ViNT: A Foundation Model for Visual Navigation. In Proceedings of the 7th Conference on Robot Learning (CoRL 2023), Atlanta, GA, USA, 6–9 November 2023; PMLR: Red Hook, NY, USA, 2023; Volume 229, pp. 711–733. Available online: https://proceedings.mlr.press/v229/shah23a.html (accessed on 5 December 2025).
Narasimhan, S.; Lisondra, M.; Wang, H.; Nejat, G. SplatSearch: Instance Image Goal Navigation for Mobile Robots Using 3D Gaussian Splatting and Diffusion Models. arXiv 2025, arXiv:2511.12972. [Google Scholar] [CrossRef]
Fung, A.; Tan, A.H.; Wang, H.; Benhabib, B.; Nejat, G. MLLM-Search: A Zero-Shot Approach to Finding People Using Multimodal Large Language Models. Robotics 2025, 14, 102. [Google Scholar] [CrossRef]
Tan, A.H.; Fung, A.; Wang, H.; Nejat, G. Mobile Robot Navigation Using Hand-Drawn Maps: A Vision Language Model Approach. IEEE Robot. Autom. Lett. 2025, 10, 7667–7674. [Google Scholar] [CrossRef]
Beyer, L.; Steiner, A.; Pinto, A.S.; Kolesnikov, A.; Wang, X.; Salz, D.; Neumann, M.; Alabdulmohsin, I.; Tschannen, M.; Bugliarello, E.; et al. Paligemma: A Versatile 3B Vision-Language Model for Transfer. arXiv 2024, arXiv:2407.07726. [Google Scholar] [CrossRef]
Dai, W.; Li, J.; Li, D.; Tiong, A.M.H.; Zhao, J.; Wang, W.; Li, B.; Fung, P.; Hoi, S. InstructBLIP: Towards General-Purpose Vision–Language Models with Instruction Tuning. In Proceedings of the 37th International Conference on Neural Information Processing Systems (NeurIPS 2023), New Orleans, LA, USA, 10–16 December 2023; Curran Associates, Inc.: Red Hook, NY, USA, 2023; Article No. 2142; Available online: https://dl.acm.org/doi/10.5555/3666122.3668264 (accessed on 5 December 2025).
Yao, Y.; Duan, J.; Xu, K.; Cai, Y.; Sun, Z.; Zhang, Y. A Survey on Large Language Model (LLM) Security and Privacy: The Good, The Bad, and The Ugly. High-Confid. Comput. 2024, 4, 100211. [Google Scholar] [CrossRef]
Spreitzer, R.; Moonsamy, V.; Korak, T.; Mangard, S. Systematic Classification of Side-Channel Attacks: A Case Study for Mobile Devices. IEEE Commun. Surv. Tutor. 2018, 20, 465–488. [Google Scholar] [CrossRef]
Hettwer, B.; Gehrer, S.; Güneysu, T. Applications of Machine Learning Techniques in Side-Channel Attacks: A Survey. J. Cryptogr. Eng. 2020, 10, 135–162. [Google Scholar] [CrossRef]
Pa Pa, Y.M.; Tanizaki, S.; Kou, T.; van Eeten, M.; Yoshioka, K.; Matsumoto, T. An Attacker’s Dream? Exploring the Capabilities of ChatGPT for Developing Malware. In Proceedings of the 16th Cyber Security Experimentation and Test Workshop (CSET 2023), Marina del Rey, CA, USA, 7–8 August 2023; Association for Computing Machinery: New York, NY, USA, 2023; pp. 10–18. [Google Scholar] [CrossRef]
Alami, A.; Jensen, V.V.; Ernst, N.A. Accountability in Code Review: The Role of Intrinsic Drivers and the Impact of LLMs. ACM Trans. Softw. Eng. Methodol. 2025, 34, 1–44. [Google Scholar] [CrossRef]
Raptis, E.K.; Kapoutsis, A.C.; Kosmatopoulos, E.B. Agentic LLM-Based Robotic Systems for Real-World Applications: A Review on Their Agenticness and Ethics. Front. Robot. AI 2025, 12, 1605405. [Google Scholar] [CrossRef]
Sathish, V.; Lin, H.; Kamath, A.K.; Nyayachavadi, A. LLeMpower: Understanding Disparities in the Control and Access of Large Language Models. arXiv 2024, arXiv:2404.09356. [Google Scholar] [CrossRef]
Nagarajan, R.; Kondo, M.; Salas, F.; Sezgin, E.; Yao, Y.; Klotzman, V.; Godambe, S.A.; Khan, N.; Limon, A.; Stephenson, G.; et al. Economics and Equity of Large Language Models: Health Care Perspective. J. Med. Internet Res. 2024, 26, e64226. [Google Scholar] [CrossRef]
Frank, M.R.; Ahn, Y.-Y.; Moro, E. AI Exposure Predicts Unemployment Risk: A New Approach to Technology-Driven Job Loss. PNAS Nexus 2025, 4, pgaf107. [Google Scholar] [CrossRef]
Eloundou, T.; Manning, S.; Mishkin, P.; Rock, D. GPTs Are GPTs: Labor Market Impact Potential of LLMs. Science 2024, 384, 1306–1308. [Google Scholar] [CrossRef]
Adilazuarda, M.F.; Mukherjee, S.; Lavania, P.; Singh, S.S.; Aji, A.F.; O’Neill, J.; Modi, A.; Choudhury, M. Towards Measuring and Modeling “Culture” in LLMs: A Survey. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP 2024), Miami, FL, USA, 12–16 November 2024; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 15763–15784. [Google Scholar] [CrossRef]
Li, C.; Chen, M.; Wang, J.; Sitaram, S.; Xie, X. CultureLLM: Incorporating Cultural Differences into Large Language Models. In Proceedings of the 38th International Conference on Neural Information Processing Systems (NeurIPS 2024), Vancouver, BC, Canada, 10–15 December 2024; Curran Associates Inc.: Red Hook, NY, USA, 2024; Article No. 2693; Available online: https://dl.acm.org/doi/10.5555/3737916.3740609 (accessed on 5 December 2025).
Pal, A.; Wangmo, T.; Bharadia, T.; Ahmed-Richards, M.; Bhanderi, M.B.; Kachhadiya, R.; Allemann, S.S.; Elger, B.S. Generative AI/LLMs for Plain Language Medical Information for Patients, Caregivers and General Public: Opportunities, Risks and Ethics. Patient Prefer. Adherence 2025, 19, 2227–2249. [Google Scholar] [CrossRef] [PubMed]
Alessandro, G.; Dimitri, O.; Cristina, B.; Anna, M. The Emotional Impact of Generative AI: Negative Emotions and Perception of Threat. Behav. Inf. Technol. 2025, 44, 676–693. [Google Scholar] [CrossRef]
Wester, J.; de Jong, S.; Pohl, H.; van Berkel, N. Exploring People’s Perceptions of LLM-Generated Advice. Comput. Hum. Behav. Artif. Hum. 2024, 2, 100072. [Google Scholar] [CrossRef]
Huang, J.-T.; Lam, M.H.; Li, E.J.; Ren, S.; Wang, W.; Jiao, W.; Tu, Z.; Lyu, M.R. Apathetic or Empathetic? Evaluating LLMss’ Emotional Alignments with Humans. In Advances in Neural Information Processing Systems, NeurIPS 2024; Curran Associates Inc.: Red Hook, NY, USA, 2024. [Google Scholar] [CrossRef]
Li, Y.; Huang, Y.; Wang, H.; Cheng, Y.; Zhang, X.; Zou, J.; Sun, L. Evaluating Large Language Models with Psychometrics. arXiv 2024, arXiv:2406.17675. [Google Scholar] [CrossRef]
Moreira, J.I.; Zhang, J. ChatGPT as a Fourth Arbitrator? The Ethics and Risks of Using Large Language Models in Arbitration. Arbitr. Int. 2025, 41, 71–84. [Google Scholar] [CrossRef]
Mahadevan, K.; Chien, J.; Brown, N.; Xu, Z.; Parada, C.; Xia, F.; Zeng, A.; Takayama, L.; Sadigh, D. Generative expressive robot behaviors using large language models. In Proceedings of the 2024 ACM/IEEE International Conference on Human-Robot Interaction (HRI 2024), Boulder, CO, USA, 11–14 March 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 482–491. [Google Scholar] [CrossRef]
Hu, Y.; Huang, P.; Sivapurapu, M.; Zhang, J. ELEGNT: Expressive and Functional Movement Design for Non-Anthropomorphic Robot. arXiv 2025, arXiv:2501.12493. [Google Scholar] [CrossRef]
Bommasani, R.; Hudson, D.A.; Adeli, E.; Altman, R.; Arora, S.; von Arx, S.; Bernstein, M.S.; Bohg, J.; Bosselut, A.; Brunskill, E.; et al. On the Opportunities and Risks of Foundation Models. arXiv 2021, arXiv:2108.07258. [Google Scholar] [CrossRef]
Wasi, A.T.; Islam, M.R. CogErgLLM: Exploring Large Language Model Systems Design Perspective Using Cognitive Ergonomics. In Proceedings of the 1st Workshop on NLP for Science (NLP4Science 2024), Miami, FL, USA, 12–16 November 2024; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 249–258. [Google Scholar] [CrossRef]
Sang, H.; Zhang, L.; Chen, T.; Guo, W.; Zhang, Z. Onboard Deployment of Remote Sensing Foundation Models: A Comprehensive Review of Architecture, Optimization, and Hardware. Remote Sens. 2026, 18, 298. [Google Scholar] [CrossRef]
Ranasinghe, N.; Mohammed, W.M.; Stefanidis, K.; Martinez Lastra, J.L. Large Language Models in Human-Robot Collaboration with Cognitive Validation Against Context-Induced Hallucinations. IEEE Access 2025, 13, 77418–77430. [Google Scholar] [CrossRef]
Hamid, O.H. Beyond probabilities: Unveiling the Delicate Dance of Large Language Models (LLMs) and AI Hallucination. In Proceedings of the 2024 IEEE Conference on Cognitive and Computational Aspects of Situation Management (CogSIMA), Montreal, QC, Canada, 7–10 May 2024; IEEE: New York, NY, USA, 2024; pp. 85–90. [Google Scholar] [CrossRef]
Dong, Q.; Zeng, P.; He, Y.; Wan, G.; Dong, X. Mitigating Catastrophic Forgetting in Robot Continual Learning: A Guided Policy Search Approach Enhanced with Memory-Aware Synapses. IEEE Robot. Autom. Lett. 2024, 9, 11242–11249. [Google Scholar] [CrossRef]
Taie, W.; ElGeneidy, K.; Al-Yacoub, A.; Sun, R. Addressing Catastrophic Forgetting in Payload Parameter Identification Using Incremental Ensemble Learning. Front. Robot. AI 2024, 11, 1470163. [Google Scholar] [CrossRef]
O’Neill, A.; Rehman, A.; Maddukuri, A.; Gupta, A.; Padalkar, A.; Lee, A.; Pooley, A.; Gupta, A.; Mandlekar, A.; Jain, A.; et al. Open X-Embodiment: Robotic Learning Datasets And RT-X Models. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 6892–6903. [Google Scholar] [CrossRef]
Xu, M.; Cai, D.; Yin, W.; Wang, S.; Jin, X.; Liu, X. Resource-Efficient Algorithms and Systems of Foundation Models: A Survey. ACM Comput. Surv. 2025, 57, 1–39. [Google Scholar] [CrossRef]
Xue, Y.; Tan, C.K.; Wong, W.P. Energy-Aware Multi-Robot Exploration and Coverage in Fragmented Un-known Environments Using Collaborative Reinforcement Learning. Results Eng. 2026, 29, 109171. [Google Scholar] [CrossRef]
Maresca, F.; Romero, A.; Delgado, C.; Sciancalepore, V.; Paradells, J.; Costa-Pérez, X. REACT: Multi-Robot Energy-Aware Orchestrator for Indoor Search and Rescue Critical Tasks. In Proceedings of the 2025 IEEE International Conference on Robotics and Automation (ICRA 2025), Atlanta, GA, USA, 19–23 May 2025; IEEE: Piscataway, NJ, USA, 2025. [Google Scholar] [CrossRef]
Li, P.; Toprak, O.S.; Narayanan, A.; Topcu, U.; Chinchali, S. Online Foundation Model Selection in Robotics. arXiv 2024, arXiv:2402.08570. [Google Scholar] [CrossRef]
Zhu, X.; Li, J.; Liu, Y.; Ma, C.; Wang, W. A survey on model compression for large language models. Trans. Assoc. Comput. Linguist. 2024, 12, 1556–1577. [Google Scholar] [CrossRef]
Park, Y.; Hyun, J.; Cho, S.; Sim, B.; Lee, J.W. Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs. arXiv 2024, arXiv:2402.10517. [Google Scholar] [CrossRef]
Yang, C.; Si, Q.; Duan, Y.; Zhu, Z.; Zhu, C.; Li, Q.; Chen, M.; Lin, Z.; Wang, W. Dynamic Early Exit in Reasoning Models. arXiv 2025, arXiv:2504.15895. [Google Scholar] [CrossRef]
Liu, L.; Zhang, S.; Jiang, Y.; Guo, J.; Zhao, W. Task Decomposition and Self-Evaluation Mechanisms for Home Healthcare Robots Using Large Language Models. IEEE Access 2025, 13, 65726–65736. [Google Scholar] [CrossRef]
Cohen, V.; Liu, J.X.; Mooney, R.; Tellex, S.; Watkins, D. A Survey of Robotic Language Grounding: Tradeoffs between Symbols and Embeddings. arXiv 2024, arXiv:2405.13245. [Google Scholar] [CrossRef]
Khan, M.T.; Waheed, A. Foundation Model Driven Robotics: A Comprehensive Review. arXiv 2025, arXiv:2507.10087. [Google Scholar] [CrossRef]
Merlo, E.; Lagomarsino, M.; Ajoudani, A. A Human-in-The-Loop Approach to Robot Action Replanning Through LLM Common-Sense Reasoning. IEEE Robot. Autom. Lett. 2025, 10, 10767–10774. [Google Scholar] [CrossRef]
Wang, H.; Tan, A.H.; Fung, A.; Nejat, G. X-Nav: Learning End-to-End Cross-Embodiment Navigation for Mobile Robots. IEEE Robot. Autom. Lett. 2026, 11, 698–705. [Google Scholar] [CrossRef]
Kawaharazuka, K.; Matsushima, T.; Gambardella, A.; Guo, J.; Paxton, C.; Zeng, A. Real-World Robot Applications of Foundation Models: A Review. Adv. Robot. 2024, 38, 1232–1254. [Google Scholar] [CrossRef]
Chen, J.; Yu, C.; Zhou, X.; Xu, T.; Mu, Y.; Hu, M.; Shao, W.; Wang, Y.; Li, G.; Shao, L. EMOS: Embodiment-Aware Heterogeneous Multi-Robot Operating System with LLM Agents. arXiv 2024, arXiv:2410.22662. [Google Scholar] [CrossRef]
Wang, W.; Obi, I.; Min, B.-C. Multi-Agent LLM Actor-Critic Framework for Social Robot Navigation. arXiv 2025, arXiv:2503.09758. [Google Scholar] [CrossRef]
Lin, X.; Alam, N.; Shuvo, M.I.R.; Fime, A.A.; Kim, J.H. MechLMM: A Collaborative Knowledge Framework for Enhanced Data Fusion in Multi-Robot Systems Using Large Multimodal Models. In Proceedings of the 2024 IEEE 8th International Conference on Information and Communication Technology (CICT), Prayagraj, India, 6–8 December 2024; IEEE: Piscataway, NJ, USA, 2024. [Google Scholar] [CrossRef]
Mandi, Z.; Jain, S.; Song, S. RoCo: Dialectic Multi-Robot Collaboration with Large Language Models. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; IEEE: Piscataway, NJ, USA, 2024. [Google Scholar] [CrossRef]
El-Mohamed, J.; Ahmadipour, M.; Warrier, S.; Costello, T.; Hii, M.; Mohan, H. Robotic Surgical Curriculum for Medical Students: A Scoping Review. J. Robot. Surg. 2025, 19, 496. [Google Scholar] [CrossRef] [PubMed]
Abdi, J.; Al-Hindawi, A.; Ng, T.; Vizcaychipi, M.P. Scoping Review on the use of Socially Assistive Robot Technology in Elderly Care. BMJ Open 2018, 8, e018815. [Google Scholar] [CrossRef] [PubMed]
Al Bayrakdar, A.; Dragone, M.; Wojcik, G.; McConnell, A.; King, M.; Paterson, R. Robotics Use in the Care and Management of People Living with Diabetes Mellitus: A Scoping Review. J. Diabetes Sci. Technol. 2025; ahead of print. [Google Scholar] [CrossRef]
Shneiderman, B. Human-Centered Artificial Intelligence: Reliable, Safe & Trustworthy. Int. J. Hum.-Comput. Interact. 2020, 36, 495–504. [Google Scholar] [CrossRef]
Tricco, A.C.; Lillie, E.; Zarin, W.; O’Brien, K.K.; Colquhoun, H.; Levac, D.; Moher, D.; Peters, M.D.J.; Horsley, T.; Weeks, L.; et al. PRISMA Extension for Scoping Reviews (PRISMA-ScR): Checklist and Explanation. Ann. Intern. Med. 2018, 169, 467–473. [Google Scholar] [CrossRef]
Cabibihan, J.-J.; Javed, H.; Ang, M.; Aljunied, S.M. Why Robots? A Survey on the Roles and Benefits of Social Robots in the Therapy of Children with Autism. Int. J. Soc. Robot. 2013, 5, 593–618. [Google Scholar] [CrossRef]

Figure 1. A VLM-MLLM pipeline enables language-guided medicine retrieval in a home. Images were generated using the Nano Banana Pro AI Image Generator.

Figure 2. Foundation models fuse multimodal inputs to drive perception, planning, and control. The images for the top-row robots and the RGB-D image pair were generated using the Nano Banana Pro AI Image Generator. All other images are from the authors.

Figure 3. Timeline of embodied AI advancements for mobile service robots before and after the rise of foundation models.

Figure 4. Foundation models support communication, navigation, and physical interaction in service robots. Images were generated using the Nano Banana Pro AI Image Generator.

Figure 5. Mobile service robots operate across domestic, healthcare, and service automation domains. Images generated using the Nano Banana Pro AI Image Generator.

Figure 6. Three-level framework maps foundation models to perception, reasoning, and action across domains.

Figure 7. Responsible robotics spans ethical, societal, human–robot interaction, and physical-ergonomic domains.

Figure 8. Roadmap showing short-, mid-, and long-term research goals of foundation-model service robots. The images of the robot and human were generated using the Nano Banana Pro AI Image Generator.

Table 1. Ranking of the four core challenges in mobile service robotics.

Rank	Challenge	Paper Count N	Percent of Papers (N/7506 × 100%)	Conclusions
#1	Challenge #1 Language-to-Action Mapping	2190	29%	Largest research focus: semantic grounding, task sequencing, navigation & manipulation logic
#2	Challenge #2 Multimodal Perception	2164	29%	Strong support for cross-modal sensing and perception alignment
#3	Challenge #3 Uncertainty Estimation	2005	27%	Strong emphasis on safety, robustness, and decision-making under uncertainty
#4	Challenge #4 Computational Capabilities	1147	15%	Focus on real-time execution, embedded efficiency, and resource constraints

Table 2. Comparison of major foundation models with respect to the core challenges for mobile service robots.

Model	Language- to-Action (Success Rate)	Perception (Accuracy/ Latency)	Uncertainty (Calibration Error)	Computation (GFLOPs/FPS)	Notes (Strengths (+)/ Limitations (−))	Best-Fit Scenarios
Magma [150]	52.3% (Google Robot)	Vision QA: 88.6% ~220–300 ms	—	≈45 GFLOPs/~30 FPS	+ Strong multimodal perception − Limited robotics action success; partial spatial reasoning	Healthcare: Vision-Language Navigation (Section 4.2): Multimodal gesture-speech grounding for mobile service robots operating in crowded hospital wards, enabling robust intent disambiguation under visual occlusion and acoustic interference.
DeepSeek-R1 [149]	—	MLLU: 84% ~300–400 ms	81.8%	≈90 GFLOPs/~18 FPS	+ Best uncertainty-aware reasoning − No action grounding or real-robot evaluation	Healthcare: Language Communication with Uncertainty Awareness (Section 4.3): Dialogue-driven task execution under ambiguous goal specification, with calibrated confidence estimation to ensure safe and trustworthy human–robot interaction.
SAM 2 [174]	—	Segmentation mIoU: 86% ~50 ms CPU	—	260–310 GFLOPs/25–44 FPS	+ Highest visual perception accuracy − Cannot produce or ground action commands	Healthcare: Vision-Language Navigation (Section 4.2): Real-time scene segmentation to support safe mobile service robot navigation through hospital corridors with dense human motion and dynamic obstacles.
CLIP-CAP [168]	71% across 7 tasks (Everyday Robot)	Goal recognition: 83% ~200 ms	—	175 GFLOPs/3–5 FPS	+ Best physical manipulation success − Low FPS hinders onboard deployment	Domestic Assistance: Manipulation and Organization (Section 4.1): Language-conditioned mobile manipulation in cluttered kitchens, translating natural language commands into executable control programs for object retrieval and rearrangement.
Perceiver IO [175] for Perceiver-Actor [176]	70% across 18 tasks (Franka Panda Robot)	Perceiver-IO scene encoding: 84.5% ~250–350 ms	—	9–10 GFLOPs/2–4 FPS	+ Lowest computational cost with multi-task robustness − Uncertainty not explicitly modeled	Domestic Assistance: Computationally Efficient Manipulation (Section 4.4): Edge-constrained multi-task visuomotor control for home service robots performing navigation and object interaction under strict onboard compute and latency limitations.
GPT [144]	85–92% structured question and answering (Franka Panda Robot)	MMLU: 90% ~700 ms multimodal	92.0%	≈800 GFLOPs/token → 1–3 FPS	+ Strongest high-level reasoning under ambiguity − Computationally prohibitive for onboard guidance	Service Automation: Long-Horizon Language Communication and Planning (Section 4.1 and Section 4.4): High-level delivery instruction decomposition for multi-room navigation and task sequencing with remote inference support under non-real-time operational constraints.
LLaMA [177]	~91% symbolic planning (Google Robot)	MMLU: 83.6% 250–350 ms	92.1%	$\approx 8 \times 10^{3}$ GFLOPs/token → ~0.7 Hz	+ Effective symbolic planner; strong long-context utility − Slow inference; lacks action grounding	Service Automation; High-Level Semantic Planning (Section 4.1): Pre-execution semantic plan synthesis within hybrid symbolic-LLM pipelines for navigation and manipulation tasks prior to embodied grounding.
Top Performer Per Challenge	CLIP-CAP (best success rate in manipulation tasks)	SAM-2 (highest perception accuracy & real-time latency)	DeepSeek-R1 (lowest reasoning error on HLE text-only tasks)	Perceiver-Actor (lowest GFLOPs with viable FPS on-board)	Demonstrates clear trade-offs; no single model excels across all four core challenges

Note: “—” means not available.

Table 3. Foundation model applications across several robot domains.

Domain	Task	Representative Frameworks	Robots
Domestic Assistance	Fetch and Carry Tasks	Ok-Robot [216]	Stretch RE-1 Robot
		Octo [217]	Franka Emika Panda Robot
		CogNav [30]	Fetch mobile manipulator
		PARTNR [218]	Franka Emika Panda Robot Spot + 7-DoF arm
	Cleaning	TidyBot [37]	Holonomic mobile base and a Kinova Gen3 arm
	Cleaning	RT-2 [14]	Everyday mobile manipulator robots
	Childcare	Smart Help [28]	Custom mobile manipulator robot with Kinova Gen3 arm
	Childcare	SAFE [219]	Franka Emika Panda robot
	Cooking	OVQSRFRW [220]	Everyday mobile manipulator robots
		TPLC [221]	6-DoF Universal Robots UR5e arm
		CTGN [222]	Dual-arm Universal Robots UR5e
Healthcare Assistance	Delivery of Medical Supplies	ProgPrompt [223]	Franka Emika Panda robot
		LLM-Grounder [224]	Franka Emika Panda robot
		VLM-Social-Nav [201]	Clearpath Jackal robot
	Bedside Patient Monitoring	MoMa-LLM [201]	Fetch mobile manipulator
	Assistive Mobility	OLiVia-Nav [202]	Clearpath Jackal robot
	Assistive Mobility	GSON [225]	Custom differential-drive robot
	Hygiene Maintenance and Infection Control	AutoRT [226]	Everyday mobile manipulator robots
	Hygiene Maintenance and Infection Control	SayPlan [227]	Franka Emika Panda robot
Service Automation	Customer Assistance, Guidance and Wayfinding	OVSG [228]	Ackermann Steering robot
		LM-Nav [229]	Clearpath Jackal robot
		ViNG [230]	Vizbot, Unitree Go1, Clearpath Jackal, and LoCoBot
	Service Setup and Maintenance	Hi Robot [231]	Dual-arm mobile manipulator Mobile ALOHA robot
		$π_{0.5}$ [154]	Dual-arm mobile manipulator Mobile ALOHA robot
		PhysObjects [232]	Franka Emika Panda robot

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lisondra, M.; Benhabib, B.; Nejat, G. Embodied AI with Foundation Models for Mobile Service Robots: A Systematic Review. Robotics 2026, 15, 55. https://doi.org/10.3390/robotics15030055

AMA Style

Lisondra M, Benhabib B, Nejat G. Embodied AI with Foundation Models for Mobile Service Robots: A Systematic Review. Robotics. 2026; 15(3):55. https://doi.org/10.3390/robotics15030055

Chicago/Turabian Style

Lisondra, Matthew, Beno Benhabib, and Goldie Nejat. 2026. "Embodied AI with Foundation Models for Mobile Service Robots: A Systematic Review" Robotics 15, no. 3: 55. https://doi.org/10.3390/robotics15030055

APA Style

Lisondra, M., Benhabib, B., & Nejat, G. (2026). Embodied AI with Foundation Models for Mobile Service Robots: A Systematic Review. Robotics, 15(3), 55. https://doi.org/10.3390/robotics15030055

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Embodied AI with Foundation Models for Mobile Service Robots: A Systematic Review

Abstract

1. Introduction

2. Methods Used for Identifying the Research Challenges

2.1. Search Strategy

3. Open Challenges of Embodied AI for Mobile Service Robots

3.1. Challenge #1: Translation of Natural Language Instructions into Executable Robot Actions

3.2. Challenge #2: Multimodal Perception

3.3. Challenge #3: Uncertainty Estimation

3.4. Challenge #4: Computational Capabilities

4. Opportunities in Mobile Service Robots for Foundation Models

4.1. Addressing Challenge #1: Translation of Natural Language Instructions into Executable Robot Actions

4.2. Addressing Challenge #2: Multimodal Perception

4.3. Addressing Challenge #3: Uncertainty Estimation

4.4. Addressing Challenge #4: Computational Capabilities

4.5. Lessons Learned and Failure Patterns Across Foundation Model Architectures

5. Real-World Applications for Mobile Service Robots with Embedded Foundation Models

5.1. Domestic Assistance

5.2. Healthcare Assistance

5.3. Service Automation

6. Ethical, Societal, Human-Interaction, and Physical Design Implications for Mobile Service Robots with Embedded Foundation Models

6.1. Ethical Implications

6.2. Societal Implications

6.3. Implications for Human–Robot Interaction

6.4. Physical Design and Ergonomic Implications

7. Future Research Directions

7.1. Reliability and Lifelong Adaptation

7.2. Privacy-Aware and Further Resource-Constrained Research and Inference

7.3. Governance, Standards, and Human-in-the-Loop Frameworks

7.4. Multi-Robot Coordination and Fleet-Level Reasoning

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Systematic Review Methodology

Appendix A.1. PRISMA Checklist

Appendix A.2. Selection Process

Appendix A.3. Data Extraction and Categorization Procedure

Appendix A.4. Full Boolean Search Strings and Operators

Appendix A.5. Limitations of the Review Methodology

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI