1. Introduction
The integration of mobile robots into indoor environments marks a significant paradigm shift in robotics—from isolated industrial applications to direct collaboration with humans in shared spaces [
1]. Indoor mobile robots are increasingly deployed across diverse environments, including hospitals [
2], laboratories [
3], offices, and residential settings [
4]. This shift requires Human–Robot Interaction (HRI) capabilities to be sophisticated to enable natural, safe, and efficient collaboration between humans and robotic systems.
Several converging factors motivate the development of effective HRI in indoor mobile robotics.
Aging population: An aging global population creates a critical need for assistance. By 2050, the world’s population of people aged 60 years and older will double to 2.1 billion; 1 in 6 people worldwide will be over 65 by 2050, which is a significant increase from 1 in 11 in 2019, according to the World Health Organization [
5]. This aging population and growing healthcare needs require the deployment of robots in hospitals, care facilities, and home care. Robots can relieve caregivers by performing routine tasks, thereby increasing the efficiency of patient care [
6]. The use of mobile robots enhances and enables independent living for the elderly and those with disabilities. The use of smart wheelchairs with Human–Robot Interaction via artificial intelligence increases usability, learnability, efficiency, satisfaction, and a sense of independence and dignity among elderly and mobility-impaired individuals [
7,
8]. Matthias and Markus proposed an intelligent wheelchair that can navigate in indoor environments and accompany any person. In addition, it allows social interaction while walking to relieve relatives or nursing staff, who otherwise need to push the wheelchair [
9]. Shannon Vallor in her paper “Carebots and Caregivers” introduced the concept of “carebots” and argued that the ethical evaluation of these systems must extend beyond the impact on patients to consider the moral value of caregiving practices for the caregivers themselves, examining whether robots sustain or deprive them of the internal goods of caring [
10]. The TORNADO cloud-integrated robotics platform, featuring people-aware navigation and dexterous manipulation, includes a validation scenario specifically focused on patient support in a hospital palliative ward [
11]. This ageing population requires robots to assist, care, help, and socially interact. To achieve that goal, we need the human-centric design of Human–Robot Interaction.
Labor shortage: Another major motivating factor to develop robots with intelligent and user-accepted Human–Robot Interaction is due to the labor shortage. To have the robot cooperate and coordinate with humans, it needs to understand human norms and proxemics. Cao and Tam proposed a search and fetch operation for the mobile manipulation robots with a multimodal Human–Robot Interaction via gesture, voice, and face recognition. This can address the persistent labor shortage in roles that require complex, repetitive tasks in indoor environments like manufacturing factories [
12]. Beyond simple repetition, the economic viability of automation is increasingly driven by the modernization of existing “brownfield” facilities. Unlike “greenfield” projects built from scratch, brownfield environments require Autonomous Mobile Robots (AMRs) that can navigate without fixed infrastructure like rails, allowing companies to automate gradually without expensive shutdowns or remodeling [
13]. In labor-intensive industries like textile composite production, where full automation is cost-prohibitive, collaborative robots (cobots) offer a middle ground. They reduce production time and costs while preserving the tacit knowledge of skilled human experts who are becoming scarcer due to aging workforces [
14]. This trend extends to agriculture, where robotic planters are being designed to remedy future farmer shortages by optimizing energy usage and draft force [
15]. One of the common things among these users is their lack of technical skills and training to operate and interact with robots, and hence, the human-centric design of interaction becomes paramount. In the service and healthcare sectors, the integration of Generative AI (GenAI) which include large language models (LLMs), large behavioral models (LBMs), and agentic AI is transforming the economic landscape by enabling “citizen developers”—frontline employees who can train and fine-tune robots without coding skills, leading to cost-effective service excellence [
16]. Real-world applications demonstrate significant returns on investment; for instance, the deployment of Moxi robots has saved clinical staff over 575,000 h and 1.5 billion steps, directly addressing clinician burnout and operational inefficiency [
17]. Furthermore, humanoid robots are being explored as a solution to hospital labor shortages, capable of performing teleoperated clinical tasks with human-like dexterity [
18]. The deployment of mobile robots in healthcare settings has demonstrated promising results in improving operational efficiency and reducing the workload on medical staff [
19,
20]. Service robots have evolved from simple delivery systems to sophisticated platforms capable of complex Human–Robot Interaction [
21,
22].
Pandemic Response and Biosecurity: The COVID-19 pandemic acted as a catalyst for robotic adoption, highlighting the advantage of systems with intrinsic immunity to pathogens [
23]. Robots became essential for maintaining “social distancing” in medical settings. For example, the “Dr. Spot” quadruped robot was developed to measure vital signs (skin temperature, heart rate, SpO2) without direct contact, preserving Personal Protective Equipment (PPE) and protecting healthcare workers [
24]. Mobile robots are assessed for deployment in isolation-room hospital settings to execute tasks like remote supply delivery and medication distribution, aiming to minimize the risk of cross-contamination and reduce staff workload, particularly during infectious disease outbreaks like the COVID-19 pandemic [
25,
26]. In extreme cases, such as in Wuhan, China, entire wards were temporarily run by robots to deliver food and medicine to quarantine patients, minimizing human exposure [
27]. Telepresence robots also gained traction, allowing isolated patients to interact with families and doctors while avoiding contagion risks [
28]. Beyond direct care, robots have been utilized for disinfection, logistics, and waste handling, effectively breaking the chain of virus transmission [
26]. This protection extends to surgical oncology, where robotic systems allow for precise interventions while adhering to strict safety protocols to prevent viral aerosolization [
29]. Therefore, whenever a human is involved, which is very likely the case in the indoor environments of hospitals and laboratories, the robot must have a Human–Robot Interaction model covering the basics like human-aware navigation, intuitive interaction, and likable presence.
Occupational and Psychological Safety: Beyond biological hazards, HRI addresses physical and mental safety in industry. In the construction sector, which faces high accident rates, machine learning models are being used to predict human trust in robots based on physiological data (such as skin temperature), ensuring that human–robot collaboration (HRC) remains safe and productive [
30]. However, safety is also psychological. The isolation caused by pandemics or varying abilities affects mental health. Social robots with “hybrid face” designs have been deployed to support the mental health of older adults and those with dementia, providing companionship and “human-like” conversation to mitigate isolation [
31]. Research in aged care suggests that while robots can relieve the physical workload, their acceptance depends heavily on their reliability and the emotional reactions of the staff, emphasizing that psychological safety is a prerequisite for successful implementation [
32]. Ultimately, robots are tasked with the “dull, dirty, and dangerous” jobs—from welding to patient lifting—enhancing the overall safety profile of modern work environments [
33].
The challenge for Human–Robot Interaction in indoor environments is twofold. One is the technical challenge, and the other is the human challenge (
Figure 1).
Complexity of environments: Indoor environments such as hospitals, laboratories, and offices are dynamic, tightly structured, and populated with many moving obstacles. Robots must not only navigate but also consider human activities and interact safely. Effective HRI supports adaptive, context-sensitive, flexible collaboration [
34].
Extending robot functionality: Natural interaction methods such as speech, gestures, touch, or gaze enable robots to perform more complex tasks that would be difficult to manage without HRI. These capabilities make robots more intuitive and “human-like” to operate, lower the learning barrier for users, and improve task efficiency.
User acceptance and trust: The success of robotic systems depends heavily on user trust and acceptance. Robots must behave in a comprehensible, reliable, and socially appropriate manner to ensure that users can rely on them and use them effectively in daily life [
35]. Trust is a prerequisite for user acceptance [
35,
36,
37]. It depends heavily on crucial human-centric factors, such as whether users are willing to accept the technology and trust it to perform its tasks reliably [
38]. Only then can we have the long-term adaptation of mobile robotics in our personal spaces.
Safety and ethical considerations: In human-centered environments, safety is paramount. HRI contributes to predictable, transparent, and ethically sound interactions, for example, through clear communication, privacy protection, and the respectful handling of human autonomy.
HRI can be classified along multiple axes; Goodrich and Schultz survey HRI in terms of robot autonomy, interaction roles, and team organization [
1]. Parasuraman et al. formalize levels of automation for information acquisition, analysis, decision, and action—a framework widely adopted in HRI design and evaluation [
39]. Steinfeld and Fong talked about common matrices for Human–Robot Interactions [
40]. Bartneck et al. covered measurement instruments for the anthropomorphism, animacy, likeability, perceived intelligence, and the perceived safety of robots [
41].
Figure 2 represents the different dimensions of HRI classification.
This review provides a comprehensive and integrative synthesis Human–Robot Interaction (HRI) in indoor mobile robotics. Unlike prior surveys that focus on isolated subdomains, we systematically combine three perspectives that are often treated separately: (i) technical aspect of HRI, including modalities such as speech, gesture, touch, visual, and emerging LLM-enabled interfaces; (ii) human aspect of HRI, encompassing usability, trust formation, social acceptance, and long-term user experience; and (iii) practical and regulatory consideration, including safety engineering, privacy considerations, deployment constraints, and the compliance framework.
By bridging fragmented literature across robotics, human–computer interaction, healthcare technology, and safety engineering, this review addresses a critical gap: the lack of a unified 2025 perspective that connects technical implementation with socio-technical integration in real deployment contexts. We analyze representative case studies from operational systems to ground theoretical insights in practice and to derive transferable design principles.
In addition to synthesizing the state of the art, we identify key open research challenges, including the need for longitudinal studies on trust dynamics, the cross-cultural validation of HRI models, scalable economic deployment frameworks, and standardized interoperability architectures for heterogeneous robot fleets. Addressing these challenges is essential for advancing indoor mobile robotics toward natural, safe, economically sustainable, and socially accepted human–robot collaboration.
Finally, we analyze representative case studies and examine current approaches with highlights of future research directions aimed at advancing the field toward more natural, safe, and effective human–robot collaboration in indoor environments.
3. Applications of Indoor Mobile Robots: Cross-Domains and Domain-Specific Challenges
Indoor mobile robots are autonomous or semi-autonomous robotic systems that operate across diverse indoor environments, such as hospitals, laboratories, offices, industrial facilities, and private homes. These environments differ in structure, predictability, and user characteristics. As summarized in
Table 1, indoor operational context can be broadly classified as structured and predictable or unstructured and dynamic. This distinction provides a useful framework for understanding both cross-domain and domain specific requirements.
Indoor mobile robots perform tasks ranging from transportation and assistance to monitoring and social interaction while interacting safely and effectively with humans and dynamic surroundings. Compared to outdoor industrial robots, they operate in fundamentally different contexts, which impose specialized capabilities and design considerations [
51]. These environments present unique constraints that directly influence interaction design and user experience. Indoor mobile robots operate in constrained, dynamic environments where they must navigate around humans, devices, furniture, and other obstacles while maintaining social appropriateness [
52]. The interaction design must accommodate diverse user groups with varying technical expertise, from healthcare professionals to elderly residents [
53] as depicted in
Table 2.
3.1. Cross-Domain HRI Challenges
Regardless of the environment, several challenges are shared across domains:
Safe Navigation in Human-Populated Spaces: Robots must operate in constrained, dynamic environments with humans, furniture, and moving obstacles. They need precise collision avoidance while maintaining efficiency and must adapt to unpredictable human behavior [
34,
54]. Structured environments, like laboratories, emphasize exact path following, whereas homes or healthcare settings require dynamic adaptation to moving people and clutter [
55,
56].
Social Appropriateness and Intent Communication: Across domains, robots must behave in socially intelligible ways. Mechanisms include visual signaling, auditory cues, gesture-based communication, and projected indicators of movement or intent [
57,
58]. Such communication increases the predictability and legibility of robot motion, thereby enhancing human–robot trust, whether in office corridors, hospital hallways, or shared home spaces [
59,
60].
Heterogenous User Groups: Indoor robots interact with highly trained professionals, administrative staff, elderly residents, children, or casual visitors [
6,
61]. Each group has different expectations, technical literacy, and cognitive loads [
62]. The interaction design must accommodate these differences, providing clear, context-appropriate feedback without overloading users, accounting for varying levels of psychosocial functioning [
63].
Infrastructure Integration: Robots must interface with existing buildings and workflow infrastructure, including elevators, automatic doors, IT networks, and laboratory management systems [
64,
65]. Reliable integration ensures smooth operation and minimal disruption, often requiring standardized “plug and play” frameworks for seamless deployment [
19,
66].
Trust, Privacy, and Ethical Considerations: Especially in healthcare and residential settings, robots often process sensitive data [
67]. The cross-domain HRI design must safeguard privacy, implement consent mechanisms, and maintain transparency to ensure user trust and avoid the ethical pitfalls related to surveillance or autonomy [
35,
68,
69].
3.2. Domain-Specific Requirements
While indoor mobile robots share certain cross-domain challenges such as safe navigation, clear communication of intent, infrastructure integration, and accommodation of heterogeneous user groups, the relative importance and technical implementation of these challenges vary significantly by domain. Each environment imposes unique operational constraints, interaction priorities, and user expectations, which shape both the design of the robot and the nature of Human–Robot Interaction.
3.2.1. Healthcare and Elderly Care
Education and healthcare are particularly fertile application areas for domain-specific Human–Robot Interaction [
61,
70]. Healthcare facilities represent one of the most demanding indoor environments due to sterility requirements, strict patient privacy constraints, time-critical workflows, and the emotional sensitivity of users. Robots such as Moxi (Diligent Robotics) assist nursing staff with routine tasks, including delivering supplies or transporting medications [
71]. These assistance tasks involve more direct support to users and often require an understanding of individual user preferences, routines, and intentions. Such applications rely on advanced interaction capabilities, including speech recognition, gesture interpretation, contextual understanding, and long-term user modeling to personalize the robot’s behavior and responses over time.
Common applications in healthcare settings include medication reminders for elderly patients, health monitoring and data collection, or educational activities in classrooms and laboratories [
72]. Robots in these environments must navigate precisely around medical equipment, patient beds, and healthcare staff, while strictly maintaining sterility and safety protocols [
73]. Platforms such as “Marvin”, an omni-directional assistant for domestic environments, are designed to support elderly monitoring and remote presence [
74]. Additionally, robots must recognize and respond appropriately to human presence, avoid disrupting workflows, and always ensure patient privacy.
In residential care, robots increasingly provide emotional support and assistance with activities of daily living for elderly individuals [
22]. The deployment of heterogeneous mobile robot fleets in hospitals, such as Tartu University Hospital, demonstrate the practical impact of these systems: robots perform time-critical object transportation tasks, moving samples from intensive care units (ICUs) to hospital laboratories through crowded and narrow hallways [
75]. The development of usable autonomous mobile robots requires careful consideration of user needs, environmental constraints, and task-specific requirements [
21]. Social robots in hospitals play roles ranging from patient companionship to healthcare delivery. They generally yield high user satisfaction when they have multimodal communication and personalized behaviors to mitigate occasional user fear or frustration [
76,
77].
3.2.2. Laboratory Environments
Laboratory environments introduce considerable complexity due to the presence of expensive and sensitive equipment, hazardous chemicals, and strict procedural requirements [
3]. Robots operating in laboratories must handle delicate instruments accurately, maintain contamination control, navigate safely around staff and equipment, and adapt to dynamic experimental setups [
64]. Integration with laboratory management systems, adherence to precise spatial paths, and responsiveness to dynamic workspace changes are essential to ensure both safety and workflow efficiency.
Multi-floor labware transportation systems such as the MOLAR Automated Guided Vehicles (AGVs) developed at the Center for Life Science Automation (CELISCA) execute complex workflows, moving labware and materials between workbenches located on different floors [
78,
79]. These robots are integrated with laboratory infrastructure, including automatic doors and elevators opening [
64,
78,
79]. Research also addresses methods to correct unexpected localization errors to maintain operational safety [
80]. Similarly, the H2O robot at CELISCA uses StarGazer sensors to navigate via ceiling landmarks and employs hybrid elevator-controlling strategies that combine robot arm manipulation and wireless control [
65,
79]. For both systems, social navigation and interaction with humans are critical HRI components.
The mobile robot Kevin was specifically designed to handle the transportation of labware, relieving highly trained laboratory staff from logistical duties [
81]. With a non-intimidating height (100–160 cm) and organic shapes, Kevin uses a multimodal interface comprising lights, speakers, and a tablet to communicate its status and movement intentions to non-technical personnel. User studies indicate that a “medium” communication level that provides targeted, concise feedback is preferred over continuous signaling to avoid information overload [
81].
Clinical specimen delivery systems such as Proxie robot, a mobile collaborative robot (cobot) piloted at Mayo Clinic Laboratories, autonomously move existing carts containing laboratory specimens between pathology stations. This significantly reduces staff effort while maintaining trust through a stable architecture and subtle visual cues, such as expressive “eyes,” which help users feel confident working alongside it. Its “Scout Sense” captures the environment from a human-like eye level to ensure situational awareness, while “Glide 360” mobility allows it to move naturally and intuitively around people. The robot prioritizes safe interaction and uses adaptive AI to learn and harmonize with human workflows, aiming to assist rather than replace staff [
82].
Frameworks such as LAPP (Laboratory Automation Plug & Play) further extend flexibility in pharmaceutical and general laboratory automation by enabling mobile manipulator with vision systems to “learn” device poses. This approach aims to simplify the integration of devices from different vendors for end-to-end automation systems [
66].
In addition, sophisticated HRI and safety mechanisms are implemented in shared spaces. Mobile systems employ multi-layer smart collision avoidance using Kinect sensors for the real-time recognition of dynamic human face orientation (classified using LVQ neural networks) or specific hand gesture commands (classified using SVM) to receive direct navigational guidance from personnel in narrow zones [
65]. Methods to communicate robot intent include projected visual signals, LEDs, speakers, and wearable haptic devices, which improve both predictability and user trust [
59].
Research Platforms like the Pioneer 3-DX and TIAGo mobile robots are widely used to study human safety perception and trust [
38]. TIAGo, with a semi-humanoid form, an adjustable height, movable head, and multimodal interaction capabilities, supports natural communication via speech, gesture recognition, touch, and emotion detection, allowing safe collaboration in research and healthcare [
83].
Overall, indoor mobile robots must balance operational precision, procedural compliance, infrastructure integration, and safe social interaction. Each application, from labware transport to specimen delivery, imposes unique technical and social requirements. Successful deployment depends on integrating perception, socially intelligent navigation, and natural Human–Robot Interaction to meet both safety and user acceptance standards.
3.2.3. Office Environments
Office environments are socially structured but less safety-critical than healthcare or laboratory setting. Robots in offices must navigate within shared workspaces and must respect professional etiquette, meeting norms, and hierarchical structures and team dynamics. Typical tasks include document delivery, providing information services, visitor guidance, and telepresence. The interaction design must minimize disruption and maintain low-friction communication, allowing users to focus on work rather than managing the robot [
84].
To assist individuals in office spaces, several research efforts have demonstrated effective HRI solutions. Iida and Abdulali proposed a telepresence robot implementing DEtect, TRAck and FOllow (DETRAFO) algorithm [
28] to enable intuitive tracking and following users. Ngo and Nguyen developed a cost-effective on-device Natural Language Command Navigation System for mobile robots in indoor environments like an office, making the Human–Robot Interaction efficient and effective to communicate goals via natural commands [
85]. Additionally, Balcı and Poncelet introduced movable robotic furniture in shared office spaces to modulate human–human interaction, to avoid distraction, and make spontaneous interaction more meaningful, thereby improving the overall workflow efficiency [
86,
87].
3.2.4. Industrial Settings
Industrial indoor environments focus on productivity enhancement, repetitive task execution, and safe human–robot collaboration. Cao and Tam proposed a search and fetch operation for mobile manipulation robots that leverages multimodal Human–Robot Interaction via gesture, voice, and face recognition. This can address the persistent labor shortage in roles that require complex, repetitive tasks in indoor environments such as manufacturing factories [
12]. Colceriu and Theis examined human-centric GUI designs for mobile cobots to increase their potential for assembly work in industry settings [
88]. Huy and Vietcheslav further developed a novel interface framework for Human–Robot Interaction in industry, employing a laser-writer in combination with a see-through head-mounted display using augmented reality and spatial augmented reality to securely exchange information. They also introduced a novel handheld device enabling multiple input modalities, allowing users to interact with mobile robots’ efficiency [
89].
These approaches aim to enhance productivity and safety in harsh or hazardous environments where robots can take over risky jobs [
90]. HRI thus plays a central role in improving operational efficiency while maintaining safety in complex indoor construction sites. In such settings, robots work alongside human teams, supporting physically demanding operations, facilitating task communication, and adhering to strict safety protocols [
91]. Compared to healthcare or laboratory environments, emotional intelligence and social or companion functions are largely unnecessary. The focus is instead on task performance, reliability, and predictable interaction.
3.2.5. Residential Homes
Residential environments introduce substantial social and behavioral challenges. Robots must navigate shared spaces occupied by multiple individuals with varying routines, preferences, and technical skills [
92]. They are required to interpret informal social cues, avoid interfering with family interactions, and respect privacy, including restricted access to certain rooms and protection of personal data. In addition to assistance tasks, residential robots frequently perform companion roles, providing emotional support, engagement through conversation or play, and help with daily routines, particularly for elderly or disabled residents [
93].
Monitoring and surveillance tasks are also common in these settings, including autonomous patrolling, security enforcement, and environmental monitoring [
94]. While these applications are socially interactive and assistive [
57,
95,
96], they raise critical privacy and ethical considerations, as robots often collect sensitive data about people and spaces. Designers must carefully implement robust data protection, consent mechanisms, and transparent reporting, to ensure that the robot’s presence does not infringe on individual rights or create distrust. Social and companion roles are increasingly relevant in residential care, home environments, and facilities supporting vulnerable populations. In these roles, robots provide emotional support, companionship, and engagement through conversation, play, or assistance with daily routines [
93]. These applications demand the most sophisticated HRI capabilities, including emotion recognition, adaptive personalization, and the ability to build long-term relationships with users. Robots must interpret subtle social cues, respond appropriately to changing moods, and foster trust over repeated interactions, creating a sense of presence and companionship that extends beyond functional task performance. Demand for increased convenience, security monitoring, and remote care for relatives at home can be achieved by the Astro robot from Amazon [
97]. Examples of residential robots include the Astro platform from Amazon, which supports convenience, security monitoring, and remote care for relatives at home [
97]. Other robots, such as ZERITH H1 and Loki (Loki Robotics), are used for housekeeping and toilet cleaning in homes, hotels, and offices [
98,
99]. Unlike laboratories or industrial robots, residential robots prioritize social acceptance, privacy preservation, adaptability to unstructured environments, and long-term relational interaction. They require sophisticated HRI capabilities, including emotion recognition, adaptive personalization, and the ability to foster trust over repeated interactions, creating a sense of companionship that extends beyond functional task performance.
3.3. Synthesis
Although indoor mobile robots share foundational capabilities such as navigation, perception, and infrastructure integration, the nature and complexity of Human–Robot Interaction (HRI) differs substantially across domains. These differences arise from variations in the environmental structure, user expectations, task criticality, and social context.
In industrial settings, HRI is primarily task-oriented and performance-driven. Interaction focuses on clear command input, predictable system responses, and compliance with safety protocols. Multimodal interfaces (e.g., gesture, voice, GUI, augmented reality) are designed to increase efficiency and reduce cognitive load during repetitive or hazardous operations. Social expressiveness and emotional intelligence are largely unnecessary; instead, transparency, reliability, and unambiguous intent communication are central. The human operator remains goal-directed, and interaction serves operational optimization.
Similarly, laboratory environments require highly structured and controlled interaction. Here, HRI must support precision, traceability, and procedural compliance. Communication is typically concise and purpose-specific, avoiding distraction in cognitively demanding research settings. Robots must signal movement intentions clearly to ensure safety in confined spaces, but excessive social signaling can reduce usability. Trust is established through accuracy, predictability, and seamless integration into laboratory workflows rather than through expressive or companion-like behavior. Thus, HRI in laboratories is functional, minimally intrusive, and tightly coupled to workflow reliability.
In healthcare environments, HRI becomes significantly more complex. Robots interact not only with trained professionals but also with patients, elderly individuals, and visitors. Consequently, interaction must combine clinical precision with social sensitivity. Speech recognition, gesture interpretation, contextual awareness, and user modeling are required to personalize assistance and accommodate varying cognitive and physical abilities. Emotional sensitivity, privacy protection, and trust-building are critical. Unlike in laboratories or industry, interaction failures may directly affect well-being or patient confidence. HRI must therefore be adaptive, transparent, and ethically grounded.
Office environments occupy an intermediate position. While less safety-critical, they are socially structured and norm-sensitive. HRI must be low-friction, socially compliant, and minimally disruptive. Robots should respect the proxemics, meeting etiquette, and hierarchical dynamics. Interaction is often informational (e.g., navigation guidance, telepresence, task delivery) and must be seamlessly integrate into daily routines. Compared to healthcare, emotional engagement is less central; compared to laboratories, social appropriateness carries greater weight than strict procedural precision.
The most demanding HRI requirements emerge in residential environments. Homes are socially dynamic, informal, and privacy-sensitive spaces with heterogeneous users, including children, elderly individuals, and guests. Robots must interpret subtle social cues, adapt to changing routines, and avoid interfering with family interactions. In companion or assistive roles, HRI must support emotion recognition, adaptive personalization, and long-term relationship building. Trust formation, consent management, and data transparency are not peripheral concerns but central design constraints. Unlike industrial or laboratory systems, residential robots are evaluated as much on relational quality and perceived presence as on task performance.
Across domains, several cross-cutting HRI dimensions can be identified:
Intent Communication: Required everywhere, but ranging from purely functional signaling (industry, labs) to socially expressive behavior (homes, healthcare).
Adaptivity: Minimal in highly structured environments; essential in healthcare and residential settings.
User Modeling: Optional in industrial contexts; critical in long-term residential or elderly care scenarios.
Emotional Intelligence: Marginal in productivity-driven domains; central in companion-oriented applications.
Trust Formation: Performance-based trust dominates in laboratories and industry, while relational and privacy-based trust becomes decisive in homes and healthcare.
In summary, indoor mobile robotics do not present a uniform HRI problem. Instead, each domain shifts the balance between efficiency, safety, social intelligence, emotional responsiveness, and ethical safeguards. Successful HRI design therefore requires domain-specific prioritization layered upon a shared technical foundation (
Table 3 and
Table 4).
4. The Technical Aspect of HRI
Mobile robots in indoor environments typically perform tasks through three major steps: perception, navigation, and interaction. Among these, the Human–Robot Interaction (HRI) component is central when a robot executes tasks on behalf of or in collaboration with humans. Effective HRI ensures safety, comfort, and efficient task execution, forming the core of human-centered robot behavior (
Figure 4).
Before explicit interaction begins, implicit cues during navigation already constitute a form of interaction. For instance, human-aware navigation communicates the robot’s intent, enhancing coordination in shared spaces and providing humans with a sense of safety [
34,
54,
60,
100]. Similarly, interaction modalities such as speech, gesture, touch, and visual cues shape how humans understand and guide the robot’s actions.
While many studies have catalogued individual methods and interaction techniques, a structured, analytical synthesis of these approaches is still lacking. Current research often focuses on isolated methods without systematically comparing their robustness, computational requirements, user cognitive load, or adaptability to real-world indoor environments. This gap motivates the present chapter, which aims not only to summarize existing work but also to critically evaluate methods, highlight emerging trends, and identify open challenges and research gaps.
In the following sections, the chapter is organized to support this analytical perspective:
Human-aware navigation (
Section 4.1)—methods for safe, legible, and socially compliant navigation, including a comparison of classical and learning-based approaches.
Interaction modalities (
Section 4.2)—detailed treatment of speech, gesture, touch, visual/gaze, and multimodal systems, with structured comparison tables, evaluation of performance trade-offs, and integration of recent advances such as LLMs and VLMs.
Cross-cutting synthesis (
Section 4.3)—an overview of trends, methodological patterns, and critical limitations across modalities.
By explicitly combining descriptive and analytical perspectives, this chapter aims to move beyond a simple catalog of HRI techniques and provide a critical, structured review of the current state-of-the-art in indoor mobile robotics.
4.1. Human-Aware Navigation
Before explicit interaction begins, implicit interaction already occurs through the navigational behavior of the robot. Human-aware navigation aims to give human a perception of safety, communicate the robot’s intent, and facilitate coordination in shared workspaces [
34,
53,
60,
100]. Design principles include expressive movement, legibility, comfort, and adherence to social norms in human-populated environments [
34].
Table 5 represents four approaches to social navigation.
The human-centric nature of perception and navigation in indoor mobile robots can be seen in
Figure 5.
While all four approaches enable human-aware navigation, they differ in robustness, adaptability, computational requirements, and real-world applicability. Reactive approaches are simple and fast but limited in dynamic environments, whereas predictive and model-based approaches improve coordination and social compliance at the cost of higher computational effort. Learning-based methods provide adaptive and socially compliant behaviors but require extensive training and sensor integration. Understanding these trade-offs is crucial for selecting navigation strategies tailored to specific indoor environments and user populations.
Lasota et al. survey safety strategies for close-proximity collaboration [
54], and these principles are integrated into collision avoidance and navigation systems. One of the primary limitations is space; indoor environments often feature narrow corridors, cluttered rooms, and dynamic obstacles, requiring compact, agile robots with highly precise navigation capabilities while maintaining safety and trust of their human counterparts and not becoming a hurdle in their path [
55]. Robots must maneuver around furniture, equipment, and humans safely and without disruption. Power management represents another critical challenge [
101]. Many robots, especially in healthcare settings or laboratory settings, must operate continuously for extended periods, often 8–12 h, without frequent charging [
102]. This necessitates efficient energy consumption, optimized motion planning, and, in some cases, the ability to autonomously dock and recharge. Advanced power systems, lightweight materials, and energy-efficient components are therefore essential design considerations. Human tracking and following is another aspect of assistive robots. Multisensor-based human detection and tracking systems combine data from cameras, depth sensors, and other modalities to achieve robust performance in crowded environments [
103]. Intelligent mobility-assistance robots leverage multimodal sensory processing to provide safe and effective support for users with mobility impairments [
104]. RGB-D sensor-based real-time people detection and mapping systems enable mobile robots to maintain awareness of human positions and movements [
105], helping overcome the limitations of individual modalities and providing redundancy in case of sensor failures. A typical human tracking and following system is presented in
Figure 6.
Simultaneous Localization and Mapping (SLAM) is essential for indoor navigation. Robots must accurately map unknown or changing environments while simultaneously tracking their position in real-time. SLAM systems must remain robust in dynamic spaces with moving obstacles, variable lighting, and crowded conditions [
73,
106]. High-precision mapping supports both task execution and safe interaction with humans. Modern indoor robots employ a variety of sensors, including LiDAR, RGB-D cameras, ultrasonic sensors, and inertial measurement units (IMUs), which must operate reliably across diverse indoor conditions [
107]. Collaborative mapping approaches enable multiple robots to work together in constructing environmental representations, improving coverage and efficiency [
108]. Advanced scan matching techniques exploiting dominant directional features and improve localization accuracy [
109], while integrating Ultra-Wideband (UWB) technology with traditional SLAM systems further enhances indoor positioning accuracy and human avoidance [
110].
The evolution from reactive to learning-based navigation reflects a broader trend toward adaptive, socially aware, and context-sensitive navigation. Multisensor fusion, predictive path planning, and collaborative mapping are increasingly standard. However, open challenges remain:
Robust operation in extremely dense or highly dynamic indoor spaces.
Trade-offs among computational cost, real-time performance, and social compliance.
Long-duration operation under constrained energy resources.
Seamless integration with multimodal interaction modalities to ensure cohesive HRI.
Adaptability to diverse user populations with varying mobility and cognitive abilities.
Addressing these gaps is critical for the next generation of indoor mobile robots capable of safe, efficient, and socially compliant operation in real-world human environments.
4.2. Interaction Modalities
The effectiveness of indoor mobile robots depends critically on the design and implementation of appropriate interaction modalities, as these determine how intuitively, efficiently, and safely humans can communicate with and control robotic systems. Unlike traditional human–computer interfaces, which often rely on static screens or input devices, HRI in mobile robotics must account for the dynamic and content-rich nature of human-centered environments. This includes managing spatial relationships, supporting natural and multimodal communication, adapting to human mobility and movement patterns, and responding to real-time environmental changes that affect both robot behavior and human expectations [
111].
Figure 7 depicts the explicit and implicit channels of interaction typically used in indoor mobile robotics. These channels span speech, gesture, touch, visual/gaze, and multimodal approaches, forming the basis for safe, effective, and social-aware interactions.
Domain-specific studies provide valuable insights into user expectations and interaction requirements. Reviews of social robots in classrooms highlight evidence for effective engagement and learning outcomes [
61], while research in autism-related therapy demonstrates the potential of robots to support specialized interventions [
70]. Similarly, studies of service robots in home environments reveal how everyday adaptations, user preferences, and environmental constraints shape interaction design [
4]. These findings emphasize that no single interaction modality suffices across all contexts. Instead, interaction systems must be selected and adapted based on the task, user population, and environmental characteristics. The following
Section 4.2.1,
Section 4.2.2,
Section 4.2.3,
Section 4.2.4 and
Section 4.2.5 provide a detailed examination of each major modality—speech, gesture, touch, visual/gaze, and multimodal systems—highlighting their technical characteristics, comparative strengths, current trends, and remaining limitations. By systematically evaluating each modality, we aim to provide a critical and structured synthesis rather than a simple catalog of existing methods.
4.2.1. Speech-Based Interaction
Speech-based interaction represents one of the most natural communication modalities for humans and has been extensively studied in indoor mobile robotics [
112]. Voice interfaces enable hands-free operation, which is particularly valuable in healthcare and laboratory environments where users’ hands may be occupied with other tasks [
113].
Automatic Speech Recognition (ASR) technology has matured to a level suitable for practical deployment, with modern systems achieving high accuracy even in noisy environments [
114]. Cloud-based ASR systems can further improve performance but raise privacy concerns in sensitive settings, leading to increased interest in on-device processing [
115].
Figure 8 illustrates the process of a typical speech-based interaction with distinct components like ASR, NLP and TTS.
Natural Language Processing (NLP) allows robots to understand complex commands and engage in contextual conversations [
116]. Large Language Models (LLMs) are increasingly integrated into robotic systems, enabling sophisticated dialogue management and intent understanding [
117]. Cognitive instruction interfaces leverage natural language understanding to provide intuitive robot navigation commands [
118], while recent advances in grounding implicit goal descriptions allow robots to interpret ambiguous spatial references through recursive belief updates [
119]. Overall, LLM integration is emerging as a powerful approach to enhance the comprehension of complex verbal instructions [
120].
The third component of speech interaction is text-to-speech (TTS). Recent developments in speech synthesis allow robots to provide natural-sounding feedback and convey emotional states [
121]. Paralinguistic features such as tone, pace, and volume can communicate robot intentions and emotions, enhancing the overall user experience [
122]. For secure interactions, voice biometric authentication ensures that only recognized personnel can control the robot, even when multiple users are present (
Figure 9).
Despite these advances, speech-based interaction in indoor environments faces several technical and practical challenges. Ambient noise, multiple speakers, and acoustic reverberation can degrade recognition performance [
123]. Cultural and linguistic diversity requires multilingual support and accent adaptation [
124]. Privacy concerns are particularly critical when voice data is processed or stored in healthcare settings [
125].
Table 6 summarizes the main approaches to speech-based interaction, including classical keyword/grammar-based methods and their key performance characteristics, evaluation environments, metrics, and identified limitations. This structured comparison highlights the trade-offs and gaps in existing methods, motivating the integration of LLM and VLM approaches discussed below.
Recent advances in Large Language Models (LLMs) and vision-language models (VLMs) are fundamentally transforming Human–Robot Interaction. Unlike traditional command-based speech recognition, LLMs enable context-aware reasoning, natural language understanding, and adaptive dialogue, allowing robots to interpret complex user intents and interact in more flexible, human-like ways. Compared to classical interaction modalities such as command-based speech, gesture, or gaze, LLM-driven approaches provide enhanced adaptability, a richer semantic understanding, and the ability to integrate multimodal input streams. This paradigm shift allows indoor mobile robots to operate more autonomously in dynamic environments while improving the naturalness of human–robot collaboration.
Despite these capabilities, deploying LLMs on mobile robots introduces several practical challenges. Onboard inference can be limited by hardware constraints, whereas cloud-based inference introduces network latency and potential reliability concerns. Safety-critical tasks require a careful evaluation of decision timing, error propagation, and fallback mechanisms to ensure robust interaction in real-world environments.
To systematically analyze these methods, we evaluate LLM- and VLM-based approaches using the same criteria applied to classical modalities, including computational cost, robustness to environmental noise, user cognitive load, and hardware requirements. This comparative perspective highlights the trade-offs among performance, deployability, and safety and informs the selection of appropriate speech-based interaction strategies in indoor mobile robotics.
Table 7 summarizes the main LLM- and VLM-based approaches to speech HRI, highlighting their evaluation settings, metrics, strengths, and limitations. Compared to classical methods (
Table 6), these approaches offer a richer context understanding, multimodal integration, and greater flexibility, but require the careful consideration of computational cost, latency, and safety-critical constraints.
For a quick comparative overview,
Table 8 summarizes key differences across computational costs, robustness, user load, hardware requirements, evaluation metrics, strengths, and limitations, highlighting the evolution from traditional to modern methods.
This comparative overview highlights the trade-offs between classical and modern LLM/VLM-based speech interaction methods, emphasizing where traditional approaches fall short and motivating the integration of context-aware, multimodal models to improve robustness, flexibility, and user experience in indoor mobile HRI.
4.2.2. Gesture-Based Interaction
Gesture-based interaction provides an intuitive and natural communication channel that can complement or substitute for speech in various scenarios [
133]. Hand and arm gestures can convey spatial information, directional commands, and social signals that are difficult to express verbally [
134].
Vision-based gesture recognition systems primarily rely on RGB cameras, depth sensors, or combinations thereof to capture and interpret human movements [
135]. These systems must operate in real-time while remaining robust to variations in lighting, occlusions, user differences, and environmental clutter [
136]. A typical gesture-recognition pipeline combines spatial understanding (e.g., MediaPipe Holistic [
137]) with temporal modeling via recurrent networks such as LSTMs to interpret dynamic gestures (
Figure 10).
Simple gesture vocabularies have proven effective for basic robot control, including pointing gestures for navigation, hand signals for stop/start commands, and waving to attract attention [
138]. More complex gestures can convey emotional states, social intentions, and task-specific instructions [
139].
Spatial gesture recognition allows users to indicate locations, directions, and objects within the environment [
140]. Over time, gesture-based HRI methods have evolved from template- or rule-based systems toward machine learning and deep learning approaches. Machine learning-based systems enable robots to recognize subtle gestural cues and adapt to user-specific behaviors, supporting tasks such as mobility assistance [
141]. Deep learning methods, including LSTM or Transformer models, further enhance recognition accuracy and temporal understanding, enabling real-time gesture control in dynamic environments [
142]. Collaborative object handling between robots and human operators also benefit from sophisticated gesture recognition and motion prediction [
143], which is particularly valuable for mobile robots navigating to specific locations or interacting with environmental objects. Overall, the field shows a clear trend from fixed vocabularies to adaptive, learning-based systems, capable of interpreting complex gestures, social cues, and emotional states.
Table 9 provides a summary of gesture-based HRI methods including suitable evaluation metrics, and limitations.
Despite these advances, gesture-based interaction faces several challenges. Gesture ambiguity and cultural differences can lead to misinterpretation [
144], and environmental factors, such as lighting, background clutter, and camera positioning, strongly influence recognition reliability. Many systems require user training or adaptation to function robustly. Furthermore, computational cost and latency for real-time deep learning models can constrain deployment on mobile robots. Finally, integrating gesture recognition with other interaction modalities, such as speech or gaze, remains an open area of research to improve robustness, flexibility, and overall user experience.
4.2.3. Touch and Physical Interaction
Touch and physical interaction provide direct, immediate feedback channels that are particularly valuable for users with visual or auditory impairments [
145]. In contrast to speech or vision-based modalities, tactile interaction enables unambiguous, proximal control and often reduces the cognitive load, especially in safety-critical scenarios. Tactile interfaces on mobile robots include touchscreens, physical buttons, force-sensitive surfaces, and haptic feedback mechanisms.
Touchscreen interfaces offer familiar interaction paradigms from smartphones and tablets, enabling complex command input and information display [
146]. However, robot mobility introduces challenges. Screen visibility can be compromised during movement, accessibility may vary depending on the robot’s height and orientation, and hardware must withstand diverse indoor environments. Thus, while touchscreens support rich interaction, their usability is dependent on context.
The physical manipulation of robot components, such as guiding arm movement or adjusting robot positioning, enables direct spatial instruction [
147]. This modality is particularly useful for demonstrating desired behaviors, correcting robot actions, or guiding motion in shared workspaces. Compared to indirect modalities such as speed, physical interaction offers high precision and immediate feedback but requires close proximity and careful safety control.
Haptic feedback through vibration, force, or texture variation can provide the confirmation of user inputs and convey robot status information [
145]. Such feedback is especially important for users with visual impairments or in environments where visual attention must be directed elsewhere. In this sense, haptic feedback not only supports accessibility but also enhances situational awareness.
Safety considerations are paramount in the design of physical interactions. Force limitation, emergency stop mechanisms, and collision avoidance strategies must be implemented to prevent injury and ensure safe operation [
148]. The robot’s physical design must carefully balance interaction capabilities with user safety requirements. Authentication mechanisms, such as fingerprint identification, ensure that only authorized personnel can access certain touch-based functionalities.
Figure 11 represents a typical fingerprint-identification process.
A clear trend can be observed from simple mechanical interfaces toward sensor-rich, force-aware systems that enable compliant and safe physical interaction. Modern systems increasingly integrate force-torque sensors, tactile skins, and adaptive control strategies to support shared control and collaborative behaviors. However, this evolution introduces trade-offs among the hardware complexity, safety certification requirements, and deployment cost.
Despite their intuitive nature, touch and physical interaction modalities face several limitations:
Requirement of physical proximity, limiting scalability and remote use.
Safety risks in case of control failure or excessive force.
Hardware wear and durability issues in high-frequency use scenarios.
Hygiene and contamination concerns in healthcare environments.
Limited expressiveness compared to speech or multimodal interaction.
Furthermore, the integration of physical interaction with other modalities (speech, gaze, gesture) remains an important research direction to enable seamless transitions between direct and indirect control paradigms.
Table 10 provides a summary of touch and physical interaction HRI methods including evaluation metrics, and limitations.
4.2.4. Visual and Gaze-Based Interaction
Visual interaction modalities leverage human visual perception and attention mechanisms to establish effective communication channels between humans and robots [
149]. Eye gaze patterns, facial expressions, and body language provide rich information about user intentions, emotional states, and situational context. These modalities complement speech and touch, enabling more nuanced and socially aware interaction.
Gaze-based interaction allows users to direct robot attention, indicate objects of interest, and provide spatial references [
150]. Eye-tracking technology integrated into mobile robots can detect the gaze direction and duration, enabling attention-aware behaviors and improving task coordination in shared instances.
Figure 12 depicts a typical gaze-based interaction pipeline.
Facial expression recognition enables robots to assess user emotional states and adapt the behavior accordingly [
151], which is particularly valuable in healthcare, social robotics, and assistive environments. Face recognition additionally provides a secure mechanism for user identification, as illustrated in
Figure 13.
Visual displays on robots, including LED patterns, screens, and expressive body language, can convey information, intentions, and emotional states to users [
58]. Moreover, robot positioning and orientation communicate social intentions and respect personal space, helping to maintain appropriate distances according to proxemic norms [
152]. Effective visual interaction thus integrates perception, expressive signaling, and socially aware navigation.
Despite these advantages, visual and gaze-based modalities face several challenges. Variations in lighting, occlusion, and user mobility can degrade recognition performance. Cultural differences in gestural or gaze meaning require adaptive interpretation. Privacy concerns are also paramount when capturing or processing visual data, particularly facial recognition information.
Visual and gaze-based HRI methods can be grouped into several categories:
Gaze Tracking Systems—Eye trackers and camera-based systems for attention-aware behavior and spatial referencing [
150].
Facial Expression Recognition—Classifiers (traditional or deep learning-based) to infer user affective states and guide robot responses [
151].
Face Recognition and Biometric Access—Security and user identification using facial data [
58].
Visual Feedback Displays—LEDs, expressive screens, or robot motion to communicate information and robot intent [
58,
152].
A clear trend emerges toward deep learning-based and multimodal visual interpretation, combining gaze, facial expressions, and robot positioning to support context-aware behavior. Integrating these systems with other modalities (speech, gesture, touch) enhances robustness, adaptivity, and user experience.
Table 11 provides a comprehensive summary of visual and gaze-based HRI methods including their approaches, metrics, and limitations.
4.2.5. Multimodal and Adaptive Systems
Multimodal interaction systems combine multiple input and output modalities to create more robust and flexible communication channels [
153]. These systems can adapt to user preferences, environmental conditions, and task requirements by dynamically selecting appropriate interaction modalities. Multimodal human–robot interfaces that combine speech with other modalities have proven effective for remote robot operation [
154]. Sensor fusion techniques combine data from speech, vision, touch, and other sensors to improve recognition accuracy, robustness, and overall system reliability [
155].
Adaptive interaction systems learn user preferences and adjust their behavior over time [
87]. Machine learning algorithms optimize interaction strategies based on user feedback, task success rates, and environmental conditions. Context-aware systems consider environmental factors, user states, and task requirements when selecting interaction modalities [
156]. For example, a robot may switch from speech to visual display in noisy environments or use gesture recognition when users’ hands are free.
The integration of multiple modalities requires sophisticated fusion algorithms and decision-making frameworks [
157]. Temporal synchronization, conflict resolution, and priority management are critical considerations in designing robust multimodal HRI systems. Despite advances, challenges remain in the latency, computational cost, sensor calibration, and safety-critical decision making, especially in real-world, dynamic environments.
Table 12 provides a comprehensive comparison of interaction modalities across key performance dimensions, highlighting the complementary nature of different approaches and the importance of multimodal integration for optimal user experience.
Recent advances in vision-language models (VLMs) have enabled multimodal systems to integrate visual perception with natural language understanding, allowing robots to interpret user commands and environmental cues simultaneously. For example, a robot can identify objects in its environment using VLM-based perception such as Contrastive Language-Image Pretraining (CLIP) [
132] while understanding instructions provided in natural language through LLM reasoning.
This integration represents a major step toward “embodied AI,” shifting mobile service robots from simple command executors to autonomous, high-level agents capable of intuitive communication and context-aware assistance [
158]. By leveraging LLMs and VLMs, mobile robots can perform language-conditioned control, allowing them to navigate via complex semantic instructions (e.g., systems like NavGPT [
159]) and execute adaptive mobile manipulation tasks (e.g., BUMBLE) [
76]. To ensure continuous and seamless Human–Robot Interaction (HRI), novel frameworks utilizing Perception-Action Loops (PALoop) have been introduced, enabling robots to combine logic reasoning with pre-trained databases for long-horizon planning [
160].
A critical challenge in applying these foundation models to physical environments is bridging the gap between abstract text and the 3D world. To address this, frameworks like the TAsk Planing Agent (TaPA) align LLMs with open-vocabulary visual object detectors to generate grounded, executable plans for complex indoor household tasks [
161]. Similarly, 3D vision-language-action (VLA) models, such as LEO, allow mobile agents to perceive, reason, and act directly within 3D spaces, responding accurately to multi-round user instructions [
162]. Because “hallucination” in LLMs poses a safety risk in physical HRI, recent prompting mechanisms like Contextual Set-of-Mark (ConSoM) have been developed to significantly improve visual grounding and precision in indoor robotic scenarios [
163].
Furthermore, modern VLM integration is expanding beyond simple RGB vision to true multisensory perception. Advanced models like MultiPLY integrate 3D visual data with tactile, auditory, and thermal inputs [
164]. This multisensory approach is pivotal for robust embodied interaction; for instance, incorporating tactile sensing enables a mobile manipulator to gather vital safety feedback to avoid applying excessive force, while auditory processing allows it to isolate voice commands [
165]. Finally, to ensure safety and precision during task execution, researchers are implementing closed-loop feedback systems. By integrating Small Language Models (SLMs) alongside VLMs, indoor mobile robots can iteratively evaluate scene changes and refine their control commands in real-time, adapting instantly to dynamic human environments [
166].
Table 13 provides a comparative overview of multimodal and VLM-based interaction methods. Each approach is evaluated across key dimensions including computational cost, robustness to noise, user cognitive load, and hardware requirements. This table highlights the advantages of combining multiple modalities with LLM/VLM reasoning, such as increased adaptability, improved semantic understanding, and enhanced task performance. At the same time, it draws attention to limitations such as latency, onboard resource demands, and safety-critical constraints, guiding design decisions for real-world deployments.
Integrating LLMs and VLMs into multimodal interfaces represents a paradigm shift in HRI, moving from fixed-rule fusion strategies toward adaptive, context-aware reasoning systems. This enables more natural, flexible, and socially compliant human–robot collaboration. Design decisions for selecting modalities or combinations thereof should be guided by the task context, user population, and environmental constraints, with explicit consideration of computational limitations and safety-critical requirements.
4.3. Cross-Cutting Synthesis
Human–Robot Interaction in indoor mobile robotics encompasses multiple modalities, each offering distinct advantages, constraints, and suitability depending on the context. While previous sections have discussed speech, gesture, touch, visual/gaze, and multimodal approaches separately, synthesizing these insights highlights overarching trends, comparative strengths, and persistent challenges.
Table 14 provides a comprehensive overview of interaction modalities, highlighting advantages, limitations, computational demands, and typical applications. This allows for a side-by-side comparison of classical and learning-based approaches, including the role of LLM and VLM reasoning in modern adaptive systems.
A key development in HRI is the clear shift from classical, rule-based systems toward learning-based, adaptive approaches. Early systems for speech and gesture often relied on predefined templates and command sets. Modern deep learning architectures—particularly LSTM and Transformer models—allow the recognition of complex, context-dependent commands. When combined with LLMs and VLMs, robots can now interpret multimodal inputs and respond to subtle, situational user intentions.
Another significant trend is context- and environment-aware adaptation. Multimodal systems integrating speech, gestures, visual, and tactile signals can dynamically select the most appropriate interaction modality. For instance, robots may rely on visual displays or gestures in noisy environments or leverage gesture recognition when users’ hands are free, ensuring both efficiency and clarity of interaction.
Human-centered design is increasingly emphasized. Effective systems feature legible motion, socially compliant behavior, and adaptive responses. Multimodal integration not only enhances efficiency but also accessibility: limitations of individual modalities—such as reduced gesture recognizability in poor lighting or speech commands in visually cluttered environments—can be mitigated through sensor fusion.
Finally, cross-cutting challenges span the entire field. These include dependence on environmental conditions, variability of human users (e.g., cognitive load, cultural differences), and high computational and hardware demands for LLM/VLM-based systems. Moreover, the lack of standardized datasets, evaluation metrics, and protocols complicates the comparative assessment and reproducibility of approaches.
In summary, Human–Robot Interaction in indoor mobile robotics is shaped by evolving methodologies, context-specific applications, and enduring challenges. Understanding these cross-cutting factors is critical for developing adaptive, robust, and user-centered systems and provides a foundation for advancing future HRI research and deployment.
8. Conclusions
This comprehensive review of Human–Robot Interaction in indoor mobile robotics reveals a rapidly evolving field with significant potential for transforming how humans and robots collaborate in shared environments. The analysis of interaction modalities demonstrates that effective HRI requires multimodal approaches that combine speech, gesture, touch, and visual communication channels. Each modality offers unique advantages while facing distinct challenges related to environmental conditions, user diversity, and technical limitations.
The examination of user experience and acceptance factors highlights the critical importance of human-centered design in achieving successful robot deployment. Trust, reliability, and social integration emerge as fundamental requirements that must be addressed through careful attention to robot behavior, transparency, and cultural sensitivity.
The case studies of Moxi, Temi, and Astro illustrate different approaches to addressing these challenges, with varying degrees of success depending on the application domain and user requirements.
Safety and privacy considerations present ongoing challenges that require continued attention from researchers, manufacturers, and policymakers. The development of appropriate regulatory frameworks, technical standards, and ethical guidelines will be essential for the widespread adoption of indoor mobile robots. Privacy protection mechanisms and transparent data practices must be integrated into system design rather than treated as afterthoughts.
The technical challenges identified in this review, including navigation reliability, interaction robustness, and computational constraints, represent active areas of research with promising solutions emerging from advances in AI, sensor technology, and computing infrastructure. The integration of Large Language Models and foundation models offers promise for enhancing interaction capabilities while raising new questions about safety and reliability.
Future research should focus on developing more robust and adaptable interaction systems that can operate effectively across diverse indoor environments and user populations. Long-term studies of human–robot relationships, social integration effects, and trust development will be crucial for understanding the full implications of robot deployment in human-centered environments.
The success of indoor mobile robotics will ultimately depend on achieving the right balance among technical capability, user acceptance, safety, and economic viability. This requires continued collaboration among technologists, social scientists, ethicists, and end users to ensure that robotic systems truly serve human needs and values. As the field matures, the focus must shift from demonstrating technical feasibility to creating sustainable, beneficial, and trustworthy partnerships between humans and robots in our most personal and important spaces.