Learning to Navigate in Mixed Human–Robot Crowds via an Attention-Driven Deep Reinforcement Learning Framework

Kabir, Ibrahim K.; Mysorewala, Muhammad F.; Osais, Yahya I.; Nasir, Ali

doi:10.3390/make7040145

Open AccessArticle

Learning to Navigate in Mixed Human–Robot Crowds via an Attention-Driven Deep Reinforcement Learning Framework

¹

Control and Instrumentation Department, King Fahd University of Petroleum & Minerals (KFUPM), Dhahran 31261, Saudi Arabia

²

Interdisciplinary Research Center for Smart Mobility and Logistics, King Fahd University of Petroleum & Minerals (KFUPM), Dhahran 31261, Saudi Arabia

³

Interdisciplinary Research Center for Intelligent Manufacturing and Robotics, King Fahd University of Petroleum & Minerals (KFUPM), Dhahran 31261, Saudi Arabia

⁴

Information & Computer Science Department, King Fahd University of Petroleum & Minerals (KFUPM), Dhahran 31261, Saudi Arabia

⁵

Interdisciplinary Research Center for Intelligent Secure Systems, King Fahd University of Petroleum & Minerals (KFUPM), Dhahran 31261, Saudi Arabia

⁶

Interdisciplinary Research Center for Communication Systems and Sensing, King Fahd University of Petroleum & Minerals (KFUPM), Dhahran 31261, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Mach. Learn. Knowl. Extr. 2025, 7(4), 145; https://doi.org/10.3390/make7040145

Submission received: 23 September 2025 / Revised: 7 November 2025 / Accepted: 11 November 2025 / Published: 13 November 2025

(This article belongs to the Section Learning)

Download

Browse Figures

Versions Notes

Abstract

The rapid growth of technology has introduced robots into daily life, necessitating navigation frameworks that enable safe, human-friendly movement while accounting for social aspects. Such methods must also scale to situations with multiple humans and robots moving simultaneously. Recent advances in Deep Reinforcement Learning (DRL) have enabled policies that incorporate these norms into navigation. This work presents a socially aware navigation framework for mobile robots operating in environments shared with humans and other robots. The approach, based on single-agent DRL, models all interaction types between the ego robot, humans, and other robots. Training uses a reward function balancing task completion, collision avoidance, and maintaining comfortable distances from humans. An attention mechanism enables the framework to extract knowledge about the relative importance of surrounding agents, guiding safer and more efficient navigation. Our approach is tested in both dynamic and static obstacle environments. To improve training efficiency and promote socially appropriate behaviors, Imitation Learning is employed. Comparative evaluations with state-of-the-art methods highlight the advantages of our approach, especially in enhancing safety by reducing collisions and preserving comfort distances. Results confirm the effectiveness of our learned policy and its ability to extract socially relevant knowledge in human–robot environments where social compliance is essential for deployment.

Keywords:

crowd navigation; deep reinforcement learning; human–robot interaction; robot navigation; socially aware navigation; knowledge extraction; attention mechanism

1. Introduction

The field of robotics has advanced rapidly in recent decades, and robots have been increasingly deployed in human environments such as factory floors, restaurants, airports, and hospitals [1]. In shared spaces with humans, robots must avoid collisions and move in a manner that does not cause discomfort or harm to nearby individuals. Human navigation in shared spaces is governed by social conventions or norms, which help people understand intentions and move in ways that ensure mutual comfort [2]. Robot navigation in the presence of humans, termed socially aware navigation or crowd navigation, applies these rules as well as proxemics [3], which defines how space is regulated when interacting and navigating. As a requirement, for navigation to be considered socially aware, it must take into account comfort, naturalness, and sociability [4].

Existing work on Socially Aware Navigation (SAN) has generally been categorized into three areas: reaction-based methods, proactive methods, and learning-based methods [5]. Reaction-based methods involve robots reacting with other mobile agents using one-step interaction approaches [6]. They can be considered purely rule-based, with no prediction or learning involved. The Velocity Obstacle (VO) [7], Optimal Reciprocal Collision Avoidance (ORCA) [8], and Social Force Model (SFM) [9] are such methods. However, they are heavily dependent on mobility models and capture only simple interactions. Moreover, planning is short-sighted and often unnatural. Proactive methods, often referred to as trajectory-based methods, predict human behavior through their trajectories and plan a path to follow. These methods exhibit smoother interactions within human social norms [10,11,12]. Unfortunately, the computational cost of predicting all possible paths in these methods becomes very high, especially when the state space expands with increasing crowd size.

Learning-based methods have shown promise in recent years, specifically Reinforcement Learning (RL) and Deep Reinforcement Learning (DRL) approaches. Collision Avoidance with DRL (CADRL) [13] was one of the earliest methods to utilize DRL for collision avoidance in robot navigation, replacing hand-made techniques with DRL. The authors then extended it in [14] to consider social norms and learn socially compliant behavior. However, the over-reliance on the reciprocal assumption led to overly conservative behavior, resulting in severely reduced effectiveness in complex scenarios. In [15], LSTM-RL was developed using Long-Short-Term Memory (LSTM) to encode other agents and model the collective impact of the crowd. It sequentially processes the current states of each neighbor in reverse order of proximity to the robot. In large-scale crowds, it is necessary to identify the most critical individuals for navigation and learn accurate representations of the crowd.

Various studies have proposed different approaches, such as Graph Convolutional Networks (GCNs) [16], Graph Attention Network (GAT) [17], Self-Attention Mechanisms [18], and Transformers [19]. Some methods employ Imitation Learning (IL) to enhance sample and training efficiency by initializing training with a set of expert navigation demonstrations, similar to supervised learning [18,20,21]. This is then used to obtain a navigation model based on the training data.

Multiple mobile robots can navigate an environment cooperatively, sharing the same goal, or with different goals. Some of these robots may be designed to follow social norms and exhibit social awareness. It is therefore essential to design navigation frameworks that enable mobile robots to navigate through crowded environments that include both people and robots. Frameworks must be developed to accurately model different interaction types in environments with multiple humans and robots. For cases involving multiple robots in the same environment, the robots must be able to differentiate between themselves and humans. Most methods that explore Multirobot SAN treat each robot as a separate RL agent. This models the framework as a Multi-Agent Reinforcement Learning (MARL) framework, focusing on cooperative navigation [22,23,24]. This approach, however, has serious challenges, which include: (1) Non-stationarity resulting from continually changing policies of the agents during their learning processes, (2) Dimensional explosion with increasing number of agents when scaling, (3) Learning optimal policies in the MARL system [25]. Additionally, methods like these often fail to account for the uniqueness of robot–robot interactions, frequently modeling them homogeneously alongside human–robot interactions. This is a crucial point to consider, given that robot–robot encounters do not require the same social proxemic considerations as human–robot encounters. Furthermore, existing methods do not model asymmetric social awareness scenarios, in which the ego robot is trained to be fully socially compliant. In contrast, other robots in the environment only operate with basic collision avoidance policies. This gap can lead to inefficient social navigation. The ego robot may misinterpret or fail to leverage the predictable, non-social behavior of other robots.

Given all these drawbacks and challenges, we therefore propose a Socially Aware Navigation framework based on single-agent Deep Reinforcement Learning with Imitation-Learning-based pretraining and initialization for navigating multiple robots in the presence of humans. The main contributions of this work are as follows:

A navigation framework based on single-agent deep reinforcement learning, which allows the ego robot to move according to social norms among humans and other robots that follow a predefined collision avoidance behavior.
The use of imitation learning for socially aware single-agent navigation in environments shared by other robots and humans.
A model of the full social awareness for the ego robot and partial social awareness for other robots to reflect realistic interactions, enabling the extraction of socially relevant knowledge from human–robot interactions.

The rest of the paper is organized as follows. Section 2 discusses the related work, Section 3 presents and discusses the methodology. Section 4 presents the obtained results along with discussions. Finally, Section 5 gives a conclusion of this research.

2. Related Work

The earliest works in this field all consider a single robot navigating in the presence of humans in a socially aware manner. Robots primarily rely on their perception and reasoning capabilities to interpret human behavior and predict future human actions, yet lack feedback from the crowd. These methods often treat robot navigation in social environments as a one-way human–robot interaction problem, which could pose challenges as the size of the crowd increases. To accurately model robot navigation in human environments as a two-way human–robot interaction problem, ref. [18] developed the Social Attention Reinforcement Learning (SARL) approach, which explicitly models interactions between the robot and the crowd, specifically human–human and human–robot interactions. It uses a self-attention mechanism to find the aggregate impact of the crowd and prioritizes the most relevant interactions. It also incorporates imitation learning for initialization of training using ORCA [8]. Building upon this work, ref. [26] proposed Social Obstacle Avoidance using Deep Reinforcement Learning (SOADRL), which processes static and dynamic obstacle information independently, thereby effectively addressing the difficulty of navigating with sensors only in a restricted field of view within crowds. In [27], the most critical individuals with respect to navigation are highlighted within the crowd. The work in [28] suggested simulating danger zones for the robot to enable real-time reaction to human actions. The definition of these zones is based on tracking human behavior in real-time and then storing all the potential actions that individuals may do at any given moment. The robot learns to navigate away from these danger zones to achieve dependable and safe navigation.

Multirobot approaches in SAN have primarily focused on multi-agent cooperative navigation within Multi-Agent RL frameworks. These works have essentially developed frameworks with Centralized Training and Decentralized Execution (CTDE) of the agents and trained policies, respectively [24,29,30,31]. These methods treat each robot as an RL agent with its own separate actions, observations, and rewards. In [24], a multirobot Multi-agent RL approach was developed. It employed separate hybrid spatio-temporal transformers to model both human–robot and robot–robot interactions, then used a multi-modal transformer to align and fuse these features, creating a comprehensive state representation that includes both human–robot and robot–robot interactions. It also used an attention mechanism to determine the importance of each agent relative to others, enhancing the robot’s ability to make socially-aware decisions. The approach in [29] employs an Actor-Critic RL method, incorporating cooperative planning, and utilizes a Temporal-Spatial graph-based encoder to determine the significance of social interactions. It also incorporates a K-step look-ahead reward mechanism to enable robots to consider future consequences, thereby avoiding aggressive, short-sighted decisions. Robot–robot interactions are captured using a centralized value network that shares the necessary information about each robot, enabling accurate decision-making. While [23] also employs the CTDE paradigm, it models interactions in a different manner. It utilizes a Graph Neural Network (GNN) to model interactions between robots and humans, combining their positions within the field of view to obtain an accurate representation of the environment. Then, an edge selector mechanism is used to identify the most critical interactions between agents by creating a sparse graph and applying attention to the nodes.

The authors in [32] proposed Heterogeneous Relational Deep Reinforcement Learning (HeR-DRL), a single-agent Multi-Robot crowd navigation framework. It uses a two-layer heterogeneous GNN to transfer and aggregate information among the robots and humans navigating in the environment. It uses separate subgraphs to model the relationships between the different types of robots and humans. It, however, modeled robot–robot interactions in the same way as human–robot interactions, and modeled only the controlled robot, which is the RL agent with the capability of social awareness, as well as all other robots in the environment with no social awareness. While their approach demonstrates the importance of heterogeneous modeling using a GNN, our work provides an alternative approach by employing an attention mechanism for pooling interaction features, allowing the model to weigh the importance of different agents dynamically. We also utilize Imitation Learning from ORCA demonstrations to provide a strong initial policy.

3. Methodology

3.1. Assumptions

This work is based upon some fundamental assumptions, which are presented below:

This work is based on a single-agent deep reinforcement learning framework, where an agent learns in an environment that includes other robots and humans as part of the environment.
The robot that learns the optimal policy is called the ego robot, while the other robots in the environment are referred to as the other robots.
The ego robot has a full view of the environment, while the other robots have a partial view of the environment.
All the robots are modeled as holonomic, i.e., they can move in any direction instantly without rotation constraints.
Humans and other robots in the environment follow the ORCA policy.
There is no explicit communication of navigation intent between the ego robot, other robots, and humans; they can only observe the states of each other.
The navigation is modeled as point-to-point navigation scenarios in Two-Dimensional (2D) plane.

3.2. Problem Formulation

Many real-world applications require robot navigation in only two dimensions. Their navigation objective is to determine the most efficient and safe route from the initial position to the target location. The problem of a robot navigating in a 2D Euclidean space environment with multiple humans and multiple robots with social considerations can be formulated as a Markov Decision Process (MDP) within an RL framework. The MDP is defined by the tuple

〈 S, A, P, R, γ 〉

, where S is the state space, A is the action space, P is the transition probability function, R is the reward function, and

γ

is the discount factor. Together, these elements form the foundation for the agent’s ability to make decisions and learn within the environment. By selecting an optimal action in a given state, the agent maximizes the cumulative reward, and the algorithm steadily improves its performance.

3.2.1. State Space

The state space S comprises all the environment information that the agent can observe. This includes the states of humans, other robots, and the ego robot. The state space of the ego robot comprises its position

p = [p_{x}, p_{y}]

, velocity

v = [v_{x}, v_{y}]

, and radius, along with the heading angle, preferred velocity, and goal position

p_{g}

. The state of the ego robot at time is given by

s_{t}^{e r} = [d_{g}, v_{p r e f}, θ, r, v_{x}, v_{y}]

(1)

where

d_{g} = | | p - p_{g} {| |}_{2}

is the distance of the ego robot to the goal,

v_{p r e f}

is the preferred velocity of the ego robot,

θ

is the heading angle of the ego robot, r is the radius of the robot,

v_{x}

, and

v_{y}

are the velocity components of the ego robot. The state of the i-th human at time t is denoted as

s_{t}^{h_{i}} = [p_{x}^{h_{i}}, p_{y}^{h_{i}}, v_{x}^{h_{i}}, v_{y}^{h_{i}}, r^{h_{i}}, d_{a}^{h_{i}}, r^{h_{i}} + r, c^{h_{i}}]

(2)

where

p_{x}^{h_{i}}

, and

p_{y}^{h_{i}}

are components of the position of the human,

v_{x}^{h_{i}}

and

v_{y}^{h_{i}}

are the velocity components,

r^{h_{i}}

is the radius of the human, and

d_{a}^{h_{i}}

is the distance between the ego robot and the human. Similarly, the state of the j-th other robot at time t is given by

S_{t}^{o r_{j}} = [p_{x}^{o r_{j}}, p_{y}^{o r_{j}}, v_{x}^{o r_{j}}, v_{y}^{o r_{j}}, r^{o r_{j}}, d_{a}^{o r_{j}}, r^{o r_{j}} + r, c^{o r_{j}}]

(3)

where

p_{x}^{o r_{j}}, p_{y}^{o r_{j}}

are the components of the position of the other robot,

v_{x}^{o r_{j}}, v_{y}^{o r_{j}}

are velocity components of the other robot,

r^{o r_{j}}

is the radius of the other robot, and

d_{a}^{o r_{j}}

is the distance between the other robot and the ego robot. As in [32], we also define a category flag

c^{*}

, which is 1 if the agent is human and 0 if it is an Other robot, to help identify the type of agent in the joint state and hence the type of interaction.

We utilize the above state information to construct the joint state,

s_{t}^{j n}

, which is a combination of the states of the ego robot, humans, and other robots. To simplify computation and reduce redundancy, the ego robot is positioned at the origin of a polar coordinate system, and the x-axis is directed toward the goal position, which is always relative to the ego robot’s current position.

3.2.2. Action Space

The action space encompasses the full range of actions the agent can perform. In this case, the action space is discrete and allows the ego robot to select any combination of speed and direction for smooth and unrestricted movement. We use holonomic kinematics for the robots, and assume that the ego robot’s velocity can be immediately reached after receiving the action command. We utilize a large action space to enable more flexible action selection while appropriately limiting it to prevent excessive computational burden. We use a two-dimensional action space

[v_{t}, ϕ_{t}]

, where

v_{t}

represents the velocity at time t, taking five discrete values uniformly distributed between 0 and

v_{p r e f}

, and

ϕ_{t}

denotes the heading angle at time t, taking 16 discrete values uniformly distributed between 0 and

2 π

. The final combination then yields a total of 80 discrete action possibilities: A speed command

v_{t} \in [0, v_{p r e f}]

, and a heading angle command

ϕ_{t} \in [0, 2 π]

.

3.2.3. Reward Function

The reward function assigns a numerical score (reward or penalty) to the agent’s actions based on how desirable those actions are in a given state. We selected a reward function that will help in achieving social awareness in the navigation framework. Our reward framework includes an arrival reward, a collision penalty, an uncomfortable distance penalty, and a time-based penalty (set as zero).

The reward function used is given below and was adapted from [18]. Let H be the set of humans and

O R

be the set of other robots. The expression for the cumulative or total reward

R_{T}

is expressed as

R_{T} = R_{g} + R_{c} + R_{d}

(4)

where

R_{g}

is the reward for navigating to the goal,

R_{c}

is the penalty for collisions, and

R_{d}

is the discomfort penalty, which only applies to interactions with humans. These are all individually defined below.

R_{g} = \{\begin{matrix} 1, & if p_{t} = p_{g} \\ 0, & otherwise \end{matrix}

(5)

R_{c} = \{\begin{matrix} - 0.25, & if d_{a}^{i} < r + r^{i} \\ 0, & otherwise \end{matrix} for all i \in H \cup O R .

(6)

R_{d} = \sum_{h \in H} f (d_{m i n}^{h}, d_{t h}^{H})

(7)

f (d_{m i n}, d_{t h}) = \{\begin{matrix} d_{m i n} - d_{t h}, & if d_{m i n} < d_{t h} \\ 0, & otherwise \end{matrix}

(8)

where

p_{t}

is the position p of the ego robot, at time t,

d_{g}

is the ego robot distance to the goal, and

d_{a}^{i}

is the ego robot distance from another agent (human or other robot) i at timestep t. Because the comfort term applies only to humans,

d_{t h}^{H} = 0.2

m is the threshold distance, below which humans will feel discomfort, and

d_{m i n}

is the assigned value of minimum distance of separation between the ego robot and the j-th human at timestep t. For Other robots,

d_{t h}^{O} R = 0

by design, so they contribute nothing to

R_{d}

.

3.2.4. Optimal Policy and Value Function

The value function

V_{s}

represents the expected total reward an agent can accumulate by being in a particular state and following a certain policy afterward. In this work, the goal is to find the optimal policy and an optimal value function

V^{*} (s_{t}^{j n})

, which is given by

V^{*} (s_{t}^{j n}) = max_{a} E_{s^{'}} [r (s_{t}^{j n}, a_{t}) + γ V^{*} (s_{t}^{j n^{'}})]

(9)

where

γ

is a value between 0 and 1 (but closer to 1) termed the discount factor, and

r (s_{t}^{j n}, a_{t})

is the reward for taking action

a_{t}

in the joint state. The optimal policy

π^{*}

, which maps the joint state

s_{t}^{j n}

to an action

a_{t}

is given by

π^{*} (c) = arg max_{a_{t}} Q^{*} (s_{t}^{j n}, a_{t})

(10)

where

s_{t}^{j n}

is the current joint state and

a_{t}

is the chosen action.

3.3. Interaction Modeling

Five distinct interaction types exist in the environment: human–human, human–ego robot, ego robot–other robot, human–other robot, and other robot–other robot interactions, and are shown in Figure 1, and their distinct relationships are shown in an agent interaction matrix in Table 1.

3.3.1. Human–Human Interaction

Human agents are modeled using the ORCA policy, which enables reciprocal collision avoidance based on local sensing and velocity adjustments. Each human responds reactively to the motion of nearby humans, maintaining safe distances and navigating toward individual goals while avoiding collisions. This interaction captures the decentralized and socially compliant nature of human navigation in shared spaces.

3.3.2. Human–Ego Robot Interaction

The ego robot employs a reinforcement learning policy trained to exhibit socially aware behavior. It perceives and interprets the state of surrounding humans, including position, velocity, and relative orientation, to make navigation decisions that align with human social norms. The policy accounts for proxemic constraints and collision risk, allowing the robot to adjust its trajectory appropriately in the presence of humans.

3.3.3. Ego Robot–Other Robot Interaction

While the ego robot is the only agent trained via reinforcement learning, other robots in the environment are modeled using the ORCA policy. These other robots are treated as dynamic obstacles by the ego robot, so it does not enforce avoiding entering the personal space or discomfort zone of the other robots; rather, it only does collision avoidance, because robots are not social beings, so do not experience discomfort.

3.3.4. Human–Other Robot Interaction

Interactions between humans and other robots are governed by the same reciprocal collision avoidance principles as human–human interactions. There is no social awareness in this interaction, only purely collision avoidance. However, other robots are designed to maintain slightly larger safety margins when navigating around humans. This ensures that their presence in the environment reflects behavior of robots.

3.3.5. Other Robot–Other Robot Interaction

Interactions among the other robots follow standard ORCA dynamics. Each robot independently computes its velocity based on the positions and velocities of its neighbors, including other robots. These interactions are symmetric and reciprocal, enabling decentralized collision avoidance without explicit communication. Although these robots are not trained, their behavior is sufficient to support socially aware motion of other agents in the environment.

3.4. System Architecture

Our framework consists of three modules and is designed to facilitate socially aware navigation in multi-robot scenarios. It comprises an Interaction Module, a Pooling Module, and a Planning Module as shown in the overall architecture in Figure 2. While the modular principle is conceptually related to [18], our formulation is specifically adapted to capture both human–robot and robot–robot interactions through unique input representations, modified attention computations, and a training strategy that supports navigation of multiple robots within a single-agent DRL formulation. These modules are discussed in detail below.

3.4.1. Interaction Module

All entities in the environment (human, ego robot, other robot) have an impact on each other, and it would be too computationally expensive to model all these interactions individually. For this purpose, local maps were created to obtain representation for the human–human, human–other robot, and other robot–other robot interactions, and a pairwise interaction module was used for the human–ego robot and ego robot–other robot interactions. To capture the presence and velocities of neighbors in a neighborhood of size L, Figure 3 shows a

L \times L \times 4

map tensor

M^{*_{i}}

centered at each agent (with ∗ representing

o r

for other robot and h for humans). This map is referred to as the local map and is given by

M^{*_{i}} (a, b, :) = \sum_{j \in N_{i}} δ_{a b} [x_{j} - x_{i}, y_{j} - y_{i}] ξ_{j}^{'}

(11)

where

ξ_{j}^{'} = (v_{x_{j}}, v_{y_{j}}, 1)

is a local state vector for agent j, and

δ_{a b} [x_{j} - x_{i}, y_{j} - y_{i}]

is an indicator function.

The purpose of the Interaction Module is to create a feature vector for each agent in the environment that represents its interaction with the ego robot. As shown in Figure 2, this module takes three distinct inputs for each agent:

The agent’s own state ( $S^{h_{i}}$ for a human or $S^{o r_{j}}$ for another robot).
The agent’s local map ( $M^{h_{i}}$ or $M^{o r_{j}}$ ), as defined in Equation (11).
The ego robot’s own state ( $s^{e r}$ ), which is shared across all agents.

Using a Multi-Layer Perceptron (MLP), we encode the state of the ego robot, humans, other robots, and the map tensor

M^{o r_{i}}

or

M^{h_{i}}

into a fixed-length embedding vector

e^{o r_{i}}

for the interaction of an ’other robot’ or

e^{h_{i}}

for interaction of a human respectively. They are given as

e^{o r_{i}} = ϕ_{e} (s^{e r}, s^{o r_{i}}, M^{o r_{i}}; W_{e})

(12)

e^{h_{i}} = ϕ_{e} (s^{e r}, s^{h_{i}}, M^{h_{i}}; W_{e})

(13)

where

W_{e}

denotes the embedding weights and

ϕ_{e} (\cdot)

is a function with ReLU activations for embedding the states, weights, and map tensor. This embedding now contains the joint information of the ego robot, the specific agent, and that agent’s local neighborhood. The final interaction feature between the ego robot and a human or other robot is then obtained by feeding the embedding vectors

e^{o r_{i}}

and

e^{h_{i}}

to subsequent MLPs

h^{o r_{i}} = ψ_{h} (e^{o r_{i}}; W_{h})

(14)

h^{h_{i}} = ψ_{h} (e^{h_{i}}; W_{h})

(15)

where

ψ_{h} (\cdot)

is another layer with ReLU activation function, and

W_{h}

denotes the weight of this MLP.

3.4.2. Pooling Module

The set of all interaction vectors and embedding vectors from the Interaction Module is then passed to the Pooling Module. The goal of this module is to aggregate variable-sized crowd information into a single, fixed-length context vector, c, utilizing an attention mechanism.

As part of this module, we incorporate a self-attention mechanism to model the importance of each agent in the environment in relation to the ego robot. This module enhances social awareness because the nearest other robot or human is not necessarily the most important in terms of navigation, especially if its direction and speed are not in conflict with that of the ego robot. The obtained interaction vectors are then pooled and operated on to obtain the attention scores

α^{h_{i}}

and

α^{o r_{i}}

for the human and other robot, respectively. They are obtained as follows

e_{m} = \frac{1}{n} \sum_{k = 1}^{n} e^{o r_{n}} + \frac{1}{n} \sum_{k = 1}^{n} e^{h_{n}}

(16)

α^{o r_{i}} = ψ_{α} (e^{o r_{i}}, e_{m}; W_{α})

(17)

α^{h_{i}} = ψ_{α} (e^{h_{i}}, e_{m}; W_{α})

(18)

where

W_{α}

represents the weights,

ψ_{α} (\cdot)

is a Multi-Layer Perceptron with Rectified Linear Unit activations, and

e_{m}

is a vector with a bounded length that results from taking the mean after pooling all vectors of interactions.

First, the mean of all agent embeddings,

e_{m}

, is calculated (Equation (16)). This provides a general summary of the crowd. This mean embedding is then concatenated with each individual agent’s embedding (

e^{o r_{i}}

or

e^{h_{i}}

) and fed into the attention MLP,

ψ_{α} (\cdot)

(Equations (17) and (18)). The output is a scalar attention score (

α^{o r_{i}}

or

α^{h_{i}}

) for each agent, representing its “importance” to the ego robot’s current decision. Finally, as shown in the diagram, these scores are normalized via a softmax function and used to compute a weighted sum of the interaction feature vectors (

h^{o r_{n}}

,

h^{h_{n}}

). This operation, shown in Equation (19), produces the final crowd representation vector, c, which gives more weight to the agents deemed more important by the attention mechanism.

In the final stage of crowd representation, we employ the previously obtained pairwise interaction vectors and their accompanying attention scores to obtain a linear weighted sum of all the pairs that make up the crowd representation.

c = \sum_{i = 1}^{n} softmax (α^{o r_{n}}) h^{o r_{n}} + \sum_{i = 1}^{n} softmax (α^{h_{n}}) h^{h_{n}}

(19)

3.4.3. Planning Module

This module determines the value of the network through obtaining the value of the planning state v. The basis for this is from the previously established model of the setting and crowd dynamics and it is given by

v = f_{v} (s^{e r}, c; W_{v})

(20)

where

f_{v} (\cdot)

is a Multilayer Perceptron with ReLU activations, with weights denoted by

W_{v}

. In the final Planning Module, the fixed-length crowd vector c (from the Pooling Module) is concatenated with the ego robot’s state

s^{e r}

. This combined vector, which encapsulates the ego robot’s state and its full understanding of the surrounding crowd, is fed into the value network

f_{v} (\cdot)

(Equation (20)). The module’s single output, v, is the estimated value of being in the current state, which is then used by the DRL algorithm to select the optimal action.

4. Results

4.1. Training Setup

We initialized training of our framework with Imitation Learning (IL) as a gateway to deep reinforcement learning (DRL), enabling stable policy learning from the outset. We first generated a dataset of expert demonstrations by simulating 2000 navigation episodes in which an expert agent following the ORCA policy navigates the environment. These 2000 episodes were simulated using the “square crossing” environment configuration with 5 humans and 2 other robots, the same configuration used for the subsequent DRL training phase. Each episode consisted of a complete run from the starting position to the goal position, and we collected state-action pairs from all successful trajectories. The generated trajectories served as expert data for a supervised learning phase, with the training objective being to reduce cross-entropy loss between the policy’s predicted actions and the expert’s (ORCA’s) actions across all states. Using this method, the ego robot’s initial policy learned to mimic the ORCA expert’s conservative, reciprocal actions.

In the second phase of DRL training, the trajectories were sampled in batches and the rewards were calculated using the reward function described in Section 3. Updates to the policy were done with the help of Adam Optimizer, and the discount factor was selected to balance between short-term collision avoidance and long-term goal-reaching. An exploration factor that decayed from 0.5 to 0.1 was introduced to encourage gradual convergence to stable, socially aware behavior.

The hyperparameters summarized in Table 2 were selected based on a survey of values from the literature and fine-tuning to identify those with the best performance. The learning rate of 0.001 proved gradual convergence, and the batch size of 100 determined the trade-off between computational cost and the stability of gradient estimates. We chose a discount factor of 0.9 to promote long-term planning and valuing of future rewards. The hidden units listed in Table 2 define the architecture of the multilayer perceptrons used in the different modules of the framework. These sizes were selected to balance between representational capacity and training stability. The number of training episodes for the reinforcement learning phase was chosen to obtain stable training in the least number of episodes. Hence, our stopping condition during training was reaching a maximum number of 6000 episodes which guarantees convergence. The hardware used was an AMD Ryzen CPU with 64 GB RAM, and training took approximately 17 h. In the training phases of the framework, we used 5 humans and 2 Other robots setup. We tested the same configuration, as well as others, in various scenarios applied in this work.

The Details for the implementation of this work, including the training parameters, configuration parameters, and environment parameters, are presented in Table 2. The DRL algorithm was implemented using the PyTorch library [33] of Python in OpenAI Gym [34] and trained using Adam optimizer [35]. We use a square crossing simulation setup for both training and testing the navigation algorithm. In this scenario, all agents, except the ego robot, are randomly initialized in either the left or right half of the environment, and their goal positions are randomly selected on the opposite side of the start position. For the ego robot, its start and goal positions are predefined. To ensure clarity, Table 3 provides a unified training protocol table, bringing together the entire training setup, detailing the hyperparameters, imitation learning configuration, and hardware specifications. The IL-initialized and RL-only models were trained under the same conditions.

4.2. Qualitative Results

To test the effectiveness of the developed algorithm and its generalization capabilities we have tested it in multiple simulation scenarios, which include a varying number of humans and other robots, and an environment with static obstacles.

4.2.1. Open Space Scenario

In this scenario, we simulate an open environment with no obstacles. This scenario involves the ego robot and a combination of humans and other robots, each with their goals. The ego robot is initialized from position

(0, - 4)

, and the target is set at

(0, 4)

which the ego robot must navigate to in a socially aware manner in the presence of the humans and other robots. We test this scenario across a diverse range of human and other robot setups to show the generalization ability of the developed framework. We present this in the following subsections.

a: 5 Humans and 2 other robots

We tested the trained algorithm in a scenario of 5 Humans and 2 other robots as shown in Figure 4. The ego robot is initialized with its own self state vector which encodes its position, preferred velocity, and heading angle. The action that is selected as the outcome of the algorithm is a velocity and heading command. A scalar reward value is obtained based on the reward function, and the next state is observed after the action has been taken to continue the loop.

From the onset (

t = 0

s), the ego robot identifies a feasible trajectory that avoids early collisions with humans or other robots. As the simulation progresses (

t = 3.25

s to

t = 4.75

s), the ego robot ventures further into the crowded area, slows down, and allows humans 4 and 1 to pass at

t = 5.25

s and

t = 7

s, respectively. The ego robot maintains a safe space and utilizes the attention of the agents to prioritize each agent’s needs. In the final timestep (

t = 9

s), the ego robot completes its path to the goal by efficiently passing between two other robots while preserving smooth and collision-free navigation. As seen in Figure 4, the attention score of each agent is computed at each timestep to inform the ego robot of the most relevant agent to it in navigational terms. At

t = 4.75

s, that agent is Human 3, and at

t = 5.25

s, that agent is Human 4.

b: 10 Humans and 3 Other robots

Figure 5 shows the scenario of 10 humans with 3 other robots. This scenario is a more densely populated and complex environment. The density of both humans and other robots imposes tighter constraints on the available free space.

As the ego robot begins to advance (

t = 2.5

s), it dynamically adjusts its trajectory, and by

t = 3.75

s to

t = 5.75

s, it can be seen socially avoiding humans 3, 4, and 8, as well as responding to the trajectory of another robot approaching it head-on. Between

t = 5.75

s to

t = 8.5

s, the robot has moved between agents to reach its goal. Figure 6 illustrates the spatial dynamics and temporal evolution of all agents throughout this navigation scenario, involving 10 humans and 3 other robots.

4.2.2. Static Obstacles Scenario

We test the effectiveness of the developed navigation framework in an environment with static obstacles, which more realistically depicts natural human environments.In this static obstacle scenario, A as shown in Figure 7, the effectiveness of the ego robot’s socially aware navigation strategy is evaluated in a setting with 5 humans, 2 other robots, and several static obstacles (grey squares). This navigation case illustrates the ego robot’s decision-making process as it traverses a complex environment while maintaining safe and socially acceptable distances from both moving agents and static objects.

At the onset, the ego robot begins reasoning about obstacles to avoid and starts anticipating the future positions of nearby humans and other robots. As the scenario develops (

t = 3.75

s to

t = 4.75

s), the ego robot actively moves towards the denser part of the environment while taking precautions to venture inwards only at the right time. The successful social navigation of the ego robot is evident as it respects comfort zones while moving through narrow passages between moving humans and other robots (

t = 6.25

s to

t = 7

s). By the final shown time step (

t = 8.5

s), the ego robot has approached the goal position successfully, with a trajectory that reflects a balance between comfort and efficiency.

4.3. Quantitative Results and Analysis

In this section, we present, analyze, and discuss the quantitative results obtained in this work. We first present our primary quantitative benchmark for training, which was conducted with the same training configuration (5 humans and 2 other robots) as the compared baseline methods. Following this, we present qualitative analyses from more complex, unseen scenarios (10-human and 3-robot scenario and a static obstacle scenario) to visually demonstrate the generalization capabilities and robustness of our trained policy. Table 4 presents this comparative evaluation of the developed navigation framework against HeR-DRL [32], which encodes and differentiates between different interaction types, Homogeneous Relational DRL (HoR-DRL) [32], which encodes all interactions homogeneously regardless of agent type, SARL [18], LSTM-RL [15], and ST² [19]. We evaluated our work using essential metrics such as the success rate, which gives the percentage of successful cases in which the ego robot arrives at the designated target location within the allotted time frame without collisions, Collision rate, which gives the percentage of episodes that ended in a collision with the other agents in the environment, Navigation time, which gives the average time for successful navigation cases, Discomfort rate, which is the percentage of episodes in which the ego robot enters the discomfort region of humans given by the 0.2 m threshold, and Average Minimum Separation Distance which gives the average of the minimum separation distance to a human recorded across all successful test episodes in which the ego robot entered the discomfort zone of a human.

Our developed method achieves a success rate of 89%, which, even though is slightly lower than that of most of the other frameworks, is of an acceptable value considering our approach is more conservative to prioritize safety and collision avoidance. A key difference is that our proposed approach achieves close to zero collision rate, highlighting its superior safety behavior. In contrast, all the other methods have significant values of collision rates. Regarding navigation time, this work achieves an average of 12.13 s, slightly longer than the other compared methods. This increase reflects how conservative the ego robot is in avoiding both humans and other robots. However, the trade-off results in a higher minimum separation distance of 0.16 m, surpassing the values of all previous methods. This indicates enhanced social compliance and a lower likelihood of discomfort among surrounding agents. Additionally, this work achieves a lower discomfort rate, which indicates that the ego robot comes into the discomfort zones of humans with less frequency compared to other methods. Another significant finding is the computational efficiency of our framework compared to previous works. While other methods required approximately 10,000 training episodes for convergence to an optimal policy, our framework converged at around 6000 training episodes.

The superior discomfort rate and low collision rate are directly attributable to the joint impact of our design of the reward function and the use of the attention mechanism. Unlike HeR-DRL [32], which does not use an attention mechanism, and LSTM-RL [15], which processes agents sequentially often sorted by distance, our attention mechanism learns to assign importance scores to all agents in parallel as shown in Figure 4 from which the ego robot focuses on the most relevant agents and achieves better performance. An inherent advantage in our model structure and interaction modeling is that is more robust and ego-centric. We only model human–human interactions via local maps and do not attempt to model them with the same high-fidelity GNN structure as in other methods, such as HeR-DRL. Thus, our attention mechanism is optimized to prioritize agents most relevant to the ego robot. This approach is a key factor for our superior performance in terms of the previously mentioned metrics. Overall, while the proposed method has slight trade-offs in success rate and navigation time, it significantly improves safety and social compliance. This makes it particularly reasonable for use in real-world scenarios where collisions are unacceptable and socially aware behavior is required in the presence of multiple dynamic agents. Our framework processes two distinct levels of social awareness. The first is at a categorical level, where the model learns to treat agent types differently. Our reward function applies a discomfort penalty only to humans; thus, the ego robot is socially aware only for humans while being only collision-aware for other robots. The second is at a dynamic level, powered by the attention mechanism to distinguish the agents to be prioritized from among the rest.

From Figure 8, we can see that the success rate improves from approximately 30% at the beginning to around 90% by the end of training, indicating that the agent is learning. The occasional dips reflect exploration or unstable policy updates, which are common during RL training. For example, the dip in success rate just before the 4000 episode mark reflects the inherent exploration/exploitation trade-off. As the exploration rate (which decays over the first 4000 episodes) encourages the agent to test new, potentially suboptimal strategies, a temporary drop in performance occurs before the policy corrects itself and converges on a more robust behavior. Correspondingly, the collision rate decreases sharply from an initial value of over 60% to nearly 0%. This directly indicates the policy is learning to avoid humans and other robots in its path. The steep decline before episode 3000 implies early gains in safety behavior, likely guided by imitation learning and reward shaping. A downward trend in the time taken to reach the goal shows that the ego robot is learning to avoid collisions and doing so with increasing efficiency. The time drops from around 23 s to under 15 s, which suggests improved path planning and better interaction modeling. The steadily increasing cumulative reward confirms that the learning objective is being met. It is initially negative due to frequent collisions and penalties, then it transitions to positive values as the agent learns to maximize returns by reaching the goal quickly and safely. The reward curve’s saturation near the end shows policy convergence. To reduce short-term variance, we plot each training curve as a 200-episode running mean. The shaded area within each curve depicts pointwise 95% confidence intervals around that running mean, computed from the same window with mean

\pm 1.96

times the standard error. These intervals show the variability within the run, across episodes within the window for our training seed.

Table 5 summarizes our framework’s performance over periodic validation checkpoints. This table highlights training stability and improvements in success, safety, and efficiency are consistent and not coincidental. The training performance metrics are computed over validation rollouts at each checkpoint, then aggregated to highlight the mean and corresponding 95 percent confidence interval as shown in Table 6. The 95% CI bounds (indicated as Lower 95% CI and Upper 95% CI) indicate the extent to which these metrics vary across checkpoints within this training run, thereby providing a measure of the policy’s stability as learning progresses.

We performed a comparative experiment between two models of our framework: one trained exclusively through Deep Reinforcement Learning (DRL), and another initialized using imitation learning with ORCA-generated demonstrations before DRL, thus quantitatively evaluating the contribution of imitation learning (IL) to the training process. This was done under same environments, reward systems, and training duration (6000 episodes). Figure 9 shows the training reward over episodes for both settings. With a steeper reward gradient and a higher final reward, the policy initialized with IL surpasses the DRL-only policy throughout all training episodes. The graphs are shown with a ±1 standard deviation. The IL-based approach reduces the trial-and-error burden of RL, hence improving sample efficiency, stability, and performance. The first phase of the training, which was conducted using imitation learning only, achieved a success rate of 99% and a navigation time of 7.75 s. This high performance is not surprising given that imitation learning replicates expert demonstrations, in this case, the ORCA policy. In environments with multiple agents, where unsafe behavior could lead to collisions or even harm, this is particularly important. These findings demonstrate the impact of imitation learning in improving training of the agent towards safer and more optimal behavior.

An important aspect, highlighted in part in Table 4, is the computational complexity of the framework. Our framework’s architecture, which processes each agent through an interaction module and then aggregates them using an attention-based pooling module, scales linearly,

O (N)

, with the number of agents N, a key feature for real-world scalability. This

O (N)

complexity is comparable to methods like SARL [18,32], and offers a significant advantage over models that require full self-attention across all agents (e.g.,

S T^{2}

[19]), which scale quadratically (

O (N^{2})

). Our approach also demonstrates good training stability. The learning curves in Figure 8 show a smooth and stable increase in cumulative reward and success rate, with a corresponding stable decrease in collision rate and navigation time. Our framework’s stability is further enhanced by Imitation Learning, which, as shown in Figure 9, provides a stable reward profile from the very beginning of training, and ensures that the DRL training phase starts from a stable, non-random state. This combination of architectural efficiency and IL-based initialization with the non-social-aware ORCA policy enables our model to converge to an optimal policy in 6000 episodes, significantly fewer than the 10,000–15,000 episodes reported by previous works.

5. Conclusions

In this work, we demonstrate that safe and efficient socially aware robot navigation can be achieved using a single-agent DRL framework augmented with Imitation Learning for initialization. We have explicitly modeled the interactions between the ego robot, other robots, and humans in the environment. This process enables the framework to extract socially relevant knowledge that supports different levels of social awareness, resulting in a conservative policy with good generalization capability. This framework has achieved enhanced socially aware metrics, such as the average minimum separation distance within a discomfort region and a reduced discomfort frequency, when compared to similar works in the literature. This is in addition to a lower collision rate. The trade-off in our approach is that conservativeness of the policy and prioritization of safety result in a lower success rate and average navigation time.

We have also demonstrated the impact and benefits of using Imitation Learning for the initialization of the agent’s training. The developed framework’s performance was benchmarked quantitatively in our training configuration (5 humans and 2 robots) and evaluated qualitatively in more complex scenarios, including a denser crowd (10 humans, 3 robots) and an environment with static obstacles, showcasing its adaptability to scenarios of higher complexity than it was trained on. While this work focused on validating the proposed architecture under standardized, fully observable, holonomic simulation settings to ensure fair comparison with prior works, we recognize that real-world navigation involves sensing noise, occlusion, latency, and nonholonomic motion constraints. Extending the framework to evaluate performance degradation under these uncertainties and conducting detailed causal ablation analyses of the attention mechanism are critical next steps toward robust, explainable, and deployable socially aware navigation. We acknowledge, that the 2D simulation is a simplification and that a full empirical demonstration of scalability to very dense, complex real-world crowds remains an open area of research. Therefore, as a critical next step, we recommend deploying the policy on real physical robots to verify its feasibility and bridge the ’sim-to-real’ gap in 3D scenarios while also extending the framework to multi-agent RL to test cooperative and adversarial multirobot SAN scenarios.

The potential challenges that may arise in real-world deployments can be solved by developing robust perception pipelines that accurately perform state estimation from noisy data, employing domain randomization to enhance robustness to system latency, and training the policy with non-holonomic kinematics. Likewise, employing a Centralized Training with Decentralized Execution (CTDE) MARL paradigm would enable interaction between multiple robots that are simultaneously learning different policies. Additionally, future work will include a comprehensive quantitative benchmark (including multiple seeds analysis, Confidence Intervals, and significance tests) in mixed static and dynamic environments while proving explainability through causal ablation tests, building on the qualitative generalization test presented in this paper, and also investigating the application of different human behavior models and robots with varying policies.

Author Contributions

Conceptualization, I.K.K. and M.F.M.; methodology, I.K.K., M.F.M. and Y.I.O.; software, I.K.K.; formal analysis, I.K.K., M.F.M. and A.N.; investigation, I.K.K., M.F.M., Y.I.O. and A.N.; resources, M.F.M.; writing—original draft preparation, I.K.K.; writing—review and editing, I.K.K., M.F.M. and Y.I.O.; visualization, I.K.K.; supervision, M.F.M.; funding acquisition, M.F.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Interdisciplinary Research Center for Smart Mobility and Logistics, King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Acknowledgments

The authors gratefully acknowledge the support of King Fahd University of Petroleum & Minerals (KFUPM), Saudi Arabia, through the Interdisciplinary Research Center for Smart Mobility and Logistics.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ivanov, S.; Gretzel, U.; Berezina, K.; Sigala, M.; Webster, C. Progress on robotics in hospitality and tourism: A review of the literature. J. Hosp. Tour. Technol. 2019, 10, 489–521. [Google Scholar] [CrossRef]
Daza, M.; Barrios-Aranibar, D.; Diaz-Amado, J.; Cardinale, Y.; Vilasboas, J. An Approach of Social Navigation Based on Proxemics for Crowded Environments of Humans and Robots. Micromachines 2021, 12, 193. [Google Scholar] [CrossRef] [PubMed]
Samarakoon, S.M.B.P.; Muthugala, M.A.V.J.; Jayasekara, A.G.B.P. A Review on Human–Robot Proxemics. Electronics 2022, 11, 2490. [Google Scholar] [CrossRef]
Kruse, T.; Pandey, A.K.; Alami, R.; Kirsch, A. Human-aware robot navigation: A survey. Robot. Auton. Syst. 2013, 61, 1726–1743. [Google Scholar] [CrossRef]
Guillén-Ruiz, S.; Bandera, J.P.; Hidalgo-Paniagua, A.; Bandera, A. Evolution of socially-aware robot navigation. Electronics 2023, 12, 1570. [Google Scholar] [CrossRef]
Chen, Y.; Zhao, F.; Lou, Y. Interactive model predictive control for robot navigation in dense crowds. IEEE Trans. Syst. Man, Cybern. Syst. 2021, 52, 2289–2301. [Google Scholar] [CrossRef]
Fiorini, P.; Shiller, Z. Motion planning in dynamic environments using velocity obstacles. Int. J. Robot. Res. 1998, 17, 760–772. [Google Scholar] [CrossRef]
Van Den Berg, J.; Guy, S.J.; Lin, M.; Manocha, D. Reciprocal n-body collision avoidance. In Robotics Research: The 14th International Symposium ISRR; Springer: Berlin/Heidelberg, Germany, 2011; pp. 3–19. [Google Scholar]
Helbing, D.; Molnar, P. Social force model for pedestrian dynamics. Phys. Rev. E 1995, 51, 4282. [Google Scholar] [CrossRef]
Aoude, G.S.; Luders, B.D.; Joseph, J.M.; Roy, N.; How, J.P. Probabilistically safe motion planning to avoid dynamic obstacles with uncertain motion patterns. Auton. Robot. 2013, 35, 51–76. [Google Scholar] [CrossRef]
Svenstrup, M.; Bak, T.; Andersen, H.J. Trajectory planning for robots in dynamic human environments. In Proceedings of the 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, Taipei, Taiwan, 18–22 October 2010; pp. 4293–4298. [Google Scholar]
Fulgenzi, C.; Tay, C.; Spalanzani, A.; Laugier, C. Probabilistic navigation in dynamic environment using rapidly-exploring random trees and gaussian processes. In Proceedings of the 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems, Nice, France, 22–26 September 2008; pp. 1056–1062. [Google Scholar]
Chen, Y.F.; Liu, M.; Everett, M.; How, J.P. Decentralized non-communicating multiagent collision avoidance with deep reinforcement learning. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 285–292. [Google Scholar]
Chen, Y.F.; Everett, M.; Liu, M.; How, J.P. Socially aware motion planning with deep reinforcement learning. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; pp. 1343–1350. [Google Scholar]
Everett, M.; Chen, Y.F.; How, J.P. Motion planning among dynamic, decision-making agents with deep reinforcement learning. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 3052–3059. [Google Scholar]
Chen, C.; Hu, S.; Nikdel, P.; Mori, G.; Savva, M. Relational graph learning for crowd navigation. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 10007–10013. [Google Scholar]
Zhou, Z.; Zhu, P.; Zeng, Z.; Xiao, J.; Lu, H.; Zhou, Z. Robot navigation in a crowd by integrating deep reinforcement learning and online planning. Appl. Intell. 2022, 52, 15600–15616. [Google Scholar] [CrossRef]
Chen, C.; Liu, Y.; Kreiss, S.; Alahi, A. Crowd-robot interaction: Crowd-aware robot navigation with attention-based deep reinforcement learning. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 6015–6022. [Google Scholar]
Yang, Y.; Jiang, J.; Zhang, J.; Huang, J.; Gao, M. ST²: Spatial-temporal state transformer for crowd-aware autonomous navigation. IEEE Robot. Autom. Lett. 2023, 8, 912–919. [Google Scholar] [CrossRef]
Tai, L.; Zhang, J.; Liu, M.; Burgard, W. Socially compliant navigation through raw depth inputs with generative adversarial imitation learning. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 1111–1117. [Google Scholar]
Pfeiffer, M.; Schaeuble, M.; Nieto, J.; Siegwart, R.; Cadena, C. From perception to decision: A data-driven approach to end-to-end motion planning for autonomous ground robots. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 1527–1533. [Google Scholar]
Chandra, R.; Zinage, V.; Bakolas, E.; Biswas, J.; Stone, P. Decentralized multi-robot social navigation in constrained environments via game-theoretic control barrier functions. arXiv 2023, arXiv:2308.10966. [Google Scholar]
Escudie, E.; Matignon, L.; Saraydaryan, J. Attention graph for multi-robot social navigation with deep reinforcement learning. arXiv 2024, arXiv:2401.17914. [Google Scholar] [CrossRef]
Wang, W.; Mao, L.; Wang, R.; Min, B.C. Multi-robot cooperative socially-aware navigation using multi-agent reinforcement learning. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 12353–12360. [Google Scholar]
Albrecht, S.V.; Christianos, F.; Schäfer, L. Multi-Agent Reinforcement Learning: Foundations and Modern Approaches; MIT Press: Cambridge, MA, USA, 2024. [Google Scholar]
Liu, L.; Dugas, D.; Cesari, G.; Siegwart, R.; Dubé, R. Robot navigation in crowded environments using deep reinforcement learning. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 5671–5677. [Google Scholar]
Chen, Y.; Liu, C.; Shi, B.E.; Liu, M. Robot navigation in crowds by graph convolutional networks with attention learned from human gaze. IEEE Robot. Autom. Lett. 2020, 5, 2754–2761. [Google Scholar] [CrossRef]
Samsani, S.S.; Muhammad, M.S. Socially compliant robot navigation in crowded environment by human behavior resemblance using deep reinforcement learning. IEEE Robot. Autom. Lett. 2021, 6, 5223–5230. [Google Scholar] [CrossRef]
Dong, L.; He, Z.; Song, C.; Yuan, X.; Zhang, H. Multi-robot social-aware cooperative planning in pedestrian environments using attention-based actor-critic. Artif. Intell. Rev. 2024, 57, 108. [Google Scholar] [CrossRef]
Wang, W.; Bera, A.; Min, B.C. Hyper-SAMARL: Hypergraph-based Coordinated Task Allocation and Socially-aware Navigation for Multi-Robot Systems. arXiv 2024, arXiv:2409.11561. [Google Scholar]
Song, C.; He, Z.; Dong, L. A local-and-global attention reinforcement learning algorithm for multiagent cooperative navigation. IEEE Trans. Neural Netw. Learn. Syst. 2022, 35, 7767–7777. [Google Scholar] [CrossRef]
Zhou, X.; Piao, S.; Chi, W.; Chen, L.; Li, W. HeR-DRL: Heterogeneous Relational Deep Reinforcement Learning for Single-Robot and Multi-Robot Crowd Navigation. IEEE Robot. Autom. Lett. 2025, 10, 4524–4531. [Google Scholar] [CrossRef]
Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; Lerer, A. Automatic differentiation in pytorch. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. Openai gym. arXiv 2016, arXiv:1606.01540. [Google Scholar] [CrossRef]
Kingma, D.P. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]

Figure 1. Categories of agents and their interaction relationship.

Figure 2. System architecture of the ego robot learning framework.

Figure 3. Creation of Local Map Tensor.

Figure 4. Simulation Scenario with 5 humans and 2 other robots at time steps of 0.25 s, 3.25 s, 4.75 s, 5.25 s, 7 s, and 9 s.

Figure 5. Simulation Scenario with 10 humans and 3 other robots at time steps of 0 s, 2.5 s, 3.75 s, 4.5 s, 5.75 s, and 8.5 s.

Figure 6. Trajectory plot of 10 Humans and 3 other robots scenario.

Figure 7. Simulation Scenario with static obstacles and 5 humans and 2 other robots at time steps of 0.5 s, 3.75 s, 4.75 s, 6.25 s, 7 s, and 8.5 s.

Figure 8. Performance metrics of the navigation framework over training episodes in an open-space environment with 5 humans and 2 other robots: (a) success rate, (b) collision rate, (c) ego robot’s navigation time to goal, and (d) cumulative discounted reward.

Figure 9. Reward for DRL-only and IL with DRL models, with ±1 standard deviation.

Table 1. Agent interaction behavior matrix.

Agent	In Presence of Human	In Presence of Ego Robot	In Presence of Other Robot
Human	Reciprocal collision avoidance	Reciprocal collision avoidance	Reciprocal collision avoidance
Ego Robot	Implements learned DRL policy with social awareness	-	Reciprocal collision avoidance (no social awareness)
Other Robot	Reciprocal collision avoidance with larger safety margin	Reciprocal collision avoidance (no social awareness)	Reciprocal collision avoidance

Table 2. Training and configuration parameters.

Parameter	Value
Preferred velocity	1.0 m/s
Radius of all agents	0.3 m
Discomfort distance for humans	0.2 m
Hidden units of $ϕ_{e} (\cdot)$	150, 100
Hidden units of $ψ_{h} (\cdot)$	100, 50
Hidden units of $ψ_{α} (\cdot)$	100, 100
Hidden units of $f_{v} (\cdot)$	150, 100, 100
IL training episodes	2000
IL epochs	50
IL learning rate	0.01
RL learning rate	0.001
Discount factor	0.9
Training batch size	100
RL training episodes	6000
Evaluation episodes	1000
Exploration rate in first 4000 episodes	0.5 to 0.1

Table 3. Unified training-protocol table.

Component	Description
Imitation Learning (IL)	2000 expert trajectories using ORCA-generated demonstrations (5 humans + 2 robots, square-crossing setup); trained for 50 epochs, learning rate 0.01, batch size 100.
Reinforcement Learning (RL)	6000 training episodes, learning rate 0.001, discount factor 0.9, batch size 100, Adam optimizer.
Stopping rule	Fixed at 6000 episodes; convergence determined via success rate.
Random seeds	Single seed (consistent across IL and DRL-only runs).
Hardware	AMD Ryzen 9 5950X CPU, 64 GB RAM; total training time of 17 h including IL pretraining.
Environment	PyTorch + OpenAI Gym; 2D square-crossing scenario.

Table 4. Performance comparison across key metrics.

Works	Success Rate (%)	Collision Rate (%)	Navigation Time (s)	Discomfort Rate (%)	Avg. Min. Separation Distance (m)	No of Training Episodes
This work	89.0	0.15	12.13	6	0.16	6000
HeR-DRL [32]	96.54	3.14	10.88	6.9	0.154	15,000
HoR-DRL [32]	96.05	3.06	10.91	7.8	0.15	10,000
LSTM-RL [15]	85.52	5.49	11.52	*	0.147	*
SARL [18]	93.14	4.34	10.83	*	0.154	10,000
ST² [19]	96.46	2.99	11.08	*	0.149	1000

* Not reported.

Table 5. Validation Checkpoints.

Episode	Success Rate (%)	Collision Rate (%)	Navigation Time (s)	Discomfort Rate (%)	Avg. Min. Separation Distance (m)
0	88	5	11.37	15	0.09
1000	79	2	12.28	10	0.12
2000	72	0	14.10	2	0.14
3000	83	1	13.48	3	0.15
4000	86	0	12.92	3	0.16
5000	87	2	12.01	4	0.15

Table 6. Summary of validation metrics with means and confidence intervals.

Metric	Mean	Lower 95% CI	Upper 95% CI
Success Rate (%)	0.8250	0.7763	0.8737
Collision Rate (%)	0.0167	0.0018	0.0316
Navigation Time (s)	12.693	11.890	13.497
Discomfort Rate (%)	0.0617	0.0201	0.1032
Avg. Min. Separation Distance (m)	0.1350	0.1143	0.1557

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kabir, I.K.; Mysorewala, M.F.; Osais, Y.I.; Nasir, A. Learning to Navigate in Mixed Human–Robot Crowds via an Attention-Driven Deep Reinforcement Learning Framework. Mach. Learn. Knowl. Extr. 2025, 7, 145. https://doi.org/10.3390/make7040145

AMA Style

Kabir IK, Mysorewala MF, Osais YI, Nasir A. Learning to Navigate in Mixed Human–Robot Crowds via an Attention-Driven Deep Reinforcement Learning Framework. Machine Learning and Knowledge Extraction. 2025; 7(4):145. https://doi.org/10.3390/make7040145

Chicago/Turabian Style

Kabir, Ibrahim K., Muhammad F. Mysorewala, Yahya I. Osais, and Ali Nasir. 2025. "Learning to Navigate in Mixed Human–Robot Crowds via an Attention-Driven Deep Reinforcement Learning Framework" Machine Learning and Knowledge Extraction 7, no. 4: 145. https://doi.org/10.3390/make7040145

APA Style

Kabir, I. K., Mysorewala, M. F., Osais, Y. I., & Nasir, A. (2025). Learning to Navigate in Mixed Human–Robot Crowds via an Attention-Driven Deep Reinforcement Learning Framework. Machine Learning and Knowledge Extraction, 7(4), 145. https://doi.org/10.3390/make7040145

Article Menu

Learning to Navigate in Mixed Human–Robot Crowds via an Attention-Driven Deep Reinforcement Learning Framework

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Assumptions

3.2. Problem Formulation

3.2.1. State Space

3.2.2. Action Space

3.2.3. Reward Function

3.2.4. Optimal Policy and Value Function

3.3. Interaction Modeling

3.3.1. Human–Human Interaction

3.3.2. Human–Ego Robot Interaction

3.3.3. Ego Robot–Other Robot Interaction

3.3.4. Human–Other Robot Interaction

3.3.5. Other Robot–Other Robot Interaction

3.4. System Architecture

3.4.1. Interaction Module

3.4.2. Pooling Module

3.4.3. Planning Module

4. Results

4.1. Training Setup

4.2. Qualitative Results

4.2.1. Open Space Scenario

4.2.2. Static Obstacles Scenario

4.3. Quantitative Results and Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI