A Multi-Policy Rapidly-Exploring Random Tree Based Mobile Robot Controller for Safe Path Planning in Human-Shared Environments

Mi, Jian; Zhang, Xianbo; Long, Zhongjie; Wang, Jun

doi:10.3390/s25227008

Open AccessArticle

A Multi-Policy Rapidly-Exploring Random Tree Based Mobile Robot Controller for Safe Path Planning in Human-Shared Environments

¹

Department of Transport Engineering, School of Civil Engineering and Transportation, Yangzhou University, Yangzhou 225127, China

²

Key Laboratory of the Ministry of Education for Modern Measurement & Control Technology, College of Mechanical and Electrical Engineering, Beijing Information Science & Technology University, Beijing 102206, China

³

College of Information Science & Technology, Beijing University of Chemical Technology, Beijing 100029, China

^*

Author to whom correspondence should be addressed.

Sensors 2025, 25(22), 7008; https://doi.org/10.3390/s25227008 (registering DOI)

Submission received: 5 September 2025 / Revised: 10 November 2025 / Accepted: 13 November 2025 / Published: 17 November 2025

(This article belongs to the Section Sensors and Robotics)

Download

Browse Figures

Versions Notes

Abstract

Mobile robot path planning in static environments has been extensively studied. However, ensuring a safe path in the presence of stochastically moving humans remains a significant challenge. This work focuses on solving the pathfinding problem of a mobile robot operating in human-shared environments with unknown human motions. To prevent conflicts at the planning level, we propose a multi-policy rapidly exploring random tree (MP-RRT)-based safe pathfinding algorithm. A MP-RRT diverse path generator is developed within this framework to produce multiple diverse candidate paths, which are considered as the initial solution set. Additionally, a dynamic quadrant-based stochastic exploration mechanism is introduced for efficient environment exploration. To obtain an optimally safe path, we design a path optimization mechanism based on stochastic risk evaluation, which explicitly models human motion uncertainties. Finally, an optimal safe path is generated by considering human risks at the planning level to ensure the safety for a robot collaborating with humans. We evaluate the proposed algorithm under different configurations ideal warehouse grid environment from conflict numbers, task success rate, and path reward. The proposed method outperforms A*, MDP, and RRT in terms of conflict number (−70.2%, −72.8%, and −73.8%), task success rate (+66.0%, +95.0%, and +85.7%). Simulation results prove the efficiency of our proposals in safe path planning in human-shared environments.

Keywords:

mobile robot; safe path planning; human-shared environments; multi-policy rapidly exploring random tree search; stochastic risk evaluation

1. Introduction

In recent years, substantial advances in robotics technology have greatly accelerated the development of autonomous systems, enabling transformative applications across diverse real-world domains, including automated factories [1], logistics centers [2], warehouses [3], and manufacturing systems [4]. These advancements encompass not only autonomous mobile robots (AMRs) but also humanoid robots. The demand for robots capable of collaborating with humans has been steadily increasing [5,6,7,8]. As is well known, path planning is a fundamental and essential component of robot control [9,10]. Although path planning in static environments has been extensively researched [11,12], planning a safe path for a mobile robot operating alongside stochastically moving humans remains a challenging problem [13,14,15]. Resolving potential conflicts between robots and humans has become a critical and pressing challenge for effective human–robot collaboration.

Safety remains a primary bottleneck in human–robot collaboration [16] and various research has been conducted to address this issue [17,18,19,20]. In addition to safety, the task success rate constitutes a factor as critical as safety for effective human–robot collaboration in the industrial domain. Classic pathfinding algorithms, such as D* [21], A* [22], and RRT [23], are efficient in finding an optimal path without environment uncertainties. However, the objective of path planning in human–robot collaboration differs from the classical Shortest Path Planning (SPP) problem [24,25]. In this context, the planned path must ensure not only safety by avoiding potential future conflicts with humans but also efficiency by enabling timely arrival at the goal within a given time budget.

Numerous approaches have been proposed to address the dynamic obstacle avoidance problem [26,27,28,29,30]. Among them, real-time path planning algorithms [31,32] are widely used, allowing robots to continuously detect their environment, make decisions, and execute actions to avoid dynamic obstacles. This iterative process is currently recognized as one of the most reliable methods for ensuring safety as it replans the path when it encounters a dynamic obstacle [33]. Typically, this approach employs a global planner to guide the robot toward its goal and a local planner for obstacle avoidance [34]. However, real-time solution schemes focus primarily on avoiding immediately encountered conflicts and lack foresight, which can result in a loss of global optimality and a low task task success rate. In addition, such approaches often incur high computational costs due to continuous re-planning. In industrial domains, robots are deployed to perform tasks with the objective of improving overall work efficiency. Real-time solution schemes are not applicable in industrial scenarios where humans move stochastically. To the best of our knowledge, few warehouses or logistics centers currently allow mobile robots to operate directly alongside human staff without physical separation or safety barriers. The main reason is that the safety of a path cannot be guaranteed as future human motions are unknown. Once a real-time solution scheme is applied to ensure safety, the work efficiency tends to decrease.

In order to enable a robot to work in human-shared environments, various human-aware [35,36]- or risk-aware [37,38]-based path planning methods have been developed that rely on monitoring human movements. However, those methods rely heavily on detouring the encountered human. Our previous studies [14,15] have demonstrated that addressing conflicts at the planning level provides an effective solution for navigation in dynamic environments. Resolving future potential conflicts at the planning level significantly reduces computational costs compared with continuous re-planning at every time step. Based on the principle of solving conflicts at the planning level instead of re-planning when encountering humans, we propose a multi-policy rapidly exploring random tree (MP-RRT)-based mobile robot controller for safe path planning without prior knowledge of human motions. The motivation behind our proposal is to avoid stochastic conflicts with humans while maintaining a high task success rate in environments like warehouses where robots and humans work together. The key contributions of this study are summarized as follows:

We develop a novel MP-RRT-based safe path finding algorithm to generate an optimal safe path under unknown human motions.
An MP-RRT diverse path generator is proposed to generate multiple diverse path candidates which focus more on breadth first search (BFS).
A dynamic quadrant stochastic exploration mechanism is developed to enable multi-policy searching.
A safe path optimization mechanism based on stochastic risk evaluation is constructed to obtain an optimal safe path.

The rest of this paper is organized as follows. Section 2 presents a review of related works. Section 3 introduces the problem formulation. The proposed safe planner is presented in Section 4, followed by the simulation results in ideal warehouse grid environment and discussions in Section 5. Finally, we give a brief conclusion in Section 6.

2. Related Works

The path planning problem in dynamic environments aims to move a robot to its goal position without conflicts with environmental uncertainties. Classical path planning methods, such as A* [39,40], D* [21,41], and RRT/RRT* [42,43], have demonstrated their effectiveness in solving pathfinding problems in static environments. Numerous variants of these algorithms have been extended to address path planning in dynamic environments. We roughly classify the related methods into three types: searching-based methods, sampling-based methods, and learning-based methods. In the following, we provide a review of related methods.

Searching-based methods explore the environment by incrementally generating and evaluating successor states, primarily represented by variants of A*, D*, and related graph search algorithms. J. Mi et al. [15] developed a safe A* algorithm to plan a safe local path while considering stochastic human risks. However, their algorithm is limited and highly dependent on the reward design. M. Je Choi et al. [35] proposed an enhanced D* algorithm to optimize the path for an autonomous robot moving on sidewalks. They analyzed people’s trajectory data collected by lidar sensors, and identified the average distance and angle of avoidance at which people start to avoid autonomous robots. Their algorithm allows the robot to maintain its existing optimal path when humans are willing to maneuver around it. However, strictly speaking, this problem setting assumes that humans avoid robots rather than requiring robots to avoid humans.

Sampling-based methods explore the environment by drawing samples according to predefined policies, most commonly random sampling, to incrementally approximate the feasible space. V. Rajendran et al. [36] proposed a human-aware RRT-connect motion planner to maintain a safe distance. However, their setting involves human and manipulator collaboration, not a robot working with randomly moving humans. Jyotish and M. Chen [43] presented a TD-RRT* dynamic re-planning method to avoid both moving and static obstacles. However, their obstacle avoidance mechanism depends on real-time re-planning once the path is obstructed.

For learning-based methods, reinforcement learning (RL) [13,44] and Monte Carlo Tree Search (MCTS) [45,46] are two representative approaches. These methods typically optimize an objective function by designing reward mechanisms and assigning penalties when conflicts occur. Chen et al. [47] introduced a safety-oriented RL controller that integrates control barrier functions, in which Gaussian processes were utilized to model system dynamics and associated uncertainties. However, Gaussian kernels often struggle to represent multi-modal uncertainty distributions. El-Shamouty et al. [16] and Liu et al. [48] applied deep reinforcement learning techniques to achieve safe motion planning in human–robot collaboration scenarios. Luo et al. [49] proposed AlphaRoute, an AlphaGo-inspired planner for large-scale route coordination based on graph attention reinforcement learning and MCTS. Learning-based path planning methods have demonstrated strong adaptability in dynamic environments. However, their training process is computationally expensive, and these methods still lack theoretical guarantees in terms of safety and optimality.

The approaches to solving human and robot conflicts can also be divided into three types: real-time detection, simulation, and cost-map construction. Works like [43,50] use the real-time detection method for obstacle avoidance, which represents one of the most common solutions. It is the safest way and is widely applied in robotics, self-driving, and other fields. As described previously, it incurs heavy computational cost and may lead to a low task success rate. Both RL and MCTS [51] penalize the robot by simulations with designing reward functions that are simple and efficient. However, it cannot provide any theoretical guarantee. L. Bruckschen et al. [52] compute a time-dependent cost map to realize human-aware robot navigation. It also involves expensive computational cost and requires large memory capacity. A. Dixit et al. [53] applied conditional value at risk (CVaR) to calculate risk constraints, which has shown promising results in handling environmental risks. Unfortunately, CVaR is hard to estimate.

Although the aforementioned methods have achieved significant progress in path planning for dynamic environments, several limitations prevent their deployment in industrial applications Searching-based methods exhibit limited adaptability to environmental changes, and sampling-based approaches suffer from the same drawback. Learning-based algorithms provide a promising framework for handling uncertainties; however, their high computational cost restricts practical use in industrial settings. Moreover, the paths generated by these methods are often not robust to future environment changes. Most existing approaches rely on online replanning when encountering moving obstacles, which easily traps the planner in local optima and reduces task success rates, especially when the robot operates under strict time constraints. Therefore, resolving potential conflicts proactively at the planning level is crucial to improving the efficiency and safety of human–robot collaboration in industrial environments.

In this paper, we simulate the human trajectories and calculate the risk distributions which is efficient and more simple than CVaR. Unlike traditional human-aware methods, we try to generate multiple diverse paths distributed over the whole environment to find an optimal safe path, increasing the path robustness and ensuring the safety of the robot and humans. A major difference from well-known online replanning approaches in human-shared environments is that we propose a safe path planning method solving conflicts at the planning level and find a safe path in a single planning process, that is, planning only one time. Hence, our method eliminates the expensive replanning cost for online replanning.

3. Problem Formulation

A mobile robot collaborates with K

(K \in N_{+})

humans. Note that the movements of each human are unknown. A human moves according to their own decisions and cannot be controlled. We model the environment as follows. Let

S be the finite state set of the robot, $s_{t} \in S$ define the state of the robot at discrete time-step t, and $s_{0}$ , $s_{g}$ be the initial state and goal state, respectively,
H be the finite state set of humans, $h_{t} = {h_{t}^{1}, \dots, h_{t}^{K}} \in H$ define the state of K humans at time-step t, and $h_{0}^{k}$ , $h_{g}^{k}$ be the initial state, goal state of human k, respectively,
$A = {\leftarrow, \to, ↑, ↓, w a i t}$ represent the action set of the robot and humans.

Let

x = {s_{0}, \dots, s_{t}, \dots, s_{T}}

be the path of the agent and T be the budget,

s_{T} = s_{g}

. For each human k,

h_{t}^{k}

can be observed, and

h_{0}^{k}

,

h_{g}^{k}

are known. However, the trajectory is unknown, and each human moves stochastically. The robot has to move to its goal

s_{g}

without colliding with humans. A conflict occurs once the robot and a human k occupy the same position or traverse the same edge, that is,

a vertex conflict occurs when $s_{t} = h_{t}^{k}$ , and
an edge conflict occurs when $s_{t} = h_{t + 1}^{k} \land s_{t + 1} = h_{t}^{k}$ .

The robot achieves a success if and only if it reaches the goal without collisions and within the budget T. Otherwise, it suffers a failure.

We aim to find an optimal safe path for the robot with unknown human motions. As described above, we can only observe the state of a human k,

h_{t_{0}}^{k}

, at time step

t_{0}

, and the destination

h_{g}^{k}

is also known. However, the most critical knowledge—the paths of humans—remains unknown. We do not have prior information about their movements. The problem we aim to address is how to avoid stochastic conflicts between the robot and randomly moving humans. In this paper, we focus on the safety of the generated path rather than on achieving the traditional shortest path.

4. MP-RRT-Based Mobile Robot Controller

The architecture of the proposed mobile robot controller is illustrated in Figure 1. The controller consists of a stochastic risk evaluation module, an MP-RRT diverse path generator, and a safe path optimization module. Hereafter, we present the details of each module.

4.1. Model Human Uncertainties: Stochastic Risk Evaluation

It is difficult to obtain the precise movements of humans. We perform stochastic simulations to model the risk caused by humans. Figure 2 shows the stochastic motion model of a human k. We know that a human k will go to his/her goal location. However, the path that a human takes is unknown. We model the human motions as follows. The probability of each action a

(a \in A)

is modeled as

p_{a} \in {p_{f}, p_{b}, p_{l}, p_{r}, p_{w}}

with respect to the action set A. Note that

p_{a}

is unknown.

All humans move according to the following policies:

each human prefers movements directed toward their goal;
no conflicts occur between any two humans, that is, for any two humans $k_{1}$ , $k_{2}$ ,
$h_{t}^{k_{1}} = h_{t}^{k_{2}}$ and $h_{t}^{k_{1}} = h_{t + 1}^{k_{2}} \land h_{t + 1}^{k_{1}} = h_{t}^{k_{2}}$ , are not allowed.

For any time step t, a human k takes an action

a_{t}

at

h_{t}^{k}

, and the state transition is written as

(h_{t}^{k}, a_{t}, h_{t + 1}^{k})

. Let

A (h_{t}^{k})

define the action set of human k at t and

A (h_{t}^{k}) \subseteq A

. As described previously, a human prefers movements directed toward the goal. Let

a^{*}

be the action that brings the robot closest to the goal and defined by

a^{*} \leftarrow \arg \min_{a \in A (h_{t}^{k})} M a n h (h_{t + 1}^{k}, h_{g}^{k}),

(1)

where

M a n h (\cdot)

calculates the manhattan distance from

h_{t + 1}^{k}

to

h_{g}^{k}

.

Let C be a set of prioritized actions,

C \subseteq A (h_{t}^{k})

, and let it contains all possible

a^{*}

. Higher conditional probabilities are assigned to prioritized actions to encourage goal-directed movements. Based on this principle, the actions are sampled from the following conditional probability distribution:

p (a_{t} | h_{t}^{k}) = \{\begin{matrix} \frac{1 - ζ (| A (h_{t}^{k}) | - | C |)}{| C |}, & a_{t} \in C, \\ ζ, & otherwise, \end{matrix}

(2)

where

| \cdot |

is the size of action set,

0 \leq ζ < 0.2

,

\sum_{a_{t} \in A (h_{t}^{k})} p (a_{t} | h_{t}^{k}) = 1

.

In a 2D grid environment,

h_{t_{0}}^{k} = v

is the observed state at

t_{0}

, where the vertex v is occupied by the human k. We conduct stochastic simulations to predict the movements of the human from the observation

h_{t_{0}}^{k}

. This allows us to calculate the conditional probability of a vertex v occupied by human k at time step

t (0 < t \leq T)

as follows:

p_{k, t} (v | h_{0}^{k}) = \frac{N_{v}}{N_{s}},

(3)

where

N_{s}

represents the total simulation times and

N_{v}

indicates the visited times of v at t. Then, the risk value of a vertex v occupied by all K humans at t, written as

R (v, t)

, is calculated by

R (v, t) = \sum_{k = 1}^{K} p_{k, t} (v | h_{0}^{k}) .

(4)

Finally, we obtain human risk set

ξ = {ξ_{t_{0}}, ξ_{t_{0} + 1}, \dots, ξ_{t_{0} + T}}

and

\forall ξ_{t} \in ξ

,

ξ_{t} = {R (v_{1}, t), R (v_{2}, t), \dots}, t_{0} \leq t \leq t_{0} + T

. Algorithm 1 shows the procedures of human risk estimation.

Algorithm 1: Human stochastic risk evaluation

4.2. Multi-Policy Rapid-Exploring Random Tree (MP-RRT) Diverse Path Generator

The core part of the MP-RRT diverse path generator is the MP-RRT diverse path planning algorithm. It searches multiple paths with different policies while considering the diversity of paths.

4.2.1. Multi-Policy Sampling

Unlike the traditional RRT algorithm, our method aims to generate multiple paths while accounting for path diversity. To this end, we design a multi-policy sampling strategy to explore the environment, which includes dynamic quadrant, bridge, and goal policies.

Dynamic quadrant policy (DQP). Given a map

G = < V, E >

, as shown in Figure 3, the robot divides it into quadrants 1, 2, 3 and 4. Let

V_{1}

be the vertex set which contains all vertices in quadrant 1. Similarly, we define the vertex set of quadrant 2, 3, 4 as

V_{2}, V_{3}, V_{4}

, respectively. Redefine

V = {V_{1}, V_{2}, V_{3}, V_{4}}

and

V = V_{1} \cup V_{2} \cup V_{3} \cup V_{4}

.

\forall V_{i} \in V

,

i \in {1, 2, 3, 4}

, let

{\bar{V}}_{i}

contain those vertices of quadrant i which has already been explored and are on one generated path x. For a vertex v, if

v \in V_{i}

and

v \in x

, we have

v \in {\bar{V}}_{i}

. For each quadrant i, we define a exploration rate

Δ (V_{i})

by

Δ (V_{i}) = \frac{| {\bar{V}}_{i} |}{| V_{i} |},

(5)

where

| \cdot |

is the size of a set.

Let

χ

be the set of robot paths. At the beginning, the environment has not been explored,

χ = \emptyset

,

{\bar{V}}_{i} = \emptyset

and

| {\bar{V}}_{i} | = 0

, thus, the exploration rate

Δ (V_{i}) = 0

.

χ = \emptyset \Rightarrow \forall V_{i} \subset V, | V_{i} | = 0, Δ (V_{i}) = 0 .

(6)

Once one or more paths are found,

χ = {x_{1}, \dots}

, and quadrant i is explored, its exploration rate

0 < Δ (V_{i})

.

| χ | > 0 \Rightarrow \exists V_{i} \subset V, | {\bar{V}}_{i} | > 0, Δ (V_{i}) > 0 .

(7)

If quadrant i is fully explored, that is, all vertices of quadrant i have been added to

\bar{V_{i}}

,

{\bar{V}}_{i} = V_{i}

and the exploration rate

Δ (V_{i}) = 1

.

| χ | > 0, V_{i} = {\bar{V}}_{i} \Rightarrow Δ (V_{i}) = 1 .

(8)

Let

α_{1}, α_{2}, α_{3}

and

α_{4}

be the sampling rate of quadrant 1, 2, 3, and 4. For each quadrant i, the sampling rate

α_{i}

changes along the exploration rate

Δ (V_{i})

and is calculated by

α_{i} = = \frac{1 - Δ (V_{i})}{4 - Δ (V_{1}) - Δ (V_{2}) - Δ (V_{3}) - Δ (V_{4})} .

(9)

At the beginning,

Δ (V_{1}) = Δ (V_{2}) = Δ (V_{3}) = Δ (V_{4}) = 0

, the sampling rate of each quadrant i follows the uniform distribution where

α_{i} = 0.25

. Once quadrant i is fully explored,

Δ (V_{i}) = 1

, its sampling rate

α_{i} = 0

. The quadrant i won’t be explored any more. Let

\bar{α}

be the probability of the dynamic quadrant policy that the robot takes for sampling which is unknown.

Bridge policy (BP). This policy enables to sample a point from the search tree randomly. Hence, a new path could be found by bridging existing paths. The probability of bridge policy is

β

(0 \leq β < 1)

.

Goal policy (GP). This policy is the same as the classic RRT/RRT* method. The proposed algorithm follows a probability of

γ

to sample the goal point

(0 \leq γ < 1)

.

4.2.2. Diversity Evaluation

Given two paths

x_{1}, x_{2}

, we evaluate the diversity of them using function

D (x_{1}, x_{2})

.

D (x_{1}, x_{2})

is calculated based on Jaccard similarity,

D (x_{1}, x_{2}) = 1 - \frac{| x_{1} \cap x_{2} |}{| x_{1} \cup x_{2} |},

(10)

where

| x_{1} \cap x_{2} |

is the the size of intersection set of

x_{1}

and

x_{2}

and

| x_{1} \cup x_{2} |

is the size of the union set of two paths. The diversity increases when

D (x_{1}, x_{2})

varies from 0 to 1.

D (x_{1}, x_{2}) = 0

indicates that

x_{1}

and

x_{2}

are the same path while

D (x_{1}, x_{2}) = 1

shows two completely different paths.

Our purpose is to generate multiple paths while keeping their diversity. Let

θ

be the threshold of path diversity. For any two paths

x_{1}, x_{2} \in χ

, the following constraint is satisfied,

D (x_{1}, x_{2}) \geq θ .

(11)

4.2.3. Searching Structure

The robot starts exploring the environment by sampling and growing the search tree. The searching structure of the proposed MP-RRT diverse path generator is shown as Figure 4. Let

τ = < V_{τ}, E_{τ} >

be the search tree where

V_{τ}

is the node set and

E_{τ}

is the edge set.

q_{0} = s_{0}

is the root node. Thus

V_{τ} = {q_{0}}

and

E_{τ} = \emptyset

.

Policy selection phase. The robot starts sampling a point in the state space based on multiple policies–dynamic quadrant, bridge and goal. As described previously, the probability of each policy,

\bar{α}

,

β

and

γ

are unknown. Hence, the robot randomly chooses a policy

π

from the above three policies as shown in Figure 4.

Tree growing phase. Based on the chosen policy

π

, a sampling point

q_{r a n d}

is obtained. Then,

\forall q \in V_{τ}

, we calculate the Manhattan distance from q to

q_{r a n d}

, written as

M a n h (q, q_{r a n d})

. A nearest neighbor node

q_{m}

with the minimum distance is chosen,

q_{m} \leftarrow \arg m i n_{q \in V_{τ}} M a n h (q, q_{r a n d}) .

(12)

Hence, the tree grows in the direction from

q_{m}

to

q_{r a n d}

. As shown in Figure 4,

V_{τ}

contains only a root node. The growing direction is shown in blue arrow. The rest parts are similar to classic RRT method. We briefly introduce as follows. It moves from

q_{m}

toward

q_{r a n d}

and produces a new node

q^{'}

. If no obstacles exist between

q_{m}

and

q^{'}

, add

q^{'}

as the child of

q_{m}

; otherwise, discard it. Repeat sampling and growing the search tree until a path x is found or the computational budget runs out.

Diversity checking phase. For the new found path x and any existing path

x^{'} (x^{'} \in χ)

, we check whether x satisfies the diversity constraints (Equation (11)) based on Equation (10). If

D (x, x^{'}) \geq θ

, add x to

χ

; otherwise, discard x and re-generate a new path until computational budget runs out or

| χ |

reaches the terminal condition, a desired path number L. The procedures are shown in Algorithm 2.

Algorithm 2: MP-RRT diverse path generator

4.3. Safe Path Optimization

The MP-RRT diverse path generator outputs multiple diverse paths

χ

without considering unknown human motions. In this section, we show how an optimal safe path is generated. Each path x (

x \in χ

) is a candidate. We evaluate the conflict risk level of each candidate x based on the obtained human risks

ξ

.

For a candidate x,

\forall v \in x

, the probability of it being occupied by humans at time step t is

R (v, t)

. Hence, the robot collides with a human at position v at t with a risk of

R (v, t)

, which is the vertex conflict. As for the edge conflict, suppose that the robot moves to

v^{'}

at time step

t + 1

and it encounters an edge conflict risk of

R (v, t + 1) \times R (v^{'}, t)

. The risk level of x is calculated by

\begin{matrix} R i s k (x) = \sum_{t = 0, v \in x}^{T} R (v, t) + \sum_{t = 0, v, v^{'} \in x}^{T - 1} (R (v, t + 1) \times R (v^{'}, t)) . \end{matrix}

(13)

By minimizing the risk level, an optimal candidate path

x^{*}

could be found, where

x^{*} \leftarrow \arg m i n_{x \in X} R i s k (x) .

(14)

The procedures are shown as Algorithm 3.

Algorithm 3: Optimization

5. Results and Discussions

5.1. Environment Setting

To evaluate the effectiveness of the proposed methods, we conducted a series of scenario simulations and systematically compared the results with established algorithms such as A*, RRT and RL methods (MDP (Markov Decision Process), DQN (Deep Q-network), DDQN (Double Deep Q-network)). Note that we only perform simulations at the current stage and the experiments in real-world has not been performed. The simulations were performed on a high-performance computing system equipped with an Intel Core i9-10900 CPU running at 2.80 GHz and 32 GB of memory, ensuring sufficient computational resources for reliable assessment.

For all methods under evaluation, we adopted a consistent reward structure to maintain comparability for path quality evaluation. Specifically, the reward parameters were set as follows:

goal achievement reward $r_{g} = 10$ ,
step penalty $r_{s} = - 0.1$ , which incentivizes shorter paths,
conflict penalty $r_{c} = - 2$ , imposed when the agent encounters a collision with a human.

The robot receives a reward for the achievement of the goal

r_{g}

within the given time step budget T. Throughout the navigation process, each step incurs a penalty of

r_{s}

, discouraging unnecessary movements and promoting efficiency. Additionally, to simulate the shared environment with humans, any collision with a human results in a penalty of

r_{c}

.

Parameters of multi-policy

\bar{α}, β, γ

are unknown. Table 1 presents the investigation results for several combinations of the parameters

\bar{α}

,

β

, and

γ

. We find that the performance differences among these combinations are minor and it is difficult to determine the best parameter set. Therefore, a random number generator (RNG) is adopted to randomly assign the values of

\bar{α}

,

β

, and

γ

under the following constraints:

$\bar{α} \in [0, 1], β \in [0, 1], γ \in [0, 1]$ , and
$\bar{α} + β + γ = 1$ .

Table 1. Performances of different parameter values

\bar{α}, β, γ

.

Table 1. Performances of different parameter values

\bar{α}, β, γ

.

Parameters	Average Conflict Number	Task Success Rate
$\bar{α} = 0.8$ , $β = 0.1$ , $γ = 0.1$	0.04 ± 0.20	0.96 ± 0.20
$\bar{α} = 0.7$ , $β = 0.2$ , $γ = 0.1$	0.05 ± 0.26	0.96 ± 0.20
$\bar{α} = 0.6$ , $β = 0.3$ , $γ = 0.1$	0.04 ± 0.20	0.96 ± 0.20
$\bar{α} = 0.5$ , $β = 0.4$ , $γ = 0.1$	0.05 ± 0.22	0.95 ± 0.22
$\bar{α} = 0.4$ , $β = 0.5$ , $γ = 0.1$	0.04 ± 0.20	0.96 ± 0.20
$\bar{α} = 0.4$ , $β = 0.4$ , $γ = 0.2$	0.06 ± 0.24	0.94 ± 0.24
$\bar{α} = 0.4$ , $β = 0.3$ , $γ = 0.3$	0.04 ± 0.20	0.96 ± 0.20
$\bar{α} = 0.3$ , $β = 0.3$ , $γ = 0.4$	0.06 ± 0.24	0.94 ± 0.24
$\bar{α}, β, γ$ : randomly generated	0.04 ± 0.20	0.96 ± 0.20

We verify the proposed methods in three scenarios:

Scenario 1: verification on a small size map;
Scenario 2: verification with different size maps;
Scenario 3: verification on increasing human numbers.

Figure 5 shows the configuration of three different maps:

10 \times 10

,

20 \times 20

, and

40 \times 40

, which are warehouse grid environment. We run

N_{s} = 2000

times simulations to calculate human risks

ξ

. The diversity threshold

θ

is set as 0.25.

The evaluation metrics of different algorithms vary depending on the specific problem and scenario being addressed. For the shortest path planning problem [24,25], the primary metric is typically the path length. In the case of multi-agent path planning [54,55], commonly used metrics include the makespan or the sum of costs of all paths. Moreover, the task success rate and path reward are two widely adopted metrics in the literature for assessing overall performance and path quality.

In this paper, we focus more on the safety of the path. Therefore, the performance of the proposed method is evaluated using the following metrics: the average conflict number, conflict distribution, task success rate, and path reward. To ensure statistical reliability, each scenario was simulated 100 times, and the mean values of these metrics were calculated. A random seed set is applied for 100 simulations,

[1000, 1001, \dots, 1099]

.

5.2. Verification on Scenario 1: Human Number K = 1 and Small Size Map

This experiment was conducted in a

10 \times 10

grid world (Figure 5a) environment to evaluate the performance of the proposed method against three baseline methods: A*, MDP, RRT, DQN, and DDQN. The agent starts from

s_{0}

= (0, 0), and

s_{g}

= (9, 9) is its goal position. A human moves stochastically from position

h_{0}^{1}

= (9, 0) to his/her goal position

h_{g}^{1}

= (0, 9). The time budget

T = 20

. The results, summarized in Figure 6 and Table 2.

(1) Average conflict number. Figure 6a shows the results for the average number of conflicts. With an average of only 0.04 ± 0.20 conflicts, our proposed method significantly outperforms all baselines: A* (0.22 ± 0.67), MDP (0.16 ± 0.48), RRT (0.25 ± 0.59), DQN (0.07 ± 0.26), and DDQN (0.05 ± 0.22). The low standard deviation further indicates that this performance is consistent and reliable. It shows that our method is much better than A*, MDP and RRT methods; slightly better than DQN and DDQN methods with the small size map.

(2) Conflict distribution. Figure 6b illustrates the time step distribution of conflicts that occurred during the simulations. Our proposed method MP-RRT demonstrates a significantly superior performance, with the first conflict occurring at

t = 15

far behind other methods. Its distribution is nearly towards zero conflicts. In contrast, the baseline methods (A*, MDP, RRT, DQN, and DDQN) show a much wider spread and a higher frequency of conflicts, which indicate that their strategies are less effective at predicting and avoiding the stochastically moving human. This shows a great advantage of the proposed method in proactively resolving future conflicts. As the time step increases, conflict resolution becomes more challenging due to the inherent uncertainty and dynamic nature of future.

(3) Average task success rate. Figure 6c presents the average task success rate, which measures the agent’s ability to reach its goal within the allotted time budget (

T = 20

). The proposed method achieves the highest success rate of 96% (standard deviation ± 0.20), surpassing the performance of A* (87%), MDP (88%), RRT (80%), DQN (93%), and DDQN (94%). This indicates that our method’s strategy for avoiding conflicts does not come at the cost of failing its primary objective; instead, it more effectively navigates the dynamic environment to reach the goal reliably.

(4) Average reward. Figure 6d depicts the average path reward. Our method achieves the highest average reward of −1.98 ± 0.39, which is notably better than the rewards of A* (−2.34 ± 1.34), MDP (−2.23 ± 0.97), RRT (−2.40 ± 1.18), DQN (−2.04 ± 0.51), and DDQN (−2.00 ± 0.44). The average path reward shows that the path quality of our method is better than the other methods.

The combination of the highest success rate, lowest conflict number, and highest reward confirms that the proposed method offers a more robust and optimal solution for safe path planning in a human-shared space.

5.3. Verification on Scenario 2: Different Size Maps

In Scenario 2, we evaluate the performance of the proposed algorithm on larger grid maps of sizes

20 \times 20

and

40 \times 40

, as illustrated in Figure 5b,c. This scenario considers only one human in this scenario. The time budgets allocated for the agent are set to

T = 40

and

T = 80

for the

20 \times 20

and

40 \times 40

maps, respectively. The simulation results are summarized in Table 2.

(1) Results of

20 \times 20

map. As shown in Figure 7 left and Table 2, the proposed method achieves perfect performance in both task success rate (1.00 ± 0.00) and conflict avoidance (0.00 ± 0.00), significantly outperforming all baseline methods across all three metrics. The obtained reward of −3.91 ± 0.04 is also the highest among all approaches, with statistically significant differences observed in every comparative case. This indicates that the method is highly effective in medium-scale environments. Figure 8 illustrates a set of diverse paths generated by our MP-RRT algorithm. As can be seen from the figure, every location on the map has been explored.

(2) Results of

40 \times 40

map. In contrast, on the larger map (Figure 7 right and Table 2), while the proposed method still attains the highest overall reward (−7.98 ± 0.39) and a strong task success rate (0.96 ± 0.20). In terms of conflict number, although the proposed method (0.04 ± 0.20) is slightly higher than DDQN (0.00 ± 0.00), the task success rate of DDQN reduces to a low-level of (0.27 ± 0.44). Similarly, in task success rate, it performs comparably to MDP and RRT, with no significant differences detected. These results suggest that as environmental complexity increases, although the conflict number of our method is not the minimal, the results demonstrate that it achieves the highest task success rate.

Table 2 indicates that our method achieves superior performance on the

20 \times 20

map compared to the

40 \times 40

map. To verify this observation, we changed the human’s start and goal positions and conducted a new simulation,

h_{0}^{1} : (30, 39) \to (39, 0)

,

h_{g}^{1} : (15, 2) \to (0, 39)

. The results are as follows: average task success rate

1.00 \pm 0.00

, average conflict number

0.00 \pm 0.00

, and average reward

- 7.91 \pm 0.06

. Compared with the results in Table 2, it is difficult to conclude that our algorithm performs better in the

20 \times 20

map. Nevertheless, it can be confirmed that the proposed method performs consistently well in both

20 \times 20

and

40 \times 40

map environments.

Table 2 also shows that DQN and DDQN do not have good performances when map size increases to

20 \times 20

and

40 \times 40

. As the grid map size increases, the performance of DQN and DDQN degrades due to the enlarged state space, sparse reward signals, and insufficient exploration, which hinder the convergence of value estimation. In contrast, model-based approaches such as MDP maintain stable performance owing to their explicit transition modeling and global optimization nature.

From Figure 6 and Figure 7, it can be observed that increasing the map size has little impact on our method, which consistently maintains a low number of conflicts and a high task success rate. This robustness arises from the use of the MP-RRT diverse path generator, which computes multiple diverse paths across the entire environment. As a result, our method is able to identify safe paths by considering the global environment rather than being restricted to local regions.

5.4. Verification on Scenario 3: Increasing Human Number K

In this setting, we evaluate the proposed method in a more challenging environment by increasing the number of humans to

K \in {2, 4, \dots, 10}

. The simulation is a 40 × 40 grid world (Figure 5c), where the agent starts at

s_{0} = (0, 0)

and aims to reach

s_{g} = (39, 39)

within a time budget of

T = 80

steps. Human start and goal locations are randomly initialized; a representative instance uses

h_{0}^{1}

= (30, 39),

h_{0}^{2}

= (39, 0),

h_{0}^{3}

= (4, 2),

h_{0}^{4}

= (1, 37),

h_{0}^{5}

= (29, 31),

h_{0}^{6}

= (2, 33),

h_{0}^{7}

= (14, 4),

h_{0}^{8}

= (21, 8),

h_{0}^{9}

= (30, 10),

h_{0}^{10}

= (33, 12); with corresponding goals

h_{g}^{1}

= (15, 2),

h_{g}^{2}

= (0, 39),

h_{g}^{3}

= (35, 35),

h_{g}^{4}

= (32, 6),

h_{g}^{5}

= (6, 12),

h_{g}^{6}

= (30, 2),

h_{g}^{7}

= (38, 29),

h_{g}^{8}

= (5, 27),

h_{g}^{9}

= (6, 25), and

h_{g}^{10}

= (17, 35). The time budget is configured as

T = 80

. Figure 9 illustrates the simulation results across three key metrics, while Table 3 provides a detailed summary.

(1) Average conflict number. As shown in the Figure 9a, all baseline methods exhibit a steady increase in collisions as the human number grows, with RRT suffering the most. In contrast, our method maintains consistently low conflict numbers across all scenarios.

(2) Average task success rate. For the task success rate (Figure 9b, the baselines experience a sharp decline as the environment becomes more crowded. Success rates of A*, MDP, and RRT all drop below 0.5 when

K \geq 8

. By comparison, the proposed method sustains a high success rate above 0.8 even in the most challenging case with 10 humans.

(3) Average reward. Figure 9c shows the results of average reward. Our method consistently outperforms the baselines. Although rewards decrease slightly with larger K, the reduction is marginal compared with the steep declines observed for A*, MDP, and RRT. This indicates that our approach not only achieves higher success rates but also balances efficiency and safety better than the alternatives.

In Table 3, when the number of humans is

K = 2

, both the MDP method and the proposed method achieve the same average conflict number (

0.07 \pm 0.26

). This mainly results from two factors: (1) the

40 \times 40

map is relatively sparse with only two humans, and (2) both methods adopt similar conflict resolution mechanisms based on soft constraints. The RL method learns an optimal path by penalizing conflicts in the reward function, while our method optimizes the risk level of each path candidate using Equation (14) under simulated risk constraints. With such sparse human distribution and similar mechanisms, both methods yield comparable results. As K increases and human-induced risks become stronger, their performance differences become more distinct. As shown in Table 3, the performances of DQN and DDQN remain poor for the same reason as discussed previously, which is consistent with the results in Table 2.

Figure 10 illustrates the conflict distribution across varying human numbers K ranging from 2 to 10 in a

40 \times 40

grid map. The proposed method consistently demonstrates robust performance as K increases, maintaining a notably low conflict frequency across all scenarios. For smaller human numbers (e.g.,

K = 2

and

K = 4

), conflicts are exceptionally rare with the majority of trials resulting in zero conflicts. With K increasing to 6 and 8, there is a marginal increase in the occurrence of conflicts. That said, the distribution still exhibits a strong skew toward lower counts—an observation that underscores the effectiveness of conflict avoidance, even as environmental complexity intensifies. Even at

K = 10

, where the environment is most crowded and dynamic, the proposed method reduces conflicts by 70.2%, 72.8%, and 73.8% compared with A*, MDP, and RRT, respectively

(\frac{0.94 - 0.28}{0.94} = 70.2 %, \frac{1.03 - 0.28}{1.03} = 72.8 %, \frac{1.07 - 0.28}{1.07} = 73.8 %)

. Furthermore, it achieves a significant improvement in the average task success rate, outperforming A* by 66.0%, MDP by 95.0%, and RRT by 85.7%

(\frac{0.78 - 0.47}{0.47} = 66.0 %, \frac{0.78 - 0.40}{0.40} = 95.0 %, \frac{0.78 - 0.42}{0.42} = 85.7 %)

. This contrast underscores the superiority of the proposed approach in handling scalability and uncertainty in multi-human environments. The results confirm that our proposed method effectively balances path efficiency and safety.

In highly dynamic environments, such as those scenarios involving stochastically moving humans, depth-first search (DFS) methods struggle to handle the pathfinding problem efficiently. In contrast, a key advantage of our approach is the use of a breadth-first search (BFS) scheme to generate multiple diverse path candidates. The proposed multi-policy sampling strategy ensures that these candidates are distributed across the entire map where classic RRT, A*, and RL methods cannot. Thereby, it enhances the ability to cope with dynamic changes. Thus, the algorithm is more robust to the dynamic environment. The safety of the generated path is ensured from the human risks with stochastic evaluation module. Such that we can classify the dangerous areas and find an optimal path. Note that there may be no safe path exists as the humans can move anywhere. In such a case our algorithm return a path with lowest risk

R i s k (x)

. That is why a conflict still occurs in the results.

It should be noted that, for safety considerations of both the robot and humans, we conducted only simulation experiments for method validation at the current stage, and the proposed approach has not yet been tested in real-world environments. The simulation results demonstrate the efficiency of our algorithm in resolving conflicts at the planning level. However, it still cannot guarantees no conflict occur with planning only one time. The conflict distribution reveals an interesting observation: in our algorithm, conflicts, when they occur, tend to arise much later compared with traditional approaches. In this paper, our algorithm cannot provide a safety guarantee for the entire path theoretically. We only use soft-constraints and calculate the risk level of each path candidates. An optimal path is obtained by minimizing the risk level with Equation (14). However, it still cannot provide the safety guarantee. It is difficult to guarantee 100% safety for the entire path. To improve our algorithm, we will take hard-constraints to provide a safety guarantee in future. Another solution is to divide the path planning problem into several time-horizon sub-problems to ensure the safety for each time-horizon which is much easier. Dividing the path planning problem into several time-horizon sub-problems may be a compromise approach to balance the safety and efficiency.

A limitation of this work is that the computational cost has not been taken into consideration. At the current stage, our primary focus is on ensuring path safety. Once safety can be reliably guaranteed, our next objective will be to accelerate the search process and reduce computational cost. Therefore, in this stage of our research, ensuring the safety of the generated path in human-shared environments is our foremost priority, while computational efficiency will be addressed in our future work.

6. Conclusions

In this work, we addressed the challenge of safe path planning for mobile robots operating in human-shared environments with stochastic human motions. To this end, we proposed a multi-policy rapidly-exploring random tree (MP-RRT)-based safe pathfinding algorithm that explicitly incorporates human-related uncertainties at the planning level. The algorithm integrates three key components: (i) a multi-policy RRT diverse path generator that produces multiple candidate paths with enhanced diversity, (ii) a dynamic quadrant stochastic exploration mechanism that improves search efficiency, and (iii) a safe path optimization mechanism based on stochastic risk evaluation to ensure safety in the presence of unpredictable human movements.

Extensive simulations demonstrated that the proposed MP-RRT significantly reduces conflicts by 70.2%, 72.8%, and 73.8% compared with A*, MDP, and RRT, respectively, while improving task success rates by 66.0%, 95.0%, and 85.7%. In addition, the proposed method is more robust to map size than DQN and DDQN methods. These results confirm the effectiveness and robustness of our approach in generating safe and reliable paths in crowded and dynamic human-shared environments.

Our future work will focus on extending the framework to real-world robotic systems and exploring adaptive mechanisms for handling more complex human behaviors and multi-agent interactions.

Author Contributions

Conceptualization, J.M.; methodology, J.M.; validation, X.Z.; formal analysis, Z.L.; investigation, J.W.; data curation, X.Z.; writing—original draft preparation, J.M. and Z.L.; writing—review and editing, J.M.; funding acquisition, J.M. and J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China (62403411), Natural Science Foundation of Jiangsu Province (BK20230600), Local Innovation Talent Project of Yangzhou (2022YXBS122), National Natural Science Foundation of China (62303040), Beijing Natural Science Foundation (L233029).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data generated by the proposed algorithm are shown in the manuscript and no additional data is used.

Acknowledgments

Acknowledgement is made to Wei Xu for the assistance with formal analysis and validations.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

RL	Reinforcement Learning
MDP	Markov Decision Process
RRT	Rapidly-Exploring Random Tree
MP-RRT	Multi-Policy Rapidly Exploring Random Tree
DQN	Deep Q-network
DDQN	Double Deep Q-network

References

Belda, K.; Jirsa, J. Control Principles of Autonomous Mobile Robots Used in Cyber-Physical Factories. In Proceedings of the 2021 23rd International Conference on Process Control (PC), Strbske Pleso, Slovakia, 1–4 June 2021; pp. 96–101. [Google Scholar] [CrossRef]
Ramdani, N.; Panayides, A.; Karamousadakis, M.; Mellado, M.; Lopez, R.; Christophorou, C.; Rebiai, M.; Blouin, M.; Vellidou, E.; Koutsouris, D. A Safe, Efficient and Integrated Indoor Robotic Fleet for Logistic Applications in Healthcare and Commercial Spaces: The ENDORSE Concept. In Proceedings of the 2019 20th IEEE International Conference on Mobile Data Management (MDM), Hong Kong, China, 10–13 June 2019; pp. 425–430. [Google Scholar] [CrossRef]
Özbaran, C.; Dilibal, S.; Sungur, G. Mechatronic System Design of A Smart Mobile Warehouse Robot for Automated Storage/Retrieval Systems. In Proceedings of the 2020 Innovations in Intelligent Systems and Applications Conference (ASYU), Istanbul, Turkey, 15–17 October 2020; pp. 1–6. [Google Scholar] [CrossRef]
Latsou, C.; Farsi, M.; Erkoyuncu, J.A.; Morris, G. Digital twin integration in multi-agent cyber physical manufacturing systems. IFAC-PapersOnLine 2021, 54, 811–816. [Google Scholar] [CrossRef]
Hjorth, S.; Chrysostomou, D. Human–robot collaboration in industrial environments: A literature review on non-destructive disassembly. Robot. Comput.-Integr. Manuf. 2022, 73, 102208. [Google Scholar] [CrossRef]
Baratta, A.; Cimino, A.; Gnoni, M.G.; Longo, F. Human Robot Collaboration in Industry 4.0: A literature review. Procedia Comput. Sci. 2023, 217, 1887–1895. [Google Scholar] [CrossRef]
Katsampiris-Salgado, K.; Dimitropoulos, N.; Gkrizis, C.; Michalos, G.; Makris, S. Advancing human-robot collaboration: Predicting operator trajectories through AI and infrared imaging. J. Manuf. Syst. 2024, 74, 980–994. [Google Scholar] [CrossRef]
Hoffman, G.; Bhattacharjee, T.; Nikolaidis, S. Inferring human intent and predicting human action in human–robot collaboration. Annu. Rev. Control Robot. Auton. Syst. 2024, 7, 73–95. [Google Scholar] [CrossRef]
Sánchez-Ibáñez, J.R.; Pérez-del Pulgar, C.J.; García-Cerezo, A. Path Planning for Autonomous Mobile Robots: A Review. Sensors 2021, 21, 7898. [Google Scholar] [CrossRef] [PubMed]
Yang, Y.; Pan, J.; Wan, W. Survey of optimal motion planning. IET Cyber-Syst. Robot. 2019, 1, 13–19. [Google Scholar] [CrossRef]
Marashian, A.; Razminia, A. Mobile robot’s path-planning and path-tracking in static and dynamic environments: Dynamic programming approach. Robot. Auton. Syst. 2024, 172, 104592. [Google Scholar] [CrossRef]
Ab Wahab, M.N.; Nazir, A.; Khalil, A.; Ho, W.J.; Akbar, M.F.; Noor, M.H.M.; Mohamed, A.S.A. Improved genetic algorithm for mobile robot path planning in static environments. Expert Syst. Appl. 2024, 249, 123762. [Google Scholar] [CrossRef]
Mi, J.; Kuze, N.; Ushio, T. A mobile robot controller using reinforcement learning under scLTL specifications with uncertainties. Asian J. Control 2022, 24, 2916–2930. [Google Scholar] [CrossRef]
Mi, J.; Liu, J.; Xu, Y.; Long, Z.; Wang, J.; Xu, W.; Ji, T. A Reinforcement Learning-Based Double Layer Controller for Mobile Robot in Human-Shared Environments. Appl. Sci. 2025, 15, 7812. [Google Scholar] [CrossRef]
Mi, J.; Zhang, X.; Long, Z.; Wang, J.; Xu, W.; Xu, Y.; Deng, S. A mobile robot safe planner for multiple tasks in human-shared environments. PLoS ONE 2025, 20, e0324534. [Google Scholar] [CrossRef]
El-Shamouty, M.; Wu, X.; Yang, S.; Albus, M.; Huber, M.F. Towards Safe Human-Robot Collaboration Using Deep Reinforcement Learning. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 4899–4905. [Google Scholar] [CrossRef]
Pupa, A.; Comari, S.; Arrfou, M.; Andreoni, G.; Carapia, A.; Carricato, M.; Secchi, C. Enhancing Performance in Human–Robot Collaboration: A Modular Architecture for Task Scheduling and Safe Trajectory Planning. IEEE Trans. Autom. Sci. Eng. 2025, 22, 17535–17551. [Google Scholar] [CrossRef]
Lacevic, B.; Zanchettin, A.M.; Rocco, P. Safe Human-Robot Collaboration via Collision Checking and Explicit Representation of Danger Zones. IEEE Trans. Autom. Sci. Eng. 2023, 20, 846–861. [Google Scholar] [CrossRef]
Ali, U.; Sukkar, F.; Mueller, A.; Wu, L.; Le Gentil, C.; Kaupp, T.; Vidal Calleja, T. Enabling Safe, Active and Interactive Human-Robot Collaboration via Smooth Distance Fields. In Proceedings of the 2025 ACM/IEEE International Conference on Human-Robot Interaction, Melbourne, Australia, 4–6 March 2025; pp. 439–446. [Google Scholar]
Scholz, C.; Cao, H.L.; Imrith, E.; Roshandel, N.; Firouzipouyaei, H.; Burkiewicz, A.; Amighi, M.; Menet, S.; Sisavath, D.W.; Paolillo, A.; et al. Sensor-Enabled Safety Systems for Human–Robot Collaboration: A Review. IEEE Sens. J. 2025, 25, 65–88. [Google Scholar] [CrossRef]
Ferguson, D.; Stentz, A. Using interpolation to improve path planning: The Field D* algorithm. J. Field Robot. 2006, 23, 79–101. [Google Scholar] [CrossRef]
Warren, C. Fast path planning using modified A* method. In Proceedings of the [1993] Proceedings IEEE International Conference on Robotics and Automation, Atlanta, GA, USA, 2–6 May 1993; Volume 2, pp. 662–667. [Google Scholar] [CrossRef]
Aiswarya, L.; Chowdhury, A.R. Human Aware Robot Motion Planning using RRT Algorithm in Industry 4.0 Environment. In Proceedings of the 2021 IEEE International Conference on Intelligence and Safety for Robotics (ISR), Tokoname, Japan, 4–6 March 2021; pp. 351–358. [Google Scholar] [CrossRef]
Di, L.; Sun, D.; Qi, Y.; Xiao, Z. Research on Shortest Path Planning and Smoothing Without Obstacle Collision Based on Moving Carrier. Int. J. Aerosp. Eng. 2024, 2024, 5235125. [Google Scholar] [CrossRef]
Voloch, N.; Zadok, Y.; Voloch-Bloch, N.; Hajaj, M.M. Using Combined Knapsack and Shortest Path Problems for Planning Optimal Navigation Paths for Robotic Deliveries. In Proceedings of the 2024 10th International Conference on Automation, Robotics and Applications (ICARA), Athens, Greece, 22–24 February 2024; pp. 139–143. [Google Scholar] [CrossRef]
Dundar, Y.C. Dynamic path finding method and obstacle avoidance for automated guided vehicle navigation in Industry 4.0. Procedia Comput. Sci. 2021, 192, 3945–3954. [Google Scholar] [CrossRef]
Yoon, Y.; Jo, A. Obstacle Avoidance Planning for Autonomous Vehicles Based on Neural Network-centric Path Sampling. Int. J. Control. Autom. Syst. 2025, 23, 126–136. [Google Scholar] [CrossRef]
Liu, Y.; Cao, Z.; Xiong, H.; Du, J.; Cao, H.; Zhang, L. Dynamic Obstacle Avoidance for Cable-Driven Parallel Robots with Mobile Bases via Sim-to-Real Reinforcement Learning. IEEE Robot. Autom. Lett. 2023, 8, 1683–1690. [Google Scholar] [CrossRef]
Wang, J.; Yang, L.; Cen, H.; He, Y.; Liu, Y. Dynamic obstacle avoidance control based on a novel dynamic window approach for agricultural robots. Comput. Ind. 2025, 167, 104272. [Google Scholar] [CrossRef]
Liu, Y.; Ding, Q.; Zou, Y.; Guo, S.; Tang, S. Multirobot conflict coordination and dynamic obstacle avoidance cooperative path planning. Intell. Serv. Robot. 2025, 18, 413–432. [Google Scholar] [CrossRef]
Huang, Y.; Wang, Y.; Zatarain, O. Dynamic Path Optimization for Robot Route Planning. In Proceedings of the 2019 IEEE 18th International Conference on Cognitive Informatics & Cognitive Computing (ICCI-CC), Milan, Italy, 23–25 July 2019; pp. 47–53. [Google Scholar] [CrossRef]
Wang, B.; Liu, Z.; Li, Q.; Prorok, A. Mobile Robot Path Planning in Dynamic Environments Through Globally Guided Reinforcement Learning. IEEE Robot. Autom. Lett. 2020, 5, 6932–6939. [Google Scholar] [CrossRef]
Buyukkocak, A.T.; Aksaray, D. Resilient Online Planning for Mobile Robots with Minimal Relaxation of Signal Temporal Logic Specifications. IEEE Robot. Autom. Lett. 2025, 10, 5935–5942. [Google Scholar] [CrossRef]
Liu, L.; Wang, X.; Yang, X.; Liu, H.; Li, J.; Wang, P. Path planning techniques for mobile robots: Review and prospect. Expert Syst. Appl. 2023, 227, 120254. [Google Scholar] [CrossRef]
Je Choi, M.; Jin Park, S.; Kim, S.; Jae Lee, S. Human-Aware Trajectory Optimization for Enhancing D* Algorithm for Autonomous Robot Navigation. IEEE Access 2024, 12, 103237–103250. [Google Scholar] [CrossRef]
Rajendran, V.; Carreno-Medrano, P.; Fisher, W.; Kulić, D. Human-Aware RRT-Connect: Motion Planning for Safe Human-Robot Collaboration. In Proceedings of the 2021 30th IEEE International Conference on Robot & Human Interactive Communication (RO-MAN), Vancouver, BC, Canada, 8–12 August 2021; pp. 15–22. [Google Scholar] [CrossRef]
Fan, D.D.; Otsu, K.; Kubo, Y.; Dixit, A.; Burdick, J.; Agha-Mohammadi, A.A. STEP: Stochastic Traversability Evaluation and Planning for Risk-Aware Off-road Navigation. arXiv 2021, arXiv:2103.02828. [Google Scholar]
Chu, J.; Zhao, F.; Bakshi, S.; Yan, Z.; Chen, D. Risk-Aware Path Planning with Uncertain Human Interactions. In Proceedings of the 2021 American Control Conference (ACC), New Orleans, LA, USA, 25–28 May 2021; pp. 4225–4230. [Google Scholar] [CrossRef]
Benslimane, A.; Nguyen-Minh, H. Jamming Attack Model and Detection Method for Beacons Under Multichannel Operation in Vehicular Networks. IEEE Trans. Veh. Technol. 2017, 66, 6475–6488. [Google Scholar] [CrossRef]
Guruji, A.K.; Agarwal, H.; Parsediya, D. Time-efficient A* Algorithm for Robot Path Planning. Procedia Technol. 2016, 23, 144–149. [Google Scholar] [CrossRef]
Dakulović, M.; Petrović, I. Two-way D* algorithm for path planning and replanning. Robot. Auton. Syst. 2011, 59, 329–342. [Google Scholar] [CrossRef]
Herrera-Medina, X.; Arismendi-Valle, H. Optimizing Path Planning and Human Detection Using YOLO and an RRT Algorithm in Autonomous Robots. In Proceedings of the 2025 11th International Conference on Control, Automation and Robotics (ICCAR), Kyoto, Japan, 18–20 April 2025; pp. 246–251. [Google Scholar] [CrossRef]
Chen, M.Y. A TD-RRT* Based Real-Time Path Planning of a Nonholonomic Mobile Robot and Path Smoothening Technique Using Catmull-Rom Interpolation. In Proceedings of the 2022 International Conference on System Science and Engineering (ICSSE), Taichung, Taiwan, 26–29 May 2022; pp. 115–120. [Google Scholar] [CrossRef]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
Dam, T.; Chalvatzaki, G.; Peters, J.; Pajarinen, J. Monte-Carlo robot path planning. IEEE Robot. Autom. Lett. 2022, 7, 11213–11220. [Google Scholar] [CrossRef]
Li, W.; Liu, Y.; Ma, Y.; Xu, K.; Qiu, J.; Gan, Z. A self-learning Monte Carlo tree search algorithm for robot path planning. Front. Neurorobot. 2023, 17, 1039644. [Google Scholar] [CrossRef]
Cheng, R.; Orosz, G.; Murray, R.M.; Burdick, J.W. End-to-End Safe Reinforcement Learning through Barrier Functions for Safety-Critical Continuous Control Tasks. Proc. AAAI Conf. Artif. Intell. 2019, 33, 3387–3395. [Google Scholar] [CrossRef]
Liu, Q.; Liu, Z.; Xiong, B.; Xu, W.; Liu, Y. Deep reinforcement learning-based safe interaction for industrial human-robot collaboration using intrinsic reward function. Adv. Eng. Inform. 2021, 49, 101360. [Google Scholar] [CrossRef]
Luo, G.; Wang, Y.; Zhang, H.; Yuan, Q.; Li, J. AlphaRoute: Large-Scale Coordinated Route Planning via Monte Carlo Tree Search. Proc. AAAI Conf. Artif. Intell. 2023, 37, 12058–12067. [Google Scholar] [CrossRef]
Zhang, Z.; Qian, K.; Schuller, B.W.; Wollherr, D. An Online Robot Collision Detection and Identification Scheme by Supervised Learning and Bayesian Decision Theory. IEEE Trans. Autom. Sci. Eng. 2021, 18, 1144–1156. [Google Scholar] [CrossRef]
Browne, C.B.; Powley, E.; Whitehouse, D.; Lucas, S.M.; Cowling, P.I.; Rohlfshagen, P.; Tavener, S.; Perez, D.; Samothrakis, S.; Colton, S. A Survey of Monte Carlo Tree Search Methods. IEEE Trans. Comput. Intell. AI Games 2012, 4, 1–43. [Google Scholar] [CrossRef]
Bruckschen, L.; Bungert, K.; Dengler, N.; Bennewitz, M. Human-Aware Robot Navigation by Long-Term Movement Prediction. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 11032–11037. [Google Scholar] [CrossRef]
Dixit, A.; Ahmadi, M.; Burdick, J.W. Risk-Sensitive Motion Planning using Entropic Value-at-Risk. In Proceedings of the 2021 European Control Conference (ECC), Delft, The Netherlands, 29 June–2 July 2021; pp. 1726–1732. [Google Scholar] [CrossRef]
Sharon, G.; Stern, R.; Felner, A.; Sturtevant, N.R. Conflict-Based Search For Optimal Multi-Agent Path Finding. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, Toronto, ON, Canada, 22–26 July 2012; pp. 563–569. [Google Scholar]
Sharon, G.; Stern, R.; Felner, A.; Sturtevant, N.R. Conflict-based search for optimal multi-agent pathfinding. Artif. Intell. 2015, 219, 40–66. [Google Scholar] [CrossRef]

Figure 1. Architecture of the proposed controller.

x^{*}

is the optimal path. The arrows indicate the direction of data or process flow (left) and the move direction of the robot (right). Green star is the goal position of the robot. Black rectangles are static obstacles.

Figure 1. Architecture of the proposed controller.

x^{*}

is the optimal path. The arrows indicate the direction of data or process flow (left) and the move direction of the robot (right). Green star is the goal position of the robot. Black rectangles are static obstacles.

Figure 2. Stochastic motion model of the human. A human moves to his/her goal or take a “wait” action with a stochastic policy. Red circle is the goal of human k.

Figure 3. Quadrant exploration. The environment is divided into four quadrant areas by the black dashed lines.

x_{1}

and

x_{2}

are two paths across different quadrant areas.

Figure 3. Quadrant exploration. The environment is divided into four quadrant areas by the black dashed lines.

x_{1}

and

x_{2}

are two paths across different quadrant areas.

Figure 4. Searching structure of MP-RRT diverse path generator. redBlack node is the root and the red one is the

q_{r a n d}

. The blue arrow shows the growing direction.

Figure 4. Searching structure of MP-RRT diverse path generator. redBlack node is the root and the red one is the

q_{r a n d}

. The blue arrow shows the growing direction.

Figure 5. Warehouse grid environment map: (a)

10 \times 10

, (b)

20 \times 20

, and (c)

40 \times 40

. The flag indicates the goal of the robot and the map pin denotes the goal of a human. The start and goal positions of one human are represented in the same color. Black squares are obstacles.

Figure 5. Warehouse grid environment map: (a)

10 \times 10

, (b)

20 \times 20

, and (c)

40 \times 40

. The flag indicates the goal of the robot and the map pin denotes the goal of a human. The start and goal positions of one human are represented in the same color. Black squares are obstacles.

Figure 6. Results of scenario 1: human number

K = 1

and a 10 × 10 grid map. (a) Average conflict number. (b) Conflict distribution. (c) Average task success rate. (d) Average reward.

Figure 6. Results of scenario 1: human number

K = 1

and a 10 × 10 grid map. (a) Average conflict number. (b) Conflict distribution. (c) Average task success rate. (d) Average reward.

Figure 7. Average results of

20 \times 20

and

40 \times 40

grid map. (a,b) are conflict number of

20 \times 20

and

40 \times 40

map, respectively. (c,d) present the conflict distributions. (e,f) illustrate the task success rates, while (g,h) depict the rewards for

20 \times 20

and

40 \times 40

map, respectively.

Figure 7. Average results of

20 \times 20

and

40 \times 40

grid map. (a,b) are conflict number of

20 \times 20

and

40 \times 40

map, respectively. (c,d) present the conflict distributions. (e,f) illustrate the task success rates, while (g,h) depict the rewards for

20 \times 20

and

40 \times 40

map, respectively.

Figure 8. An example of the generated diverse paths on a

20 \times 20

map, where 60 diverse paths are generated. For brevity, only even-numbered paths are displayed in the legend. The “len” in the legend denotes the length of a path.

Figure 8. An example of the generated diverse paths on a

20 \times 20

map, where 60 diverse paths are generated. For brevity, only even-numbered paths are displayed in the legend. The “len” in the legend denotes the length of a path.

Figure 9. Simulation results on increasing human number K. (a) Average conflict number. (b) Average task success rate. (c) Average reward.

Figure 10. Conflict distribution of a

40 \times 40

grid map with different human number.

Figure 10. Conflict distribution of a

40 \times 40

grid map with different human number.

Table 2. Simulation results on different size of maps with the human number

K = 1

. The asterisk (*) denotes a statistically significant difference from the proposed method (

p < 0.05

, Wilcoxon test), whereas the hyphen (-) indicates no significant difference (

p \geq 0.05

).

Table 2. Simulation results on different size of maps with the human number

K = 1

. The asterisk (*) denotes a statistically significant difference from the proposed method (

p < 0.05

, Wilcoxon test), whereas the hyphen (-) indicates no significant difference (

p \geq 0.05

).

Grid Map	Methods	Average Conflict Number	Task Success Rate	Reward
$10 \times 10$	A*	0.22 ± 0.67 (*)	0.87 ± 0.34 (*)	−2.34 ± 1.34 (*)
	MDP	0.16 ± 0.48 (*)	0.88 ± 0.32 (*)	−2.23 ± 0.97 (*)
	RRT	0.25 ± 0.59 (*)	0.80 ± 0.40 (*)	−2.40 ± 1.18 (*)
	DQN	0.07 ± 0.26 (-)	0.93 ± 0.26 (-)	−2.04 ± 0.51 (*)
	DDQN	0.05 ± 0.22 (-)	0.94 ± 0.24 (-)	−2.00 ± 0.44 (*)
	Ours	0.04 ± 0.20	0.96 ± 0.20	−1.98 ± 0.39
$20 \times 20$	A*	0.09 ± 0.40 (*)	0.94 ± 0.24 (*)	−4.08 ± 0.80 (*)
	MDP	0.07 ± 0.29 (*)	0.93 ± 0.26 (*)	−4.05 ± 0.58 (*)
	RRT	0.28 ± 1.05 (*)	0.81 ± 0.39 (*)	−4.46 ± 2.10 (*)
	DQN	0.17 ± 0.60 (*)	0.68 ± 0.47 (*)	−4.39 ± 1.55 (*)
	DDQN	0.15 ± 0.46 (*)	0.40 ± 0.49 (*)	−4.37 ± 0.94 (*)
	Ours	0.00 ± 0.00	1.00 ± 0.00	−3.91 ± 0.04
$40 \times 40$	A*	0.26 ± 1.51 (*)	0.88 ± 0.32 (*)	−8.42 ± 3.03 (*)
	MDP	0.07 ± 0.32 (-)	0.95 ± 0.22 (-)	−8.04 ± 0.65 (-)
	RRT	0.07 ± 0.35 (-)	0.95 ± 0.22 (-)	−8.04 ± 0.71 (-)
	DQN	0.02 ± 0.14 (-)	0.19 ± 0.39 (*)	−8.18 ± 0.27 (*)
	DDQN	0.00 ± 0.00 (*)	0.27 ± 0.44 (*)	−8.12 ± 0.13 (*)
	Ours	0.04 ± 0.20	0.96 ± 0.20	−7.98 ± 0.39

Table 3. Average results under the environmental setting of a

40 \times 40

grid map and 2∼10 humans. The asterisk (*) denotes a statistically significant difference from the proposed method (

p < 0.05

, Wilcoxon test), whereas the hyphen (-) indicates no significant difference (

p \geq 0.05

).

Table 3. Average results under the environmental setting of a

40 \times 40

grid map and 2∼10 humans. The asterisk (*) denotes a statistically significant difference from the proposed method (

p < 0.05

, Wilcoxon test), whereas the hyphen (-) indicates no significant difference (

p \geq 0.05

).

Human Number	Methods	Conflict Number	Task Success Rate	Reward
$K = 2$	A*	0.16 ± 0.42 (-)	0.86 ± 0.35 (-)	−8.22 ± 0.84 (-)
	MDP	0.07 ± 0.26 (-)	0.90 ± 0.30 (-)	−8.02 ± 0.51 (-)
	RRT	0.22 ± 0.60 (*)	0.84 ± 0.37 (*)	−8.34 ± 1.19 (*)
	DQN	0.13 ± 0.36 (*)	0.43 ± 0.50 (*)	−8.34 ± 0.89 (*)
	DDQN	0.39 ± 0.72 (*)	0.23 ± 0.42 (*)	−8.92 ± 1.53 (*)
	Ours	0.07 ± 0.26	0.93 ± 0.26	−8.04 ± 0.51
$K = 4$	A*	0.42 ± 0.89 (*)	0.72 ± 0.45 (*)	−8.74 ± 1.77 (*)
	MDP	0.50 ± 1.22 (*)	0.64 ± 0.48 (*)	−8.95 ± 2.46 (*)
	RRT	0.61 ± 1.18 (*)	0.60 ± 0.49 (*)	−9.12 ± 2.36 (*)
	DQN	0.36 ± 0.54 (*)	0.25 ± 0.43 (*)	−8.81 ± 1.08 (*)
	DDQN	0.42 ± 0.81 (*)	0.27 ± 0.44 (*)	−8.95 ± 1.67 (*)
	Ours	0.10 ± 0.39	0.92 ± 0.27	−8.10 ± 0.77
$K = 6$	A*	0.65 ± 1.19 (*)	0.58 ± 0.49 (*)	−9.20 ± 2.39 (*)
	MDP	0.38 ± 0.78 (*)	0.59 ± 0.49 (*)	−8.75 ± 1.56 (*)
	RRT	0.93 ± 1.81 (*)	0.54 ± 0.50 (*)	−9.76 ± 3.61 (*)
	DQN	0.29 ± 0.60 (*)	0.23 ± 0.42 (*)	−8.69 ± 1.21 (*)
	DDQN	0.47 ± 1.20 (*)	0.27 ± 0.44 (*)	−9.03 ± 2.42 (*)
	Ours	0.23 ± 0.76	0.86 ± 0.35	−8.36 ± 1.52
$K = 8$	A*	0.86 ± 1.23 (*)	0.48 ± 0.50 (*)	−9.62 ± 2.47 (*)
	MDP	0.78 ± 1.21 (*)	0.45 ± 0.50 (*)	−9.56 ± 2.42 (*)
	RRT	0.95 ± 1.25 (*)	0.46 ± 0.50 (*)	−9.80 ± 2.50 (*)
	DQN	0.64 ± 1.42 (*)	0.21 ± 0.41 (*)	−9.37 ± 2.85 (*)
	DDQN	0.57 ± 0.89 (*)	0.13 ± 0.34 (*)	−9.28 ± 1.79 (*)
	Ours	0.26 ± 0.58	0.80 ± 0.40	−8.42 ± 1.15
$K = 10$	A*	0.94 ± 1.29 (*)	0.47 ± 0.50 (*)	−9.78 ± 2.57 (*)
	MDP	1.03 ± 1.84 (*)	0.40 ± 0.49 (*)	−10.07 ± 3.69 (*)
	RRT	1.07 ± 1.44 (*)	0.42 ± 0.50 (*)	−10.04 ± 2.87 (*)
	DQN	0.72 ± 1.50 (*)	0.19 ± 0.39 (*)	−9.58 ± 3.14 (*)
	DDQN	0.47 ± 0.77 (*)	0.14 ± 0.35 (*)	−9.09 ± 1.58 (*)
	Ours	0.28 ± 0.63	0.78 ± 0.41	−8.46 ± 1.27

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mi, J.; Zhang, X.; Long, Z.; Wang, J. A Multi-Policy Rapidly-Exploring Random Tree Based Mobile Robot Controller for Safe Path Planning in Human-Shared Environments. Sensors 2025, 25, 7008. https://doi.org/10.3390/s25227008

AMA Style

Mi J, Zhang X, Long Z, Wang J. A Multi-Policy Rapidly-Exploring Random Tree Based Mobile Robot Controller for Safe Path Planning in Human-Shared Environments. Sensors. 2025; 25(22):7008. https://doi.org/10.3390/s25227008

Chicago/Turabian Style

Mi, Jian, Xianbo Zhang, Zhongjie Long, and Jun Wang. 2025. "A Multi-Policy Rapidly-Exploring Random Tree Based Mobile Robot Controller for Safe Path Planning in Human-Shared Environments" Sensors 25, no. 22: 7008. https://doi.org/10.3390/s25227008

APA Style

Mi, J., Zhang, X., Long, Z., & Wang, J. (2025). A Multi-Policy Rapidly-Exploring Random Tree Based Mobile Robot Controller for Safe Path Planning in Human-Shared Environments. Sensors, 25(22), 7008. https://doi.org/10.3390/s25227008

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Multi-Policy Rapidly-Exploring Random Tree Based Mobile Robot Controller for Safe Path Planning in Human-Shared Environments

Abstract

1. Introduction

2. Related Works

3. Problem Formulation

4. MP-RRT-Based Mobile Robot Controller

4.1. Model Human Uncertainties: Stochastic Risk Evaluation

4.2. Multi-Policy Rapid-Exploring Random Tree (MP-RRT) Diverse Path Generator

4.2.1. Multi-Policy Sampling

4.2.2. Diversity Evaluation

4.2.3. Searching Structure

4.3. Safe Path Optimization

5. Results and Discussions

5.1. Environment Setting

5.2. Verification on Scenario 1: Human Number K = 1 and Small Size Map

5.3. Verification on Scenario 2: Different Size Maps

5.4. Verification on Scenario 3: Increasing Human Number K

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI