A Reinforcement Learning-Based Double Layer Controller for Mobile Robot in Human-Shared Environments

Mi, Jian; Liu, Jianwen; Xu, Yue; Long, Zhongjie; Wang, Jun; Xu, Wei; Ji, Tao

doi:10.3390/app15147812

Open AccessArticle

A Reinforcement Learning-Based Double Layer Controller for Mobile Robot in Human-Shared Environments

by

Jian Mi

¹,

Jianwen Liu

¹,

Yue Xu

^1,*,

Zhongjie Long

²

,

Jun Wang

³,

Wei Xu

⁴ and

Tao Ji

¹

Department of Transport Engineering, College of Architecture Science and Engineering, Yangzhou University, Yangzhou 225127, China

²

Key Laboratory of the Ministry of Education for Modern Measurement & Control Technology, School of Electromechanical Engineering, Beijing Information Science & Technology University, Beijing 102206, China

³

College of Information Science & Technology, Beijing University of Chemical Technology, Beijing 100029, China

⁴

Smart Lab of Innovation, Sinotrans Innovation & Technology Co., Ltd., Beijing 100029, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(14), 7812; https://doi.org/10.3390/app15147812

Submission received: 4 June 2025 / Revised: 29 June 2025 / Accepted: 8 July 2025 / Published: 11 July 2025

Download

Browse Figures

Versions Notes

Abstract

Various approaches have been explored to address the path planning problem for mobile robots. However, it remains a significant challenge, particularly in environments where a multi-tasking mobile robot operates alongside stochastically moving humans. This paper focuses on path planning for a mobile robot executing multiple pickup and delivery tasks in an environment shared with humans. To plan a safe path and achieve high task success rate, a Reinforcement Learning (RL)-based double layer controller is proposed in which a double-layer learning algorithm is developed. The high-level layer integrates a Finite-State Automaton (FSA) with RL to perform global strategy learning and task-level decision-making. The low-level layer handles local path planning by incorporating a Markov Decision Process (MDP) that accounts for environmental uncertainties. We verify the proposed double layer algorithm under different configurations and evaluate its performance based on several metrics, including task success rate, reward, etc. The proposed method outperforms conventional RL in terms of reward (+63.1%) and task success rate (+113.0%). The simulation results demonstrate the effectiveness of the proposed algorithm in solving path planning problem with stochastic human uncertainties.

Keywords:

reinforcement learning; double-layer control; path planning; mobile robots; human-shared environments; MDP; FSA

1. Introduction

In recent years, autonomous mobile robots have been widely employed in automatic factories [1], logistic centers [2], warehouses [3], and manufacturing systems [4], playing an increasingly important role in transportation, pickup, and delivery systems. Today, the demand for autonomous mobile robots that can collaborate with humans is rapidly increasing. Consequently, planning an optimal safe path for a mobile robot performing multiple pickup and delivery operations in human-shared environments has become a critical and urgent challenge.

Path planning is widely recognized as one of the fundamental problems in robotic control. For a mobile robot performing multiple tasks, the control problem can generally be divided into two subproblems: the task scheduling problem, which determines the order in which tasks are executed, and the local path planning problem, which computes an optimal path to accomplish each individual task. The task scheduling problem determines the execution order of tasks, aiming to find an optimal sequence for visiting multiple goal locations. This is commonly formulated as the Vehicle Routing Problem (VRP) [5] or the Traveling Salesman Problem (TSP) [6], which have been extensively studied in static environments where the path cost between locations is fixed and known. However, the path cost in dynamic environments is uncertain and often time-varying. Local path planning focuses on computing a conflict-free path from a start location to a goal location while satisfying specific performance criteria such as minimizing travel distance, commonly referred to as the Shortest Path Planning (SPP) problem [7,8]. These two subproblems are deeply interdependent: the outcome of task scheduling directly influences local planning decisions; in turn, local path feasibility can affect task scheduling results. Moreover, both subproblems are significantly affected by the presence of stochastically moving humans in human-shared environments, which introduces uncertainty and increases the complexity of planning.

As is well known, path planning in static environments has already been studied in considerable depth and various methods have been developed, including D* [9] and A* [10]. However, the path planning problem for a mobile robot designed with multiple pickup and delivery tasks is NP-hard and remains a significant challenge, particularly in dynamic environments. The primary difficulties arise from (1) controlling multiple tasks, and (2) dealing with environmental uncertainties, including dynamic obstacles and stochastic human behavior.

To date, various methods have been developed to generate a global path for a mobile robot performing multiple tasks. To manage task sequences, syntactically co-safe Linear Temporal Logic (scLTL)—a fragment of Linear Temporal Logic (LTL) [11]—has been widely applied for specifying and controlling multiple tasks formulated as scLTL expressions [12]. For example, Mi et al. [13] employed scLTL to represent task specifications and combined it with the A* pathfinding algorithm. It has been proved that scLTL is efficient in task descriptions and control. In terms of path planning, real-time algorithms [14,15] are commonly used to ensure safe navigation by avoiding moving obstacles in dynamic environments. These methods typically adopt a hierarchical path planning framework; instead, a global planner provides guidance toward the goal, while a local planner is responsible for obstacle avoidance. Unfortunately, real-time path planning methods incur high computational costs due to continuous re-planning, and are unable to explicitly address environmental uncertainties at the planning level. Although real-time re-planning at each time step is widely regarded as the safest strategy for handling environmental uncertainties [16], it is often impractical in real-world scenarios such as warehouses, logistics centers, and automated factories. This is primarily because mobile robots in these environments are required to complete tasks within strict time constraints, making constant re-planning computationally infeasible.

Reinforcement Learning (RL) [17,18] has been widely adopted for modeling complex environmental dynamics and learning optimal control policies through interaction. Numerous Deep Reinforcement Learning (DRL) methods [19,20] have been developed for planning in dynamic environments. Chai et al. [21] proposed a hierarchical Deep Learning (DL)-based control framework to plan optimal maneuver trajectories and guide the robot with uncertainties. However, their proposed motion planning model was trained offline and the learning cost was expensive. In our previous work [22], we proposed an FSA-MDP controller by integrating scLTL with a Markov Decision Process (MDP) to plan a global path for the mobile robot considering the uncertainties of jamming attackers. However, this approach still relies on real-time re-planning mechanisms when environmental changes occur.

Recent studies have demonstrated that addressing uncertainties at the planning level is not only effective but also more practical for real-world deployment [13]. On the one hand, computational costs are significantly reduced, as continuous re-planning is no longer required. On the other, it helps to prevent the robot from becoming trapped in local optima, improving the overall task success rate. In human-shared environments, ensuring safe operation of mobile robots is both fundamental and critical. At the same time, achieving a high task success rate is essential for improving the efficiency of systems such as warehouses and automated factories. Motivated by these considerations and inspired by the principle of addressing uncertainties at the planning level, we propose a mobile robot controller solving the path planning and task controlling problems in human-shared environments. An RL-based double layer algorithm is proposed to plan a global safe path for a robot performing multiple pickup and delivery tasks in an environment with uncertainties. The key contributions of this study are summarized as follows:

1.: We estimate environmental uncertainties through stochastic simulations and propose a hierarchical mobile robot controller that integrates scLTL and RL. Unlike conventional approaches that rely on real-time obstacle avoidance, our method explicitly handles environmental uncertainties at the planning level.
2.: A novel RL-based double-layer algorithm is developed. Unlike traditional product- or encoder-type algorithms that use scLTL and RL, the proposed algorithm separates the learning problem into high-level and low-level layers. This decomposition enables the robot to compute an optimal path in human-shared environments with significantly reduced learning costs.
3.: At the higher level, an FSA-RL learner is proposed on the basis of FSA transitions to find an optimal global strategy for multiple tasks. Notably, we extend the deterministic state transitions of the FSA to a stochastic formulation, allowing the controller to better adapt to dynamic environments.
4.: At the lower level, we construct an MDP learner that improves upon the classic MDP by redefining the reward function to explicitly account for environmental uncertainties at the planning level. This enhancement enables the generation of an optimal local path under dynamic and uncertain conditions.

The rest of this paper is organized as follows: Section 2 presents a review of related works; Section 3 briefly reviews the scLTL specifications; Section 4 introduces the problem formulation; the proposed safe planner is presented in Section 5, followed by the simulation results and discussions in Section 6; finally, a brief conclusion is provided in Section 7.

2. Related Work

In path planning, obstacle avoidance has traditionally been the primary requirement for mobile robots. Numerous efficient algorithms have been proposed for path planning and task assignment, including A* [10,23], D* [9,24], Reinforcement Learning (RL) [17,25], and Monte Carlo Tree Search (MCTS) [26,27]. However, as is widely known, although the A* algorithm is efficient for finding the shortest path in static environments, it struggles to address uncertainties in dynamic environments, with D* facing the same limitation. While the RL and MCTS methods are effective in both static and dynamic settings, unfortunately these approaches typically require extensive learning costs to derive an optimal policy by repeating trials. What is more, even in static environments they cannot provide any guarantees for the learned policy.

These classic algorithms do not perform well when addressing uncertain factors, especially unpredictable obstacles such as human operators. When a human suddenly appears in front of a mobile robot, the robot has to stop proactively to ensure human safety. Thus, more robust methods [28,29,30,31] have been presented for environments requiring human–robot coexistence. Cheng et al. [28] proposed a safe RL controller with control barrier functions and utilized Gaussian Processes (GPs) to model the system dynamics and uncertainties. However, Gaussian functions encounter difficulties when modeling uncertainties with multiple peaks. Mohamed et al. [29] proposed a framework that uses DRL as an enabling technology to enhance the intelligence and safety of robots in human–robot collaboration scenarios. Liu et al. [30] presented a DRL approach to realize the real-time collision-free motion planning of an industrial robot for human–robot collaboration. Shao et al. [31] proposed a Reachability-based Trajectory Safeguard (RTS) controller which leverages reachability analysis to ensure safety during training and operation. While these methods are efficient in human–robot collaboration, they are depended on online learning and cannot solve the issue of collisions at the planning level. A novel approach to collision avoidance based on human–robot interaction was proposed in [32], where the robot communicates its presence to nearby humans through voice messages. Upon approaching a certain distance, the human is required to provide feedback in order to complete the interaction process. Zeng et al. [33] presented a virtual force field-based mobile robot navigation algorithm in which both active and critical regions are used to deal with the uncertainty of human motion. The region sizes for the robot’s navigation are calculated based on worst-case avoidance conditions. Although the above-mentioned methods have shown promising results, they all consider humans as fixed subjects in the physical workspace.

Hybrid methods combining Global Path Planning (GPP) and Local Path Planning (LPP) [34] are widely applied in navigation, where LPP is used for real-time obstacle avoidance. For instance, this scheme is widely utilized in the field of autonomous driving [35]. Strictly speaking, this is a real-time planning system in which the re-planning mechanism is invoked when an unexpected obstacle is encountered. The main advantage is that it provides high safety assurance. However, this approach does not consider the future risk or uncertainties at the planning level. In an automated factory scenario where mobile robots work together with humans, a mobile robot usually needs to accomplish its tasks within a given time budget. Thus, such real-time planning systems encounter high cost while also easily becoming trapped in continuous re-planning, leading to low task success rates.

Recently, numerous path planning methods that consider uncertain human actions have been proposed to handle human–robot interactions. Chan et al. [36] presented a conflict-free speed alteration strategy for human safety. Although the superiority of this method was shown by comparing it with two existing methods, this method plans one-dimensional moving paths, which is unsuitable for an autonomous system. Similar works have made progress in the probabilistic prediction of human motion [37,38] and robust sequential trajectory planning for multi-robot systems [39,40]. However, fusing these techniques into a real-time planning system remains a challenge because of the difficulty encountered in joint planning and prediction for multiple robots and humans. Bajcsy’s research [41] introduced a scalable framework for robot navigation that accounts for high-order system dynamics and maintains safety in the presence of external disturbances, other robots, and nondeterministic intentional agents. However, this framework decomposes multi-agent pathfinding into a sequence of instances in which it is necessary to re-plan the paths at every time step for all agents [42,43]. While such a framework maintains human safety, it does not account for the time budget of the robot or its task success rate. Consequently, this framework is difficult to realize in real-world applications.

Table 1 summarizes and compares the related path planning methods discussed above. As described previously, shortest path planning methods such as A*, D*, and RRT* are not suitable for dynamic environments in which humans move randomly. Methods such as RL, FSA-MDP, and MCTS adopt learning schemes to obtain an optimal control policy, and are efficient in both static and dynamic environments. Moreover, these learning schemes include the possibility of addressing future risk in the learning phase through techniques such as stochastic simulations. Overall, RL is a good scheme, while MCTS suffers expensive computation cost. It is well known that real-time solutions are the safest approach to path planning which combines the two key components of global path planning and real-time obstacle avoidance. The real-time method is widely applied in navigation; however, it lacks risk foresight ability and cannot solve conflicts at the planning level. Above all, addressing risks at the planning level rather than relying on real-time methods is crucial for ensuring the safety of mobile robots working alongside humans while achieving a high task success rate.

It has been proven that syntactically co-safe Linear Temporal Logic (scLTL), which is one fragment of linear temporal logic (LTL) [11], is efficient in terms of both task descriptions and control. Multiple tasks can be formulated using an scLTL formula [44,45], offering efficient methodologies for managing multi-task scenarios and seamlessly integrating with path planning methods. For example, in [22] the authors investigated an RL approach merging MDP with scLTL and proposed an FSA-MDP algorithm to generate optimal paths against jamming attacks. Building upon this foundation, our primary aim is to generate safe and efficient plans that satisfy scLTL tasks for environments such as warehouses and automated factories where humans and robots must coexist or cooperate.

Table 2 shows a comparison of the different methods. Ushio et al. [46] proposed a product MDP learner based on LTL for robots performing multiple tasks. However, their method only considers uncertainties in state transitions and does not address environmental uncertainties. The FSA-encoder-based MDP (Feb-MDP) [22] integrates scLTL and reinforcement learning, handling environmental uncertainties through real-time re-planning. However, this approach was designed for jamming attack scenarios, which differ significantly from human–robot interaction settings. DRL [47] and Plan-Seq-Learn [48] primarily focus on task sequencing for robotic manipulators, but lack mechanisms for handling environmental uncertainties. Furthermore, path planning for mobile robots in human-shared environments is fundamentally different from motion planning for manipulators. Teja et al. [49] proposed a human-aware navigation planner called Cooperative Human-Aware Navigation (Co-HAN), which effectively handles various human-robot interaction contexts; however, their approach does not consider the task sequencing problem. As summarized in Table 2, our goal is to develop a path planner for mobile robots that meets the following two requirements: (1) addressing environmental uncertainties explicitly at the planning level, and (2) solving the task sequencing problem and generating an optimal global path for performing multiple tasks. In contrast to traditional approaches, the proposed algorithm resolves potential conflicts during the planning phase, enabling the robot to complete multiple tasks in human-shared environments through one-time planning without the need for continuous re-planning.

3. Preliminaries

scLTL is one fragment of LTL. An scLTL formula commonly consists of a set of atomic propositions, Boolean operators, and temporal operators. Let

P r P

be the set of atomic propositions. An scLTL formula over the atomic propositions

P r P

is recursively defined as

φ : : ⊤ | p r p | \neg p r p | φ_{1} \land φ_{2} | φ_{1} \lor φ_{2} | F φ | X φ | φ_{1} U φ_{2} .

(1)

Note that:

$⊥ = \neg ⊤$ (False) is an scLTL formula,
$F φ = ⊤ U φ$ is an scLTL formula, and
given an scLTL formula $φ$ , it is written in a positive normal form where ¬ is only written in front of an atomic proposition $p r p$ .

In addition, ¬ (negation), ∧ (conjunction), and ∨ (disconjuction) are Boolean operators, while X (next), F (in the future), and U (until) are temporal operators. By combining the Boolean operators and temporal operators, complex and rich mission specifications can be described. An scLTL formula is interpreted by an infinite word

w \in {(2^{P r P})}^{ω}

, where

w = w [0] w [1] w [2] \dots

. Let

w (i) = w [i] w [i + 1] \dots

; then, the semantics formula

φ

over word w is defined recursively in Equation (2) as follows:

\begin{matrix} w (i) ⊧ ⊤ \\ w (i) ⊧ c \Leftrightarrow c \in w \\ w (i) ⊧ φ_{1} \land φ_{2} \Leftrightarrow w (i) ⊧ φ_{1} \land w (i) ⊧ φ_{2} \\ w (i) ⊧ φ_{1} \lor φ_{2} \Leftrightarrow w (i) ⊧ φ_{1} \lor w (i) ⊧ φ_{2} \\ w (i) ⊧ X φ \Leftrightarrow w (i + 1) ⊧ φ \\ w (i) ⊧ φ_{1} U φ_{2} \Leftrightarrow \exists k \geq i, \forall j (i \leq j < k), w (k) ⊧ φ_{2} \land w (j) ⊧ φ_{1} . \end{matrix}

(2)

For any scLTL formula

φ

, there exists an FSA

A_{φ} = < Q, P r P, δ, q_{0}, F >

that accepts all infinite words satisfying the formula

φ

, where Q is the finite set of states,

P r P

is the set of atomic propositions,

δ

:

Q \times 2^{P r P} \to Q

is the state transition function,

q_{0} \in Q

is the initial state, and

F \subseteq Q

is the set of the accepting states. Several off-the-shelf tools for the computation of

A_{φ}

have been developed [50,51]. (The tool used in this paper, please see https://spot.lre.epita.fr/app/; accessed on 7 July 2025)

4. Problem Formulation

We consider the online pickup and delivery problem of an agent under environmental uncertainties. An agent is designed to collect workpieces from

I (I \in N_{+})

specific workstations and transport them to different destinations. Let

o_{i}

be the position of the workstation

i (1 \leq i \leq I)

and

g_{i}

the destination of the workpieces from workstation i.

For each workstation i, the agent:

Visits $o_{i}$ and collects workpieces, which is called the collection task
Delivers the collected workpieces to its destination $g_{i}$ , named the delivery task
$T_{i}^{1}, T_{i}^{2}$ are the budgets for the collection and delivery tasks, respectively.

Therefore, I workstations define

2 I

tasks. We denote the task state at time step t as

τ^{t} = {τ_{1}, \dots, τ_{i}, \dots, τ_{I}, τ_{I + 1}, \dots, τ_{I + i}, \dots, τ_{2 I}}

, where

τ_{i}

,

τ_{I + i}

represent the state of the collection and delivery tasks of workstation i, respectively. Here,

\forall τ_{i}, τ_{I + i}

,

τ_{i}, τ_{I + i} \in {F a l s e, T r u e}

, where:

$F a l s e$ shows a task that is not achieved
$T r u e$ represents a task that is achieved.

Thus, the initial task state

τ^{0} = {τ_{1}, \dots, τ_{i}, \dots, τ_{2 I}} = {F a l s e, \dots, F a l s e, \dots, F a l s e}

. After

τ_{i}

is achieved at time step t,

τ^{t} = {F a l s e, \dots, T r u e, \dots, F a l s e}

. After all tasks are completed, this becomes

{T r u e, \dots, T r u e, \dots, T r u e}

.

Definition 1.

If a task is achieved within its given budget, we say that it is online; otherwise, it is offline.

Definition 2.

If all tasks are online completed, we say that it is a success; otherwise, we consider it as a failure.

Environmental uncertainties arise from the states of the doors. There are

N (N \in N_{+})

doors in the environment that the agent has to pass through. Each door

n (1 \leq n \leq N)

can be in one of two states, open or closed. The state of each door is determined by the door control system. Let

d_{n}^{t}

define the state of door n at time step t; then:

d_{n}^{t} = \{\begin{matrix} 1, & i f t h e d o o r c o u l d o p e n, \\ 0, & o t h e r w i s e . \end{matrix}

(3)

When the agent needs to pass through a door, it sends an open request to the door control system. However, the request may be denied if the door is not available when the outside is dangerous, such as in a scenario where other vehicles are passing through.

Based on the

2 I

tasks of the agent, we define a set of atomic propositions

P r P = \cup_{i = 1}^{2 I} P r P_{i}

and

P r P_{i} = {n u l l_{i}, p i c k_{i}, d e l i v e r_{i}} (1 \leq i \leq I)

, where:

$n u l l_{i}$ : The agent does not accomplish the collection task at workstation $o_{i}$
$p i c k_{i}$ : The agent successfully completes the online collection task at workstation $o_{i}$
$d e l i v e r_{i}$ : The agent successfully completes the online delivery task of workpieces from workstation $o_{i}$ .

All tasks can be formulated into an scLTL formula

φ

, as follows:

φ = ⋀_{i = 1}^{I} (F p i c k_{i} \land F d e l i v e r_{i} \land (\neg d e l i v e r_{i} U p i c k_{i}))

(4)

where:

$F p i c k_{i}$ : The agent will achieve online collection task $τ_{i}$ in future
$F d e l i v e r_{i}$ : The agent will achieve online delivery task $τ_{I + i}$ in future
$\neg d e l i v e r_{i} U p i c k_{i}$ : The delivery task $τ_{I + i}$ will not be achieved before the collection task $τ_{i}$ is completed.

In this paper, we aim to generate an optimal solution for the online pickup and delivery problem under environment uncertainties.

5. Proposals

The architecture of the proposed planner is illustrated in Figure 1. We first model the environmental uncertainties based on the observations. Then, an RL-based double-layer learning algorithm is introduced to learn an optimal policy while accounting for these uncertainties. The lower-level layer consists of an MDP learner which takes a starting position and a goal, then searches for a local optimal safe path under uncertainties. The higher-level layer employs an FSA-RL learner to achieve global optimization across all tasks. The details of the proposed planner are presented below.

5.1. Model Environment Uncertainties

The uncertainties are primarily caused by human-driven vehicles such as forklifts. Let

Z_{t} = {z_{t}^{1}, \dots, z_{t}^{K}}

be the observations of all forklifts at time step t by the door control system and let

z_{t}^{k} = < z_{t}^{k}, d i r e^{k} >

be the state of forklift k. Here,

z_{t}^{k}

is the observed position and

d i r e^{k}

is the observed direction, where

d i r e^{k} \in {\leftarrow, \to}

. Each forklift can take two actions,

f o r w a r d

and

w a i t

. The direction of a forklift remains fixed throughout its operation. The

w a i t

action is reserved for emergency situations only. The probabilistic model is illustrated in Figure 2. Each forklift is human-operated, and the driver controls its speed within a specified range. Suppose that a forklift is observed at position v at time step t, as shown in Figure 2, where

z_{t}^{k} = v

and

d i r e^{k}

is ←. It moves to

z_{t + 1}^{k}

with

f o r w a r d

action under

p_{f}

. The state transition is written as

(z_{t}^{k}, f o r w a r d, z_{t + 1}^{k})

. Note that the transition is not deterministic.. Due to speed variations, the forklift may move forward by one or two steps, i.e.,

z_{t + 1}^{k} = v_{1}

or

v_{2}

.

Each door n is associated with a designated barrier area

B_{n}

, and the door control system evaluates the risk level of

B_{n}

to determine whether to grant or deny an open request. As illustrated in Figure 3, when the agent attempts to pass through a corridor via door

n_{1}

to reach another workshop, it sends an open request to the door controller upon approaching the door. The door controller evaluates the risk level of

B_{n_{1}}

; if it confirms safety, it responds with a confirmation message “Y” and opens door

n_{1}

. In contrast, if a forklift is detected approaching the corridor near door

n_{2}

, then the area is considered unsafe. As a result, the controller denies the request and responds with a refusal message “N”, keeping the door closed.

In the following, we show the details of the risk evaluation for barrier area

B_{n}

of each door n. Here,

B_{n} = {v_{1}^{n}, v_{2}^{n}, \dots}

is the set of vertices around the door n. Suppose that a forklift k enters the corridor at

t_{0}

and that the observation is

z_{t_{0}}^{k} = < s_{t_{0}}^{k}, d i r e^{k} >

. We simulate its path until it leaves the corridor and calculate the conditional probability of each vertex

v \in B_{n}

occupied by forklifts:

p_{k, t} (v | Z_{t_{0}}) = \frac{visited times of v}{simulation times}

(5)

where

t_{0} < t < T_{k}

,

v \in B_{n}

, and

T_{k}

is the time step at which forklift k leaves the corridor. The risk level of

B_{n}

at time step t is evaluated by

r (B_{n}, t) = \sum_{v \in B_{n}} \sum_{k = 1}^{K} p_{k, t} (v | Z_{t_{0}}) .

(6)

Thus, we obtain the risk distribution

R_{i s k} = {{r (B_{1}, t_{0} + 1), \dots, r (B_{1}, T^{*})}, \dots, {r (B_{N}, T^{*})

,

\dots, r (B_{N}, T^{*})}}

, where

T^{*}

is the time step at which the last forklift leaves the corridor.

5.2. High-Level: FSA-RL Learner

The high level is a global learner based on FSA. All tasks are first written in formula

φ

, as shown in Equation (4). Afterwards, an FSA

A_{φ}

is transformed from formula

φ

using off-the-shelf tools, where

A_{φ} = < Q, P r P, δ, q_{0}, F >

. Based on

A_{φ}

, we construct the high-level learner

M_{h} = < S_{h}, A_{h}, P_{h}, δ_{h}, R_{h}, q_{0}, q_{F}, P r P, L >

where:

$S_{h} = Q$ is the state set
$A_{h} = {m o v e, s t a y}$ is the action set
$P_{h} : S_{h} \times A_{h} \times S_{h} \to [0, 1]$ is the state transition probability
$δ_{h} : S_{h} \times P r P \times L (S_{l}) \to S_{h}$ is the state transition function (with $S_{l}$ being the low-level state set)
$L : S_{l} \to 2^{P r P}$ is the label function
$P r P$ is the atomic proposition set
$R_{h}$ is the reward function, which is determined by the low-level MDP learner
$q_{0}$ is the initial state and $q_{F} \in F$ is the accept state.

The agent chooses an action

a_{h} \in A_{h}

for state transition from

s_{h}

to

s_{h}^{'}

with probability

P (s_{h}^{'} | s_{h}, a_{h})

. We claim that in the case of

a_{h} = m o v e

, the agent moves to next state

s_{h}^{'} (s_{h}^{'} \neq s_{h})

with probability

p (s_{h}^{'} | s_{h}, a_{h})

and may remain at

s_{h}

with probability

1 - p (s_{h}^{'} | s_{h}, a_{h})

. An example is shown in Figure 4 (left), where the agent chooses to move to

q^{'}

from q. However, it may stay at q with probability

1 - p (q | q, a_{h})

. The agent stays at q when

a_{h} = s t a y

such that

p (q | q, a_{h} = s t a y) = 1

, as shown in Figure 4 (right).

In detail, the state transition is defined by

δ_{h} (\cdot)

as in Equation (7):

s_{h}^{'} = δ_{h} (s_{h}, p r p, L (s_{l}))

(7)

where

p r p \in P r P_{i}

is an atomic proposition which represents a task and

L (s_{l})

is the label. Note that the action set A is redefined as

A_{h} = {p i c k_{i}, d e l i v e r_{i}, n u l l_{i}},

(8)

where

m o v e

action is replaced by

{p i c k_{i}, d e l i v e r_{i}}

and

s t a y

is replaced by

n u l l_{i}

. Here,

δ_{h} (\cdot)

is different from the deterministic state transition

δ (\cdot)

in

A_{φ}

. Given a state

s_{h} = q \in Q

, let

n e i g h b o u r s

be the set of all possible states to which q may transit.

\forall q^{'} \in n e i g h b o u r s

, it has a corresponding atomic proposition

p r p

that leads the agent to move. As described previously, each

p r p

represents an online collection task

τ_{i}

or delivery task

τ_{I + i}

. Let

f (\cdot)

define the mapping from

p r p

to a position of the corresponding task, defined as follows:

s_{l} = f (p r p)

(9)

where

s_{l} \in {o_{1}, \dots, o_{I}, g_{1}, \dots, g_{I}}

. After the agent chooses a neighbor to visit, it invokes the lower level to obtain a label

L (s_{l})

, where:

L (s_{l}) = \{\begin{matrix} T r u e, & i f t a s k i s o n l i n e c o m p e l e t e d, \\ F a l s e, & o t h e r w i s e . \end{matrix}

(10)

If task

τ_{i}

or

τ_{I + i}

is achieved online, that is, if the agent visits

s_{l}

within the given budget in the lower level, then

L (s_{l})

would return

T r u e

. Then, the agent transits from

s_{h} = q

to

s_{h}^{'} = q^{'}

, written as

δ_{h} (q, p r p, L (s_{l}) = T r u e) \to q^{'}

. Otherwise,

L (s_{l}) = F a l s e

, and

δ_{h} (q, p r p, L (s_{l}) = F a l s e) \to q

. Figure Figure 5 shows an example in which an agent chooses to move to

q_{1}^{'}

with

p r p_{i} = p i c k_{i}

from state q. Here,

\tilde{q}

is the last state and the last transition is

δ_{h} (\tilde{q}, \tilde{p r p}, L (\tilde{s_{l}}) = T r u e) \to q

, where

\tilde{p r p}

is the chosen atomic proposition at

\tilde{q}

and

\tilde{s_{l}} = f (\tilde{p r p})

. Note that there is a special case when

q = q_{0}

. As shown in Figure 5 (right), we define its last state transition as

δ_{h} (q_{0}, n u l l_{0}, L (\tilde{s_{l}}) = T r u e \to q_{0}

. At this point it should be claimed that

s_{l} = \tilde{s_{l}}

if

p r p = n u l l_{i}

, formulated as

s_{l} = f (p r p) = f (\tilde{p r p}), if p r p = n u l l_{i} .

(11)

In the transition of

δ_{h} (q_{0}, n u l l_{0}, L (\tilde{s_{l}}) = T r u e) \to q_{0}

, we define

s_{0} = f (n u l l_{0})

.

The high level invokes the low-level learner to learn a local control policy. The low-level learner returns a label

L (s_{l})

indicating the local outcome. Meanwhile, if the corresponding task is accomplished online, a local path

x_{q, q ’}

is also returned.

The scLTL formula

φ

defines the global reward condition. The agent receives a global reward if and only if

φ

is satisfied, where

R_{h} (s_{h}, a_{h}, s_{h}^{'}) = E, i f s_{h}^{'} = q_{F} .

(12)

In addition, the immediate reward function is also designed to accelerate the learning process, defined by

R_{h} (s_{h}, a_{h}, s_{h}^{'}) = \{\begin{matrix} r E + | x_{s_{h}, s_{h}^{'}} | * s t e p_c o s t, & i f s_{h} \neq s_{h}^{'}, \\ 0, & o t h e r w i s e, \end{matrix}

(13)

where

| x_{s_{h}, s_{h}^{'}} |

is the length of the path from

s_{h}

to

s_{h}^{'}

returned by the lower level and

s t e p_c o s t

is the cost for each movement.

5.3. Low-Level: MDP Learner

The lower level consists of a local MDP learner for learning an optimal control policy and generating an optimal local path. We model the environment as

M_{l}

, defined by

M_{l} = < S_{l}, A_{l}, P_{l}, R_{l}, s_{l_{0}}, s_{l_{g}}, R_{i s k} >

, where:

$S_{l} = S$ is the state set
$A_{l} = {f o r w a r d, b a c k w a r d, r i g h t, l e f t, w a i t}$ is the action set
$P_{l} : S_{l} \times A_{l} \times S_{l} \to [0, 1]$ is the state transition probability
$R_{l}$ is the reward function with $R_{l} (s_{l}, a_{l}, s_{l}^{'}) \to R$ , where $s_{l}, s_{l}^{'} \in S_{l}$ and $a_{l} \in A_{l}$
$s_{l_{0}}$ is the initial state and $s_{l_{g}}$ is the goal state, both of which are provided by the higher level
$R_{i s k}$ is the risk set of the environment.

The environment of the lower level is a 2D grid world. The state set

S_{l}

contains all vertices of the environment. Starting from the given initial state

s_{l_{0}}

, the agent chooses an action

a_{l} \in A_{l}

and explores the environment to gain experience in the form of a Q-table. The reward function

R_{l}

is designed as follows:

R_{l} (s_{l}, a, s_{l}^{'}) = \{\begin{matrix} g o a l_r e w a r d, & i f s_{l}^{'} = s_{l_{g}}, \\ s t e p_c o s t, & o t h e r w i s e . \end{matrix}

(14)

To enhance the efficiency of learning with uncertainties, we modify the reward function

R_{l}

(Equation (14)) as follows:

R_{l} (s_{l}, a, s_{l}^{'}) = \{\begin{matrix} g o a l_r e w a r d, & i f s_{l}^{'} = s_{l_{g}}, \\ s t e p_c o s t + u (s_{l}^{'}, t^{'}), & o t h e r w i s e, \end{matrix}

(15)

where

t^{'}

is the time step at which the agent transits to

s_{l}^{'}

and

u (\cdot)

calculates the reward from environmental uncertainties. When an agent approaches a door, it sends a request and the door control system may return an a message “Y” to open the door or “N” to refuse. However, whether or not a door will be available in the future is unknown. Thus, we consider the future uncertainties

R_{i s k}

to predict whether or not a door could open. For each door n,

d_{n}^{t} = \{\begin{matrix} 1, & i f r (B_{n}, t) < θ, \\ 0, & o t h e r w i s e, \end{matrix}

(16)

where

θ

is a risk threshold. As shown in Equation (16), when the risk of door n is less than

θ

, we consider that the door could be passed though with a higher probability. When

r (B_{n}, t) \geq θ

, this means that passing through the door will pose a significant collision risk which is unacceptable; in this case, the door refuses to open, and we define

u (\cdot)

as

u (s_{l}, t) = \{\begin{matrix} P e n a l t y, & i f d_{n}^{t} = 0, \\ r (B_{n}, t) * c_c o s t, & o t h e r w i s e, \end{matrix}

(17)

where

P e n a l t y

is a negative constant value and

c_c o s t

is the collision penalty. Here,

u (s_{l}, t)

captures the influence of environmental uncertainties. Specifically, the robot receives a penalty if it is refused passage through a door when it is indicated that proceeding may pose a collision risk with a forklift. It is important to note that a collision may still occur even if the door system grants access (i.e., opens the door), as the door-opening decision is based on a threshold condition

r (B_{n}, t) < θ

. To reflect this residual risk, the robot receives a reward proportional to the predicted risk value, provided by

r (B_{n}, t) \cdot c_{c} o s t

. This design enables the local learner to utilize the predicted risk distribution

R_{i s k}

to learn a safer path.

Both the high- and low-level learners adopt a Q-learning mechanism to learn the optimal policy. The learning structures of these learners are shown in Algorithms 1 and 2, respectively. Figure 6 illustrates the interaction between the high-level and low-level learners in learning the optimal global path for executing multiple tasks.

Algorithm 1: FSA-RL learner

Algorithm 2: Low-level MDP learner

Figure 6. High- and low-level interaction for learning the optimal global path.

As described above, the high-level learner is responsible for searching the global task strategy, while the low-level focuses on learning local paths. This hierarchical decomposition effectively transforms a complex global learning problem into several simpler local learning problems. For a single-layer RL learner, the size of the search space is

| S_{l} | \times 2^{2 I}

. In contrast, our hierarchical approach reduces the search space to

| Q |

for the high-level learner and

| S_{l} |

for low-level learner, where

| Q |

denotes the number of automaton states derived from scLTL. As is well-known, learning an optimal policy in a small search space is much easier than in a large search space. By decomposing the complex global learning problem into two levels, high-level for task sequencing and low-level for local path planning, the learning cost is reduced. Moreover, the proposed method reduces redundant learning cost, as it eliminates the need to re-learn a local path from

s_{l_{0}}

to

s_{l_{g}}

after it has already been successfully learned. In contrast, a single-layer RL learner would need to continue learning until finding an optimal global path, resulting in higher learning costs and slower convergence.

6. Results and Discussion

6.1. Reward Design and Environment Setting

The details of the rewards for the high- and low-level learners are designed as shown in Table 3. The other parameters are shown in Table 4. The agent should finish all tasks within the given

b u d g e t = 100

time steps. The learning parameters in Q-learning are set as

α = 0.5, γ = 0.9

, and

ϵ = 0.2

. Figure 7 shows how

α, γ

, and

ϵ

affect the learning process. As shown in Figure 7a, the learning rate

α

influences the convergence speed but does not affect the final convergence reward. In this paper, we set

α = 0.5

as a tradeoff between learning stability and speed. The discount factor

γ

has a significant impact on the final convergence reward, as shown in Figure 7b; here, we set

γ = 0.9

, as it achieves the highest reward. The exploration rate

ϵ

in the

ϵ

-greedy policy is typically set to 0.2; Figure 7c confirms that this value provides a favorable balance between exploration and exploitation. The risk threshold

θ = 0.1

.

We verify the proposed FSA-RL algorithm in dynamic environments, as shown by the different configurations in Figure 8. In Figure 8a, three workstations, six tasks, and two forklifts are considered. Six doors are set in the environment. To increase the environmental uncertainties, we then increase the number of tasks to twelve, with six workstations and four forklifts. To model the environment uncertainties, 2000 simulations are performed and the risk distribution

R_{i s k}

is calculated. During the simulation,

p_{f} = 1.0

and

p_{w} \approx 0.0

for the motion model of the forklifts. Note that the uncertainties are caused by a forklift moving one step or two steps ahead, which we randomly simulate. The start time of the agent is set as

t_{0} = 0

, while the forklifts start moving from time step

t = 30

.

We conduct 100 simulations and evaluate the proposals by the average conflict number, average task success rate, average reward, and average length of generated paths. Moreover, we compare the results with the RL method.

6.2. Results on Scenario 1: Workstation Number $I = 3$ and Forklift Number $K = 2$

Scenario 1 verifies the proposed algorithm under the configuration as shown in Figure 8a. We first construct FSA

A_{φ}

from the scLTL formula

φ

(Equation (4)) with

I = 3

. Figure 9 demonstrates the state transitions of

A_{φ}

. Based on FSA

A_{φ}

, we construct the state space of the high-level learner. As shown in Figure 9, it contains 27 nodes, where

q_{0} = 1

is the initial state and

q_{F} = 0

is the accept state.

From

q_{0} = 1

, the agent randomly chooses an atomic proposition from

{p i c k_{1}, p i c k_{2}, p i c k_{3}}

to explore the state space of the high-level learner. Here, we take

p i c k_{1}

as an example. The algorithm maps

p i c k_{1}

to its task position with Equation (9), where

s_{l} = o_{1} = f (p i c k_{1})

and

{\bar{s}}_{l} = s_{0} = f (n u l l_{i})

. Then, the agent sends

(s_{0}, o_{1})

to the low-level learner and evokes it to obtain

L (s_{l})

and the corresponding local path x. After

L (s_{l}) = T r u e

, that is, when x satisfies the requirement, the agent moves to state 6, written as

δ_{h} (s_{h} = 1, p i c k_{1}, L (s_{l}) = T r u e) \to s_{h}^{'} = 6

. Otherwise, if

L (s_{l}) = F a l s e

, the generated path x does not satisfy the requirement and

δ_{h} (s_{h} = 1, p i c k_{1}, L (s_{l}) = F a l s e) \to s_{h} = 1

.

The reward of the learning process is shown in Figure 10. The blue curve shows the learning curve of the classic RL method, which is characterized by substantial oscillatory behavior. The green curve shows the learning curve of the proposed FSA-RL; its reward converges at around −1.0, much larger than classic RL method, which converges at around −5.0.

Figure 11 shows the details of the simulation results from the perspective of refused times, average task success rate, average reward, and average length of the generated paths. Note that “refused times” refers to the number of times the robot fails to pass through a door along the generated path due to the door rejecting the open request. As shown in Figure 11a, when using the classic RL method in 100 simulation runs, the agent was refused 37 times by the door system and failed to pass the door, achieving an average task success rate 0.63 in Figure 11b. Meanwhile, the proposed method enabled the agent to pass through the door and successfully deliver all workpieces in 100 simulation runs; thus, the number of refusals in Figure 11a is 0, and the average task success rate in Figure 11b is 1.0.

Figure 11c,d shows the average reward and average length of the generated paths. There is little difference in the average length between the RL and the proposed FSA-RL methods, with the average length being 79 for RL and 80 for the proposed method. The difference lies in the average reward; as Figure 11c shows, the average reward of the proposed FSA-RL method is −0.92, much larger than −1.94 with the RL method. The reason for this is that the door system refuses to allow the agent to pass through when the agent follows the path generated using RL.

Figure 12 shows an example of the generated paths. Figure 12a shows the path generated with RL. It is clear that the agent visits workstations

o_{1} \to o_{2} \to o_{3}

first, then passes through doors

# 3

and

# 6

before proceeding to the other workshop for delivery tasks, in the order

g_{2} \to g_{3} \to g_{1}

. Unfortunately, when the agent requests to pass through door

# 3

at time step

t = 36

, the door system tends to refuse due to the conflict risk from forklifts in the corridor. The risk values of door

# 3

at time steps 37 and 38 are

r (B_{3}, 37) = 0.5

,

r (B_{3}, 38) = 0.15

, respectively (Table 5). Apparently, it is not a safe policy to choose door

# 3

.

The proposed FSA-RL method generates a different policy for moving through the corridor, as shown in Figure 12b. The pickup policy is same as with the RL method, visiting

o_{1} \to o_{2} \to o_{3}

sequentially. The agent then chooses to pass through doors

# 1

and

# 4

. Moreover, the delivery policy is also different. In the path generated with FSA-RL, the agent performs delivery tasks following the order

g_{1} \to g_{2} \to g_{3}

. The agent requests to open door

# 1

at time step

t = 44

with

r (B_{1}, 45) = 0.0

,

r (B_{1}, 46) = 0.0

. Clearly, door

# 1

is safe for the agent to pass through.

6.3. Results on Scenario 2: Increase Forklift Number to $K = 4$

In this scenario, we increase the number of forklifts to

K = 4

and reset the start position of the agent in order to further investigate performance under a complex environment. The four forklifts begin moving at

t = 30

. The results are shown in Figure 13 and Figure 14.

As shown in Figure 13, the number of refusals with the RL method is 54, while the number with the proposed FSA-RL method is 2. Compared with the RL method, the proposed method reduces the refusals by about 96.3%. Compared with the results in Figure 11, it can be seen that the number of refusals increases. Consequently, the average task success rate of RL is 0.46 and that of FSA-RL is 0.98 (Figure 13b), for an improvement of about 113.0%.

As the number of forklifts increases, the refusals increase and the task success rate decreases. For the generated paths, the average reward decreases for both the RL and FSA-RL methods. The average reward of FSA-RL is improved by 63.1% compared to that of the RL method. Figure 13c shows that the average reward for the RL method drops to −2.60, while that of the proposed FSA-RL decreases only slightly to −0.96 in comparison to Figure 11c. Meanwhile, the average length also increases, as shown in Figure 13d.

Figure 14 shows examples of the generated paths. Table 6 shows the risk level of each door from time step 36 to 46. In Figure 14a, the agent visits workstations

o_{2} \to o_{1} \to o_{3}

to perform collection tasks. Then, it passes through the door

# 2

and visits

g_{4} \to g_{2} \to g_{1}

for delivery tasks. In detail, the agent requests to open door

# 2

at time step

t = 37

and enters the corridor at time step

t = 38

. The risk values of door

# 2

are

r (B_{2}, 38) = 0.147

and

r (B_{2}, 39) = 0.111

. Thus, the agent encounters a greater risk of being refused. The refusal results in Figure 13a also verify this point, with the agent being refused by the door 54 times. Figure 14b shows the generated path. The visit order for collection tasks is the same as with the RL method,

o_{2} \to o_{1} \to o_{3}

. After finishing the collection tasks, the agent moves to the other workshop through door

# 1

. The agent enters the corridor at

t = 44

, where the risk value is

r (B_{1}, 44) = 0.006

. Thus, the agent is only refused twice in 100 simulation runs.

The simulation results show that the proposed RL-based double-layer controller is efficient in finding an optimal path for a mobile robot working in human-shared environments. The number of refusals is greatly reduced compared to the classic RL method. The quality of the generated path is improved from the viewpoint of both task success rate and path reward. Simulation results demonstrate that the proposed method outperforms traditional RL approaches. One advantage comes from our algorithm’s ability to decompose the complex path planning problem into several smaller subproblems with reduced search spaces, making them easier to solve. In traditional RL approaches, the size of the search space is

| S | \times 2^{2 I}

due to the need to handle all tasks and states. In contrast, our hierarchical method significantly reduces the complexity; the high-level planner operates over a smaller abstract state space of size

| Q |

, while the low-level planner operates over

| S |

. For example, the size of the search space for the RL algorithm is

| S | \times 2^{2 I} = 20 \times 22 \times 2^{6} = 28160

. In our proposed algorithm, the size of high-level search space is

| Q | = 26

, while the size of low-level search space is

| S | = 22 \times 20 = 440

. Each subproblem needs to be learned only once, eliminating the need for repeated learning. In contrast, traditional RL must continuously learn optimal policies across multiple tasks. This leads to unavoidable redundant learning, as the learning process is not terminated until all tasks are achieved. The decomposition of the complex path planning problem also enhances the flexibility of our algorithm. Another advantage of our algorithm is its risk foresight ability thanks to modeling environmental uncertainties. As Table 5 and Table 6 show, the risk distribution

R_{i s k}

over a certain time horizon can be obtained using Monte Carlo simulations. Based on

R_{i s k}

, we design the reward function in Equation (17) for the low-level learner in order to handle environmental uncertainties.

In the proposed double-layer learning scheme, the high-level learner is applied to learn the global policy for multiple tasks, while the low-level learner is applied for local path planning while considering environmental risks. Thus, the proposed double-layer algorithm is more efficient at handling environment uncertainties for a mobile robot designed to perform multiple tasks. The shortcomings of the proposed method are the same as other reinforcement learning algorithms, namely, that it cannot provide guarantees and that the search space increases with the number of tasks.

7. Conclusions

This paper addresses the path planning problem of a mobile robot performing multiple pickup and delivery tasks in human-shared environments. We propose a reinforcement learning-based double-layer control algorithm to learn an optimal path in automated factory settings while considering the stochastic movements of human-operated forklifts. The proposed algorithm decomposes the complex global path planning problem into several smaller subproblems that are easier to solve. The high-level layer focuses on global strategy optimization using scLTL and reinforcement learning, while the low-level layer handles local path planning under environmental uncertainties. Compared with conventional RL methods, our approach reduces the number of refusals by approximately 96.3%, improves the average reward by 63.1%, and increases the task success rate by 113.0%. In the future, we plan to: (1) investigate Lyapunov-based stability analysis to provide safety guarantees, (2) develop more robust path planning strategies under uncertainty, and (3) apply the proposed algorithm to real-world robotic systems.

Author Contributions

Conceptualization, J.M.; methodology, J.M.; validation, J.L. and Y.X.; formal analysis, Z.L.; investigation, J.W.; data curation, W.X.; writing—original draft preparation, J.L. and Y.X.; writing—review and editing, J.M. and T.J.; funding acquisition, J.M., J.W. and T.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China (62403411), the Natural Science Foundation of Jiangsu Province (BK20230600), the Natural Science Foundation of Yangzhou (YZ2023164), the Project of Yangzhou Science and Technology Bureau (YZ2023201), the Local Innovation Talent Project of Yangzhou (2022YXBS122), the National Natural Science Foundation of China (62303040), the Beijing Natural Science Foundation (L233029), and the Project of Cultivation for Young Top-Notch Talents of Beijing Municipal Institutions (BPHR202203232).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data generated by the proposed algorithm are shown in Section 5; no additional data are used.

Conflicts of Interest

Author Wei Xu was employed by the company Smart Lab of Innovation, Sinotrans Innovation & Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

RL	Reinforcement Learning
MDP	Markov Decision Process
scLTL	Syntactically Co-Safe Linear Temporal Logic
FSA	Finite State Automaton
FSA-RL	Finite State Automaton Reinforcement Learning

References

Belda, K.; Jirsa, J. Control Principles of Autonomous Mobile Robots Used in Cyber-Physical Factories. In Proceedings of the 2021 23rd International Conference on Process Control (PC), Strbske Pleso, Slovakia, 1–4 June 2021; pp. 96–101. [Google Scholar] [CrossRef]
Ramdani, N.; Panayides, A.; Karamousadakis, M.; Mellado, M.; Lopez, R.; Christophorou, C.; Rebiai, M.; Blouin, M.; Vellidou, E.; Koutsouris, D. A Safe, Efficient and Integrated Indoor Robotic Fleet for Logistic Applications in Healthcare and Commercial Spaces: The ENDORSE Concept. In Proceedings of the 2019 20th IEEE International Conference on Mobile Data Management (MDM), Hong Kong, China, 10–13 June 2019; pp. 425–430. [Google Scholar] [CrossRef]
Özbaran, C.; Dilibal, S.; Sungur, G. Mechatronic System Design of A Smart Mobile Warehouse Robot for Automated Storage/Retrieval Systems. In Proceedings of the 2020 Innovations in Intelligent Systems and Applications Conference (ASYU), Istanbul, Turkey, 15–17 October 2020; pp. 1–6. [Google Scholar] [CrossRef]
Latsou, C.; Farsi, M.; Erkoyuncu, J.A.; Morris, G. Digital twin integration in multi-agent cyber physical manufacturing systems. IFAC-PapersOnLine 2021, 54, 811–816. [Google Scholar] [CrossRef]
Pan, L.; Hu, B.; Sun, Z.; Xu, L.; Gong, G.; Xie, X.; Sun, Z. A Review of the Evolutionary Algorithm Based VRP Problem. In Proceedings of the 2023 42nd Chinese Control Conference (CCC), Tianjin, China, 24–26 July 2023; pp. 1939–1944. [Google Scholar] [CrossRef]
Zhang, C.; Sun, P. Heuristic Methods for Solving the Traveling Salesman Problem (TSP): A Comparative Study. In Proceedings of the 2023 IEEE 34th Annual International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC), Toronto, ON, Canada, 5–8 September 2023; pp. 1–6. [Google Scholar] [CrossRef]
Di, L.; Sun, D.; Qi, Y.; Xiao, Z. Research on Shortest Path Planning and Smoothing Without Obstacle Collision Based on Moving Carrier. Int. J. Aerosp. Eng. 2024, 2024, 5235125. [Google Scholar] [CrossRef]
Voloch, N.; Zadok, Y.; Voloch-Bloch, N.; Hajaj, M.M. Using Combined Knapsack and Shortest Path Problems for Planning Optimal Navigation Paths for Robotic Deliveries. In Proceedings of the 2024 10th International Conference on Automation, Robotics and Applications (ICARA), Athens, Greece, 22–24 February 2024; pp. 139–143. [Google Scholar] [CrossRef]
Ferguson, D.; Stentz, A. Using interpolation to improve path planning: The Field D* algorithm. J. Field Robot. 2006, 23, 79–101. [Google Scholar] [CrossRef]
Warren, C. Fast path planning using modified A* method. In Proceedings of the IEEE International Conference on Robotics and Automation, Atlanta, GA, USA, 2–6 May 1993; Volume 2, pp. 662–667. [Google Scholar] [CrossRef]
Calin Belta, B.Y.; Gol, E.A. Formal Methods for Discrete-Time Dynamical Systems; Springer: Berlin/Heidelberg, Germany, 2017. [Google Scholar] [CrossRef]
Chen, Y.; Ding, X.C.; Stefanescu, A.; Belta, C. Formal approach to the deployment of distributed robotic teams. IEEE Trans. Robot. 2012, 28, 158–171. [Google Scholar] [CrossRef]
Mi, J.; Zhang, X.; Long, Z.; Wang, J.; Xu, W.; Xu, Y.; Deng, S. A mobile robot safe planner for multiple tasks in human-shared environments. PLoS ONE 2025, 20, e0324534. [Google Scholar] [CrossRef] [PubMed]
Huang, Y.; Wang, Y.; Zatarain, O. Dynamic Path Optimization for Robot Route Planning. In Proceedings of the 2019 IEEE 18th International Conference on Cognitive Informatics & Cognitive Computing (ICCI-CC), Milan, Italy, 23–25 July 2019; pp. 47–53. [Google Scholar] [CrossRef]
Wang, B.; Liu, Z.; Li, Q.; Prorok, A. Mobile Robot Path Planning in Dynamic Environments Through Globally Guided Reinforcement Learning. IEEE Robot. Autom. Lett. 2020, 5, 6932–6939. [Google Scholar] [CrossRef]
Jiaoyang, L.; Andrew, T.; Scott, K.; Joseph W., D.; Satish, T.K.K.; Sven, K. Lifelong Multi-Agent Path Finding in Large-Scale Warehouses. In Proceedings of the The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21), Auckland, New Zealand, 9–13 May 2021; pp. 11272–11281. [Google Scholar]
Ullah, Z.; Xu, Z.; Zhang, L.; Zhang, L.; Ullah, W. RL and ANN Based Modular Path Planning Controller for Resource-Constrained Robots in the Indoor Complex Dynamic Environment. IEEE Access 2018, 6, 74557–74568. [Google Scholar] [CrossRef]
Hazem, B. Study of Q-learning and deep Q-network learning control for a rotary inverted pendulum system. Discov. Appl. Sci. 2024, 6, 49. [Google Scholar] [CrossRef]
Jiang, J.; Yang, L.; Zhang, L. DQN-based on-line Path Planning Method for Automatic Navigation of Miniature Robots. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 5407–5413. [Google Scholar] [CrossRef]
Carvalho, J.P.; Aguiar, A.P. A Reinforcement Learning Based Online Coverage Path Planning Algorithm. In Proceedings of the 2023 IEEE International Conference on Autonomous Robot Systems and Competitions (ICARSC), Tomar, Portugal, 26–27 April 2023; pp. 81–86. [Google Scholar] [CrossRef]
Chai, R.; Niu, H.; Carrasco, J.; Arvin, F.; Yin, H.; Lennox, B. Design and Experimental Validation of Deep Reinforcement Learning-Based Fast Trajectory Planning and Control for Mobile Robot in Unknown Environment. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 5778–5792. [Google Scholar] [CrossRef]
Mi, J.; Kuze, N.; Ushio, T. A mobile robot controller using reinforcement learning under scLTL specifications with uncertainties. Asian J. Control 2022, 24, 2916–2930. [Google Scholar] [CrossRef]
Guruji, A.K.; Agarwal, H.; Parsediya, D. Time-efficient A* Algorithm for Robot Path Planning. Procedia Technol. 2016, 23, 144–149. [Google Scholar] [CrossRef]
Dakulović, M.; Petrović, I. Two-way D* algorithm for path planning and replanning. Robot. Auton. Syst. 2011, 59, 329–342. [Google Scholar] [CrossRef]
Blekas, K.; Vlachos, K. RL-based path planning for an over-actuated floating vehicle under disturbances. Robot. Auton. Syst. 2018, 101, 93–102. [Google Scholar] [CrossRef]
Dam, T.; Chalvatzaki, G.; Peters, J.; Pajarinen, J. Monte-Carlo robot path planning. IEEE Robot. Autom. Lett. 2022, 7, 11213–11220. [Google Scholar] [CrossRef]
Li, W.; Liu, Y.; Ma, Y.; Xu, K.; Qiu, J.; Gan, Z. A self-learning Monte Carlo tree search algorithm for robot path planning. Front. Neurorobotics 2023, 17, 1039644. [Google Scholar] [CrossRef] [PubMed]
Cheng, R.; Orosz, G.; Murray, R.M.; Burdick, J.W. End-to-End Safe Reinforcement Learning through Barrier Functions for Safety-Critical Continuous Control Tasks. Proc. AAAI Conf. Artif. Intell. 2019, 33, 3387–3395. [Google Scholar] [CrossRef]
El-Shamouty, M.; Wu, X.; Yang, S.; Albus, M.; Huber, M.F. Towards Safe Human-Robot Collaboration Using Deep Reinforcement Learning. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 4899–4905. [Google Scholar] [CrossRef]
Liu, Q.; Liu, Z.; Xiong, B.; Xu, W.; Liu, Y. Deep reinforcement learning-based safe interaction for industrial human-robot collaboration using intrinsic reward function. Adv. Eng. Inform. 2021, 49, 101360. [Google Scholar] [CrossRef]
Shao, Y.S.; Chen, C.; Kousik, S.; Vasudevan, R. Reachability-Based Trajectory Safeguard (RTS): A Safe and Fast Reinforcement Learning Safety Layer for Continuous Control. IEEE Robot. Autom. Lett. 2021, 6, 3663–3670. [Google Scholar] [CrossRef]
Ghandour, M.; Liu, H.; Stoll, N.; Thurow, K. A hybrid collision avoidance system for indoor mobile robots based on human-robot interaction. In Proceedings of the 2016 17th International Conference on Mechatronics–Mechatronika (ME), Prague, Czech Republic, 7–9 December 2016; pp. 1–7. [Google Scholar]
Zeng, L.; Bone, G.M. Mobile Robot Collision Avoidance in Human Environments. Int. J. Adv. Robot. Syst. 2013, 10, 41. [Google Scholar] [CrossRef]
Yu, Q.; Zhou, J. A Review of Global and Local Path Planning Algorithms for Mobile Robots. In Proceedings of the 2024 8th International Conference on Robotics, Control and Automation (ICRCA), Shanghai, China, 12–14 January 2024; pp. 84–90. [Google Scholar] [CrossRef]
De Araujo, P.R.M.; Mounier, E.; Dawson, E.; Noureldin, A. Smart Mobility: Leveraging Perception Sensors for Map-Based Navigation in Autonomous Vehicles. In Proceedings of the 2024 IEEE International Conference on Smart Mobility (SM), Niagara Falls, ON, Canada, 16–18 September 2024; pp. 281–286. [Google Scholar] [CrossRef]
Chan, C.C.; Tsai, C.C. Collision-Free Speed Alteration Strategy for Human Safety in Human-Robot Coexistence Environments. IEEE Access 2020, 8, 80120–80133. [Google Scholar] [CrossRef]
Ziebart, B.D.; Ratliff, N.; Gallagher, G.; Mertz, C.; Peterson, K.; Bagnell, J.A.; Hebert, M.; Dey, A.K.; Srinivasa, S. Planning-based prediction for pedestrians. In Proceedings of the 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems, St. Louis, MO, USA, 10–15 October 2009; pp. 3931–3936. [Google Scholar] [CrossRef]
Fisac, J.F.; Bajcsy, A.; Herbert, S.L.; Fridovich-Keil, D.; Wang, S.; Tomlin, C.J.; Dragan, A.D. Probabilistically safe robot planning with confidence-based human predictions. arXiv 2018, arXiv:1806.00109. [Google Scholar] [CrossRef]
Chen, M.; Shih, J.C.; Tomlin, C.J. Multi-vehicle collision avoidance via hamilton-jacobi reachability and mixed integer programming. In Proceedings of the 2016 IEEE 55th Conference on Decision and Control (CDC), Las Vegas, NV, USA, 12–14 December 2016; pp. 1695–1700. [Google Scholar] [CrossRef]
Bansal, S.; Chen, M.; Fisac, J.F.; Tomlin, C.J. Safe Sequential Path Planning of Multi-Vehicle Systems Under Disturbances and Imperfect Information. arXiv 2016, arXiv:1603.05208. [Google Scholar] [CrossRef]
Bajcsy, A.; Herbert, S.L.; Fridovich-Keil, D.; Fisac, J.F.; Deglurkar, S.; Dragan, A.D.; Tomlin, C.J. A Scalable Framework for Real-Time Multi-Robot, Multi-Human Collision Avoidance. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 936–943. [Google Scholar] [CrossRef]
Grenouilleau, F.; Van Hoeve, W.J.; Hooker, J.N. A multi-label A* algorithm for multi-agent pathfinding. In Proceedings of the International Conference on Automated Planning and Scheduling, Berkeley, CA, USA, 11–15 July 2019; Volume 29, pp. 181–185. [Google Scholar] [CrossRef]
Lifelong Multi-Agent Path Finding in A Dynamic Environment. In Proceedings of the 2018 15th International Conference on Control, Automation, Robotics and Vision (ICARCV), Singapore, 18–21 November 2018; pp. 875–882. [CrossRef]
Sadigh, D.; Kim, E.S.; Coogan, S.; Sastry, S.S.; Seshia, S.A. A learning based approach to control synthesis of markov decision processes for linear temporal logic specifications. In Proceedings of the 53rd IEEE Conference on Decision and Control, Singapore, 18–21 November 2014; pp. 1091–1096. [Google Scholar] [CrossRef]
Cho, K.; Suh, J.; Tomlin, C.J.; Oh, S. Cost-aware path planning under co-safe temporal logic specifications. IEEE Robot. Autom. Lett. 2017, 2, 2308–2315. [Google Scholar] [CrossRef]
Hiromoto, M.; Ushio, T. Learning an Optimal Control Policy for a Markov Decision Process Under Linear Temporal Logic Specifications. In Proceedings of the 2015 IEEE Symposium Series on Computational Intelligence, Cape Town, South Africa, 7–10 December 2015; pp. 548–555. [Google Scholar] [CrossRef]
Dong, X.; Wan, G.; Zeng, P.; Song, C.; Cui, S. Optimizing Robotic Task Sequencing and Trajectory Planning on the Basis of Deep Reinforcement Learning. Biomimetics 2024, 9, 10. [Google Scholar] [CrossRef] [PubMed]
Dalal, M.; Chiruvolu, T.; Chaplot, D.; Salakhutdinov, R. Plan-Seq-Learn: Language Model Guided RL for Solving Long Horizon Robotics Tasks. arXiv 2024, arXiv:2405.01534. [Google Scholar]
Teja Singamaneni, P.; Favier, A.; Alami, R. Human-Aware Navigation Planner for Diverse Human-Robot Interaction Contexts. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 5817–5824. [Google Scholar] [CrossRef]
Kupferman, O.; Vardi, M.Y. Model Checking of Safety Properties. Form. Methods Syst. Des. 2001, 19, 291–314. [Google Scholar] [CrossRef]
Latvala, T. Efficient Model Checking of Safety Properties. In Proceedings of the 10th International SPIN Workshop on Model Checking of Software, Portland, OR, USA, 9–10 May 2003; pp. 74–88. [Google Scholar] [CrossRef]

Figure 1. Architecture of the proposed planner. Red circles are the workstations where the agent has to pick up workpieces. The purple triangles are the destinations of the workpieces.

# 1, # 2, \dots

are the doors through which the agent should pass to deliver the collected workpieces.

Figure 1. Architecture of the proposed planner. Red circles are the workstations where the agent has to pick up workpieces. The purple triangles are the destinations of the workpieces.

# 1, # 2, \dots

are the doors through which the agent should pass to deliver the collected workpieces.

Figure 2. Motion model and state transition of a forklift.

Figure 3. Communication between the agent and the door control system.

Figure 4. An example of state transition probability: (left)

s_{h} = q, a_{h} = m o v e

; (right)

a_{h} = s t a y

.

Figure 4. An example of state transition probability: (left)

s_{h} = q, a_{h} = m o v e

; (right)

a_{h} = s t a y

.

Figure 5. State transitions.

Figure 7. Parameter investigation, showing the configuration with one task and two forklifts: (a) learning rate

α

, (b) discount factor

γ

, and (c) threshold

ϵ

in

ϵ

-greedy policy.

Figure 7. Parameter investigation, showing the configuration with one task and two forklifts: (a) learning rate

α

, (b) discount factor

γ

, and (c) threshold

ϵ

in

ϵ

-greedy policy.

Figure 8. Environment Setting. (a) The colored circles show

I = 3

workstations,

o_{1}, o_{2}, o_{3}

. There are six doors, denoted

# 1, \dots, # 6

. The colored triangles show the destinations,

g_{1}, g_{2}, g_{3}

. There are

K = 2

forklifts in the environment; the red arrows show their directions. The gray squares are obstacles, while the pink squares show the barrier area of each door. (b) The number of forklift s is increased to

K = 4

and the robot’s start position is reset.

Figure 8. Environment Setting. (a) The colored circles show

I = 3

workstations,

o_{1}, o_{2}, o_{3}

. There are six doors, denoted

# 1, \dots, # 6

. The colored triangles show the destinations,

g_{1}, g_{2}, g_{3}

. There are

K = 2

forklifts in the environment; the red arrows show their directions. The gray squares are obstacles, while the pink squares show the barrier area of each door. (b) The number of forklift s is increased to

K = 4

and the robot’s start position is reset.

Figure 9. State transitions of

A_{φ}

converted from scLTL formula

φ

with

I = 3

. Note that the state transition

q \leftarrow δ (q, n u l l_{i})

exists

\forall q \in Q

(not drawn in the figure).

Figure 9. State transitions of

A_{φ}

converted from scLTL formula

φ

with

I = 3

. Note that the state transition

q \leftarrow δ (q, n u l l_{i})

exists

\forall q \in Q

(not drawn in the figure).

Figure 10. Convergence curves of the learning process.

Figure 11. Simulation results of scenario 1: (a) refusals, (b) average task success rate, (c) average reward of generated paths, and (d) average length of generated paths.

Figure 12. Examples of generated paths in scenario 1: (a) classic RL method, path length 77; (b) proposed FSA-RL method, path length 79.

Figure 13. Simulation results of scenario 2 (increasing forklift number to

K = 4

): (a) refusals, (b) average task success rate, (c) average reward of generated paths, and (d) average length of generated paths.

Figure 13. Simulation results of scenario 2 (increasing forklift number to

K = 4

): (a) refusals, (b) average task success rate, (c) average reward of generated paths, and (d) average length of generated paths.

Figure 14. Examples of generated paths of scenario 2: (a) classic RL method, path length 84; (b) proposed FSA-RL method, path length 78.

Table 1. Comparison of different path planning methods.

Methods	Environment		Risk Foresight
Methods	Static	Dynamic	Risk Foresight
A*	✓	–	–
D*	✓	–	–
RRT*	✓	–	–
RL	✓	✓	✓
FSA-MDP	✓	✓	✓
MCTS	✓	✓	✓
Real-time solution	✓	✓	–

Table 2. Comparison of different methods for task sequencing.

Methods	Handling of Environmental Uncertainty	Human- Interaction	Task Sequencing
Product MDP [46]	–	–	✓
Feb-MDP [22]	real-time way	–	✓
DRL [47]	–	–	✓
Plan-Seq-Learn [48]	–	–	✓
G2RL [15]	real-time way	–	–
Co-HAN [49]	real-time way	✓	–
Ours	at the planning level	✓	✓

Table 3. Reward design in the high- and low-level learners.

Types of Reward	Value	Learner
E	$1.0$	high-level
$r E$	$0.0$	high-level
step_cost	$- 0.1$	high-/low-level
goal_reward	$1.0$	low-level
Penalty	$- 1.0$	low-level
c_cost	$- 1.0$	low-level

Table 4. Parameter settings.

Parameter	Value
$b u d g e t$	100	Section 4
$p_{f}$	$1.0$ (randomly move one step or two steps)	Section 5.1
$p_{w}$	≈ $0.0$
$s i m u l a t i o n t i m e s$	2000
$α$	$0.5$	high-/low-level
$γ$	0.9
$ϵ$	$0.2$
cimputation_budget	1000
$θ$	$0.1$	low-level

Table 5. Scenario 1: Risk level of each door when the agent passes through.

Time-Step t	37	38	39	40	41	42	43	44
door $# 1$	0.010	0.150	0.480	0.650	0.460	0.190	0.050	0.010
door $# 2$	0.500	0.150	0.110	0.050	0.010	0.000	0.000	0.000
door $# 3$	0.500	0.150	0.060	0.020	0.010	0.000	0.000	0.000

Table 6. Scenario 2: Risk level of each door when the agent passes through.

Time-Step t	36	37	38	39	40	41	42	43	44
door $# 1$	0.336	0.714	0.730	0.727	0.739	0.500	0.197	0.048	0.006
door $# 2$	0.807	0.500	0.147	0.111	0.020	0.010	0.000	0.000	0.000
door $# 3$	1.178	0.728	0.208	0.024	0.020	0.010	0.000	0.000	0.000

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mi, J.; Liu, J.; Xu, Y.; Long, Z.; Wang, J.; Xu, W.; Ji, T. A Reinforcement Learning-Based Double Layer Controller for Mobile Robot in Human-Shared Environments. Appl. Sci. 2025, 15, 7812. https://doi.org/10.3390/app15147812

AMA Style

Mi J, Liu J, Xu Y, Long Z, Wang J, Xu W, Ji T. A Reinforcement Learning-Based Double Layer Controller for Mobile Robot in Human-Shared Environments. Applied Sciences. 2025; 15(14):7812. https://doi.org/10.3390/app15147812

Chicago/Turabian Style

Mi, Jian, Jianwen Liu, Yue Xu, Zhongjie Long, Jun Wang, Wei Xu, and Tao Ji. 2025. "A Reinforcement Learning-Based Double Layer Controller for Mobile Robot in Human-Shared Environments" Applied Sciences 15, no. 14: 7812. https://doi.org/10.3390/app15147812

APA Style

Mi, J., Liu, J., Xu, Y., Long, Z., Wang, J., Xu, W., & Ji, T. (2025). A Reinforcement Learning-Based Double Layer Controller for Mobile Robot in Human-Shared Environments. Applied Sciences, 15(14), 7812. https://doi.org/10.3390/app15147812

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Reinforcement Learning-Based Double Layer Controller for Mobile Robot in Human-Shared Environments

Abstract

1. Introduction

2. Related Work

3. Preliminaries

4. Problem Formulation

5. Proposals

5.1. Model Environment Uncertainties

5.2. High-Level: FSA-RL Learner

5.3. Low-Level: MDP Learner

6. Results and Discussion

6.1. Reward Design and Environment Setting

6.2. Results on Scenario 1: Workstation Number $I = 3$ and Forklift Number $K = 2$

6.3. Results on Scenario 2: Increase Forklift Number to $K = 4$

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

A Reinforcement Learning-Based Double Layer Controller for Mobile Robot in Human-Shared Environments

Abstract

1. Introduction

2. Related Work

3. Preliminaries

4. Problem Formulation

5. Proposals

5.1. Model Environment Uncertainties

5.2. High-Level: FSA-RL Learner

5.3. Low-Level: MDP Learner

6. Results and Discussion

6.1. Reward Design and Environment Setting

6.2. Results on Scenario 1: Workstation Number I = 3 and Forklift Number K = 2

6.3. Results on Scenario 2: Increase Forklift Number to K = 4

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

6.2. Results on Scenario 1: Workstation Number $I = 3$ and Forklift Number $K = 2$

6.3. Results on Scenario 2: Increase Forklift Number to $K = 4$