Next Article in Journal
A Confidential Transmission Method for High-Speed Power Line Carrier Communications Based on Generalized Two-Dimensional Polynomial Chaotic Mapping
Previous Article in Journal
Antimicrobial Activity of Chemical Hop (Humulus lupulus) Compounds: A Systematic Review and Meta-Analysis
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Reinforcement Learning-Based Double Layer Controller for Mobile Robot in Human-Shared Environments

1
Department of Transport Engineering, College of Architecture Science and Engineering, Yangzhou University, Yangzhou 225127, China
2
Key Laboratory of the Ministry of Education for Modern Measurement & Control Technology, School of Electromechanical Engineering, Beijing Information Science & Technology University, Beijing 102206, China
3
College of Information Science & Technology, Beijing University of Chemical Technology, Beijing 100029, China
4
Smart Lab of Innovation, Sinotrans Innovation & Technology Co., Ltd., Beijing 100029, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(14), 7812; https://doi.org/10.3390/app15147812
Submission received: 4 June 2025 / Revised: 29 June 2025 / Accepted: 8 July 2025 / Published: 11 July 2025

Abstract

Various approaches have been explored to address the path planning problem for mobile robots. However, it remains a significant challenge, particularly in environments where a multi-tasking mobile robot operates alongside stochastically moving humans. This paper focuses on path planning for a mobile robot executing multiple pickup and delivery tasks in an environment shared with humans. To plan a safe path and achieve high task success rate, a Reinforcement Learning (RL)-based double layer controller is proposed in which a double-layer learning algorithm is developed. The high-level layer integrates a Finite-State Automaton (FSA) with RL to perform global strategy learning and task-level decision-making. The low-level layer handles local path planning by incorporating a Markov Decision Process (MDP) that accounts for environmental uncertainties. We verify the proposed double layer algorithm under different configurations and evaluate its performance based on several metrics, including task success rate, reward, etc. The proposed method outperforms conventional RL in terms of reward (+63.1%) and task success rate (+113.0%). The simulation results demonstrate the effectiveness of the proposed algorithm in solving path planning problem with stochastic human uncertainties.

1. Introduction

In recent years, autonomous mobile robots have been widely employed in automatic factories [1], logistic centers [2], warehouses [3], and manufacturing systems [4], playing an increasingly important role in transportation, pickup, and delivery systems. Today, the demand for autonomous mobile robots that can collaborate with humans is rapidly increasing. Consequently, planning an optimal safe path for a mobile robot performing multiple pickup and delivery operations in human-shared environments has become a critical and urgent challenge.
Path planning is widely recognized as one of the fundamental problems in robotic control. For a mobile robot performing multiple tasks, the control problem can generally be divided into two subproblems: the task scheduling problem, which determines the order in which tasks are executed, and the local path planning problem, which computes an optimal path to accomplish each individual task. The task scheduling problem determines the execution order of tasks, aiming to find an optimal sequence for visiting multiple goal locations. This is commonly formulated as the Vehicle Routing Problem (VRP) [5] or the Traveling Salesman Problem (TSP) [6], which have been extensively studied in static environments where the path cost between locations is fixed and known. However, the path cost in dynamic environments is uncertain and often time-varying. Local path planning focuses on computing a conflict-free path from a start location to a goal location while satisfying specific performance criteria such as minimizing travel distance, commonly referred to as the Shortest Path Planning (SPP) problem [7,8]. These two subproblems are deeply interdependent: the outcome of task scheduling directly influences local planning decisions; in turn, local path feasibility can affect task scheduling results. Moreover, both subproblems are significantly affected by the presence of stochastically moving humans in human-shared environments, which introduces uncertainty and increases the complexity of planning.
As is well known, path planning in static environments has already been studied in considerable depth and various methods have been developed, including D* [9] and A* [10]. However, the path planning problem for a mobile robot designed with multiple pickup and delivery tasks is NP-hard and remains a significant challenge, particularly in dynamic environments. The primary difficulties arise from (1) controlling multiple tasks, and (2) dealing with environmental uncertainties, including dynamic obstacles and stochastic human behavior.
To date, various methods have been developed to generate a global path for a mobile robot performing multiple tasks. To manage task sequences, syntactically co-safe Linear Temporal Logic (scLTL)—a fragment of Linear Temporal Logic (LTL) [11]—has been widely applied for specifying and controlling multiple tasks formulated as scLTL expressions [12]. For example, Mi et al. [13] employed scLTL to represent task specifications and combined it with the A* pathfinding algorithm. It has been proved that scLTL is efficient in task descriptions and control. In terms of path planning, real-time algorithms [14,15] are commonly used to ensure safe navigation by avoiding moving obstacles in dynamic environments. These methods typically adopt a hierarchical path planning framework; instead, a global planner provides guidance toward the goal, while a local planner is responsible for obstacle avoidance. Unfortunately, real-time path planning methods incur high computational costs due to continuous re-planning, and are unable to explicitly address environmental uncertainties at the planning level. Although real-time re-planning at each time step is widely regarded as the safest strategy for handling environmental uncertainties [16], it is often impractical in real-world scenarios such as warehouses, logistics centers, and automated factories. This is primarily because mobile robots in these environments are required to complete tasks within strict time constraints, making constant re-planning computationally infeasible.
Reinforcement Learning (RL) [17,18] has been widely adopted for modeling complex environmental dynamics and learning optimal control policies through interaction. Numerous Deep Reinforcement Learning (DRL) methods [19,20] have been developed for planning in dynamic environments. Chai et al. [21] proposed a hierarchical Deep Learning (DL)-based control framework to plan optimal maneuver trajectories and guide the robot with uncertainties. However, their proposed motion planning model was trained offline and the learning cost was expensive. In our previous work [22], we proposed an FSA-MDP controller by integrating scLTL with a Markov Decision Process (MDP) to plan a global path for the mobile robot considering the uncertainties of jamming attackers. However, this approach still relies on real-time re-planning mechanisms when environmental changes occur.
Recent studies have demonstrated that addressing uncertainties at the planning level is not only effective but also more practical for real-world deployment [13]. On the one hand, computational costs are significantly reduced, as continuous re-planning is no longer required. On the other, it helps to prevent the robot from becoming trapped in local optima, improving the overall task success rate. In human-shared environments, ensuring safe operation of mobile robots is both fundamental and critical. At the same time, achieving a high task success rate is essential for improving the efficiency of systems such as warehouses and automated factories. Motivated by these considerations and inspired by the principle of addressing uncertainties at the planning level, we propose a mobile robot controller solving the path planning and task controlling problems in human-shared environments. An RL-based double layer algorithm is proposed to plan a global safe path for a robot performing multiple pickup and delivery tasks in an environment with uncertainties. The key contributions of this study are summarized as follows:
1.
We estimate environmental uncertainties through stochastic simulations and propose a hierarchical mobile robot controller that integrates scLTL and RL. Unlike conventional approaches that rely on real-time obstacle avoidance, our method explicitly handles environmental uncertainties at the planning level.
2.
A novel RL-based double-layer algorithm is developed. Unlike traditional product- or encoder-type algorithms that use scLTL and RL, the proposed algorithm separates the learning problem into high-level and low-level layers. This decomposition enables the robot to compute an optimal path in human-shared environments with significantly reduced learning costs.
3.
At the higher level, an FSA-RL learner is proposed on the basis of FSA transitions to find an optimal global strategy for multiple tasks. Notably, we extend the deterministic state transitions of the FSA to a stochastic formulation, allowing the controller to better adapt to dynamic environments.
4.
At the lower level, we construct an MDP learner that improves upon the classic MDP by redefining the reward function to explicitly account for environmental uncertainties at the planning level. This enhancement enables the generation of an optimal local path under dynamic and uncertain conditions.
The rest of this paper is organized as follows: Section 2 presents a review of related works; Section 3 briefly reviews the scLTL specifications; Section 4 introduces the problem formulation; the proposed safe planner is presented in Section 5, followed by the simulation results and discussions in Section 6; finally, a brief conclusion is provided in Section 7.

2. Related Work

In path planning, obstacle avoidance has traditionally been the primary requirement for mobile robots. Numerous efficient algorithms have been proposed for path planning and task assignment, including A* [10,23], D* [9,24], Reinforcement Learning (RL) [17,25], and Monte Carlo Tree Search (MCTS) [26,27]. However, as is widely known, although the A* algorithm is efficient for finding the shortest path in static environments, it struggles to address uncertainties in dynamic environments, with D* facing the same limitation. While the RL and MCTS methods are effective in both static and dynamic settings, unfortunately these approaches typically require extensive learning costs to derive an optimal policy by repeating trials. What is more, even in static environments they cannot provide any guarantees for the learned policy.
These classic algorithms do not perform well when addressing uncertain factors, especially unpredictable obstacles such as human operators. When a human suddenly appears in front of a mobile robot, the robot has to stop proactively to ensure human safety. Thus, more robust methods [28,29,30,31] have been presented for environments requiring human–robot coexistence. Cheng et al. [28] proposed a safe RL controller with control barrier functions and utilized Gaussian Processes (GPs) to model the system dynamics and uncertainties. However, Gaussian functions encounter difficulties when modeling uncertainties with multiple peaks. Mohamed et al. [29] proposed a framework that uses DRL as an enabling technology to enhance the intelligence and safety of robots in human–robot collaboration scenarios. Liu et al. [30] presented a DRL approach to realize the real-time collision-free motion planning of an industrial robot for human–robot collaboration. Shao et al. [31] proposed a Reachability-based Trajectory Safeguard (RTS) controller which leverages reachability analysis to ensure safety during training and operation. While these methods are efficient in human–robot collaboration, they are depended on online learning and cannot solve the issue of collisions at the planning level. A novel approach to collision avoidance based on human–robot interaction was proposed in [32], where the robot communicates its presence to nearby humans through voice messages. Upon approaching a certain distance, the human is required to provide feedback in order to complete the interaction process. Zeng et al. [33] presented a virtual force field-based mobile robot navigation algorithm in which both active and critical regions are used to deal with the uncertainty of human motion. The region sizes for the robot’s navigation are calculated based on worst-case avoidance conditions. Although  the above-mentioned methods have shown promising results, they all consider humans as fixed subjects in the physical workspace.
Hybrid methods combining Global Path Planning (GPP) and Local Path Planning (LPP) [34] are widely applied in navigation, where LPP is used for real-time obstacle avoidance. For instance, this scheme is widely utilized in the field of autonomous driving [35]. Strictly speaking, this is a real-time planning system in which the re-planning mechanism is invoked when an unexpected obstacle is encountered. The main advantage is that it provides high safety assurance. However, this approach does not consider the future risk or uncertainties at the planning level. In an automated factory scenario where mobile robots work together with humans, a mobile robot usually needs to accomplish its tasks within a given time budget. Thus, such real-time planning systems encounter high cost while also easily becoming trapped in continuous re-planning, leading to low task success rates.
Recently, numerous path planning methods that consider uncertain human actions have been proposed to handle human–robot interactions. Chan et al. [36] presented a conflict-free speed alteration strategy for human safety. Although the superiority of this method was shown by comparing it with two existing methods, this method plans one-dimensional moving paths, which is unsuitable for an autonomous system. Similar works have made progress in the probabilistic prediction of human motion [37,38] and robust sequential trajectory planning for multi-robot systems [39,40]. However, fusing these techniques into a real-time planning system remains a challenge because of the difficulty encountered in joint planning and prediction for multiple robots and humans. Bajcsy’s research [41] introduced a scalable framework for robot navigation that accounts for high-order system dynamics and maintains safety in the presence of external disturbances, other robots, and nondeterministic intentional agents. However, this framework decomposes multi-agent pathfinding into a sequence of instances in which it is necessary to re-plan the paths at every time step for all agents [42,43]. While such a framework maintains human safety, it does not account for the time budget of the robot or its task success rate. Consequently, this framework is difficult to realize in real-world applications.
Table 1 summarizes and compares the related path planning methods discussed above. As described previously, shortest path planning methods such as A*, D*, and RRT* are not suitable for dynamic environments in which humans move randomly. Methods such as RL, FSA-MDP, and MCTS adopt learning schemes to obtain an optimal control policy, and are efficient in both static and dynamic environments. Moreover, these learning schemes include the possibility of addressing future risk in the learning phase through techniques such as stochastic simulations. Overall, RL is a good scheme, while MCTS suffers expensive computation cost. It is well known that real-time solutions are the safest approach to path planning which combines the two key components of global path planning and real-time obstacle avoidance. The real-time method is widely applied in navigation; however, it lacks risk foresight ability and cannot solve conflicts at the planning level. Above all, addressing risks at the planning level rather than relying on real-time methods is crucial for ensuring the safety of mobile robots working alongside humans while achieving a high task success rate.
It has been proven that syntactically co-safe Linear Temporal Logic (scLTL), which is one fragment of linear temporal logic (LTL) [11], is efficient in terms of both task descriptions and control. Multiple tasks can be formulated using an scLTL formula [44,45], offering efficient methodologies for managing multi-task scenarios and seamlessly integrating with path planning methods. For example, in [22] the authors investigated an RL approach merging MDP with scLTL and proposed an FSA-MDP algorithm to generate optimal paths against jamming attacks. Building upon this foundation, our primary aim is to generate safe and efficient plans that satisfy scLTL tasks for environments such as warehouses and automated factories where humans and robots must coexist or cooperate.
Table 2 shows a comparison of the different methods. Ushio et al. [46] proposed a product MDP learner based on LTL for robots performing multiple tasks. However, their method only considers uncertainties in state transitions and does not address environmental uncertainties. The FSA-encoder-based MDP (Feb-MDP) [22] integrates scLTL and reinforcement learning, handling environmental uncertainties through real-time re-planning. However, this approach was designed for jamming attack scenarios, which differ significantly from human–robot interaction settings. DRL [47] and Plan-Seq-Learn [48] primarily focus on task sequencing for robotic manipulators, but lack mechanisms for handling environmental uncertainties. Furthermore, path planning for mobile robots in human-shared environments is fundamentally different from motion planning for manipulators. Teja et al. [49] proposed a human-aware navigation planner called Cooperative Human-Aware Navigation (Co-HAN), which effectively handles various human-robot interaction contexts; however, their approach does not consider the task sequencing problem. As summarized in Table 2, our goal is to develop a path planner for mobile robots that meets the following two requirements: (1) addressing environmental uncertainties explicitly at the planning level, and (2) solving the task sequencing problem and generating an optimal global path for performing multiple tasks. In contrast to traditional approaches, the proposed algorithm resolves potential conflicts during the planning phase, enabling the robot to complete multiple tasks in human-shared environments through one-time planning without the need for continuous re-planning.

3. Preliminaries

scLTL is one fragment of LTL. An scLTL formula commonly consists of a set of atomic propositions, Boolean operators, and temporal operators. Let P r P be the set of atomic propositions. An scLTL formula over the atomic propositions P r P is recursively defined as
φ : : | p r p | ¬ p r p | φ 1 φ 2 | φ 1 φ 2 | F φ | X φ | φ 1 U φ 2 .
Note that:
  • = ¬ (False) is an scLTL formula,
  • F φ = U φ is an scLTL formula, and
  • given an scLTL formula φ , it is written in a positive normal form where ¬ is only written in front of an atomic proposition p r p .
In addition, ¬ (negation), ∧ (conjunction), and ∨ (disconjuction) are Boolean operators, while X (next), F (in the future), and U (until) are temporal operators. By combining the Boolean operators and temporal operators, complex and rich mission specifications can be described. An scLTL formula is interpreted by an infinite word w ( 2 P r P ) ω , where w = w [ 0 ] w [ 1 ] w [ 2 ] . Let w ( i ) = w [ i ] w [ i + 1 ] ; then, the semantics formula φ over word w is defined recursively in Equation (2) as follows:
w ( i ) w ( i ) c c w w ( i ) φ 1 φ 2 w ( i ) φ 1 w ( i ) φ 2 w ( i ) φ 1 φ 2 w ( i ) φ 1 w ( i ) φ 2 w ( i ) X φ w ( i + 1 ) φ w ( i ) φ 1 U φ 2 k i , j ( i j < k ) , w ( k ) φ 2 w ( j ) φ 1 .
For any scLTL formula φ , there exists an FSA A φ = < Q , P r P , δ , q 0 , F > that accepts all infinite words satisfying the formula φ , where Q is the finite set of states, P r P is the set of atomic propositions, δ : Q × 2 P r P Q is the state transition function, q 0 Q is the initial state, and  F Q is the set of the accepting states. Several off-the-shelf tools for the computation of A φ have been developed [50,51]. (The tool used in this paper, please see https://spot.lre.epita.fr/app/; accessed on 7 July 2025)

4. Problem Formulation

We consider the online pickup and delivery problem of an agent under environmental uncertainties. An agent is designed to collect workpieces from I ( I N + ) specific workstations and transport them to different destinations. Let o i be the position of the workstation i ( 1 i I ) and g i the destination of the workpieces from workstation i.
For each workstation i, the agent:
  • Visits o i and collects workpieces, which is called the collection task
  • Delivers the collected workpieces to its destination g i , named the delivery task
  • T i 1 , T i 2 are the budgets for the collection and delivery tasks, respectively.
Therefore, I workstations define 2 I tasks. We denote the task state at time step t as τ t = { τ 1 , , τ i , , τ I , τ I + 1 , , τ I + i , , τ 2 I } , where τ i , τ I + i represent the state of the collection and delivery tasks of workstation i, respectively. Here, τ i , τ I + i , τ i , τ I + i { F a l s e , T r u e } , where:
  • F a l s e shows a task that is not achieved
  • T r u e represents a task that is achieved.
Thus, the initial task state τ 0 = { τ 1 , , τ i , , τ 2 I } = { F a l s e , , F a l s e , , F a l s e } . After τ i is achieved at time step t, τ t = { F a l s e , , T r u e , , F a l s e } . After all tasks are completed, this becomes { T r u e , , T r u e , , T r u e } .
Definition 1.
If a task is achieved within its given budget, we say that it is online; otherwise, it is offline.
Definition 2.
If all tasks are online completed, we say that it is a success; otherwise, we consider it as a failure.
Environmental uncertainties arise from the states of the doors. There are N ( N N + ) doors in the environment that the agent has to pass through. Each door n ( 1 n N )  can be in one of two states, open or closed. The state of each door is determined by the door control system. Let d n t define the state of door n at time step t; then: 
d n t = 1 , i f   t h e   d o o r   c o u l d   o p e n , 0 , o t h e r w i s e .
When the agent needs to pass through a door, it sends an open request to the door control system. However, the request may be denied if the door is not available when the outside is dangerous, such as in a scenario where other vehicles are passing through.
Based on the 2 I tasks of the agent, we define a set of atomic propositions P r P = i = 1 2 I P r P i and P r P i = { n u l l i , p i c k i , d e l i v e r i } ( 1 i I ) , where:
  • n u l l i : The agent does not accomplish the collection task at workstation o i
  • p i c k i : The agent successfully completes the online collection task at workstation o i
  • d e l i v e r i : The agent successfully completes the online delivery task of workpieces from workstation o i .
All tasks can be formulated into an scLTL formula φ , as follows:
φ = i = 1 I ( F p i c k i F d e l i v e r i ( ¬ d e l i v e r i U p i c k i ) )
where:
  • F p i c k i : The agent will achieve online collection task τ i in future
  • F d e l i v e r i : The agent will achieve online delivery task τ I + i in future
  • ¬ d e l i v e r i U p i c k i : The delivery task τ I + i will not be achieved before the collection task τ i is completed.
In this paper, we aim to generate an optimal solution for the online pickup and delivery problem under environment uncertainties.

5. Proposals

The architecture of the proposed planner is illustrated in Figure 1. We first model the environmental uncertainties based on the observations. Then, an RL-based double-layer learning algorithm is introduced to learn an optimal policy while accounting for these uncertainties. The lower-level layer consists of an MDP learner which takes a starting position and a goal, then searches for a local optimal safe path under uncertainties. The higher-level layer employs an FSA-RL learner to achieve global optimization across all tasks. The details of the proposed planner are presented below.

5.1. Model Environment Uncertainties

The uncertainties are primarily caused by human-driven vehicles such as forklifts. Let Z t = { z t 1 , , z t K } be the observations of all forklifts at time step t by the door control system and let z t k = < z t k , d i r e k > be the state of forklift k. Here, z t k is the observed position and d i r e k is the observed direction, where d i r e k { , } . Each forklift can take two actions, f o r w a r d and w a i t . The direction of a forklift remains fixed throughout its operation. The w a i t action is reserved for emergency situations only. The probabilistic model is illustrated in Figure 2. Each forklift is human-operated, and the driver controls its speed within a specified range. Suppose that a forklift is observed at position v at time step t, as shown in Figure 2, where z t k = v and d i r e k is ←. It moves to z t + 1 k with f o r w a r d action under p f . The state transition is written as ( z t k , f o r w a r d , z t + 1 k ) . Note that the transition is not deterministic.. Due to speed variations, the forklift may move forward by one or two steps, i.e.,  z t + 1 k = v 1 or v 2 .
Each door n is associated with a designated barrier area B n , and the door control system evaluates the risk level of B n to determine whether to grant or deny an open request. As illustrated in Figure 3, when the agent attempts to pass through a corridor via door n 1 to reach another workshop, it sends an open request to the door controller upon approaching the door. The door controller evaluates the risk level of B n 1 ; if it confirms safety, it responds with a confirmation message “Y” and opens door n 1 . In contrast, if a forklift is detected approaching the corridor near door n 2 , then the area is considered unsafe. As a result, the controller denies the request and responds with a refusal message “N”, keeping the door closed.
In the following, we show the details of the risk evaluation for barrier area B n of each door n. Here, B n = { v 1 n , v 2 n , } is the set of vertices around the door n. Suppose that a forklift k enters the corridor at t 0 and that the observation is z t 0 k = < s t 0 k , d i r e k > . We simulate its path until it leaves the corridor and calculate the conditional probability of each vertex v B n occupied by forklifts:
p k , t ( v | Z t 0 ) = visited   times   of   v simulation   times
where t 0 < t < T k , v B n , and T k is the time step at which forklift k leaves the corridor. The risk level of B n at time step t is evaluated by
r ( B n , t ) = v B n k = 1 K p k , t ( v | Z t 0 ) .
Thus, we obtain the risk distribution R i s k = { { r ( B 1 , t 0 + 1 ) , , r ( B 1 , T * ) } , , { r ( B N , T * ) , , r ( B N , T * ) } } , where T * is the time step at which the last forklift leaves the corridor.

5.2. High-Level: FSA-RL Learner

The high level is a global learner based on FSA. All tasks are first written in formula φ , as shown in Equation (4). Afterwards, an FSA A φ is transformed from formula φ using off-the-shelf tools, where A φ = < Q , P r P , δ , q 0 , F > . Based on A φ , we construct the high-level learner M h = < S h , A h , P h , δ h , R h , q 0 , q F , P r P , L > where:
  • S h = Q is the state set
  • A h = { m o v e , s t a y } is the action set
  • P h : S h × A h × S h [ 0 , 1 ] is the state transition probability
  • δ h : S h × P r P × L ( S l ) S h is the state transition function (with S l being the low-level state set)
  • L : S l 2 P r P is the label function
  • P r P is the atomic proposition set
  • R h is the reward function, which is determined by the low-level MDP learner
  • q 0 is the initial state and  q F F is the accept state.
The agent chooses an action a h A h for state transition from s h to s h with probability P ( s h | s h , a h ) . We claim that in the case of a h = m o v e , the agent moves to next state s h ( s h s h ) with probability p ( s h | s h , a h ) and may remain at s h with probability 1 p ( s h | s h , a h ) . An example is shown in Figure 4 (left), where the agent chooses to move to q from q. However, it may stay at q with probability 1 p ( q | q , a h ) . The agent stays at q when a h = s t a y such that p ( q | q , a h = s t a y ) = 1 , as shown in Figure 4 (right).
In detail, the state transition is defined by δ h ( · ) as in Equation (7):
s h = δ h ( s h , p r p , L ( s l ) )
where p r p P r P i is an atomic proposition which represents a task and L ( s l ) is the label. Note that the action set A is redefined as
A h = { p i c k i , d e l i v e r i , n u l l i } ,
where m o v e action is replaced by { p i c k i , d e l i v e r i } and s t a y is replaced by n u l l i . Here, δ h ( · ) is different from the deterministic state transition δ ( · ) in A φ . Given a state s h = q Q , let n e i g h b o u r s be the set of all possible states to which q may transit. q n e i g h b o u r s , it has a corresponding atomic proposition p r p that leads the agent to move. As described previously, each p r p represents an online collection task τ i or delivery task τ I + i . Let f ( · ) define the mapping from p r p to a position of the corresponding task, defined as follows:
s l = f ( p r p )
where s l { o 1 , , o I , g 1 , , g I } . After the agent chooses a neighbor to visit, it invokes the lower level to obtain a label L ( s l ) , where:
L ( s l ) = T r u e , i f   t a s k   i s   o n l i n e   c o m p e l e t e d , F a l s e , o t h e r w i s e .
If task τ i or τ I + i is achieved online, that is, if the agent visits s l within the given budget in the lower level, then L ( s l ) would return T r u e . Then, the agent transits from s h = q to s h = q , written as δ h ( q , p r p , L ( s l ) = T r u e ) q . Otherwise, L ( s l ) = F a l s e , and  δ h ( q , p r p , L ( s l ) = F a l s e ) q . Figure Figure 5 shows an example in which an agent chooses to move to q 1 with p r p i = p i c k i from state q. Here, q ˜ is the last state and the last transition is δ h ( q ˜ , p r p ˜ , L ( s l ˜ ) = T r u e ) q , where p r p ˜ is the chosen atomic proposition at q ˜ and s l ˜ = f ( p r p ˜ ) . Note that there is a special case when q = q 0 . As shown in Figure 5 (right), we define its last state transition as δ h ( q 0 , n u l l 0 , L ( s l ˜ ) = T r u e q 0 . At this point it should be claimed that s l = s l ˜ if p r p = n u l l i , formulated as
s l = f ( p r p ) = f ( p r p ˜ ) , if p r p = n u l l i .
In the transition of δ h ( q 0 , n u l l 0 , L ( s l ˜ ) = T r u e ) q 0 , we define s 0 = f ( n u l l 0 ) .
The high level invokes the low-level learner to learn a local control policy. The low-level learner returns a label L ( s l ) indicating the local outcome. Meanwhile, if the corresponding task is accomplished online, a local path x q , q is also returned.
The scLTL formula φ defines the global reward condition. The agent receives a global reward if and only if φ is satisfied, where
R h ( s h , a h , s h ) = E , i f s h = q F .
In addition, the immediate reward function is also designed to accelerate the learning process, defined by
R h ( s h , a h , s h ) = r E   +   | x s h , s h |     s t e p _ c o s t , i f   s h s h , 0 , o t h e r w i s e ,
where | x s h , s h | is the length of the path from s h to s h returned by the lower level and s t e p _ c o s t is the cost for each movement.

5.3. Low-Level: MDP Learner

The lower level consists of a local MDP learner for learning an optimal control policy and generating an optimal local path. We model the environment as M l , defined by M l = < S l , A l , P l , R l , s l 0 , s l g , R i s k > , where:
  • S l = S is the state set
  • A l = { f o r w a r d , b a c k w a r d , r i g h t , l e f t , w a i t } is the action set
  • P l : S l × A l × S l [ 0 , 1 ] is the state transition probability
  • R l is the reward function with R l ( s l , a l , s l ) R , where s l , s l S l and a l A l
  • s l 0 is the initial state and  s l g is the goal state, both of which are provided by the higher level
  • R i s k is the risk set of the environment.
The environment of the lower level is a 2D grid world. The state set S l contains all vertices of the environment. Starting from the given initial state s l 0 , the agent chooses an action a l A l and explores the environment to gain experience in the form of a Q-table. The reward function R l is designed as follows:
R l ( s l , a , s l ) = g o a l _ r e w a r d , i f   s l = s l g , s t e p _ c o s t , o t h e r w i s e .
To enhance the efficiency of learning with uncertainties, we modify the reward function R l (Equation (14)) as follows:
R l ( s l , a , s l ) = g o a l _ r e w a r d , i f   s l = s l g , s t e p _ c o s t + u ( s l , t ) , o t h e r w i s e ,
where t is the time step at which the agent transits to s l and  u ( · ) calculates the reward from environmental uncertainties. When an agent approaches a door, it sends a request and the door control system may return an a message “Y” to open the door or “N” to refuse. However, whether or not a door will be available in the future is unknown. Thus, we consider the future uncertainties R i s k to predict whether or not a door could open. For each door n,
d n t = 1 , i f   r ( B n , t ) < θ , 0 , o t h e r w i s e ,
where θ is a risk threshold. As shown in Equation (16), when the risk of door n is less than θ , we consider that the door could be passed though with a higher probability. When r ( B n , t ) θ , this means that passing through the door will pose a significant collision risk which is unacceptable; in this case, the door refuses to open, and we define u ( · ) as
u ( s l , t ) = P e n a l t y , i f   d n t = 0 , r ( B n , t ) c _ c o s t , o t h e r w i s e ,
where P e n a l t y is a negative constant value and c _ c o s t is the collision penalty. Here, u ( s l , t ) captures the influence of environmental uncertainties. Specifically, the robot receives a penalty if it is refused passage through a door when it is indicated that proceeding may pose a collision risk with a forklift. It is important to note that a collision may still occur even if the door system grants access (i.e., opens the door), as the door-opening decision is based on a threshold condition r ( B n , t ) < θ . To reflect this residual risk, the robot receives a reward proportional to the predicted risk value, provided by r ( B n , t ) · c c o s t . This design enables the local learner to utilize the predicted risk distribution R i s k to learn a safer path.
Both the high- and low-level learners adopt a Q-learning mechanism to learn the optimal policy. The learning structures of these learners are shown in Algorithms 1 and 2, respectively. Figure 6 illustrates the interaction between the high-level and low-level learners in learning the optimal global path for executing multiple tasks.
Algorithm 1: FSA-RL learner
Applsci 15 07812 i001
Algorithm 2: Low-level MDP learner
Applsci 15 07812 i002
Figure 6. High- and low-level interaction for learning the optimal global path.
Figure 6. High- and low-level interaction for learning the optimal global path.
Applsci 15 07812 g006
As described above, the high-level learner is responsible for searching the global task strategy, while the low-level focuses on learning local paths. This hierarchical decomposition effectively transforms a complex global learning problem into several simpler local learning problems. For a single-layer RL learner, the size of the search space is | S l |   ×   2 2 I . In contrast, our hierarchical approach reduces the search space to | Q | for the high-level learner and | S l | for low-level learner, where | Q | denotes the number of automaton states derived from scLTL. As is well-known, learning an optimal policy in a small search space is much easier than in a large search space. By decomposing the complex global learning problem into two levels, high-level for task sequencing and low-level for local path planning, the learning cost is reduced. Moreover, the proposed method reduces redundant learning cost, as it eliminates the need to re-learn a local path from s l 0 to s l g after it has already been successfully learned. In contrast, a single-layer RL learner would need to continue learning until finding an optimal global path, resulting in higher learning costs and slower convergence.

6. Results and Discussion

6.1. Reward Design and Environment Setting

The details of the rewards for the high- and low-level learners are designed as shown in Table 3. The other parameters are shown in Table 4. The agent should finish all tasks within the given b u d g e t = 100 time steps. The learning parameters in Q-learning are set as α = 0.5 , γ = 0.9 , and ϵ = 0.2 . Figure 7 shows how α , γ , and ϵ affect the learning process. As shown in Figure 7a, the learning rate α influences the convergence speed but does not affect the final convergence reward. In this paper, we set α = 0.5 as a tradeoff between learning stability and speed. The discount factor γ has a significant impact on the final convergence reward, as shown in Figure 7b; here, we set γ = 0.9 , as it achieves the highest reward. The exploration rate ϵ in the ϵ -greedy policy is typically set to 0.2; Figure 7c confirms that this value provides a favorable balance between exploration and exploitation. The risk threshold θ = 0.1 .
We verify the proposed FSA-RL algorithm in dynamic environments, as shown by the different configurations in Figure 8. In Figure 8a, three workstations, six tasks, and two forklifts are considered. Six doors are set in the environment. To increase the environmental uncertainties, we then increase the number of tasks to twelve, with six workstations and four forklifts. To model the environment uncertainties, 2000 simulations are performed and the risk distribution R i s k is calculated. During the simulation, p f = 1.0 and p w 0.0 for the motion model of the forklifts. Note that the uncertainties are caused by a forklift moving one step or two steps ahead, which we randomly simulate. The start time of the agent is set as t 0 = 0 , while the forklifts start moving from time step t = 30 .
We conduct 100 simulations and evaluate the proposals by the average conflict number, average task success rate, average reward, and average length of generated paths. Moreover, we compare the results with the RL method.

6.2. Results on Scenario 1: Workstation Number I = 3 and Forklift Number K = 2

Scenario 1 verifies the proposed algorithm under the configuration as shown in Figure 8a. We first construct FSA A φ from the scLTL formula φ (Equation (4)) with I = 3 . Figure 9 demonstrates the state transitions of A φ . Based on FSA A φ , we construct the state space of the high-level learner. As shown in Figure 9, it contains 27 nodes, where q 0 = 1 is the initial state and q F = 0 is the accept state.
From q 0 = 1 , the agent randomly chooses an atomic proposition from { p i c k 1 , p i c k 2 , p i c k 3 } to explore the state space of the high-level learner. Here, we take p i c k 1 as an example. The algorithm maps p i c k 1 to its task position with Equation (9), where s l = o 1 = f ( p i c k 1 ) and s ¯ l = s 0 = f ( n u l l i ) . Then, the agent sends ( s 0 , o 1 ) to the low-level learner and evokes it to obtain L ( s l ) and the corresponding local path x. After L ( s l ) = T r u e , that is, when x satisfies the requirement, the agent moves to state 6, written as δ h ( s h = 1 , p i c k 1 , L ( s l ) = T r u e ) s h = 6 . Otherwise, if L ( s l ) = F a l s e , the generated path x does not satisfy the requirement and δ h ( s h = 1 , p i c k 1 , L ( s l ) = F a l s e ) s h = 1 .
The reward of the learning process is shown in Figure 10. The blue curve shows the learning curve of the classic RL method, which is characterized by substantial oscillatory behavior. The green curve shows the learning curve of the proposed FSA-RL; its reward converges at around −1.0, much larger than classic RL method, which converges at around −5.0.
Figure 11 shows the details of the simulation results from the perspective of refused times, average task success rate, average reward, and average length of the generated paths. Note that “refused times” refers to the number of times the robot fails to pass through a door along the generated path due to the door rejecting the open request. As shown in Figure 11a, when using the classic RL method in 100 simulation runs, the agent was refused 37 times by the door system and failed to pass the door, achieving an average task success rate 0.63 in Figure 11b. Meanwhile, the proposed method enabled the agent to pass through the door and successfully deliver all workpieces in 100 simulation runs; thus, the number of refusals in Figure 11a is 0, and the average task success rate in Figure 11b is 1.0.
Figure 11c,d shows the average reward and average length of the generated paths. There is little difference in the average length between the RL and the proposed FSA-RL methods, with the average length being 79 for RL and 80 for the proposed method. The difference lies in the average reward; as Figure 11c shows, the average reward of the proposed FSA-RL method is −0.92, much larger than −1.94 with the RL method. The reason for this is that the door system refuses to allow the agent to pass through when the agent follows the path generated using RL.
Figure 12 shows an example of the generated paths. Figure 12a shows the path generated with RL. It is clear that the agent visits workstations o 1 o 2 o 3 first, then passes through doors # 3 and # 6 before proceeding to the other workshop for delivery tasks, in the order g 2 g 3 g 1 . Unfortunately, when the agent requests to pass through door # 3 at time step t = 36 , the door system tends to refuse due to the conflict risk from forklifts in the corridor. The risk values of door # 3 at time steps 37 and 38 are r ( B 3 , 37 ) = 0.5 , r ( B 3 , 38 ) = 0.15 , respectively (Table 5). Apparently, it is not a safe policy to choose door # 3 .
The proposed FSA-RL method generates a different policy for moving through the corridor, as shown in Figure 12b. The pickup policy is same as with the RL method, visiting o 1 o 2 o 3 sequentially. The agent then chooses to pass through doors # 1 and # 4 . Moreover, the delivery policy is also different. In the path generated with FSA-RL, the agent performs delivery tasks following the order g 1 g 2 g 3 . The agent requests to open door # 1 at time step t = 44 with r ( B 1 , 45 ) = 0.0 , r ( B 1 , 46 ) = 0.0 . Clearly, door # 1 is safe for the agent to pass through.

6.3. Results on Scenario 2: Increase Forklift Number to K = 4

In this scenario, we increase the number of forklifts to K = 4 and reset the start position of the agent in order to further investigate performance under a complex environment. The four forklifts begin moving at t = 30 . The results are shown in Figure 13 and Figure 14.
As shown in Figure 13, the number of refusals with the RL method is 54, while the number with the proposed FSA-RL method is 2. Compared with the RL method, the proposed method reduces the refusals by about 96.3%. Compared with the results in Figure 11, it can be seen that the number of refusals increases. Consequently, the average task success rate of RL is 0.46 and that of FSA-RL is 0.98 (Figure 13b), for an improvement of about 113.0%.
As the number of forklifts increases, the refusals increase and the task success rate decreases. For the generated paths, the average reward decreases for both the RL and FSA-RL methods. The average reward of FSA-RL is improved by 63.1% compared to that of the RL method. Figure 13c shows that the average reward for the RL method drops to −2.60, while that of the proposed FSA-RL decreases only slightly to −0.96 in comparison to Figure 11c. Meanwhile, the average length also increases, as shown in Figure 13d.
Figure 14 shows examples of the generated paths. Table 6 shows the risk level of each door from time step 36 to 46. In Figure 14a, the agent visits workstations o 2 o 1 o 3 to perform collection tasks. Then, it passes through the door # 2 and visits g 4 g 2 g 1 for delivery tasks. In detail, the agent requests to open door # 2 at time step t = 37 and enters the corridor at time step t = 38 . The risk values of door # 2 are r ( B 2 , 38 ) = 0.147 and r ( B 2 , 39 ) = 0.111 . Thus, the agent encounters a greater risk of being refused. The refusal results in Figure 13a also verify this point, with the agent being refused by the door 54 times. Figure 14b shows the generated path. The visit order for collection tasks is the same as with the RL method, o 2 o 1 o 3 . After finishing the collection tasks, the agent moves to the other workshop through door # 1 . The agent enters the corridor at t = 44 , where the risk value is r ( B 1 , 44 ) = 0.006 . Thus, the agent is only refused twice in 100 simulation runs.
The simulation results show that the proposed RL-based double-layer controller is efficient in finding an optimal path for a mobile robot working in human-shared environments. The number of refusals is greatly reduced compared to the classic RL method. The quality of the generated path is improved from the viewpoint of both task success rate and path reward. Simulation results demonstrate that the proposed method outperforms traditional RL approaches. One advantage comes from our algorithm’s ability to decompose the complex path planning problem into several smaller subproblems with reduced search spaces, making them easier to solve. In traditional RL approaches, the size of the search space is | S | × 2 2 I due to the need to handle all tasks and states. In contrast, our hierarchical method significantly reduces the complexity; the high-level planner operates over a smaller abstract state space of size | Q | , while the low-level planner operates over | S | . For example, the size of the search space for the RL algorithm is | S | × 2 2 I = 20 × 22 × 2 6 = 28160 . In our proposed algorithm, the size of high-level search space is | Q | = 26 , while the size of low-level search space is | S | = 22 × 20 = 440 . Each subproblem needs to be learned only once, eliminating the need for repeated learning. In contrast, traditional RL must continuously learn optimal policies across multiple tasks. This leads to unavoidable redundant learning, as the learning process is not terminated until all tasks are achieved. The decomposition of the complex path planning problem also enhances the flexibility of our algorithm. Another advantage of our algorithm is its risk foresight ability thanks to modeling environmental uncertainties. As Table 5 and Table 6 show, the risk distribution R i s k over a certain time horizon can be obtained using Monte Carlo simulations. Based on R i s k , we design the reward function in Equation (17) for the low-level learner in order to handle environmental uncertainties.
In the proposed double-layer learning scheme, the high-level learner is applied to learn the global policy for multiple tasks, while the low-level learner is applied for local path planning while considering environmental risks. Thus, the proposed double-layer algorithm is more efficient at handling environment uncertainties for a mobile robot designed to perform multiple tasks. The shortcomings of the proposed method are the same as other reinforcement learning algorithms, namely, that it cannot provide guarantees and that the search space increases with the number of tasks.

7. Conclusions

This paper addresses the path planning problem of a mobile robot performing multiple pickup and delivery tasks in human-shared environments. We propose a reinforcement learning-based double-layer control algorithm to learn an optimal path in automated factory settings while considering the stochastic movements of human-operated forklifts. The proposed algorithm decomposes the complex global path planning problem into several smaller subproblems that are easier to solve. The high-level layer focuses on global strategy optimization using scLTL and reinforcement learning, while the low-level layer handles local path planning under environmental uncertainties. Compared with conventional RL methods, our approach reduces the number of refusals by approximately 96.3%, improves the average reward by 63.1%, and increases the task success rate by 113.0%. In the future, we plan to: (1) investigate Lyapunov-based stability analysis to provide safety guarantees, (2) develop more robust path planning strategies under uncertainty, and (3) apply the proposed algorithm to real-world robotic systems.

Author Contributions

Conceptualization, J.M.; methodology, J.M.; validation, J.L. and Y.X.; formal analysis, Z.L.; investigation, J.W.; data curation, W.X.; writing—original draft preparation, J.L. and Y.X.; writing—review and editing, J.M. and T.J.; funding acquisition, J.M., J.W. and T.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China (62403411), the Natural Science Foundation of Jiangsu Province (BK20230600), the Natural Science Foundation of Yangzhou (YZ2023164), the Project of Yangzhou Science and Technology Bureau (YZ2023201), the Local Innovation Talent Project of Yangzhou (2022YXBS122), the National Natural Science Foundation of China (62303040), the Beijing Natural Science Foundation (L233029), and the Project of Cultivation for Young Top-Notch Talents of Beijing Municipal Institutions (BPHR202203232).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data generated by the proposed algorithm are shown in Section 5; no additional data are used.

Conflicts of Interest

Author Wei Xu was employed by the company Smart Lab of Innovation, Sinotrans Innovation & Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
RLReinforcement Learning
MDPMarkov Decision Process
scLTLSyntactically Co-Safe Linear Temporal Logic
FSAFinite State Automaton
FSA-RLFinite State Automaton Reinforcement Learning

References

  1. Belda, K.; Jirsa, J. Control Principles of Autonomous Mobile Robots Used in Cyber-Physical Factories. In Proceedings of the 2021 23rd International Conference on Process Control (PC), Strbske Pleso, Slovakia, 1–4 June 2021; pp. 96–101. [Google Scholar] [CrossRef]
  2. Ramdani, N.; Panayides, A.; Karamousadakis, M.; Mellado, M.; Lopez, R.; Christophorou, C.; Rebiai, M.; Blouin, M.; Vellidou, E.; Koutsouris, D. A Safe, Efficient and Integrated Indoor Robotic Fleet for Logistic Applications in Healthcare and Commercial Spaces: The ENDORSE Concept. In Proceedings of the 2019 20th IEEE International Conference on Mobile Data Management (MDM), Hong Kong, China, 10–13 June 2019; pp. 425–430. [Google Scholar] [CrossRef]
  3. Özbaran, C.; Dilibal, S.; Sungur, G. Mechatronic System Design of A Smart Mobile Warehouse Robot for Automated Storage/Retrieval Systems. In Proceedings of the 2020 Innovations in Intelligent Systems and Applications Conference (ASYU), Istanbul, Turkey, 15–17 October 2020; pp. 1–6. [Google Scholar] [CrossRef]
  4. Latsou, C.; Farsi, M.; Erkoyuncu, J.A.; Morris, G. Digital twin integration in multi-agent cyber physical manufacturing systems. IFAC-PapersOnLine 2021, 54, 811–816. [Google Scholar] [CrossRef]
  5. Pan, L.; Hu, B.; Sun, Z.; Xu, L.; Gong, G.; Xie, X.; Sun, Z. A Review of the Evolutionary Algorithm Based VRP Problem. In Proceedings of the 2023 42nd Chinese Control Conference (CCC), Tianjin, China, 24–26 July 2023; pp. 1939–1944. [Google Scholar] [CrossRef]
  6. Zhang, C.; Sun, P. Heuristic Methods for Solving the Traveling Salesman Problem (TSP): A Comparative Study. In Proceedings of the 2023 IEEE 34th Annual International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC), Toronto, ON, Canada, 5–8 September 2023; pp. 1–6. [Google Scholar] [CrossRef]
  7. Di, L.; Sun, D.; Qi, Y.; Xiao, Z. Research on Shortest Path Planning and Smoothing Without Obstacle Collision Based on Moving Carrier. Int. J. Aerosp. Eng. 2024, 2024, 5235125. [Google Scholar] [CrossRef]
  8. Voloch, N.; Zadok, Y.; Voloch-Bloch, N.; Hajaj, M.M. Using Combined Knapsack and Shortest Path Problems for Planning Optimal Navigation Paths for Robotic Deliveries. In Proceedings of the 2024 10th International Conference on Automation, Robotics and Applications (ICARA), Athens, Greece, 22–24 February 2024; pp. 139–143. [Google Scholar] [CrossRef]
  9. Ferguson, D.; Stentz, A. Using interpolation to improve path planning: The Field D* algorithm. J. Field Robot. 2006, 23, 79–101. [Google Scholar] [CrossRef]
  10. Warren, C. Fast path planning using modified A* method. In Proceedings of the IEEE International Conference on Robotics and Automation, Atlanta, GA, USA, 2–6 May 1993; Volume 2, pp. 662–667. [Google Scholar] [CrossRef]
  11. Calin Belta, B.Y.; Gol, E.A. Formal Methods for Discrete-Time Dynamical Systems; Springer: Berlin/Heidelberg, Germany, 2017. [Google Scholar] [CrossRef]
  12. Chen, Y.; Ding, X.C.; Stefanescu, A.; Belta, C. Formal approach to the deployment of distributed robotic teams. IEEE Trans. Robot. 2012, 28, 158–171. [Google Scholar] [CrossRef]
  13. Mi, J.; Zhang, X.; Long, Z.; Wang, J.; Xu, W.; Xu, Y.; Deng, S. A mobile robot safe planner for multiple tasks in human-shared environments. PLoS ONE 2025, 20, e0324534. [Google Scholar] [CrossRef] [PubMed]
  14. Huang, Y.; Wang, Y.; Zatarain, O. Dynamic Path Optimization for Robot Route Planning. In Proceedings of the 2019 IEEE 18th International Conference on Cognitive Informatics & Cognitive Computing (ICCI-CC), Milan, Italy, 23–25 July 2019; pp. 47–53. [Google Scholar] [CrossRef]
  15. Wang, B.; Liu, Z.; Li, Q.; Prorok, A. Mobile Robot Path Planning in Dynamic Environments Through Globally Guided Reinforcement Learning. IEEE Robot. Autom. Lett. 2020, 5, 6932–6939. [Google Scholar] [CrossRef]
  16. Jiaoyang, L.; Andrew, T.; Scott, K.; Joseph W., D.; Satish, T.K.K.; Sven, K. Lifelong Multi-Agent Path Finding in Large-Scale Warehouses. In Proceedings of the The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21), Auckland, New Zealand, 9–13 May 2021; pp. 11272–11281. [Google Scholar]
  17. Ullah, Z.; Xu, Z.; Zhang, L.; Zhang, L.; Ullah, W. RL and ANN Based Modular Path Planning Controller for Resource-Constrained Robots in the Indoor Complex Dynamic Environment. IEEE Access 2018, 6, 74557–74568. [Google Scholar] [CrossRef]
  18. Hazem, B. Study of Q-learning and deep Q-network learning control for a rotary inverted pendulum system. Discov. Appl. Sci. 2024, 6, 49. [Google Scholar] [CrossRef]
  19. Jiang, J.; Yang, L.; Zhang, L. DQN-based on-line Path Planning Method for Automatic Navigation of Miniature Robots. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 5407–5413. [Google Scholar] [CrossRef]
  20. Carvalho, J.P.; Aguiar, A.P. A Reinforcement Learning Based Online Coverage Path Planning Algorithm. In Proceedings of the 2023 IEEE International Conference on Autonomous Robot Systems and Competitions (ICARSC), Tomar, Portugal, 26–27 April 2023; pp. 81–86. [Google Scholar] [CrossRef]
  21. Chai, R.; Niu, H.; Carrasco, J.; Arvin, F.; Yin, H.; Lennox, B. Design and Experimental Validation of Deep Reinforcement Learning-Based Fast Trajectory Planning and Control for Mobile Robot in Unknown Environment. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 5778–5792. [Google Scholar] [CrossRef]
  22. Mi, J.; Kuze, N.; Ushio, T. A mobile robot controller using reinforcement learning under scLTL specifications with uncertainties. Asian J. Control 2022, 24, 2916–2930. [Google Scholar] [CrossRef]
  23. Guruji, A.K.; Agarwal, H.; Parsediya, D. Time-efficient A* Algorithm for Robot Path Planning. Procedia Technol. 2016, 23, 144–149. [Google Scholar] [CrossRef]
  24. Dakulović, M.; Petrović, I. Two-way D* algorithm for path planning and replanning. Robot. Auton. Syst. 2011, 59, 329–342. [Google Scholar] [CrossRef]
  25. Blekas, K.; Vlachos, K. RL-based path planning for an over-actuated floating vehicle under disturbances. Robot. Auton. Syst. 2018, 101, 93–102. [Google Scholar] [CrossRef]
  26. Dam, T.; Chalvatzaki, G.; Peters, J.; Pajarinen, J. Monte-Carlo robot path planning. IEEE Robot. Autom. Lett. 2022, 7, 11213–11220. [Google Scholar] [CrossRef]
  27. Li, W.; Liu, Y.; Ma, Y.; Xu, K.; Qiu, J.; Gan, Z. A self-learning Monte Carlo tree search algorithm for robot path planning. Front. Neurorobotics 2023, 17, 1039644. [Google Scholar] [CrossRef] [PubMed]
  28. Cheng, R.; Orosz, G.; Murray, R.M.; Burdick, J.W. End-to-End Safe Reinforcement Learning through Barrier Functions for Safety-Critical Continuous Control Tasks. Proc. AAAI Conf. Artif. Intell. 2019, 33, 3387–3395. [Google Scholar] [CrossRef]
  29. El-Shamouty, M.; Wu, X.; Yang, S.; Albus, M.; Huber, M.F. Towards Safe Human-Robot Collaboration Using Deep Reinforcement Learning. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 4899–4905. [Google Scholar] [CrossRef]
  30. Liu, Q.; Liu, Z.; Xiong, B.; Xu, W.; Liu, Y. Deep reinforcement learning-based safe interaction for industrial human-robot collaboration using intrinsic reward function. Adv. Eng. Inform. 2021, 49, 101360. [Google Scholar] [CrossRef]
  31. Shao, Y.S.; Chen, C.; Kousik, S.; Vasudevan, R. Reachability-Based Trajectory Safeguard (RTS): A Safe and Fast Reinforcement Learning Safety Layer for Continuous Control. IEEE Robot. Autom. Lett. 2021, 6, 3663–3670. [Google Scholar] [CrossRef]
  32. Ghandour, M.; Liu, H.; Stoll, N.; Thurow, K. A hybrid collision avoidance system for indoor mobile robots based on human-robot interaction. In Proceedings of the 2016 17th International Conference on Mechatronics–Mechatronika (ME), Prague, Czech Republic, 7–9 December 2016; pp. 1–7. [Google Scholar]
  33. Zeng, L.; Bone, G.M. Mobile Robot Collision Avoidance in Human Environments. Int. J. Adv. Robot. Syst. 2013, 10, 41. [Google Scholar] [CrossRef]
  34. Yu, Q.; Zhou, J. A Review of Global and Local Path Planning Algorithms for Mobile Robots. In Proceedings of the 2024 8th International Conference on Robotics, Control and Automation (ICRCA), Shanghai, China, 12–14 January 2024; pp. 84–90. [Google Scholar] [CrossRef]
  35. De Araujo, P.R.M.; Mounier, E.; Dawson, E.; Noureldin, A. Smart Mobility: Leveraging Perception Sensors for Map-Based Navigation in Autonomous Vehicles. In Proceedings of the 2024 IEEE International Conference on Smart Mobility (SM), Niagara Falls, ON, Canada, 16–18 September 2024; pp. 281–286. [Google Scholar] [CrossRef]
  36. Chan, C.C.; Tsai, C.C. Collision-Free Speed Alteration Strategy for Human Safety in Human-Robot Coexistence Environments. IEEE Access 2020, 8, 80120–80133. [Google Scholar] [CrossRef]
  37. Ziebart, B.D.; Ratliff, N.; Gallagher, G.; Mertz, C.; Peterson, K.; Bagnell, J.A.; Hebert, M.; Dey, A.K.; Srinivasa, S. Planning-based prediction for pedestrians. In Proceedings of the 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems, St. Louis, MO, USA, 10–15 October 2009; pp. 3931–3936. [Google Scholar] [CrossRef]
  38. Fisac, J.F.; Bajcsy, A.; Herbert, S.L.; Fridovich-Keil, D.; Wang, S.; Tomlin, C.J.; Dragan, A.D. Probabilistically safe robot planning with confidence-based human predictions. arXiv 2018, arXiv:1806.00109. [Google Scholar] [CrossRef]
  39. Chen, M.; Shih, J.C.; Tomlin, C.J. Multi-vehicle collision avoidance via hamilton-jacobi reachability and mixed integer programming. In Proceedings of the 2016 IEEE 55th Conference on Decision and Control (CDC), Las Vegas, NV, USA, 12–14 December 2016; pp. 1695–1700. [Google Scholar] [CrossRef]
  40. Bansal, S.; Chen, M.; Fisac, J.F.; Tomlin, C.J. Safe Sequential Path Planning of Multi-Vehicle Systems Under Disturbances and Imperfect Information. arXiv 2016, arXiv:1603.05208. [Google Scholar] [CrossRef]
  41. Bajcsy, A.; Herbert, S.L.; Fridovich-Keil, D.; Fisac, J.F.; Deglurkar, S.; Dragan, A.D.; Tomlin, C.J. A Scalable Framework for Real-Time Multi-Robot, Multi-Human Collision Avoidance. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 936–943. [Google Scholar] [CrossRef]
  42. Grenouilleau, F.; Van Hoeve, W.J.; Hooker, J.N. A multi-label A* algorithm for multi-agent pathfinding. In Proceedings of the International Conference on Automated Planning and Scheduling, Berkeley, CA, USA, 11–15 July 2019; Volume 29, pp. 181–185. [Google Scholar] [CrossRef]
  43. Lifelong Multi-Agent Path Finding in A Dynamic Environment. In Proceedings of the 2018 15th International Conference on Control, Automation, Robotics and Vision (ICARCV), Singapore, 18–21 November 2018; pp. 875–882. [CrossRef]
  44. Sadigh, D.; Kim, E.S.; Coogan, S.; Sastry, S.S.; Seshia, S.A. A learning based approach to control synthesis of markov decision processes for linear temporal logic specifications. In Proceedings of the 53rd IEEE Conference on Decision and Control, Singapore, 18–21 November 2014; pp. 1091–1096. [Google Scholar] [CrossRef]
  45. Cho, K.; Suh, J.; Tomlin, C.J.; Oh, S. Cost-aware path planning under co-safe temporal logic specifications. IEEE Robot. Autom. Lett. 2017, 2, 2308–2315. [Google Scholar] [CrossRef]
  46. Hiromoto, M.; Ushio, T. Learning an Optimal Control Policy for a Markov Decision Process Under Linear Temporal Logic Specifications. In Proceedings of the 2015 IEEE Symposium Series on Computational Intelligence, Cape Town, South Africa, 7–10 December 2015; pp. 548–555. [Google Scholar] [CrossRef]
  47. Dong, X.; Wan, G.; Zeng, P.; Song, C.; Cui, S. Optimizing Robotic Task Sequencing and Trajectory Planning on the Basis of Deep Reinforcement Learning. Biomimetics 2024, 9, 10. [Google Scholar] [CrossRef] [PubMed]
  48. Dalal, M.; Chiruvolu, T.; Chaplot, D.; Salakhutdinov, R. Plan-Seq-Learn: Language Model Guided RL for Solving Long Horizon Robotics Tasks. arXiv 2024, arXiv:2405.01534. [Google Scholar]
  49. Teja Singamaneni, P.; Favier, A.; Alami, R. Human-Aware Navigation Planner for Diverse Human-Robot Interaction Contexts. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 5817–5824. [Google Scholar] [CrossRef]
  50. Kupferman, O.; Vardi, M.Y. Model Checking of Safety Properties. Form. Methods Syst. Des. 2001, 19, 291–314. [Google Scholar] [CrossRef]
  51. Latvala, T. Efficient Model Checking of Safety Properties. In Proceedings of the 10th International SPIN Workshop on Model Checking of Software, Portland, OR, USA, 9–10 May 2003; pp. 74–88. [Google Scholar] [CrossRef]
Figure 1. Architecture of the proposed planner. Red circles are the workstations where the agent has to pick up workpieces. The purple triangles are the destinations of the workpieces. # 1 , # 2 , are the doors through which the agent should pass to deliver the collected workpieces.
Figure 1. Architecture of the proposed planner. Red circles are the workstations where the agent has to pick up workpieces. The purple triangles are the destinations of the workpieces. # 1 , # 2 , are the doors through which the agent should pass to deliver the collected workpieces.
Applsci 15 07812 g001
Figure 2. Motion model and state transition of a forklift.
Figure 2. Motion model and state transition of a forklift.
Applsci 15 07812 g002
Figure 3. Communication between the agent and the door control system.
Figure 3. Communication between the agent and the door control system.
Applsci 15 07812 g003
Figure 4. An example of state transition probability: (left) s h = q , a h = m o v e ; (right) a h = s t a y .
Figure 4. An example of state transition probability: (left) s h = q , a h = m o v e ; (right) a h = s t a y .
Applsci 15 07812 g004
Figure 5. State transitions.
Figure 5. State transitions.
Applsci 15 07812 g005
Figure 7. Parameter investigation, showing the configuration with one task and two forklifts: (a) learning rate α , (b) discount factor γ , and (c) threshold ϵ in ϵ -greedy policy.
Figure 7. Parameter investigation, showing the configuration with one task and two forklifts: (a) learning rate α , (b) discount factor γ , and (c) threshold ϵ in ϵ -greedy policy.
Applsci 15 07812 g007
Figure 8. Environment Setting. (a) The colored circles show I = 3 workstations, o 1 , o 2 , o 3 . There are six doors, denoted # 1 , , # 6 . The colored triangles show the destinations, g 1 , g 2 , g 3 . There are K = 2 forklifts in the environment; the red arrows show their directions. The gray squares are obstacles, while the pink squares show the barrier area of each door. (b) The number of forklift s is increased to K = 4 and the robot’s start position is reset.
Figure 8. Environment Setting. (a) The colored circles show I = 3 workstations, o 1 , o 2 , o 3 . There are six doors, denoted # 1 , , # 6 . The colored triangles show the destinations, g 1 , g 2 , g 3 . There are K = 2 forklifts in the environment; the red arrows show their directions. The gray squares are obstacles, while the pink squares show the barrier area of each door. (b) The number of forklift s is increased to K = 4 and the robot’s start position is reset.
Applsci 15 07812 g008
Figure 9. State transitions of A φ converted from scLTL formula φ with I = 3 . Note that the state transition q δ ( q , n u l l i ) exists q Q (not drawn in the figure).
Figure 9. State transitions of A φ converted from scLTL formula φ with I = 3 . Note that the state transition q δ ( q , n u l l i ) exists q Q (not drawn in the figure).
Applsci 15 07812 g009
Figure 10. Convergence curves of the learning process.
Figure 10. Convergence curves of the learning process.
Applsci 15 07812 g010
Figure 11. Simulation results of scenario 1: (a) refusals, (b) average task success rate, (c) average reward of generated paths, and (d) average length of generated paths.
Figure 11. Simulation results of scenario 1: (a) refusals, (b) average task success rate, (c) average reward of generated paths, and (d) average length of generated paths.
Applsci 15 07812 g011
Figure 12. Examples of generated paths in scenario 1: (a) classic RL method, path length 77; (b) proposed FSA-RL method, path length 79.
Figure 12. Examples of generated paths in scenario 1: (a) classic RL method, path length 77; (b) proposed FSA-RL method, path length 79.
Applsci 15 07812 g012
Figure 13. Simulation results of scenario 2 (increasing forklift number to K = 4 ): (a) refusals, (b) average task success rate, (c) average reward of generated paths, and (d) average length of generated paths.
Figure 13. Simulation results of scenario 2 (increasing forklift number to K = 4 ): (a) refusals, (b) average task success rate, (c) average reward of generated paths, and (d) average length of generated paths.
Applsci 15 07812 g013
Figure 14. Examples of generated paths of scenario 2: (a) classic RL method, path length 84; (b) proposed FSA-RL method, path length 78.
Figure 14. Examples of generated paths of scenario 2: (a) classic RL method, path length 84; (b) proposed FSA-RL method, path length 78.
Applsci 15 07812 g014
Table 1. Comparison of different path planning methods.
Table 1. Comparison of different path planning methods.
MethodsEnvironmentRisk Foresight
Static Dynamic
A*
D*
RRT*
RL
FSA-MDP
MCTS
Real-time solution
Table 2. Comparison of different methods for task sequencing.
Table 2. Comparison of different methods for task sequencing.
MethodsHandling of
Environmental Uncertainty
Human-
Interaction
Task
Sequencing
Product MDP [46]
Feb-MDP [22]real-time way
DRL [47]
Plan-Seq-Learn [48]
G2RL [15]real-time way
Co-HAN [49]real-time way
Oursat the planning level
Table 3. Reward design in the high- and low-level learners.
Table 3. Reward design in the high- and low-level learners.
Types of RewardValue Learner
E 1.0 high-level
r E 0.0 high-level
step_cost 0.1 high-/low-level
goal_reward 1.0 low-level
Penalty 1.0 low-level
c_cost 1.0 low-level
Table 4. Parameter settings.
Table 4. Parameter settings.
ParameterValue
b u d g e t 100Section 4
p f 1.0 (randomly move one step or two steps)Section 5.1
p w 0.0
s i m u l a t i o n t i m e s 2000
α 0.5 high-/low-level
γ 0.9
ϵ 0.2
cimputation_budget1000
θ 0.1 low-level
Table 5. Scenario 1: Risk level of each door when the agent passes through.
Table 5. Scenario 1: Risk level of each door when the agent passes through.
Time-Step t37383940414243444546
door # 1 0.0100.1500.4800.6500.4600.1900.0500.0100.0000.000
door # 2 0.5000.1500.1100.0500.0100.0000.0000.0000.0000.000
door # 3 0.5000.1500.0600.0200.0100.0000.0000.0000.0000.000
Table 6. Scenario 2: Risk level of each door when the agent passes through.
Table 6. Scenario 2: Risk level of each door when the agent passes through.
Time-Step t3637383940414243444546
door # 1 0.3360.7140.7300.7270.7390.5000.1970.0480.0060.0000.000
door # 2 0.8070.5000.1470.1110.0200.0100.0000.0000.0000.0000.000
door # 3 1.1780.7280.2080.0240.0200.0100.0000.0000.0000.0000.000
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Mi, J.; Liu, J.; Xu, Y.; Long, Z.; Wang, J.; Xu, W.; Ji, T. A Reinforcement Learning-Based Double Layer Controller for Mobile Robot in Human-Shared Environments. Appl. Sci. 2025, 15, 7812. https://doi.org/10.3390/app15147812

AMA Style

Mi J, Liu J, Xu Y, Long Z, Wang J, Xu W, Ji T. A Reinforcement Learning-Based Double Layer Controller for Mobile Robot in Human-Shared Environments. Applied Sciences. 2025; 15(14):7812. https://doi.org/10.3390/app15147812

Chicago/Turabian Style

Mi, Jian, Jianwen Liu, Yue Xu, Zhongjie Long, Jun Wang, Wei Xu, and Tao Ji. 2025. "A Reinforcement Learning-Based Double Layer Controller for Mobile Robot in Human-Shared Environments" Applied Sciences 15, no. 14: 7812. https://doi.org/10.3390/app15147812

APA Style

Mi, J., Liu, J., Xu, Y., Long, Z., Wang, J., Xu, W., & Ji, T. (2025). A Reinforcement Learning-Based Double Layer Controller for Mobile Robot in Human-Shared Environments. Applied Sciences, 15(14), 7812. https://doi.org/10.3390/app15147812

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop