Risk-Aware Reinforcement Learning with Dynamic Safety Filter for Collision Risk Mitigation in Mobile Robot Navigation

Guo, Bingbing; Wang, Guina; Chen, Yiyang; Gao, Yue; Xie, Qian

doi:10.3390/s25175488

Open AccessArticle

Risk-Aware Reinforcement Learning with Dynamic Safety Filter for Collision Risk Mitigation in Mobile Robot Navigation

by

Bingbing Guo

¹,

Guina Wang

^1,*,

Yiyang Chen

^1,*

,

Yue Gao

^1,* and

Qian Xie

²

¹

School of Mechanical and Electrical Engineering, Soochow University, Suzhou 215137, China

²

Jiangsu Eazytec Co., Ltd., Wuxi 214205, China

^*

Authors to whom correspondence should be addressed.

Sensors 2025, 25(17), 5488; https://doi.org/10.3390/s25175488

Submission received: 30 July 2025 / Revised: 25 August 2025 / Accepted: 1 September 2025 / Published: 3 September 2025

(This article belongs to the Special Issue Indoor Localization Technologies and Applications)

Download

Browse Figures

Versions Notes

Abstract

Mobile robots face collision risk avoidance challenges in dynamic environments, necessitating that we address the safety and adaptability shortcomings of traditional navigation methods. Traditional methods rely on predefined rules, making it difficult to achieve flexible, safe, and real-time obstacle avoidance in complex, dynamic environments. To address this issue, a risk-aware, dynamic, adaptive regulation barrier policy optimization (RADAR-BPO) method is proposed, combining proximal policy optimization (PPO) with the control barrier function (CBF). RADAR-BPO generates exploratory actions using PPO and constructs a real-time safety filter using the CBF. This method uses quadratic programming to minimize risky actions, thereby ensuring safe obstacle avoidance while maintaining navigation efficiency. Testing of three phased learning environments in the ROS Gazebo simulation environment demonstrated that the proposed method achieves an obstacle avoidance success rate of nearly 90% in complex, dynamic, multi-obstacle environments and improves the overall mission success rate, validating its robustness and effectiveness in complex dynamic scenarios.

Keywords:

robot navigation; obstacle avoidance; safety filter; reinforcement learning

1. Introduction

Currently, the convergence of artificial intelligence (AI), the Internet of Things (IoT), and automation technologies is facilitating intelligent transformation across various industries [1,2,3]. Robotic technology represents a key driver of this evolution [4]. Advancements in intelligent algorithms and hardware enable robots to undertake increasingly complex tasks. As a critical subset of robots, mobile robots are being deployed extensively in industry [5,6], logistics [7], and agriculture [8].

Despite their diverse designs and functionalities, mobile robots commonly face challenges in safe obstacle avoidance and efficient navigation during operation [9]. In dynamic and complex environments, they must accurately identify obstacles, plan safe trajectories, and make real-time adjustments [10]. Ensuring task completion necessitates advanced sensor systems for real-time environmental perception [11]. Moreover, using intelligent algorithms to process information, execute dynamic path planning, and adapt to uncertainties is critical, and enhancing these capabilities is key to ensuring efficient and safe operation in complex environments [12,13,14].

Beyond traditional obstacle avoidance, navigating shared environments requires socially aware navigation capabilities [15]. Robots must not only avoid collisions but also adhere to social norms, exhibit predictable behavior, and respect personal space to achieve smooth and comfortable human–robot coexistence [16,17]. This involves understanding and predicting the motivations and intentions of human behavior and has become a research focus in the field of human–robot interaction (HRI) [18,19]. However, although the ultimate goal of social navigation is to achieve natural and comfortable human–machine coexistence, collision risk mitigation remains the most basic and critical safety prerequisite. Any social navigation strategy must be based on absolute collision avoidance in order to further optimize comfort and efficiency [20,21].

While techniques have previously been developed for obstacle avoidance and navigation by mobile robots, significant limitations persist [22,23]. Traditional control methods such as the artificial potential field (APF) method [24] and dynamic window approach (DWA) [25], often reliant on predefined rules and fixed path planning, perform adequately in simple tasks but lack adaptability and flexibility in complex, dynamic environments [26,27,28]. Their limited capacity for autonomous adjustment frequently results in decision errors or inefficiency when handling intricate scenarios [29]. Furthermore, the safety assurance mechanisms of conventional obstacle avoidance strategies remain inadequate, leading to elevated collision risks during operation. These constraints hinder their widespread deployment in complex and variable settings [30,31].

Advances in AI have facilitated the application of reinforcement learning (RL) in mobile robotics [32]. As a machine learning approach, RL demonstrates significant potential for use in autonomous robot decision-making [33]. It addresses key challenges including task execution policy optimization, environmental adaptability enhancement, and autonomous decision-making for complex tasks [34,35]. By learning optimal policies through environmental interaction, RL enables effective decision-making in dynamic and uncertain environments [36].

To enhance the time efficiency for mobile robot navigation in crowded environments, Zhou et al. [37] proposed a social graph-based double dueling deep Q-Network (DQN). This approach employs a social attention mechanism to extract effective graph representations, optimizes the state–action value approximator, and leverages simulated experiences generated by a learned environmental model for further refinement, demonstrating significantly improved success rates in crowded navigation tasks. Li et al. [38] introduced a fused deep deterministic policy gradient (DDPG) method, integrating a multi-branch deep learning network with a time-critical reward function, which effectively enhanced the convergence velocity and navigation performance in complex environments.

On the other hand, the control barrier function (CBF) has garnered attention in regard to obstacle avoidance control for mobile robots [39,40,41]. Originating from control theory, the CBF provides robust control guarantees for safety-critical systems operating under constraints [42]. For mobile robot obstacle avoidance, the CBF enforces strict safety conditions during motion by constructing safety constraints [43,44].

In order to further enhance the obstacle avoidance performance of the CBF, researchers have proposed various improvements and hybrid approaches. Singletary et al. [45] conducted a comparative analysis of the performance of the CBF and artificial potential fields (APFs) in robotic obstacle avoidance, demonstrating that the CBF generates smoother trajectories, effectively mitigates oscillations, and offers enhanced safety guarantees. Jian et al. [46] introduced a dynamic control barrier function, integrating it with model predictive control (MPC) to ensure collision-free trajectories for robots operating in uncertain and dynamic environments.

This paper proposes a novel safe reinforcement learning framework integrating the control barrier function (CBF) and proximal policy optimization (PPO) to reduce collisions in navigation for a mecanum wheel robot. Implemented within the ROS Gazebo simulation environment, the framework leverages CBF-based safety constraints formulated as a quadratic programming problem. This dynamically adjusts potentially unsafe actions generated by the PPO policy, guaranteeing collision mitigation while preserving motion efficiency. Comparative experiments against baseline algorithms (PPO, DQN, and DDPG) are conducted to rigorously evaluate the method’s effectiveness across key safety and navigation performance metrics.

The remainder of this paper is organized as follows: Section 2 introduces the robot model and basic methods; Section 3 presents the main research and methods; Section 4 describes the simulation results and and provides a discussion; and the final section, Section 5, concludes this paper.

2. Preliminary Knowledge

This section establishes the theoretical groundwork for the mobile robot platform, detailing its kinematic model and core implementation approaches. To support the development of subsequent control strategies and navigation algorithms, the unique omnidirectional motion capabilities inherent to the mecanum wheel robot are mathematically characterized. This section then presents fundamental obstacle avoidance techniques utilizing relevant control methodologies.

2.1. Kinematic Model of Robot

The mecanum wheel robot can achieve omnidirectional movement due to its unique wheel hub structure. The passive rollers installed at a 45° angle around each wheel allow the robot to translate or rotate in any direction without changing the orientation of the robot. Its kinematic model defines the relationship between the wheel velocities and the robot’s holistic motion. This representation is illustrated in Figure 1.

2.1.1. Forward and Inverse Kinematics Models

The parameters related to the robot chassis are defined in Figure 1. In the model

W_{a}

represents half of the body length (from the front wheel to the center of the rear wheel), and

W_{b}

represents half of the body width (the distance between the left and right wheel centers). The robot’s linear velocity is

v_{l i n e a r}

, and

v_{x}

and

v_{y}

are the velocities in the x and y directions, respectively.

In addition

v_{n} (n = 1, 2, 3, 4)

is the linear velocity of each wheel, and

v_{n ω}

is the angular velocity of the chassis corresponding to each wheel, and

V_{n R}

represents the velocity at which the wheel moves forward. In summary, the forward kinematics model of the robot can be described as

[\begin{matrix} v_{x} \\ v_{y} \\ ω \end{matrix}] = J {[\begin{matrix} V_{1 R} & V_{2 R} & V_{3 R} & V_{4 R} \end{matrix}]}^{T},

(1)

J = \frac{1}{4} [\begin{matrix} 1 & 1 & 1 & 1 \\ - 1 & 1 & 1 & - 1 \\ - \frac{1}{d} & - \frac{1}{d} & \frac{1}{d} & \frac{1}{d} \end{matrix}],

(2)

where

d = W_{a} + W_{b}

represents the characteristic length of the robot chassis. The matrix J establishes the mapping relationship between the wheel velocities and the chassis motion in the body-fixed frame. The corresponding inverse kinematics of the mecanum wheel robot chassis are shown as

\{\begin{matrix} V_{1 R} = v_{x} - v_{y} - (W_{a} + W_{b}) ω \\ V_{2 R} = v_{x} + v_{y} + (W_{a} + W_{b}) ω \\ V_{3 R} = v_{x} + v_{y} - (W_{a} + W_{b}) ω \\ V_{4 R} = v_{x} - v_{y} + (W_{a} + W_{b}) ω \end{matrix} .

(3)

Therefore, the velocity of the robot chassis and the velocity of the four wheels can be converted into each other using the forward and inverse kinematics, which facilitates control of the robot’s trajectory. Based on this, a schematic diagram of the robot’s wheel odometry model is shown in Figure 2.

2.1.2. Ideal Model of Robot’s Odometry

Under ideal conditions, the relationship between the robot’s current position,

P_{k}

, and the previous position,

P_{k - 1}

, is expressed by the offset

Δ_{k}

. The next posture,

P_{k + 1}

, is the current posture,

P_{k}

, plus the offset

Δ_{k + 1}

, and the corresponding expression equation is

P_{k + 1} = P_{k} + Δ_{k + 1} Δ_{t + 1},

(4)

where

P_{k} = {[\begin{matrix} x_{k} & y_{k} & θ_{k} \end{matrix}]}^{T}

, and

Δ_{k + 1} = {[\begin{matrix} v_{x}^{k + 1} & v_{y}^{k + 1} & ω_{k + 1} \end{matrix}]}^{T}

. Assuming that the time,

Δ_{t + 1}

, between two points is very small, the next position can be expressed as the linear and angular velocity offsets of the current position, and the expanded component form is as follows:

\{\begin{matrix} x_{k + 1} & = x_{k} + v_{x}^{k + 1} \\ y_{k + 1} & = y_{k} + v_{y}^{k + 1} \\ θ_{k + 1} & = θ_{k} + ω_{k + 1} \end{matrix} .

(5)

Therefore, the derived forward and inverse kinematic models, combined with the discrete-time odometry update equations, provide a complete mathematical framework to predict the robot’s chassis motion based on the wheel velocities, which forms the basis for controlling the robot’s trajectory.

2.2. Control Barrier Function

In the process of robot movement and navigation, obstacle avoidance and velocity limiting are indispensable to prevent uncontrollable behavior. This section mainly introduces the control barrier function (CBF), which is often used in the control field as a controller and can also be used to constrain a robot’s actions.

2.2.1. Definition of Safety Set

For a dynamic system, the safety set

D

is defined as the set of safe states in the system. A diagram of the safety set is shown in Figure 3, and

D

is defined as

\begin{matrix} D & = {x ∣ x \in R^{n} \land h (x) \geq 0}, \\ \partial D & = {x ∣ x \in R^{n} \land h (x) = 0}, \\ Int (D) & = {x ∣ x \in R^{n} \land h (x) > 0}, \end{matrix}

(6)

where the safe set

D

contains the status

x \subset R^{n}

. When the system state x is safe, it means that it is inside the safety combination

D

or just on the boundary, which is

h (x) \geq 0

. On the contrary,

h (x) < 0

means that state x is outside the safe set.

2.2.2. Definition of Control Barrier Funciton

Consider a control affine system and the corresponding dynamic system in the nonlinear case, which can be expressed as

\dot{x} = F (x, u),

(7)

\dot{x} = f (x) + g (x) u,

(8)

where

u \subset R^{n}

, and F is Lipschitz continuous. The behavior of the system is f when there is no control input, and g represents the control of the system and affects the changes in the system through a control input, u.

For

h (x)

,

\forall x \in D

, and

\nabla h (x) \neq 0

, there exists

α

, such that any state, x, in the set

D

satisfies the condition

\dot{h} (x, u) = L_{f} h (x) + L_{g} h (x) u \geq - α h (x),

(9)

where

h (x)

is considered to be a CBF, and

α

is an extended K function.

2.3. Basic Theory of Reinforcement Learning

Reinforcement learning (RL) is a machine learning approach characterized by an agent learning to optimize its actions through trial-and-error interactions with a dynamic environment, aimed at discovering strategies that maximize the long-term returns. Its core feature is learning through a trial-and-error mechanism, relying on reward signals to guide behavior optimization.

RL usually models a problem as a Markov decision process (MDP). A MDP can be represented by a five-tuple,

S, A, P, R, γ

.

S: The state space represents the set of all the possible states of the agent in the environment. The state

x \in S

represents a specific situation in the environment at a certain moment and contains all the information needed by the agent.

A: The Action Space and the actions that the agent can take during an interaction with the environment are usually represented by

A (s)

, where

a \in A (s)

represents the set of possible actions in state s.

P: The State Transition Function represents the probability of transitioning to state

s_{t + 1}

if an action, a, is executed in state s: that is,

P (s_{t + 1} | s_{t}, a)

.

R: The reward function can be written as

R (s, a)

and refers to the immediate reward feedback obtained by the agent when it performs action a in state s.

γ

: The discount factor is used to calculate future cumulative rewards,

γ \in [0, 1]

. A larger discount factor is suitable for tasks that focus on achieving long-term goals, while a smaller discount factor is more suitable for tasks that emphasize immediate feedback.

A robot can be regarded as an intelligent agent in reinforcement learning. The process of training it to interact with the environment involves RL. The corresponding environmental interaction is shown in Figure 4.

Figure 4 depicts the iterative learning process through which an agent learns to interact with its environment within an Actor–Critic framework. The Actor generates an action mechanism and communicates it to the robot agent. The robot explores the environment and obtains the corresponding rewards and states, which are then passed to the Critic network for value calculation. The Critic updates the Actor with the calculated parameters and outputs a better action strategy.

3. Main Research Content and Methods

This section mainly describes the motion control method based on proximal policy optimization (PPO) and the design of the real-time constraint mechanism using the CBF. On this basis, a novel fusion framework is constructed to implement a strategy for safe reinforcement learning, so as to ensure that the robot maintains safety performance while exerting its exploration advantages and further improve the stability of reinforcement learning.

3.1. Proximal Policy Optimization

The PPO objective function and Actor–Critic architecture are defined as key mechanisms for generating efficient exploratory navigation strategies for omnidirectional mobile robotic platforms. Their core function is to learn adaptive obstacle avoidance strategies for use in complex or dynamic scenarios and output robot velocity commands that balance navigation task completion and safety.

3.1.1. PPO Principle Description

As an Actor–Critic derivative, PPO enforces policy update constraints via a trust-region-inspired mechanism to ensure monotonic improvement. By introducing an objective function clipping mechanism, it achieves efficient learning while ensuring training stability. Its core components and formulas are as follows:

Actor : π_{θ} (a_{t} | s_{t}),

(10)

Critic : V_{ϕ} (s_{t}),

(11)

where

θ

and

ϕ

are the parameters of the policy network and the value network, respectively, the current state is

s_{t}

, and

a_{t}

is the generated action.

First, we need to calculate the generalized advantage estimation (GAE), calculate the TD error based on

V_{ϕ} (s_{t})

, and calculate the advantage

\hat{A} t

based on the TD error

δ_{t}

. The specific equation is expressed as

δ_{t} = r_{t} + γ V_{ϕ} (s_{t + 1}) - V_{ϕ} (s_{t}),

(12)

\hat{A} t = δ_{t} + γ λ δ_{t + 1} + \dots + {(γ λ)}^{T - t + 1} δ_{T - 1},

(13)

with

γ \in [0, 1]

being the discount factor and

λ \in [0, 1]

the GAE parameter.

The importance sampling ratio in the policy network update

r_{t} (θ)

is

r_{t} (θ) = \frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{old}} (a_{t} | s_{t})} .

(14)

Next, we need to calculate the clipping objective function, which is also one of the core components of PPO. It can be expressed as

L_{c l i p}^{P P O} (θ) = E_{t} [\min (r_{t} (θ) {\hat{A}}_{t}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) {\hat{A}}_{t})],

(15)

where

ϵ

is the clipping threshold, and it limits the importance sampling ratio

r_{t} (θ)

to within the interval

[1 - ϵ, 1 + ϵ]

.

For an action with a positive advantage

({\hat{A}}_{t} > 0)

, the objective function is clipped to prevent the policy from increasing the probability of the action too aggressively

(r_{t} (θ) > (1 + ϵ))

. Conversely, for an action with a negative advantage

({\hat{A}}_{t} < 0)

, the objective function is clipped to prevent the probability from decreasing too much

(r_{t} (θ) < (1 - ϵ))

.

In the Critic network update, the value loss function that needs to be calculated is as follows:

L_{v f}^{P P O} (ϕ) = E t [{(V ϕ (s_{t}) - R_{t})}^{2}],

(16)

where

R_{t} = \sum_{k = 0}^{T} γ^{k} r_{t + k}

is the cumulative return. At the same time, PPO introduces an entropy regularization term to enhance the exploration of strategies:

L_{e n t}^{P P O} (θ) = E t [H (θ (a | s_{t}))],

(17)

where

H (θ (a | s_{t}))

is the entropy of the strategy.

Then the total loss function can be calculated as

L_{t o t a l}^{P P O} (θ, ϕ) = E_{t} [L_{c l i p}^{P P O} (θ) - c_{1} L_{v f}^{P P O} (ϕ) + c_{2} L_{e n t}^{P P O} (θ)],

(18)

where

c_{1}

and

c_{2}

are weight parameters.

3.1.2. PPO Algorithm Flow

The PPO algorithm is an efficient policy gradient method. By introducing an objective function clipping mechanism, it can achieve efficient learning while ensuring training stability. The core idea of PPO is to limit the step size of policy updates to avoid performance crashes during training. The algorithm flow is shown in Algorithm 1.

Algorithm 1 Proximal policy optimization strategy.

Input: Initial policy $θ_{0}$ , value $ϕ_{0}$ , $ϵ$ , $γ$ , $λ$ , $c_{1}$ , $c_{2}$ , K, M
Output: Optimized $θ^{*}$ , $ϕ^{*}$
1:
Initialize $θ \leftarrow θ_{0}$ , $ϕ \leftarrow ϕ_{0}$
2:
for $k = 0$ to K do
3:
    Collect $D_{k}$ via $π_{θ}$
4:
    for each t in $D_{k}$ do
5:
         $R_{t} \leftarrow \sum_{l = t}^{T} γ^{l - t} r_{l}$
6:
         ${\hat{A}}_{t} \leftarrow \sum_{l = 0}^{T - t} {(γ λ)}^{l} δ_{t + l}$
7:
    end for
8:
     $θ_{o l d} \leftarrow θ$
9:
    for episode $= 1$ to M do
10:
        Sample minibatch $B$ from $D_{k}$
11:
         $L_{t o t a l}^{P P O} (θ, ϕ) \leftarrow 0$
12:
        for each $(s_{t}, a_{t}, R_{t}, {\hat{A}}_{t})$ in $B$ do
13:
            $r_{t} (θ) \leftarrow π_{θ} (a_{t} | s_{t}) / π_{θ_{o l d}} (a_{t} | s_{t})$
14:
            $L_{c l i p}^{P P O} (θ) \leftarrow \min (r_{t} (θ) {\hat{A}}_{t}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) {\hat{A}}_{t})$
15:
            $L_{v f}^{P P O} (ϕ) \leftarrow {(V_{ϕ} (s_{t}) - R_{t})}^{2}$
16:
            $L_{e n t}^{P P O} (θ) \leftarrow - E t [H (θ (a | s_{t}))]$
17:
            $L_{t o t a l}^{P P O} (θ, ϕ) \leftarrow L_{t o t a l}^{P P O} (θ, ϕ) - L_{c l i p}^{P P O} (θ) + c_{1} L_{v f}^{P P O} (ϕ) - c_{2} L_{e n t}^{P P O} (θ)$
18:
        end for
19:
         $θ^{*} \leftarrow θ - η_{θ} \nabla_{θ} L_{t o t a l}^{P P O}$
20:
         $ϕ^{*} \leftarrow ϕ - η_{ϕ} \nabla_{ϕ} L_{t o t a l}^{P P O}$
21:
    end for
22:
end for
23:
return $θ^{*} \leftarrow θ$ , $ϕ^{*} \leftarrow ϕ$

3.2. Application and Implementation of CBF

The CBF is specifically used for its safety conditions, which are based on the kinematic model of the omnidirectional mobile robot used in this study. The focus is on demonstrating how this function acts as a real-time safety filter, applying online corrections to the original motion output by the PPO algorithm to achieve collision avoidance during navigation. Its quadratic programming formulation explicitly serves a single goal: to minimize the safety corrections to the PPO-determined motion while ensuring that the robot maintains its specific safety distance and stays within its physical motion limits.

3.2.1. Safe Sets and Safe Functions

The set of states,

D

, in which the robot can safely operate and the barrier function

h (p)

are defined as

D = {p \subset R^{n} : h (p) \geq 0},

(19)

h (p) = | | p - p_{o b s} {| |}^{2} - r_{s a f e}^{2},

(20)

where

p = [x, y]

represents the coordinates of the robot in the global coordinate system,

p_{o b s}

represents the coordinates of the nearest obstacle, and

r_{s a f e}

is the safety distance threshold.

3.2.2. Implementation of CBF Safety Constraints

The time derivative of the CBF can be calculated as

\begin{matrix} \frac{d h}{d t} & = \frac{\nabla h \cdot d p}{d t} \\ = [\frac{\partial h}{\partial x}, \frac{\partial h}{\partial y}] \cdot {[\dot{x}, \dot{y}]}^{T}, \end{matrix}

(21)

where

\frac{\partial h}{\partial x} = 2 (x - x_{o b s})

,

\frac{\partial h}{\partial y} = 2 (y - y_{o b s})

. Then substitute the robot motion model into the equation

\frac{d h}{d t} = 2 v [(x - x_{o b s}) c o s θ + (y - y_{o b s}) s i n θ] .

(22)

Since the CBF needs to satisfy this constraint, substitute Equation (22) into Equation (9):

2 v [(x - x_{o b s}) c o s θ + (y - y_{o b s}) s i n θ] \geq - α ({(x - x_{o b s})}^{2} + {(y - y_{o b s})}^{2} - r_{s a f e}^{2}) .

(23)

Set a relative position vector,

Δ p

, and the unit vector of the robot’s forward direction, i, as

Δ p = {[\begin{matrix} x - x_{o b s}, & y - y_{o b s} \end{matrix}]}^{T},

(24)

i = {[\begin{matrix} \cos θ, & \sin θ \end{matrix}]}^{T} .

(25)

The corresponding constraints can be transformed into

2 v \cdot (Δ p \cdot i) \geq - α (∥ Δ p ∥^{2} - r_{safe}^{2}) .

(26)

In summary, the overall constraint solving framework of the CBF can be written as

\begin{matrix} {(v, ω)}^{*} = \arg \min_{v, ω \in R^{n}} & [{(v - a_{t} (v))}^{2} + {(ω - a_{t} (ω))}^{2}] \\ s . t . & 2 v \cdot (Δ p \cdot n) \geq - α (∥ Δ p ∥^{2} - r_{s a f e}^{2}), \\ 0 \leq v \leq v_{\max}, \\ | ω | \leq ω_{\max}, \end{matrix}

(27)

where

v_{m a x}

is the max value of v, and

ω_{m a x}

is the max angular velocity of

ω

.

Figure 5 shows a simple diagram of the CBF obstacle avoidance process, as well as an example of the impact of different

α

values on the trajectory. At each step, the CBF combines the current velocity and other state variables to determine the next output that meets the constraint requirements and combines this with the result from the obstacle function calculation to continuously move until it reaches the target position.

A CBF framework for safe robot motion is proposed. First, a safe set is used by a barrier function, which represents the set of collision-free trajectories. Then, the time derivative of this barrier function is derived from the robot’s kinematic model and combined with the K function to obtain linear inequalities for the control variables. Finally, a quadratic program is formulated to minimize the modification of the nominal control command while satisfying the CBF safety constraints and the actuator’s restrictions on the linear and angular velocities. The corresponding algorithm for the CBF is shown in Algorithm 2.

Algorithm 2 CBF safety filter.

Input: Robot kinematic model, safety distance threshold $r_{s a f e}$ , maximum velocity $v_{\max}$ , maximum angular velocity $ω_{\max}$ , CBF parameters $α$
Output: Optimized control inputs $v^{*}$ , $ω^{*}$
1:
Initialize robot state $p = [x, y]$ , robot orientation $θ$ , velocity v, and angular velocity $ω$
2:
Calculate the relative position vector $Δ p = {[x - x_{o b s}, y - y_{o b s}]}^{T}$ to the nearest obstacle
3:
Calculate the unit vector of the robot forward direction $i = {[\cos θ, \sin θ]}^{T}$
4:
Calculate the barrier function $h (p) = {∥ Δ p ∥}^{2} - r_{s a f e}^{2}$
5:
Calculate the time derivative of the barrier function using (22)
6:
Set the initial control inputs $v_{0}$ , $ω_{0}$
7:
for $k = 0$ to K do
8:
       Calculate the constraint $2 v \cdot (Δ p \cdot i) \geq - α (∥ Δ p ∥^{2} - r_{s a f e}^{2})$
9:
      where $i = {[\cos θ, \sin θ]}^{T}$
10:
  and $Δ p = {[x - x_{o b s}, y - y_{o b s}]}^{T}$
11:
Solve the optimization problem

$\begin{matrix} {(v, ω)}^{*} = \arg \min_{v, ω \in R^{n}} & [{(v - a_{t} (v))}^{2} + {(ω - a_{t} (ω))}^{2}] \\ s . t . & 2 v \cdot (Δ p \cdot i) \geq - α (∥ Δ p ∥^{2} - r_{s a f e}^{2}), \\ 0 \leq v \leq v_{\max}, \\ | ω | \leq ω_{\max} . \end{matrix}$

12:
    Update control inputs $v^{*} \leftarrow v$ , $ω^{*} \leftarrow ω$
13:
    Update robot state using the kinematic model
14:
end for
15:
return $v^{*} \leftarrow v$ , $ω^{*} \leftarrow ω$

3.3. Risk-Aware, Dynamic, Adaptive Regulation Barrier Policy Optimization

This section designs a risk-aware, dynamic, adaptive regulation barrier policy optimization (RADAR-BPO) method which combines the efficiency of stable exploration with PPO. The Actor outputs probabilistic actions, and the Critic updates their value and optimizes the Actor based on the information obtained from feedback on the agent’s motion in the environment. In addition, extra security is provided by the CBF, which was added to further optimize the actions output by the PPO algorithm for the actor, filter actions with risky values, and output safe actions. This method deeply integrates the exploration capabilities of PPO and the security of the CBF, so that the robot agent can maximize the value of exploration in different environments. A corresponding block diagram of the complete system is shown in Figure 6, and the algorithm is shown in Algorithm 3.

Figure 6. Complete RADAR-BPO framework for training to interact with environment.

Algorithm 3 RADAR-BPO.

Input: Initial policy $θ_{0}$ , value $ϕ_{0}$ , safety params
Output: Optimized $θ^{*}$ , $ϕ^{*}$
1:
Initialize $θ \leftarrow θ_{0}$ , $ϕ \leftarrow ϕ_{0}$
2:
for iteration $k = 0$ to K do
3:
    Data Collection:
4:
    Collect trajectory $D_{k}$ via policy $π_{θ}$
5:
    Policy Optimization:
6:
    Compute advantages ${\hat{A}}_{t}$ and returns $R_{t}$ for $D_{k}$
7:
     $θ_{old} \leftarrow θ$
8:
    Update $θ, ϕ$ using PPO loss with $D_{k}$
9:
    for each state $s_{t}$ in $D_{k}$ do
10:
        Get PPO action $a_{t}^{PPO} = (v^{PPO}, ω^{PPO})$
11:
        Compute safety constraint based on robot state
12:
        Safety Filtering:
13:
        Solve CBF-QP:

$\begin{matrix} (v^{*}, ω^{*}) & = \arg \min_{v, ω} {∥ a - a_{t}^{PPO} ∥}^{2} \\ s . t . & CBF constraints \end{matrix}$

14:
        Execute safe action $a_{t}^{safe} = (v^{*}, ω^{*})$
15:
    end for
16:
end for
17:
return $θ^{*} \leftarrow θ$ , $ϕ^{*} \leftarrow ϕ$

In addition, this method fully considers information on the robot’s spatial dimensions during the obstacle avoidance process, including its position, orientation, distance from obstacles, etc., and transforms this spatial information into safety constraints using the CBF and embeds it into a reinforcement learning-based decision-making process, thereby achieving safe and efficient navigation in complex dynamic environments.

To achieve an optimal balance between navigation efficiency and safety guarantees, a multi-objective reward function with adaptive components is designed. The mathematical formulation is defined as follows:

r_{t} = r_{h} + r_{d} + r_{ob} + r_{v},

(28)

where

r_{h}

,

r_{d}

,

r_{ob}

, and

r_{v}

are defined below.

The heading reward

r_{h}

encourages alignment with the target direction:

r_{h} = 1 + \cos (θ_{d i f f}),

(29)

where

θ_{d i f f}

denotes the angular deviation between the robot’s current orientation and the target direction. The distance reward

r_{d}

motivates progression toward the target position:

r_{d} = 5 \exp (- 2 \frac{d_{t}}{d_{0}}),

(30)

where

d_{t}

is the current Euclidean distance to the target and

d_{0}

is the initial distance.

The obstacle penalty

r_{ob}

ensures collision risk awareness:

r_{ob} = \{\begin{matrix} r_{r i s k} & if d_{\min} < r_{s a f e} \\ r_{n o r m a l} & otherwise, \end{matrix}

(31)

where

d_{\min}

is the distance to the nearest obstacle.

The velocity reward

r_{v}

optimizes the motion efficiency:

r_{v} = \exp (- \frac{{(v_{t} - v_{d e s})}^{2}}{2 σ^{2}}),

(32)

with an adaptive optimal velocity of

v_{d e s} = 0.25 \min (1.0, d_{t})

and

σ = 0.1

. This equation encourages a higher velocity when distant from the target while promoting precision during the final phases.

The reward function integrates the four components in an additive manner. Although this is an equally weighted sum, each term is designed to have comparable magnitudes to prevent any one objective from overly dominating the learning process. For example,

r_{h} \in [0, 2]

, and

r_{d}

is scaled by a factor and exponentially decays with distance.

r_{ob}

provides a significant but bounded penalty. This design prioritizes simplicity, interpretability, and minimal parameter tuning in the early stages of the algorithm’s implementation, resulting in a reliable benchmark. The

r_{v}

component is not a constant reward but a term used to fine-tune the efficiency of the movement.

The RADAR-BPO framework establishes a safety-critical reinforcement learning paradigm through integrated policy optimization and real-time safety assurance. During each training iteration, the agent executes interactions with the environment using its current policy network

π_{θ}

, gathering trajectory data that includes environmental states, exploratory actions, and reward signals. This collected experience forms the foundation for subsequent policy refinement.

Following data collection, the algorithm performs proximal policy optimization using the GAE. This involves calculating the temporal difference errors to estimate the action advantages, then updating both the policy and value networks using a specialized objective function. The optimization balances exploration incentives with value approximation while maintaining training stability through gradient clipping.

During action execution, each policy-generated velocity command undergoes safety verification using a CBF filter. This module solves a quadratic optimization problem that minimizes deviations from the original actions while enforcing collision avoidance constraints. The safety verification incorporates the robot’s positional relationship with obstacles and its current heading orientation to dynamically adjust the motion commands.

The resulting safe actions preserve the learning direction of the policy network while ensuring formal safety guarantees through real-time constraint enforcement. This synergistic integration creates an evolution mechanism where policy improvement and safety assurance mutually reinforce each other throughout the learning process. The system maintains continuous compliance with safety boundaries while progressively refining its navigation strategy through environmental interactions.

4. Case Study

To rigorously evaluate the performance and validate the efficacy of the proposed risk-aware, dynamic, adaptive regulation barrier policy optimization (RADAR-BPO) framework for collision risk mitigation in mobile robot navigation, this section presents comprehensive simulation experiments conducted within the ROS Gazebo environment.

4.1. Test Environment Setup

The simulation environments visualized in Figure 7 were constructed within the ROS Gazebo platform to test a robot’s obstacle avoidance navigation. Specifically, Figure 7a presents a dynamic pedestrian environment, Figure 7b demonstrates a multi-obstacle environment, and Figure 7c showcases a complex obstacle environment incorporating both static and dynamic elements.

In Figure 7a, there are two pedestrians walking back and forth at a certain speed. The robot needs to avoid them and reach the target point while the pedestrians are moving. In Figure 7b, there are multiple static cylindrical or cubic obstacles. The robot needs to avoid the static obstacles and successfully navigate to the target point. Figure 7c contains multiple pedestrians and multiple different static obstacles. The robot also needs to complete a navigation task in this complex and changing environment.

4.2. Design of Experimental Test

The motion trajectories of the pedestrians in Figure 7a,c are shown in Table 1, and there are two stationary pedestrians in Figure 7c.

In the Gazebo coordinate system (the positive direction on the y-axis is left and the negative direction is right), pedestrian 1 moves 1 m to the right from their initial position (0.5, 2.5) to (0.5, 1.5), turns on the spot, moves 1 m to the left, returns to their starting point (0.5, 2.5), and stays there. At the same time, pedestrian 2 moves 2 m to the left from their initial position (3.0, 0.0) to (3.0, 2.0), turns on the spot, moves 2 m to the right, returns to their starting point (3.0, 0.0), and stays there. Both pedestrians complete a round trip in the y-axis direction. There are two other stationary pedestrians: one is fixed at (1.0, 0.0), and the other is fixed at (2.0, 3.0). The robot starts from the starting point (0.0, 0.0) and needs to traverse the dynamic environment to reach the target point (3.0, 3.0). Its path will be blocked by moving pedestrians and it will need to avoid stationary pedestrian obstacles and other obstacles.

Table 2 lists the key parameter settings used to train the RADAR-BPO navigation algorithm. These parameters cover aspects such as policy optimization, the safety constraints, the robot’s kinematic model, and the training environment configuration. Some parameter values (such as discount factors and GAE parameters) were within typical ranges or needed to be adjusted according to the specific environment.

The algorithm was trained on each environment for a fixed number of episodes: 150 for Env 1 (dynamic pedestrians), 200 for Env 2 (multiple static obstacles), and 300 for Env 3 (multiple complex obstacles). This design with an increasing environment complexity and training time was intended to mimic curriculum learning, allowing the agent to consolidate foundational skills before tackling more difficult tasks. Each training episode terminated when the agent successfully reached the goal, collided with an obstacle, or reached the maximum step limit.

4.3. Test Results and Analysis

To comprehensively evaluate the performance of the proposed RADAR-BPO framework and quantitatively assess its effectiveness in mitigating collision risks while maintaining navigation efficiency, extensive simulations were conducted across the three distinct environments introduced in Section 4.1 (Figure 7). The evaluation employed a three-stage progressive training paradigm: Stage 1 (Figure 7a) utilized the dynamic pedestrian environment (Env 1), Stage 2 (Figure 7b) progressed to the multi-static obstacle environment (Env 2), and Stage 3 (Figure 7c) took place in the complex hybrid environment containing both static obstacles and dynamic pedestrians (Env 3). This staged approach rigorously tested the algorithm to determine its fundamental obstacle avoidance ability.

The core performance metrics included the learning stability, safety performance, and navigation efficiency. Crucially, the cumulative reward curves from throughout the training process were analyzed and compared against those of the baseline PPO algorithm. This direct comparison highlighted the impact of integrating the real-time CBF safety filter within the RADAR-BPO framework. The corresponding segmented training reward curve is shown in Figure 8.

As shown in Figure 8, during the progressive training process implemented across three stages (Env 1, Env 2, and Env 3), the RADAR-BPO algorithm significantly outperformed all the baseline algorithms (PPO, DQN, and DDPG) in terms of the average reward in all the environments. This overall performance advantage was demonstrated by the following: in the relatively simple Env 1 stage, while all the algorithms initially experienced low rewards, RADAR-BPO quickly escaped this low-reward zone and stabilized at a higher level. Entering the more complex Env 2 stage, RADAR-BPO exhibited a burst of performance growth, with its reward values rapidly exceeding 1000 and stabilizing at approximately 1100. In contrast, PPO slowly climbed to around 400, while the DQN and DDPG lagged significantly behind. In the most complex Env 3 stage, the challenging environment caused performance degradation for all the algorithms. However, RADAR-BPO maintained an absolute advantage, stabilizing at around 760 with a smooth, volatility-resistant curve. PPO fluctuated violently below 400 and showed weak growth. The DQN and DDPG’s performance remained sluggish, failing to improve significantly, and the gap between their performance and RADAR-BPO’s was the largest.

Comparing the two curves at the beginning of the Env 3 phase reveals that PPO’s reward values exhibited a larger upward gradient and oscillation amplitude, while the rise in RADAR-BPO’s values was more stable. This phenomenon reveals the different learning modes of the two algorithms: The PPO strategy almost failed after an environment switch, requiring a painstaking learning process to recover from an extremely high collision rate, resulting in an unstable learning process. In contrast, RADAR-BPO benefited from the real-time safety guarantees provided by the CBF. Its strategy retained its core obstacle avoidance capabilities after an environment switch, and its learning process involved stable fine-tuning based on a higher performance baseline to adapt to new dynamic obstacles. Therefore, RADAR-BPO sacrificed a seemingly larger learning amplitude in exchange for a higher, more stable, and safer final performance.

This result clearly shows that the RADAR-BPO framework integrated with real-time CBF security filtering can not only achieve higher task returns in complex dynamic environments (based on the higher success rate and average reward in Table 3) but also significantly improve the convergence speed of the learning process, the final performance ceiling, and the training stability, effectively solving the core problems of low exploration efficiency, large policy fluctuations, and limited performance of traditional PPO in safety-critical scenarios.

According to the results listed in Table 3, RADAR-BPO outperformed the PPO algorithm in most metrics. In Env 1, RADAR-BPO’s collision rate was 68.67%, lower than PPO’s 82.00%; in Env 3, RADAR-BPO’s collision rate further decreased to 10.67%, compared to PPO’s 30.67%. Furthermore, RADAR-BPO achieved 47, 164, and 268 successes in the three different environments, respectively, all exceeding PPO’s 27, 134, and 208. In terms of the average reward, RADAR-BPO also generally outperformed PPO, with these algorithms achieving 1070.09 and 377.01, respectively, in Env 2. These results demonstrate that RADAR-BPO maintains high task completion efficiency and a high reward yield while reducing the collision rate.

In contrast, the DDPG and DQN algorithms generally underperformed in comparison to RADAR-BPO and PPO. The DDPG’s collision rate was above 75% across all the environments, and its average reward was significantly lower than that of the other algorithms. The DQN had zero successes and a 100% collision rate in Env 1. In Env 2 and Env 3, its collision rates were 55.50% and 68.00%, respectively. While its average reward was higher than that of the DDPG, it was still lower than that of RADAR-BPO and PPO.

To provide an intuitive visualization of the algorithm’s real-time decision-making and safety assurance capabilities in the most challenging scenario, the trajectory evolution of the robot navigating across Env 3 over a critical 9 s interval (from t = 1 s to t = 9 s) is presented and discussed. This sequence illustrates how RADAR-BPO dynamically adjusted the robot’s path to safely avoid both static obstacles and moving pedestrians while progressing towards the target.

In Figure 9, the robot starts from the starting position (0.0, 0.0), initially accelerates to bypass the stationary pedestrian in front, and turns left toward the target direction; when it encounters Pedestrian 1 (0.5, 2.5 → 1.5) moving horizontally to the right, it suddenly turns at a sharp angle to achieve emergency avoidance and simultaneously and smoothly avoids walls and nearby obstacles. Under the combined threat of Pedestrian 2 (3.0, 2.0 → 0.0) moving horizontally and dense obstacles, the robot flexibly adjusts the path to squeeze through and finally arrives at the end point (3.0, 3.0) accurately. The entire path maintains the straight-line efficiency of PPO in open areas and forms a smooth and conservative contour using CBF constraints when approaching obstacles. The collision-free nature of the entire process verifies RADAR-BPO’s seamless integration of exploration and safety in dynamic and dense scenes.

5. Conclusions and Future Work

This paper proposes RADAR-BPO (risk-aware, dynamic, adaptive regulation barrier policy optimization), a novel safe reinforcement learning framework integrating PPO with CBF-based safety filters to mitigate collision risks in mobile robot navigation. The framework leverages PPO for exploratory policy generation while employing the CBF as a real-time safety filter, formulated as a quadratic programming problem, to minimally modify risky actions and ensure collision avoidance. Implemented on a mecanum wheel robot within the ROS Gazebo simulation environment, the method demonstrated significant improvements in safety performance across diverse dynamic and complex scenarios. Comparative experiments against baseline PPO, DQN, and DDPG algorithms confirmed that RADAR-BPO achieves higher success rates, lower collision rates, and superior average rewards while maintaining navigation efficiency, highlighting its effectiveness in balancing exploration with formal safety guarantees.

In future work, the framework could be extended to address more complex real-world challenges. Potential directions include validating the approach on physical robotic platforms to assess its real-time performance and robustness under sensor noise and hardware limitations. Additionally, exploring the use of adaptive or learned CBF parameters to handle heterogeneous obstacle shapes and uncertain dynamics, integrating multi-robot collision avoidance scenarios, and extending the method to unstructured outdoor environments would further enhance this approach’s applicability.

Author Contributions

Conceptualization: B.G., G.W., Y.C. and Y.G.; methodology: B.G., G.W., Y.C. and Y.G.; software: B.G., Y.G. and Q.X.; validation: B.G. and Y.C.; formal analysis: G.W.; investigation: B.G. and Y.C.; resources: B.G., G.W., Y.C. and Y.G.; data curation: B.G.; writing—original draft preparation: B.G.; writing—review and editing: Y.C. and Y.G.; visualization: B.G.; supervision: B.G. and G.W.; project administration: Y.C.; funding acquisition: Y.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under Grants No. 62103293 and No. 72201186 and the Natural Science Foundation of Jiangsu Province under Grants No. BK20210709 and No. BK20220481.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

We would like to express our heartfelt thanks to Jiangsu Eazytec Co., Ltd., for the technical support provided for this paper and to Zhi Xie, Bin Gong, and Xiaodi Yu for their valuable contributions.

Conflicts of Interest

The author Qian Xie was employed by the company Jiangsu Eazytec Co., Ltd. The author Yiyang Chen has received research grants from the company Jiangsu Eazytec Co., Ltd. The funder had the following involvement with the study: providing technical support for test environment implementation and software validation.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial intelligence
IoT	Internet of Things
RL	Reinforcement learning
DQN	Deep Q-Network
DDPG	Deep deterministic policy gradient
CBF	Control barrier function
APF	Artificial potential field
MPC	Model predictive control
PPO	Proximal policy optimization
MDP	Markov decision process
GAE	Generalized advantage estimation
RADAR-BPO	Risk-aware, dynamic, adaptive regulation barrier policy optimization

References

Ahmed, I.; Jeon, G.; Piccialli, F. From artificial intelligence to explainable artificial intelligence in industry 4.0: A survey on what, how, and where. IEEE Trans. Ind. Inform. 2022, 18, 5031–5042. [Google Scholar] [CrossRef]
Ge, S.; Xie, Y.; Liu, K.; Ding, Z.; Hu, E.; Chen, L.; Wang, F.Y. The use of intelligent vehicles and artificial intelligence in mining operations: Ethics, responsibility, and sustainability. IEEE Trans. Intell. Veh. 2023, 8, 1021–1024. [Google Scholar] [CrossRef]
Aoki, S.; Yonezawa, T.; Kawaguchi, N. RobotNEST: Toward a viable testbed for IoT-enabled environments and connected and autonomous robots. IEEE Sens. Lett. 2022, 6, 6000304. [Google Scholar] [CrossRef]
Sharma, N.; Pandey, J.K.; Mondal, S. A review of mobile robots: Applications and future prospect. Int. J. Precis. Eng. Manuf. 2023, 24, 1695–1706. [Google Scholar] [CrossRef]
Srisuchinnawong, A.; Phongaksorn, K.; Ausrivong, W.; Manoonpong, P. Adaptive bipedal robot walking on industrial pipes under neural multimodal locomotion control: Toward robotic out-pipe inspection. IEEE/ASME Trans. Mechatron. 2023, 29, 1205–1216. [Google Scholar] [CrossRef]
Chen, Y.; Zhang, F.; Wang, G.; Weng, G.; Fontanelli, D. An Active Contour Model Based on Fuzzy Superpixel Centers and Nonlinear Diffusion Filter for Instance Segmentation. IEEE Trans. Instrum. Meas. 2025, 74, 5035013. [Google Scholar] [CrossRef]
Li, Y.; Huang, H. Efficient task planning for heterogeneous AGVs in warehouses. IEEE Trans. Intell. Transp. Syst. 2024, 25, 10005–10019. [Google Scholar] [CrossRef]
Ju, C.; Son, H.I. Modeling and control of heterogeneous agricultural field robots based on Ramadge–Wonham theory. IEEE Robot. Autom. Lett. 2019, 5, 48–55. [Google Scholar] [CrossRef]
Guo, B.; Guo, N.; Cen, Z. Obstacle avoidance with dynamic avoidance risk region for mobile robots in dynamic environments. IEEE Robot. Autom. Lett. 2022, 7, 5850–5857. [Google Scholar] [CrossRef]
Li, X.; Xu, Z.; Su, Z.; Wang, H.; Li, S. Distance-and velocity-based simultaneous obstacle avoidance and target tracking for multiple wheeled mobile robots. IEEE Trans. Intell. Transp. Syst. 2023, 25, 1736–1748. [Google Scholar] [CrossRef]
Guan, R.; Hu, G. Formation tracking of mobile robots under obstacles using only an active RGB-D camera. IEEE Trans. Ind. Electron. 2023, 71, 4049–4058. [Google Scholar] [CrossRef]
Liu, W.; Hu, J.; Zhang, H.; Wang, M.Y.; Xiong, Z. A novel graph-based motion planner of multi-mobile robot systems with formation and obstacle constraints. IEEE Trans. Robot. 2023, 40, 714–728. [Google Scholar] [CrossRef]
Loizou, S.G.; Rimon, E.D. Mobile robot navigation functions tuned by sensor readings in partially known environments. IEEE Robot. Autom. Lett. 2022, 7, 3803–3810. [Google Scholar] [CrossRef]
Cui, Y.; Zhang, Y.; Zhang, C.H.; Yang, S.X. Task cognition and planning for service robots. Intell. Robot. 2025, 5, 119–142. [Google Scholar]
Hoang, V.B.; Nguyen, V.H.; Ngo, T.D.; Truong, X.-T. Socially aware robot navigation framework: Where and how to approach people in dynamic social environments. IEEE Trans. Autom. Sci. Eng. 2022, 20, 1322–1336. [Google Scholar] [CrossRef]
Gao, Y.; Huang, C.M. Evaluation of socially-aware robot navigation. Front. Robot. AI 2022, 8, 721317. [Google Scholar] [CrossRef]
Wang, W.; Mao, L.; Wang, R.; Min, B.C. Multi-robot cooperative socially-aware navigation using multi-agent reinforcement learning. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 12353–12360. [Google Scholar]
Mavrogiannis, C.; Baldini, F.; Wang, A.; Zhao, D.; Trautman, P.; Steinfeld, A.; Oh, J. Core challenges of social robot navigation: A survey. ACM Trans. Hum.-Robot. Interact. 2023, 12, 36. [Google Scholar] [CrossRef]
Mello, R.C.; Scheidegger, W.M.; Múnera, M.C.; Cifuentes, C.A.; Ribeiro, M.R.; Frizera-Neto, A. The PoundCloud framework for ROS-based cloud robotics: Case studies on autonomous navigation and human–robot interaction. Robot. Auton. Syst. 2022, 150, 103981. [Google Scholar] [CrossRef]
López, B.; Muñoz, J.; Quevedo, F.; Monje, C.A.; Garrido, S.; Moreno, L.E. Path planning and collision risk management strategy for multi-UAV systems in 3D environments. Sensors 2021, 21, 4414. [Google Scholar] [CrossRef]
Arrouch, I.; Ahmad, N.S.; Goh, P.; Mohamad-Saleh, J. Close proximity time-to-collision prediction for autonomous robot navigation: An exponential GPR approach. Alex. Eng. J. 2022, 61, 11171–11183. [Google Scholar] [CrossRef]
Cui, S.; Chen, Y.; Li, X. A robust and efficient UAV path planning approach for tracking agile targets in complex environments. Machines 2022, 10, 931. [Google Scholar] [CrossRef]
Sun, Y.; Zhu, H.; Liang, Z.; Liu, A.; Ni, H.; Wang, Y. A phase search-enhanced Bi-RRT path planning algorithm for mobile robots. Intell. Robot. 2025, 5, 404–418. [Google Scholar]
Szczepanski, R. Safe Artificial Potential Field - Novel Local Path Planning Algorithm Maintaining Safe Distance From Obstacles. IEEE Robot. Autom. Lett. 2023, 8, 4823–4830. [Google Scholar] [CrossRef]
Lee, D.H.; Lee, S.S.; Ahn, C.K.; Shi, P.; Lim, C.C. Finite Distribution Estimation-Based Dynamic Window Approach to Reliable Obstacle Avoidance of Mobile Robot. IEEE Trans. Ind. Electron. 2021, 68, 9998–10006. [Google Scholar] [CrossRef]
Pan, H.; Luo, M.; Wang, J.; Huang, T.; Sun, W. A safe motion planning and reliable control framework for autonomous vehicles. IEEE Trans. Intell. Veh. 2024, 9, 4780–4793. [Google Scholar] [CrossRef]
Lobos-Tsunekawa, K.; Leiva, F.; Ruiz-del Solar, J. Visual navigation for biped humanoid robots using deep reinforcement learning. IEEE Robot. Autom. Lett. 2018, 3, 3247–3254. [Google Scholar] [CrossRef]
Chen, Y.; Wang, Y.; Freeman, C.T. Iterative Learning Control of Minimum Energy Path Following Tasks for Second-Order MIMO Systems: An Indirect Reference Update Framework. IEEE Trans. Cybern. 2025, 55, 3403–3416. [Google Scholar] [CrossRef]
Cui, J.; Wu, L.; Huang, X.; Xu, D.; Liu, C.; Xiao, W. Multi-strategy adaptable ant colony optimization algorithm and its application in robot path planning. Knowl.-Based Syst. 2024, 288, 111459. [Google Scholar] [CrossRef]
Xiao, X.; Liu, B.; Warnell, G.; Stone, P. Motion planning and control for mobile robot navigation using machine learning: A survey. Auton. Robot. 2022, 46, 569–597. [Google Scholar] [CrossRef]
Mackay, A.K.; Riazuelo, L.; Montano, L. RL-DOVS: Reinforcement learning for autonomous robot navigation in dynamic environments. Sensors 2022, 22, 3847. [Google Scholar] [CrossRef]
Cao, H.; Xiong, H.; Zeng, W.; Jiang, H.; Cai, Z.; Hu, L.; Zhang, L.; Lu, W. Safe reinforcement learning-based motion planning for functional mobile robots suffering uncontrollable mobile robots. IEEE Trans. Intell. Transp. Syst. 2023, 25, 4346–4363. [Google Scholar] [CrossRef]
Wang, G.; Li, Z.; Weng, G.; Chen, Y. An overview of industrial image segmentation using deep learning models. Intell. Robot. 2025, 5, 143–180. [Google Scholar] [CrossRef]
Yang, H.; Yao, C.; Liu, C.; Chen, Q. Rmrl: Robot navigation in crowd environments with risk map-based deep reinforcement learning. IEEE Robot. Autom. Lett. 2023, 8, 7930–7937. [Google Scholar] [CrossRef]
Cheng, C.; Duan, S.; He, H.; Li, X.; Chen, Y. A generalized robot navigation analysis platform (RoNAP) with visual results using multiple navigation algorithms. Sensors 2022, 22, 9036. [Google Scholar] [CrossRef]
Cheng, C.; Zhang, H.; Sun, Y.; Tao, H.; Chen, Y. A cross-platform deep reinforcement learning model for autonomous navigation without global information in different scenes. Control Eng. Pract. 2024, 150, 105991. [Google Scholar] [CrossRef]
Zhou, Z.; Zhu, P.; Zeng, Z.; Xiao, J.; Lu, H.; Zhou, Z. Robot navigation in a crowd by integrating deep reinforcement learning and online planning. Appl. Intell. 2022, 52, 15600–15616. [Google Scholar] [CrossRef]
Li, B.; Huang, Z.; Chen, T.W.; Dai, T.; Zang, Y.; Xie, W.; Tian, B.; Cai, K. MSN: Mapless short-range navigation based on time critical deep reinforcement learning. IEEE Trans. Intell. Transp. Syst. 2022, 24, 8628–8637. [Google Scholar] [CrossRef]
Ames, A.D.; Xu, X.; Grizzle, J.W.; Tabuada, P. Control barrier function based quadratic programs for safety critical systems. IEEE Trans. Autom. Control 2016, 62, 3861–3876. [Google Scholar] [CrossRef]
Ferraguti, F.; Landi, C.T.; Singletary, A.; Lin, H.C.; Ames, A.; Secchi, C.; Bonfe, M. Safety and efficiency in robotics: The control barrier functions approach. IEEE Robot. Autom. Mag. 2022, 29, 139–151. [Google Scholar] [CrossRef]
Wang, X. Ensuring safety of learning-based motion planners using control barrier functions. IEEE Robot. Autom. Lett. 2022, 7, 4773–4780. [Google Scholar] [CrossRef]
Guo, B.; Sun, Y.; Chen, Y. Safe path planning of mobile robot based on improved particle swarm optimization. Trans. Inst. Meas. Control 2025, 47, 1715–1724. [Google Scholar] [CrossRef]
Ames, A.D.; Coogan, S.; Egerstedt, M.; Notomista, G.; Sreenath, K.; Tabuada, P. Control barrier functions: Theory and applications. In Proceedings of the 2019 18th European Control Conference (ECC), Naples, Italy, 25–28 June 2019; pp. 3420–3431. [Google Scholar]
Huang, Z.; Lan, W.; Yu, X. A formal control framework of autonomous vehicle for signal temporal logic tasks and obstacle avoidance. IEEE Trans. Intell. Veh. 2023, 9, 1930–1940. [Google Scholar] [CrossRef]
Singletary, A.; Klingebiel, K.; Bourne, J.; Browning, A.; Tokumaru, P.; Ames, A. Comparative analysis of control barrier functions and artificial potential fields for obstacle avoidance. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 8129–8136. [Google Scholar]
Jian, Z.; Yan, Z.; Lei, X.; Lu, Z.; Lan, B.; Wang, X.; Liang, B. Dynamic control barrier function-based model predictive control to safety-critical obstacle-avoidance of mobile robot. arXiv 2022, arXiv:2209.08539. [Google Scholar]

Figure 1. Kinematic model of mecanum wheel robot.

Figure 2. Wheel odometry model of robot.

Figure 3. State space safety region visualization.

Figure 4. Training robot agent to interact with environment.

Figure 5. CBF example: (a) Sequential points showing the robot’s navigation path through an environment containing obstacles. (b) Comparative analysis of the trajectory variations under different

α

parameter values.

Figure 5. CBF example: (a) Sequential points showing the robot’s navigation path through an environment containing obstacles. (b) Comparative analysis of the trajectory variations under different

α

parameter values.

Figure 7. Simulation in three different environments.

Figure 8. Reward curve resulting from three-stage training process.

Figure 9. Trajectory of robot in complex hybrid environment.

Table 1. Trajectory coordinates of pedestrians in environment.

Time (s)	Pedestrian 1		Pedestrian 2
Time (s)	x (m)	y (m)	x (m)	y (m)
0.0	0.5	2.5	3.0	0.0
3.0	0.5	1.5	3.0	2.0
3.5	0.5	1.5	3.0	2.0
5.5	0.5	2.5	-	-
6.0	0.5	2.5	-	-
6.5	-	-	3.0	0.0
7.0	-	-	3.0	0.0

Table 2. Training parameter settings for RADAR-BPO.

Category	Parameter	Symbol	Value/Range
Policy Optimization	Discount factor	$γ$	0.99
	GAE parameter	$λ$	0.95
	Clipping threshold	$ϵ$	0.15
	Value coefficient	$c_{1}$	0.5
	Entropy coefficient	$c_{2}$	0.01
	Policy iterations	K	[150, 200, 300]
	Linear velocity	$v_{\max}$	[0, 0.5]
	Angular velocity	$ω_{\max}$	[−1.0, 1.0]
Safety Constraints	Safety distance	$r_{safe}$	0.5
	Max linear velocity	$v_{\max}$	0.5
	Max angular velocity	$ω_{\max}$	1.0
	Barrier parameter	$α$	1.0
Robot Kinematics	Half of body’s length	$W_{a}$	0.195
	Half of body’s width	$W_{b}$	0.172
	Characteristic length	$d = W_{a} + W_{b}$	0.367
Training Setup	2D lidar angle range	–	[0°, 360°]
Training Setup	Scanning range	–	[0.08, 10]

Table 3. Algorithm performance comparison.

Environment	Algorithm	Success Count	Collision Count	Average Reward	Collision Rate (%)
Env 1	PPO	27	123	357.23	82.00
	RADAR-BPO	47	103	421.58	68.67
	DDPG	17	133	13.67	88.67
	DQN	0	150	271.50	100.00
Env 2	PPO	134	66	377.01	33.00
	RADAR-BPO	164	36	1070.09	18.00
	DDPG	25	175	58.02	87.50
	DQN	89	111	273.47	55.50
Env 3	PPO	208	92	399.38	30.67
	RADAR-BPO	268	32	762.66	10.67
	DDPG	73	227	19.45	75.67
	DQN	96	204	269.22	68.00

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guo, B.; Wang, G.; Chen, Y.; Gao, Y.; Xie, Q. Risk-Aware Reinforcement Learning with Dynamic Safety Filter for Collision Risk Mitigation in Mobile Robot Navigation. Sensors 2025, 25, 5488. https://doi.org/10.3390/s25175488

AMA Style

Guo B, Wang G, Chen Y, Gao Y, Xie Q. Risk-Aware Reinforcement Learning with Dynamic Safety Filter for Collision Risk Mitigation in Mobile Robot Navigation. Sensors. 2025; 25(17):5488. https://doi.org/10.3390/s25175488

Chicago/Turabian Style

Guo, Bingbing, Guina Wang, Yiyang Chen, Yue Gao, and Qian Xie. 2025. "Risk-Aware Reinforcement Learning with Dynamic Safety Filter for Collision Risk Mitigation in Mobile Robot Navigation" Sensors 25, no. 17: 5488. https://doi.org/10.3390/s25175488

APA Style

Guo, B., Wang, G., Chen, Y., Gao, Y., & Xie, Q. (2025). Risk-Aware Reinforcement Learning with Dynamic Safety Filter for Collision Risk Mitigation in Mobile Robot Navigation. Sensors, 25(17), 5488. https://doi.org/10.3390/s25175488

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Risk-Aware Reinforcement Learning with Dynamic Safety Filter for Collision Risk Mitigation in Mobile Robot Navigation

Abstract

1. Introduction

2. Preliminary Knowledge

2.1. Kinematic Model of Robot

2.1.1. Forward and Inverse Kinematics Models

2.1.2. Ideal Model of Robot’s Odometry

2.2. Control Barrier Function

2.2.1. Definition of Safety Set

2.2.2. Definition of Control Barrier Funciton

2.3. Basic Theory of Reinforcement Learning

3. Main Research Content and Methods

3.1. Proximal Policy Optimization

3.1.1. PPO Principle Description

3.1.2. PPO Algorithm Flow

3.2. Application and Implementation of CBF

3.2.1. Safe Sets and Safe Functions

3.2.2. Implementation of CBF Safety Constraints

3.3. Risk-Aware, Dynamic, Adaptive Regulation Barrier Policy Optimization

4. Case Study

4.1. Test Environment Setup

4.2. Design of Experimental Test

4.3. Test Results and Analysis

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI