Next Article in Journal
PSD-YOLO: An Enhanced Real-Time Framework for Robust Worker Detection in Complex Offshore Oil Platform Environments
Previous Article in Journal
Polarization-Interference Jones Matrix Sensors of Layer-by-Layer Scanning of Polycrystalline Dehydrated Blood Films. Fundamental and Applied Aspects
Previous Article in Special Issue
Continuous Vibration-Driven Virtual Tactile Motion Perception Across Fingertips
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Collision-Free Robot Path Planning by Integrating DRL with Noise Layers and MPC

by
Xinzhan Hong
1,2,†,
Qieshi Zhang
3,†,
Yexing Yang
1,
Tianqi Zhao
1,
Zhenyu Xu
3,4,
Tichao Wang
3 and
Jing Ji
1,2,*
1
Hangzhou Institute of Technology, Xidian University, Hangzhou 311231, China
2
School of Information Mechanics and Sensing Engineering, Xidian University, Xi’an 710071, China
3
CAS Key Laboratory of Human–Machine Intelligence-Synergy Systems, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China
4
Faculty of Science and Technology, University of Macau, Macau 999078, China
*
Author to whom correspondence should be addressed.
Xinzhan Hong and Qieshi Zhang contributed equally to this work.
Sensors 2025, 25(20), 6263; https://doi.org/10.3390/s25206263
Submission received: 5 July 2025 / Revised: 1 September 2025 / Accepted: 26 September 2025 / Published: 10 October 2025

Abstract

With the rapid advancement of Autonomous Mobile Robots (AMRs) in industrial automation and intelligent logistics, achieving efficient and safe path planning in dynamic environments has become a critical challenge. These environments require robots to perceive complex scenarios and adapt their motion strategies accordingly, often under real-time constraints. Existing methods frequently struggle to balance efficiency, responsiveness, and safety, especially in the presence of continuously changing dynamic obstacles. While Model Predictive Control (MPC) and Deep Reinforcement Learning (DRL) have each shown promise in this domain, they also face limitations when applied individually—such as high computational demands or insufficient environmental exploration. To address these challenges, we propose a hybrid path planning framework that integrates an optimized DRL algorithm with MPC. We replace the Actor’s output with a learnable noisy linear layer whose mean and scale parameters are optimized jointly with the policy via backpropagation, thereby enhancing exploration while preserving training stability. TD3 produces stepwise control commands that evolve into a short-horizon reference trajectory, while MPC refines this trajectory through constraint-aware optimization to ensure timely obstacle avoidance. This complementary process combines TD3′s learning-based adaptability with MPC’s reliable local feasibility. Extensive experiments conducted in environments with varying obstacle dynamics and densities demonstrate that the proposed method significantly improves obstacle avoidance success rate, trajectory smoothness, and path accuracy compared to traditional MPC, standalone DRL, and other hybrid approaches, offering a robust and efficient solution for autonomous navigation in complex scenarios.

1. Introduction

With the widespread application of Autonomous Mobile Robots (AMRs) in fields such as industrial automation and intelligent logistics, achieving efficient and collision-free path planning and obstacle avoidance in dynamic environments has become a critical research topic [1,2,3]. Particularly in complex dynamic environments, robots must not only avoid static obstacles but also deal with constantly changing dynamic obstacles, such as pedestrians and other robots. Existing path planning methods perform well in static environments, but they often fail to effectively guarantee real-time performance and safety when facing dynamic obstacles.
Model Predictive Control (MPC) [4], as a widely used path planning and obstacle avoidance method, is especially suitable for local planning in dynamic environments. MPC optimizes control inputs at each time step to minimize the objective function while satisfying various constraints (such as obstacle avoidance and velocity limits). This allows real-time adjustments to the control strategy to cope with dynamic changes in the environment, particularly excelling in obstacle avoidance tasks. However, although MPC can provide precise control within a local range, it still faces challenges in global path planning, particularly when dealing with complex non-convex obstacles and high computational complexity, which may lead to low computational efficiency and slow responses to environmental changes. Therefore, relying solely on MPC for global planning may not meet the requirements for real-time performance and computational efficiency, often necessitating its combination with other algorithms to enhance global optimization capabilities [4,5,6,7].
In recent years, Deep Reinforcement Learning (DRL) [8] has been introduced into path planning and obstacle avoidance tasks, showing significant advantages, especially in high-dimensional state spaces and complex action control scenarios. For example, improved versions of DRL algorithms such as Deep Q-Network (DQN) [9], Deep Deterministic Policy Gradient (DDPG) [10], and Twin Delayed Deep Deterministic Policy Gradient (TD3) [11] have been designed with specific mechanisms to enhance exploration to some extent and to actively address local optima, achieving some success. However, DRL still faces the problem of insufficient exploration in complex dynamic environments. Especially in complex scenarios, models tend to get stuck in local optima, limiting the efficiency and precision of path planning and obstacle avoidance tasks, which requires further optimization [12,13,14].
To address these challenges, this paper proposes a collaborative optimization strategy that integrates the TD3 model with MPC, as shown in Figure 1. In this approach, TD3 produces sequential local actions that roll out into a short-horizon reference path, which MPC continuously refines for real-time local adjustments. A dynamic weight adjuster fine-tunes the trajectory to enhance obstacle avoidance and improve navigation efficiency.
Specifically, we propose an optimization scheme that replaces the Actor output layer in the TD3 network with a learnable noise layer, thereby enhancing exploration capability while maintaining training stability. The optimized TD3 is then integrated with MPC: TD3 produces sequential local actions that roll out into a short-horizon reference path, while MPC refines this path through constraint-aware local optimization for real-time obstacle avoidance. This collaboration improves both path planning accuracy and obstacle avoidance capability. Extensive experimental validation demonstrates that the proposed method outperforms traditional MPC, standalone DRL algorithms, and other hybrid optimization approaches in static and dynamic environments, achieving higher success rates, smoother trajectories, and more accurate paths. The main contributions of this paper include:
(1) A TD3-based path planning framework enhanced with an embedded noise layer in the Actor network is proposed, which improves exploration capability and convergence stability compared to conventional action noise injection.
(2) The TD3 planner is integrated with an MPC-based local controller through a dynamic weight-switching mechanism, where TD3 provides sequential local actions that roll out into a reference path, and MPC refines these actions for responsive local obstacle avoidance.
(3) The proposed framework is designed for and evaluated with fused onboard sensor data, combining 360° LiDAR, odometry, and IMU measurements. This sensing configuration enables robust operation in realistic, highly dynamic environments and differs from previous approaches that rely solely on fixed-view visual inputs.
(4) Extensive experiments in static and dynamic scenarios demonstrate superior success rate, path smoothness, and safety margins compared to pure MPC, pure RL, and prior DRL+MPC baselines.

2. Related Works

2.1. MPC Based Path Planning

MPC, as an optimization-based path planning method, is widely applied in robotic path planning and obstacle avoidance tasks. Eckhoff et al. [15] innovatively proposed the application of MPC in safe Human–Robot Interaction (HRI) motion planning. By optimizing and predicting future states based on constrained optimization, it actively avoids robot motion safety conflicts, such as human–robot collisions, overcoming the limitations of traditional control schemes that react only during motion conflicts. This approach provides a proactive path planning strategy for safe HRI. Ramezani et al. [16] combined MPC with Long Short-Term Memory (LSTM) networks to address the unmanned aerial vehicle path planning problem in complex environments. Zhao et al. [17] proposed a solution combining an improved A* algorithm and MPC, where MPC outputs control signals based on vehicle dynamics models, integrating predictive optimization and custom penalty functions to precisely handle complex parking scenarios, effectively improving the quality and control accuracy of parking trajectories.
However, the application of MPC in complex dynamic environments still faces challenges. Especially in high-computation-demand and uncertain environments, MPC often falls into local optima, and its high computational complexity results in poor real-time performance. These limitations make it difficult for MPC to cope with rapidly changing dynamic environments when used alone. To address this, researchers have started exploring the combination of MPC with other intelligent algorithms to enhance its adaptability and real-time responsiveness in dynamic environments.

2.2. DRL Based Path Planning

In recent years, DRL has been widely applied in robotic path planning and control, especially in complex tasks involving high-dimensional state spaces and continuous action spaces, where it shows significant advantages. Yan et al. [18] proposed a DRL-based UAV path planning method, using a situation assessment model and the Duel Deep Q-Network (D3QN) algorithm to predict Q-values and select actions based on specific policies, providing a new approach for DRL in path planning. Li et al. [19] used the TD3 algorithm for mobile robot path planning, introducing Prioritized Experience Replay and Transfer Learning to improve learning efficiency. They also designed a dynamic delay update strategy and incorporated Ornstein-Uhlenbeck (OU) noise, improving the success rate of path planning by 16.6%. Han et al. [20] introduced a novel perspective by designing differentiable deterministic factors, proposing denoising methods, and constructing an online weight adjustment mechanism, offering new insights into the incorporation of noise layer into DRL algorithms and opening new avenues for optimizing noise layer.

2.3. MPC and DRL Combined Path Planning

In reference [21], the authors proposed a hybrid switching-driven algorithm that integrates MPC and DRL, successfully applying it to planar non-grasping operations for robots. This approach significantly enhanced both the training and execution efficiency of the original system. Zhang et al. innovatively combined DRL with MPC by training DQN variants tailored to different observations, optimizing action selection, and designing new reward functions and switching strategies [22]. Their experiments across multiple scenarios demonstrated that the proposed hybrid method outperformed standalone MPC and DQN algorithms in various performance metrics, offering an efficient solution to robot navigation challenges. Later, Zhang et al. introduced a hybrid method combining bird’s-eye-view vision-based DRL with MPC to solve complex trajectory generation and obstacle avoidance problems [23]. The study involved designing a DDPG model, optimizing hybrid strategies, and performing evaluations across multiple scenarios, which revealed superior performance in robot settings, particularly in terms of new scene adaptation, computational efficiency, obstacle avoidance, and collaborative performance, when compared to pure DRL or MPC methods. However, despite the high success rate of model tests in static obstacle environments and occasional tests with dynamic obstacles, the motion trajectories of obstacles remained relatively fixed. This suggests there is still potential for improvement, particularly in handling randomly dynamic obstacles and more complex, evolving environments.

3. Methodology

3.1. Problem Definition

In dynamic and complex environments, robot trajectory planning requires effective coordination between global path planning and real-time local optimization while addressing challenges such as dynamic obstacles and diverse scenarios. Figure 2 illustrates the proposed framework, which integrates TD3 with MPC to balance global decision-making and local adjustments. TD3 employs a learnable noise layer design to enhance exploration capabilities and robustness, outputting sequential local actions that, when rolled out, form a short-horizon reference path while optimizing long-term cumulative rewards. Meanwhile, MPC incorporates dynamic obstacle prediction and kinematic constraints to perform real-time local trajectory optimization and dynamic obstacle avoidance. The two components collaborate adaptively through a dynamic weight adjustment module, which allocates control strategy weights based on obstacle density and environmental complexity. This synergy enables the generation of smooth, safe, and efficient robot control commands. The framework achieves closed-loop optimization at both global and local levels, providing an efficient and robust solution for robot navigation in challenging environments.

3.1.1. Kinematic Modeling

In robot path planning, the kinematic model is critical for ensuring that the robot can move efficiently while avoiding obstacles in dynamic environments. The non-holonomic differential drive model is employed in this study, which represents the robot’s motion in terms of its position, orientation, and velocities. The model provides a simple yet effective representation of a mobile robot’s motion and is particularly suitable for systems with two wheels, like differential drive robots, which are commonly used in practical applications. The discrete-time form of the kinematic model is given by the following equations:
x t + 1 = x t + v t cos θ t t ,
y t + 1 = y t + v t sin θ t t ,
θ t = θ t + ω t t ,
Here, x t , y t , θ t represents the robot’s position and orientation at time t, v t and ω t are the linear and angular velocities of the robot, respectively; and t is the time step.
The robot’s trajectory planning depends heavily on the accurate modeling of its movement, which can be constrained by real-world limitations such as the robot’s maximum velocity and acceleration. To ensure controllability and stability in the robot’s motion, the following kinematic constraints are applied:
Linear velocity and acceleration: The linear velocity is constrained within the range [−0.5,1.5] m/s, where negative values represent allowable reverse motion. To avoid instability caused by sudden acceleration or deceleration, the linear acceleration is restricted within [−1.0,1.0] m/s2, thus controlling the rate of change in velocity.
Angular velocity and acceleration: To control the amplitude of the robot’s rotational motion, the angular velocity is constrained within [−0.5,0.5] rad/s. The range of angular acceleration is defined as [−3.0,3.0] rad/s2, in order to ensure smooth rotational motion and avoid oscillations.
Discretization and Sampling Time: To implement the above constraints in a digital control system, the kinematic model is discretized with a sampling time of t = 0.2 s. This discretized time step ensures that the robot can respond promptly to changes in the dynamic environment during planning and is consistent with the control step size in the MPC optimization framework.
The application of these kinematic constraints ensures that the robot can operate within safe and practical limits. They are essential when integrating the robot’s kinematic model with the global path planning and local optimization components, particularly in the context of dynamic environments where obstacle densities and environmental conditions constantly change. This modeling approach helps maintain a balance where TD3 outputs local control actions that roll out into a reference path, while MPC performs local optimization for real-time obstacle avoidance, which are critical for the robot’s ability to navigate complex, dynamic scenarios.

3.1.2. Optimization Objective Modeling

To achieve effective robot navigation in dynamic and complex environments, the trajectory planning task is framed as an optimization problem. The primary objective is to generate a collision-free trajectory from the start point to the target, while maintaining smoothness, efficiency, and control input feasibility. The optimization cost function is designed to combine several critical factors:
J = t = 0 N w 1 J p a t h + w 2 J s m o o t h + w 3 J c o l l i s i o n + w 4 J c o n t r o l ,
where path tracking cost ( J p a t h ) measures the deviation between the robot’s actual trajectory and the reference path, which is critical for accurate path following:
J p a t h = | | s t s r e f | | 2 ,
Here, s t = ( x t , y t ) is the robot’s current position and sref is the reference trajectory point.
Trajectory smoothness cost ( J s m o o t h ) limits abrupt turns or accelerations during trajectory generation:
J s m o o t h = | | u t u t 1 | | 2 ,
where u t = ( v t , w t ) represents the control inputs, including linear and angular velocities.
Collision cost ( J c o l l i s i o n ) is used to avoid collisions between the robot and static or dynamic obstacles:
J c o l l i s i o n = o O max ( 0 , d s a f e | | s t o t | | 2 ) ,
Here, o t = ( x o , y o ) represents the position of the obstacle at time t, O is the set of all obstacles in the environment, and d s a f e is the predefined safe distance.
Control cost ( J c o n t r o l ) limits the magnitude of control inputs, ensuring that the robot operates within its physical capabilities:
J c o n t r o l = v t 2 + w t 2 .
The weight coefficients w 1 , w 2 , w 3 , w 4 in the cost function can be adjusted according to task requirements to balance path tracking and control overhead.

3.1.3. Sensing and Observation

In this work, the observation state s t is derived from onboard multi-sensor data fusion rather than a top-mounted Bird’s-Eye-View (BEV) camera. A 2D LiDAR sensor provides 720 evenly spaced range measurements over 360°, normalized to [0,1] using the sensor’s maximum range of 3.5 m, and transformed into a local occupancy grid aligned with the robot’s current heading. This grid is processed by a lightweight Convolutional Neural Network (CNN) to extract spatial features. These features are combined with proprioceptive information, including linear and angular velocities from wheel odometry, and three-axis angular velocity and linear acceleration from an Inertial Measurement Unit (IMU). In the real-world experiments, a forward-facing RGB camera is optionally used to provide supplementary visual cues for detecting obstacle boundaries and environmental details; however, it is not included in the DRL training process.
The action vector a t is two-dimensional, consisting of the linear velocity ν and angular velocity ω for the differential-drive base. The ranges are constrained to v [ 0 , 0.22 ] m/s and ω [ 2.84 , 2.84 ] rad/s to match the physical limits of the robot platform. This sensing-action configuration supports robust operation in both simulated and real-world environments with stochastic dynamic obstacles, without relying on top-mounted BEV cameras [23].

3.1.4. Dynamic Obstacle Prediction Modeling

In dynamic environments, the random motion of obstacles increases the difficulty of path planning. Therefore, this paper designs a kinematic-based obstacle trajectory prediction model to update dynamic constraints in real time during the optimization process. The motion model of an obstacle is given by:
o t + 1 = o t + t v o ,
where o t = ( x o , y o ) represents the obstacle’s position and v o = ( v x , v y ) represents its velocity. Additionally, to enhance the robustness of the prediction, a combination of short-term trajectory regression and motion trend analysis is employed, and a sliding window is used to smooth the obstacle trajectory:
o t + 1 = σ o t + 1 α o t 1 + v o Δ t ,
where α ∈ [0,1] is the smoothing factor. The predicted obstacle trajectories are used to dynamically update the obstacle constraints in the optimization problem, enabling the MPC module to effectively avoid dynamic obstacles.

3.2. TD3 Model with Noise Layer Optimization

Traditional TD3 implementations typically inject i.i.d. Gaussian noise directly into the action space to promote exploration. However, such post hoc perturbations do not influence the policy parameters during backpropagation, and therefore cannot structurally shape how the policy improves. This limits their ability to provide consistent, gradient-informed exploration.
To overcome this limitation, we redesign the Actor’s output layer as a learnable noisy linear layer inspired by parameter-space noise. Instead of applying external noise after the action is computed, the perturbation is embedded into the weight and bias parameters of the output layer and updated together with the Actor through backpropagation. This mechanism produces state-consistent, rollout-level exploration that directly impacts the replay buffer distribution while preserving the stability of TD3 through a noise-free twin-Critic. In this way, exploration is structurally coupled with learning, which is fundamentally different from conventional post hoc action noise. In our final design, noise injection is applied only to the Actor network to avoid destabilizing Critic value estimation.
During data collection, the noisy layer is enabled, and actions are sampled with stochastic perturbations, generating rollout-consistent exploration that populates the replay buffer. For target policy evaluation, the noisy layer is disabled and its deterministic expectation is used, preserving the stability of TD3′s target update. During Actor updates, we employ a single-sample reparameterization strategy with the sampled perturbation fixed per minibatch, which allows gradients to flow into the noise parameters while keeping variance under control. This design ensures that exploration directly influences the training data while maintaining the deterministic backbone required for stable policy optimization.

3.2.1. Noise Layer Design Principles

The noise layer is a mechanism that injects parameterized randomness into the network during training, promoting exploration by introducing stochasticity into the policy, especially in high-dimensional continuous action spaces [24]. The core design consists of the following three components:
(1) Randomization of weights and bias parameters
Figure 3 compares a standard linear layer with a noise layer, highlighting the parameterized randomness design of the noise layer. The linear transformation of each noise layer is redefined as:
y = μ ω + σ ω ϵ ω x + μ b + σ b ϵ b ,
where μ ω and μ b are learnable means values of the weight and bias, respectively, acting as deterministic parameters. σ ω and σ b are learnable scales (all registered as trainable tensors), updated jointly with the Actor via backpropagation. ϵ w and ϵ b are noise variables drawn from a standard normal distribution N(0,1). This design dynamically adjusts the amplitude of noise injection, facilitating a balance between exploration and exploitation during different training stages.
(2) Noise decay mechanism
To mitigate over-exploration in later stages without overriding the learnable scale parameters σ ω and σ b , we apply a weak annealing coefficient only to the sampled perturbations. The specific formula is:
ε t ~ = α t ε ,          α t = α 0 exp t τ ,          ε ~ N 0 , 1 .
This annealing biases the effective injected noise toward smaller values over time, while σ ω and σ b remain fully trainable and continue to be optimized by backpropagation. In this way, exploration stays strong early on and naturally attenuates as learning progresses, without hard-coding a schedule onto the learnable scales.
(3) Noise smoothing and stability enhancement
To address the issue of noise fluctuations potentially affecting model convergence, an Exponential Moving Average (EMA) smoothing mechanism is introduced, with the following formula:
ϵ ω t = σ ϵ ω t 1 + 1 σ ϵ ω t ,
where σ is the EMA decay factor that controls the degree of smoothing. The introduction of EMA dynamically smooths the noise by giving higher weight to the historical noise, effectively reducing the severity of noise fluctuations. This smoothing process not only reduces the interference of noise on policy generation but also enhances the model’s stability and convergence efficiency during training, ensuring robust exploration while avoiding excessive noise interference in model optimization. EMA operates on sampled perturbations; the underlying μ and σ parameters remain fully learnable.

3.2.2. TD3 Network Structure with Noise Layer

To improve the exploration capability and convergence stability of the TD3 in dynamic environments, this study integrates a noise layer into the output layer of the actor network while retaining the dual Q-network architecture in the Critic network [25], forming the overall framework shown in Figure 4.
The Actor network extracts high-dimensional features of the environment state s t through a feature extraction module (Conv2D + ReLU) and passes them through two hidden layers (each containing 64 neurons) to generate the control action a t . To enhance the diversity of policy generation, the output layer is redefined as a learnable noisy linear layer, where both the mean and scale parameters are optimized via backpropagation together with the Actor. The Tanh activation function is applied to constrain the action values within the range [−1,1]. The noisy output layer injects perturbations through trainable mean and scale parameters; under gradient pressure (and the weak decay prior), the learned variance gradually decreases, yielding stable policies in later training.
The Critic network evaluates the value of the action a t generated by the Actor using the dual Q-network structure, which computes two independent Q-values, Q1 and Q2, to address the problem of Q-value overestimation. In reinforcement learning, the Q-value represents the expected future rewards for a given state-action pair. However, using a single Q-value can lead to overestimation, particularly when the value function is approximated from noisy or biased data. To mitigate this, the dual Q-network structure independently estimates the Q-value through Q1 and Q2. The final Q-value used for learning is typically selected as the smaller of the two estimates, helping to prevent overestimation and promoting more stable learning.
These Q-values are computed after concatenating the state s t and action a t , which are processed through fully connected layers (each with 64 neurons). The state features, including both external aspects (e.g., visual information) and internal factors (e.g., speed and direction), are combined multidimensionally to improve the network’s understanding of the environment. This holistic representation of the state enhances the ability of the Critic network to evaluate the action’s value accurately.
Through this design, the integration of the noise layer in the Actor network enhances the exploration capability of the policy network while not interfering with the Q-value evaluation in the Critic network. This provides greater robustness and efficiency for policy optimization in dynamic environments.

3.3. Collaborative Optimization Mechanism of TD3 and MPC

In dynamic environments, robot path planning involves balancing global planning and local real-time control. The collaboration between TD3 and MPC efficiently integrates both aspects, improving robot navigation performance.

3.3.1. TD3 Provides Short-Horizon Action Rollouts

The Actor network outputs stepwise velocity commands; rolling them out over the control horizon yields a short-horizon reference path that conditions MPC’s local optimization. This reference is then supplied to the MPC module, which adapts it to the current environment for locally feasible execution.

3.3.2. MPC Performs Local Optimization Based on the Short-Horizon Reference

MPC takes the provisional path obtained from TD3′s action rollouts and further optimizes it under kinematic constraints and predicted obstacle motion, producing control commands that are both safe and dynamically feasible. The optimization objective function for MPC is defined as:
J = t = 0 N ( q r | | x t x r e f | | 2 + q μ | | μ t | | 2 + q o b s i = 1 M d i s t ( x t , o i ) ) ,
where x t is the robot’s current state, x r e f is the reference trajectory generated by TD3, u t is the control input, and o i represents the positions of dynamic obstacles. The parameters q r , q u , q o b s are the weights for the reference trajectory deviation, control input cost, and obstacle distance cost, respectively.
By optimizing this objective function, MPC generates smooth and safe local trajectories in dynamic environments, avoiding collisions with obstacles while enhancing the smoothness and robustness of local control [26,27]. The local optimization process of MPC effectively adapts the short-horizon candidate path to the dynamic environment, enabling quick responses to environmental changes and improving reliability.

3.3.3. Dynamic Weight Adjustment for Collaborative Optimization

To coordinate the TD3 planner and MPC controller, a logistic switching function is introduced to dynamically adjust their relative contributions according to the observed obstacle density:
λ t = 1 1 + e x p ( k d t d c ) ,
where d t denotes the obstacle density at time step t , d c is a predefined threshold, and k determines the steepness of the transition curve. When d t > d c , the value of λ t increases, giving MPC greater control authority to ensure collision avoidance; otherwise, TD3 retains a larger share to leverage its long-horizon planning capability.
In this work, d c = 0.5 and k = 5 were determined through preliminary simulation trials, balancing responsiveness and stability while avoiding oscillatory switching between controllers. Empirical observations indicate that moderate variations (±20%) in these parameters have negligible effects on success rate and trajectory smoothness, suggesting that the method is not highly sensitive to exact parameter values. This mechanism ensures smooth and adaptive transitions between global and local control without manual intervention.

4. Experiments

The effectiveness of the noise layer replacement scheme in the TD3+MPC model is validated through experiments, focusing on the impact of different noise replacement positions on model performance and comparing it with traditional algorithms. The results are presented through reward curves, map scene evaluations, and performance data tables, highlighting the superiority and practical applicability of the proposed design.

4.1. Experimental Setup

The experiments were conducted on a computing platform with a GPU (NVIDIA RTX 4060) and CPU (Intel i9-12900). The TD3 network received fused onboard sensor data as its state input, combining 360° LiDAR range measurements with odometry and inertial sensor readings to provide both obstacle proximity and motion state information. This setup reflects a realistic onboard sensing configuration for mobile robots, differing from prior works [22,23] that rely on overhead or fixed-view camera data. The action space consisted of continuous control over speed and direction. The hyperparameters used in the training process are shown in Table 1.
These parameters ensured sufficient training and stability for the model. During the training process, multiple parallel environments were used to accelerate data sampling, and moderate policy noise and a discount factor γ were introduced to balance exploration and long-term rewards effectively. These hyperparameters were chosen based on several experimental validations, forming a reliable foundation for subsequent comparative studies. Note that the ‘policy noise’ in Table 1 refers to TD3′s target policy smoothing for the target action (as in the canonical TD3), not to an additional exploration noise at action-space; exploration is primarily provided by the learnable parameter-space noisy output layer, whose mean and scale parameters are optimized via backpropagation while sampled noise drives exploration during data collection.

4.2. Reward Curve Comparison: Different Noise Layer Replacement Schemes

The primary goal of the noise-layer integration is to enhance exploration during policy learning while ensuring stable convergence of the model. Figure 5 presents representative reward curves for four noise-layer configurations. Each curve shows the smoothed average reward obtained from periodic evaluation rollouts conducted during training, illustrating the general learning dynamics and convergence tendencies. These curves are intended to highlight relative trends, while large-scale statistical robustness is examined through the scenario-based evaluations in Section 4.3.
The baseline TD3 model without noise integration (Figure 5a) converges locally around a reward value of 70. Replacing the Actor output layer with a noise layer (Figure 5b) overcomes this premature convergence and yields a substantially higher final reward. Perturbing both the Actor and Critic outputs (Figure 5c), however, reduces stability and results in a lower final reward, confirming that noise at the Critic side injects variance into Q-value estimation and hinders convergence. Similarly, perturbing both the Actor output and a hidden layer (Figure 5d) initially boosts exploration but later causes sharp fluctuations and even declines, indicating that excessive noise interference outweighs any transient benefit.
These findings support the decision to prioritize the Actor network for noise layer replacement. In the TD3 architecture, the Critic’s primary role is to estimate Q-values, while the Actor is responsible for generating policies. Therefore, modifying the Actor network has a more direct impact on exploration and policy optimization than altering the Critic. For similar reasons, this study did not explore noise replacement in the input layer, as it could distort the state feature representation and hinder environmental perception.
Unlike conventional TD3 implementations that inject i.i.d. external action noise, the proposed method embeds a learnable, reparameterized noisy output layer in the Actor. Both mean and scale parameters are optimized by backpropagation, while sampled perturbations are applied during data collection and optionally frozen per minibatch during policy updates to control gradient variance. For target policy evaluation, the deterministic expectation of the noisy layer is used to preserve the stability of TD3. This mechanism integrates exploration directly into the learning pipeline and provides trajectory-level consistency, fundamentally distinguishing it from post hoc action noise.

4.3. Algorithm Performance Evaluation: Comparison Across Multiple Map Scenes

To validate the performance of the TD3+MPC algorithm with Actor output layer replacement in various environments, we designed three representative evaluation scenes (Scenes A, B, and C) shown in Figure 6, are more challenging than those used in the evaluation environments of previous studies [23], featuring more dynamic obstacles, more complex obstacle distributions, and higher task difficulty. The aim is to comprehensively test the algorithm’s adaptability in dynamic environments, narrow passages, and dense obstacle scenarios.
The experiments compare the performance of the traditional pure MPC method, the TD3 algorithm, and other deep reinforcement learning-based combinations. The performance evaluation metrics include computation time, action smoothness, minimum safety distance (clearance), task completion time steps, and success rate. Action smoothness is quantified by the mean absolute change in both linear and angular velocities between consecutive time steps, reflecting the stability of robot motion control. Among these, clearance refers to the minimum distance between the robot’s trajectory and the nearest obstacle, reflecting the safety margin of path planning. The results of these evaluations are summarized in Table 2.
In Scene A, representing a static and structured obstacle environment, MPC and MPC+DDPG both maintained full success rates, while MPC+DQN achieved slightly lower stability. TD3 alone performed poorly due to its lack of local optimization. By contrast, MPC+TD3 also achieved a 100% success rate and demonstrated reduced computational burden per step compared with pure MPC. Although its completion steps were somewhat longer than those of MPC and MPC+DQN, the trajectories remained stable and collision-free, indicating that the hybrid method can preserve safety while improving efficiency in controller execution.
In Scene B with randomized dynamic obstacles, MPC+TD3 achieved a higher success rate than all optimization-based baselines, approaching the robustness of standalone TD3 but with significantly improved safety margins. While its action smoothness and completion steps were less favorable, this reflected proactive velocity adjustments and pre-emptive slowdowns in response to obstacle interactions. Such behavior emphasizes that the hybrid method prioritizes collision avoidance and adaptability under uncertainty, ensuring reliable navigation even at the expense of perfect smoothness.
Table 2. Evaluation results (average over 100 runs).
Table 2. Evaluation results (average over 100 runs).
SceneMethodComputation
Time (ms/step)
Action SmoothnessClearance
(m)
Finish
Time Step
Success
Rate (%)
SpeedAngular
Scene AMPC [4]17.280.030.011.16312.8100
TD3 [28]3.070.500.990.482862
MPC+DQN [22]21.640.010.030.73130.0790
MPC+DDPG [23]16.910.030.011.15336.2100
MPC+TD314.850.030.051.16330.96100
Scene BMPC17.520.030.070.77103.5576
TD31.930.100.311.11569.7994
MPC+DQN22.840.020.040.74126.4980
MPC+DDPG18.200.040.080.78108.3660
MPC+TD318.130.520.480.74147.8587
Scene CMPC/////0
TD32.140.150.200.45126.3816
MPC+DQN22.170.020.030.75132.8776
MPC+DDPG18.450.020.050.72147.1190
MPC+TD314.220.030.060.78107.3398
Scene C, involving densely cluttered obstacles, posed the most difficult challenge. Here, MPC failed to complete the task, and standalone TD3 exhibited very low stability. MPC+DQN and MPC+DDPG achieved higher success rates, but both suffered from longer completion times and unstable motion patterns. By contrast, MPC+TD3 reached the highest success rate among all tested methods and required noticeably fewer completion steps than MPC+DDPG, while maintaining superior clearance. These results confirm that the hybrid approach can effectively combine TD3′s exploratory capability with MPC’s reactive refinement to produce efficient and safe navigation even in the most complex scenarios.
Overall, MPC+TD3 consistently demonstrated significant advantages across all scenarios. Compared to standalone TD3 or traditional optimization methods, MPC+TD3 integrates the global exploration capabilities of reinforcement learning with the local optimization strength of MPC, exhibiting enhanced adaptability and robustness in complex, dynamic environments. Furthermore, compared to recently proposed methods like MPC+DQN and MPC+DDPG, MPC+TD3 not only achieved higher success rates in challenging scenarios but also outperformed in key metrics such as completion time and trajectory smoothness. These improvements are attributed to the twin-Critic architecture with delayed policy updates in TD3, together with the learnable parameter-space noisy output layer that structurally integrates exploration into the Actor. This synergy enhances strategy generation and ensures both adaptability and stability in complex dynamic environments. In conclusion, MPC+TD3 offers an effective solution for path planning in complex dynamic environments and provides a promising direction for further integration of reinforcement learning with MPC.

4.4. Real-World Platform Validation Experiment

To further verify the adaptability and effectiveness of the proposed path planning algorithm in real-world conditions, a series of physical experiments were conducted using a self-assembled mobile robot platform. The platform consists of a four-wheel differential drive chassis, an onboard embedded control unit, a power system, and multiple environmental sensors, enabling scene perception and autonomous decision-making capabilities.
The input state of the TD3 agent consists of 360° LiDAR range data, robot odometry (linear and angular velocity), and IMU readings. These proprioceptive and exteroceptive signals are fused to represent the robot’s state in the policy network. An RGB camera was mounted only for visualization and qualitative analysis during real-world experiments, but it was not used as an input to the proposed algorithm, respectively. In addition, the system integrates IMU and encoder data to achieve accurate pose estimation and real-time feedback. The overall architecture is built on the Robot Operating System (ROS), supporting seamless deployment and online control of the proposed model. The test environment was set up in a structured indoor laboratory space with a smooth floor and various obstacles arranged to simulate representative indoor navigation scenarios. By deploying the proposed TD3+MPC hybrid algorithm with a noise-layer-optimized actor network, the robot was able to autonomously generate navigation trajectories and perform smooth obstacle avoidance during motion.
As shown in Figure 7, the robot autonomously planned its path and successfully navigated through multiple static obstacles, with clear temporal progression captured in each frame. The generated trajectory exhibited strong continuity and smooth transitions, while no collisions occurred throughout the experiment. These results confirm the robustness and practical applicability of the proposed method under real-world deployment conditions.

5. Conclusions

This study presented a hybrid optimization framework that integrates TD3 with an output-layer noise mechanism and MPC to address the problem of mobile robot path planning in complex dynamic environments. The embedded noise layer enhances the consistency of exploration during policy learning, while MPC provides real-time local optimization for obstacle avoidance. In this framework, TD3 provides adaptive guidance through learned local actions, and MPC subsequently fine-tunes these actions into executable trajectories, ensuring both long-term adaptability and short-term safety. This division explicitly avoids claiming any global plan from local observations, while preserving long-horizon adaptability through learning and short-term safety via optimization The proposed approach achieved higher success rates, smoother trajectories, and more efficient task completion compared with baseline schemes in static, dynamic, and densely cluttered scenarios. Overall, the incorporation of a learnable output-layer parametric noise layer should be regarded as a practical strategy for enhancing exploration consistency and training stability in TD3. When combined with MPC, the proposed framework demonstrates clear advantages in terms of success rate and safety, particularly in dynamic and densely cluttered environments. While some efficiency or smoothness indicators may be traded off, the method achieves a balanced and reliable performance, particularly under dynamic and densely cluttered conditions where safety and success rate are prioritized over motion smoothness, offering a promising solution for complex real-world navigation tasks.

Author Contributions

X.H. and Q.Z.: writing—original draft preparation, manuscript, investigation, analysis, and figure preparation. Y.Y. and T.Z.: conceptualization, methodology. Z.X. and T.W.: analyzed data and supervision. J.J.: revision and funding acquisition. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key Research and Development Program (No. 2025YFE0199900), the National Natural Science Foundation of China (No. 62376261), the National Natural Science Foundation of Guangdong Province (No. 2024A1515011754), the Key Program of Concept Validation Foundation of Xidian University Hangzhou Research Institute (No. GNYZ2024QC014), and Shenzhen Technology Project (No. KJZD20240903100000001).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Qin, H.; Shao, S.; Wang, T.; Yu, X.; Jiang, Y.; Cao, Z. Review of autonomous path planning algorithms for mobile robots. Drones 2023, 7, 211. [Google Scholar] [CrossRef]
  2. Sánchez-Ibáñez, J.R.; Pérez-del-Pulgar, C.J.; García-Cerezo, A. Path planning for autonomous mobile robots: A review on path planning algorithms for autonomous ground vehicles and other surface-moving robots. Sensors 2021, 21, 7898. [Google Scholar] [CrossRef]
  3. Wang, B.; Liu, Z.; Li, Q.; Prorok, A. Mobile robot path planning in dynamic environments through globally guided reinforcement learning. IEEE Robot. Autom. Lett. (RA-L) 2020, 5, 6932–6939. [Google Scholar] [CrossRef]
  4. Lindqvist, B.; Mansouri, S.S.; Agha-mohammadi, A.-A.; Nikolakopoulos, G. Nonlinear MPC for collision avoidance and control of UAVs with dynamic obstacles. IEEE Robot. Autom. Lett. (RA-L) 2020, 5, 6001–6008. [Google Scholar] [CrossRef]
  5. Sun, D.; Jamshidnejad, A.; De Schutter, B. A novel framework combining MPC and deep reinforcement learning with application to freeway traffic control. IEEE Trans. Intell. Transp. Syst. (T-ITS) 2024, 25, 6756–6769. [Google Scholar] [CrossRef]
  6. Jiang, S.; Tran, C.Q.; Keyvan-Ekbatani, M. Regional route guidance with realistic compliance patterns: Application of deep reinforcement learning and MPC. Transp. Res. Part C Emerg. Technol. (TR-C) 2024, 158, 104440. [Google Scholar] [CrossRef]
  7. Li, Z.; Shi, N.; Zhao, L.; Zhang, M. Deep reinforcement learning path planning and task allocation for multi-robot collaboration. Alex. Eng. J. 2024, 109, 408–423. [Google Scholar] [CrossRef]
  8. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
  9. Li, J.; Chen, Y.; Zhao, X.; Huang, J. An improved DQN path planning algorithm. J. Supercomput. 2022, 78, 616–639. [Google Scholar] [CrossRef]
  10. Lyu, P. Robot path planning algorithm with improved DDPG algorithm. Int. J. Interact. Des. Manuf. (IJIDeM) 2024, 19, 1123–1133. [Google Scholar] [CrossRef]
  11. Tan, Y.; Lin, Y.; Liu, T.; Min, H. PL-TD3: A dynamic path planning algorithm for mobile robots. In Proceedings of the 2022 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Prague, Czech Republic, 9–12 October 2022; pp. 3040–3045. [Google Scholar] [CrossRef]
  12. Zhang, Y.; Zhao, W.; Wang, J.; Yuan, Y. Recent progress, challenges, and future prospects of applied deep reinforcement learning: A practical perspective in path planning. Neurocomputing 2024, 608, 128423. [Google Scholar] [CrossRef]
  13. Chen, Y.; Ji, C.; Cai YYan, T.; Su, B. Deep reinforcement learning in autonomous car path planning and control: A survey. arXiv 2024, arXiv:2404.00340. [Google Scholar] [CrossRef]
  14. Almazrouei, K.; Kamel, I.; Rabie, T. Dynamic obstacle avoidance and path planning through reinforcement learning. Appl. Sci. 2023, 13, 8174. [Google Scholar] [CrossRef]
  15. Eckhoff, M.; Kirschner, R.J.; Kern, E.; Abdolshah, S.; Haddadin, S. An MPC framework for planning safe & trustworthy robot motions. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; pp. 4737–4742. [Google Scholar] [CrossRef]
  16. Ramezani, M.; Habibi, H.; Sanchez-Lopez, J.L.; Voos, H. UAV path planning employing MPC-reinforcement learning method considering collision avoidance. In Proceedings of the 2023 International Conference on Unmanned Aircraft Systems (ICUAS), Warsaw, Poland, 6–9 June 2023; pp. 507–514. [Google Scholar] [CrossRef]
  17. Zhao, Y. Automatic parking planning control method based on improved A* algorithm. arXiv 2024, arXiv:2406.15429. [Google Scholar] [CrossRef]
  18. Yan, C.; Xiang, X.; Wang, C. Towards real-time path planning through deep reinforcement learning for a UAV in dynamic environments. J. Intell. Robot. Syst. 2020, 98, 297–309. [Google Scholar] [CrossRef]
  19. Li, P.; Chen, D.; Wang, Y.; Zhang, L.; Zhao, S. Path planning of mobile robot based on improved TD3 algorithm in dynamic environment. Heliyon 2024, 10, e32167. [Google Scholar] [CrossRef] [PubMed]
  20. Han, S.; Zhou, W.; Lu JLiu, J.; Lu, S. NROWAN-DQN: A stable noisy network with noise reduction and online weight adjustment for exploration. Expert Syst. Appl. (ESWA) 2022, 203, 117343. [Google Scholar] [CrossRef]
  21. Zhang, B.; Huang, C.; Zhang, H.; Bai, X. Switching pushing skill combined MPC and deep reinforcement learning for planar non-prehensile manipulation. arXiv 2023, arXiv:2303.17379. [Google Scholar] [CrossRef]
  22. Zhang, Z.; Cai, Y.; Ceder, K.; Enliden, A.; Eriksson, O.; Kylander, S.; Sridhara, R.; Åkesson, K. Collision-free trajectory planning of mobile robots by integrating deep reinforcement learning and model predictive control. In Proceedings of the 2023 IEEE 19th International Conference on Automation Science and Engineering (CASE), Auckland, New Zealand, 26–30 August 2023; pp. 1–7. [Google Scholar] [CrossRef]
  23. Ceder, K.; Zhang, Z.; Burman, A.; Kuangaliyev, I.; Mattsson, K.; Nyman, G.; Petersén, A.; Wisell, L.; Åkesson, K. Bird’s-eye-view trajectory planning of multiple robots using continuous deep reinforcement learning and model predictive control. In Proceedings of the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Abu Dhabi, United Arab Emirates, 14–18 October 2024; pp. 1–8. [Google Scholar] [CrossRef]
  24. Plappert, M.; Houthooft, R.; Dhariwal, P.; Sidor, S.; Chen, R.Y.; Chen, X.; Asfour, T.; Abbeel, P.; Andrychowicz, M. Parameter space noise for exploration. arXiv 2017, arXiv:1706.01905. [Google Scholar] [CrossRef]
  25. Zhang, S.; Li, Y.; Dong, Q. Autonomous navigation of UAV in multi-obstacle environments based on a deep reinforcement learning approach. Appl. Soft Comput. 2022, 115, 108194. [Google Scholar] [CrossRef]
  26. Gaertner, M.; Bjelonic, M.; Farshidian, F.; Hutter, M. Collision-free MPC for legged robots in static and dynamic scenes. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 8266–8272. [Google Scholar] [CrossRef]
  27. Lotfi, F.; Virji, K.; Faraji, F.; Berry, L.; Holliday, A.; Meger, D.; Dudek, G. Uncertainty-aware hybrid paradigm of nonlinear MPC and model-based RL for offroad navigation: Exploration of transformers in the predictive model. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 2925–2931. [Google Scholar]
  28. Li, P.; Wang, Y.; Gao, Z. Path planning of mobile robot based on improved TD3 algorithm. In Proceedings of the 2022 IEEE International Conference on Mechatronics and Automation (ICMA), Guilin, China, 7–10 August 2022; pp. 715–720. [Google Scholar] [CrossRef]
Figure 1. Schematic Path planning process using TD3 and MPC for dynamic obstacle avoidance. The TD3 module provides a sequence of local actions that together approximate a reference path, while MPC continuously adjusts this path in response to environmental changes, ensuring collision-free navigation. The dynamic weight adjuster fine-tunes the final trajectory to avoid both static and dynamic obstacles.
Figure 1. Schematic Path planning process using TD3 and MPC for dynamic obstacle avoidance. The TD3 module provides a sequence of local actions that together approximate a reference path, while MPC continuously adjusts this path in response to environmental changes, ensuring collision-free navigation. The dynamic weight adjuster fine-tunes the final trajectory to avoid both static and dynamic obstacles.
Sensors 25 06263 g001
Figure 2. Proposed framework integrating TD3 with MPC for dynamic obstacle avoidance.
Figure 2. Proposed framework integrating TD3 with MPC for dynamic obstacle avoidance.
Sensors 25 06263 g002
Figure 3. Graphical Representation of Linear Layer and Noisy Linear Layer.
Figure 3. Graphical Representation of Linear Layer and Noisy Linear Layer.
Sensors 25 06263 g003
Figure 4. TD3 Network with noise layer.
Figure 4. TD3 Network with noise layer.
Sensors 25 06263 g004
Figure 5. Training reward curves under four noise-layer configurations. Each curve shows the smoothed average reward from periodic evaluation rollouts during training. (a) baseline TD3 without noise; (b) Actor output with noise layer; (c) Actor + Critic outputs with noise layers; (d) Actor output + one hidden layer with noise layers.
Figure 5. Training reward curves under four noise-layer configurations. Each curve shows the smoothed average reward from periodic evaluation rollouts during training. (a) baseline TD3 without noise; (b) Actor output with noise layer; (c) Actor + Critic outputs with noise layers; (d) Actor output + one hidden layer with noise layers.
Sensors 25 06263 g005
Figure 6. Evaluation Maps: (a) Scene A: Static regular obstacle distribution; (b) Scene B: Dynamic random obstacle distribution; (c) Scene C: High-density random obstacle distribution.
Figure 6. Evaluation Maps: (a) Scene A: Static regular obstacle distribution; (b) Scene B: Dynamic random obstacle distribution; (c) Scene C: High-density random obstacle distribution.
Sensors 25 06263 g006
Figure 7. Real-world deployment and validation of the self-assembled mobile robot platform. The time-stamped images illustrate the robot’s real-time navigation and obstacle avoidance process in a structured indoor environment.
Figure 7. Real-world deployment and validation of the self-assembled mobile robot platform. The time-stamped images illustrate the robot’s real-time navigation and obstacle avoidance process in a structured indoor environment.
Sensors 25 06263 g007
Table 1. Parameters of the training experiment.
Table 1. Parameters of the training experiment.
ParametersValue
Total Timesteps106
Number of Parallel Environments8
Discount Factor γ 0.99
Soft Update Factor τ 0.005
Learning Rate0.01
Replay Buffer Size105
Batch Size32
Policy Noise0.2
Gradient Steps−1
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hong, X.; Zhang, Q.; Yang, Y.; Zhao, T.; Xu, Z.; Wang, T.; Ji, J. Collision-Free Robot Path Planning by Integrating DRL with Noise Layers and MPC. Sensors 2025, 25, 6263. https://doi.org/10.3390/s25206263

AMA Style

Hong X, Zhang Q, Yang Y, Zhao T, Xu Z, Wang T, Ji J. Collision-Free Robot Path Planning by Integrating DRL with Noise Layers and MPC. Sensors. 2025; 25(20):6263. https://doi.org/10.3390/s25206263

Chicago/Turabian Style

Hong, Xinzhan, Qieshi Zhang, Yexing Yang, Tianqi Zhao, Zhenyu Xu, Tichao Wang, and Jing Ji. 2025. "Collision-Free Robot Path Planning by Integrating DRL with Noise Layers and MPC" Sensors 25, no. 20: 6263. https://doi.org/10.3390/s25206263

APA Style

Hong, X., Zhang, Q., Yang, Y., Zhao, T., Xu, Z., Wang, T., & Ji, J. (2025). Collision-Free Robot Path Planning by Integrating DRL with Noise Layers and MPC. Sensors, 25(20), 6263. https://doi.org/10.3390/s25206263

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop