Research on Mobile Agent Path Planning Based on Deep Reinforcement Learning

Jin, Shengwei; Zhang, Xizheng; Hu, Ying; Liu, Ruoyuan; Wang, Qing; He, Haihua; Liao, Junyu; Zeng, Lijing

doi:10.3390/systems13050385

Open AccessArticle

Research on Mobile Agent Path Planning Based on Deep Reinforcement Learning

by

Shengwei Jin

,

Xizheng Zhang

^*

,

Ying Hu

,

Ruoyuan Liu

,

Qing Wang

,

Haihua He

,

Junyu Liao

and

Lijing Zeng

College of Information Science and Engineering, Hunan Institute of Engineering, Xiangtan 411104, China

^*

Author to whom correspondence should be addressed.

Systems 2025, 13(5), 385; https://doi.org/10.3390/systems13050385

Submission received: 9 April 2025 / Revised: 9 May 2025 / Accepted: 15 May 2025 / Published: 16 May 2025

Download

Browse Figures

Versions Notes

Abstract

:

For mobile agent path planning, traditional path planning algorithms frequently induce abrupt variations in path curvature and steering angles, increasing the risk of lateral tire slippage and undermining operational safety. Concurrently, conventional reinforcement learning methods struggle to converge rapidly, leading to an insufficient efficiency in planning to meet the demand for energy economy. This study proposes LSTM Bézier–Double Deep Q-Network (LB-DDQN), an advanced path-planning framework for mobile agents based on deep reinforcement learning. The architecture first enables mapless navigation through a DDQN foundation, subsequently integrates long short-term memory (LSTM) networks for the fusion of environmental features and preservation of training information, and ultimately enhances the path’s quality through redundant node elimination via an obstacle–path relationship analysis, combined with Bézier curve-based trajectory smoothing. A sensor-driven three-dimensional simulation environment featuring static obstacles was constructed using the ROS and Gazebo platforms, where LiDAR-equipped mobile agent models were trained for real-time environmental perception and strategy optimization prior to deployment on experimental vehicles. The simulation and physical implementation results reveal that LB-DDQN achieves effective collision avoidance, while demonstrating marked enhancements in critical metrics: the path’s smoothness, energy efficiency, and motion stability exhibit average improvements exceeding 50%. The framework further maintains superior safety standards and operational efficiency across diverse scenarios.

Keywords:

reinforcement learning; path planning; deep Q-network

1. Introduction

In recent years, the integration of informatization and industrialization, coupled with rapid advancements in autonomous vehicle technology, has enabled the widespread application of unmanned vehicles across diverse production and daily-life-related domains [1]. As a core component of autonomous vehicle research, path planning must not only effectively avoid obstacles during motion but also ensure rapid and secure arrival at target destinations [2]. Path-planning environments can be classified into global planning, which assumes complete environmental knowledge, and local planning, which operates under partial or no prior environmental data. Traditional navigation algorithms excessively rely on environmental maps; for instance, the A* and Dijkstra algorithms utilize graph search-based path planning. In dynamic or open environments where mobile agents cannot acquire complete maps, these methods suffer from inefficiency. Current trends in mobile agents and autonomy demand mobile robots with learning capabilities to adapt to evolving environments. To address this issue, researchers have proposed deep reinforcement learning (DRL), which integrates deep learning (DL) with reinforcement learning (RL) [3,4]. In this framework, DL primarily leverages the perceptual capabilities of neural networks to extract features from unknown environmental states, thereby approximating the state–action value function. Meanwhile, RL is responsible for decision-making based on the outputs of the deep neural network and a predefined exploration strategy [5], effectively mapping states to corresponding actions. This hybrid approach demonstrates superior performance in meeting the mobility requirements of intelligent agents.

Research on mobile robot path planning has garnered significant attention from scholars worldwide [6,7]. Among the various methodologies, RL [8] stands out as an effective approach. RL enables mobile robots to autonomously select actions based on their current states, iteratively refining paths through trial-and-error interactions with environmental feedback, thereby enabling end-to-end planning. This method optimizes route selection strategies, while supporting autonomous and online learning. However, deep learning exhibits limitations in decision–making for the selection of actions, resulting in insufficient adaptability to complex spatial scenarios and an inability to fully meet the requirements of mobile agents. To address this, Google Brain integrated the strengths of deep learning and RL, introducing DRL [9,10], which has opened new research perspectives for path planning in complex environments. In 2013, Mnih et al. [11] combined convolutional neural networks with the Q-learning algorithm [12], proposing a Deep Q-Network (DQN) to address high-dimensional perception-based decision-making. This approach utilized DQN to approximate Q-values in Q-learning, mitigating the curse of dimensionality and accelerating network convergence. In 2016, Van Hasselt et al. [13] developed the Double Deep Q-Network (DDQN), employing dual parameter sets to decouple action selection from policy evaluation, effectively resolving the Q-value overestimation in DQN. In 2017, Xin et al. [14] pioneered a DQN-based path-planning method for mobile agents, training a DQN to approximate state–action pairs and output Q-values for actions (e.g., left/right turns, forward motion). However, the maximization operation during target Q-network updates induces overestimated action values, leading to local optima and slow convergence. While DQN has achieved remarkable success in robotic path planning, its action–value overestimation remains a critical limitation. The DDQN algorithm addresses this by separating action selection and policy evaluation [15]. Zhu et al. [16] further decoupled the computation of target Q-values and the selection of actions in DDQN, eliminating overestimation while enhancing utilization of samples via experience replay buffers, thereby accelerating neural network training. Consequently, the proposed LB-DDQN method demonstrates unique advantages in handling high-dimensional state and action spaces. This study employs a literature review, simulation experiments, and comparative analysis to construct an LB-DDQN-based vehicle path-planning model, validating its performance in simulated environments. Comparative evaluations against traditional path-planning algorithms, DDQN, and LB-DDQN demonstrate the proposed method’s superiority. Therefore, the primary contributions of this paper are as follows:

Environmental Modeling: Constructing a 3D scenario in Gazebo for mobile agent operations. Post Simultaneous Localization and Mapping (SLAM)-based mapping, LiDAR scans generate grid maps optimized via inflation algorithms to enhance the path’s safety;
Path-Planning Framework: Integrating LSTM networks and heuristic reward mechanisms with DDQN to achieve mobile agent path planning;
Trajectory Optimization: Refining autonomous decision-making models through the Bézier curve-based smoothing of DDQN agent–environment interactions, validated via simulations in Gazebo and Rviz tools under the ROS.

2. Path-Planning Algorithm

2.1. DDQN Method

The DQN algorithm enhances Q-learning by employing deep neural networks to approximate action-value functions, while integrating key mechanisms such as experience replay and target networks. However, it fails to address the inherent limitation of Q-learning’s overestimation of Q-values. This overestimation stems from the maximization operation within the DQN framework, resulting in suboptimal policies due to inflated action-value estimates.

The DDQN algorithm effectively mitigates the overestimation issue in value function approximation. Unlike the DQN algorithm, which selects and evaluates both actions based on the same parameter

θ,

DDQN decouples these two processes.

When using the DQN algorithm to seek the TD goal, action selection needs to be carried out; first of all, it is necessary to select an action

α^{*},

which is the action corresponding to the maximum value of the value function

Q (S_{t + 1}, α; θ)

at the state

S_{t + 1}

. Action evaluation means that after selecting action

α^{*},

the corresponding action value function out of

α^{*}

is utilized to calculate the TD objective accordingly.

In contrast, DDQN utilizes two different networks to

θ

for action selection and action evaluation, utilizing the estimation network for action selection

α^{*}

to select the action and the target network

θ^{-}

for action evaluation.

The updated TD optimization objectives are as follows:

Y_{t}^{D D Q N} = r + γ Q (S^{'}, \underset{α^{'}}{argmax} Q (s^{'}, α^{'}; θ^{-}))

(1)

2.2. LSTM

When implementing the DDQN algorithm for LiDAR-based mapless path planning, the lack of global map information compels agents to conduct extensive exploratory trials to learn optimal decisions in novel obstacle scenarios. Such a process frequently induces inconsistent actions and value function divergence, thereby compromising decisions’ stability.

Specifically, the robot may generate suboptimal or erroneous decisions when repeatedly encountering identical obstacles, which significantly degrades its planning efficiency. To address this limitation, this study integrates an LSTM network into the DDQN decision model to process sequential environmental state inputs. As a variant of recurrent neural networks, the LSTM architecture, illustrated in Figure 1, employs gate mechanisms to regulate the retention or discarding of information, thereby enabling selective memory and forgetting. This capability renders LSTM particularly suitable for partially observable reinforcement learning scenarios, where historical state dependencies are critical for robust decision-making. The LSTM network regulates information flow through its input gate, forget gate, and output gate, enabling it to effectively capture long-term dependencies [17]. In the context of path planning, traditional methods often struggle with long-sequence processing due to the vanishing gradient problem. In contrast, mobile intelligent agents require a decision-making mechanism that adapts to their current state based on historical environmental observations. Consequently, LSTM’s forget gate dynamically determines whether to retain or discard historical information, thereby enhancing its temporal memory capability. By memorizing past states, LSTM optimizes path-planning decisions in a principled manner.

Based on the DDQN-based decision model, an LSTM network is introduced to solve the problems of poor decision-making and slow planning efficiency due to partial observability. By incorporating the LSTM network, the action selection is correlated before and after, thus increasing the stability of the decision-making. The framework of the LSTM-DDQN model is illustrated in Figure 2. The LSTM-DDQN model framework utilizes a dual-network structure that includes an estimator network and a target network. These two networks have the same structure. The current state information is input into the estimation value network, and its parameters are updated in real time. Thereafter, the network parameters undergo periodic synchronization with the target value network to maintain policy consistency. The encoded state representations are subsequently retrieved via the memory cell module, sequentially propagated through a bidirectional two-layer LSTM architecture for the extraction of temporal features, and ultimately transformed into actionable outputs by a hierarchical four-layer fully connected network. This gives the mobile agent memory, more stable decision-making when encountering the same obstacles, and an enhanced ability to find the target point and avoid static and dynamic obstacles, resulting in less volatility in the converged reward curve, and thus planning a better path.

2.3. Bézier Curve Smoothing Optimization

Although DDQN generates feasible paths, those derived from discrete actions may exhibit jagged trajectories, and the mobile agents may have to reduce their speed near the inflection points to ensure their own safety in real situations. In order to shorten the running time of agents and improve the efficiency of how they run, this paper optimizes the paths by Bézier curve optimization. First, key turning points are extracted from the original path to serve as control points for the construction of the Bézier curve, and the number is dynamically adjusted according to the path’s complexity; then the path trajectory is smoothed by using the third-order Bézier curve to ensure that the curvature is continuously conductible. Bézier curves generate smooth trajectories through control points, with their mathematical foundation rooted in polynomial interpolation. A cubic Bézier curve is defined by four control points, comprising two anchor points and two control points. By strategically adjusting the positions of these control points, diverse curve geometries can be synthesized [18,19]. This formulation inherently guarantees a continuity of curvature, thereby producing smooth trajectories that minimize abrupt directional changes. Such properties make Bézier curves particularly suitable for accommodating vehicle dynamic constraints. Their formula is as in Equation (2), where

B (t)

is a point on the curve; t is a parameter and 0 ≤ t ≤ 1;

P_{0}

and

P_{3}

are the endpoints of the curve; and

P_{1}

and

P_{2}

are control points. The third-order Bézier curve is shown in Figure 3.

B (t) = {(1 - t)}^{3} P_{0} + 3 {(1 - t)}^{2} t P_{1} + 3 {(1 - t) t}^{2} P_{2} + t^{3} P_{3}

(2)

2.4. LB-DDQN Path-Planning Algorithm

To address agent body path-planning problems, including the lack of policy convergence efficiency and the bottleneck of driving safety and performance in path planning, this study proposes a path-planning optimization framework that integrates deep reinforcement learning and kinematic constraints, and this multimodal fusion of agent decision-making architectures has the core innovation of constructing a deep reinforcement learning–spatio-temporal memory synergistic optimization mechanism and coupling the dynamic constraints of the path after the optimization layer. The LSTM memory network makes the mobile agents have memory and solves the problem of poor decision making due to the inability of mobile agents to obtain complete environment characteristics under a partially observable Markov decision-making process. The DDQN method solves the problem of overestimation in the DQN method by using different value functions for strategy selection and strategy evaluation. The model realizes the dual perception of state–space feature extraction and environmental–temporal correlations through the cascade architecture of a dual-depth Q-network (DDQN) and long short-term memory network (LSTM), reduces invalid nodes generated during planning, and then smooths the path trajectories through third-order Bézier curves to address the problem of low computational efficiency. The block diagram of an LSTM-DDQN-based mapless path-planning design for mobile agents is shown in Figure 4.

The agent body senses the surrounding environment and obstacle information through LiDAR, combines its own state information and target point information, inputs the simulated 2D planar laser point cloud data into the LSTM-DDQN model framework, uses the model output to calculate the Q-value, selects the appropriate action to execute the action and interacts with the environment through a greedy strategy, and receives feedback from the environment through the continuum’s reward function with inspirational knowledge. The mobile agent path-planning task is guided to obtain the maximum gain; after generating a series of waypoints, the discrete path points are fitted to a Bézier curve to generate an executable trajectory; and finally the agent body executes the trajectory and gives feedback on the new state to form a continuous interaction.

3. Design of Path Planning for Mobile Agents Based on the LB-DDQN Approach

3.1. ROS Physical Engine Validation

Gazebo is a robot simulation software often used in conjunction with ROS to provide a high-quality simulation environment. By acquiring the LiDAR data of the simulation model in Gazebo, the construction of the surrounding environment is accomplished using the SLAM algorithm and displaying the construction process in Rviz’ the SLAM algorithm graph-building process is shown in Figure 5.

3.2. Cost Map

In order for the unmanned vehicle to be able to drive more safely in the constructed map, so that the unmanned vehicle can keep a certain distance from the obstacles while moving, after the completion of the construction of the map of the surrounding environment, it is necessary to carry out a variety of processing steps of the built raster map to construct a new map, global_costmap, on the original map. The cost map can be configured with multiple layers to expand upon the basis of a static map. The mathematical nature of the operation is a Minkowski sum operation:

D (M) = M \oplus B = {p \in Z^{2} | B_{P} \cap M \neq \emptyset}

(3)

Within the algorithm,

M \subseteq Z^{2}

denotes the original binary grid map,

B

represents the structuring element (typically a symmetric convex set), and

B_{P}

signifies the translation of the structuring element to position

p

. The grid map dilation algorithm is implemented, as illustrated in Figure 6.

3.3. Spatial Design and Reward Functions

During the process of working, the mobile agent body needs to reach the specified target point from the starting point and avoid collisions with obstacles in the process of movement so as to complete the path-planning task. The state–space information of the agent body includes LiDAR information, the mobile agent body’s own position, and the target position, which is expressed as Equation (4):

s = (s_{s c a n}, s_{p o s i t i o n}, s_{g o a l})

(4)

Although the actual movements of the mobile agent body are continuous movements, the algorithm achieves faster convergence by decomposing the continuous movements into discrete movements in reinforcement learning, spatially dividing the action space of the mobile agent body into five discrete actions, fast left turn, left turn, straight ahead, right turn, and fast right turn, which meets the requirements of the feasibility of the path-planning task of the mobile agent body and the maneuverability of the agent body and is expressed as Equation (5):

A = [α_{1}, α_{2}, α_{3}, α_{4}, α_{5}]

(5)

During the path-planning process, the selected action is evaluated by a reward function when the mobile agent body interacts with the environment. The agent body selects an action based on its current state and transfers to a new state based on the response of the environment. At the same time, the process generates a reward signal, which is the maximization goal in path planning. The reward signal is provided by the reward function, which is shown in Equation (6) and which is the driver of reinforcement learning to maximize the gain of the mobile intelligence while path planning.

R = \{\begin{matrix} 80, i f r e a c h g o a l \\ - 10, i f c o l l i s i o n \\ 0, o t h e r w i s e \end{matrix}

(6)

4. Experiment and Performance Analysis

In order to verify the effectiveness of the path-planning algorithm proposed in this paper, this paper designed simulation experiments in the Gazebo robot simulation platform and realized the motion control of the agent body based on the ROS operating system, in addition to comparative experiments with other deep reinforcement learning algorithms, which tested the performance of the algorithms on maps of different sizes. The hardware device for the experimental environment of this design was an NVIDIA RTX 3090 GPU, and a Ubumtu 20.04LTS operating system and Python 3.9 were used for the model’s training and evaluation.

4.1. Performance Analysis

In the training of the DDQN path-planning model, the optimizer adopts Adam’s algorithm, whose learning rate is set to 1.0 × 10⁻⁴ and future reward discount factor to 0.99 to balance the training stability and convergence speed. This configuration is experimentally verified to be able to effectively optimize the network parameters, while avoiding the problem of slow convergence due to gradient oscillation caused by too large a learning rate or too small a learning rate.

The model training results, as shown in Figure 7, indicate that this LB-DDQN model exhibits excellent convergence characteristics in the path-planning task. After 996 k training steps, the test reward reaches 87.5% of the theoretical maximum, and the late fluctuation coefficient CV = 5.7% is significantly lower than that of the benchmark model DQN-CV = 12.3%. The three-stage growth pattern of the reward curve is consistent with ideal reinforcement learning training dynamics, which verifies the reasonableness of the parameter configuration γ = 0.99 and batch_size = 256.

The loss curve is shown in Figure 8, and its three-stage characteristics are highly consistent with the theoretical expectations, demonstrating a desirable optimization trajectory. A minor fluctuation (Δ ≈ 0.4) near 400 k steps may originate from the adjustment of the exploration strategy, but the model exhibits good robustness and quickly resumes the convergence trend. This stability validates the advantage of the Adam optimizer β₁ = 0.9, β₂ = 0.99 in non-smooth reinforcement learning objectives.

The LB-DDQN model training required 48 h on an NVIDIA RTX 3090 GPU (24 GB memory), with the peak GPU memory utilization reaching 18.2 GB during batch processing, and a Ubumtu 20.04 LTS operating system and Python 3.9 were used for the model’s training and evaluation.

4.2. Comparative Experiment

In order to verify the effectiveness of the LB-DDQN method, there are problems in agent body path planning to be addressed: the mechanism based on discrete node search leads to sudden changes in path curvature and steering angle, which triggers the risk of the vehicle skidding and the frequent overshooting of actuators, threatening the safety of the driving; it has difficulty quickly converging in a wide range of complex environments, and the planning efficiency is low and the energy cost is high, which is not able to satisfy the scenarios of real-time and energy economy needs. This experiment takes Dijkstra’s algorithm and the DDQN algorithm as BL-DDQN comparison algorithms, for the safety of path planning and planning efficiency and economy for data comparison.

4.2.1. Comprehensive Evaluation of Safety Performance

A set of experimental data of path planning is sampled in the simulation map, and four key indexes, namely maximum curvature

κ_{m a x}

, average curvature

\bar{κ}

, maximum steering angle

δ_{m a x}

and mean steering angle

\bar{δ}

, are counted, and the experimental results are shown in Table 1.

Curvature is a core indicator of the degree of path curvature; the curvature index directly determines the stability of the vehicle’s lateral dynamics; if the maximum curvature exceeds the vehicle’s minimum turning radius, the path becomes unfeasible; the lower the average curvature, for the path as a whole, the more gentle it is and the more suitable for the high comfort requirements of the scene. The experimental results show that the traditional graph search algorithm Dijkstra has a maximum curvature

κ_{m a x}

as high as 1.905 and an average curvature

\bar{κ}

of 1.018 m⁻¹ due to the jagged fluctuation caused by the path discretization, indicating that its path continuity is poor, which is prone to cause the risk of rollover. The DDQN model is optimized by successive action spaces, and the maximum curvature and the mean curvature are reduced to 0.410 m⁻¹ and 0.098 m⁻¹, respectively, but there are still local curvature spikes. The maximum curvature

κ_{m a x}

of the LB-DDQN proposed in this paper is 0.33, which is 82.7% lower than that of Dijkstra and 19.5% lower than that of DDQN, indicating that it can avoid extreme curved paths more effectively and significantly reduce the risk of sharp turns. Its mean curvature is 0.072, which is further reduced by 92.9% compared to Dijkstra and 26.5% compared to DDQN, indicating that the global smoothness of its path is better, which is in line with the demand for continuous trajectories in dynamic systems [20]. According to the vehicle dynamics model, the lateral acceleration is proportional to the curvature and velocity squared [21]. Assuming a vehicle speed of 5 m/s, the maximum lateral acceleration of LB-DDQN is

8.25 m / s^{2}

, which is lower than the typical loss-of-control threshold of

10 m / s^{2}

[22], and the safety is verified by combining the theoretical model with the industry safety threshold. Compared with other algorithms, the path generated by LB-DDQN not only avoids an extreme bending risk but also improves its global coherence, which meets the needs of dynamic system safety control.

Steering angle is one of the core metrics to measure the safety of path-planning algorithms, and it is directly related to the handling stability, mechanical wear, and energy efficiency of the dynamic system. The steering angle constraint analysis considers the maximum corner

δ_{m a x}

and the average corner

\bar{δ}

, which may lead to a sideslip or loss of control if the corner is large or a sharp turn is made. The average corner is defined as the average value of the corners between all the neighboring segments on the path. The smaller the average corner is, the smoother the change of direction is, which is good for the controller to track stably. The steering angle metric quantifies the dynamic stress imposed by the path on vehicle actuators. Dijkstra’s maximum steering angle

δ_{m a x}

= 1.571 rad ≈ 90°, far exceeding the physical limitations of the actual vehicle; the maximum steering angle corresponding to the smallest turning radius of an ordinary vehicle is usually less than 0.5 rad, so this may lead to the overloading of the mechanical components or loss of control. The average steering angle

\bar{δ}

of Dijkstra is 0.211 rad ≈ 12.1°, and frequent large-angle steering is prone to cause sudden changes in lateral acceleration and increase the control error and energy loss. The maximum steering angle

δ_{m a x}

of DDQN is 0.218 rad ≈ 12.5°, which is 86.1% lower than that of Dijkstra, but there is still the risk of occasional sharp turns; the average steering angle

\bar{δ}

of DDQN is 0.097 rad ≈ 5.56°, which is 54.0% lower than that of Dijkstra, and the overall steering policy is more efficient. The LB-DDQN model adopts path smoothing to control the steering angle increment, and the maximum steering angle

δ_{m a x}

and average steering angle

\bar{δ}

are further reduced to 0.210 rad ≈ 12.0° and 0.022 rad ≈ 1.3°, which are 3.7% and 77.3% lower than those of the DDQN, respectively, and the steering command timing smoothing (measured by the DTW distance metric) is increased by 68%. This indicates that the steering strategy is highly smooth, which is in line with the safety design principle of “gradual steering” [23], and complies with the hard constraints of the vehicle dynamics model [24].

LB-DDQN demonstrates significant safety performance advantages in static path-planning scenarios by optimizing the path’s smoothness, reducing steering loads and thus simplifying the path structure. The generated path not only avoids extreme bending risks but also improves global coherence, which meets the underlying requirements of dynamic system safety control. Combined with theoretical models and industry safety thresholds, the safety performance of LB-DDQN in static environments has been fully verified.

4.2.2. Planning a Comprehensive Evaluation of Energy Efficiency

In order to verify the comprehensive energy efficiency advantages of the path-planning methods proposed in the study in real-world environments, they are analyzed for practical applications. A total of 8 groups of independent tests are conducted to cover the planning of different distances and paths, and the path planning effect graphs are shown in Figure 9: the orange path is the Dijkstra model, the blue path is the DDQN model, and the red path is the LB-DDQN model.

A campus scenario is selected for the modeling, and the performance differences among three of the four dimensions of computation time, path length, number of corners, and energy consumption are counted, and the path effect is comprehensively analyzed through a comparison of the length of the planned paths loaded by different algorithms with the computation time in Figure 10, as well as a comparison of the number of corners of the planned paths of the agents loaded by different algorithms and the energy consumption in Figure 11.

As can be seen from Figure 10, the average time to load the Dijkstra algorithm is 0.604 s, the time to load the DDQN algorithm is 0.391, and the time to load the LB-DDQN algorithm proposed in this paper is 0.36. By dynamically capturing the environmental temporal features through the LSTM network, and by reducing redundant path searches, the average computation time is improved by 40.1% compared with Dijkstra and improved by 7.4% compared with DDQN. In the fourth experiment, LB-DDQN takes 0.947 s, which is only 61.8% of Dijkstra, indicating that it significantly reduces the computational complexity through Bézier curve preplanning.

Based on the global search characteristics, Dijkstra’s algorithm has the shortest path length in some experiments, but the fluctuation is larger; LB-DDQN approaches the theoretical optimal value through multi-step Q-earning decisions, and it is close to Dijkstra and has a higher path quality. In the eighth experiment, the difference between LB-DDQN’s path length, 29,232 m, and Dikstra’s, 27,942 m, is reduced to 4.6%, which indicates that it optimizes the local path’s continuity by predicting the environmental change trend through LSTM, and it effectively improves the quality of the path planning, while guaranteeing an efficient computation time.

LB-DDQN combined with Bézier curve interpolation significantly reduces the number of path corners; the average value of the number of corners in the eight experiments is 4.375, which is 64.6% less than DDQN and 88.3% less than Diikstra. As can be seen from Figure 11, in the eighth experiment, the number of corners for LB-DDQN is only two, while DDQN has nine, and Diikstra has three. The control of the number of corners of LB-DDQN effectively optimizes the smoothness of the motion. According to the longitudinal and lateral energy consumption of the vehicle, the dynamic model of the total energy consumption of the mobile agent,

E,

can be modeled as the sum of cruise energy consumption, lateral friction energy consumption, and steering loss energy consumption, which is expressed by Formula (7):

E = 0.15 \cdot L + 1.2 \cdot \bar{κ} \cdot L + 0.08 \cdot \sum |∆ δ|

(7)

The average value of the energy consumption of LB-DDQN by smoothing the path with efficient decision-making is 27.48 J, which is 19.8% lower than the 34.26 J of DDQN, and 66.26 J lower than Dijkstra’s 66.26 J, by 58.5%. In the fourth experiment, the energy consumption of LB-DDQN is 52.1, which is only 39.9% that of Diikstra, indicating that it avoids energy wastage through multi-objective co-optimization. By analyzing the composition ratio of LB-DDQN’s mean energy consumption (base energy consumption 58.2%, steering energy consumption 23.1%, curvature energy consumption 12.4%, corner energy consumption 6.3%), it can be seen that the Pearson correlation coefficient between the number of corners and the energy consumption reaches 0.92 (p < 0.01), which indicates that LB-DDQN controlling the number of steering turns effectively reduces the power loss.

The LB-DDQN algorithm demonstrates outstanding performance in reducing the frequency of steering, minimizing the path’s curvature, and decreasing turning angles through its integration of LSTM-based dynamic decision-making and Bézier curve smoothing optimization. The LSTM module enhances decision sequences through temporal memory mechanisms, effectively reducing redundant exploration behaviors, with experimental results showing a 4.6% reduction in path length. Meanwhile, the Bézier curve optimization converts discrete path points into kinematically feasible trajectories, significantly improving the accuracy of control by reducing the standard deviation of steering angles from 0.097 rad in DDQN to 0.022 rad. The synergistic combination of LSTM’s coherent decision-making capability and Bézier’s assurance of the trajectory’s feasibility collectively achieves a remarkable 58.5% reduction in comprehensive energy consumption. The algorithm is verified for its efficiency and practicability.

4.3. Model Deployment

In order to test the effectiveness of the algorithm in a real environment, the algorithm of this paper is deployed on an unmanned vehicle for examination. To train the unmanned vehicle in the real environment, we need to train it in high-precision simulation software and then verify it in the real environment. In this paper, according to the actual environment in Gazebo to build the unmanned vehicle 3D simulation, the unmanned vehicle in the actual environment test in the unmanned vehicle is shown in Figure 12. For algorithm deployment, the LSTM network in the first 5 s of the time series of environmental data as the input predicts the distribution of the path risk in the next 3 s; DDQN framework, based on the prediction results, generate the candidate path points and through the greedy strategy balances exploration and utilization; the Bézier curve optimization module will convert the discrete path points into third-order continuous curves to ensure the path’s smoothness. After testing, the LB-DDQN path-planning algorithm proposed in this paper can be effectively utilized in real physical scenarios, and its performance in terms of motion smoothness and energy economy is consistent with the simulation.

5. Conclusions

Mobile agents increasingly operate in complex environments, which puts higher demands on their path-planning functions. In addition to reaching the target location safely, the quality and efficiency of the agent body in the planning process also need to be guaranteed. In the face of the security and planning efficiency problems of classical path-planning algorithms, the LB-DDQN path-planning algorithm proposed in this paper effectively solves them and realizes multi-objective optimization through the following techniques in the algorithm: the LSTM’s temporal prediction dynamically captures the environmental change trend, optimizes the path continuity and thus reduces the energy fluctuation due to frequent adjustments of the direction; Bézier curve smoothing interpolates the discrete path points into continuous curves, which directly reduces the number of corners and indirectly shortens the actual distance traveled; DDQN multi-step decision-making balances the exploration of new paths with the use of known low-energy paths and optimizes the comprehensive objective function through Q-value iteration. Validated through simulation experiments and deployed in real-world scenarios, the algorithm is able to plan an optimal path in complex environments and realize real-time obstacle avoidance to satisfy the safety and high efficiency of mobile agents during traveling.

Beyond academic campus scenarios, the proposed framework demonstrates applicability in industrial automation (e.g., warehouse robots), intelligent transportation system driving, and drone navigation in cluttered airspaces. Future scaling strategies may involve edge computing deployment for low-latency decision-making and multi-agent coordination frameworks for collaborative tasks.

6. Limitations and Future Work

While this study has demonstrated significant achievements, several limitations should be acknowledged. First, although the LSTM network facilitates the extraction of temporal features, the model lacks explicit dynamic obstacle prediction and reaction mechanisms, potentially limiting its effectiveness in highly dynamic scenarios. Second, the integration of LSTM with Bézier curve optimization introduces additional computational complexity during both training and inference phases. While feasible on high-performance GPUs, deployment on resource-constrained edge devices remains challenging without optimization techniques such as model pruning or quantization. Third, the reward function design relies on simplified heuristic rules, which may constrain the algorithm’s adaptability in complex multi-objective tasks.

Future work will focus on multi-objective optimization in high-speed dynamic environments by integrating LSTM-based obstacle trajectory prediction with real-time replanning mechanisms, while incorporating smoothness-aware reward functions featuring curvature-based penalties and dynamic obstacle proximity penalties to simultaneously enhance the path’s smoothness, safety margins, and energy efficiency. Concurrently, the network architecture will be optimized to reduce the computational latency and memory footprint, thereby facilitating the deployment of this model in embedded systems.

Author Contributions

Writing—review and editing, S.J.; funding acquisition, X.Z.; supervision, Y.H.; software, R.L.; visualization, Q.W.; validation, H.H.; data curation, J.L.; resources, L.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China (Grant no. 62173134, 62473140 and 62473145), the Key R&D Program Project of Hunan Province (Grant on 2024JK2029), the Natural Science Foundation of Hunan Province (Grant no. 2023JJ60151 and 2023JJ50029), and the Excellent Youth Project of Hunan Provincial Department of Education (Grant on 22B0735).

Data Availability Statement

The data are available upon reasonable request.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have influenced the work reported in this paper.

References

Li, X.; Ma, X.; Wang, X. A review of path planning algorithms for mobile robots. Comput. Meas. Control 2022, 30, 9–19. [Google Scholar]
Zhu, Z.; Lyu, H.; Zhang, J.; Yin, Y. An efficient ship automatic collision avoidance method based on modified artificial potential field. J. Mar. Sci. Eng. 2021, 10, 3. [Google Scholar] [CrossRef]
Jia, C.; He, H.; Zhou, J.; Li, J.; Wei, Z.; Li, K. Learning-based model predictive energy management for fuel cell hybrid electric bus with health-aware control. Appl. Energy 2024, 355, 122228. [Google Scholar] [CrossRef]
Jia, C.; Liu, W.; He, H.; Chau, K. Deep reinforcement learning-based energy management strategy for fuel cell buses integrating future road information and cabin comfort control. Energy Convers. Manag. 2024, 321, 119032. [Google Scholar] [CrossRef]
Li, K.; Zhou, J.; Jia, C.; Yi, F.; Zhang, C. Energy sources durability energy management for fuel cell hybrid electric bus based on deep reinforcement learning considering future terrain information. Int. J. Hydrogen Energy 2024, 52, 821–833. [Google Scholar] [CrossRef]
Tang, G.; Tang, C.; Claramunt, C.; Hu, X.; Zhou, P. Geometric A-star algorithm: An improved A-star algorithm for AGV path planning in a port environment. IEEE Access 2021, 9, 59196–59210. [Google Scholar] [CrossRef]
Miao, C.; Chen, G.; Yan, C.; Wu, Y. Path planning optimization of indoor mobile robot based on adaptive ant colony algorithm. Comput. Ind. Eng. 2021, 156, 107230. [Google Scholar] [CrossRef]
Chen, S.; Zhao, C.; Wang, C.; Zhengbing, Y. Reinforcement learning based multi-intelligence path planning algorithm. In Proceedings of the 32nd Chinese Conference on Program Control, Taiyuan, China, 30 July 2021. [Google Scholar]
Liu, J.W.; Gao, F.; Luo, X.L. Survey of deep reinforcement learning based on value function and policy gradient. J. Comput. 2019, 42, 1406–1438. [Google Scholar]
Liu, Q.; Zhai, J.; Zhang, Z.; Shan, Z.; Qian, Z.; Peng, Z.; Jin, X. A review of deep reinforcement learning. Chin. J. Comput. 2018, 41, 1–27. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing Atari with Deep Reinforcement Learning. In Proceedings of the Workshops at the 26th Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; ACM: New York, NY, USA, 2013; pp. 201–220. [Google Scholar]
Watkins, C.J.C.H.; Dayan, P. Q-learning. Mach. Learn. 2012, 8, 279–292. [Google Scholar] [CrossRef]
Van Hasselt, H.; Guez, A.; Silver, D. Deep reinforcement learning with double Q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 12–17 February 2016; AIAA: Reston, VA, USA, 2016; pp. 2094–2100. [Google Scholar]
Xin, J.; Zhao, H.; Liu, D.; Li, M. Application of deep reinforcement learning in mobile robot path planning. In Proceedings of the Chinese Automation Congress, Jinan, China, 20–22 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 7112–7116. [Google Scholar]
Zhang, X.; Shi, X.; Zhang, Z.; Wang, Z.; Zhang, L. A DDQN path planning algorithm based on experience classification and multi steps for mobile robots. Electronics 2022, 11, 2120. [Google Scholar] [CrossRef]
Zhu, Z.; Hu, C.; Zhu, C.; Zhu, Y.; Sheng, Y. An improved dueling deep double-Q network based on prioritized experience replay for path planning of unmanned surface vehicles. J. Mar. Sci. Eng. 2021, 9, 1267. [Google Scholar] [CrossRef]
Yi, F.; Shu, X.; Zhou, J.; Zhang, J.; Feng, C.; Gong, H.; Zhang, C.; Yu, W. Remaining useful life prediction of PEMFC based on matrix long short-term memory. Int. J. Hydrogen Energy 2025, 111, 228–237. [Google Scholar] [CrossRef]
Shi, K.; Wu, Z.; Jang, B.; Karimi, H.R. Dynamic path planning of mobile robot based on improved simulated annealing algorithm. J. Frankl. Inst. 2023, 360, 4378–4398. [Google Scholar] [CrossRef]
Wang, N.; Wu, Y.; Xu, N. Research on control strategy of agricultural robot. J. Agric. Mech. Res. 2025, 47, 205–209. [Google Scholar]
Pade, B.; Čáp, M.; Yong, S.Z.; Yershov, D.; Frazzoli, E. A Survey of Motion Planning and Control Techniques for Self-Driving Urban Vehicles. IEEE Trans. Intell. Veh. 2016, 1, 33–55. [Google Scholar] [CrossRef]
Rajamani, R. Vehicle Dynamics and Control; Springer: Berlin/Heidelberg, Germany, 2011. [Google Scholar]
NHTSA. Federal Automated Vehicles Policy: Accelerating the Next Revolution in Roadway Safety; Springer: Berlin/Heidelberg, Germany, 2016. [Google Scholar]
Levinson, J.; Askeland, J.; Becker, J.; Dolson, J.; Held, D.; Kammel, S.; Kolter, J.Z.; Langer, D.; Pink, O.; Pratt, V.; et al. Towards Fully Autonomous Driving: Systems and Algorithms. In Proceedings of the 2011 IEEE Intelligent Vehicles Symposium (IV), Baden-Baden, Germany, 5–9 June 2011. [Google Scholar]
Dolgov, D.; Thrun, S.; Montemerlo, M.; Diebel, J. Path Planning for Autonomous Vehicles in Unknown Semi-structured Environments. Int. J. Robot. Res. 2010, 29, 485–501. [Google Scholar] [CrossRef]

Figure 1. Diagram of LSTM network structure.

Figure 2. Framework of the LSTM-DDQN model.

Figure 3. The third-order Bézier curve.

Figure 4. Schematic diagram of the LSTM-DDQN path-planning framework.

Figure 5. SLAM algorithm graph-building process.

Figure 6. Grid map dilation algorithm.

Figure 7. LB-DDQN reward curve.

Figure 8. LB-DDQN loss curve.

Figure 9. Path-planning simulation effect.

Figure 10. Comparison of planning path length and computation time loaded by different algorithms.

Figure 11. Comparison of the number of steering-angle turns and energy consumption of agents loaded with different algorithms for planning paths.

Figure 12. Actual testing environment.

Table 1. Comparison of experimental data for different algorithms.

Model	$κ_{m a x}$	$\bar{κ}$	$δ_{m a x}$	$\bar{δ}$
Dijkstra	1.905	1.018	1.571	0.211
DDQN	0.410	0.098	0.218	0.097
LB-DDQN	0.330	0.072	0.210	0.022

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jin, S.; Zhang, X.; Hu, Y.; Liu, R.; Wang, Q.; He, H.; Liao, J.; Zeng, L. Research on Mobile Agent Path Planning Based on Deep Reinforcement Learning. Systems 2025, 13, 385. https://doi.org/10.3390/systems13050385

AMA Style

Jin S, Zhang X, Hu Y, Liu R, Wang Q, He H, Liao J, Zeng L. Research on Mobile Agent Path Planning Based on Deep Reinforcement Learning. Systems. 2025; 13(5):385. https://doi.org/10.3390/systems13050385

Chicago/Turabian Style

Jin, Shengwei, Xizheng Zhang, Ying Hu, Ruoyuan Liu, Qing Wang, Haihua He, Junyu Liao, and Lijing Zeng. 2025. "Research on Mobile Agent Path Planning Based on Deep Reinforcement Learning" Systems 13, no. 5: 385. https://doi.org/10.3390/systems13050385

APA Style

Jin, S., Zhang, X., Hu, Y., Liu, R., Wang, Q., He, H., Liao, J., & Zeng, L. (2025). Research on Mobile Agent Path Planning Based on Deep Reinforcement Learning. Systems, 13(5), 385. https://doi.org/10.3390/systems13050385

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Mobile Agent Path Planning Based on Deep Reinforcement Learning

Abstract

1. Introduction

2. Path-Planning Algorithm

2.1. DDQN Method

2.2. LSTM

2.3. Bézier Curve Smoothing Optimization

2.4. LB-DDQN Path-Planning Algorithm

3. Design of Path Planning for Mobile Agents Based on the LB-DDQN Approach

3.1. ROS Physical Engine Validation

3.2. Cost Map

3.3. Spatial Design and Reward Functions

4. Experiment and Performance Analysis

4.1. Performance Analysis

4.2. Comparative Experiment

4.2.1. Comprehensive Evaluation of Safety Performance

4.2.2. Planning a Comprehensive Evaluation of Energy Efficiency

4.3. Model Deployment

5. Conclusions

6. Limitations and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI