Path Planning in Complex Environments Using Attention-Based Deep Deterministic Policy Gradient

Chen, Jinlong; Jiang, Yun; Pan, Hongren; Yang, Minghao

doi:10.3390/electronics13183746

Open AccessArticle

Path Planning in Complex Environments Using Attention-Based Deep Deterministic Policy Gradient

by

Jinlong Chen

¹

,

Yun Jiang

^1,*,

Hongren Pan

² and

Minghao Yang

³

¹

School of Computer and Information Security, Guilin University of Electronic Technology, Guilin 541004, China

²

Guilin Shandi Network Technology Co., Ltd., Guilin 541000, China

³

Laboratory of Brain Atlas and Brain-Inspired Intelligence, Institute of Automation, Chinese Academy of Sciences (CASIA), Beijing 100190, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(18), 3746; https://doi.org/10.3390/electronics13183746

Submission received: 23 August 2024 / Revised: 13 September 2024 / Accepted: 16 September 2024 / Published: 20 September 2024

Download

Browse Figures

Versions Notes

Abstract

The traditional Deep Deterministic Policy Gradient (DDPG) algorithm frequently exhibits a notable reduction in success rate when transferred to new environments after being trained in complex simulation settings. To address these issues, this paper adopts a Multi-Environment (Multi-Env) parallel training approach and integrates Multi-Head Attention (MHA) and Prioritized Experience Replay (PER) into the DDPG framework, optimizing the reward function to form the MAP-DDPG algorithm. This approach enhances the algorithm’s generalization capability and execution efficiency. Through comparative training and testing of the DDPG and MAP-DDPG algorithms in both simulation and real-world environments, the experimental results demonstrate that MAP-DDPG significantly improves generalization and execution efficiency over the DDPG algorithm. Specifically, in simulation environment tests, the MAP-DDPG algorithm achieved an average 30% increase in success rate and reduced the average time to reach the target point by 23.7 s compared to the DDPG algorithm. These results indicate that the MAP-DDPG algorithm significantly enhances path planning generalization and execution efficiency, providing a more effective solution for path planning in complex environments.

Keywords:

path planning; DDPG; MHA; PER; Multi-Env; reward function

1. Introduction

Path planning is one of the core issues in fields such as robotics, autonomous driving, and mobile robot navigation [1]. This problem requires algorithms to find the optimal or feasible path between a start point and an endpoint, while also considering obstacle avoidance and optimizing certain performance metrics such as path length, safety, or adaptability. Autonomous navigation technology primarily relies on maps, which involve three key components: localization, mapping, and path planning. Localization is typically achieved using Adaptive Monte Carlo Localization (AMCL) [2], while mapping relies on Kalman filtering [3] or graph-based SLAM [4]. Path planning can be divided into global and local path planning. Global path planning commonly uses algorithms such as D* [5] and Rapidly-Exploring Random Tree (RRT) [6], whereas local path planning employs methods like the Dynamic Window Approach (DWA) [7] and Timed Elastic Band (TEB) [8]. However, traditional map-based path planning is often costly in terms of map construction and struggles to adapt to dynamic and large-scale environments. As a result, mapless path planning methods have gradually emerged, enabling robots to plan paths through interaction with the environment and autonomous exploration.

In the field of mapless path planning, autonomous learning is the core technology for exploring unknown environments. This approach particularly relies on reactive strategies based on reinforcement learning, as they do not require supervised learning or prior knowledge. The origins of reinforcement learning date back to the 1950s [9]. In 1957, Bellman introduced dynamic programming into stochastic discrete Markov decision processes, simplifying the optimal control problem. In 1989, Watkins advanced the field by developing the Q-learning method [10]. However, in complex environments, the state-action pairs in the Q-table can become overwhelmingly large, resulting in significant memory usage and issues related to the “curse of dimensionality”. In 2013, Google DeepMind merged deep learning with Q-learning by leveraging neural networks to estimate the value function, which led to the development of Deep Q-Learning (DQN) [11], an innovative end-to-end approach encompassing perception to learning. Nevertheless, DQN is restricted to discrete action spaces and struggles with continuous ones. To overcome this, in 2015, Google DeepMind integrated DQN into the ActorCritic framework, proposing the Deep Deterministic Policy Gradient (DDPG) [12,13] to address the difficulties inherent in continuous action spaces.

When applying the DDPG algorithm to mobile robot path planning, although the algorithm is capable of generating continuous action sequences, it often faces challenges such as slow learning speed and susceptibility to local optima in complex environments. Particularly when transferring the model to new complex environments, the success rate and execution efficiency of the traditional DDPG algorithm frequently decline, sometimes requiring retraining to adapt to environmental changes. To address these issues, this paper proposes an improved path-planning method, with the following key contributions: (1) it introduces a novel network architecture that integrates the Multi-Head Attention (MHA) mechanism with the DDPG algorithm to enhance the efficiency and generalization of path planning strategy generation; (2) it simultaneously trains the same model across multiple randomly generated complex environments, including both dynamic and static obstacles, while incorporating the Prioritized Experience Replay (PER) method to accelerate training and improve the model’s generalization capabilities; and (3) it redesigns the reward function to better align with the specific requirements of this research, further enhancing the agent’s performance in complex environments.

The organization of this paper is as follows: Section 2 examines related work, outlining existing methods in path planning and their limitations. Section 3 offers a comprehensive explanation of our proposed approach, detailing the model architecture, multi-environment training strategy, and the reward function design. Section 4 discusses the experimental setup and analysis of the results. Lastly, Section 5 discusses the effectiveness of the proposed algorithm in other aspects and highlights its limitations, while Section 6 summarizes the research findings of this paper.

2. Related Works

2.1. Deep Deterministic Policy Gradient (DDPG) Algorithm

The algorithm holds significant application value and research importance in the field of reinforcement learning. Since its introduction, extensive work has been carried out to improve and apply the algorithm, yielding widespread results. Lillicrap et al. were the first to propose the DDPG algorithm, which combines deterministic policy gradient methods with deep neural networks to achieve efficient learning in high-dimensional continuous action spaces [12]. The core idea of DDPG is based on the Actor–Critic architecture, where the Actor network is responsible for generating actions and the Critic network estimates the value of those actions. This approach allows DDPG to output actions directly in continuous action spaces without relying on discretization, significantly enhancing the efficiency of reinforcement learning. Path planning in complex environments often requires precise control over continuous actions such as the robot’s speed and direction. The DDPG algorithm is specifically designed to handle continuous action spaces, enabling the generation of smooth and accurate action sequences, making it well-suited for fine-grained path planning in complex environments. Additionally, DDPG incorporates an Actor–Critic architecture, allowing it to simultaneously learn the action policy (Actor network) and evaluate the value of that policy (Critic network). This enables it to adjust path-planning decisions in real-time based on changes in the environment, making it more adaptable to the uncertainties present in complex settings. As shown in Figure 1, the Critic network updates its parameters by minimizing the following loss function:

L (θ^{Q}) = E_{(s, a, r, s^{'})} [{(r + γ Q^{'} (s^{'}, μ^{'} (s^{'}∣ θ^{μ^{'}})∣ θ^{Q^{'}}) - Q (s, a∣ θ^{Q}))}^{2}],

(1)

Here,

Q (s, a∣ θ^{Q})

represents the estimated value from the Critic network, and

Q^{'} (s^{'}, μ^{'} (s^{'}∣ θ^{μ^{'}})∣ θ^{Q^{'}})

is the estimated value from the target Critic network. The discount factor

γ

accounts for future rewards,

r

is the immediate reward, and

s

and

a

denote the state and action, respectively. The Actor network updates its parameters by maximizing the action value provided by the Critic network, with the goal of maximizing the following objective function:

J (θ^{μ}) = E_{s} [Q (s, μ (s∣ θ^{μ})∣ θ^{Q})]

(2)

The parameters of the Actor network are updated using the gradient ascent method, and the gradient is computed as follows:

\nabla_{θ^{μ}} J \approx E_{s} [\nabla_{a} Q (s, a∣ θ^{Q}) ∣_{a = μ (s)} \nabla_{θ^{μ}} μ (s| θ^{μ})]

(3)

Through the above formulas, the Actor and Critic networks are alternately updated during each iteration, ultimately converging to an optimal policy.

Since the introduction of the DDPG algorithm, extensive research has been conducted on its improvement and application, yielding significant results. Fujimoto et al. [14] proposed the Twin Delayed DDPG (TD3), which introduced mechanisms such as delayed policy updates and dual Critic networks, significantly addressing issues related to policy overestimation and function approximation errors in DDPG. Zou et al. [15] combined imitation learning with deep reinforcement learning to propose an end-to-end driving policy learning method based on Imitation Learning–Deep Deterministic Policy Gradient, effectively addressing the inefficiency of DDPG in sparse reward environments. In practical applications, DDPG has been widely applied in robotics control and autonomous driving domains. For instance, the study by Rao et al. demonstrated the effectiveness of DDPG in controlling high-dimensional nonlinear systems [16], while the work by Chu et al. highlighted the potential of DDPG in path planning for autonomous vehicles [17]. Additionally, in multi-agent systems, Lowe et al. [18] proposed the Multi-Agent DDPG (MADDPG), which resolved the non-stationarity problem in multi-agent systems through a centralized training and decentralized execution framework.

Many scholars have also focused on improving DDPG’s efficiency and robustness by refining exploration strategies, optimizing reward functions, or adding regularization terms. Henderson et al. [19] explored the impact of different hyperparameter settings on DDPG performance, finding that appropriate reward design and parameter adjustment are crucial for enhancing DDPG’s learning outcomes. However, despite its superior performance in many domains, DDPG still faces challenges when dealing with complex or unfamiliar environments. These challenges include the algorithm’s tendency to become stuck in local optima and its poor adaptability to environmental changes, leading to a significant decline in success rate and efficiency in new environments. These issues limit its widespread application in real-world scenarios, particularly in high-risk or rapidly changing environments. Future research may focus on addressing these challenges to further enhance the practical value of DDPG.

2.2. Attention Mechanism Algorithm

The attention mechanism represents a deep learning approach that imitates the human visual system’s ability to focus attention. It has found applications in computer vision for enhancing cognitive concentration. Mnih et al. were pioneers in introducing a Recurrent Neural Network (RNN) model capable of selectively processing essential elements to extract information from images [20]. Following this, Jaderberg et al. developed the Spatial Transformer Network, a method designed to select and emphasize significant regions [21]. This methodology eventually became recognized as the attention mechanism. Essentially, the attention mechanism enables models to focus on specific data segments when handling extensive datasets [22]. By allowing models to evaluate the importance of various input components and allocate different levels of attention accordingly, it enhances their capacity to identify complex patterns and dependencies within the data. Initially conceived for tasks like image processing, classification, and text summarization [23,24], the attention mechanism has also been demonstrated to be highly effective in reinforcement learning [25,26].

The application of the attention mechanism in path planning is an emerging research area, and this mechanism can significantly enhance the performance of path-planning algorithms. By employing the attention mechanism, algorithms can focus on the most relevant features and information in the environment, enabling them to more effectively handle complex and dynamic scenarios. The attention mechanism allows models to prioritize regions and factors that are critical to decision making during path planning, thereby reducing unnecessary computational resource consumption while improving the accuracy and efficiency of the paths generated. The introduction of this mechanism provides path planning algorithms with greater adaptability and robustness, particularly in environments characterized by high diversity and uncertainty.

This mechanism can dynamically adjust strategies by calculating the association weights between different parts of the environmental state, enabling adaptation to environmental changes. In 2020, Li et al. [27] proposed a selective gating mechanism enhanced by self-attention combined with entity-aware embeddings to improve distant supervision relation extraction. In 2022, Shiri et al. [28] explored an attention-based communication and control strategy for multi-UAV (Unmanned Aerial Vehicle) path planning. The research team utilized an attention model to optimize coordination and communication among UAVs, allowing each UAV to adjust its flight path based on the states of other UAVs and environmental factors. This strategy not only improved the efficiency of path planning, but also enhanced the collective coordination ability of UAVs when performing complex tasks. Although the self-attention mechanism introduces new possibilities for path planning, it still faces some challenges in practical applications, such as computational complexity and resource consumption. Future research may focus on optimizing the computational efficiency of these algorithms while also exploring how to combine the self-attention mechanism with other types of machine learning techniques to further improve path-planning performance in dynamic and uncertain environments.

2.3. Prioritized Experience Replay Mechanism

The Prioritized Experience Replay (PER) mechanism is an improved experience replay method in reinforcement learning first proposed by Schaul et al. to enhance the learning efficiency and performance of deep reinforcement learning algorithms [29]. In traditional experience replay, the agent stores experiences generated through interactions with the environment in a replay buffer and randomly samples these experiences during training to update the policy or value function. This random sampling method breaks the temporal correlation between experiences, helping to improve sample efficiency and stabilize the training process. However, traditional experience replay assigns the same priority to all experiences, meaning each experience has an equal probability of being sampled. This random sampling approach may reduce learning efficiency because some key experiences, such as those with high TD (Temporal Difference) errors, contribute more to policy updates but do not receive more attention during random sampling.

To address this issue, Schaul et al. proposed the PER mechanism [29]. PER assigns weighted sampling to experiences based on their importance, which is typically measured by the TD error [30,31]. Specifically, the importance of an experience is determined by its temporal difference error (TD error), which is defined as:

δ_{i} = |r_{i} + \underset{a^{'}}{γ m a x} Q (s_{i + 1}, a^{'}; θ^{-}) - Q (s_{i + 1}, a_{i}; θ)|,

(4)

Here,

δ_{i}

is the TD error of the

i

th experience,

r_{i}

is the immediate reward obtained after executing action

a_{i}

,

γ

is the discount factor,

Q (s_{i + 1}, a^{'}; θ^{-})

is the value of the next state-action pair estimated by the target network, and

Q (s_{i + 1}, a_{i}; θ)

is the value of the current state-action pair estimated by the current network. In PER, the sampling probability

P (i)

of each experience is based on its TD error, typically using the following weighted sampling strategy:

p_{i} = |δ_{i}| + ϵ, P (i) = \frac{p_{i}^{α}}{\sum_{k} p_{k}^{α}},

(5)

Here,

p_{i} = |δ_{i}| + ϵ

represents the priority of the experience, where

ϵ

is a small constant added to prevent the sampling probability from being zero, and

α

controls the degree of sampling bias. When

α = 0

, PER degenerates into uniform sampling. Since the change in sampling probability may introduce bias, the PER mechanism typically incorporates Importance Sampling Weights to correct this bias during the update process. The Importance Sampling Weight is defined as:

ω_{i} = {(\frac{1}{N} \cdot \frac{1}{P (i)})}^{β},

(6)

Here,

N

is the size of the experience replay buffer, and

β

controls the extent of importance sampling, which is typically gradually increased during the training process to reduce bias in the early stages of training. Finally, the weight correction during updates is given by:

θ \leftarrow θ + \frac{1}{|B|} \sum_{i \in B} ω_{i} \cdot δ_{i} \cdot \nabla_{θ} Q (s_{i}, a_{i}; θ),

(7)

The Prioritized Experience Replay (PER) mechanism significantly enhances the efficiency and performance of deep reinforcement learning, particularly in environments with high-dimensional state spaces or sparse rewards. It enables the algorithm to quickly focus on key experiences that are most beneficial for policy improvement, thereby accelerating convergence. Numerous studies have shown that PER is broadly applicable in DQN and other deep reinforcement learning algorithms, achieving better performance across various tasks [29]. Additionally, the concept of PER has been further extended and applied in other contexts. PER has become an important tool in deep reinforcement learning (DRL) for improving training efficiency and policy stability.

For example, Zheng et al. (2021) optimized the robustness of DRL-driven network systems by introducing a teacher–student learning framework. This framework incorporated the PER mechanism, ensuring that critical experience samples were prioritized in the student network’s learning process, thereby speeding up convergence and enhancing the network system’s adaptability in dynamic environments [32]. Moreover, Chen et al. (2021) utilized DRL agents to jointly optimize computation offloading and resource allocation in Mobile Edge Computing (MEC). Their study also employed the PER mechanism to improve the learning efficiency of DRL agents in complex environments [33]. These studies demonstrate that PER effectively enhances learning efficiency and performance in DRL applications across various domains, making it a key factor in advancing DRL methodologies in complex systems.

3. Methodology

3.1. Parallel Training across Multiple Complex Environments

In reinforcement learning tasks within complex environments, to enhance the generalization ability and training efficiency of algorithms, this paper proposes a method that involves parallel training across multiple environments, combined with the Prioritized Experience Replay (PER) mechanism to optimize the learning process of the model. The core idea of multi-environment parallel training is to simultaneously train the same agent model in multiple independent environments. These environments may have different obstacle layouts and dynamic changes, with each environment providing the agent with diverse training data, helping the model to comprehensively learn strategies for coping with a variety of scenarios. As shown in Figure 2, in each training step, the agent can perform the following actions in each environment: First, at time step

t

, each environment

i

has a current state

s_{t}^{i}

, which represents a combination of all relevant factors in the environment (e.g., the robot’s position on the map, the positions of obstacles, etc.). Next, the Actor network generates an action

a_{t}^{i} = μ (s_{t}^{i}∣ θ^{μ})

based on the current state

s_{t}^{i}

, which directs the agent’s next move in the environment (e.g., the direction and speed of the robot’s movement). The agent then executes the generated action

a_{t}^{i}

in each environment, interacting with the environment. This interaction yields an immediate reward

r_{t}^{i}

and the next state

s_{t + 1}^{i}

, where the immediate reward

r_{t}^{i}

measures the contribution of the current action toward achieving the goal. Finally, the generated tuple

(s_{t}^{i}, a_{t}^{i}, r_{t}^{i}, s_{t + 1}^{i})

is stored in the experience pool of the corresponding environment and will be used in subsequent training steps. To ensure more effective utilization of stored experiences during training, this paper introduces the PER mechanism into experience replay. Specifically, the PER method assigns priority to experiences based on their TD error, with experiences that have higher errors being more likely to be selected during replay, thereby accelerating convergence and improving learning efficiency.

During training, the system samples experiences from the experience pools of various environments according to their priorities, forming a mini-batch that is used to update the parameters of the Actor and Critic networks. This approach ensures that the model learns from the most critical experiences in each environment, thereby accelerating convergence and enhancing the algorithm’s generalization capability. The strategy of multi-environment parallel training offers significant advantages. First, by training in multiple distinct environments simultaneously, the model is exposed to a more diverse set of state-action pairs, allowing it to learn more generalized strategies that improve its performance in new environments. Second, the parallel processing of data from multiple environments, combined with the PER mechanism, effectively utilizes computational resources and speeds up the model’s convergence. Training across multiple environments helps the model avoid local optima that may arise in a single environment, enhancing the stability of the training process. Lastly, the PER mechanism dynamically adjusts the sampling probabilities of experiences, ensuring that the model focuses on learning those experiences most critical for policy improvement, thereby improving the efficiency and effectiveness of training.

Overall, the combination of multi-environment parallel training and prioritized experience replay enables the model to adapt more quickly to diverse and complex environments, significantly enhancing its robustness and performance and providing a powerful solution for path-planning problems in challenging environments.

3.2. MAP-DDPG Network Architecture

The overall framework of the Multi-Head Attention Deep Deterministic Policy Gradient (MAP-DDPG) network proposed in this paper is illustrated in Figure 3. This framework is an end-to-end path planning network composed of two parts: the Actor network and the Critic network, designed to address path-planning issues from simulation environments to real-world environments. Specifically, the Actor network is responsible for generating the optimal actions based on the input environmental state and position information, while the Critic network evaluates the value of the actions output by the Actor network. By integrating the Multi-Head Attention (MHA) mechanism with both the Actor and Critic networks, the MAP-DDPG architecture significantly enhances the model’s decision-making capability and robustness when dealing with complex environments. The Multi-Head Attention mechanism allows the model to process input features in parallel across multiple attention heads, thereby extracting richer and more diverse feature representations.

Specifically, as shown in the Actor network in Figure 3, the state vectors from

n

environments are combined into a state matrix

S_{t}

. This environment state matrix

S_{t}

is processed through a fully connected layer and a ReLU activation function to produce matrix

X

. The matrix

X

is then processed by the Multi-Head Attention (MHA) mechanism. The illustration of the Multi-Head Attention mechanism in path planning is shown in Figure 4. The detailed processing steps of the Multi-Head Attention mechanism are as follows.

The matrix

X

is linearly mapped to generate the query

Q

, key

K

, and value

V

vectors. Assuming there are

h

attention heads, the linear mapping for each attention head

i

can be represented as follows:

Q^{i} = X W_{Q}^{i}, K^{i} = X W_{K}^{i}, V^{i} = X W_{V}^{i}, (i = 1, \dots, h),

(8)

Here,

W_{Q}^{i}

,

W_{K}^{i}

, and

W_{V}^{i}

are the linear mapping matrices for the query, key, and value, respectively, with dimensions

100 \times d_{k}

, where

d_{k}

is the dimensionality of each attention head. In each attention head, the attention weight matrix is calculated using the query

Q^{i}

and key

K^{i}

. This process, through the scaled dot–product attention mechanism, can be represented as:

{A t t e n t i o n}^{i} (Q^{i}, K^{i}, V^{i}) = s o f t m a x (\frac{{Q^{i} (K^{i})}^{⊺}}{\sqrt{d_{k}}}) V^{i}, (i = 1, \dots, h),

(9)

Here,

\sqrt{d_{k}}

is the scaling factor used to ensure numerical stability during the dot-product operation. The Multi-Head Attention mechanism concatenates the outputs of each head and performs a linear transformation to obtain the final attention output. Specifically, the output of the Multi-Head Attention mechanism is expressed as:

M u l t i H e a d (Q, K, V) = c o n a c t ({A t t e n t i o n}^{1}, {A t t e n t i o n}^{2}, . . ., {A t t e n t i o n}^{h}) W_{o},

(10)

Here,

h

represents the number of attention heads, and

W_{o}

is the linear transformation matrix with dimensions

(h \times d_{k}) \times d_{o u t}

, where

d_{o u t}

is the dimensionality of the final output. In this paper,

d_{o u t}

is set to 100. Therefore, the shape of the Multi-Head matrix is

n \times d_{o u t}

. Next, this Multi-Head matrix is concatenated with the previously mentioned

X

matrix to generate the

Y

matrix, which has a shape of

n \times 200

. The concatenated

Y

matrix is then used to calculate the robot’s linear and angular velocities. This describes the processing flow of the Actor network in the MAP-DDPG network, and the structure and processing flow of the Critic network are largely similar. The combined

Y

matrix, with a shape of

n \times 200

, is subsequently used to compute the linear and angular velocities. The above description covers the Actor network in the MAP-DDPG network, and the Critic network follows a similar structure.

In the MAP-DDPG network, the Multi-Head Attention mechanism enhances feature extraction and information integration capabilities by applying weighted processing to input features from multiple environments. Compared to a single-head attention mechanism, the Multi-Head Attention mechanism is more powerful in extracting key features and integrating information. Specifically, it can simultaneously extract multiple critical features from different attention heads within the environment, allowing the model to capture information more comprehensively and exhibit greater adaptability across various environments. This parallel processing approach enables the Multi-Head Attention mechanism to identify and focus on important features across multiple environmental inputs, while also reducing sensitivity to noise and irrelevant information. In contrast, a single-head attention mechanism typically focuses on a specific aspect of the input, which may result in incomplete capture of environmental information, potentially affecting the accuracy of decision making and the robustness of the strategy.

Furthermore, the MAP-DDPG network optimizes the feature representation fusion strategy. Building on the features processed by the Multi-Head Attention mechanism, the framework also combines the original input features with the correlation vectors of environmental features extracted through the attention mechanism. This fusion strategy enhances the diversity and depth of feature representation by retaining the rich information of the original input features and integrating the correlation features extracted by the attention mechanism. Specifically, this fusion strategy brings two major improvements: first, the retention of original input features ensures that the model does not lose critical details that might impact path planning when dealing with complex and dynamic scenarios; second, by integrating the correlation vectors of input features from multiple environments, the model can identify and leverage common information across environments, thereby improving the coherence and accuracy of decision making.

In summary, by combining the Multi-Head Attention mechanism with the fusion of environmental feature correlation vectors, the MAP-DDPG network provides a more precise and highly generalizable solution for path planning, demonstrating significant advantages in handling complex environment transitions and practical applications.

3.3. Design of the Reward Function

In this study, a composite reward function was specifically designed for agent path planning in unknown, complex environments. This function integrates the DDPG framework with Multi-Head Self-Attention (MHA) and Prioritized Experience Replay (PER) to improve the efficiency and safety of the mobile robot as it navigates toward the target. As illustrated in Figure 4, the MAP-DDPG network receives input features related to the surrounding environment, captured by a mobile robot equipped with a laser rangefinder (LiDAR). The average distance data measured by 10 laser beams are represented as

d (d_{1}, d_{2}, \dots, d_{10})

. The variable

θ_{yaw}

indicates the current yaw angle of the robot, sourced from the odometer, while

θ_{target}

refers to the azimuth angle relative to the target. The angle

θ_{heading}

represents the difference between the agent’s current heading and the target direction.

θ_{obs}

denotes the azimuth angle of the nearest obstacle relative to the agent, with

(d_{obs})

indicating the distance to the closest obstacle. The variable

d_{current}

represents the agent’s current distance to the target. The MAP-DDPG network generates the movement strategy for the mobile robot, including linear velocity

{vel}_{lin}

and angular velocity

{vel}_{ang}

. The reward function’s design is detailed as follows.

3.3.1. Heading Adjustment Reward

To facilitate the mobile robot’s efficient progress along the optimal path, the heading adjustment reward is a crucial component. We calculate the heading adjustment reward

r_{t r}

as shown in Formulas (11) and (12):

θ_{n o r m a l} = \{\begin{matrix} θ_{h e a d i n g} - 2 π, i f θ_{h e a d i n g} > π \\ θ_{h e a d i n g} + 2 π, i f θ_{h e a d i n g} < - π \end{matrix},

(11)

r_{t r} = - e x p (a b s (θ_{n o r m a l})),

(12)

The angle

θ_{heading}

represents the difference between the agent’s current heading and the target direction (as shown in Figure 5). Formula (11) normalizes the angle

θ_{heading}

using conditional statements to ensure that the calculated angle difference remains within the range of

[- π, π]

. Formula (12) defines the reward function

r_{t r}

based on the angle difference, which takes the form of a negative exponential function of the angle difference. This reward function achieves a rapid decrease in reward value as the angle difference increases, thereby incentivizing the agent to minimize the angle difference with the target direction. This design promotes quick and accurate heading adjustments, enhancing the precision and efficiency of navigation.

The heading adjustment reward plays a crucial role in the end-to-end path planning of mobile robots. By encouraging the robot to adjust directly toward the target, it not only effectively shortens the path length to the target point and improves path planning efficiency, but also supports the robot’s adaptability to new obstacles and path constraints in dynamic environments. Additionally, this reward mechanism reduces unnecessary turns along the path, resulting in a smoother trajectory and contributing to energy savings. It ensures that the robot can flexibly adjust its heading, efficiently reaching the target point.

3.3.2. Distance Ratio Reward

Distance ratio reward is designed to encourage the mobile robot to reduce its distance to the target point, thereby improving the efficiency of reaching the target. This reward mechanism ensures that the robot receives a greater reward as it gets closer to the target, motivating the robot to choose the shortest path. The mathematical expression for this reward can be represented as:

r_{d i s} = C \times d_{r a t e}

(13)

d_{r a t e} = 2^{(- \frac{d_{c u r r e n t}}{d_{g o a l}})}

(14)

In Formula (14),

d_{current}

represents the actual distance from the robot’s current position to the target point, as shown in Figure 5.

d_{goal}

is the straight-line distance from the starting point to the target point. The negative sign in this formula ensures that as

d_{current}

decreases, meaning the robot is getting closer to the target point, the reward value increases.

C

represents a constant used to amplify the reward. The use of an exponential function in this formula allows for non-linear scaling of the rewards at different distance levels, meaning that the benefits of reducing the distance increase as the robot gets closer to the target point. Each unit of distance reduced near the target yields a larger reward. This design promotes efficiency optimization in the path planning process, encouraging the robot to move quickly and directly toward the target point.

3.3.3. Obstacle Penalty

In the end-to-end path planning problem for mobile robots, ensuring the safety of the path is just as crucial as optimizing it to improve the efficiency of reaching the target point. To this end, we have introduced an obstacle penalty term into the reward function, aimed at preventing the robot from colliding with obstacles, thereby enhancing the overall safety of the path-planning process. The obstacle penalty is calculated using Formula (15) as follows:

r_{o b s} = - α \cdot e^{- β \cdot d_{o b s}} \cdot |c o s (θ_{o b s})|,

(15)

d_{o b s}

represents the distance to the nearest obstacle, while

θ_{o b s}

is the azimuth angle of the nearest obstacle relative to the agent’s forward direction.

α

and β are tuning factors used to control the intensity and decay rate of the penalty. When the obstacle is directly in front of the robot,

θ_{o b s} \approx 0

,

|c o s (θ_{o b s})|

approaches 1, resulting in the maximum penalty, as the obstacle directly obstructs the forward path. The closer the obstacle distance

d_{o b s}

, the larger the penalty, reflecting the need for urgent obstacle avoidance. The parameter α controls the basic penalty intensity, while β adjusts the penalty’s decay with distance, allowing these parameters to be tuned according to the specific application environment to accommodate different navigation tasks and safety requirements. This mechanism effectively enhances the safety of path planning, reduces the likelihood of collisions, and ensures the stable operation of mobile robots in complex environments, enabling them to complete tasks efficiently.

3.3.4. Target Reaching Reward

In mobile robot path planning, reaching the target point is the ultimate goal of the entire task. To encourage the robot to accurately and swiftly reach the designated location, it is essential to design a reward mechanism specifically for reaching the target point. This reward is defined as shown in Formula (16).

r_{a r r} = {\begin{matrix} C_{1}, & i f d_{a r r} > d_{c u r r e n t} \\ 0, & o t h e r w i s e \end{matrix},

(16)

Here,

r_{arr}

represents the reward function for reaching the target point,

d_{current}

denotes the current distance of the agent from the target point, and

d_{arr}

is a threshold distance. If

d_{current}

is less than this threshold, it is considered that the agent has reached the target point, and the reward is given; otherwise, the reward is set to zero.

3.3.5. Collision Penalty

In the mobile robot path-planning task, in addition to reward mechanisms, a well-designed penalty mechanism is equally critical, especially when it comes to safety and collision avoidance. The collision penalty is a measure designed to prevent the robot from making contact with obstacles or walls in the environment.

r_{c o l} = {\begin{matrix} C_{2}, & i f d_{c o l} > d_{o b s} \\ 0, & o t h e r w i s e \end{matrix},

(17)

Here,

r_{col}

represents the penalty function for colliding with an obstacle,

d_{obs}

denotes the current distance of the agent from the nearest obstacle, and

d_{col}

is a threshold distance. If

d_{obs}

is less than this threshold, it is considered that the agent has collided with the obstacle, and the penalty is applied; otherwise, the penalty is set to zero.

3.3.6. Total Reward

In the end-to-end path-planning model proposed in this study, the total reward is calculated by integrating the following key components, as shown in Formula (18):

r_{t o t a l} = r_{t r} + r_{d i s} + r_{o b s} + r_{a r r} + r_{c o l},

(18)

Here,

r_{t r}

encourages the robot to make effective heading adjustments to face the target point more directly, optimizing path efficiency and reducing unnecessary turns.

r_{d i s}

rewards the robot for reducing the distance to the target point, motivating the robot to choose the shortest feasible path and improving the efficiency of reaching the target.

r_{o b s}

imposes a penalty when the robot gets too close to obstacles, prompting the robot to maintain a safe distance and avoid collisions, thereby ensuring the safety of the path.

r_{a r r}

provides a one-time significant positive reward when the robot successfully reaches the target point, reinforcing the robot’s ability to complete the task. Finally,

r_{c o l}

applies a substantial penalty if the robot collides with an obstacle, enhancing the robot’s learning efficiency in avoiding obstacles and ensuring path safety.

4. Experimental Results and Analysis

4.1. Simulation Experiment Environment Setup

To validate the effectiveness of the MAP-DDPG algorithm, we conducted experiments on a platform equipped with an Intel i7-13500 CPU (Intel, Santa Clara, CA, USA) and an Nvidia GeForce RTX 4080 GPU (NVIDIA, Santa Clara, CA, USA). The relevant experimental parameters are detailed in Table 1. As illustrated in Figure 6, we constructed a mobile robot navigation simulation scenario using ROS 1/Gazebo 7, which simulates a realistic indoor environment. Based on this scenario, we developed a mobile robot model incorporating both dynamic and kinematic properties. The algorithm was implemented using Python 3.6 and the PyTorch framework.

In the simulation scenario, the black dot represents the mobile robot (its specific appearance is shown in Figure 7a), the blue area indicates the detection range of the laser sensor, and the gray areas represent obstacles. Among these obstacles, the cylindrical object encircled by a green ring represents a moving obstacle that oscillates along the y-axis, while the cuboid encircled by a yellow rectangle, along with other cuboids, represents static obstacles. The yellow sections represent walls, and the red square indicates the target point. The robot is considered to have successfully reached the target point when the distance between the robot’s center and the center of the red square is less than 10 cm. To ensure the model’s adaptability and generalization across various environments, we adopted a multi-environment training strategy. In this strategy, the robot is trained in multiple different complex environments, each featuring a variety of dynamic and static obstacles. The terrain in each environment varies, simulating the diversity found in real-world scenarios. In these environments, 7–10 static obstacles and 1–2 dynamic obstacles are randomly generated, with their positions also being random. By training in these diverse environments, our model can effectively learn and optimize path-planning strategies, thereby exhibiting stronger generalization capabilities in real-world scenarios.

Figure 8 shows the process within one of the training environments, where the red line represents the robot’s path. The area enclosed by the pink square marks the starting position for path planning, while the red square indicates the target point. In Figure 8b, the robot successfully avoids the moving obstacle; in Figure 8c, the robot is shown avoiding a wall; and finally, in Figure 8d, the robot reaches the target point. This multi-environment training approach is expected to enable our model to perform exceptionally well when facing various challenges in the real world, demonstrating enhanced adaptability and generalization capabilities.

This study evaluates the proposed path-planning algorithm based on three key metrics: success rate, time to reach the target, and training time. First, the success rate and time to reach the target reflect the generalization and execution efficiency of the proposed algorithm in different new environments. Specifically, the success rate measures the algorithm’s ability to reliably guide the robot from the starting point to the target in new environments. A higher success rate indicates better adaptability and reliability of the algorithm across various environments. The time to reach the target represents the efficiency of the generated path when the robot successfully reaches the target point; a shorter time indicates that the algorithm can plan a path more quickly and effectively. On the other hand, training time is used to assess the training efficiency of the algorithm, reflecting the time required for the algorithm to achieve an optimal strategy during training. A shorter training time indicates faster convergence during the learning process, demonstrating higher learning efficiency.

4.2. Algorithm Performance Analysis

To validate the performance of the proposed algorithm, we conducted simulations with 2200 episodes of training in the Gazebo simulation environment and compared the results of four different algorithms: the baseline DDPG algorithm; the M-DDPG algorithm trained in multiple complex environments; the MA-DDPG algorithm, which incorporates the Multi-Head Attention (MHA) mechanism; and the MAP-DDPG algorithm, which further introduces Prioritized Experience Replay (PER). Through training and testing, we analyzed the performance of each algorithm in the path-planning task.

Figure 9illustrates the average reward returned every 10,000 steps for the basic DDPG algorithm and the proposed M-DDPG, MA-DDPG, and MAP-DDPG algorithms during path-planning training in the complex simulation environment depicted in Figure 6. The blue line represents the basic DDPG algorithm, the green line represents the M-DDPG algorithm, the yellow line represents the MA-DDPG algorithm, and the red line represents the MAP-DDPG algorithm.

As shown in the Figure 9, the M-DDPG algorithm exhibits a relatively slower convergence rate due to the increased training complexity associated with training a single DDPG model across multiple environments, which leads to a slower convergence speed. In contrast, the proposed MAP-DDPG algorithm improves the convergence rate by 15.3% compared to the M-DDPG algorithm. This improvement indicates that in the MAP-DDPG algorithm, the Prioritized Experience Replay (PER) mechanism enhances convergence speed by prioritizing the training of samples with high TD errors, ensuring that the model focuses on learning experiences most critical for policy improvement during training. Additionally, the Multi-Head Attention (MHA) mechanism, by weighting the input information, enhances the model’s ability to perceive environmental changes, thereby improving its adaptability across different environments and further increasing training efficiency. Specifically, the MAP-DDPG algorithm requires 345,856 training steps, while the DDPG algorithm requires 353,264 steps, demonstrating that MAP-DDPG also shows an improvement in convergence speed compared to the single-environment DDPG algorithm (as shown in Table 2).

To further validate the effectiveness of these four algorithms in new, complex environments, we conducted 50 tests targeting multiple goal points in the simulation environment shown in Figure 10. Table 3 presents a comparison of the success rates and the time required to reach the target point for different algorithms. These results reveal the differences in performance among the algorithms in new environments, particularly in terms of efficiency and effectiveness in path planning tasks. The data in the table indicate that the proposed MAP-DDPG algorithm performs best, with a 30% improvement in success rate and a reduction of 23.7 s in the average time to reach the target point compared to the traditional DDPG algorithm.

Specifically, the success rate of DDPG in new environments was 58%, with an average time to reach the target of 52.3 s. Under multi-environment training, the M-DDPG algorithm’s success rate increased by 10%, and the average time to reach the target was reduced by 8.2 s. This suggests that multi-environment training can enhance the generalization and execution efficiency of path planning algorithms in new, complex environments. Further incorporating the Multi-Head Attention (MHA) mechanism into the M-DDPG framework, the MA-DDPG algorithm improved the success rate by 18% compared to M-DDPG and reduced the average time to reach the target by 13.9 s. This demonstrates that the Multi-Head Attention mechanism significantly enhances the model’s perception of environmental changes through weighted processing of input information, thereby improving its generalization and execution efficiency in different environments.

By comparing the performances of different algorithms in new environments, this study validates the significant advantages of the MAP-DDPG algorithm in path-planning tasks. The introduction of the Multi-Head Attention mechanism (MHA) and Prioritized Experience Replay (PER) mechanism significantly improved the model’s success rate and execution efficiency, further proving the effectiveness of these enhancements for path planning in complex environments.

To validate the path-planning performance of the model trained in the simulation environment shown in Figure 6 in new and complex environments, 30 tests were conducted in multiple new environments. Figure 10 illustrates the movement trajectories and performance of the mobile robot in new environments using three different path-planning algorithms: DDPG, M-DDPG, and MAP-DDPG. During the experiments, Rviz subscribed to the odometry messages published by the mobile robot’s Odom, visualizing the robot’s pose information at each moment as green arrows.

As shown in Figure 10a, when using the DDPG algorithm for path planning, the robot was unable to effectively avoid dynamic obstacles, indicating that the traditional DDPG algorithm still faces significant challenges in handling dynamic obstacles. In contrast, Figure 10b presents the path planning results of the M-DDPG algorithm. Although M-DDPG was able to successfully avoid both dynamic and static obstacles and reach the target point, the planned path was longer and more convoluted, indicating that the algorithm has lower efficiency in handling unknown complex environments, and the flexibility and optimization of the path planning are still suboptimal. Figure 10c shows the path planning results using the MAP-DDPG algorithm. Clearly, the path planned by the MAP-DDPG algorithm was the shortest and the overall path was smoother, avoiding sharp turns. This demonstrates that MAP-DDPG has significantly better adaptability and path-planning capabilities in new environments compared to the other two algorithms.

These results demonstrate that the strategy of parallel training in multiple complex environments allows the model to be exposed to a wide range of possible scenarios and situations, thereby exhibiting stronger generalization capabilities when faced with new environments. This training method not only ensures excellent performance in known environments but also enables the model to function effectively in unknown or unseen environments. Additionally, the introduction of the Multi-Head Attention (MHA) mechanism enables the model to more accurately identify and process critical environmental information, significantly enhancing the model’s perception of environmental changes, thereby improving its adaptability in complex environments. This improvement further validates the significant advantages of the MAP-DDPG algorithm in complex path-planning tasks.

4.3. Comparison and Analysis with Other Algorithms

To verify the performance of the algorithm proposed in this paper, we conducted experiments comparing it with several algorithms from recent literature [34,35]. In the simulation environment shown in Figure 6, each algorithm underwent 2200 episodes of training. Subsequently, to evaluate the performance of these three algorithms in new, more complex environments, we tested each of them 50 times, targeting multiple goal points in the simulation environment illustrated in Figure 11. Table 4 presents a comparison of the success rates and the time required to reach the goal points for the different algorithms, clearly reflecting the performance differences in handling complex environments, particularly in terms of efficiency and effectiveness in path-planning tasks.

As shown in the table, the proposed MAP-DDPG algorithm performed the best in terms of success rate, reaching 86%. In comparison, the algorithm in [34] had a success rate of 72%, and the algorithm in [35] had a success rate of 70%. This indicates that MAP-DDPG significantly outperforms the other two algorithms in terms of success rate in complex environments. Moreover, MAP-DDPG also demonstrated a clear advantage in the average time to reach the target point, with an average of 35.2 s, while the algorithms in [34] and [35] took 43.6 s and 40.5 s, respectively. This shows that MAP-DDPG not only surpasses the other algorithms in success rate, but also excels in path-planning efficiency. By comparing the performance of the proposed MAP-DDPG with other state-of-the-art algorithms in complex environments, this study verifies the significant advantages of MAP-DDPG in path planning tasks. The introduction of the Multi-Head Attention mechanism and Prioritized Experience Replay has effectively enhanced the model’s success rate and execution efficiency, demonstrating the algorithm’s outstanding performance in complex path planning tasks.

To verify the path-planning performance of the model trained in the simulation environment shown in Figure 4 in new, complex environments, this paper conducted 30 tests in various new environments. Figure 8 illustrates the trajectories and performance of the mobile robot in the new environments using three different path-planning algorithms: MAP-DDPG, the algorithm proposed in [34], and the algorithm proposed in [35]. During the experiments, Rviz was used to subscribe to the odometry data published by the mobile robot, visualizing the robot’s pose information at each moment (represented by green arrows) in the form of coordinate axes. As shown in Figure 8, when using the algorithm proposed in [34], the robot’s path was the longest and was not able to shorten the path as effectively as the algorithm in [35] and the MAP-DDPG algorithm proposed in this paper. Additionally, in terms of path flexibility, the MAP-DDPG algorithm also clearly outperformed the other two algorithms.

These results indicate that the MAP-DDPG algorithm demonstrates significantly better adaptability and path-planning capabilities in new, complex environments compared to other algorithms in the literature. Through parallel training in multiple complex environments, the model exhibits stronger generalization ability when facing new environments. The introduction of the Multi-Head Attention (MHA) mechanism allows the model to more effectively identify and process key environmental features, improving its ability to perceive changes in the environment. This significantly enhances the model’s adaptability and path-planning performance in complex environments.

4.4. Real-World Performance of the Algorithm

We further applied the MAP-DDPG and DDPG algorithms to two scenarios: Scenario 1 (static obstacles) and Scenario 2 (both static and dynamic obstacles). The experimental results are shown in Figure 12, where the red line represents the complete navigation path of the mobile robot, the yellow box indicates the robot’s starting point, and the red circle marks the target point. The blue lines represent the paths traversed by the dynamic obstacles. From the experimental results, it is evident that in real-world environments, due to issues such as lower sensor accuracy and network latency, the performance of trained models in simulation environments may not always directly translate to real-world scenarios. In Scenario 1, both algorithms successfully completed the navigation task, but the path generated by the MAP-DDPG algorithm was more optimized, demonstrating better path-planning capability. In Scenario 2, only the MAP-DDPG algorithm successfully completed the navigation task, while the DDPG algorithm failed to avoid the dynamic obstacles. These experimental results indicate that, compared to the DDPG algorithm, the MAP-DDPG algorithm performs better in real-world environments, with a higher success rate and more optimal path-planning capability. Overall, the MAP-DDPG algorithm significantly outperforms the DDPG algorithm in terms of navigation and path optimization, demonstrating its potential application value in complex real-world environments.

5. Discussion

The method proposed in this paper has significant potential for applications in the field of autonomous driving. Autonomous vehicles face challenges similar to those encountered by mobile robots when performing path planning in complex road environments, including handling dynamic obstacles, real-time path optimization, and environmental perception. The MAP-DDPG algorithm, by integrating Multi-Head Attention (MHA) and Prioritized Experience Replay (PER), effectively addresses the complexities of dynamic and unpredictable environmental features, enabling autonomous vehicles to accurately perceive their surroundings and make optimal path planning decisions. Specifically, in autonomous driving scenarios, MHA helps vehicles manage multiple information streams simultaneously in congested urban roads, complex intersections, or highways. These information streams include other vehicles, pedestrians, traffic signals, and road signs. The MHA mechanism allows the algorithm to focus on these critical inputs in parallel and dynamically adjust decision-making strategies based on their importance, ensuring stable and safe driving paths in complex environments. Meanwhile, PER facilitates learning from key past experiences by prioritizing the replay of critical experience data, allowing autonomous systems to adapt quickly to various road conditions. This capability is especially valuable in emergency situations, where efficient collision avoidance and hazard management are crucial.

In other scenarios, such as environments with U-shaped and H-shaped obstacles, path-planning tasks pose greater challenges. These obstacle shapes often create enclosed or narrow passages, making it easier for robots to become trapped. The MAP-DDPG algorithm, by incorporating MHA and PER, effectively handles such complex environments. In these cases, MHA can accurately identify the geometric features of U-shaped and H-shaped obstacles by processing input features in parallel, guiding the model to focus on the most crucial environmental information and avoiding dead ends or enclosed areas. Additionally, the PER mechanism prioritizes training on high-TD-error-experience samples, reinforcing the model’s ability to avoid obstacles in such challenging environments. Through iterative learning, as seen in methods similar to those in [36], MAP-DDPG enhances obstacle avoidance and adjusts its path-planning strategy in real time based on the shape of the obstacles, preventing the robot from becoming stuck in U-shaped or H-shaped regions. Moreover, by effectively perceiving the boundaries of obstacles, the algorithm can plan more flexible paths, reducing the risk of collisions in narrow passages and improving navigation efficiency and safety.

However, there are limitations to this approach. While the MAP-DDPG algorithm has been trained in parallel across multiple environments to improve generalization, in certain specific environments, its convergence speed may be relatively slow. Future work could optimize training strategies, such as introducing hierarchical learning or more efficient learning mechanisms in specific environments, to accelerate convergence. Additionally, the integration of MHA and PER increases the algorithm’s computational resource requirements, especially when dealing with high-dimensional state spaces, which may impact real-time navigation efficiency. Future research could explore model compression techniques or lightweight attention mechanisms to reduce computational complexity, thereby enhancing the real-time performance and execution efficiency of the algorithm in practical applications.

6. Conclusions

By integrating the concepts of multi-environment parallel training, Multi-Head Attention (MHA), and Prioritized Experience Replay (PER), this paper proposes the MAP-DDPG algorithm and applies it to mobile robot path planning. The theoretical analysis highlights how multi-environment training and attention mechanisms enhance the model’s generalization capability and execution efficiency, demonstrating the algorithm’s effectiveness in complex environments. Experimental results indicate that, compared to the traditional DDPG algorithm, MAP-DDPG performs significantly better in both simulation and real-world environments. In simulation environments, the MAP-DDPG algorithm achieved a 30% higher success rate and reduced the time to reach the target by an average of 23.7 s compared to the DDPG algorithm, showcasing superior path-planning efficiency and generalization capability. Additionally, in dynamic environments, MAP-DDPG effectively avoids dynamic obstacles and generates smoother paths. In summary, the MAP-DDPG algorithm proposed in this paper demonstrates significant advantages in success rate, path smoothness, and obstacle avoidance capability, offering an efficient and reliable solution for robot path planning in complex environments. Future work will focus on further enhancing the algorithm’s robustness and exploring the integration of sensor data fusion methods to improve its applicability in even more complex and uncertain environments.

Author Contributions

Conceptualization, Y.J. and M.Y.; methodology, J.C. and Y.J.; validation, Y.J.; formal analysis, J.C., Y.J., M.Y. and H.P.; investigation, Y.J.; resources, Y.J.; data curation, Y.J.; writing—original draft preparation, J.C., Y.J., M.Y. and H.P.; writing—review and editing, J.C., M.Y. and H.P.; visualization, Y.J.; supervision, J.C., M.Y. and H.P.; project administration, J.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the Guangxi Science and Technology Development Project (AB21220038; AB24010164, AB23026048), the National Natural Science Foundation of China (NSFC) (No.61873269), the Beijing Natural Science Foundation (J210012, L192005), and Hebei Natural Science Foundation (F2021205014).

Data Availability Statement

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author/s.

Conflicts of Interest

Author Hongren Pan was also employed by Guilin Shandi Network Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial of financial relationships that could be construed as a potential conflict of interest.

References

Khan, H.; Iqbal, J.; Baizid, K.; Zielinska, T. Longitudinal and lateral slip control of autonomous wheeled mobile robot for trajectory tracking. Front. Inf. Technol. Electron. Eng. 2015, 16, 166–172. [Google Scholar] [CrossRef]
Chung, M.A.; Lin, C.W. An Improved Localization of Mobile Robotic System Based on AMCL Algorithm. IEEE Sens. J. 2022, 22, 900–908. [Google Scholar] [CrossRef]
Guo, G.; Zhao, S.J. 3D Multi-Object Tracking with Adaptive Cubature Kalman Filter for Autonomous Driving. IEEE Trans. Intell. Veh. 2023, 8, 512–519. [Google Scholar] [CrossRef]
Huang, Y.W.; Shan, T.X.; Chen, F.F.; Englot, B. DiSCo-SLAM: Distributed Scan Context-Enabled Multi-Robot LiDAR SLAM With Two-Stage Global-Local Graph Optimization. IEEE Robot. Autom. Lett. 2022, 7, 1150–1157. [Google Scholar] [CrossRef]
Saranya, C.; Unnikrishnan, M.; Ali, S.A.; Sheela, D.S.; Lalithambika, V.R. Terrain Based D* Algorithm for Path Planning. In Proceedings of the 4th IFAC Conference on Advances in Control and Optimization of Dynamical Systems (ACODS 2016), Tiruchirappalli, India, 1–5 February 2016; pp. 178–182. [Google Scholar]
Jeong, I.B.; Lee, S.J.; Kim, J.H. Quick-RRT*: Triangular inequality-based implementation of RRT* with improved initial solution and convergence rate. Expert Syst. Appl. 2019, 123, 82–90. [Google Scholar] [CrossRef]
Xu, C.; Xu, Z.B.; Xia, M.Y. Obstacle Avoidance in a Three-Dimensional Dynamic Environment Based on Fuzzy Dynamic Windows. Appl. Sci. 2021, 11, 504. [Google Scholar] [CrossRef]
Wu, J.F.; Ma, X.H.; Peng, T.R.; Wang, H.J. An Improved Timed Elastic Band (TEB) Algorithm of Autonomous Ground Vehicle (AGV) in Complex Environment. Sensors 2021, 21, 8312. [Google Scholar] [CrossRef]
Bellman, R. A Markovian decision process. J. Math. Mech. 1957, 6, 679–684. [Google Scholar] [CrossRef]
Watkins, C.J.C.H. Learning from Delayed Rewards. Ph.D. Thesis, King’s College, Cambridge, UK, 1989. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
Heess, N.; Hunt, J.J.; Lillicrap, T.P.; Silver, D. Memory-based control with recurrent neural networks. arXiv 2015, arXiv:1512.04455. [Google Scholar]
Fujimoto, S.; Hoof, H.; Meger, D. Addressing function approximation error in actor-critic methods. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 1587–1596. [Google Scholar]
Zou, Q.; Xiong, K.; Hou, Y. An end-to-end learning of driving strategies based on DDPG and imitation learning. In Proceedings of the 2020 Chinese Control and Decision Conference (CCDC), Hefei, China, 22–24 August 2020; pp. 3190–3195. [Google Scholar]
Rao, J.; Wang, J.; Xu, J.; Zhao, S. Optimal control of nonlinear system based on deterministic policy gradient with eligibility traces. Nonlinear Dyn. 2023, 111, 20041–20053. [Google Scholar] [CrossRef]
Chu, Z.; Wang, F.; Lei, T.; Luo, C. Path planning based on deep reinforcement learning for autonomous underwater vehicles under ocean current disturbance. IEEE Trans. Intell. Veh. 2022, 8, 108–120. [Google Scholar] [CrossRef]
Lowe, R.; Wu, Y.I.; Tamar, A.; Harb, J.; Pieter Abbeel, O.; Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments. Adv. Neural Inf. Process. Syst. 2017, 30, 6382–6393. [Google Scholar]
Henderson, P.; Islam, R.; Bachman, P.; Pineau, J.; Precup, D.; Meger, D. Deep reinforcement learning that matters. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
Mnih, V.; Heess, N.; Graves, A. Recurrent models of visual attention. Adv. Neural Inf. Process. Syst. 2014, 27, 2204–2212. [Google Scholar]
Jaderberg, M.; Simonyan, K.; Zisserman, A. Spatial transformer networks. Adv. Neural Inf. Process. Syst. 2015, 28, 2017–2025. [Google Scholar]
Niu, Z.; Zhong, G.; Yu, H. A review on the attention mechanism of deep learning. Neurocomputing 2021, 452, 48–62. [Google Scholar] [CrossRef]
You, Q.; Jin, H.; Wang, Z.; Fang, C.; Luo, J. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 4651–4659. [Google Scholar]
Sun, X.; Lu, W. Understanding attention for text classification. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 3418–3428. [Google Scholar]
Park, Y.M.; Hassan, S.S.; Tun, Y.K.; Han, Z.; Hong, C.S. Joint trajectory and resource optimization of MEC-assisted UAVs in sub-THz networks: A resources-based multi-agent proximal policy optimization DRL with attention mechanism. IEEE Trans. Veh. Technol. 2023, 73, 2003–2016. [Google Scholar] [CrossRef]
Peng, Y.; Tan, G.; Si, H.; Li, J. DRL-GAT-SA: Deep reinforcement learning for autonomous driving planning based on graph attention networks and simplex architecture. J. Syst. Archit. 2022, 126, 102505. [Google Scholar] [CrossRef]
Li, Y.; Long, G.; Shen, T.; Zhou, T.; Yao, L.; Huo, H.; Jiang, J. Self-attention enhanced selective gate with entity-aware embedding for distantly supervised relation extraction. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 8269–8276. [Google Scholar]
Shiri, H.; Seo, H.; Park, J.; Bennis, M. Attention-based communication and control for multi-UAV path planning. IEEE Wirel. Commun. Lett. 2022, 11, 1409–1413. [Google Scholar] [CrossRef]
Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized experience replay. arXiv 2015, arXiv:1511.05952. [Google Scholar]
Wang, G.; Lu, S.; Giannakis, G.; Tesauro, G.; Sun, J. Decentralized TD tracking with linear function approximation and its finite-time analysis. Adv. Neural Inf. Process. Syst. 2020, 33, 13762–13772. [Google Scholar]
Wu, D.; Dong, X.; Shen, J.; Hoi, S.C. Reducing estimation bias via triplet-average deep deterministic policy gradient. IEEE Trans. Neural Netw. Learn. Syst. 2020, 31, 4933–4945. [Google Scholar] [CrossRef] [PubMed]
Zheng, Y.; Lin, L.; Zhang, T.; Chen, H.; Duan, Q.; Xu, Y.; Wang, X. Enabling robust DRL-driven networking systems via teacher-student learning. IJSAC 2021, 40, 376–392. [Google Scholar] [CrossRef]
Chen, J.; Xing, H.; Xiao, Z.; Xu, L.; Tao, T. A DRL agent for jointly optimizing computation offloading and resource allocation in MEC. IEEE Internet Things J. 2021, 8, 17508–17524. [Google Scholar] [CrossRef]
Li, P.; Ding, X.; Sun, H.; Zhao, S.; Cajo, R. Research on dynamic path planning of mobile robot based on improved DDPG algorithm. Mob. Inf. Syst. 2021, 2021, 5169460. [Google Scholar] [CrossRef]
Gong, H.; Wang, P.; Ni, C.; Cheng, N. Efficient path planning for mobile robot based on deep deterministic policy gradient. Sensors 2022, 22, 3579. [Google Scholar] [CrossRef]
Zohaib, M.; Pasha, S.M.; Javaid, N.; Salaam, A.; Iqbal, J. An improved algorithm for collision avoidance in environments having U and H shaped obstacles. Stud. Inform. Control. 2014, 23, 97–106. [Google Scholar] [CrossRef]

Figure 1. The structure of DDPG algorithm.

Figure 2. The structure diagram of the PER-DDPG algorithm with multi-environment parallel training (In the simulation environment on the left, the black represents the robot, the blue represents the robot’s laser, the red square is the target point, the gray squares are obstacles, and the yellow represents the walls).

Figure 3. MAP-DDPG network architecture diagram (In the simulation environment, the black represents the robot, the blue represents the robot’s laser, the red square is the target point, the gray squares are obstacles, and the yellow rep-resents the walls).

Figure 4. MHA network architecture diagram.

Figure 5. The basic framework of path planning in this paper.

Figure 6. Parallel training in various complex environments (with dynamic and static obstacles). (a) Environment 1. (b) Environment 2. (c) Environment 3. (d) Environment 4.

Figure 7. Gazebo and real-world two-wheel differential drive robot models. (a) Burger simulation robot. (b) MR600 real robot.

Figure 8. MAP-DDPG algorithm training process. (a) Starting point of the training. (b) Avoiding a dynamic obstacle. (c) Avoiding a protruding wall obstacle. (d) Reaching the target point.

Figure 9. Average reward of the training process for four different algorithms in simulation environment Figure 6.

Figure 10. Path-planning performance of different algorithms in simulation environments (in new environments). (a) Path plot and odometer of the DDPG algorithm. (b) Path plot and odometer of the M-DDPG algorithm. (c) Path plot and odometer of the MAP-DDPG algorithm.

Figure 11. Path-planning performance of different algorithms in simulation environments (in new environments). (a) Path plot and odometry of the DDPG algorithm proposed in [34]. (b) Path plot and odometry of the DDPG algorithm proposed in [35]. (c) Path plot and odometry of the MAP-DDPG algorithm.

Figure 12. Comparison of navigation paths of DDPG and MAP-DDPG in complex real-world environments. (a) Path plot of the DDPG algorithm in real-world scenario 1. (b) Path plot of the MAP-DDPG algorithm in real-world scenario 1. (c) Path plot of the DDPG algorithm in real-world scenario 2. (d) Path plot of the MAP-DDPG algorithm in real-world scenario 2.

Table 1. Hyperparameters used during model training.

Parameter	Value
Sampling batch size	128
Experience pool size	100,000
Discount factor	0.99
Max Learning rate	0.0001
Network update frequency	0.001
Number of environments	4
State dimension	22
Action dimension	2

Table 2. In the simulation environment of Figure 6. Comparison of training time and number of steps for different algorithms.

Models	Training Time (h)	$Training Steps (Step) / n_{i}$
DDPG	25.02	353,264
M-DDPG	28.21	373,264
MA-DDPG	27.15	362,357
MAP-DDPG	24.47	345,856

Table 3. Comparison of the performance of four algorithms (new environment).

Models	Success Rate (100%)	Average Time to Reach Target (s)
DDPG	29/50 = 0.58	52.3
M-DDPG	34/50 = 0.68	44.1
MA-DDPG	43/50 = 0.86	30.2
MAP-DDPG	44/50 = 0.88	28.6

Table 4. Comparison of the performance of three algorithms (new environment).

Models	Success Rate (100%)	Average Time to Reach Target (s)
Algorithm in [34]	36/50 = 0.72	43.6
Algorithm in [35]	35/50 = 0.70	40.5
MAP-DDPG	43/50 = 0.86	35.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, J.; Jiang, Y.; Pan, H.; Yang, M. Path Planning in Complex Environments Using Attention-Based Deep Deterministic Policy Gradient. Electronics 2024, 13, 3746. https://doi.org/10.3390/electronics13183746

AMA Style

Chen J, Jiang Y, Pan H, Yang M. Path Planning in Complex Environments Using Attention-Based Deep Deterministic Policy Gradient. Electronics. 2024; 13(18):3746. https://doi.org/10.3390/electronics13183746

Chicago/Turabian Style

Chen, Jinlong, Yun Jiang, Hongren Pan, and Minghao Yang. 2024. "Path Planning in Complex Environments Using Attention-Based Deep Deterministic Policy Gradient" Electronics 13, no. 18: 3746. https://doi.org/10.3390/electronics13183746

APA Style

Chen, J., Jiang, Y., Pan, H., & Yang, M. (2024). Path Planning in Complex Environments Using Attention-Based Deep Deterministic Policy Gradient. Electronics, 13(18), 3746. https://doi.org/10.3390/electronics13183746

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Path Planning in Complex Environments Using Attention-Based Deep Deterministic Policy Gradient

Abstract

1. Introduction

2. Related Works

2.1. Deep Deterministic Policy Gradient (DDPG) Algorithm

2.2. Attention Mechanism Algorithm

2.3. Prioritized Experience Replay Mechanism

3. Methodology

3.1. Parallel Training across Multiple Complex Environments

3.2. MAP-DDPG Network Architecture

3.3. Design of the Reward Function

3.3.1. Heading Adjustment Reward

3.3.2. Distance Ratio Reward

3.3.3. Obstacle Penalty

3.3.4. Target Reaching Reward

3.3.5. Collision Penalty

3.3.6. Total Reward

4. Experimental Results and Analysis

4.1. Simulation Experiment Environment Setup

4.2. Algorithm Performance Analysis

4.3. Comparison and Analysis with Other Algorithms

4.4. Real-World Performance of the Algorithm

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI