To validate the feasibility of implementing the algorithm proposed in this paper, we initially constructed a simulation environment with randomly distributed obstacles, which measures 17 m in length and 13 m in width based on Gazebo’s simulation environment within the Robot Operating System (ROS), while leveraging NVIDIA GeForce RTX 2070 SUPER server for computational support. We used Ubuntu 20.04.5 LST and installed ROS-Noetic 1.15.15 for our simulation experiments.
4.2. Simulation Experiments
The simulation experiments were conducted using three different training methods for comparison. Method I utilized the original deep reinforcement learning approach without any enhancements (DDQN). Method II enhanced the approach of Method I by incorporating dimensional discretization, a two-branch network and a continuous reward function into the training process. Method III was trained through the incorporation of “Expert Experience” into Method II. For each approach, we conducted separate evaluations on the episode return, success rate over nearly 100 trials, training time, and training stability.
For Method I, the reward information obtained from 100 rounds of training and the success rate of nearly 100 rounds is depicted in
Figure 11. It is evident that the training process is characterized by slow convergence and susceptibility to local optima, resulting in prolonged model training time. The training times were very long and it was difficult to converge the training duration for 5000 cycles in the gazebo environment with 10× acceleration reduced to 50 h. Additionally, the success rate remained at a mere 5% throughout nearly 100 rounds.
According to the kernel density graph in
Figure 12 and the bar heat map in
Figure 13, it is evident that a majority of reward values are concentrated around −300 with only a handful successfully reaching the target point, indicating an unfavorable outcome of training. The training process is notably marked by slow convergence and vulnerability to local optima, leading to extended model training duration.
Method II builds upon Method I by incorporating dimensional discretization, a two-branch network and a continuous reward function for training. This results in improved training efficacy and successful convergence of the model, enabling it to effectively navigate path planning and obstacle-avoidance tasks following a period of training.
The reward information and average success rate obtained by Method II after training over approximately 100 rounds are illustrated in
Figure 14. Due to the robot’s random exploration approach and imperfect model, its success rate is initially low during the early stages of training; however, as training progresses, the accuracy rate can reach up to 80% for almost 100 rounds. The kernel density plot and the bar heat map are shown in
Figure 15 and
Figure 16. It can be observed that during the early stages of training, the robot predominantly receives low reward values (−300); however, as the model progresses, there is a gradual improvement in reward distribution. At this juncture, the robot acquires obstacle-avoidance capability and successfully reaches the target point to obtain a high reward of +300. The training duration for 5000 cycles in the gazebo environment with 10× acceleration is reduced to 20 h, which enhances model stability and convergence compared to Method I.
Figure 17 depicts the reward information and success rate of nearly 100 training rounds using Method III, where “ep” denotes the proportion of expert experience incorporated by the model. We observe an increase in the average reward value by using Method III, accompanied by a corresponding rise in the proportion of “expert experience” as training progresses. As the training progresses, the robot’s reliance on “expert experience” decreases and it starts to choose more autonomous planning paths. This leads to a slight decrease in average reward but an increase in stability. The robot model also begins to learn from “reverse experience,” resulting in an increase in average reward value and eventually obtaining navigation and obstacle avoidance abilities. With a 10× acceleration setting in Gazebo, 3500 cycles of training can be completed within just 4 h, the model convergence speed and training stability are significantly improved. The training model using Method III consistently yields high reward values throughout the entire process, with only a slight dip in the middle of training. Overall, the training is stable and this is demonstrated by both the kernel density plot (
Figure 18) and bar hotspot chart (
Figure 19) for Method III.
Moreover, all three methods are compared in terms of data distribution histograms and numerical analysis, respectively.
Figure 20 compares the reward level distributions for the three techniques. Compared to Method I, the reward values of Method II and Method III exhibit a higher degree of concentration around +300. Moreover, when compared to Method II, Method III exhibits a higher tendency to receive rewards valued at +300, resulting in a reward value distribution that is more closely aligned with this particular value.
Table 6 shows the results of the three kinds of numerical analysis, respectively, comparing the mean value, variance and quantile of the rewards obtained in the training process. Combined with
Figure 18, the distribution and dispersion degree of the reward values obtained by the three different methods can be intuitively seen. Obviously, Method I has a low success rate and cannot converge. Compared with Method II, the average reward value of Method III is higher, and the standard deviation and variance are lower, indicating that the data concentration degree is higher, the model convergence speed is faster, and the stability is higher.
Therefore, the improved training methods proposed in this paper (Method II and Method III) exhibit superior performance compared to the traditional training method (Method I), including faster training time, higher stability and quicker model convergence.
In order to further verify the advantages of the improved algorithm proposed in this paper in the training process, this paper uses DQN, DDQN, and Improvement DDQN (Method III) for training under the same environment and analyzes the reward value obtained in each round, as shown in
Figure 21 (using Whittaker Smoother fitting to create a smoother curve). We can see that the improved algorithm proposed in this paper results in more positive rewards in the training stage and higher overall rewards than the other two algorithms, as shown in
Figure 22.
In order to verify the feasibility of the improved algorithm and briefly analyze its interpretability, we used Method III to conduct a complete path planning process in the simulation environment and recorded the Q values of the actions that could be taken in each step, as shown in
Figure 23. It can be seen that in the early stage of path planning, due to the farther distance from the target point, each Q value was relatively low. With the progress of the path finding process, at five steps and 13 steps, the overall reward is low due to the existence of obstacles around the robot, indicating that there may be dangers. Finally, after 20 steps, the target point is approaching, and the reward value becomes larger and larger until the target point is reached. The arrival of the target point will obtain a +300 reward; thus, it can be seen that taking different actions at the upcoming target point makes a big difference in the reward value.
Moreover, we set up simple environment and complex environment, respectively, in the gazebo and carried out path planning by using models trained by different algorithms to evaluate the effect of the improved algorithm. Due to the poor stability, long time, and poor training effect of the training model using the traditional method, this paper only uses the improved algorithm for testing experiments.
Figure 24 and
Figure 25 show the trajectories obtained by using Method II and Method III for path planning of different target points, respectively, in a simple environment. It can be seen that the trajectories obtained using Method III are smoother, and the planned paths are shorter and more reasonable.
Figure 26 and
Figure 27 show the trajectories obtained by using Method II and Method III, respectively, for path planning of different target points in a complex environment. To eliminate the effect of different initial positions on the path planning, we use two methods to navigate at different initial positions separately, as shown in
Figure 28 and
Figure 29. We also conducted multiple experiments on the same path planning problem (with the same initial point and target point) using Method II and Method III on large-scale maps, as shown in
Figure 30 and
Figure 31. It can be seen that the trajectory obtained using Method III is better. Through simulation experiments, it can be seen that using the improvement proposed in this paper (Method III) can obtain a smoother path with a shorter distance.