A DDQN Path Planning Algorithm Based on Experience Classification and Multi Steps for Mobile Robots
Abstract
:1. Introduction
2. DDQN Algorithm
2.1. DQN Algorithm
Algorithm 1: Pseudo form of the DQN algorithm |
Input: , , , , , , , , , , , |
for t = 1 to T do |
Initialization: Initialize the first state in the present state sequence and acquire its eigenvector |
Update the basic exploration factor based on the number of iteration episodes |
for the each initialized first state do |
(a) Select an action as per the probability , or randomly select an action |
(b) Execute the action under the state to acquire an instant reward , and obtain a new state and its eigenvector . refers to whether the state is terminated |
(c) Store the quintuple in an experience container D |
(d) |
(e) Randomly collect state transition samples (sample size >) from the experience container D, and calculate the present : |
(f) Use the mean square error loss function to update the network weight via the small-batch gradient descent method |
Update the target network parameter every steps |
Until ends in an is_done state |
end for |
Until reaching the maximum number of iteration episodes |
end for |
Output: |
2.2. DDQN Algorithm
Algorithm 2: Pseudo form of the DDQN algorithm |
Input: , , , , , , , , , , , |
for t = 1 to T do |
Initialization: Initialize the first state in the present state sequence and acquire its eigenvector |
Update the basic exploration factor based on the number of iteration episodes |
for the each initialized first state do |
(a) Select an action as per the probability , or randomly select an action |
(b) Execute the action under the state to acquire an instant reward , obtain a new state and its eigenvector . refers to whether the state is terminated |
(c) Store the quintuple in an experience container D |
(d) |
(e) Randomly collect state transition samples from the experience container D if the total sample size in the container is greater than , and calculate the present : |
(f) Use the mean square error loss function to update the network weight via the small-batch gradient descent method |
Update the target network parameter every steps |
Until ends in an state |
end for |
Until reaching the maximum number of iteration episodes |
end for |
Output: |
3. DDQN-ECMS Algorithm
3.1. Multi Step DDQN Algorithm
Algorithm 3: Pseudo form of the MS-DDQN algorithm |
Input: , , , , , , , , , , , , |
for t = 1 to T do |
Initialization: Initialize the first state in the present state sequence and acquire its eigenvector |
Update the basic exploration factor based on the number of iteration episodes |
for the each initialized first state do |
(a) Select an action as per the probability , or randomly select an action |
(b) Execute the action under the state to acquire an instant reward , obtain a new state and its eigenvector . refers to whether the state is terminated |
(c) Store the quintuple in a multi-step guidance queue container D. (When len(D) = , the first quintuple in the queue pops up firstly) |
(d) When len(D) = , store the quintuple into the experience container M. , and stand for , and of the first quintuple, respectively, in the queue D, refers to the non- state after the subsequent maximum number of iteration episodes, and represents the cumulative sum of subsequent damping instant rewards ended in a state. |
(e) |
(f) When the total sample size in the container is greater than , randomly select state transition samples from the experience container M and calculate the present : |
(g) Use the mean square error loss function to update the network weight via the small-batch gradient descent method |
Update the target network parameter every steps |
Until ends in an state |
end for |
Until reaching the maximum number of iteration episodes |
end for |
Output: |
3.2. Experience Classification DDQN Algorithm
3.2.1. Experience Classification Training Method
3.2.2. Experience Classification Training Method
Algorithm 4: Pseudo form of the EC-DDQN algorithm |
Input: , , , , , , , , , , , , , , |
for t =1 to T do |
Initialization: Initialize the first state in the present state sequence and acquire its eigenvector |
Update the basic exploration factor based on the number of iteration episodes |
for the each initialized first state do |
(a) Select an action as per the probability , or randomly select an action |
(b) Execute the action under the state to acquire an instant reward , obtain a new state and its eigenvector . refers to whether the state is terminated |
(c) Store the quintuple . Simultaneously store M0 and M1 when len(M1) < N. Store M1 when len(M1) >= N, and simultaneously store M0 if . |
(d) |
(e) Update the sampling proportion of the auxiliary experience pool: |
(f) When the total sample size in the container is greater than , randomly select state transition samples from the experience containers M0 and M1 at sampling proportions of and , respectively, and calculate the present : |
(g) Use the mean square error loss function to update the network weight via the small-batch gradient descent method, and return the average losses and of different experience pools. |
Update the target network parameter every steps |
Until ends in an state |
end for |
Until reaching the maximum number of iteration episodes |
end for |
Output: |
3.3. DDQN Algorithm Based on Experience Classification and Multi Steps
Algorithm 5: Pseudo form of the DDQN-ECMS algorithm |
Input: , , , , , , , , , , , , , Q(w), C |
for t = 1 to T do |
Initialization: To get the eigenvector of the first state in state sequence . |
Update the basic exploration factor according to the number of iteration episodes . |
for the each initialized first state S do |
(a) Select action based on probability , otherwise action will be randomly selected.(b) Execute action |
(b) Execute action in the state and get the instant reward . New state and its eigenvector will be obtained. refers to whether the state is terminated. |
(c) Store quintet in container for MS queue D. (If len(D) = , the first quintet in the queue will pop up) |
(d) If len(D) = , store quintet in the experience container. If len(M1) < N: store M0 and M1; if len(M1) >= : store M1. If , store M0. , and are , and in the first quintet of the M3 queue. is subsequent maximum number of steps that is not in state. is cumulative sum with damping instant reward subsequently ended in state. |
(e) |
(f) Update the sampling proportion of auxiliary experience replay buffers: |
(g) If the total number of samples in container is more than N, m samples of state transitions are randomly selected from experience containers M0 and M1 according to sampling proportion and to calculate the current : |
(h) Use mean squared deviation loss function: |
Update network weight parameter through the small batch gradient descent method. The average loss is and while returning to different experience replay buffers. |
Update target network parameters every C step. |
Until the as terminal stateend for |
Until the maximum number of iteration episodes . |
end for |
Output: |
4. Test and Analysis
4.1. Setting of the Experimental Simulation Environment
- 1.
- Design of reward function
- 2.
- Design of value fitting network Q
- 3.
- Parameter setting
4.2. Experiments and Result Analysis
4.2.1. Experiment in the Environment with Fixed Positions and Result Analysis
4.2.2. Experiment in the Environment with Random Positions and Result Analysis
5. Conclusions and Future Directions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- He, Y.Y.; Fan, X.W. Application of improved ant colony algorithm in robot path planning. Comput. Eng. Appl. 2021, 57, 276–282. [Google Scholar]
- Jiang, M.; Wang, F.; Ge, Y.; Sun, L.L. Research on path planning of mobile robot based on improved ant colony algorithm. Chin. J. Sci. Instrum. 2019, 40, 113–121. [Google Scholar]
- Fadzli, S.A.; Abdulkadir, S.I.; Makhtar, M. Robotic Indoor Path Planning Using Dijkstra’s Algorithm with Multi-Layer Dictionaries. In Proceedings of the 2015 2nd International Conference on Information Science and Security (ICISS), Seoul, Korea, 14–16 December 2015; pp. 1–4. [Google Scholar]
- Ahlam, A.; Mohd, A.S.; Khatija, R. An optimized hybrid approach for path finding. Int. J. Found. Comput. Sci. Technol. 2015, 5, 47–58. [Google Scholar]
- Song, B.Y.; Wang, Z.D.; Sheng, L. A new genetic algorithm approach to smooth path planning for mobile robots. Assem. Autom. 2016, 36, 138–145. [Google Scholar] [CrossRef] [Green Version]
- Li, G.; Chou, W. Path planning for mobile robot using self-adaptive learning particle swarm optimization. Sci. China Inf. Sci. 2018, 61, 052204. [Google Scholar] [CrossRef] [Green Version]
- Juang, C.; Yeh, Y. Multiobjective Evolution of Biped Robot Gaits Using Advanced Continuous Ant-Colony Optimized Recurrent Neural Networks. IEEE Trans. Cybern. 2018, 48, 1910–1922. [Google Scholar] [CrossRef] [PubMed]
- Zhang, H.; He, L.; Yuan, L.; Ran, T. Mobile robot path planning based on improved two-layer ant colony algorithm. Control. Decis. 2022, 37, 303–313. [Google Scholar]
- Wang, K.Y.; Shi, Z.; Yang, Z.C.; Wang, S.S. Improved reinforcement learning algorithm applied to mobile robot path planning. Comput. Eng. Appl. 2021, 57, 270–274. [Google Scholar]
- Zhou, X.M.; Bai, T.; Ga, Y.B.; Han, Y.T. Vision-Based Robot Navigation through Combining Unsupervised Learning and Hierarchical Reinforcement Learning. Sensors 2019, 19, 1576. [Google Scholar] [CrossRef] [Green Version]
- Liu, Z.R.; Jiang, S.H. Research review of mobile robot path planning based on reinforcement learning. Manuf. Autom. 2017, 41, 90–93. [Google Scholar]
- Dong, Y.; Ge, Y.Y.; Guo, H.Y.; Dong, Y.F.; Yang, C. Mobile robot path planning based on deep reinforcement learning. Comput. Eng. Appl. 2019, 55, 15–19. [Google Scholar]
- Lv, L.H.; Zhang, S.J.; Ding, D.R.; Wang, Y.X. Path Planning via an Improved DQN-Based Learning Policy. IEEE Access 2019, 7, 67319–67330. [Google Scholar] [CrossRef]
- Zhang, F.; Gu, C.; Yang, F. An Improved Algorithm of Robot Path Planning in Complex Environment Based on Double DQN. In Advances in Guidance, Navigation and Control. Lecture Notes in Electrical Engineering; Yan, L., Duan, H., Yu, X., Eds.; Springer: Singapore, 2021; Volume 644. [Google Scholar]
- Peng, Y.S.; Liu, Y.; Zhang, H. Deep Reinforcement Learning based Path Planning for UAV-assisted Edge Computing Networks. In Proceedings of the 2021 IEEE Wireless Communications and Networking Conference (WCNC), Nanjing, China, 29 March–1 April 2021; pp. 1–6. [Google Scholar]
- Yan, C.; Xiang, X.; Wang, C. Towards Real-Time Path Planning through Deep Reinforcement Learning for a UAV in Dynamic Environments. J. Intell. Robot. Syst. 2020, 98, 297–309. [Google Scholar] [CrossRef]
- Lei, X.; Zhang, Z.; Dong, P. Dynamic path planning of unknown environment based on deep reinforcement learning. J. Robot. 2018, 2018, 5781591. [Google Scholar] [CrossRef]
- Jiang, L.; Huang, H.Y.; Ding, Z.H. Path Planning for Intelligent Robots Based on Deep Q-learning With Experience Replay and Heuristic Knowledge. IEEE-CAA J. Autom. Sin. 2020, 7, 1179–1189. [Google Scholar] [CrossRef]
- Dong, Y.F.; Yan, C.; Dong, Y.; Qin, C.X.; Xiao, H.X.; Wang, Z.Q. Path planning based on improved DQN robot. Comput. Eng. Des. 2021, 42, 552–558. [Google Scholar]
- Feng, S.; Shu, H.; Xie, B.O. Three-dimensional environment path planning based on improved deep reinforcement learning. Comput. Appl. Softw. 2021, 38, 250–255. [Google Scholar]
- Huang, R.N.; Qin, C.X.; Li, J.L.; Lan, X.J. Path planning of mobile robot in unknown dynamic continuous environment using reward-modified deep Q-network. Optim. Control. Appl. Methods, 2021; early view. [Google Scholar] [CrossRef]
- Xie, R.; Meng, Z.; Zhou, Y.; Ma, Y.; Wu, Z. Heuristic Q-learning based on experience replay for three-dimensional path planning of the unmanned aerial vehicle. Sci. Prog. 2020, 103, 0036850419879024. [Google Scholar] [CrossRef]
- Prianto, E.; Kim, M.; Park, J.H.; Bae, J.H.; Kin, J.S. Path Planning for Multi-Arm Manipulators Using Deep Reinforcement Learning: Soft Actor–Critic with Hindsight Experience Replay. Sensors 2020, 20, 5911. [Google Scholar] [CrossRef]
- Liu, Q.Q.; Liu, P.Y. Soft Actor Critic Reinforcement Learning with Prioritized Experience Replay. J. Jilin Univ. (Inf. Sci. Ed.) 2021, 39, 192–199. [Google Scholar]
- Zhai, P.; Zhang, Y.; Shaobo, W. Intelligent Ship Collision Avoidance Algorithm Based on DDQN with Prioritized Experience Replay under COLREGs. J. Mar. Sci. Eng. 2022, 10, 585. [Google Scholar] [CrossRef]
- Li, H. Research on Mobile Robot Path Planning Method Based on Deep Reinforcement Learning. Master’s Thesis, Tianjin Vocational and Technical Normal University, Tianjin, China, 2020. [Google Scholar]
- Hasselt, H.V.; Guze, A.; Silver, D. Deep Reinforcement Learning with Double Q-learning. Comput. Sci. 2015, 47, 253–279. [Google Scholar] [CrossRef]
- Devo, A.; Costante, G.; Valigi, P. Deep Reinforcement Learning for Instruction Following Visual Navigation in 3D Maze-Like Environments. IEEE Robot. Autom. Lett. 2020, 5, 1175–1182. [Google Scholar] [CrossRef]
Parameter | Value |
---|---|
initial exploration factor | 1 |
minimum exploration factor | 0.01 |
learning rate | 0.001 |
network parameter of updated step interval | 100 |
maximum step of single episode | 1000 |
maximum number of each iteration episodes | 100 |
reward attenuation factor | 0.9 |
maximum capacity of experience playback container | 20,000 |
minimum exploration number | 500 |
the sampling batch size | 128 |
initial exploration probability | 0.2 |
dynamic sampling probability | 0.2 |
auxiliary exploration coefficient | 0.4 |
Algorithm | DDQN | MS-DDQN | EC-DDQN | DDQN-ECMS |
---|---|---|---|---|
The average time spent (s) | 129.732 | 122.244 | 122.587 | 124.709 |
The average return values | 8.203 | 8.825 | 8.275 | 8.930 |
The average number of steps | 157.815 | 160.267 | 158.132 | 157.330 |
The average reward per step | 0.052 | 0.055 | 0.052 | 0.057 |
Algorithm | DDQN | MS-DDQN | EC-DDQN | DDQN-ECMS |
---|---|---|---|---|
The average time spent (s) | 73.292 | 74.436 | 67.913 | 74.013 |
The average return values | 8.886 | 8.644 | 8.902 | 9.121 |
The average number of steps | 71.684 | 72.767 | 69.888 | 71.102 |
The average reward per step | 0.124 | 0.119 | 0.127 | 0.128 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, X.; Shi, X.; Zhang, Z.; Wang, Z.; Zhang, L. A DDQN Path Planning Algorithm Based on Experience Classification and Multi Steps for Mobile Robots. Electronics 2022, 11, 2120. https://doi.org/10.3390/electronics11142120
Zhang X, Shi X, Zhang Z, Wang Z, Zhang L. A DDQN Path Planning Algorithm Based on Experience Classification and Multi Steps for Mobile Robots. Electronics. 2022; 11(14):2120. https://doi.org/10.3390/electronics11142120
Chicago/Turabian StyleZhang, Xin, Xiaoxu Shi, Zuqiong Zhang, Zhengzhong Wang, and Lieping Zhang. 2022. "A DDQN Path Planning Algorithm Based on Experience Classification and Multi Steps for Mobile Robots" Electronics 11, no. 14: 2120. https://doi.org/10.3390/electronics11142120
APA StyleZhang, X., Shi, X., Zhang, Z., Wang, Z., & Zhang, L. (2022). A DDQN Path Planning Algorithm Based on Experience Classification and Multi Steps for Mobile Robots. Electronics, 11(14), 2120. https://doi.org/10.3390/electronics11142120