A Reinforcement Learning Approach Based on Automatic Policy Amendment for Multi-AUV Task Allocation in Ocean Current
Abstract
:1. Introduction
- (1)
- In the traditional methods, sample reuse is to extract experience by learning samples from replay buffer, and it cannot directly improve the quality of samples by guiding the behavior of policy. Furthermore, experience extracted from samples can improve the convergence, but the effect is related to the experience extraction. The algorithm we proposed can extract available information from samples and use the information directly in decisions. Our algorithm not overly dependent on training effect and can directly improve the sample quality.
- (2)
- The traditional methods based on sample reuse do not take the influence of exploitation on policy exploration into account. Automatic Policy Amendment algorithm (APAA) considers the balance between exploration and exploitation, and it uses entropy to evaluate the information extracted from samples, aiming to maintain certain exploration ability in action decision-making and avoid trapping into a non-optimal policy.
- (3)
- The traditional method based on importance evaluation generally evaluates the importance of samples with the expected reward, and does not consider the evaluation between samples with the same expected reward under environmental changes. To overcome the shortcoming, the subtask reward evaluation method is combined to distinguish the influence of the same reward value on policy under different situations.
2. Problem Statement
2.1. Ocean Current Environment
2.2. AUVs Model
2.3. Task Model
3. Background
Reinforcement Learning (RL)
4. Model Design
4.1. State Model
4.2. Action Model
4.3. Reward Model
- Energy consumption is determined by the AUV’s speed. The faster the AUV speed in each time step, the larger the energy consumption is, and the lower the reward value will be, the reward for the energy consumption is defined as
- Moving evaluation is determined by the AUV whether it is closer to the nearest target than it was at the previous time step, the reward for the moving evaluation is defined as
- Collision detection is to judge whether an AUV collides with others. It will get a negative reward when colliding with others. The reward for the collision is defined as
- Task completion reward is determined by whether the AUVs salvage the targets. If AUVs salvage a target at time t, all the AUVs will get the reward, which is also related to the emergency of the target. The reward for the task completion is defined as
5. Automatic Policy Amendment Algorithm (APAA)
5.1. Task Sequence Matrix (TSM)
5.2. Automatic Policy Amendment Matrix (APAM)
- (1)
- Team Cumulative Reward (TCR)
- (2)
- Entropy
- (3)
- Subtask Reward (SR)
- (4)
- Probability Weighted
- (5)
- Probability Prediction
5.3. Action Conduct by APAM
5.4. Algorithm Summarize
Algorithm 1 APAA. |
Input: Nu, Nm, D, EPISODE, L, M. Output: . 1: Initialize: , , , , , . 2: for to do 3: ; 4: Generate APAM by Equations (20)–(30); 5: while is not terminal state do 6: for to do 7: Generate action probability distribution according to state ; 8: if then 9: if then 10: ; 11: end if 12: for to D do 13: is corrected according to Equation (33); 14: end for 15: end if 16: Choose action by -greedy according to , and get reward ; 17: Put into ; 18: end for 19: ; 20: end while 21: For a new sequence, update TSM according to Equation (19); 22: if then 23: A batch samples randomly selected from ; 24: Training network by Equation (13); 25: end if 26: Executed after several iterations; 27: end for |
6. Simulation Results
6.1. Experiment Parameters
6.2. Experiment Result
7. Analysis
7.1. Validity Analysis
7.2. Computational Complexity Analysis
8. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Allotta, B.; Bartolini, F.; Caiti, A.; Costanzi, R.; Di Corato, F.; Fenucci, D.; Gelli, J.; Guerrini, P.; Monni, N.; Munafò, A.; et al. Typhoon at CommsNet13: Experimental Experience on AUV Navigation and Localization. Annu. Rev. Control 2015, 40, 157–171. [Google Scholar] [CrossRef]
- Allotta, B.; Costanzi, R.; Pugi, L.; Ridolfi, A. Identification of the Main Hydrodynamic Parameters of Typhoon AUV from A Reduced Experimental Dataset. Ocean. Eng. 2018, 147, 77–88. [Google Scholar] [CrossRef]
- Liu, Q.; Sun, B.; Zhu, D. A Multi-AUVs Cooperative Hunting Algorithm for Environment with Ocean Current. In Proceedings of the 2018 37th Chinese Control Conference, Wuhan, China, 25–27 July 2018. [Google Scholar]
- Li, L.; Li, Y.; Zeng, J.; Xu, G.; Zhang, Y.; Feng, X. A Research of Multiple Autonomous Underwater Vehicles Cooperative Target Hunting Based on Formation Control. In Proceedings of the 2021 6th International Conference on Automation, Control and Robotics Engineering, Dalian, China, 15–17 July 2021. [Google Scholar]
- Wu, J.; Song, C.; Ma, J.; Wu, J.; Han, G. Reinforcement Learning and Particle Swarm Optimization Supporting Real-Time Rescue Assignments for Multiple Autonomous Underwater Vehicles. IEEE Trans. Intell. Transp. Syst. 2021. accepted. [Google Scholar] [CrossRef]
- Zhu, Z.; Wu, Z.; Deng, Z.; Qin, H.; Wang, X. An Ocean Bottom Flying Node AUV for Seismic Observations. In Proceedings of the 2018 IEEE/OES Autonomous Underwater Vehicle Workshop (AUV), Porto, Portugal, 6–9 November 2018. [Google Scholar]
- Liu, S.; Xu, H.L.; Lin, Y.; Gao, L. Visual Navigation for Recovering an AUV by Another AUV in Shallow Water. Sensors 2019, 19, 1889. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Shen, C.; Buckham, B.; Shi, Y. Modified C/GMRES Algorithm for Fast Nonlinear Model Predictive Tracking Control of AUVs. IEEE Trans. Control Syst. Technol. 2017, 25, 1896–1904. [Google Scholar] [CrossRef]
- Carreras, M.; Hernandez, J.D.; Vidal, E.; Palomeras, N.; Ribas, D.; Ridao, P. Sparus II AUV-A Hovering Vehicle for Seabed Inspection. IEEE J. Ocean. Eng. 2018, 43, 344–355. [Google Scholar] [CrossRef]
- Kojima, M.; Asada, A.; Mizuno, K.; Nagahashi, K.; Katase, F.; Saito, Y.; Ura, T. AUV IRSAS for Submarine Hydrothermal Deposits Exploration. In Proceedings of the 2016 IEEE/OES Autonomous Underwater Vehicles (AUV), Tokyo, Japan, 6–9 November 2016. [Google Scholar]
- Savkin, A.V.; Verma, S.C.; Anstee, S. Optimal Navigation of an Unmanned Surface Vehicle and an Autonomous Underwater Vehicle Collaborating for Reliable Acoustic Communication with Collision Avoidance. Drones 2022, 6, 27. [Google Scholar] [CrossRef]
- Yu, X.; Gao, X.; Wang, L.; Wang, X.; Ding, Y.; Lu, C.; Zhang, S. Cooperative Multi-UAV Task Assignment in Cross-Regional Joint Operations Considering Ammunition Inventory. Drones 2022, 6, 77. [Google Scholar] [CrossRef]
- Ferri, G.; Munafo, A.; Tesei, A.; LePage, K. A Market-based Task Allocation Framework for Autonomous Underwater Surveillance Networks. In Proceedings of the Oceans Aberdeen Conference, Aberdeen, UK, 19–22 June 2018. [Google Scholar]
- Ma, Y.N.; Gong, Y.J.; Xiao, C.F.; Gao, Y.; Zhang, J. Path Planning for Autonomous Underwater Vehicles: An Ant Colony Algorithm Incorporating Alarm Pheromone. IEEE Trans. Veh. Technol. 2019, 68, 141–154. [Google Scholar] [CrossRef]
- Han, G.; Gong, A.; Wang, H.; Martinez-Garcia, M.; Peng, Y. Multi-AUV Collaborative Data Collection Algorithm Based on Q-learning in Underwater Acoustic Sensor Networks. IEEE Trans. Veh. Technol. 2021, 70, 9294–9305. [Google Scholar] [CrossRef]
- Xi, L.; Zhou, L.; Xu, Y.; Chen, X. A Multi-Step Unified Reinforcement Learning Method for Automatic Generation Control in Multi-area Interconnected Power Grid. IEEE Trans. Sustain. Energy 2020, 12, 1406–1415. [Google Scholar] [CrossRef]
- Zhang, J.; Yang, Q.; Shi, G.; Lu, Y.; Wu, Y. UAV Cooperative Air Combat Maneuver Decision Based on Multi-agent Reinforcement Learning. J. Syst. Eng. Electron. 2021, 32, 1421–1438. [Google Scholar]
- Zhang, Z.; Wang, D.; Gao, J. Learning Automata-based Multiagent Reinforcement Learning for Optimization of Cooperative Tasks. IEEE Trans. Neural. Netw. Learn. Syst. 2021, 32, 4639–4652. [Google Scholar] [CrossRef] [PubMed]
- Guo, W.; Tian, W.; Ye, Y.; Xu, L.; Wu, K. Cloud Resource Scheduling with Deep Reinforcement Learning and Imitation Learning. IEEE Internet Things J. 2021, 8, 3576–3586. [Google Scholar] [CrossRef]
- Hoseini, S.A.; Hassan, J.; Bokani, A.; Kanhere, S.S. In Situ MIMO-WPT Recharging of UAVs Using Intelligent Flying Energy Sources. Drones 2021, 5, 89. [Google Scholar] [CrossRef]
- Sutton, R.; Barto, A. Reinforcement Learning:An Introduction. IEEE Trans. Neural Netw. 1998, 9, 1054. [Google Scholar] [CrossRef]
- Watkins, C.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
- Geist, M.; Pietquin, O. Algorithmic Survey of Parametric Value Function Approximation. IEEE Trans. Neural Netw. Learn. Syst. 2013, 24, 845–867. [Google Scholar] [CrossRef] [Green Version]
- Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G. Human-level Control through Deep Reinforcement Learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
- Van Hasselt, H.; Guez, A.; Silver, D. Deep Reinforcement Learning with Double Q-learning. In Proceedings of the 30th Association for the Advancement of Artificial Intelligence (AAAI) Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016. [Google Scholar]
- Lin, L.J. Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching. Mach. Learn. 1992, 8, 293–321. [Google Scholar] [CrossRef]
- Schaul, T.; Quan, J.; Antonoglou, I.; Silver, D. Prioritized Experience Replay. In Proceedings of the International Conference on Learning Representations 2016, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
- Horgan, D.; Quan, J.; Budden, D.; Barth Maron, G.; Hessel, M.; Van Hasselt, H.; Silver, D. Distributed Prioritized Experience Replay. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Zhao, Y.; Liu, P.; Zhao, W.; Tang, X. Twice Sampling Method in Deep Q-network. Acta Autom. Sin. 2019, 14, 1870–1882. [Google Scholar]
- Zhang, H.J.; Qu, C.; Zhang, J.D.; Li, J. Self-Adaptive Priority Correction for Prioritized Experience Replay. Appl. Sci. 2020, 10, 6925. [Google Scholar] [CrossRef]
- Ramicic, M.; Bonarini, A. Entropy-based Prioritized Sampling in Deep Q-learning. In Proceedings of the 2017 2nd International Conference on Image, Vision and Computing (ICIVC), Chengdu, China, 2–4 June 2017. [Google Scholar]
- Yang, D.; Qin, X.; Xu, X.; Li, C.; Wei, G. Sample-efficient Deep Reinforcement Learning with Directed Associative Graph. China Commun. 2021, 18, 100–113. [Google Scholar] [CrossRef]
- Pathak, D.; Agrawal, P.; Efros, A.A.; Darrell, T. Curiosity-driven Exploration by Self-supervised Prediction. In Proceedings of the 30th IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Zhu, J.; Wei, Y.T. Adaptive Deep Reinforcement Learning for Non-stationary Environments. Sci. China Inf. Sci. 2021. accepted. [Google Scholar]
- Kumra, S.; Joshi, S.; Sahin, F. Learning Robotic Manipulation Tasks via Task Progress Based Gaussian Reward and Loss Adjusted Exploration. IEEE Robot. Autom. Lett. 2022, 7, 534–541. [Google Scholar] [CrossRef]
- Shi, H.; Xu, M. A Multiple-Attribute Decision-Making Approach to Reinforcement Learning. IEEE Trans. Cogn. Dev. Syst. 2020, 12, 695–708. [Google Scholar] [CrossRef]
- Pakizeh, E.; Palhang, M.; Pedram, M.M. Multi-criteria Expertness Based Cooperative Q-learning. Appl. Intell. 2013, 39, 28–40. [Google Scholar] [CrossRef]
- Yao, X.; Wang, F.; Wang, J. Energy-optimal Path Planning for AUV with Time-variable Ocean Currents. Control Decis. 2020, 35, 2424–2432. [Google Scholar]
1 | - | - | - | - | - | - | - | - | - | - | −108 | |||||
2 | - | - | - | - | - | - | - | - | - | - | −92 | |||||
3 | - | - | - | - | - | - | - | - | - | - | - | −115 | ||||
4 | - | - | - | - | - | - | - | - | - | - | −113 | |||||
5 | - | - | - | - | - | - | - | - | - | - | −131 | |||||
6 | - | - | - | - | - | - | - | - | - | - | −107 | |||||
7 | - | - | - | - | - | - | - | - | - | - | −111 | |||||
8 | - | - | - | - | - | - | - | - | - | - | −117 | |||||
9 | - | - | - | - | - | - | - | - | - | - | −123 | |||||
10 | - | - | - | - | - | - | - | - | - | - | −125 |
0.403 | 0.425 | - | - | - | - | - | - | - | - | 0.122 | 0.123 | 0.282 | 0.333 | - | |
- | - | 1 | - | - | - | 0.717 | - | - | - | 0.508 | 0.369 | - | - | - | |
0.468 | - | - | - | - | 0.419 | - | - | - | - | - | 0.255 | 0.289 | - | - | |
- | 0.575 | - | 1 | - | - | - | - | - | - | 0.243 | 0.127 | 0.429 | - | - | |
0.129 | - | - | - | 1 | 0.581 | 0.283 | - | - | - | 0.127 | 0.126 | - | 0.667 | - |
Hidden Layers | Transfer Function | Optimization Function | Epochs | Learning Rate | Batch | Regularization |
---|---|---|---|---|---|---|
2 | tanh | adam | 500 | 0.001 | 300 | L2 |
5000 | 0.8 | 0.995 |
Parameter | Symbol | Value |
---|---|---|
Size of the TSM | N | 15 |
impact factor | 0.1 | |
attenuation factor | 0.9 | |
update factor | 0.7 |
AUVs | Position | Power (kg) | Speed (m/s) | Energy (J) |
---|---|---|---|---|
(5,1) | 5 | 1 | 600 | |
(10,3) | 2 | 3 | 600 | |
(3,6) | 2 | 3 | 600 |
Targets | Position | Weight (kg) | Emergency |
---|---|---|---|
(5,7) | 4 | 9.7894 | |
(4,9) | 2 | 7.3135 | |
(6,2) | 2 | 8.66 | |
(8,10) | 3 | 7.3227 | |
(2,1) | 2 | 8.405 |
Parameter | Symbol | Value |
---|---|---|
Number of AUVs | 3 | |
Number of Targets | 5 | |
Salvage radius | 0.5 m | |
Collision radius | 0.1 m | |
Drag coefficient | k | 3.425 |
Targets weight attenuation coefficient | 0.01 |
Time (h) | |
---|---|
DDQN | 2.75 |
PER | 3.77 |
PPO-Clip | 0.74 |
APAA | 0.5 |
Task Reward | Collision Frequency | Energy Consumption (J) | Time Consuming (s) | |
---|---|---|---|---|
DDQN | 105 | 3 | 560 | 84 |
PER | 110 | 2 | 650 | 75 |
PPO-Clip | 82 | 4 | 1350 | 111 |
APAA | 117 | 0 | 337 | 12 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ding, C.; Zheng, Z. A Reinforcement Learning Approach Based on Automatic Policy Amendment for Multi-AUV Task Allocation in Ocean Current. Drones 2022, 6, 141. https://doi.org/10.3390/drones6060141
Ding C, Zheng Z. A Reinforcement Learning Approach Based on Automatic Policy Amendment for Multi-AUV Task Allocation in Ocean Current. Drones. 2022; 6(6):141. https://doi.org/10.3390/drones6060141
Chicago/Turabian StyleDing, Cheng, and Zhi Zheng. 2022. "A Reinforcement Learning Approach Based on Automatic Policy Amendment for Multi-AUV Task Allocation in Ocean Current" Drones 6, no. 6: 141. https://doi.org/10.3390/drones6060141